[jira] [Created] (LUCENE-9824) Hunspell suggestions: speed up ngram score calculation for each dictionary entry

2021-03-04 Thread Peter Gromov (Jira)
Peter Gromov created LUCENE-9824:


 Summary:  Hunspell suggestions: speed up ngram score calculation 
for each dictionary entry
 Key: LUCENE-9824
 URL: https://issues.apache.org/jira/browse/LUCENE-9824
 Project: Lucene - Core
  Issue Type: Sub-task
Reporter: Peter Gromov






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (SOLR-15185) Improve "hash" QParser

2021-03-04 Thread David Smiley (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-15185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Smiley resolved SOLR-15185.
-
Fix Version/s: master (9.0)
   Resolution: Fixed

> Improve "hash" QParser
> --
>
> Key: SOLR-15185
> URL: https://issues.apache.org/jira/browse/SOLR-15185
> Project: Solr
>  Issue Type: Improvement
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Major
> Fix For: master (9.0)
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> * Don't use Filter (to be removed)
> * Do use TwoPhaseIterator, not PostFilter
> * Don't pre-compute matching docs (wasteful)
> * Support more fields, and more field types
> * Faster hash on Strings (avoid Char conversion)
> * Stronger hash when using multiple fields



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-15185) Improve "hash" QParser

2021-03-04 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-15185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295761#comment-17295761
 ] 

ASF subversion and git services commented on SOLR-15185:


Commit ddbd3b88ec8a9c3acc55e351f94f370a11f514b5 in lucene-solr's branch 
refs/heads/master from David Smiley
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=ddbd3b8 ]

SOLR-15185: Optimize Hash QParser (#1524)

used in parallel() streaming expression.  Hash algorithm is different.
* Simpler
* Don't use Filter (to be removed)
* Do use TwoPhaseIterator, not PostFilter
* Don't pre-compute matching docs (wasteful)
* Support more fields, and more field types
* Faster hash on Strings (avoid Char conversion)
* Stronger hash when using multiple fields

> Improve "hash" QParser
> --
>
> Key: SOLR-15185
> URL: https://issues.apache.org/jira/browse/SOLR-15185
> Project: Solr
>  Issue Type: Improvement
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Major
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> * Don't use Filter (to be removed)
> * Do use TwoPhaseIterator, not PostFilter
> * Don't pre-compute matching docs (wasteful)
> * Support more fields, and more field types
> * Faster hash on Strings (avoid Char conversion)
> * Stronger hash when using multiple fields



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] dsmiley merged pull request #1524: SOLR-15185: Rewrite Hash query

2021-03-04 Thread GitBox


dsmiley merged pull request #1524:
URL: https://github.com/apache/lucene-solr/pull/1524


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14660) Migrating HDFS into a package

2021-03-04 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295758#comment-17295758
 ] 

David Smiley commented on SOLR-14660:
-

When the build.gradle is created for this contrib, please try to undo the 
dependency flattening that's in most modules only because it was ported from 
Ant.  Basically SOLR-14929 but just scoped to this new contrib.  This means 
removing all/most of the {{transitive=false}} and runtime dependencies, then 
figure out which deps are added unnecessarily so they can be excluded.  So for 
example, I'm seeing that Netty will be transitively included, and thus won't 
need a mention in build.gradle.

BTW in SOLR-15215 I'm removing Netty from SolrJ and in so doing I had to 
explicitly reference Netty in solr-core's build because it will no longer come 
in automatically via SolrJ.  But ideally, hadoop deps would not say 
transitive=false so I wouldn't of had to do this.

Woodstox is a dependency of hadoop that solr-core will continue to provide for 
awhile.  Jackson -- same.

There's a help/dependencies.txt file that is helpful.

> Migrating HDFS into a package
> -
>
> Key: SOLR-14660
> URL: https://issues.apache.org/jira/browse/SOLR-14660
> Project: Solr
>  Issue Type: Improvement
>Reporter: Ishan Chattopadhyaya
>Priority: Major
>  Labels: package, packagemanager
>
> Following up on the deprecation of HDFS (SOLR-14021), we need to work on 
> isolating it away from Solr core and making a package for this. This issue is 
> to track the efforts for that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] zacharymorn commented on pull request #2342: LUCENE-9406: Add IndexWriterEventListener to track events in IndexWriter

2021-03-04 Thread GitBox


zacharymorn commented on pull request #2342:
URL: https://github.com/apache/lucene-solr/pull/2342#issuecomment-791146478


   No problem, and thanks for the review feedback as well Michael!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-15217) rename shardsWhitelist and use it more broadly

2021-03-04 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-15217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295720#comment-17295720
 ] 

David Smiley commented on SOLR-15217:
-

Maybe use it as an alternative for "allowSolrUrls" in 
CrossCollectionJoinQParser, maybe in ReplicationHandler.
I suppose it's current location is not bad but at the top level (outside of 
{{}} would be better.  I'm doubtful it's worth moving it 
though.  I don't love that "shards" is in its name... I'd even prefer using the 
name chosen by CrossCollectionJoinQParser: allowSolrUrls.

> rename shardsWhitelist and use it more broadly
> --
>
> Key: SOLR-15217
> URL: https://issues.apache.org/jira/browse/SOLR-15217
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: David Smiley
>Priority: Major
>
> The {{shardsWhitelist}} is defined on shardHandlerFactory element in 
> solr.xml.  We should rename it so something like "shardsAllowList".  And we 
> could use it in more places.
> https://solr.apache.org/guide/8_7/distributed-requests.html#configuring-the-shardhandlerfactory



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (SOLR-15217) rename shardsWhitelist and use it more broadly

2021-03-04 Thread David Smiley (Jira)
David Smiley created SOLR-15217:
---

 Summary: rename shardsWhitelist and use it more broadly
 Key: SOLR-15217
 URL: https://issues.apache.org/jira/browse/SOLR-15217
 Project: Solr
  Issue Type: Improvement
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: David Smiley


The {{shardsWhitelist}} is defined on shardHandlerFactory element in solr.xml.  
We should rename it so something like "shardsAllowList".  And we could use it 
in more places.

https://solr.apache.org/guide/8_7/distributed-requests.html#configuring-the-shardhandlerfactory



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14788) Solr: The Next Big Thing

2021-03-04 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295703#comment-17295703
 ] 

Mark Robert Miller commented on SOLR-14788:
---

So with all the issues with my desire to do this, having stumbled upon it, 
mostly revolves around having the time even available to me for the time needed 
and the future time, being able to force myself to do it again or that often. 
On top of that, having lost major work on it, feeling time pressure from 
multiple angles, mostly fearing I wouldn’t be able to keep concentrating like 
that, I got setup to be in a vulnerable state, which was fine in general for 
what I was prepared for, and then the unexpected hit me harder than it should 
have - and so that cost me a ton and made it all even more uncertain. And I 
just had to keep doubling down. Forced concentration march.

I say all that only to try and explain why I’ll now say, ignore most of what I 
said. I was trying to get here, to a point like this, really the whole time.

And if I’d gotten there, I’d have now said:

I’ve got this branch state.  It’s got great characteristics, lots of 
performance and efficiency, tests that can be fast and solid, lots of stuff.

I wish I could have shared with something else, or just the knowledge to get 
there together, or time planning together. But I’m as good at that as I’ve 
always been. So I only have and had the same thing to offer there. 

But one other thing I can do is this.

And so what can be done with it is an interesting, open, and hard question. 

A few of us are going to keep pushing on it, not in terms of more changes, but 
on what is personal final state. In the meantime, I can imagine future 
brainstorming about what we might do with it if people end up having an 
interest in some of the results. 

> Solr: The Next Big Thing
> 
>
> Key: SOLR-14788
> URL: https://issues.apache.org/jira/browse/SOLR-14788
> Project: Solr
>  Issue Type: Task
>Reporter: Mark Robert Miller
>Assignee: Mark Robert Miller
>Priority: Critical
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> h3. 
> [!https://www.unicode.org/consortium/aacimg/1F46E.png!|https://www.unicode.org/consortium/adopted-characters.html#b1F46E]{color:#00875a}*The
>  Policeman is {color:#de350b}NOW{color} {color:#de350b}OFF{color} 
> duty!*{color}
> {quote}_{color:#de350b}*When The Policeman is on duty, sit back, relax, and 
> have some fun. Try to make some progress. Don't stress too much about the 
> impact of your changes or maintaining stability and performance and 
> correctness so much. Until the end of phase 1, I've got your back. I have a 
> variety of tools and contraptions I have been building over the years and I 
> will continue training them on this branch. I will review your changes and 
> peer out across the land and course correct where needed. As Mike D will be 
> thinking, "Sounds like a bottleneck Mark." And indeed it will be to some 
> extent. Which is why once stage one is completed, I will flip The Policeman 
> to off duty. When off duty, I'm always* *occasionally*{color} *down for some 
> vigilante justice, but I won't be walking the beat, all that stuff about sit 
> back and relax goes out the window.*_
> {quote}
>  
> I have stolen this title from Ishan or Noble and Ishan.
> This issue is meant to capture the work of a small team that is forming to 
> push Solr and SolrCloud to the next phase.
> I have kicked off the work with an effort to create a very fast and solid 
> base. That work is not 100% done, but it's ready to join the fight.
> Tim Potter has started giving me a tremendous hand in finishing up. Ishan and 
> Noble have already contributed support and testing and have plans for 
> additional work to shore up some of our current shortcomings.
> Others have expressed an interest in helping and hopefully they will pop up 
> here as well.
> Let's organize and discuss our efforts here and in various sub issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9823) SynonymQuery rewrite can change field boost calculation

2021-03-04 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295653#comment-17295653
 ] 

Robert Muir commented on LUCENE-9823:
-

+1

sounds like this rewrite is unsafe.

> SynonymQuery rewrite can change field boost calculation
> ---
>
> Key: LUCENE-9823
> URL: https://issues.apache.org/jira/browse/LUCENE-9823
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Julie Tibshirani
>Priority: Minor
>
> SynonymQuery accepts a boost per term, which acts as a multiplier on the term 
> frequency in the document. When rewriting a SynonymQuery with a single term, 
> we create a BoostQuery wrapping a TermQuery. This changes the meaning of the 
> boost: it now multiplies the final TermQuery score instead of multiplying the 
> term frequency before it's passed to the score calculation.
> This is a small point, but maybe it's worth avoiding rewriting a single-term 
> SynonymQuery unless the boost is 1.0.
> The same consideration affects CombinedFieldQuery in sandbox.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9822) Assert that ForUtil.BLOCK_SIZE can be encoded in a single byte in PForUtil

2021-03-04 Thread Robert Muir (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved LUCENE-9822.
-
Fix Version/s: master (9.0)
   Resolution: Fixed

Thanks [~gsmiller] !

> Assert that ForUtil.BLOCK_SIZE can be encoded in a single byte in PForUtil
> --
>
> Key: LUCENE-9822
> URL: https://issues.apache.org/jira/browse/LUCENE-9822
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: master (9.0)
>Reporter: Greg Miller
>Priority: Trivial
> Fix For: master (9.0)
>
> Attachments: LUCENE-9822.patch
>
>
> PForUtil assumes that ForUtil.BLOCK_SIZE can be encoded in a single byte when 
> generating "patch offsets". If this assumption doesn't hold, PForUtil will 
> silently encode incorrect positions. While the BLOCK_SIZE isn't particularly 
> configurable, it would be nice to assert this assumption early in PForUtil in 
> the even that the BLOCK_SIZE changes in some future codec version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9822) Assert that ForUtil.BLOCK_SIZE can be encoded in a single byte in PForUtil

2021-03-04 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295647#comment-17295647
 ] 

ASF subversion and git services commented on LUCENE-9822:
-

Commit 8e337ab63fac9aeeaf76e91c698cabad0ccbe769 in lucene-solr's branch 
refs/heads/master from Robert Muir
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=8e337ab ]

LUCENE-9822: Assert that ForUtil.BLOCK_SIZE can be PFOR-encoded in a single byte

For/PFor code has BLOCK_SIZE=128 as a static final constant, with a lot
of assumptions and optimizations for that case. For example it will
encode 3 exceptions at most and optimizes the exception encoding with a
single byte.

This would not work at all if you changed the constant in the code to
something like 512, but an assertion at an early stage helps make
experimentation less painful, and better "documents" the assumption of how
the exception encoding currently works.


> Assert that ForUtil.BLOCK_SIZE can be encoded in a single byte in PForUtil
> --
>
> Key: LUCENE-9822
> URL: https://issues.apache.org/jira/browse/LUCENE-9822
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: master (9.0)
>Reporter: Greg Miller
>Priority: Trivial
> Attachments: LUCENE-9822.patch
>
>
> PForUtil assumes that ForUtil.BLOCK_SIZE can be encoded in a single byte when 
> generating "patch offsets". If this assumption doesn't hold, PForUtil will 
> silently encode incorrect positions. While the BLOCK_SIZE isn't particularly 
> configurable, it would be nice to assert this assumption early in PForUtil in 
> the even that the BLOCK_SIZE changes in some future codec version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-9823) SynonymQuery rewrite can change field boost calculation

2021-03-04 Thread Julie Tibshirani (Jira)
Julie Tibshirani created LUCENE-9823:


 Summary: SynonymQuery rewrite can change field boost calculation
 Key: LUCENE-9823
 URL: https://issues.apache.org/jira/browse/LUCENE-9823
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Julie Tibshirani


SynonymQuery accepts a boost per term, which acts as a multiplier on the term 
frequency in the document. When rewriting a SynonymQuery with a single term, we 
create a BoostQuery wrapping a TermQuery. This changes the meaning of the 
boost: it now multiplies the final TermQuery score instead of multiplying the 
term frequency before it's passed to the score calculation.

This is a small point, but maybe it's worth avoiding rewriting a single-term 
SynonymQuery unless the boost is 1.0.

The same consideration affects CombinedFieldQuery in sandbox.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] tflobbe commented on pull request #2456: SOLR-15216 Fix for Invalid Reference to data.followers in Admin UI

2021-03-04 Thread GitBox


tflobbe commented on pull request #2456:
URL: https://github.com/apache/lucene-solr/pull/2456#issuecomment-791027598


   @deanpearce, do you want to add an entry to CHANGES.txt in the 8.9 section?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9754) ICU Tokenizer: letter-space-number-letter tokenized inconsistently

2021-03-04 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295633#comment-17295633
 ] 

Robert Muir commented on LUCENE-9754:
-

This tokenizer splits on scripts because it lets you customize the tokenization 
per-script by design.

The reason is some writing systems need different approaches... or even 
choices. See LUCENE-7393 for a great example.

So it is just like the "notes" section says, and I will quote:

{quote}
Normally word breaking does not require breaking between different scripts. 
However, adding that capability may be useful in combination with other 
extensions of word segmentation.
{quote}

And that is what we do, for that exact reason. I am guessing it confuses you 
because it seems to break all kinds of "rules" (e.g. don't break between 
letters). 

If you want a simple state-machine based on those rules without fancy stuff, 
again I recommend using StandardTokenizer instead.

This tokenizer is quite different, it will do entirely different algorithms 
depending on the writing system. And you can customize that with rules and 
options (e.g. break Myanmar with ICU word dictionary or with syllables)


> ICU Tokenizer: letter-space-number-letter tokenized inconsistently
> --
>
> Key: LUCENE-9754
> URL: https://issues.apache.org/jira/browse/LUCENE-9754
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 7.5
> Environment: Tested most recently on Elasticsearch 6.5.4.
>Reporter: Trey Jones
>Priority: Major
> Attachments: LUCENE-9754_prototype.patch
>
>
> The tokenization of strings like _14th_ with the ICU tokenizer is affected by 
> the character that comes before preceeding whitespace.
> For example, _x 14th_ is tokenized as x | 14th; _ァ 14th_ is tokenized as ァ | 
> 14 | th.
> In general, in a letter-space-number-letter sequence, if the writing system 
> before the space is the same as the writing system after the number, then you 
> get two tokens. If the writing systems differ, you get three tokens.
> -If the conditions are just right, the chunking that the ICU tokenizer does 
> (trying to split on spaces to create <4k chunks) can create an artificial 
> boundary between the tokens (e.g., between _ァ_ and _14th_) and prevent the 
> unexpected split of the second token (_14th_). Because chunking changes can 
> ripple through a long document, editing text or the effects of a character 
> filter can cause changes in tokenization thousands of lines later in a 
> document.- _(This inconsistency was included as a side issue that I thought 
> might add more weight to the main problem I am concerned with, but it seems 
> to be more of a distraction. Chunking issues should perhaps be addressed in a 
> different ticket, so I'm striking it out.)_
> My guess is that some "previous character set" flag is not reset at the 
> space, and numbers are not in a character set, so _t_ is compared to _ァ_ and 
> they are not the same—causing a token split at the character set change—but 
> I'm not sure.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-15216) Invalid JS Object Key data.followers.currentData

2021-03-04 Thread Dean Pearce (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-15216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dean Pearce updated SOLR-15216:
---
Affects Version/s: 8.8
   8.8.1

> Invalid JS Object Key data.followers.currentData
> 
>
> Key: SOLR-15216
> URL: https://issues.apache.org/jira/browse/SOLR-15216
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Admin UI
>Affects Versions: 8.7, 8.8, 8.8.1
>Reporter: Dean Pearce
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Minor bug in the Admin UI Angular code, a line was changed to `
> settings.currentTime = parseDateToEpoch(data.follower.currentDate);` but the 
> underlying API still refers to `data.slave`. I believe this is fixed in the 
> master stream as the migration to the new leader/follower naming was 
> complete, but is broken in 8.x (8.7 an onwards).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-15216) Invalid JS Object Key data.followers.currentData

2021-03-04 Thread Dean Pearce (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-15216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295628#comment-17295628
 ] 

Dean Pearce commented on SOLR-15216:


Opened a PR on GitHub with the minimum viable patch for the 8.x stream: 
https://github.com/apache/lucene-solr/pull/2456

> Invalid JS Object Key data.followers.currentData
> 
>
> Key: SOLR-15216
> URL: https://issues.apache.org/jira/browse/SOLR-15216
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Admin UI
>Affects Versions: 8.7
>Reporter: Dean Pearce
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Minor bug in the Admin UI Angular code, a line was changed to `
> settings.currentTime = parseDateToEpoch(data.follower.currentDate);` but the 
> underlying API still refers to `data.slave`. I believe this is fixed in the 
> master stream as the migration to the new leader/follower naming was 
> complete, but is broken in 8.x (8.7 an onwards).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] deanpearce opened a new pull request #2456: SOLR-15216 Fix for Invalid Reference to data.followers in Admin UI

2021-03-04 Thread GitBox


deanpearce opened a new pull request #2456:
URL: https://github.com/apache/lucene-solr/pull/2456


   
   # Description
   
   Minor bug in the Admin UI Angular code introduced when switching to 
followers terminology, the underlying API for 8.x still refers to them as 
slave. In master 9.x branch this is resolved as full migration to new 
terminology is complete, but any future 8.x builds would have this issue.
   
   This bug prevents the Replication UI for Legacy Replication from loading if 
polling is enabled and there has been a successful run.
   
   # Solution
   
   Changed to use the correct JavaScript attribute.
   
   # Tests
   
   Compiled and ran the UI against my development instance, verified that the 
UI loads correctly.
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [X] I have reviewed the guidelines for [How to 
Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms 
to the standards described there to the best of my ability.
   - [X] I have created a Jira issue and added the issue ID to my pull request 
title.
   - [X] I have given Solr maintainers 
[access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
 to contribute to my PR branch. (optional but recommended)
   - [ ] I have developed this patch against the `master` branch.
   - [ ] I have run `./gradlew check`.
   - [ ] I have added tests for my changes.
   - [ ] I have added documentation for the [Ref 
Guide](https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide) 
(for Solr changes only).
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9754) ICU Tokenizer: letter-space-number-letter tokenized inconsistently

2021-03-04 Thread Trey Jones (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295622#comment-17295622
 ] 

Trey Jones commented on LUCENE-9754:


I appreciate that this is frustrating, and I’m sorry that we seem to be 
frustrating each other. You seem to feel that I am not listening to what you 
have to say, which is no surprise, since I feel that you are not listening to 
what I have to say. Can we try again to meet somewhere in the middle?
{quote}That's because this tokenizer first divides on scripts
{quote}
I’m trying my best to hear what you are saying here. The current behavior is 
the result of the tokenizer splitting on scripts before splitting on spaces. 
This does in fact completely explain the output we see in the _p 3a π 3a_ 
example.

However, what the tokenizer _does_ and what the tokenizer is _supposed to do_ 
are not necessarily the same thing.

I read your comments as offering the Word Boundary Rules and related Notes from 
Annex 29 as justification for the tokenizer’s behavior. I read over them, and I 
don’t see a justification there. Rather, I see a specific concrete example of 
what _*not*_ to do—splitting _3a_—yet the tokenizer seems to do exactly that.

So, I do actually like your answer, but I don’t like the question that goes 
with it, which seems to be, “Why does the tokenizer do that?” *The question I’m 
trying to ask is, “Is this what the tokenizer _should_ do?”*

My opinion is obviously that this is not what it should do—but opinions can 
differ. My reading of the documentation you suggested is _also_ that this is 
not what the tokenizer should do. I’m willing to accept the possibility that I 
have read UAX29 and WB10 and the example given there incorrectly, but I’m going 
to need a little help seeing it.

Your previous comments have not provided the elucidation that I seek:
{quote}That's because this tokenizer first divides on scripts
{quote}
This explains why it behaves as it does, not why that is the desired behavior.
{quote}You can find more discussions on that in Notes section of 
[https://unicode.org/reports/tr29/#Word_Boundary_Rules]
{quote}
These rules and notes seem to contradict the behavior of the tokenizer.
{quote}I think this tokenizer works behind-the-scenes differently than you 
imagine
{quote}
I believe that I understand what it does—as you said, it divides on scripts—but 
that doesn’t explain why that is the right thing to do.
{quote}the rules you see don't mean what you might infer
{quote}
I infer that _3a,_ the example give in the rules, should not be split. If that 
is the wrong inference, please make some small attempt to explain _why,_ rather 
than implying that I don’t understand, or telling me _what_ the tokenizer does 
to get this behavior, which seems no less incorrect for being explainable.

I hope we can give this one more go and find a productive consensus on whether 
the current tokenizer behavior is correct, and if so, some insight into why.

Thanks for the time you've put into this discussion.

> ICU Tokenizer: letter-space-number-letter tokenized inconsistently
> --
>
> Key: LUCENE-9754
> URL: https://issues.apache.org/jira/browse/LUCENE-9754
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 7.5
> Environment: Tested most recently on Elasticsearch 6.5.4.
>Reporter: Trey Jones
>Priority: Major
> Attachments: LUCENE-9754_prototype.patch
>
>
> The tokenization of strings like _14th_ with the ICU tokenizer is affected by 
> the character that comes before preceeding whitespace.
> For example, _x 14th_ is tokenized as x | 14th; _ァ 14th_ is tokenized as ァ | 
> 14 | th.
> In general, in a letter-space-number-letter sequence, if the writing system 
> before the space is the same as the writing system after the number, then you 
> get two tokens. If the writing systems differ, you get three tokens.
> -If the conditions are just right, the chunking that the ICU tokenizer does 
> (trying to split on spaces to create <4k chunks) can create an artificial 
> boundary between the tokens (e.g., between _ァ_ and _14th_) and prevent the 
> unexpected split of the second token (_14th_). Because chunking changes can 
> ripple through a long document, editing text or the effects of a character 
> filter can cause changes in tokenization thousands of lines later in a 
> document.- _(This inconsistency was included as a side issue that I thought 
> might add more weight to the main problem I am concerned with, but it seems 
> to be more of a distraction. Chunking issues should perhaps be addressed in a 
> different ticket, so I'm striking it out.)_
> My guess is that some "previous character set" flag is not reset at the 
> space, and numbers are not in a character set, so _t_ 

[jira] [Created] (SOLR-15216) Invalid JS Object Key data.followers.currentData

2021-03-04 Thread Dean Pearce (Jira)
Dean Pearce created SOLR-15216:
--

 Summary: Invalid JS Object Key data.followers.currentData
 Key: SOLR-15216
 URL: https://issues.apache.org/jira/browse/SOLR-15216
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
  Components: Admin UI
Affects Versions: 8.7
Reporter: Dean Pearce


Minor bug in the Admin UI Angular code, a line was changed to `
settings.currentTime = parseDateToEpoch(data.follower.currentDate);` but the 
underlying API still refers to `data.slave`. I believe this is fixed in the 
master stream as the migration to the new leader/follower naming was complete, 
but is broken in 8.x (8.7 an onwards).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-2852) SolrJ doesn't need woodstox jar

2021-03-04 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295620#comment-17295620
 ] 

David Smiley commented on SOLR-2852:


Time to do this for 9.0 in at least SolrJ?

> SolrJ doesn't need woodstox jar
> ---
>
> Key: SOLR-2852
> URL: https://issues.apache.org/jira/browse/SOLR-2852
> Project: Solr
>  Issue Type: Improvement
>  Components: clients - java
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Minor
>
> The /dist/solrj-lib/ directory contains wstx-asl-3.2.7.jar (Woodstox StAX 
> API).  SolrJ doesn't actually have any type of dependency on this library. 
> The maven build doesn't have it as a dependency and the tests pass.  Perhaps 
> Woodstox is faster than the JDK's StAX, I don't know, but I find that point 
> quite moot since SolrJ can use the efficient binary format.  Woodstox is not 
> a small library either, weighting in at 524KB, and of course if someone 
> actually wants to use it, they can.
> I propose woodstox be removed as a SolrJ dependency.  I am *not* proposing it 
> be removed as a Solr WAR dependency since it is actually required there due 
> to an obscure XSLT issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9687) Hunspell support improvements

2021-03-04 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295619#comment-17295619
 ] 

ASF subversion and git services commented on LUCENE-9687:
-

Commit 231e3afe0691e55403d297f99778736798726acb in lucene-solr's branch 
refs/heads/master from Peter Gromov
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=231e3af ]

LUCENE-9687: Hunspell suggestions: reduce work in the 
findSimilarDictionaryEntries loop (#2451)

The loop is called a lot of times, and some allocations and method calls can be 
spared

> Hunspell support improvements
> -
>
> Key: LUCENE-9687
> URL: https://issues.apache.org/jira/browse/LUCENE-9687
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Peter Gromov
>Priority: Major
> Fix For: master (9.0)
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I'd like Lucene's Hunspell support to be on a par with the native C++ 
> Hunspell for spellchecking and suggestions, at least for some languages. So I 
> propose to:
> * support the affix rules necessary for English, German, French, Spanish and
> Russian dictionaries, possibly more languages later
> * mirror Hunspell's suggestion algorithm in Lucene
> * provide a public APIs for spellchecking, suggestion, stemming, 
> morphological data
> * check corpora for specific languages to find and fix 
> spellchecking/suggestion discrepancices between Lucene's implementation and 
> Hunspell/C++



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] rmuir merged pull request #2451: LUCENE-9687: Hunspell suggestions: reduce work in the findSimilarDictionaryEntries loop

2021-03-04 Thread GitBox


rmuir merged pull request #2451:
URL: https://github.com/apache/lucene-solr/pull/2451


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (SOLR-15215) SolrJ: Remove needless Netty dependency

2021-03-04 Thread David Smiley (Jira)
David Smiley created SOLR-15215:
---

 Summary: SolrJ: Remove needless Netty dependency
 Key: SOLR-15215
 URL: https://issues.apache.org/jira/browse/SOLR-15215
 Project: Solr
  Issue Type: Improvement
  Security Level: Public (Default Security Level. Issues are Public)
  Components: SolrJ
Reporter: David Smiley
Assignee: David Smiley


SolrJ depends on Netty transitively via ZooKeeper.  But ZooKeeper's Netty 
dependency should be considered optional -- you have to opt-in.

BTW it's only needed in Solr-core because of Hadoop/HDFS which ought to move to 
a contrib and take this dependency with it over there.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9754) ICU Tokenizer: letter-space-number-letter tokenized inconsistently

2021-03-04 Thread Trey Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Trey Jones updated LUCENE-9754:
---
Description: 
The tokenization of strings like _14th_ with the ICU tokenizer is affected by 
the character that comes before preceeding whitespace.

For example, _x 14th_ is tokenized as x | 14th; _ァ 14th_ is tokenized as ァ | 14 
| th.

In general, in a letter-space-number-letter sequence, if the writing system 
before the space is the same as the writing system after the number, then you 
get two tokens. If the writing systems differ, you get three tokens.

-If the conditions are just right, the chunking that the ICU tokenizer does 
(trying to split on spaces to create <4k chunks) can create an artificial 
boundary between the tokens (e.g., between _ァ_ and _14th_) and prevent the 
unexpected split of the second token (_14th_). Because chunking changes can 
ripple through a long document, editing text or the effects of a character 
filter can cause changes in tokenization thousands of lines later in a 
document.- _(This inconsistency was included as a side issue that I thought 
might add more weight to the main problem I am concerned with, but it seems to 
be more of a distraction. Chunking issues should perhaps be addressed in a 
different ticket, so I'm striking it out.)_

My guess is that some "previous character set" flag is not reset at the space, 
and numbers are not in a character set, so _t_ is compared to _ァ_ and they are 
not the same—causing a token split at the character set change—but I'm not sure.

 

  was:
The tokenization of strings like _14th_ with the ICU tokenizer is affected by 
the character that comes before preceeding whitespace.

For example, _x 14th_ is tokenized as x | 14th; _ァ 14th_ is tokenized as ァ | 14 
| th.

In general, in a letter-space-number-letter sequence, if the writing system 
before the space is the same as the writing system after the number, then you 
get two tokens. If the writing systems differ, you get three tokens.

If the conditions are just right, the chunking that the ICU tokenizer does 
(trying to split on spaces to create <4k chunks) can create an artificial 
boundary between the tokens (e.g., between _ァ_ and _14th_) and prevent the 
unexpected split of the second token (_14th_). Because chunking changes can 
ripple through a long document, editing text or the effects of a character 
filter can cause changes in tokenization thousands of lines later in a document.

My guess is that some "previous character set" flag is not reset at the space, 
and numbers are not in a character set, so _t_ is compared to _ァ_ and they are 
not the same—causing a token split at the character set change—but I'm not sure.

 


> ICU Tokenizer: letter-space-number-letter tokenized inconsistently
> --
>
> Key: LUCENE-9754
> URL: https://issues.apache.org/jira/browse/LUCENE-9754
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 7.5
> Environment: Tested most recently on Elasticsearch 6.5.4.
>Reporter: Trey Jones
>Priority: Major
> Attachments: LUCENE-9754_prototype.patch
>
>
> The tokenization of strings like _14th_ with the ICU tokenizer is affected by 
> the character that comes before preceeding whitespace.
> For example, _x 14th_ is tokenized as x | 14th; _ァ 14th_ is tokenized as ァ | 
> 14 | th.
> In general, in a letter-space-number-letter sequence, if the writing system 
> before the space is the same as the writing system after the number, then you 
> get two tokens. If the writing systems differ, you get three tokens.
> -If the conditions are just right, the chunking that the ICU tokenizer does 
> (trying to split on spaces to create <4k chunks) can create an artificial 
> boundary between the tokens (e.g., between _ァ_ and _14th_) and prevent the 
> unexpected split of the second token (_14th_). Because chunking changes can 
> ripple through a long document, editing text or the effects of a character 
> filter can cause changes in tokenization thousands of lines later in a 
> document.- _(This inconsistency was included as a side issue that I thought 
> might add more weight to the main problem I am concerned with, but it seems 
> to be more of a distraction. Chunking issues should perhaps be addressed in a 
> different ticket, so I'm striking it out.)_
> My guess is that some "previous character set" flag is not reset at the 
> space, and numbers are not in a character set, so _t_ is compared to _ァ_ and 
> they are not the same—causing a token split at the character set change—but 
> I'm not sure.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: 

[jira] [Commented] (SOLR-14788) Solr: The Next Big Thing

2021-03-04 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295595#comment-17295595
 ] 

Mark Robert Miller commented on SOLR-14788:
---

Ok, the Mark conquers severe ADD in the face of a myriad of ongoing setbacks, 
lost work, teams changes, relationship affects and trials is complete. Whew, 
now I know why I don't attempt these things. That's the end of my phases. The 
milestone in december gave me some peace of mind, this gives me some 
conclusion, what is to come is not on my shoulders in anything but a normal and 
standard way, maybe step away a bit first and let the overworked mind settle.

> Solr: The Next Big Thing
> 
>
> Key: SOLR-14788
> URL: https://issues.apache.org/jira/browse/SOLR-14788
> Project: Solr
>  Issue Type: Task
>Reporter: Mark Robert Miller
>Assignee: Mark Robert Miller
>Priority: Critical
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> h3. 
> [!https://www.unicode.org/consortium/aacimg/1F46E.png!|https://www.unicode.org/consortium/adopted-characters.html#b1F46E]{color:#00875a}*The
>  Policeman is {color:#de350b}NOW{color} {color:#de350b}OFF{color} 
> duty!*{color}
> {quote}_{color:#de350b}*When The Policeman is on duty, sit back, relax, and 
> have some fun. Try to make some progress. Don't stress too much about the 
> impact of your changes or maintaining stability and performance and 
> correctness so much. Until the end of phase 1, I've got your back. I have a 
> variety of tools and contraptions I have been building over the years and I 
> will continue training them on this branch. I will review your changes and 
> peer out across the land and course correct where needed. As Mike D will be 
> thinking, "Sounds like a bottleneck Mark." And indeed it will be to some 
> extent. Which is why once stage one is completed, I will flip The Policeman 
> to off duty. When off duty, I'm always* *occasionally*{color} *down for some 
> vigilante justice, but I won't be walking the beat, all that stuff about sit 
> back and relax goes out the window.*_
> {quote}
>  
> I have stolen this title from Ishan or Noble and Ishan.
> This issue is meant to capture the work of a small team that is forming to 
> push Solr and SolrCloud to the next phase.
> I have kicked off the work with an effort to create a very fast and solid 
> base. That work is not 100% done, but it's ready to join the fight.
> Tim Potter has started giving me a tremendous hand in finishing up. Ishan and 
> Noble have already contributed support and testing and have plans for 
> additional work to shore up some of our current shortcomings.
> Others have expressed an interest in helping and hopefully they will pop up 
> here as well.
> Let's organize and discuss our efforts here and in various sub issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-15213) Add support for "merge" atomic update operation for child documents

2021-03-04 Thread Jira


[ 
https://issues.apache.org/jira/browse/SOLR-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295591#comment-17295591
 ] 

Thomas Wöckinger commented on SOLR-15213:
-

So, you want to save the RTG query by id of the child document?

More information can be found 
[SOLR-15064|https://issues.apache.org/jira/browse/SOLR-15064]

> Add support for "merge" atomic update operation for child documents
> ---
>
> Key: SOLR-15213
> URL: https://issues.apache.org/jira/browse/SOLR-15213
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: James Ashbourne
>Priority: Major
> Attachments: SOLR-15213.patch
>
>
> Solr has "add", "set", "add-distinct" which work but all have their 
> limitations. Namely, there's currently no way to atomically update a document 
> where that document may or may not be present already by merging if it is 
> present and inserting if it isn't.
> i.e. in the scenario where we have a document with two nested children: 
>   
> {noformat}
> {"id": "ocean1", 
> "_isParent":"true", 
> "fish": [ 
> {
>  "id": "fish1", 
>  "type_s": "fish", 
>  "name_s": "Doe", 
>  "_isParent":"false"}, 
> {
>  "id": "fish2", 
>  "type_s": "fish", 
>  "name_s": "Hans", 
>  "_isParent":"false"}]
> }{noformat}
>  
>  If we later want to update that child doc e.g.:
> {noformat}
> {"id": "ocean1", 
> "_isParent":"true", 
> "fish": [ 
> {
>  "id": "fish1", 
>  "type_s": "fish", 
>  "name_s": "James", // new name
>  "_isParent":"false"}, 
> ]
> }{noformat}
>  
>  Existing operations:
>  - "add" - will add another nested doc with the same id leaving us with two 
> children with the same id.
>  - "set" - replaces the whole list of child docs with the single doc, we 
> could use this but would first have to fetch all the existing children.
>  - "add-distinct" - will reject the update based on the doc already being 
> present.
> I've got some changes (see patch) that a new option "merge" which checks 
> based on the id and merges the new document with the old with a fall back to 
> add if there is no id match.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9754) ICU Tokenizer: letter-space-number-letter tokenized inconsistently

2021-03-04 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295585#comment-17295585
 ] 

Robert Muir commented on LUCENE-9754:
-

I explained here what happens in the first comment, but you didn't like my 
answer. Please re-read my answer, especially "That's because this tokenizer 
first divides on scripts". 

> ICU Tokenizer: letter-space-number-letter tokenized inconsistently
> --
>
> Key: LUCENE-9754
> URL: https://issues.apache.org/jira/browse/LUCENE-9754
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 7.5
> Environment: Tested most recently on Elasticsearch 6.5.4.
>Reporter: Trey Jones
>Priority: Major
> Attachments: LUCENE-9754_prototype.patch
>
>
> The tokenization of strings like _14th_ with the ICU tokenizer is affected by 
> the character that comes before preceeding whitespace.
> For example, _x 14th_ is tokenized as x | 14th; _ァ 14th_ is tokenized as ァ | 
> 14 | th.
> In general, in a letter-space-number-letter sequence, if the writing system 
> before the space is the same as the writing system after the number, then you 
> get two tokens. If the writing systems differ, you get three tokens.
> If the conditions are just right, the chunking that the ICU tokenizer does 
> (trying to split on spaces to create <4k chunks) can create an artificial 
> boundary between the tokens (e.g., between _ァ_ and _14th_) and prevent the 
> unexpected split of the second token (_14th_). Because chunking changes can 
> ripple through a long document, editing text or the effects of a character 
> filter can cause changes in tokenization thousands of lines later in a 
> document.
> My guess is that some "previous character set" flag is not reset at the 
> space, and numbers are not in a character set, so _t_ is compared to _ァ_ and 
> they are not the same—causing a token split at the character set change—but 
> I'm not sure.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] dsmiley opened a new pull request #2455: Make SolrInputField name optional

2021-03-04 Thread GitBox


dsmiley opened a new pull request #2455:
URL: https://github.com/apache/lucene-solr/pull/2455


   Prevents other bugs by failing fast
   
   Very minor change.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14759) Separate the Lucene and Solr builds

2021-03-04 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295568#comment-17295568
 ] 

Dawid Weiss commented on SOLR-14759:


Made documentation task work for Solr. It still emits invalid links (and link 
checkers fail) but at least it passes and generates something. I think the 
follow-up will have to happen after the split when changes can be made to 
templates (removing direct links to Lucene changes, etc.).

> Separate the Lucene and Solr builds
> ---
>
> Key: SOLR-14759
> URL: https://issues.apache.org/jira/browse/SOLR-14759
> Project: Solr
>  Issue Type: Sub-task
>  Components: Build
>Reporter: Jan Høydahl
>Assignee: Dawid Weiss
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> While still in same git repo, separate the builds, so Lucene and Solr can be 
> built independently.
> The preparation step includes optional building of just Lucene from current 
> master (prior to any code removal):
> Current status of joint and separate builds:
>  * (/) joint build
> {code}
> gradlew assemble check
> {code}
>  * (/) Lucene-only
> {code}
> gradlew -Dskip.solr=true assemble check
> {code}
>  * (/) Solr-only (with documentation exclusions)
> {code}
> gradlew -Dskip.lucene=true assemble check -x test -x checkBrokenLinks -x 
> checkLocalJavadocLinksSite
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-14759) Separate the Lucene and Solr builds

2021-03-04 Thread Dawid Weiss (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss updated SOLR-14759:
---
Description: 
While still in same git repo, separate the builds, so Lucene and Solr can be 
built independently.

The preparation step includes optional building of just Lucene from current 
master (prior to any code removal):

Current status of joint and separate builds:
 * (/) joint build
{code}
gradlew assemble check
{code}
 * (/) Lucene-only
{code}
gradlew -Dskip.solr=true assemble check
{code}
 * (/) Solr-only (with documentation exclusions)
{code}
gradlew -Dskip.lucene=true assemble check -x test -x checkBrokenLinks -x 
checkLocalJavadocLinksSite
{code}

  was:
While still in same git repo, separate the builds, so Lucene and Solr can be 
built independently.

The preparation step includes optional building of just Lucene from current 
master (prior to any code removal):

Current status of joint and separate builds:
 * (/) joint build
{code}
gradlew assemble check
{code}
 * (/) Lucene-only
{code}
gradlew -Dskip.solr=true assemble check
{code}
 * (/) Solr-only (with documentation exclusions)
{code}
gradlew -Dskip.lucene=true assemble check -x test -x documentation -x 
checkBrokenLinks -x checkLocalJavadocLinksSite
{code}


> Separate the Lucene and Solr builds
> ---
>
> Key: SOLR-14759
> URL: https://issues.apache.org/jira/browse/SOLR-14759
> Project: Solr
>  Issue Type: Sub-task
>  Components: Build
>Reporter: Jan Høydahl
>Assignee: Dawid Weiss
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> While still in same git repo, separate the builds, so Lucene and Solr can be 
> built independently.
> The preparation step includes optional building of just Lucene from current 
> master (prior to any code removal):
> Current status of joint and separate builds:
>  * (/) joint build
> {code}
> gradlew assemble check
> {code}
>  * (/) Lucene-only
> {code}
> gradlew -Dskip.solr=true assemble check
> {code}
>  * (/) Solr-only (with documentation exclusions)
> {code}
> gradlew -Dskip.lucene=true assemble check -x test -x checkBrokenLinks -x 
> checkLocalJavadocLinksSite
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-15185) Improve "hash" QParser

2021-03-04 Thread Joel Bernstein (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-15185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295564#comment-17295564
 ] 

Joel Bernstein commented on SOLR-15185:
---

The *parallel* Streaming Expression is what is using this currently. 

> Improve "hash" QParser
> --
>
> Key: SOLR-15185
> URL: https://issues.apache.org/jira/browse/SOLR-15185
> Project: Solr
>  Issue Type: Improvement
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Major
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> * Don't use Filter (to be removed)
> * Do use TwoPhaseIterator, not PostFilter
> * Don't pre-compute matching docs (wasteful)
> * Support more fields, and more field types
> * Faster hash on Strings (avoid Char conversion)
> * Stronger hash when using multiple fields



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] magibney commented on pull request #2449: SOLR-15045: Execute local leader commit in parallel with distributed commits in DistributedZkUpdateProcessor

2021-03-04 Thread GitBox


magibney commented on pull request #2449:
URL: https://github.com/apache/lucene-solr/pull/2449#issuecomment-790865983


   fwiw, I think the gradle precommit is failing on a nocommit comment marking 
a question about why TOLEADER distrib commit errors aren't propagated back to 
the client ... not really an issue with this PR _per se_.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-15214) Provision an optional client configurable ZK port

2021-03-04 Thread Rohan Ganpatye (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-15214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohan Ganpatye updated SOLR-15214:
--
Description: 
Expose an optional parameter that defines the port to start the ZK server on.
 This might be particularly helpful with testing infrastructure by having the 
ability to specify what port ZK server should use to initialize instead of the 
current default assignment or have to manually edit via zoo.cfg.

Note: The new optional ZK port parameter is applicable with `zkRun` when used 
to run embedded ZooKeeper with Solr.

  was:
Expose an optional parameter that defines the port to start the ZK server on.
This might be particularly helpful with testing infrastructure by having the 
ability to specify what port ZK server should use to initialize instead of the 
current default assignment or have to manually edit via zoo.cfg.


> Provision an optional client configurable ZK port
> -
>
> Key: SOLR-15214
> URL: https://issues.apache.org/jira/browse/SOLR-15214
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 8.8.1
>Reporter: Rohan Ganpatye
>Priority: Minor
>  Labels: ZooKeeper
>
> Expose an optional parameter that defines the port to start the ZK server on.
>  This might be particularly helpful with testing infrastructure by having the 
> ability to specify what port ZK server should use to initialize instead of 
> the current default assignment or have to manually edit via zoo.cfg.
> Note: The new optional ZK port parameter is applicable with `zkRun` when used 
> to run embedded ZooKeeper with Solr.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (SOLR-15214) Provision an optional client configurable ZK port

2021-03-04 Thread Rohan Ganpatye (Jira)
Rohan Ganpatye created SOLR-15214:
-

 Summary: Provision an optional client configurable ZK port
 Key: SOLR-15214
 URL: https://issues.apache.org/jira/browse/SOLR-15214
 Project: Solr
  Issue Type: Improvement
  Security Level: Public (Default Security Level. Issues are Public)
  Components: SolrCloud
Affects Versions: 8.8.1
Reporter: Rohan Ganpatye


Expose an optional parameter that defines the port to start the ZK server on.
This might be particularly helpful with testing infrastructure by having the 
ability to specify what port ZK server should use to initialize instead of the 
current default assignment or have to manually edit via zoo.cfg.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] HoustonPutman opened a new pull request #2454: Test docker change, just to see if the docker github action works

2021-03-04 Thread GitBox


HoustonPutman opened a new pull request #2454:
URL: https://github.com/apache/lucene-solr/pull/2454


   Ignore this. Just testing the github action.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] balmukundblr commented on a change in pull request #2345: Benchmark custom

2021-03-04 Thread GitBox


balmukundblr commented on a change in pull request #2345:
URL: https://github.com/apache/lucene-solr/pull/2345#discussion_r587723043



##
File path: 
lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/ReutersContentSource.java
##
@@ -102,19 +104,43 @@ public void close() throws IOException {
   public DocData getNextDocData(DocData docData) throws NoMoreDataException, 
IOException {
 Path f = null;
 String name = null;
-synchronized (this) {
-  if (nextFile >= inputFiles.size()) {
-// exhausted files, start a new round, unless forever set to false.
-if (!forever) {
-  throw new NoMoreDataException();
-}
-nextFile = 0;
-iteration++;
-  }
-  f = inputFiles.get(nextFile++);
-  name = f.toRealPath() + "_" + iteration;
+int inputFilesSize = inputFiles.size();
+
+/*
+ * synchronized (this) {
+ * if (nextFile >= inputFiles.size()) { // exhausted files, start a new 
round, unless forever set to false.
+ * if (!forever) {
+ *throw new NoMoreDataException();
+ * }
+ * nextFile = 0;
+ * iteration++;
+ * }
+ * f = inputFiles.get(nextFile++);
+ * name = f.toRealPath() + "_" +iteration;
+ * }
+ */
+if (!threadIndexCreated) {
+  createThreadIndex();
+}
+
+int index = (int) Thread.currentThread().getId() % threadIndex.length;
+int fIndex = index + threadIndex[index] * threadIndex.length;
+threadIndex[index]++;

Review comment:
   Although, getId() is controlled by JVM but in our case, all threadIndex 
are getting initialized at once. Hence, there is high chance of getting 
guaranteed sequence of thread id, as we also observed. However, we understand 
your concern and  tweaked our code in such a way that it  guaranteed to reach 
every possible int from 0 .. threadIndex.length. We achieved it by setting a 
unique thread name and parsing the same for calculating the index value.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9754) ICU Tokenizer: letter-space-number-letter tokenized inconsistently

2021-03-04 Thread Trey Jones (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295504#comment-17295504
 ] 

Trey Jones commented on LUCENE-9754:


Any chance there is a update/reply coming? I know everyone is very busy, but I 
very much appreciate the conversation so far, and I'd like to understand why 
the way _3a_ is tokenized following _p_ vs _π_ is the expected behavior, so—as 
I said—I can understand it and explain it to other people on my end. Thanks!

> ICU Tokenizer: letter-space-number-letter tokenized inconsistently
> --
>
> Key: LUCENE-9754
> URL: https://issues.apache.org/jira/browse/LUCENE-9754
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 7.5
> Environment: Tested most recently on Elasticsearch 6.5.4.
>Reporter: Trey Jones
>Priority: Major
> Attachments: LUCENE-9754_prototype.patch
>
>
> The tokenization of strings like _14th_ with the ICU tokenizer is affected by 
> the character that comes before preceeding whitespace.
> For example, _x 14th_ is tokenized as x | 14th; _ァ 14th_ is tokenized as ァ | 
> 14 | th.
> In general, in a letter-space-number-letter sequence, if the writing system 
> before the space is the same as the writing system after the number, then you 
> get two tokens. If the writing systems differ, you get three tokens.
> If the conditions are just right, the chunking that the ICU tokenizer does 
> (trying to split on spaces to create <4k chunks) can create an artificial 
> boundary between the tokens (e.g., between _ァ_ and _14th_) and prevent the 
> unexpected split of the second token (_14th_). Because chunking changes can 
> ripple through a long document, editing text or the effects of a character 
> filter can cause changes in tokenization thousands of lines later in a 
> document.
> My guess is that some "previous character set" flag is not reset at the 
> space, and numbers are not in a character set, so _t_ is compared to _ァ_ and 
> they are not the same—causing a token split at the character set change—but 
> I'm not sure.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] balmukundblr commented on a change in pull request #2345: Benchmark custom

2021-03-04 Thread GitBox


balmukundblr commented on a change in pull request #2345:
URL: https://github.com/apache/lucene-solr/pull/2345#discussion_r587707919



##
File path: 
lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/ReutersContentSource.java
##
@@ -146,4 +172,11 @@ public synchronized void resetInputs() throws IOException {
 nextFile = 0;
 iteration = 0;
   }
+
+  private synchronized void createThreadIndex() {
+if (!threadIndexCreated) {

Review comment:
   Sure, will do the required changes.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] balmukundblr commented on a change in pull request #2345: Benchmark custom

2021-03-04 Thread GitBox


balmukundblr commented on a change in pull request #2345:
URL: https://github.com/apache/lucene-solr/pull/2345#discussion_r587707735



##
File path: 
lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/ReutersContentSource.java
##
@@ -102,19 +104,43 @@ public void close() throws IOException {
   public DocData getNextDocData(DocData docData) throws NoMoreDataException, 
IOException {
 Path f = null;
 String name = null;
-synchronized (this) {
-  if (nextFile >= inputFiles.size()) {
-// exhausted files, start a new round, unless forever set to false.
-if (!forever) {
-  throw new NoMoreDataException();
-}
-nextFile = 0;
-iteration++;
-  }
-  f = inputFiles.get(nextFile++);
-  name = f.toRealPath() + "_" + iteration;
+int inputFilesSize = inputFiles.size();
+
+/*
+ * synchronized (this) {

Review comment:
   Sure, will delete the commented codes.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (SOLR-15191) Faceting on EnumFieldType does not work if allBuckets, numBuckets or missing is set

2021-03-04 Thread David Smiley (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-15191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Smiley resolved SOLR-15191.
-
Resolution: Fixed

> Faceting on EnumFieldType does not work if allBuckets, numBuckets or missing 
> is set
> ---
>
> Key: SOLR-15191
> URL: https://issues.apache.org/jira/browse/SOLR-15191
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Facet Module, FacetComponent, faceting, search, 
> streaming expressions
>Affects Versions: 8.7, 8.8, 8.8.1
>Reporter: Thomas Wöckinger
>Assignee: David Smiley
>Priority: Major
>  Labels: easy-fix, pull-request-available
> Fix For: 8.9
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Due to Solr-14514 FacetFieldProcessorByEnumTermsStream is not used if 
> allBuckets, numBuckets or missing parma is true.
> As fallback FacetFieldProcessorByHashDV is used which 
> FacetRangeProcessor.getNumericCalc(sf) on the field. EnumFileType is not 
> handled currently, so a SolrException is thrown with BAD_REQUEST and 
> 'Expected numeric field type'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] balmukundblr commented on a change in pull request #2345: Benchmark custom

2021-03-04 Thread GitBox


balmukundblr commented on a change in pull request #2345:
URL: https://github.com/apache/lucene-solr/pull/2345#discussion_r587706698



##
File path: 
lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/ReutersContentSource.java
##
@@ -102,19 +104,43 @@ public void close() throws IOException {
   public DocData getNextDocData(DocData docData) throws NoMoreDataException, 
IOException {
 Path f = null;
 String name = null;
-synchronized (this) {
-  if (nextFile >= inputFiles.size()) {
-// exhausted files, start a new round, unless forever set to false.
-if (!forever) {
-  throw new NoMoreDataException();
-}
-nextFile = 0;
-iteration++;
-  }
-  f = inputFiles.get(nextFile++);
-  name = f.toRealPath() + "_" + iteration;
+int inputFilesSize = inputFiles.size();
+
+/*
+ * synchronized (this) {
+ * if (nextFile >= inputFiles.size()) { // exhausted files, start a new 
round, unless forever set to false.
+ * if (!forever) {
+ *throw new NoMoreDataException();
+ * }
+ * nextFile = 0;
+ * iteration++;
+ * }
+ * f = inputFiles.get(nextFile++);
+ * name = f.toRealPath() + "_" +iteration;
+ * }
+ */
+if (!threadIndexCreated) {

Review comment:
   Sure, will do the required changes.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9822) Assert that ForUtil.BLOCK_SIZE can be encoded in a single byte in PForUtil

2021-03-04 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295492#comment-17295492
 ] 

Greg Miller commented on LUCENE-9822:
-

Yeah, interesting. Looking at the code, we're packing the number of bits used 
per entry along with the number of patches in a single byte. Because we max out 
at 32 bits/entry, we can encode the number of bits/entry in 5 bits, leaving 3 
more for the number of patches. Seems like an interesting experiment to bring 
in one more byte for encoding the number of patches, significantly raising the 
ceiling on how many entries we can patch in. Just a quick though from looking 
at the code, but I'll see if I can dig into the literature a little.

> Assert that ForUtil.BLOCK_SIZE can be encoded in a single byte in PForUtil
> --
>
> Key: LUCENE-9822
> URL: https://issues.apache.org/jira/browse/LUCENE-9822
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: master (9.0)
>Reporter: Greg Miller
>Priority: Trivial
> Attachments: LUCENE-9822.patch
>
>
> PForUtil assumes that ForUtil.BLOCK_SIZE can be encoded in a single byte when 
> generating "patch offsets". If this assumption doesn't hold, PForUtil will 
> silently encode incorrect positions. While the BLOCK_SIZE isn't particularly 
> configurable, it would be nice to assert this assumption early in PForUtil in 
> the even that the BLOCK_SIZE changes in some future codec version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] dsmiley commented on pull request #2438: SOLR-14928: add exponential backoff for distributed cluster state updates

2021-03-04 Thread GitBox


dsmiley commented on pull request #2438:
URL: https://github.com/apache/lucene-solr/pull/2438#issuecomment-790820910


   I review what I commit locally in my IDE tooling.  I think IntelliJ does a 
nice job of this.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-15185) Improve "hash" QParser

2021-03-04 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-15185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295490#comment-17295490
 ] 

David Smiley commented on SOLR-15185:
-

Okay.  Given that most users of this are indirect users via streaming 
expressions (I presume), can you recommend how I might say that... i.e. _what_ 
part/expression is affected here?  Such users would not even know this 
optimization affects them otherwise.

> Improve "hash" QParser
> --
>
> Key: SOLR-15185
> URL: https://issues.apache.org/jira/browse/SOLR-15185
> Project: Solr
>  Issue Type: Improvement
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Major
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> * Don't use Filter (to be removed)
> * Do use TwoPhaseIterator, not PostFilter
> * Don't pre-compute matching docs (wasteful)
> * Support more fields, and more field types
> * Faster hash on Strings (avoid Char conversion)
> * Stronger hash when using multiple fields



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9822) Assert that ForUtil.BLOCK_SIZE can be encoded in a single byte in PForUtil

2021-03-04 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295480#comment-17295480
 ] 

Adrien Grand commented on LUCENE-9822:
--

I think that the number 3 came from me looking at query throughput vs. size of 
the .doc/.pos files for our Wikipedia dataset and figuring out the best 
trade-off.

> Assert that ForUtil.BLOCK_SIZE can be encoded in a single byte in PForUtil
> --
>
> Key: LUCENE-9822
> URL: https://issues.apache.org/jira/browse/LUCENE-9822
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: master (9.0)
>Reporter: Greg Miller
>Priority: Trivial
> Attachments: LUCENE-9822.patch
>
>
> PForUtil assumes that ForUtil.BLOCK_SIZE can be encoded in a single byte when 
> generating "patch offsets". If this assumption doesn't hold, PForUtil will 
> silently encode incorrect positions. While the BLOCK_SIZE isn't particularly 
> configurable, it would be nice to assert this assumption early in PForUtil in 
> the even that the BLOCK_SIZE changes in some future codec version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] sigram opened a new pull request #2453: SOLR-15210: ParallelStream should execute hashing & filtering directly in ExportWriter

2021-03-04 Thread GitBox


sigram opened a new pull request #2453:
URL: https://github.com/apache/lucene-solr/pull/2453


   See Jira for details.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-15185) Improve "hash" QParser

2021-03-04 Thread Joel Bernstein (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-15185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295448#comment-17295448
 ] 

Joel Bernstein commented on SOLR-15185:
---

I don't think we need to bother unless you feel like including it under 
optimizations.

> Improve "hash" QParser
> --
>
> Key: SOLR-15185
> URL: https://issues.apache.org/jira/browse/SOLR-15185
> Project: Solr
>  Issue Type: Improvement
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Major
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> * Don't use Filter (to be removed)
> * Do use TwoPhaseIterator, not PostFilter
> * Don't pre-compute matching docs (wasteful)
> * Support more fields, and more field types
> * Faster hash on Strings (avoid Char conversion)
> * Stronger hash when using multiple fields



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-15185) Improve "hash" QParser

2021-03-04 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-15185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295407#comment-17295407
 ] 

David Smiley commented on SOLR-15185:
-

[~jbernste] can you please recommend CHANGES.txt and/or ref guide upgrade notes 
pertaining to the hash changing?  Or maybe don't bother if nobody would care?  
RE the perf change; I'll just be vague to say it's more efficient.

> Improve "hash" QParser
> --
>
> Key: SOLR-15185
> URL: https://issues.apache.org/jira/browse/SOLR-15185
> Project: Solr
>  Issue Type: Improvement
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Major
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> * Don't use Filter (to be removed)
> * Do use TwoPhaseIterator, not PostFilter
> * Don't pre-compute matching docs (wasteful)
> * Support more fields, and more field types
> * Faster hash on Strings (avoid Char conversion)
> * Stronger hash when using multiple fields



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9822) Assert that ForUtil.BLOCK_SIZE can be encoded in a single byte in PForUtil

2021-03-04 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295404#comment-17295404
 ] 

Michael McCandless commented on LUCENE-9822:


+1, no unit test needed for one-line {{assert}} addition.  Thanks [~gsmiller]!
{quote}But if you are trying to do something like blocksize=512, seems like you 
would need to allow for more exceptions (e.g. 12 or something) for the patching 
to be effective for general purposes. Maybe worth checking literature as I 
don't know off the top of my head where these numbers (128, 3) etc came from.
{quote}
+1 – seems (naively) like the number of exceptions should probably grow 
linearly?  We could probably make some crazy offline tool that gathers all the 
ints we are encoding into a given index and then measures what compression we 
could achieve with different numbers of patched exceptions.

> Assert that ForUtil.BLOCK_SIZE can be encoded in a single byte in PForUtil
> --
>
> Key: LUCENE-9822
> URL: https://issues.apache.org/jira/browse/LUCENE-9822
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: master (9.0)
>Reporter: Greg Miller
>Priority: Trivial
> Attachments: LUCENE-9822.patch
>
>
> PForUtil assumes that ForUtil.BLOCK_SIZE can be encoded in a single byte when 
> generating "patch offsets". If this assumption doesn't hold, PForUtil will 
> silently encode incorrect positions. While the BLOCK_SIZE isn't particularly 
> configurable, it would be nice to assert this assumption early in PForUtil in 
> the even that the BLOCK_SIZE changes in some future codec version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3320) Explore Proximity Scoring

2021-03-04 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-3320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295393#comment-17295393
 ] 

Tomoko Uchida commented on LUCENE-3320:
---

Thanks [~mikemccand] for the pointer!

I think this will bring great improvement especially for long queries or 
natural language queries. I'd need proximity scoring for a project I'm 
currently working on... will give it a try.

> Explore Proximity Scoring 
> --
>
> Key: LUCENE-3320
> URL: https://issues.apache.org/jira/browse/LUCENE-3320
> Project: Lucene - Core
>  Issue Type: Sub-task
>  Components: core/search
>Affects Versions: Positions Branch
>Reporter: Simon Willnauer
>Priority: Major
> Fix For: Positions Branch
>
>
> Positions will be first class citizens rather sooner than later. We should 
> explore proximity scoring possibilities as well as collection / scoring 
> algorithms like proposed on LUCENE-2878 (2 phase collection)
> This paper might provide some basis for actual scoring implementation: 
> http://plg.uwaterloo.ca/~claclark/sigir2006_term_proximity.pdf



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-15213) Add support for "merge" atomic update operation for child documents

2021-03-04 Thread James Ashbourne (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Ashbourne updated SOLR-15213:
---
Description: 
Solr has "add", "set", "add-distinct" which work but all have their 
limitations. Namely, there's currently no way to atomically update a document 
where that document may or may not be present already by merging if it is 
present and inserting if it isn't.

i.e. in the scenario where we have a document with two nested children: 
  
{noformat}
{"id": "ocean1", 
"_isParent":"true", 
"fish": [ 
{
 "id": "fish1", 
 "type_s": "fish", 
 "name_s": "Doe", 
 "_isParent":"false"}, 
{
 "id": "fish2", 
 "type_s": "fish", 
 "name_s": "Hans", 
 "_isParent":"false"}]
}{noformat}
 
 If we later want to update that child doc e.g.:
{noformat}
{"id": "ocean1", 
"_isParent":"true", 
"fish": [ 
{
 "id": "fish1", 
 "type_s": "fish", 
 "name_s": "James", // new name
 "_isParent":"false"}, 
]
}{noformat}
 
 Existing operations:
 - "add" - will add another nested doc with the same id leaving us with two 
children with the same id.
 - "set" - replaces the whole list of child docs with the single doc, we could 
use this but would first have to fetch all the existing children.
 - "add-distinct" - will reject the update based on the doc already being 
present.

I've got some changes (see patch) that a new option "merge" which checks based 
on the id and merges the new document with the old with a fall back to add if 
there is no id match.

 

 

  was:
Solr has "add", "set", "add-distinct" which work but all have their 
limitations. Namely, there's currently no way to atomically update a document 
where that document may or may not be present already and merge if it is 
present.

i.e. in the scenario where we have a document with two nested children: 
  
{noformat}
{"id": "ocean1", 
"_isParent":"true", 
"fish": [ 
{
 "id": "fish1", 
 "type_s": "fish", 
 "name_s": "Doe", 
 "_isParent":"false"}, 
{
 "id": "fish2", 
 "type_s": "fish", 
 "name_s": "Hans", 
 "_isParent":"false"}]
}{noformat}
 
 If we later want to update that child doc e.g.:
{noformat}
{"id": "ocean1", 
"_isParent":"true", 
"fish": [ 
{
 "id": "fish1", 
 "type_s": "fish", 
 "name_s": "James", // new name
 "_isParent":"false"}, 
]
}{noformat}
 
 Existing operations:
 - "add" - will add another nested doc with the same id leaving us with two 
children with the same id.
 - "set" - replaces the whole list of child docs with the single doc, we could 
use this but would first have to fetch all the existing children.
 - "add-distinct" - will reject the update based on the doc already being 
present.

I've got some changes (see patch) that a new option "merge" which checks based 
on the id and merges the new document with the old with a fall back to add if 
there is no id match.

 

 


> Add support for "merge" atomic update operation for child documents
> ---
>
> Key: SOLR-15213
> URL: https://issues.apache.org/jira/browse/SOLR-15213
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: James Ashbourne
>Priority: Major
> Attachments: SOLR-15213.patch
>
>
> Solr has "add", "set", "add-distinct" which work but all have their 
> limitations. Namely, there's currently no way to atomically update a document 
> where that document may or may not be present already by merging if it is 
> present and inserting if it isn't.
> i.e. in the scenario where we have a document with two nested children: 
>   
> {noformat}
> {"id": "ocean1", 
> "_isParent":"true", 
> "fish": [ 
> {
>  "id": "fish1", 
>  "type_s": "fish", 
>  "name_s": "Doe", 
>  "_isParent":"false"}, 
> {
>  "id": "fish2", 
>  "type_s": "fish", 
>  "name_s": "Hans", 
>  "_isParent":"false"}]
> }{noformat}
>  
>  If we later want to update that child doc e.g.:
> {noformat}
> {"id": "ocean1", 
> "_isParent":"true", 
> "fish": [ 
> {
>  "id": "fish1", 
>  "type_s": "fish", 
>  "name_s": "James", // new name
>  "_isParent":"false"}, 
> ]
> }{noformat}
>  
>  Existing operations:
>  - "add" - will add another nested doc with the same id leaving us with two 
> children with the same id.
>  - "set" - replaces the whole list of child docs with the single doc, we 
> could use this but would first have to fetch all the existing children.
>  - "add-distinct" - will reject the update based on the doc already being 
> present.
> I've got some changes (see patch) that a new option "merge" which checks 
> based on the id and merges the new document with the old with a fall back to 
> add if there is no id match.
>  
>  



--
This 

[jira] [Comment Edited] (SOLR-15213) Add support for "merge" atomic update operation for child documents

2021-03-04 Thread Endika Posadas (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295388#comment-17295388
 ] 

Endika Posadas edited comment on SOLR-15213 at 3/4/21, 4:24 PM:


The problem with directly updating the child document using '_route_' is that 
there's no "upsert" mechanism. If the parent doesn't contain a child with the 
same id then it will fail the request. e.g.:   
{noformat}
"msg":"Did not find child ID fish1 in parent ocean1"
{noformat}
. 

By allowing a merge mechanism a child can either be inserted or updated with a 
single request.


was (Author: enpos):
The problem with directly updating the child document using '_route_' is that 
there's no "upsert" mechanism. If the parent doesn't contain a child with the 
same id then it will fail the request. e.g.:   
{noformat}
"msg":"Did not find child ID fish1 in parent ocean1"
{noformat}
. 

By allowing a merge mechanism a child can either be inserted or updated.

> Add support for "merge" atomic update operation for child documents
> ---
>
> Key: SOLR-15213
> URL: https://issues.apache.org/jira/browse/SOLR-15213
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: James Ashbourne
>Priority: Major
> Attachments: SOLR-15213.patch
>
>
> Solr has "add", "set", "add-distinct" which work but all have their 
> limitations. Namely, there's currently no way to atomically update a document 
> where that document may or may not be present already and merge if it is 
> present.
> i.e. in the scenario where we have a document with two nested children: 
>   
> {noformat}
> {"id": "ocean1", 
> "_isParent":"true", 
> "fish": [ 
> {
>  "id": "fish1", 
>  "type_s": "fish", 
>  "name_s": "Doe", 
>  "_isParent":"false"}, 
> {
>  "id": "fish2", 
>  "type_s": "fish", 
>  "name_s": "Hans", 
>  "_isParent":"false"}]
> }{noformat}
>  
>  If we later want to update that child doc e.g.:
> {noformat}
> {"id": "ocean1", 
> "_isParent":"true", 
> "fish": [ 
> {
>  "id": "fish1", 
>  "type_s": "fish", 
>  "name_s": "James", // new name
>  "_isParent":"false"}, 
> ]
> }{noformat}
>  
>  Existing operations:
>  - "add" - will add another nested doc with the same id leaving us with two 
> children with the same id.
>  - "set" - replaces the whole list of child docs with the single doc, we 
> could use this but would first have to fetch all the existing children.
>  - "add-distinct" - will reject the update based on the doc already being 
> present.
> I've got some changes (see patch) that a new option "merge" which checks 
> based on the id and merges the new document with the old with a fall back to 
> add if there is no id match.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-15213) Add support for "merge" atomic update operation for child documents

2021-03-04 Thread Endika Posadas (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295388#comment-17295388
 ] 

Endika Posadas edited comment on SOLR-15213 at 3/4/21, 4:23 PM:


The problem with directly updating the child document using '_route_' is that 
there's no "upsert" mechanism. If the parent doesn't contain a child with the 
same id then it will fail the request. e.g.:   
{noformat}
"msg":"Did not find child ID fish1 in parent ocean1"
{noformat}
. 

By allowing a merge mechanism a child can either be inserted or updated.


was (Author: enpos):
The problem with directly updating the child document using '_route_' is that 
there's no "upsert" mechanism. If the parent doesn't contain a child with the 
same id then it will fail the request. e.g.:   `"msg":"Did not find child ID 
fish1 in parent ocean1"`. 

By allowing a merge mechanism a child can either be inserted or updated.

> Add support for "merge" atomic update operation for child documents
> ---
>
> Key: SOLR-15213
> URL: https://issues.apache.org/jira/browse/SOLR-15213
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: James Ashbourne
>Priority: Major
> Attachments: SOLR-15213.patch
>
>
> Solr has "add", "set", "add-distinct" which work but all have their 
> limitations. Namely, there's currently no way to atomically update a document 
> where that document may or may not be present already and merge if it is 
> present.
> i.e. in the scenario where we have a document with two nested children: 
>   
> {noformat}
> {"id": "ocean1", 
> "_isParent":"true", 
> "fish": [ 
> {
>  "id": "fish1", 
>  "type_s": "fish", 
>  "name_s": "Doe", 
>  "_isParent":"false"}, 
> {
>  "id": "fish2", 
>  "type_s": "fish", 
>  "name_s": "Hans", 
>  "_isParent":"false"}]
> }{noformat}
>  
>  If we later want to update that child doc e.g.:
> {noformat}
> {"id": "ocean1", 
> "_isParent":"true", 
> "fish": [ 
> {
>  "id": "fish1", 
>  "type_s": "fish", 
>  "name_s": "James", // new name
>  "_isParent":"false"}, 
> ]
> }{noformat}
>  
>  Existing operations:
>  - "add" - will add another nested doc with the same id leaving us with two 
> children with the same id.
>  - "set" - replaces the whole list of child docs with the single doc, we 
> could use this but would first have to fetch all the existing children.
>  - "add-distinct" - will reject the update based on the doc already being 
> present.
> I've got some changes (see patch) that a new option "merge" which checks 
> based on the id and merges the new document with the old with a fall back to 
> add if there is no id match.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-15213) Add support for "merge" atomic update operation for child documents

2021-03-04 Thread Endika Posadas (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295388#comment-17295388
 ] 

Endika Posadas commented on SOLR-15213:
---

The problem with directly updating the child document using '_route_' is that 
there's no "upsert" mechanism. If the parent doesn't contain a child with the 
same id then it will fail the request. e.g.:   `"msg":"Did not find child ID 
fish1 in parent ocean1"`. 

By allowing a merge mechanism a child can either be inserted or updated.

> Add support for "merge" atomic update operation for child documents
> ---
>
> Key: SOLR-15213
> URL: https://issues.apache.org/jira/browse/SOLR-15213
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: James Ashbourne
>Priority: Major
> Attachments: SOLR-15213.patch
>
>
> Solr has "add", "set", "add-distinct" which work but all have their 
> limitations. Namely, there's currently no way to atomically update a document 
> where that document may or may not be present already and merge if it is 
> present.
> i.e. in the scenario where we have a document with two nested children: 
>   
> {noformat}
> {"id": "ocean1", 
> "_isParent":"true", 
> "fish": [ 
> {
>  "id": "fish1", 
>  "type_s": "fish", 
>  "name_s": "Doe", 
>  "_isParent":"false"}, 
> {
>  "id": "fish2", 
>  "type_s": "fish", 
>  "name_s": "Hans", 
>  "_isParent":"false"}]
> }{noformat}
>  
>  If we later want to update that child doc e.g.:
> {noformat}
> {"id": "ocean1", 
> "_isParent":"true", 
> "fish": [ 
> {
>  "id": "fish1", 
>  "type_s": "fish", 
>  "name_s": "James", // new name
>  "_isParent":"false"}, 
> ]
> }{noformat}
>  
>  Existing operations:
>  - "add" - will add another nested doc with the same id leaving us with two 
> children with the same id.
>  - "set" - replaces the whole list of child docs with the single doc, we 
> could use this but would first have to fetch all the existing children.
>  - "add-distinct" - will reject the update based on the doc already being 
> present.
> I've got some changes (see patch) that a new option "merge" which checks 
> based on the id and merges the new document with the old with a fall back to 
> add if there is no id match.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-15213) Add support for "merge" atomic update operation for child documents

2021-03-04 Thread James Ashbourne (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295383#comment-17295383
 ] 

James Ashbourne commented on SOLR-15213:


[~thomas.woeckinger] you're right '_root_' works for updating if you know that 
child exists already but in some cases you don't know if you have already added 
that child. "merge" would be a update if present or insert if not.

> Add support for "merge" atomic update operation for child documents
> ---
>
> Key: SOLR-15213
> URL: https://issues.apache.org/jira/browse/SOLR-15213
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: James Ashbourne
>Priority: Major
> Attachments: SOLR-15213.patch
>
>
> Solr has "add", "set", "add-distinct" which work but all have their 
> limitations. Namely, there's currently no way to atomically update a document 
> where that document may or may not be present already and merge if it is 
> present.
> i.e. in the scenario where we have a document with two nested children: 
>   
> {noformat}
> {"id": "ocean1", 
> "_isParent":"true", 
> "fish": [ 
> {
>  "id": "fish1", 
>  "type_s": "fish", 
>  "name_s": "Doe", 
>  "_isParent":"false"}, 
> {
>  "id": "fish2", 
>  "type_s": "fish", 
>  "name_s": "Hans", 
>  "_isParent":"false"}]
> }{noformat}
>  
>  If we later want to update that child doc e.g.:
> {noformat}
> {"id": "ocean1", 
> "_isParent":"true", 
> "fish": [ 
> {
>  "id": "fish1", 
>  "type_s": "fish", 
>  "name_s": "James", // new name
>  "_isParent":"false"}, 
> ]
> }{noformat}
>  
>  Existing operations:
>  - "add" - will add another nested doc with the same id leaving us with two 
> children with the same id.
>  - "set" - replaces the whole list of child docs with the single doc, we 
> could use this but would first have to fetch all the existing children.
>  - "add-distinct" - will reject the update based on the doc already being 
> present.
> I've got some changes (see patch) that a new option "merge" which checks 
> based on the id and merges the new document with the old with a fall back to 
> add if there is no id match.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9822) Assert that ForUtil.BLOCK_SIZE can be encoded in a single byte in PForUtil

2021-03-04 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295367#comment-17295367
 ] 

Robert Muir commented on LUCENE-9822:
-

Looks good. The single byte assumption reminds me though, with such huge 
block-sizes, the patching may not even work very well without changing how the 
class works completely. Currently It allows 3 exceptions for blocks of 128 so 
that 3 large values don't blow compression up for the whole block. 

But if you are trying to do something like blocksize=512, seems like you would 
need to allow for more exceptions (e.g. 12 or something) for the patching to be 
effective for general purposes. Maybe worth checking literature as I don't know 
off the top of my head where these numbers (128, 3) etc came from. 

> Assert that ForUtil.BLOCK_SIZE can be encoded in a single byte in PForUtil
> --
>
> Key: LUCENE-9822
> URL: https://issues.apache.org/jira/browse/LUCENE-9822
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: master (9.0)
>Reporter: Greg Miller
>Priority: Trivial
> Attachments: LUCENE-9822.patch
>
>
> PForUtil assumes that ForUtil.BLOCK_SIZE can be encoded in a single byte when 
> generating "patch offsets". If this assumption doesn't hold, PForUtil will 
> silently encode incorrect positions. While the BLOCK_SIZE isn't particularly 
> configurable, it would be nice to assert this assumption early in PForUtil in 
> the even that the BLOCK_SIZE changes in some future codec version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-9822) Assert that ForUtil.BLOCK_SIZE can be encoded in a single byte in PForUtil

2021-03-04 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295297#comment-17295297
 ] 

Greg Miller edited comment on LUCENE-9822 at 3/4/21, 3:40 PM:
--

I think this is just a one-liner in the PForUtil ctor. Patch uploaded. I 
verified this works on a local branch I have setup for 512 block sizes. Can't 
think of a good way to add unit testing around this though since the BLOCK_SIZE 
definition is static/final.


was (Author: gsmiller):
I think this is just a one-liner in the PForUtil ctor. Patch uploaded.

> Assert that ForUtil.BLOCK_SIZE can be encoded in a single byte in PForUtil
> --
>
> Key: LUCENE-9822
> URL: https://issues.apache.org/jira/browse/LUCENE-9822
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: master (9.0)
>Reporter: Greg Miller
>Priority: Trivial
> Attachments: LUCENE-9822.patch
>
>
> PForUtil assumes that ForUtil.BLOCK_SIZE can be encoded in a single byte when 
> generating "patch offsets". If this assumption doesn't hold, PForUtil will 
> silently encode incorrect positions. While the BLOCK_SIZE isn't particularly 
> configurable, it would be nice to assert this assumption early in PForUtil in 
> the even that the BLOCK_SIZE changes in some future codec version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-9822) Assert that ForUtil.BLOCK_SIZE can be encoded in a single byte in PForUtil

2021-03-04 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295297#comment-17295297
 ] 

Greg Miller edited comment on LUCENE-9822 at 3/4/21, 3:39 PM:
--

I think this is just a one-liner in the PForUtil ctor. Patch uploaded.


was (Author: gsmiller):
I think this is just a one-liner in the PForUtil ctor. I'll attach a patch 
shortly.

> Assert that ForUtil.BLOCK_SIZE can be encoded in a single byte in PForUtil
> --
>
> Key: LUCENE-9822
> URL: https://issues.apache.org/jira/browse/LUCENE-9822
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: master (9.0)
>Reporter: Greg Miller
>Priority: Trivial
> Attachments: LUCENE-9822.patch
>
>
> PForUtil assumes that ForUtil.BLOCK_SIZE can be encoded in a single byte when 
> generating "patch offsets". If this assumption doesn't hold, PForUtil will 
> silently encode incorrect positions. While the BLOCK_SIZE isn't particularly 
> configurable, it would be nice to assert this assumption early in PForUtil in 
> the even that the BLOCK_SIZE changes in some future codec version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-15038) Add elevateDocsWithoutMatchingQ and onlyElevatedRepresentative parameters to elevation functionality

2021-03-04 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-15038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295358#comment-17295358
 ] 

Bruno Roustant commented on SOLR-15038:
---

I reverted this specific line in both master and branch_8x.

> Add elevateDocsWithoutMatchingQ and onlyElevatedRepresentative parameters to 
> elevation functionality
> 
>
> Key: SOLR-15038
> URL: https://issues.apache.org/jira/browse/SOLR-15038
> Project: Solr
>  Issue Type: Improvement
>  Components: query
>Reporter: Tobias Kässmann
>Priority: Minor
> Fix For: 8.9
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> We've worked a lot with Query Elevation component in the last time and we 
> were missing two features:
>  * Elevate only documents that are part of the search result
>  * In combination with collapsing: Only show the representative if the 
> elevated documents does have the same collapse field value.
> Because of this, we've added these two feature toggles 
> _elevateDocsWithoutMatchingQ_ and _onlyElevatedRepresentative._
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9822) Assert that ForUtil.BLOCK_SIZE can be encoded in a single byte in PForUtil

2021-03-04 Thread Greg Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller updated LUCENE-9822:

Attachment: LUCENE-9822.patch

> Assert that ForUtil.BLOCK_SIZE can be encoded in a single byte in PForUtil
> --
>
> Key: LUCENE-9822
> URL: https://issues.apache.org/jira/browse/LUCENE-9822
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: master (9.0)
>Reporter: Greg Miller
>Priority: Trivial
> Attachments: LUCENE-9822.patch
>
>
> PForUtil assumes that ForUtil.BLOCK_SIZE can be encoded in a single byte when 
> generating "patch offsets". If this assumption doesn't hold, PForUtil will 
> silently encode incorrect positions. While the BLOCK_SIZE isn't particularly 
> configurable, it would be nice to assert this assumption early in PForUtil in 
> the even that the BLOCK_SIZE changes in some future codec version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-15038) Add elevateDocsWithoutMatchingQ and onlyElevatedRepresentative parameters to elevation functionality

2021-03-04 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-15038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295354#comment-17295354
 ] 

ASF subversion and git services commented on SOLR-15038:


Commit e791fb00a9452081d417d43fb7713d95ca73663b in lucene-solr's branch 
refs/heads/branch_8x from Bruno Roustant
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=e791fb0 ]

SOLR-15038: Restore read-only permission in security.policy


> Add elevateDocsWithoutMatchingQ and onlyElevatedRepresentative parameters to 
> elevation functionality
> 
>
> Key: SOLR-15038
> URL: https://issues.apache.org/jira/browse/SOLR-15038
> Project: Solr
>  Issue Type: Improvement
>  Components: query
>Reporter: Tobias Kässmann
>Priority: Minor
> Fix For: 8.9
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> We've worked a lot with Query Elevation component in the last time and we 
> were missing two features:
>  * Elevate only documents that are part of the search result
>  * In combination with collapsing: Only show the representative if the 
> elevated documents does have the same collapse field value.
> Because of this, we've added these two feature toggles 
> _elevateDocsWithoutMatchingQ_ and _onlyElevatedRepresentative._
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-15213) Add support for "merge" atomic update operation for child documents

2021-03-04 Thread Jira


[ 
https://issues.apache.org/jira/browse/SOLR-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295323#comment-17295323
 ] 

Thomas Wöckinger edited comment on SOLR-15213 at 3/4/21, 3:04 PM:
--

You simply update only the child document, and you should use '\_root\_' field 
to get the right shard, your scenario is already working.


was (Author: thomas.woeckinger):
You simply update only the child document, and you should use '_root_' field to 
get the right shard, your scenario is already working.

> Add support for "merge" atomic update operation for child documents
> ---
>
> Key: SOLR-15213
> URL: https://issues.apache.org/jira/browse/SOLR-15213
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: James Ashbourne
>Priority: Major
> Attachments: SOLR-15213.patch
>
>
> Solr has "add", "set", "add-distinct" which work but all have their 
> limitations. Namely, there's currently no way to atomically update a document 
> where that document may or may not be present already and merge if it is 
> present.
> i.e. in the scenario where we have a document with two nested children: 
>   
> {noformat}
> {"id": "ocean1", 
> "_isParent":"true", 
> "fish": [ 
> {
>  "id": "fish1", 
>  "type_s": "fish", 
>  "name_s": "Doe", 
>  "_isParent":"false"}, 
> {
>  "id": "fish2", 
>  "type_s": "fish", 
>  "name_s": "Hans", 
>  "_isParent":"false"}]
> }{noformat}
>  
>  If we later want to update that child doc e.g.:
> {noformat}
> {"id": "ocean1", 
> "_isParent":"true", 
> "fish": [ 
> {
>  "id": "fish1", 
>  "type_s": "fish", 
>  "name_s": "James", // new name
>  "_isParent":"false"}, 
> ]
> }{noformat}
>  
>  Existing operations:
>  - "add" - will add another nested doc with the same id leaving us with two 
> children with the same id.
>  - "set" - replaces the whole list of child docs with the single doc, we 
> could use this but would first have to fetch all the existing children.
>  - "add-distinct" - will reject the update based on the doc already being 
> present.
> I've got some changes (see patch) that a new option "merge" which checks 
> based on the id and merges the new document with the old with a fall back to 
> add if there is no id match.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-15213) Add support for "merge" atomic update operation for child documents

2021-03-04 Thread Jira


[ 
https://issues.apache.org/jira/browse/SOLR-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295323#comment-17295323
 ] 

Thomas Wöckinger commented on SOLR-15213:
-

You simply update only the child document, and you should use '_root_' field to 
get the right shard, your scenario is already working.

> Add support for "merge" atomic update operation for child documents
> ---
>
> Key: SOLR-15213
> URL: https://issues.apache.org/jira/browse/SOLR-15213
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: James Ashbourne
>Priority: Major
> Attachments: SOLR-15213.patch
>
>
> Solr has "add", "set", "add-distinct" which work but all have their 
> limitations. Namely, there's currently no way to atomically update a document 
> where that document may or may not be present already and merge if it is 
> present.
> i.e. in the scenario where we have a document with two nested children: 
>   
> {noformat}
> {"id": "ocean1", 
> "_isParent":"true", 
> "fish": [ 
> {
>  "id": "fish1", 
>  "type_s": "fish", 
>  "name_s": "Doe", 
>  "_isParent":"false"}, 
> {
>  "id": "fish2", 
>  "type_s": "fish", 
>  "name_s": "Hans", 
>  "_isParent":"false"}]
> }{noformat}
>  
>  If we later want to update that child doc e.g.:
> {noformat}
> {"id": "ocean1", 
> "_isParent":"true", 
> "fish": [ 
> {
>  "id": "fish1", 
>  "type_s": "fish", 
>  "name_s": "James", // new name
>  "_isParent":"false"}, 
> ]
> }{noformat}
>  
>  Existing operations:
>  - "add" - will add another nested doc with the same id leaving us with two 
> children with the same id.
>  - "set" - replaces the whole list of child docs with the single doc, we 
> could use this but would first have to fetch all the existing children.
>  - "add-distinct" - will reject the update based on the doc already being 
> present.
> I've got some changes (see patch) that a new option "merge" which checks 
> based on the id and merges the new document with the old with a fall back to 
> add if there is no id match.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13071) Add JWT Auth support in bin/solr

2021-03-04 Thread David Eric Pugh (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295322#comment-17295322
 ] 

David Eric Pugh commented on SOLR-13071:


I've heard of some CLI's that actually pop open a browser window to do the 
authentication, and then, I think by running a local webserver, capture the 
redirect, which then lets you get the authorization code, and use that to get 
the access_token.   

Having said that, I haven't found an example written in Java of a CLI doing 
this, and I'm not sure that I could grok how to do that from scratch.

So what you are suggesting is we just document how to get the access_token, and 
then have someone put that in a file that is read?    That seems easier, and or 
a good first step.

> Add JWT Auth support in bin/solr
> 
>
> Key: SOLR-13071
> URL: https://issues.apache.org/jira/browse/SOLR-13071
> Project: Solr
>  Issue Type: Improvement
>  Components: scripts and tools
>Reporter: Jan Høydahl
>Priority: Major
>
> Once SOLR-12121 gets in, we should add support to {{bin/solr}} start scripts 
> so they can authenticate with Solr using a JWT token. A preferred way would 
> perhaps be through {{solr.in.sh}} and add new
> {noformat}
> SOLR_AUTH_TYPE=token
> SOLR_AUTHENTICATION_OPTS=-DjwtToken=
> {noformat}
> A disadvantage with this method is that the user needs to know how to obtain 
> the token, and the token needs to be long-lived. A more sophisticated way 
> would be a {{bin/solr auth login}} command that opens a browser window with 
> the IDP login screen and saves the short-lived access token and optionally 
> refresh token, in the file system.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-15213) Add support for "merge" atomic update operation for child documents

2021-03-04 Thread James Ashbourne (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Ashbourne updated SOLR-15213:
---
Attachment: SOLR-15213.patch

> Add support for "merge" atomic update operation for child documents
> ---
>
> Key: SOLR-15213
> URL: https://issues.apache.org/jira/browse/SOLR-15213
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: James Ashbourne
>Priority: Major
> Attachments: SOLR-15213.patch
>
>
> Solr has "add", "set", "add-distinct" which work but all have their 
> limitations. Namely, there's currently no way to atomically update a document 
> where that document may or may not be present already and merge if it is 
> present.
> i.e. in the scenario where we have a document with two nested children: 
>   
> {noformat}
> {"id": "ocean1", 
> "_isParent":"true", 
> "fish": [ 
> {
>  "id": "fish1", 
>  "type_s": "fish", 
>  "name_s": "Doe", 
>  "_isParent":"false"}, 
> {
>  "id": "fish2", 
>  "type_s": "fish", 
>  "name_s": "Hans", 
>  "_isParent":"false"}]
> }{noformat}
>  
>  If we later want to update that child doc e.g.:
> {noformat}
> {"id": "ocean1", 
> "_isParent":"true", 
> "fish": [ 
> {
>  "id": "fish1", 
>  "type_s": "fish", 
>  "name_s": "James", // new name
>  "_isParent":"false"}, 
> ]
> }{noformat}
>  
>  Existing operations:
>  - "add" - will add another nested doc with the same id leaving us with two 
> children with the same id.
>  - "set" - replaces the whole list of child docs with the single doc, we 
> could use this but would first have to fetch all the existing children.
>  - "add-distinct" - will reject the update based on the doc already being 
> present.
> I've got some changes (see patch) that a new option "merge" which checks 
> based on the id and merges the new document with the old with a fall back to 
> add if there is no id match.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-15213) Add support for "merge" atomic update operation for child documents

2021-03-04 Thread James Ashbourne (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Ashbourne updated SOLR-15213:
---
Attachment: (was: solr-merge.patch)

> Add support for "merge" atomic update operation for child documents
> ---
>
> Key: SOLR-15213
> URL: https://issues.apache.org/jira/browse/SOLR-15213
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: James Ashbourne
>Priority: Major
> Attachments: SOLR-15213.patch
>
>
> Solr has "add", "set", "add-distinct" which work but all have their 
> limitations. Namely, there's currently no way to atomically update a document 
> where that document may or may not be present already and merge if it is 
> present.
> i.e. in the scenario where we have a document with two nested children: 
>   
> {noformat}
> {"id": "ocean1", 
> "_isParent":"true", 
> "fish": [ 
> {
>  "id": "fish1", 
>  "type_s": "fish", 
>  "name_s": "Doe", 
>  "_isParent":"false"}, 
> {
>  "id": "fish2", 
>  "type_s": "fish", 
>  "name_s": "Hans", 
>  "_isParent":"false"}]
> }{noformat}
>  
>  If we later want to update that child doc e.g.:
> {noformat}
> {"id": "ocean1", 
> "_isParent":"true", 
> "fish": [ 
> {
>  "id": "fish1", 
>  "type_s": "fish", 
>  "name_s": "James", // new name
>  "_isParent":"false"}, 
> ]
> }{noformat}
>  
>  Existing operations:
>  - "add" - will add another nested doc with the same id leaving us with two 
> children with the same id.
>  - "set" - replaces the whole list of child docs with the single doc, we 
> could use this but would first have to fetch all the existing children.
>  - "add-distinct" - will reject the update based on the doc already being 
> present.
> I've got some changes (see patch) that a new option "merge" which checks 
> based on the id and merges the new document with the old with a fall back to 
> add if there is no id match.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9822) Assert that ForUtil.BLOCK_SIZE can be encoded in a single byte in PForUtil

2021-03-04 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295297#comment-17295297
 ] 

Greg Miller commented on LUCENE-9822:
-

I think this is just a one-liner in the PForUtil ctor. I'll attach a patch 
shortly.

> Assert that ForUtil.BLOCK_SIZE can be encoded in a single byte in PForUtil
> --
>
> Key: LUCENE-9822
> URL: https://issues.apache.org/jira/browse/LUCENE-9822
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: master (9.0)
>Reporter: Greg Miller
>Priority: Trivial
>
> PForUtil assumes that ForUtil.BLOCK_SIZE can be encoded in a single byte when 
> generating "patch offsets". If this assumption doesn't hold, PForUtil will 
> silently encode incorrect positions. While the BLOCK_SIZE isn't particularly 
> configurable, it would be nice to assert this assumption early in PForUtil in 
> the even that the BLOCK_SIZE changes in some future codec version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-15213) Add support for "merge" atomic update operation for child documents

2021-03-04 Thread James Ashbourne (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Ashbourne updated SOLR-15213:
---
Description: 
Solr has "add", "set", "add-distinct" which work but all have their 
limitations. Namely, there's currently no way to atomically update a document 
where that document may or may not be present already and merge if it is 
present.

i.e. in the scenario where we have a document with two nested children: 
  
{noformat}
{"id": "ocean1", 
"_isParent":"true", 
"fish": [ 
{
 "id": "fish1", 
 "type_s": "fish", 
 "name_s": "Doe", 
 "_isParent":"false"}, 
{
 "id": "fish2", 
 "type_s": "fish", 
 "name_s": "Hans", 
 "_isParent":"false"}]
}{noformat}
 
 If we later want to update that child doc e.g.:
{noformat}
{"id": "ocean1", 
"_isParent":"true", 
"fish": [ 
{
 "id": "fish1", 
 "type_s": "fish", 
 "name_s": "James", // new name
 "_isParent":"false"}, 
]
}{noformat}
 
 Existing operations:
 - "add" - will add another nested doc with the same id leaving us with two 
children with the same id.
 - "set" - replaces the whole list of child docs with the single doc, we could 
use this but would first have to fetch all the existing children.
 - "add-distinct" - will reject the update based on the doc already being 
present.

I've got some changes (see patch) that a new option "merge" which checks based 
on the id and merges the new document with the old with a fall back to add if 
there is no id match.

 

 

  was:
Solr has "add", "set", "add-distinct" which work but all have their 
limitations. Namely, there's currently no way to atomically update a document 
where that document may or may not be present already with merging if it is 
present.

i.e. in the scenario where we have a document with two nested children: 
  
{noformat}
{"id": "ocean1", 
"_isParent":"true", 
"fish": [ 
{
 "id": "fish1", 
 "type_s": "fish", 
 "name_s": "Doe", 
 "_isParent":"false"}, 
{
 "id": "fish2", 
 "type_s": "fish", 
 "name_s": "Hans", 
 "_isParent":"false"}]
}{noformat}
 
 If we later want to update that child doc e.g.:
{noformat}
{"id": "ocean1", 
"_isParent":"true", 
"fish": [ 
{
 "id": "fish1", 
 "type_s": "fish", 
 "name_s": "James", // new name
 "_isParent":"false"}, 
]
}{noformat}
 
 Existing operations:
 - "add" - will add another nested doc with the same id leaving us with two 
children with the same id.
 - "set" - replaces the whole list of child docs with the single doc, we could 
use this but would first have to fetch all the existing children.
 - "add-distinct" - will reject the update based on the doc already being 
present.

I've got some changes (see patch) that a new option "merge" which checks based 
on the id and merges the new document with the old with a fall back to add if 
there is no id match.

 

 


> Add support for "merge" atomic update operation for child documents
> ---
>
> Key: SOLR-15213
> URL: https://issues.apache.org/jira/browse/SOLR-15213
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: James Ashbourne
>Priority: Major
> Attachments: solr-merge.patch
>
>
> Solr has "add", "set", "add-distinct" which work but all have their 
> limitations. Namely, there's currently no way to atomically update a document 
> where that document may or may not be present already and merge if it is 
> present.
> i.e. in the scenario where we have a document with two nested children: 
>   
> {noformat}
> {"id": "ocean1", 
> "_isParent":"true", 
> "fish": [ 
> {
>  "id": "fish1", 
>  "type_s": "fish", 
>  "name_s": "Doe", 
>  "_isParent":"false"}, 
> {
>  "id": "fish2", 
>  "type_s": "fish", 
>  "name_s": "Hans", 
>  "_isParent":"false"}]
> }{noformat}
>  
>  If we later want to update that child doc e.g.:
> {noformat}
> {"id": "ocean1", 
> "_isParent":"true", 
> "fish": [ 
> {
>  "id": "fish1", 
>  "type_s": "fish", 
>  "name_s": "James", // new name
>  "_isParent":"false"}, 
> ]
> }{noformat}
>  
>  Existing operations:
>  - "add" - will add another nested doc with the same id leaving us with two 
> children with the same id.
>  - "set" - replaces the whole list of child docs with the single doc, we 
> could use this but would first have to fetch all the existing children.
>  - "add-distinct" - will reject the update based on the doc already being 
> present.
> I've got some changes (see patch) that a new option "merge" which checks 
> based on the id and merges the new document with the old with a fall back to 
> add if there is no id match.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (LUCENE-9822) Assert that ForUtil.BLOCK_SIZE can be encoded in a single byte in PForUtil

2021-03-04 Thread Greg Miller (Jira)
Greg Miller created LUCENE-9822:
---

 Summary: Assert that ForUtil.BLOCK_SIZE can be encoded in a single 
byte in PForUtil
 Key: LUCENE-9822
 URL: https://issues.apache.org/jira/browse/LUCENE-9822
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/codecs
Affects Versions: master (9.0)
Reporter: Greg Miller


PForUtil assumes that ForUtil.BLOCK_SIZE can be encoded in a single byte when 
generating "patch offsets". If this assumption doesn't hold, PForUtil will 
silently encode incorrect positions. While the BLOCK_SIZE isn't particularly 
configurable, it would be nice to assert this assumption early in PForUtil in 
the even that the BLOCK_SIZE changes in some future codec version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-15213) Add support for "merge" atomic update operation for child documents

2021-03-04 Thread James Ashbourne (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Ashbourne updated SOLR-15213:
---
Attachment: solr-merge.patch

> Add support for "merge" atomic update operation for child documents
> ---
>
> Key: SOLR-15213
> URL: https://issues.apache.org/jira/browse/SOLR-15213
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: James Ashbourne
>Priority: Major
> Attachments: solr-merge.patch
>
>
> Solr has "add", "set", "add-distinct" which work but all have their 
> limitations. Namely, there's currently no way to atomically update a document 
> where that document may or may not be present already with merging if it is 
> present.
> i.e. in the scenario where we have a document with two nested children: 
>   
> {noformat}
> {"id": "ocean1", 
> "_isParent":"true", 
> "fish": [ 
> {
>  "id": "fish1", 
>  "type_s": "fish", 
>  "name_s": "Doe", 
>  "_isParent":"false"}, 
> {
>  "id": "fish2", 
>  "type_s": "fish", 
>  "name_s": "Hans", 
>  "_isParent":"false"}]
> }{noformat}
>  
>  If we later want to update that child doc e.g.:
> {noformat}
> {"id": "ocean1", 
> "_isParent":"true", 
> "fish": [ 
> {
>  "id": "fish1", 
>  "type_s": "fish", 
>  "name_s": "James", // new name
>  "_isParent":"false"}, 
> ]
> }{noformat}
>  
>  Existing operations:
>  - "add" - will add another nested doc with the same id leaving us with two 
> children with the same id.
>  - "set" - replaces the whole list of child docs with the single doc, we 
> could use this but would first have to fetch all the existing children.
>  - "add-distinct" - will reject the update based on the doc already being 
> present.
> I've got some changes (see patch) that a new option "merge" which checks 
> based on the id and merges the new document with the old with a fall back to 
> add if there is no id match.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (SOLR-15213) Add support for "merge" atomic update operation for child documents

2021-03-04 Thread James Ashbourne (Jira)
James Ashbourne created SOLR-15213:
--

 Summary: Add support for "merge" atomic update operation for child 
documents
 Key: SOLR-15213
 URL: https://issues.apache.org/jira/browse/SOLR-15213
 Project: Solr
  Issue Type: New Feature
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: James Ashbourne


Solr has "add", "set", "add-distinct" which work but all have their 
limitations. Namely, there's currently no way to atomically update a document 
where that document may or may not be present already with merging if it is 
present.

i.e. in the scenario where we have a document with two nested children: 
  
{noformat}
{"id": "ocean1", 
"_isParent":"true", 
"fish": [ 
{
 "id": "fish1", 
 "type_s": "fish", 
 "name_s": "Doe", 
 "_isParent":"false"}, 
{
 "id": "fish2", 
 "type_s": "fish", 
 "name_s": "Hans", 
 "_isParent":"false"}]
}{noformat}
 
 If we later want to update that child doc e.g.:
{noformat}
{"id": "ocean1", 
"_isParent":"true", 
"fish": [ 
{
 "id": "fish1", 
 "type_s": "fish", 
 "name_s": "James", // new name
 "_isParent":"false"}, 
]
}{noformat}
 
 Existing operations:
 - "add" - will add another nested doc with the same id leaving us with two 
children with the same id.
 - "set" - replaces the whole list of child docs with the single doc, we could 
use this but would first have to fetch all the existing children.
 - "add-distinct" - will reject the update based on the doc already being 
present.

I've got some changes (see patch) that a new option "merge" which checks based 
on the id and merges the new document with the old with a fall back to add if 
there is no id match.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9406) Make it simpler to track IndexWriter's events

2021-03-04 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295278#comment-17295278
 ] 

Michael McCandless commented on LUCENE-9406:


Should we maybe backport this to 8.x?  It is only adding a new experimental 
API, so it would not be an API break?

> Make it simpler to track IndexWriter's events
> -
>
> Key: LUCENE-9406
> URL: https://issues.apache.org/jira/browse/LUCENE-9406
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: master (9.0)
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> This is the second spinoff from a [controversial PR to add a new index-time 
> feature to Lucene to merge small segments during 
> commit|https://github.com/apache/lucene-solr/pull/1552].  That change can 
> substantially reduce the number of small index segments to search.
> In that PR, there was a new proposed interface, {{IndexWriterEvents}}, giving 
> the application a chance to track when {{IndexWriter}} kicked off merges 
> during commit, how many, how long it waited, how often it gave up waiting, 
> etc.
> Such telemetry from production usage is really helpful when tuning settings 
> like which merges (e.g. a size threshold) to attempt on commit, and how long 
> to wait during commit, etc.
> I am splitting out this issue to explore possible approaches to do this.  
> E.g. [~simonw] proposed using a statistics class instead, but if I understood 
> that correctly, I think that would put the role of aggregation inside 
> {{IndexWriter}}, which is not ideal.
> Many interesting events, e.g. how many merges are being requested, how large 
> are they, how long did they take to complete or fail, etc., can be gleaned by 
> wrapping expert Lucene classes like {{MergePolicy}} and {{MergeScheduler}}.  
> But for those events that cannot (e.g. {{IndexWriter}} stopped waiting for 
> merges during commit), it would be very helpful to have some simple way to 
> track so applications can better tune.
> It is also possible to subclass {{IndexWriter}} and override key methods, but 
> I think that is inherently risky as {{IndexWriter}}'s protected methods are 
> not considered to be a stable API, and the synchronization used by 
> {{IndexWriter}} is confusing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] mikemccand commented on pull request #2342: LUCENE-9406: Add IndexWriterEventListener to track events in IndexWriter

2021-03-04 Thread GitBox


mikemccand commented on pull request #2342:
URL: https://github.com/apache/lucene-solr/pull/2342#issuecomment-790609431


   Woops, sorry for the belated response, and thank you @zacharymorn for 
creating this and @dweiss for merging -- it looks great!  We can now add other 
events to track incrementally over time ...



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] pawel-bugalski-dynatrace commented on a change in pull request #2429: LUCENE-9791 Allow calling BytesRefHash#find concurrently

2021-03-04 Thread GitBox


pawel-bugalski-dynatrace commented on a change in pull request #2429:
URL: https://github.com/apache/lucene-solr/pull/2429#discussion_r587450225



##
File path: lucene/core/src/java/org/apache/lucene/util/BytesRefHash.java
##
@@ -31,18 +31,21 @@
  * to the id is encapsulated inside {@link BytesRefHash} and is guaranteed to 
be increased for each
  * added {@link BytesRef}.
  *
+ * Note that this implementation is not synchronized. If 
multiple threads access
+ * a {@link BytesRefHash} instance concurrently, and at least one of the 
threads modifies it
+ * structurally, it must be synchronized externally. (A structural 
modification is any
+ * operation on the map except operations explicitly listed in {@link 
UnmodifiableBytesRefHash}
+ * interface).
+ *
  * Note: The maximum capacity {@link BytesRef} instance passed to {@link 
#add(BytesRef)} must not
  * be longer than {@link ByteBlockPool#BYTE_BLOCK_SIZE}-2. The internal 
storage is limited to 2GB
  * total byte storage.
  *
  * @lucene.internal
  */
-public final class BytesRefHash implements Accountable {
+public final class BytesRefHash implements Accountable, 
UnmodifiableBytesRefHash {

Review comment:
   Based on comments I'm going to remove UnmodifiableBytesRefHash altogether





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] uschindler edited a comment on pull request #2429: LUCENE-9791 Allow calling BytesRefHash#find concurrently

2021-03-04 Thread GitBox


uschindler edited a comment on pull request #2429:
URL: https://github.com/apache/lucene-solr/pull/2429#issuecomment-790589517


   Hi,
   I agree with Mike. I like the equals() method to be thread safe. That was my 
original proposal.
   Generally: BytesRefHash is my favourite class if you need a `Set`. 
Although it's marked internal, I prefer to use it. Especially if you need a set 
of millions of strings, this is fast and does not produce millions of Strings. 
I personally used it only single threaded, but in all cases a method called 
equals should never ever change state. Sorry!
   
   +1 for the fix
   -1 to add the unmodifiable interface. That's over-engineered.
   
   Uwe
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] uschindler commented on pull request #2429: LUCENE-9791 Allow calling BytesRefHash#find concurrently

2021-03-04 Thread GitBox


uschindler commented on pull request #2429:
URL: https://github.com/apache/lucene-solr/pull/2429#issuecomment-790589517


   Hi,
   I agree with Mike. I like the equals() method to be thread safe. That was my 
original proposal.
   Generally: BytesRefHash is my favourite class if you need a Set. 
Although it's marked internal, I prefer to use it. Especially if you need a set 
of millions of strings, this is fast and does not produce millions of Strings. 
I personally used it only single threaded, but in all cases a method called 
equals should never ever change state. Sorry!
   
   +1 for the fix
   -1 to add the unmodifiable interface. That's over-engineered.
   
   Uwe
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] mikemccand commented on pull request #2429: LUCENE-9791 Allow calling BytesRefHash#find concurrently

2021-03-04 Thread GitBox


mikemccand commented on pull request #2429:
URL: https://github.com/apache/lucene-solr/pull/2429#issuecomment-790583058


   +1 for changing `equals` to not require allocation, enabling us to remove 
the thread-unsafe shared `BytesRef scratch1`!  This makes `find` thread-safe 
(as long as no other threads are making structural changes), and would suffice 
to fix `Luwak`'s usage, right?  This is a nice improvement by itself!
   
   I'm also not a fan of adding the `UnmodifiableBytesRefHash` wrapper -- this 
is indeed an `@lucene.internal` API, not a generic JDK Collections class.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] iverase commented on pull request #2452: LUCENE-9580: Don't introduce collinear edges when splitting polygon

2021-03-04 Thread GitBox


iverase commented on pull request #2452:
URL: https://github.com/apache/lucene-solr/pull/2452#issuecomment-790571887


   @nknize  would you mind having a look?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] iverase opened a new pull request #2452: LUCENE-9580: Don't introduce collinear edges when splitting polygon

2021-03-04 Thread GitBox


iverase opened a new pull request #2452:
URL: https://github.com/apache/lucene-solr/pull/2452


   I had a look into this failing polygon and it seems the issue comes in the 
logic the splits the polygon for further processing. It might happen that the 
new edge introduced is collinear with edges from the polygon. This situation 
makes this edges not eligible for filtering and it makes the logic fail.
   
   This change makes sure we don't introduce collinear edges when splitting 
polygons.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] donnerpeter commented on pull request #2451: LUCENE-9687: Hunspell suggestions: reduce work in the findSimilarDictionaryEntries loop

2021-03-04 Thread GitBox


donnerpeter commented on pull request #2451:
URL: https://github.com/apache/lucene-solr/pull/2451#issuecomment-790570709


   Sorry, I couldn't create a separate JIRA issue for this change due to some 
"XSRF Security Token Missing" error in JIRA



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] donnerpeter opened a new pull request #2451: LUCENE-9687: Hunspell suggestions: reduce work in the findSimilarDictionaryEntries loop

2021-03-04 Thread GitBox


donnerpeter opened a new pull request #2451:
URL: https://github.com/apache/lucene-solr/pull/2451


   
   
   
   # Description
   
   The loop is called a lot of times, and some allocations and method calls can 
be spared
   
   # Solution
   
   Extract some code outside the loop
   
   # Tests
   
   No new tests, ~5% speedup in `TestPerformance.en_suggest`
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [x] I have reviewed the guidelines for [How to 
Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms 
to the standards described there to the best of my ability.
   - [x] I have created a Jira issue and added the issue ID to my pull request 
title.
   - [x] I have given Solr maintainers 
[access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
 to contribute to my PR branch. (optional but recommended)
   - [x] I have developed this patch against the `master` branch.
   - [x] I have run `./gradlew check`.
   - [ ] I have added tests for my changes.
   - [ ] I have added documentation for the [Ref 
Guide](https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide) 
(for Solr changes only).
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] rmuir commented on pull request #2429: LUCENE-9791 Allow calling BytesRefHash#find concurrently

2021-03-04 Thread GitBox


rmuir commented on pull request #2429:
URL: https://github.com/apache/lucene-solr/pull/2429#issuecomment-790554086


   Just to emphasize it even more, this class is marked `@lucene.internal`. The 
class shouldn't even be exposed to the outside in the public API to start with, 
so let's please not increase the exposure.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] rmuir commented on pull request #2429: LUCENE-9791 Allow calling BytesRefHash#find concurrently

2021-03-04 Thread GitBox


rmuir commented on pull request #2429:
URL: https://github.com/apache/lucene-solr/pull/2429#issuecomment-790551382


   I still don't like the unmodifiable-interface. Sorry, I disagree with 
exposing thread-safe methods officially in the API for class that should only 
be used by one thread, just because one user of the class did it in a wrong way.
   
   It was my understanding that the problem is being solved this way because 
its "too hard" to fix lucene-monitor to instead do things correctly: I'll 
accept that we should do a "quick fix" to workaround its bugginess, but we 
should ultimately file JIRA issue to fix it (it should not use such a class 
with multiple threads).
   
   We shouldn't expose what we have done in public apis, it is just a temporary 
solution. If someone wants such a genpurpose hashtable they can use `HashMap` 
from their jdk, we aren't a hashtable library.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-15038) Add elevateDocsWithoutMatchingQ and onlyElevatedRepresentative parameters to elevation functionality

2021-03-04 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-15038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295207#comment-17295207
 ] 

Dawid Weiss commented on SOLR-15038:


Hi Bruno. No, I've no idea... I think the goal was to disallow solr from 
writing into source locations - I'm not sure if 8x allows that or if it's 
something introduced on master (sorry!).

I only noticed this while changing some bits before repo splitting.

> Add elevateDocsWithoutMatchingQ and onlyElevatedRepresentative parameters to 
> elevation functionality
> 
>
> Key: SOLR-15038
> URL: https://issues.apache.org/jira/browse/SOLR-15038
> Project: Solr
>  Issue Type: Improvement
>  Components: query
>Reporter: Tobias Kässmann
>Priority: Minor
> Fix For: 8.9
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> We've worked a lot with Query Elevation component in the last time and we 
> were missing two features:
>  * Elevate only documents that are part of the search result
>  * In combination with collapsing: Only show the representative if the 
> elevated documents does have the same collapse field value.
> Because of this, we've added these two feature toggles 
> _elevateDocsWithoutMatchingQ_ and _onlyElevatedRepresentative._
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-15038) Add elevateDocsWithoutMatchingQ and onlyElevatedRepresentative parameters to elevation functionality

2021-03-04 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-15038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295203#comment-17295203
 ] 

Bruno Roustant edited comment on SOLR-15038 at 3/4/21, 10:37 AM:
-

Ouch, yes I'll revert that. I played with this permission but didn't intend to 
commit it.

When running the tests I noticed many Solr tests have warning about being 
unable to create some test resources.
{code:java}
java.security.AccessControlException: access denied ("java.io.FilePermission" 
"lucene-solr/solr/core/build/resources/test/solr/userfiles" "write")
at 
java.security.AccessControlContext.checkPermission(AccessControlContext.java:472)
 ~[?:?]
at java.security.AccessController.checkPermission(AccessController.java:897) 
~[?:?]
at java.lang.SecurityManager.checkPermission(SecurityManager.java:322) ~[?:?]
at java.lang.SecurityManager.checkWrite(SecurityManager.java:752) ~[?:?]
at sun.nio.fs.UnixPath.checkWrite(UnixPath.java:824) ~[?:?]
at 
sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:377)
 ~[?:?]
at java.nio.file.Files.createDirectory(Files.java:689) ~[?:?]
at java.nio.file.Files.createAndCheckIsDirectory(Files.java:796) ~[?:?]
at java.nio.file.Files.createDirectories(Files.java:742) ~[?:?]
at org.apache.solr.core.CoreContainer.(CoreContainer.java:383) [main/:?]
at org.apache.solr.core.CoreContainer.(CoreContainer.java:344) 
[main/:?]{code}
I noticed they disappeared when I changed the permission for write-access in 
solr-tests.policy.

[~dweiss] do you know how to get rid of these (many) warnings?


was (Author: broustant):
Ouch, yes I'll revert that. I played with this permission but didn't intend to 
commit it.

When running the tests I noticed many Solr tests have warning about being 
unable to create some test resources.
{code:java}
java.security.AccessControlException: access denied ("java.io.FilePermission" 
"lucene-solr/solr/core/build/resources/test/solr/userfiles" "write")
at 
java.security.AccessControlContext.checkPermission(AccessControlContext.java:472)
 ~[?:?]
at java.security.AccessController.checkPermission(AccessController.java:897) 
~[?:?]
at java.lang.SecurityManager.checkPermission(SecurityManager.java:322) ~[?:?]
at java.lang.SecurityManager.checkWrite(SecurityManager.java:752) ~[?:?]
at sun.nio.fs.UnixPath.checkWrite(UnixPath.java:824) ~[?:?]
at 
sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:377)
 ~[?:?]
at java.nio.file.Files.createDirectory(Files.java:689) ~[?:?]
at java.nio.file.Files.createAndCheckIsDirectory(Files.java:796) ~[?:?]
at java.nio.file.Files.createDirectories(Files.java:742) ~[?:?]
at org.apache.solr.core.CoreContainer.(CoreContainer.java:383) [main/:?]
at org.apache.solr.core.CoreContainer.(CoreContainer.java:344) 
[main/:?]{code}
I noticed they disappeared when I changed the permission for write-access in 
solr-tests.policy.

Do you know how to get rid of these (many) warnings?

> Add elevateDocsWithoutMatchingQ and onlyElevatedRepresentative parameters to 
> elevation functionality
> 
>
> Key: SOLR-15038
> URL: https://issues.apache.org/jira/browse/SOLR-15038
> Project: Solr
>  Issue Type: Improvement
>  Components: query
>Reporter: Tobias Kässmann
>Priority: Minor
> Fix For: 8.9
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> We've worked a lot with Query Elevation component in the last time and we 
> were missing two features:
>  * Elevate only documents that are part of the search result
>  * In combination with collapsing: Only show the representative if the 
> elevated documents does have the same collapse field value.
> Because of this, we've added these two feature toggles 
> _elevateDocsWithoutMatchingQ_ and _onlyElevatedRepresentative._
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-15038) Add elevateDocsWithoutMatchingQ and onlyElevatedRepresentative parameters to elevation functionality

2021-03-04 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-15038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295203#comment-17295203
 ] 

Bruno Roustant commented on SOLR-15038:
---

Ouch, yes I'll revert that. I played with this permission but didn't intend to 
commit it.

When running the tests I noticed many Solr tests have warning about being 
unable to create some test resources.
{code:java}
java.security.AccessControlException: access denied ("java.io.FilePermission" 
"lucene-solr/solr/core/build/resources/test/solr/userfiles" "write")
at 
java.security.AccessControlContext.checkPermission(AccessControlContext.java:472)
 ~[?:?]
at java.security.AccessController.checkPermission(AccessController.java:897) 
~[?:?]
at java.lang.SecurityManager.checkPermission(SecurityManager.java:322) ~[?:?]
at java.lang.SecurityManager.checkWrite(SecurityManager.java:752) ~[?:?]
at sun.nio.fs.UnixPath.checkWrite(UnixPath.java:824) ~[?:?]
at 
sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:377)
 ~[?:?]
at java.nio.file.Files.createDirectory(Files.java:689) ~[?:?]
at java.nio.file.Files.createAndCheckIsDirectory(Files.java:796) ~[?:?]
at java.nio.file.Files.createDirectories(Files.java:742) ~[?:?]
at org.apache.solr.core.CoreContainer.(CoreContainer.java:383) [main/:?]
at org.apache.solr.core.CoreContainer.(CoreContainer.java:344) 
[main/:?]{code}
I noticed they disappeared when I changed the permission for write-access in 
solr-tests.policy.

Do you know how to get rid of these (many) warnings?

> Add elevateDocsWithoutMatchingQ and onlyElevatedRepresentative parameters to 
> elevation functionality
> 
>
> Key: SOLR-15038
> URL: https://issues.apache.org/jira/browse/SOLR-15038
> Project: Solr
>  Issue Type: Improvement
>  Components: query
>Reporter: Tobias Kässmann
>Priority: Minor
> Fix For: 8.9
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> We've worked a lot with Query Elevation component in the last time and we 
> were missing two features:
>  * Elevate only documents that are part of the search result
>  * In combination with collapsing: Only show the representative if the 
> elevated documents does have the same collapse field value.
> Because of this, we've added these two feature toggles 
> _elevateDocsWithoutMatchingQ_ and _onlyElevatedRepresentative._
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] pawel-bugalski-dynatrace commented on pull request #2429: LUCENE-9791 Allow calling BytesRefHash#find concurrently

2021-03-04 Thread GitBox


pawel-bugalski-dynatrace commented on pull request #2429:
URL: https://github.com/apache/lucene-solr/pull/2429#issuecomment-790467163


   @rmuir @madrob what do you think about current state of this PR? Any more 
comments? What else needs to be done to merge it?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-15038) Add elevateDocsWithoutMatchingQ and onlyElevatedRepresentative parameters to elevation functionality

2021-03-04 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-15038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295083#comment-17295083
 ] 

Dawid Weiss commented on SOLR-15038:


This change introduced write-access to sources:
{code}
-  permission java.io.FilePermission "${common.dir}${/}..${/}solr${/}-", "read";
+  permission java.io.FilePermission "${common.dir}${/}..${/}solr${/}-", 
"read,write";
{code}
I think this bit should be reverted?

> Add elevateDocsWithoutMatchingQ and onlyElevatedRepresentative parameters to 
> elevation functionality
> 
>
> Key: SOLR-15038
> URL: https://issues.apache.org/jira/browse/SOLR-15038
> Project: Solr
>  Issue Type: Improvement
>  Components: query
>Reporter: Tobias Kässmann
>Priority: Minor
> Fix For: 8.9
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> We've worked a lot with Query Elevation component in the last time and we 
> were missing two features:
>  * Elevate only documents that are part of the search result
>  * In combination with collapsing: Only show the representative if the 
> elevated documents does have the same collapse field value.
> Because of this, we've added these two feature toggles 
> _elevateDocsWithoutMatchingQ_ and _onlyElevatedRepresentative._
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-14759) Separate the Lucene and Solr builds

2021-03-04 Thread Dawid Weiss (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss updated SOLR-14759:
---
Description: 
While still in same git repo, separate the builds, so Lucene and Solr can be 
built independently.

The preparation step includes optional building of just Lucene from current 
master (prior to any code removal):

Current status of joint and separate builds:
 * (/) joint build
{code}
gradlew assemble check
{code}
 * (/) Lucene-only
{code}
gradlew -Dskip.solr=true assemble check
{code}
 * (/) Solr-only (with documentation exclusions)
{code}
gradlew -Dskip.lucene=true assemble check -x test -x documentation -x 
checkBrokenLinks -x checkLocalJavadocLinksSite
{code}

  was:
While still in same git repo, separate the builds, so Lucene and Solr can be 
built independently.

The preparation step includes optional building of just Lucene from current 
master (prior to any code removal):

Current status of joint and separate builds:
 * (/) joint build

{code:java}
gradlew assemble check{code}

 
{code:java}
# Current build (no tests, Lucene+Solr)
gradlew assemble check# Lucene-only build.

gradlew -Dskip.solr=true  assemble check# Solr-only build
gradlew -Dskip.lucene=true assemble check -x test -x documentation -x 
checkBrokenLinks -x checkLocalJavadocLinksSite
{code}


> Separate the Lucene and Solr builds
> ---
>
> Key: SOLR-14759
> URL: https://issues.apache.org/jira/browse/SOLR-14759
> Project: Solr
>  Issue Type: Sub-task
>  Components: Build
>Reporter: Jan Høydahl
>Assignee: Dawid Weiss
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> While still in same git repo, separate the builds, so Lucene and Solr can be 
> built independently.
> The preparation step includes optional building of just Lucene from current 
> master (prior to any code removal):
> Current status of joint and separate builds:
>  * (/) joint build
> {code}
> gradlew assemble check
> {code}
>  * (/) Lucene-only
> {code}
> gradlew -Dskip.solr=true assemble check
> {code}
>  * (/) Solr-only (with documentation exclusions)
> {code}
> gradlew -Dskip.lucene=true assemble check -x test -x documentation -x 
> checkBrokenLinks -x checkLocalJavadocLinksSite
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-14759) Separate the Lucene and Solr builds

2021-03-04 Thread Dawid Weiss (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss updated SOLR-14759:
---
Description: 
While still in same git repo, separate the builds, so Lucene and Solr can be 
built independently.

The preparation step includes optional building of just Lucene from current 
master (prior to any code removal):

Current status of joint and separate builds:
 * (/) joint build

{code:java}
gradlew assemble check{code}

 
{code:java}
# Current build (no tests, Lucene+Solr)
gradlew assemble check# Lucene-only build.

gradlew -Dskip.solr=true  assemble check# Solr-only build
gradlew -Dskip.lucene=true assemble check -x test -x documentation -x 
checkBrokenLinks -x checkLocalJavadocLinksSite
{code}

  was:
While still in same git repo, separate the builds, so Lucene and Solr can be 
built independently.

The preparation step includes optional building of just Lucene from current 
master (prior to any code removal):
{code:java}
 gradlew -Dskip.solr=true check -x checkUnusedConstraints -x verifyLocks{code}


> Separate the Lucene and Solr builds
> ---
>
> Key: SOLR-14759
> URL: https://issues.apache.org/jira/browse/SOLR-14759
> Project: Solr
>  Issue Type: Sub-task
>  Components: Build
>Reporter: Jan Høydahl
>Assignee: Dawid Weiss
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> While still in same git repo, separate the builds, so Lucene and Solr can be 
> built independently.
> The preparation step includes optional building of just Lucene from current 
> master (prior to any code removal):
> Current status of joint and separate builds:
>  * (/) joint build
> {code:java}
> gradlew assemble check{code}
>  
> {code:java}
> # Current build (no tests, Lucene+Solr)
> gradlew assemble check# Lucene-only build.
> gradlew -Dskip.solr=true  assemble check# Solr-only build
> gradlew -Dskip.lucene=true assemble check -x test -x documentation -x 
> checkBrokenLinks -x checkLocalJavadocLinksSite
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14759) Separate the Lucene and Solr builds

2021-03-04 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295076#comment-17295076
 ] 

Dawid Weiss commented on SOLR-14759:


Current status will be tracked at the main issue description level. 

> Separate the Lucene and Solr builds
> ---
>
> Key: SOLR-14759
> URL: https://issues.apache.org/jira/browse/SOLR-14759
> Project: Solr
>  Issue Type: Sub-task
>  Components: Build
>Reporter: Jan Høydahl
>Assignee: Dawid Weiss
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> While still in same git repo, separate the builds, so Lucene and Solr can be 
> built independently.
> The preparation step includes optional building of just Lucene from current 
> master (prior to any code removal):
> {code:java}
>  gradlew -Dskip.solr=true check -x checkUnusedConstraints -x verifyLocks{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] dweiss commented on a change in pull request #2448: SOLR-14759: a few initial changes so that Lucene can be built independently while Solr code is still in place.

2021-03-04 Thread GitBox


dweiss commented on a change in pull request #2448:
URL: https://github.com/apache/lucene-solr/pull/2448#discussion_r587241698



##
File path: gradle/documentation/documentation.gradle
##
@@ -20,34 +20,41 @@ configure(rootProject) {
   def refguideUrlVersion = 
project.baseVersion.replaceFirst(/^(\d+)\.(\d+).*$/, '$1_$2')
 
   ext {
-luceneDocUrl = project.propertyOrDefault('lucene.javadoc.url', {
-  if (project.version != project.baseVersion) {
-// non-release build
-new File(project('lucene:documentation').buildDir, 
'site').toURI().toASCIIString().minus(~'/$')
-  } else {
-// release build
-"https://lucene.apache.org/core/${urlVersion};
-  }
-}())
-
-solrDocUrl = project.propertyOrDefault('solr.javadoc.url', {
-  if (project.version != project.baseVersion) {
-// non-release build
-new File(project('solr:documentation').buildDir, 
'site').toURI().toASCIIString().minus(~'/$')
-  } else {
-// release build
-"https://lucene.apache.org/solr/${urlVersion};
-  }
-}())
+if (!skipLucene) {
+  luceneDocUrl = project.propertyOrDefault('lucene.javadoc.url', {
+if (project.version != project.baseVersion) {
+  // non-release build
+  new File(project('lucene:documentation').buildDir, 
'site').toURI().toASCIIString().minus(~'/$')
+} else {
+  // release build
+  "https://lucene.apache.org/core/${urlVersion};
+}
+  }())
+}
 
-solrRefguideUrl = project.propertyOrDefault('solr.refguide.url', 
"https://lucene.apache.org/solr/guide/${refguideUrlVersion};)
+// SOLR ONLY
+if (!skipSolr) {
+  solrDocUrl = project.propertyOrDefault('solr.javadoc.url', {
+if (project.version != project.baseVersion) {
+  // non-release build
+  new File(project('solr:documentation').buildDir, 
'site').toURI().toASCIIString().minus(~'/$')
+} else {
+  // release build
+  "https://lucene.apache.org/solr/${urlVersion};
+}
+  }())
+
+  solrRefguideUrl = project.propertyOrDefault('solr.refguide.url', 
"https://lucene.apache.org/solr/guide/${refguideUrlVersion};)
+}
   }
 
   task documentation() {
 group = 'documentation'
 description = 'Generate all documentation'
 
-dependsOn ':lucene:documentation:assemble'
+if (!skipLucene) {
+  dependsOn ':lucene:documentation:assemble'
+}
 dependsOn ':solr:documentation:assemble'

Review comment:
   top-level documentation wasn't part of assemble, that's why it worked.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org