[jira] [Updated] (LUCENE-9028) Public method for MultiTermIntervalSource

2019-11-01 Thread Mikhail Khludnev (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Khludnev updated LUCENE-9028:
-
Fix Version/s: 8.4
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

> Public method for MultiTermIntervalSource
> -
>
> Key: LUCENE-9028
> URL: https://issues.apache.org/jira/browse/LUCENE-9028
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mikhail Khludnev
>Assignee: Mikhail Khludnev
>Priority: Major
> Fix For: 8.4
>
> Attachments: LUCENE-9028.patch
>
>
> Right now we have prefix and widlcard for multiterm Intervals. Sometimes it's 
> necessary to provide terms set in a more generic way as automaton.
> {code:java}
> Intervals.multiterm(CompiledAutomaton automaton, int maxExpansions, String 
> pattern)
> {code}
> As a benefit we can handle it more efficient rather than or over terms. 
> What do you think? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9028) Public method for MultiTermIntervalSource

2019-11-01 Thread Mikhail Khludnev (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965237#comment-16965237
 ] 

Mikhail Khludnev commented on LUCENE-9028:
--

I haven't got a reply. So, I put it as is, hopefully we can review and polish 
before 8.4. Just let me know is there are any concern, 

> Public method for MultiTermIntervalSource
> -
>
> Key: LUCENE-9028
> URL: https://issues.apache.org/jira/browse/LUCENE-9028
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mikhail Khludnev
>Assignee: Mikhail Khludnev
>Priority: Major
> Attachments: LUCENE-9028.patch
>
>
> Right now we have prefix and widlcard for multiterm Intervals. Sometimes it's 
> necessary to provide terms set in a more generic way as automaton.
> {code:java}
> Intervals.multiterm(CompiledAutomaton automaton, int maxExpansions, String 
> pattern)
> {code}
> As a benefit we can handle it more efficient rather than or over terms. 
> What do you think? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9031) UnsupportedOperationException on highlighting Interval Query

2019-11-01 Thread Lucene/Solr QA (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965225#comment-16965225
 ] 

Lucene/Solr QA commented on LUCENE-9031:


| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
19s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
32s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
32s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Release audit (RAT) {color} | 
{color:green}  0m 28s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Check forbidden APIs {color} | 
{color:green}  0m 24s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Validate source patterns {color} | 
{color:green}  0m 24s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
48s{color} | {color:green} highlighter in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  1m  
7s{color} | {color:green} queries in the patch passed. {color} |
| {color:black}{color} | {color:black} {color} | {color:black}  8m 37s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | LUCENE-9031 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12984655/LUCENE-9031.patch |
| Optional Tests |  compile  javac  unit  ratsources  checkforbiddenapis  
validatesourcepatterns  |
| uname | Linux lucene2-us-west.apache.org 4.4.0-112-generic #135-Ubuntu SMP 
Fri Jan 19 11:48:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | ant |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh
 |
| git revision | master / 5c6a299 |
| ant | version: Apache Ant(TM) version 1.9.6 compiled on July 20 2018 |
| Default Java | LTS |
|  Test Results | 
https://builds.apache.org/job/PreCommit-LUCENE-Build/218/testReport/ |
| modules | C: lucene/highlighter lucene/queries U: lucene |
| Console output | 
https://builds.apache.org/job/PreCommit-LUCENE-Build/218/console |
| Powered by | Apache Yetus 0.7.0   http://yetus.apache.org |


This message was automatically generated.



> UnsupportedOperationException on highlighting Interval Query
> 
>
> Key: LUCENE-9031
> URL: https://issues.apache.org/jira/browse/LUCENE-9031
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/queries
>Reporter: Mikhail Khludnev
>Assignee: Mikhail Khludnev
>Priority: Major
> Fix For: 8.4
>
> Attachments: LUCENE-9031.patch, LUCENE-9031.patch, LUCENE-9031.patch, 
> LUCENE-9031.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When UnifiedHighlighter highlights Interval Query it encounters 
> UnsupportedOperationException. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13888) SolrCloud 2

2019-11-01 Thread Noble Paul (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965223#comment-16965223
 ] 

Noble Paul commented on SOLR-13888:
---

OK, so let me summarize
 * We don't know what problems we are going to solve
 * We don't know the new design. But the current design is awesome and it's 
just some code issues
 * We don't know what is the success criteria

But we know it's all going to be awesome when it's done

I love it, I'll wait. 

> SolrCloud 2
> ---
>
> Key: SOLR-13888
> URL: https://issues.apache.org/jira/browse/SOLR-13888
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Mark Miller
>Assignee: Mark Miller
>Priority: Major
>
> As devs discuss dropping the SolrCloud name on the dev list, here is an issue 
> titled SolrCloud 2.
> A couple times now I've pulled on the sweater thread that is our broken 
> tests. It leads to one place - SolrCloud is sick and devs are adding spotty 
> code on top of it at a rate that will lead to the system falling in on 
> itself. As it is, it's a very slow, very inefficient, very unreliable, very 
> buggy system.
> This is not why I am here. This is the opposite of why I am here.
> So please, let's stop. We can't build on that thing as it is.
>  
> I need some time, I lost a lot of work at one point, the scope has expanded 
> since I realized how problematic some things really are, but I have an 
> alternative path that is not so duct tape and straw. As the building climbs, 
> that foundation is going to kill us all.
>  
> This i not about an architecture change - the architecture is fine. The 
> implementation is broken and getting worse.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13841) Add jackson databind annotations to SolrJ classpath

2019-11-01 Thread Noble Paul (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965221#comment-16965221
 ] 

Noble Paul commented on SOLR-13841:
---


I have opened a PR with the new jackson {{AnnotationIntrospector}} 

https://github.com/apache/lucene-solr/pull/992

reviews welcome

> Add jackson databind annotations to SolrJ classpath
> ---
>
> Key: SOLR-13841
> URL: https://issues.apache.org/jira/browse/SOLR-13841
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> We can start using annotations in SolrJ to minimize the amount of code we 
> write & improve readability. Jackson is a widely used library and everyone is 
> already familiar with it



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] noblepaul opened a new pull request #992: SOLR-13841: removed jackson dependencies from SolrJ and provided a mapping to our annotation

2019-11-01 Thread GitBox
noblepaul opened a new pull request #992: SOLR-13841: removed jackson 
dependencies from SolrJ and provided a mapping to our annotation
URL: https://github.com/apache/lucene-solr/pull/992
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13888) SolrCloud 2

2019-11-01 Thread Mark Miller (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965215#comment-16965215
 ] 

Mark Miller commented on SOLR-13888:


{quote} And I think that's a separate discussion, affecting documentation more 
than anything.
{quote}
Indeed - I'll name this effort what I want - the project can release things as 
it wants.

 
{quote}I do know from mailing list threads and user-filed issues that we have 
users who have tons of stability problems with SolrCloud, and those are 
affecting even users who haven't built massive clusters.
{quote}
Just give me a bit of time and I'll demonstrate. Lots to come.

The fixing of all these impls includes proper logging, short tests, easier dev, 
reusable patterns, and all sorts of goodies. The proof is in the pudding. I'm 
pulling the fire alarm on the current track. For a brief bit you can see that 
however you'd like. I have a lot in my pocket. If I had't lost so much I'd have 
more, but I have a lot, given a little time, I'll show it to you and the 
comparison will be stark.

> SolrCloud 2
> ---
>
> Key: SOLR-13888
> URL: https://issues.apache.org/jira/browse/SOLR-13888
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Mark Miller
>Assignee: Mark Miller
>Priority: Major
>
> As devs discuss dropping the SolrCloud name on the dev list, here is an issue 
> titled SolrCloud 2.
> A couple times now I've pulled on the sweater thread that is our broken 
> tests. It leads to one place - SolrCloud is sick and devs are adding spotty 
> code on top of it at a rate that will lead to the system falling in on 
> itself. As it is, it's a very slow, very inefficient, very unreliable, very 
> buggy system.
> This is not why I am here. This is the opposite of why I am here.
> So please, let's stop. We can't build on that thing as it is.
>  
> I need some time, I lost a lot of work at one point, the scope has expanded 
> since I realized how problematic some things really are, but I have an 
> alternative path that is not so duct tape and straw. As the building climbs, 
> that foundation is going to kill us all.
>  
> This i not about an architecture change - the architecture is fine. The 
> implementation is broken and getting worse.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13888) SolrCloud 2

2019-11-01 Thread Shawn Heisey (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965209#comment-16965209
 ] 

Shawn Heisey commented on SOLR-13888:
-

* I don't care too much whether we still call it SolrCloud or migrate to 
calling it cluster mode.  And I think that's a separate discussion, affecting 
documentation more than anything.
 * I do not know the code that drives this.  I'd like to understand the code, 
but I'm betting that that rabbit hole will take me at least a few weeks to 
traverse ... assuming I devote ALL of my spare time to it.  That is something 
that I just can't do right now.
 * I do know from mailing list threads and user-filed issues that we have users 
who have tons of stability problems with SolrCloud, and those are affecting 
even users who haven't built massive clusters.

> SolrCloud 2
> ---
>
> Key: SOLR-13888
> URL: https://issues.apache.org/jira/browse/SOLR-13888
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Mark Miller
>Assignee: Mark Miller
>Priority: Major
>
> As devs discuss dropping the SolrCloud name on the dev list, here is an issue 
> titled SolrCloud 2.
> A couple times now I've pulled on the sweater thread that is our broken 
> tests. It leads to one place - SolrCloud is sick and devs are adding spotty 
> code on top of it at a rate that will lead to the system falling in on 
> itself. As it is, it's a very slow, very inefficient, very unreliable, very 
> buggy system.
> This is not why I am here. This is the opposite of why I am here.
> So please, let's stop. We can't build on that thing as it is.
>  
> I need some time, I lost a lot of work at one point, the scope has expanded 
> since I realized how problematic some things really are, but I have an 
> alternative path that is not so duct tape and straw. As the building climbs, 
> that foundation is going to kill us all.
>  
> This i not about an architecture change - the architecture is fine. The 
> implementation is broken and getting worse.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-13888) SolrCloud 2

2019-11-01 Thread Mark Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated SOLR-13888:
---
Description: 
As devs discuss dropping the SolrCloud name on the dev list, here is an issue 
titled SolrCloud 2.

A couple times now I've pulled on the sweater thread that is our broken tests. 
It leads to one place - SolrCloud is sick and devs are adding spotty code on 
top of it at a rate that will lead to the system falling in on itself. As it 
is, it's a very slow, very inefficient, very unreliable, very buggy system.

This is not why I am here. This is the opposite of why I am here.

So please, let's stop. We can't build on that thing as it is.

 

I need some time, I lost a lot of work at one point, the scope has expanded 
since I realized how problematic some things really are, but I have an 
alternative path that is not so duct tape and straw. As the building climbs, 
that foundation is going to kill us all.

 

This i not about an architecture change - the architecture is fine. The 
implementation is broken and getting worse.

  was:
As devs discuss dropping the SolrCloud name on the dev list, here is an issue 
titled SolrCloud 2.

A couple times now I've pulled on the sweater thread that is our broken tests. 
It leads to one place - SolrCloud is sick and devs are adding spotty code on 
top of it at a rate that will lead to the system falling in on itself. As it 
is, it's a very slow, very inefficient, very unreliable, very buggy system.

This is not why I am here. This is the opposite of why I am here.

So please, let's stop. We can't build on that thing as it is.

 

I need some time, I lost a lot of work at one point, the scope has expanded 
since I realized how problematic some things really are, but I have an 
alternative path that is not so duct tape and straw. As the building climbs, 
that foundation is going to kill us all.


> SolrCloud 2
> ---
>
> Key: SOLR-13888
> URL: https://issues.apache.org/jira/browse/SOLR-13888
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Mark Miller
>Assignee: Mark Miller
>Priority: Major
>
> As devs discuss dropping the SolrCloud name on the dev list, here is an issue 
> titled SolrCloud 2.
> A couple times now I've pulled on the sweater thread that is our broken 
> tests. It leads to one place - SolrCloud is sick and devs are adding spotty 
> code on top of it at a rate that will lead to the system falling in on 
> itself. As it is, it's a very slow, very inefficient, very unreliable, very 
> buggy system.
> This is not why I am here. This is the opposite of why I am here.
> So please, let's stop. We can't build on that thing as it is.
>  
> I need some time, I lost a lot of work at one point, the scope has expanded 
> since I realized how problematic some things really are, but I have an 
> alternative path that is not so duct tape and straw. As the building climbs, 
> that foundation is going to kill us all.
>  
> This i not about an architecture change - the architecture is fine. The 
> implementation is broken and getting worse.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-13888) SolrCloud 2

2019-11-01 Thread Mark Miller (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965204#comment-16965204
 ] 

Mark Miller edited comment on SOLR-13888 at 11/2/19 2:36 AM:
-

{quote}We should not repeat those mistakes.
{quote}
The original design was never fully implemented, and lots of code that is not 
thread safe not documented and horribly against the proper flow of the system, 
much of it written by you, is now layered across it.

You guys are headed for disaster - feel free to head that way to  degree, but I 
will be blocking anymore of your disastrous rewrites.

In the meantime, I'll show other devs what SolrCloud was meant to be - 1000x 
what it is.


was (Author: markrmil...@gmail.com):
{quote}We should not repeat those mistakes.
{quote}
The original design was never fully implemented, and lots of code that is not 
thread safe not documented and horribly against the proper flow of the system, 
much of it written by you, is now layered across it.

You guys are headed for disaster - feel free to head that way to  degree, but I 
will be blocking anymore of your disastrous rewrites.

> SolrCloud 2
> ---
>
> Key: SOLR-13888
> URL: https://issues.apache.org/jira/browse/SOLR-13888
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Mark Miller
>Assignee: Mark Miller
>Priority: Major
>
> As devs discuss dropping the SolrCloud name on the dev list, here is an issue 
> titled SolrCloud 2.
> A couple times now I've pulled on the sweater thread that is our broken 
> tests. It leads to one place - SolrCloud is sick and devs are adding spotty 
> code on top of it at a rate that will lead to the system falling in on 
> itself. As it is, it's a very slow, very inefficient, very unreliable, very 
> buggy system.
> This is not why I am here. This is the opposite of why I am here.
> So please, let's stop. We can't build on that thing as it is.
>  
> I need some time, I lost a lot of work at one point, the scope has expanded 
> since I realized how problematic some things really are, but I have an 
> alternative path that is not so duct tape and straw. As the building climbs, 
> that foundation is going to kill us all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13888) SolrCloud 2

2019-11-01 Thread Mark Miller (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965204#comment-16965204
 ] 

Mark Miller commented on SOLR-13888:


{quote}We should not repeat those mistakes.
{quote}
The original design was never fully implemented, and lots of code that is not 
thread safe not documented and horribly against the proper flow of the system, 
much of it written by you, is now layered across it.

You guys are headed for disaster - feel free to head that way to  degree, but I 
will be blocking anymore of your disastrous rewrites.

> SolrCloud 2
> ---
>
> Key: SOLR-13888
> URL: https://issues.apache.org/jira/browse/SOLR-13888
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Mark Miller
>Assignee: Mark Miller
>Priority: Major
>
> As devs discuss dropping the SolrCloud name on the dev list, here is an issue 
> titled SolrCloud 2.
> A couple times now I've pulled on the sweater thread that is our broken 
> tests. It leads to one place - SolrCloud is sick and devs are adding spotty 
> code on top of it at a rate that will lead to the system falling in on 
> itself. As it is, it's a very slow, very inefficient, very unreliable, very 
> buggy system.
> This is not why I am here. This is the opposite of why I am here.
> So please, let's stop. We can't build on that thing as it is.
>  
> I need some time, I lost a lot of work at one point, the scope has expanded 
> since I realized how problematic some things really are, but I have an 
> alternative path that is not so duct tape and straw. As the building climbs, 
> that foundation is going to kill us all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13888) SolrCloud 2

2019-11-01 Thread Mark Miller (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965203#comment-16965203
 ] 

Mark Miller commented on SOLR-13888:


It's not code, you cant veto it.

It will be SolrCloud 2 and it will be on a branch that accepts nothing new from 
master.

Eventually, devs can use the mess you guys are creating, or something that 
works.

> SolrCloud 2
> ---
>
> Key: SOLR-13888
> URL: https://issues.apache.org/jira/browse/SOLR-13888
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Mark Miller
>Assignee: Mark Miller
>Priority: Major
>
> As devs discuss dropping the SolrCloud name on the dev list, here is an issue 
> titled SolrCloud 2.
> A couple times now I've pulled on the sweater thread that is our broken 
> tests. It leads to one place - SolrCloud is sick and devs are adding spotty 
> code on top of it at a rate that will lead to the system falling in on 
> itself. As it is, it's a very slow, very inefficient, very unreliable, very 
> buggy system.
> This is not why I am here. This is the opposite of why I am here.
> So please, let's stop. We can't build on that thing as it is.
>  
> I need some time, I lost a lot of work at one point, the scope has expanded 
> since I realized how problematic some things really are, but I have an 
> alternative path that is not so duct tape and straw. As the building climbs, 
> that foundation is going to kill us all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-13888) SolrCloud 2

2019-11-01 Thread Noble Paul (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965199#comment-16965199
 ] 

Noble Paul edited comment on SOLR-13888 at 11/2/19 2:10 AM:


-1 on a large scale change without a proper design.

SolrCloud 1 was a fundamentally flawed design which could not scale to a large 
cluster. We should not repeat those mistakes.





was (Author: noble.paul):
-1 on a large scale rewrite.

You are the one who introduced this huge mess of a design in SolrCloud 1 , We 
are all living with the consequences of your bad design choices.
let's decide the design first and  get a buy in for what you wish to do and let 
everyone get on board.


> SolrCloud 2
> ---
>
> Key: SOLR-13888
> URL: https://issues.apache.org/jira/browse/SOLR-13888
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Mark Miller
>Assignee: Mark Miller
>Priority: Major
>
> As devs discuss dropping the SolrCloud name on the dev list, here is an issue 
> titled SolrCloud 2.
> A couple times now I've pulled on the sweater thread that is our broken 
> tests. It leads to one place - SolrCloud is sick and devs are adding spotty 
> code on top of it at a rate that will lead to the system falling in on 
> itself. As it is, it's a very slow, very inefficient, very unreliable, very 
> buggy system.
> This is not why I am here. This is the opposite of why I am here.
> So please, let's stop. We can't build on that thing as it is.
>  
> I need some time, I lost a lot of work at one point, the scope has expanded 
> since I realized how problematic some things really are, but I have an 
> alternative path that is not so duct tape and straw. As the building climbs, 
> that foundation is going to kill us all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-13888) SolrCloud 2

2019-11-01 Thread Noble Paul (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965199#comment-16965199
 ] 

Noble Paul edited comment on SOLR-13888 at 11/2/19 2:05 AM:


-1 on a large scale rewrite.

You are the one who introduced this huge mess of a design in SolrCloud 1 , We 
are all living with the consequences of your bad design choices.
let's decide the design first and  get a buy in for what you wish to do and let 
everyone get on board.



was (Author: noble.paul):
-1 on a large scale rewrite.

You are the one who introduced this huge mess of a design in SolrCloud 1 , We 
are all living with the consequences of your bad design choices.
let's decide first of all get a buy in for what you wish to do and let everyone 
get on board.


> SolrCloud 2
> ---
>
> Key: SOLR-13888
> URL: https://issues.apache.org/jira/browse/SOLR-13888
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Mark Miller
>Assignee: Mark Miller
>Priority: Major
>
> As devs discuss dropping the SolrCloud name on the dev list, here is an issue 
> titled SolrCloud 2.
> A couple times now I've pulled on the sweater thread that is our broken 
> tests. It leads to one place - SolrCloud is sick and devs are adding spotty 
> code on top of it at a rate that will lead to the system falling in on 
> itself. As it is, it's a very slow, very inefficient, very unreliable, very 
> buggy system.
> This is not why I am here. This is the opposite of why I am here.
> So please, let's stop. We can't build on that thing as it is.
>  
> I need some time, I lost a lot of work at one point, the scope has expanded 
> since I realized how problematic some things really are, but I have an 
> alternative path that is not so duct tape and straw. As the building climbs, 
> that foundation is going to kill us all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13888) SolrCloud 2

2019-11-01 Thread Ishan Chattopadhyaya (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965200#comment-16965200
 ] 

Ishan Chattopadhyaya commented on SOLR-13888:
-

bq. As devs discuss dropping the SolrCloud name on the dev list, here is an 
issue titled SolrCloud 2.
-1 on any attempt to continue this abomination of a name. "SolrCloud 2" is just 
doubling down on past mistakes.

> SolrCloud 2
> ---
>
> Key: SOLR-13888
> URL: https://issues.apache.org/jira/browse/SOLR-13888
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Mark Miller
>Assignee: Mark Miller
>Priority: Major
>
> As devs discuss dropping the SolrCloud name on the dev list, here is an issue 
> titled SolrCloud 2.
> A couple times now I've pulled on the sweater thread that is our broken 
> tests. It leads to one place - SolrCloud is sick and devs are adding spotty 
> code on top of it at a rate that will lead to the system falling in on 
> itself. As it is, it's a very slow, very inefficient, very unreliable, very 
> buggy system.
> This is not why I am here. This is the opposite of why I am here.
> So please, let's stop. We can't build on that thing as it is.
>  
> I need some time, I lost a lot of work at one point, the scope has expanded 
> since I realized how problematic some things really are, but I have an 
> alternative path that is not so duct tape and straw. As the building climbs, 
> that foundation is going to kill us all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13888) SolrCloud 2

2019-11-01 Thread Noble Paul (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965199#comment-16965199
 ] 

Noble Paul commented on SOLR-13888:
---

-1 on a large scale rewrite.

You are the one who introduced this huge mess of a design in SolrCloud 1 , We 
are all living with the consequences of your bad design choices.
let's decide first of all get a buy in for what you wish to do and let everyone 
get on board.


> SolrCloud 2
> ---
>
> Key: SOLR-13888
> URL: https://issues.apache.org/jira/browse/SOLR-13888
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Mark Miller
>Assignee: Mark Miller
>Priority: Major
>
> As devs discuss dropping the SolrCloud name on the dev list, here is an issue 
> titled SolrCloud 2.
> A couple times now I've pulled on the sweater thread that is our broken 
> tests. It leads to one place - SolrCloud is sick and devs are adding spotty 
> code on top of it at a rate that will lead to the system falling in on 
> itself. As it is, it's a very slow, very inefficient, very unreliable, very 
> buggy system.
> This is not why I am here. This is the opposite of why I am here.
> So please, let's stop. We can't build on that thing as it is.
>  
> I need some time, I lost a lot of work at one point, the scope has expanded 
> since I realized how problematic some things really are, but I have an 
> alternative path that is not so duct tape and straw. As the building climbs, 
> that foundation is going to kill us all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-12993) Split the state.json into 2. a small frequently modified data + a large unmodified data

2019-11-01 Thread Noble Paul (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-12993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965185#comment-16965185
 ] 

Noble Paul commented on SOLR-12993:
---

[~markrmil...@gmail.com]

Let's set aside ad hominem attacks  and focus on the problem at hand.
We have a huge issue with scaling SolrCloud to large no:of nodes. The design is 
fundamentally flawed and we need to adress this.

You are the one who created this design. Yes, i had to split the 
clusterstate,json to multiple per collection states.json and you were opposed 
to that as well. if you have a better solution I'm all ears

bq. doing crazy things with the overseer.

really? . I would like to see specific changes (crazy) I've done to Overseer . 
I haven't done anything in Overseer other than refactoring it into multiple 
pieces.  

> Split the state.json into 2. a small frequently modified data + a large 
> unmodified data
> ---
>
> Key: SOLR-12993
> URL: https://issues.apache.org/jira/browse/SOLR-12993
> Project: Solr
>  Issue Type: Improvement
>Reporter: Noble Paul
>Priority: Major
>
> The new design is posted here
> https://docs.google.com/document/d/1AZyiRA_bRhAWkUM1Nj5kg__xpPM9Fd_iwHhYazG38xI/edit#



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-13888) SolrCloud 2

2019-11-01 Thread Mark Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated SOLR-13888:
---
Description: 
As devs discuss dropping the SolrCloud name on the dev list, here is an issue 
titled SolrCloud 2.

A couple times now I've pulled on the sweater thread that is our broken tests. 
It leads to one place - SolrCloud is sick and devs are adding spotty code on 
top of it at a rate that will lead to the system falling in on itself. As it 
is, it's a very slow, very inefficient, very unreliable, very buggy system.

This is not why I am here. This is the opposite of why I am here.

So please, let's stop. We can't build on that thing as it is.

 

I need some time, I lost a lot of work at one point, the scope has expanded 
since I realized how problematic some things really are, but I have an 
alternative path that is not so duct tape and straw. As the building climbs, 
that foundation is going to kill us all.

  was:
As devs discuss dropping the SolrCloud name on the dev list, here is an issue 
titled SolrCloud 2.

A couple times now I've pulled on the sweater thread that is our broken tests. 
It leads to one place - SolrCloud is sick and devs are adding spotty code on 
top of it at a rate that will lead to the system falling in on itself. As it 
is, it's a very slow, very inefficient, very unreliable, very buggy system.

This is not why I am here. This is the opposite of why I am here.

So please, let's stop. We can build on that thing as it is.

 

I need some time, I lost a lot of work at one point, the scope has expanded 
since I realized how problematic some things really are, but I have an 
alternative path that is not so duct tape and straw. As the building climbs, 
that foundation is going to kill us all.


> SolrCloud 2
> ---
>
> Key: SOLR-13888
> URL: https://issues.apache.org/jira/browse/SOLR-13888
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Mark Miller
>Assignee: Mark Miller
>Priority: Major
>
> As devs discuss dropping the SolrCloud name on the dev list, here is an issue 
> titled SolrCloud 2.
> A couple times now I've pulled on the sweater thread that is our broken 
> tests. It leads to one place - SolrCloud is sick and devs are adding spotty 
> code on top of it at a rate that will lead to the system falling in on 
> itself. As it is, it's a very slow, very inefficient, very unreliable, very 
> buggy system.
> This is not why I am here. This is the opposite of why I am here.
> So please, let's stop. We can't build on that thing as it is.
>  
> I need some time, I lost a lot of work at one point, the scope has expanded 
> since I realized how problematic some things really are, but I have an 
> alternative path that is not so duct tape and straw. As the building climbs, 
> that foundation is going to kill us all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (SOLR-13888) SolrCloud 2

2019-11-01 Thread Mark Miller (Jira)
Mark Miller created SOLR-13888:
--

 Summary: SolrCloud 2
 Key: SOLR-13888
 URL: https://issues.apache.org/jira/browse/SOLR-13888
 Project: Solr
  Issue Type: Task
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Mark Miller
Assignee: Mark Miller


As devs discuss dropping the SolrCloud name on the dev list, here is an issue 
titled SolrCloud 2.

A couple times now I've pulled on the sweater thread that is our broken tests. 
It leads to one place - SolrCloud is sick and devs are adding spotty code on 
top of it at a rate that will lead to the system falling in on itself. As it 
is, it's a very slow, very inefficient, very unreliable, very buggy system.

This is not why I am here. This is the opposite of why I am here.

So please, let's stop. We can build on that thing as it is.

 

I need some time, I lost a lot of work at one point, the scope has expanded 
since I realized how problematic some things really are, but I have an 
alternative path that is not so duct tape and straw. As the building climbs, 
that foundation is going to kill us all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-12045) Move Analytics Component from contrib to core

2019-11-01 Thread Jira


[ 
https://issues.apache.org/jira/browse/SOLR-12045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965177#comment-16965177
 ] 

Jan Høydahl commented on SOLR-12045:


In light of the recent plugin/package work being done and the goal of making 
solr-core slimmer, not fatter, we should close this Jira as won't fix and 
instead aim to make analytics component a Solr package, installable as e.g. 
"{{bin/solr package install analytics-component}}". 

> Move Analytics Component from contrib to core
> -
>
> Key: SOLR-12045
> URL: https://issues.apache.org/jira/browse/SOLR-12045
> Project: Solr
>  Issue Type: Improvement
>Affects Versions: 8.0
>Reporter: Houston Putman
>Priority: Major
> Fix For: 8.1, master (9.0)
>
> Attachments: SOLR-12045.rb-visibility.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The Analytics Component currently lives in contrib. Since it includes no 
> external dependencies, there is no harm in moving it into core solr.
> The analytics component would be included as a default search component and 
> the analytics handler (currently only used for analytics shard requests, 
> might be transitioned to handle user requests in the future) would be 
> included as an implicit handler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-12993) Split the state.json into 2. a small frequently modified data + a large unmodified data

2019-11-01 Thread Mark Miller (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-12993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965168#comment-16965168
 ] 

Mark Miller commented on SOLR-12993:


-1, SolrCloud core is too busted to add yet another state format. We can't keep 
adding to the problem with new legacy modes and state modes and optimizations. 
We can't fix with you optimizing. Your code is some of the biggest problems by 
far. Under tested, under documented, bad or no thread concurrency, doing crazy 
things with the overseer. Veto on this change.

> Split the state.json into 2. a small frequently modified data + a large 
> unmodified data
> ---
>
> Key: SOLR-12993
> URL: https://issues.apache.org/jira/browse/SOLR-12993
> Project: Solr
>  Issue Type: Improvement
>Reporter: Noble Paul
>Priority: Major
>
> The new design is posted here
> https://docs.google.com/document/d/1AZyiRA_bRhAWkUM1Nj5kg__xpPM9Fd_iwHhYazG38xI/edit#



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9028) Public method for MultiTermIntervalSource

2019-11-01 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965155#comment-16965155
 ] 

ASF subversion and git services commented on LUCENE-9028:
-

Commit 1f6c06f30579b6a85336934e3edd795a29062943 in lucene-solr's branch 
refs/heads/branch_8x from Mikhail Khludnev
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=1f6c06f ]

LUCENE-9028: fix test compile


> Public method for MultiTermIntervalSource
> -
>
> Key: LUCENE-9028
> URL: https://issues.apache.org/jira/browse/LUCENE-9028
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mikhail Khludnev
>Assignee: Mikhail Khludnev
>Priority: Major
> Attachments: LUCENE-9028.patch
>
>
> Right now we have prefix and widlcard for multiterm Intervals. Sometimes it's 
> necessary to provide terms set in a more generic way as automaton.
> {code:java}
> Intervals.multiterm(CompiledAutomaton automaton, int maxExpansions, String 
> pattern)
> {code}
> As a benefit we can handle it more efficient rather than or over terms. 
> What do you think? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13841) Add jackson databind annotations to SolrJ classpath

2019-11-01 Thread Ishan Chattopadhyaya (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965150#comment-16965150
 ] 

Ishan Chattopadhyaya commented on SOLR-13841:
-

bq. Reverting this also means i'll have to move all the classes, I have created 
as part of Package APIs, to core. 
Ouch. We should better use JIRA linking (depends on / blocked by) to bring out 
these dependencies, esp. if all the issues in question are under development at 
the same time.

> Add jackson databind annotations to SolrJ classpath
> ---
>
> Key: SOLR-13841
> URL: https://issues.apache.org/jira/browse/SOLR-13841
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> We can start using annotations in SolrJ to minimize the amount of code we 
> write & improve readability. Jackson is a widely used library and everyone is 
> already familiar with it



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13841) Add jackson databind annotations to SolrJ classpath

2019-11-01 Thread Noble Paul (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965149#comment-16965149
 ] 

Noble Paul commented on SOLR-13841:
---

My plan was to implement Jackson's {{AnnotationIntrospector}} and change the 
JIRA description.

Reverting this also means i'll have to move all the classes, I have created as 
part of Package APIs, to core. 

> Add jackson databind annotations to SolrJ classpath
> ---
>
> Key: SOLR-13841
> URL: https://issues.apache.org/jira/browse/SOLR-13841
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> We can start using annotations in SolrJ to minimize the amount of code we 
> write & improve readability. Jackson is a widely used library and everyone is 
> already familiar with it



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-12993) Split the state.json into 2. a small frequently modified data + a large unmodified data

2019-11-01 Thread Noble Paul (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-12993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965147#comment-16965147
 ] 

Noble Paul commented on SOLR-12993:
---

[~dsmiley] Irrespective of the no:of collections/shards it can benefit 
everyone. The key is to minimize overall data sent and received so that our 
overseer does only a fraction of the work it does today

> Split the state.json into 2. a small frequently modified data + a large 
> unmodified data
> ---
>
> Key: SOLR-12993
> URL: https://issues.apache.org/jira/browse/SOLR-12993
> Project: Solr
>  Issue Type: Improvement
>Reporter: Noble Paul
>Priority: Major
>
> The new design is posted here
> https://docs.google.com/document/d/1AZyiRA_bRhAWkUM1Nj5kg__xpPM9Fd_iwHhYazG38xI/edit#



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13207) AIOOBE in calculateMinShouldMatch

2019-11-01 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965141#comment-16965141
 ] 

ASF subversion and git services commented on SOLR-13207:


Commit 332f1d7741cf33f5aa0fc41ed02a1f62f1ea8830 in lucene-solr's branch 
refs/heads/branch_8x from Tomas Eduardo Fernandez Lobbe
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=332f1d7 ]

SOLR-13207: Fix tests


> AIOOBE in calculateMinShouldMatch
> -
>
> Key: SOLR-13207
> URL: https://issues.apache.org/jira/browse/SOLR-13207
> Project: Solr
>  Issue Type: Bug
> Environment: h1. Steps to reproduce
> * Use a Linux machine.
> * Build commit {{ea2c8ba}} of Solr as described in the section below.
> * Build the films collection as described below.
> * Start the server using the command {{./bin/solr start -f -p 8983 -s 
> /tmp/home}}
> * Request the URL given in the bug description.
> h1. Compiling the server
> {noformat}
> git clone https://github.com/apache/lucene-solr
> cd lucene-solr
> git checkout ea2c8ba
> ant compile
> cd solr
> ant server
> {noformat}
> h1. Building the collection and reproducing the bug
> We followed [Exercise 
> 2|http://lucene.apache.org/solr/guide/7_5/solr-tutorial.html#exercise-2] from 
> the [Solr 
> Tutorial|http://lucene.apache.org/solr/guide/7_5/solr-tutorial.html].
> {noformat}
> mkdir -p /tmp/home
> echo '' > 
> /tmp/home/solr.xml
> {noformat}
> In one terminal start a Solr instance in foreground:
> {noformat}
> ./bin/solr start -f -p 8983 -s /tmp/home
> {noformat}
> In another terminal, create a collection of movies, with no shards and no 
> replication, and initialize it:
> {noformat}
> bin/solr create -c films
> curl -X POST -H 'Content-type:application/json' --data-binary '{"add-field": 
> {"name":"name", "type":"text_general", "multiValued":false, "stored":true}}' 
> http://localhost:8983/solr/films/schema
> curl -X POST -H 'Content-type:application/json' --data-binary 
> '{"add-copy-field" : {"source":"*","dest":"_text_"}}' 
> http://localhost:8983/solr/films/schema
> ./bin/post -c films example/films/films.json
> curl -v “URL_BUG”
> {noformat}
> Please check the issue description below to find the “URL_BUG” that will 
> allow you to reproduce the issue reported.
>Reporter: Johannes Kloos
>Priority: Major
>  Labels: diffblue, newdev
> Fix For: master (9.0), 8.4
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Requesting the following URL causes Solr to return an HTTP 500 error response:
> {noformat}
> http://localhost:8983/solr/films/select?mm=%3C=edismax=fq=field(id,1)
> {noformat}
> The error response seems to be caused by the following uncaught exception:
> {noformat}
> java.lang.ArrayIndexOutOfBoundsException: 0
> at 
> org.apache.solr.util.SolrPluginUtils.calculateMinShouldMatch(SolrPluginUtils.java:683)
> at 
> org.apache.solr.util.SolrPluginUtils.setMinShouldMatch(SolrPluginUtils.java:641)
> at 
> org.apache.solr.util.SolrPluginUtils.setMinShouldMatch(SolrPluginUtils.java:660)
> at 
> org.apache.solr.search.ExtendedDismaxQParser.parseOriginalQuery(ExtendedDismaxQParser.java:415)
> at 
> org.apache.solr.search.ExtendedDismaxQParser.parse(ExtendedDismaxQParser.java:173)
> at org.apache.solr.search.QParser.getQuery(QParser.java:173)
> at 
> org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:158)
> at 
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:272)
> at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2559)
> {noformat}
> The mm parameter is given as ‘<’. It is (after some string mangling) split 
> into sub-strings separated by ‘<’, putattively giving the left-hand and 
> right-hand argument of the operator. In the example, there are no such 
> arguments, so the resulting array “parts” is empty (cf. String.split 
> documentation). But we immediately try to access parts[0], leading to an 
> AIOOBE.
> To set up an environment to reproduce this bug, follow the description in the 
> ‘Environment’ field.
> We automatically found this issue and ~70 more like this using [Diffblue 
> Microservices Testing|https://www.diffblue.com/labs/?utm_source=solr-br]. 
> Find more information on this [fuzz testing 
> campaign|https://www.diffblue.com/blog/2018/12/19/diffblue-microservice-testing-a-sneak-peek-at-our-early-product-and-results?utm_source=solr-br].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13207) AIOOBE in calculateMinShouldMatch

2019-11-01 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965137#comment-16965137
 ] 

ASF subversion and git services commented on SOLR-13207:


Commit 5c6a299eff9950d7a2640c8efd2496795ad7156d in lucene-solr's branch 
refs/heads/master from Tomas Eduardo Fernandez Lobbe
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=5c6a299 ]

SOLR-13207: Fix tests


> AIOOBE in calculateMinShouldMatch
> -
>
> Key: SOLR-13207
> URL: https://issues.apache.org/jira/browse/SOLR-13207
> Project: Solr
>  Issue Type: Bug
> Environment: h1. Steps to reproduce
> * Use a Linux machine.
> * Build commit {{ea2c8ba}} of Solr as described in the section below.
> * Build the films collection as described below.
> * Start the server using the command {{./bin/solr start -f -p 8983 -s 
> /tmp/home}}
> * Request the URL given in the bug description.
> h1. Compiling the server
> {noformat}
> git clone https://github.com/apache/lucene-solr
> cd lucene-solr
> git checkout ea2c8ba
> ant compile
> cd solr
> ant server
> {noformat}
> h1. Building the collection and reproducing the bug
> We followed [Exercise 
> 2|http://lucene.apache.org/solr/guide/7_5/solr-tutorial.html#exercise-2] from 
> the [Solr 
> Tutorial|http://lucene.apache.org/solr/guide/7_5/solr-tutorial.html].
> {noformat}
> mkdir -p /tmp/home
> echo '' > 
> /tmp/home/solr.xml
> {noformat}
> In one terminal start a Solr instance in foreground:
> {noformat}
> ./bin/solr start -f -p 8983 -s /tmp/home
> {noformat}
> In another terminal, create a collection of movies, with no shards and no 
> replication, and initialize it:
> {noformat}
> bin/solr create -c films
> curl -X POST -H 'Content-type:application/json' --data-binary '{"add-field": 
> {"name":"name", "type":"text_general", "multiValued":false, "stored":true}}' 
> http://localhost:8983/solr/films/schema
> curl -X POST -H 'Content-type:application/json' --data-binary 
> '{"add-copy-field" : {"source":"*","dest":"_text_"}}' 
> http://localhost:8983/solr/films/schema
> ./bin/post -c films example/films/films.json
> curl -v “URL_BUG”
> {noformat}
> Please check the issue description below to find the “URL_BUG” that will 
> allow you to reproduce the issue reported.
>Reporter: Johannes Kloos
>Priority: Major
>  Labels: diffblue, newdev
> Fix For: master (9.0), 8.4
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Requesting the following URL causes Solr to return an HTTP 500 error response:
> {noformat}
> http://localhost:8983/solr/films/select?mm=%3C=edismax=fq=field(id,1)
> {noformat}
> The error response seems to be caused by the following uncaught exception:
> {noformat}
> java.lang.ArrayIndexOutOfBoundsException: 0
> at 
> org.apache.solr.util.SolrPluginUtils.calculateMinShouldMatch(SolrPluginUtils.java:683)
> at 
> org.apache.solr.util.SolrPluginUtils.setMinShouldMatch(SolrPluginUtils.java:641)
> at 
> org.apache.solr.util.SolrPluginUtils.setMinShouldMatch(SolrPluginUtils.java:660)
> at 
> org.apache.solr.search.ExtendedDismaxQParser.parseOriginalQuery(ExtendedDismaxQParser.java:415)
> at 
> org.apache.solr.search.ExtendedDismaxQParser.parse(ExtendedDismaxQParser.java:173)
> at org.apache.solr.search.QParser.getQuery(QParser.java:173)
> at 
> org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:158)
> at 
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:272)
> at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2559)
> {noformat}
> The mm parameter is given as ‘<’. It is (after some string mangling) split 
> into sub-strings separated by ‘<’, putattively giving the left-hand and 
> right-hand argument of the operator. In the example, there are no such 
> arguments, so the resulting array “parts” is empty (cf. String.split 
> documentation). But we immediately try to access parts[0], leading to an 
> AIOOBE.
> To set up an environment to reproduce this bug, follow the description in the 
> ‘Environment’ field.
> We automatically found this issue and ~70 more like this using [Diffblue 
> Microservices Testing|https://www.diffblue.com/labs/?utm_source=solr-br]. 
> Find more information on this [fuzz testing 
> campaign|https://www.diffblue.com/blog/2018/12/19/diffblue-microservice-testing-a-sneak-peek-at-our-early-product-and-results?utm_source=solr-br].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9028) Public method for MultiTermIntervalSource

2019-11-01 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965121#comment-16965121
 ] 

ASF subversion and git services commented on LUCENE-9028:
-

Commit d49b36ec9c1c51f900c8d89cb51fb70eba026ba0 in lucene-solr's branch 
refs/heads/branch_8x from Mikhail Khludnev
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=d49b36e ]

LUCENE-9028: Introduce Intervals.multiterm()


> Public method for MultiTermIntervalSource
> -
>
> Key: LUCENE-9028
> URL: https://issues.apache.org/jira/browse/LUCENE-9028
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mikhail Khludnev
>Assignee: Mikhail Khludnev
>Priority: Major
> Attachments: LUCENE-9028.patch
>
>
> Right now we have prefix and widlcard for multiterm Intervals. Sometimes it's 
> necessary to provide terms set in a more generic way as automaton.
> {code:java}
> Intervals.multiterm(CompiledAutomaton automaton, int maxExpansions, String 
> pattern)
> {code}
> As a benefit we can handle it more efficient rather than or over terms. 
> What do you think? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9028) Public method for MultiTermIntervalSource

2019-11-01 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965120#comment-16965120
 ] 

ASF subversion and git services commented on LUCENE-9028:
-

Commit 3cf131de527741e903dd3f7d687e7cef7fef9d6d in lucene-solr's branch 
refs/heads/master from Mikhail Khludnev
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=3cf131d ]

LUCENE-9028: Introduce Intervals.multiterm()


> Public method for MultiTermIntervalSource
> -
>
> Key: LUCENE-9028
> URL: https://issues.apache.org/jira/browse/LUCENE-9028
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mikhail Khludnev
>Assignee: Mikhail Khludnev
>Priority: Major
> Attachments: LUCENE-9028.patch
>
>
> Right now we have prefix and widlcard for multiterm Intervals. Sometimes it's 
> necessary to provide terms set in a more generic way as automaton.
> {code:java}
> Intervals.multiterm(CompiledAutomaton automaton, int maxExpansions, String 
> pattern)
> {code}
> As a benefit we can handle it more efficient rather than or over terms. 
> What do you think? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Reopened] (SOLR-13207) AIOOBE in calculateMinShouldMatch

2019-11-01 Thread Tomas Eduardo Fernandez Lobbe (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomas Eduardo Fernandez Lobbe reopened SOLR-13207:
--

There are Jenkins failures. I'm taking a look

> AIOOBE in calculateMinShouldMatch
> -
>
> Key: SOLR-13207
> URL: https://issues.apache.org/jira/browse/SOLR-13207
> Project: Solr
>  Issue Type: Bug
> Environment: h1. Steps to reproduce
> * Use a Linux machine.
> * Build commit {{ea2c8ba}} of Solr as described in the section below.
> * Build the films collection as described below.
> * Start the server using the command {{./bin/solr start -f -p 8983 -s 
> /tmp/home}}
> * Request the URL given in the bug description.
> h1. Compiling the server
> {noformat}
> git clone https://github.com/apache/lucene-solr
> cd lucene-solr
> git checkout ea2c8ba
> ant compile
> cd solr
> ant server
> {noformat}
> h1. Building the collection and reproducing the bug
> We followed [Exercise 
> 2|http://lucene.apache.org/solr/guide/7_5/solr-tutorial.html#exercise-2] from 
> the [Solr 
> Tutorial|http://lucene.apache.org/solr/guide/7_5/solr-tutorial.html].
> {noformat}
> mkdir -p /tmp/home
> echo '' > 
> /tmp/home/solr.xml
> {noformat}
> In one terminal start a Solr instance in foreground:
> {noformat}
> ./bin/solr start -f -p 8983 -s /tmp/home
> {noformat}
> In another terminal, create a collection of movies, with no shards and no 
> replication, and initialize it:
> {noformat}
> bin/solr create -c films
> curl -X POST -H 'Content-type:application/json' --data-binary '{"add-field": 
> {"name":"name", "type":"text_general", "multiValued":false, "stored":true}}' 
> http://localhost:8983/solr/films/schema
> curl -X POST -H 'Content-type:application/json' --data-binary 
> '{"add-copy-field" : {"source":"*","dest":"_text_"}}' 
> http://localhost:8983/solr/films/schema
> ./bin/post -c films example/films/films.json
> curl -v “URL_BUG”
> {noformat}
> Please check the issue description below to find the “URL_BUG” that will 
> allow you to reproduce the issue reported.
>Reporter: Johannes Kloos
>Priority: Major
>  Labels: diffblue, newdev
> Fix For: master (9.0), 8.4
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Requesting the following URL causes Solr to return an HTTP 500 error response:
> {noformat}
> http://localhost:8983/solr/films/select?mm=%3C=edismax=fq=field(id,1)
> {noformat}
> The error response seems to be caused by the following uncaught exception:
> {noformat}
> java.lang.ArrayIndexOutOfBoundsException: 0
> at 
> org.apache.solr.util.SolrPluginUtils.calculateMinShouldMatch(SolrPluginUtils.java:683)
> at 
> org.apache.solr.util.SolrPluginUtils.setMinShouldMatch(SolrPluginUtils.java:641)
> at 
> org.apache.solr.util.SolrPluginUtils.setMinShouldMatch(SolrPluginUtils.java:660)
> at 
> org.apache.solr.search.ExtendedDismaxQParser.parseOriginalQuery(ExtendedDismaxQParser.java:415)
> at 
> org.apache.solr.search.ExtendedDismaxQParser.parse(ExtendedDismaxQParser.java:173)
> at org.apache.solr.search.QParser.getQuery(QParser.java:173)
> at 
> org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:158)
> at 
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:272)
> at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2559)
> {noformat}
> The mm parameter is given as ‘<’. It is (after some string mangling) split 
> into sub-strings separated by ‘<’, putattively giving the left-hand and 
> right-hand argument of the operator. In the example, there are no such 
> arguments, so the resulting array “parts” is empty (cf. String.split 
> documentation). But we immediately try to access parts[0], leading to an 
> AIOOBE.
> To set up an environment to reproduce this bug, follow the description in the 
> ‘Environment’ field.
> We automatically found this issue and ~70 more like this using [Diffblue 
> Microservices Testing|https://www.diffblue.com/labs/?utm_source=solr-br]. 
> Find more information on this [fuzz testing 
> campaign|https://www.diffblue.com/blog/2018/12/19/diffblue-microservice-testing-a-sneak-peek-at-our-early-product-and-results?utm_source=solr-br].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-13887) socketTimeout of 0 causing timeouts in the Http2SolrClient

2019-11-01 Thread Houston Putman (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Houston Putman updated SOLR-13887:
--
Priority: Minor  (was: Major)

> socketTimeout of 0 causing timeouts in the Http2SolrClient
> --
>
> Key: SOLR-13887
> URL: https://issues.apache.org/jira/browse/SOLR-13887
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: http2
>Affects Versions: master (9.0), 8.4
>Reporter: Houston Putman
>Priority: Minor
> Fix For: master (9.0), 8.4
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In Solr 7, and previous versions, the both the *socketTimeout* and 
> *connTimeout* defaults in _solr.xml_ have accepted 0 as values. This is even 
> [documented in the ref 
> guide|https://lucene.apache.org/solr/guide/8_2/format-of-solr-xml.html#defining-solr-xml].
>  Using these same defaults with Solr 8 results in timeouts when trying to 
> manually create replicas. The major change here seems to be that the 
> Http2SolrClient is being used instead of the HttpSolrClient used in Solr 7 
> and previous versions.
> After some digging, I think that the issue lies in the Http2SolrClient, 
> [specifically 
> here|https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/client/solrj/impl/Http2SolrClient.java#L399].
>  Since the idleTimeout is set to 0, since that is what solr pulls from the 
> solr.xml, the listener immediately responds with a timeout.
> The fix here is pretty simple, just set a default if 0 is provided. Basically 
> treat an idleTimeout (or socketTimeout) of 0 the same as null. The ref-guide 
> should also likely be updated with the same defaults as used in the solr.xml 
> packaged in Solr.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] magibney commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

2019-11-01 Thread GitBox
magibney commented on a change in pull request #892: LUCENE-8972: Add 
ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341695837
 
 

 ##
 File path: 
lucene/analysis/icu/src/test/org/apache/lucene/analysis/icu/TestICUTransformCharFilter.java
 ##
 @@ -0,0 +1,153 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.analysis.icu;
+
+
+import java.io.IOException;
+import java.io.Reader;
+import java.io.StringReader;
+
+import org.apache.lucene.analysis.Analyzer;
+import org.apache.lucene.analysis.BaseTokenStreamTestCase;
+import org.apache.lucene.analysis.MockTokenizer;
+import org.apache.lucene.analysis.Tokenizer;
+import org.apache.lucene.analysis.core.KeywordTokenizer;
+
+import com.ibm.icu.text.Transliterator;
+import com.ibm.icu.text.UnicodeSet;
+
+
+/**
+ * Test the ICUTransformCharFilter with some basic examples.
+ */
+public class TestICUTransformCharFilter extends BaseTokenStreamTestCase {
+  
+  public void testBasicFunctionality() throws Exception {
+checkToken(Transliterator.getInstance("Traditional-Simplified"), 
+"簡化字", "简化字"); 
+checkToken(Transliterator.getInstance("Katakana-Hiragana"), 
+"ヒラガナ", "ひらがな");
+checkToken(Transliterator.getInstance("Fullwidth-Halfwidth"), 
+"アルアノリウ", "アルアノリウ");
+checkToken(Transliterator.getInstance("Any-Latin"), 
+"Αλφαβητικός Κατάλογος", "Alphabētikós Katálogos");
+checkToken(Transliterator.getInstance("NFD; [:Nonspacing Mark:] Remove"), 
+"Alphabētikós Katálogos", "Alphabetikos Katalogos");
+checkToken(Transliterator.getInstance("Han-Latin"),
+"中国", "zhōng guó");
+  }
+  
+  public void testRollbackBuffer() throws Exception {
+checkToken(Transliterator.getInstance("Cyrillic-Latin"),
+"я", "â"); // final NFC transform applied
+checkToken(Transliterator.getInstance("Cyrillic-Latin"), 0,
+"я", "a\u0302a\u0302a\u0302a\u0302a\u0302"); // final NFC 
transform never applied
+checkToken(Transliterator.getInstance("Cyrillic-Latin"), 2,
+"я", "ââa\u0302a\u0302a\u0302");
+checkToken(Transliterator.getInstance("Cyrillic-Latin"), 4,
+"яя", 
"ââa\u0302a\u0302a\u0302a\u0302a\u0302a\u0302a\u0302a\u0302");
+checkToken(Transliterator.getInstance("Cyrillic-Latin"), 8,
+"", "ââa\u0302a\u0302a\u0302âââ");
+  }
+
+  public void testCustomFunctionality() throws Exception {
+String rules = "a > b; b > c;"; // convert a's to b's and b's to c's
+checkToken(Transliterator.createFromRules("test", rules, 
Transliterator.FORWARD), "abacadaba", "bcbcbdbcb");
+  }
+  
+  public void testCustomFunctionality2() throws Exception {
+String rules = "c { a > b; a > d;"; // convert a's to b's and b's to c's
+checkToken(Transliterator.createFromRules("test", rules, 
Transliterator.FORWARD), "caa", "cbd");
+  }
+  
+  public void testOptimizer() throws Exception {
+String rules = "a > b; b > c;"; // convert a's to b's and b's to c's
+Transliterator custom = Transliterator.createFromRules("test", rules, 
Transliterator.FORWARD);
+assertTrue(custom.getFilter() == null);
+new ICUTransformCharFilter(new StringReader(""), custom);
+assertTrue(custom.getFilter().equals(new UnicodeSet("[ab]")));
+  }
+  
+  public void testOptimizer2() throws Exception {
+checkToken(Transliterator.getInstance("Traditional-Simplified; CaseFold"), 
+"ABCDE", "abcde");
+  }
+  
+  public void testOptimizerSurrogate() throws Exception {
+String rules = "\\U00020087 > x;"; // convert CJK UNIFIED IDEOGRAPH-20087 
to an x
+Transliterator custom = Transliterator.createFromRules("test", rules, 
Transliterator.FORWARD);
+assertTrue(custom.getFilter() == null);
+new ICUTransformCharFilter(new StringReader(""), custom);
+assertTrue(custom.getFilter().equals(new UnicodeSet("[\\U00020087]")));
+  }
+
+  private void checkToken(Transliterator transform, String input, String 
expected) throws IOException {
+checkToken(transform, 
ICUTransformCharFilter.DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY, input, 

[GitHub] [lucene-solr] magibney commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

2019-11-01 Thread GitBox
magibney commented on a change in pull request #892: LUCENE-8972: Add 
ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341740602
 
 

 ##
 File path: 
lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformCharFilter.java
 ##
 @@ -0,0 +1,384 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.analysis.icu;
+
+import java.io.IOException;
+import java.io.Reader;
+
+import com.ibm.icu.text.ReplaceableString;
+import com.ibm.icu.text.Transliterator;
+import com.ibm.icu.text.Transliterator.Position;
+import com.ibm.icu.text.UTF16;
+
+import org.apache.lucene.analysis.CharFilter;
+import org.apache.lucene.analysis.charfilter.BaseCharFilter;
+import org.apache.lucene.util.ArrayUtil;
+
+/**
+ * A {@link CharFilter} that transforms text with ICU.
+ * 
+ * ICU provides text-transformation functionality via its Transliteration API.
+ * Although script conversion is its most common use, a Transliterator can
+ * actually perform a more general class of tasks. In fact, Transliterator
+ * defines a very general API which specifies only that a segment of the input
+ * text is replaced by new text. The particulars of this conversion are
+ * determined entirely by subclasses of Transliterator.
+ * 
+ * 
+ * Some useful transformations for search are built-in:
+ * 
+ * Conversion from Traditional to Simplified Chinese characters
+ * Conversion from Hiragana to Katakana
+ * Conversion from Fullwidth to Halfwidth forms.
+ * Script conversions, for example Serbian Cyrillic to Latin
+ * 
+ * 
+ * Example usage: stream = new ICUTransformCharFilter(reader,
+ * Transliterator.getInstance("Traditional-Simplified"));
+ * 
+ * For more details, see the http://userguide.icu-project.org/transforms/general;>ICU User
+ * Guide.
+ */
+public final class ICUTransformCharFilter extends BaseCharFilter {
+
+  // Transliterator to transform the text
+  private final Transliterator transform;
+
+  // Reusable position object
+  private final Position position = new Position();
+
+  private static final int READ_BUFFER_SIZE = 1024;
+  private final char[] tmpBuffer = new char[READ_BUFFER_SIZE];
+
+  private static final int INITIAL_TRANSLITERATE_BUFFER_SIZE = 1024;
+  private final StringBuffer buffer = new 
StringBuffer(INITIAL_TRANSLITERATE_BUFFER_SIZE);
+  private final ReplaceableString replaceable = new ReplaceableString(buffer);
+
+  private static final int BUFFER_PRUNE_THRESHOLD = 1024;
+
+  private int outputCursor = 0;
+  private boolean inputFinished = false;
+  private int charCount = 0;
+  private int offsetDiffAdjust = 0;
+
+  private static final int HARD_MAX_ROLLBACK_BUFFER_CAPACITY = 
Integer.highestOneBit(Integer.MAX_VALUE);
+  static final int DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY = 8192;
+  private final int maxRollbackBufferCapacity;
+
+  private static final int DEFAULT_INITIAL_ROLLBACK_BUFFER_CAPACITY = 4; // 
must be power of 2
+  private char[] rollbackBuffer;
+  private int rollbackBufferSize = 0;
+
+  ICUTransformCharFilter(Reader in, Transliterator transform) {
+this(in, transform, DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY);
+  }
+
+  /**
+   * Construct new {@link ICUTransformCharFilter} with the specified {@link 
Transliterator}, backed by
+   * the specified {@link Reader}.
+   * @param in input source
+   * @param transform used to perform transliteration
+   * @param maxRollbackBufferCapacityHint used to control the maximum size to 
which this
+   * {@link ICUTransformCharFilter} will buffer and rollback partial 
transliteration of input sequences.
+   * The provided hint will be converted to an enforced limit of "the greatest 
power of 2 (excluding '1')
+   * less than or equal to the specified value". It is illegal to specify a 
negative value. There is no
+   * power of 2 greater than 
Integer.highestOneBit(Integer.MAX_VALUE)), so to prevent overflow, 
values
+   * in this range will resolve to an enforced limit of 
Integer.highestOneBit(Integer.MAX_VALUE)).
+   * Specifying "0" (or "1", in practice) disables rollback. Larger values can 
in some cases yield more accurate
+   * transliteration, at the cost of 

[GitHub] [lucene-solr] magibney commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

2019-11-01 Thread GitBox
magibney commented on a change in pull request #892: LUCENE-8972: Add 
ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341740602
 
 

 ##
 File path: 
lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformCharFilter.java
 ##
 @@ -0,0 +1,384 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.analysis.icu;
+
+import java.io.IOException;
+import java.io.Reader;
+
+import com.ibm.icu.text.ReplaceableString;
+import com.ibm.icu.text.Transliterator;
+import com.ibm.icu.text.Transliterator.Position;
+import com.ibm.icu.text.UTF16;
+
+import org.apache.lucene.analysis.CharFilter;
+import org.apache.lucene.analysis.charfilter.BaseCharFilter;
+import org.apache.lucene.util.ArrayUtil;
+
+/**
+ * A {@link CharFilter} that transforms text with ICU.
+ * 
+ * ICU provides text-transformation functionality via its Transliteration API.
+ * Although script conversion is its most common use, a Transliterator can
+ * actually perform a more general class of tasks. In fact, Transliterator
+ * defines a very general API which specifies only that a segment of the input
+ * text is replaced by new text. The particulars of this conversion are
+ * determined entirely by subclasses of Transliterator.
+ * 
+ * 
+ * Some useful transformations for search are built-in:
+ * 
+ * Conversion from Traditional to Simplified Chinese characters
+ * Conversion from Hiragana to Katakana
+ * Conversion from Fullwidth to Halfwidth forms.
+ * Script conversions, for example Serbian Cyrillic to Latin
+ * 
+ * 
+ * Example usage: stream = new ICUTransformCharFilter(reader,
+ * Transliterator.getInstance("Traditional-Simplified"));
+ * 
+ * For more details, see the http://userguide.icu-project.org/transforms/general;>ICU User
+ * Guide.
+ */
+public final class ICUTransformCharFilter extends BaseCharFilter {
+
+  // Transliterator to transform the text
+  private final Transliterator transform;
+
+  // Reusable position object
+  private final Position position = new Position();
+
+  private static final int READ_BUFFER_SIZE = 1024;
+  private final char[] tmpBuffer = new char[READ_BUFFER_SIZE];
+
+  private static final int INITIAL_TRANSLITERATE_BUFFER_SIZE = 1024;
+  private final StringBuffer buffer = new 
StringBuffer(INITIAL_TRANSLITERATE_BUFFER_SIZE);
+  private final ReplaceableString replaceable = new ReplaceableString(buffer);
+
+  private static final int BUFFER_PRUNE_THRESHOLD = 1024;
+
+  private int outputCursor = 0;
+  private boolean inputFinished = false;
+  private int charCount = 0;
+  private int offsetDiffAdjust = 0;
+
+  private static final int HARD_MAX_ROLLBACK_BUFFER_CAPACITY = 
Integer.highestOneBit(Integer.MAX_VALUE);
+  static final int DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY = 8192;
+  private final int maxRollbackBufferCapacity;
+
+  private static final int DEFAULT_INITIAL_ROLLBACK_BUFFER_CAPACITY = 4; // 
must be power of 2
+  private char[] rollbackBuffer;
+  private int rollbackBufferSize = 0;
+
+  ICUTransformCharFilter(Reader in, Transliterator transform) {
+this(in, transform, DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY);
+  }
+
+  /**
+   * Construct new {@link ICUTransformCharFilter} with the specified {@link 
Transliterator}, backed by
+   * the specified {@link Reader}.
+   * @param in input source
+   * @param transform used to perform transliteration
+   * @param maxRollbackBufferCapacityHint used to control the maximum size to 
which this
+   * {@link ICUTransformCharFilter} will buffer and rollback partial 
transliteration of input sequences.
+   * The provided hint will be converted to an enforced limit of "the greatest 
power of 2 (excluding '1')
+   * less than or equal to the specified value". It is illegal to specify a 
negative value. There is no
+   * power of 2 greater than 
Integer.highestOneBit(Integer.MAX_VALUE)), so to prevent overflow, 
values
+   * in this range will resolve to an enforced limit of 
Integer.highestOneBit(Integer.MAX_VALUE)).
+   * Specifying "0" (or "1", in practice) disables rollback. Larger values can 
in some cases yield more accurate
+   * transliteration, at the cost of 

[GitHub] [lucene-solr] HoustonPutman opened a new pull request #991: SOLR-13887: Use default instead of idleTimeouts of 0 for HTTP2 requests

2019-11-01 Thread GitBox
HoustonPutman opened a new pull request #991: SOLR-13887: Use default instead 
of idleTimeouts of 0 for HTTP2 requests
URL: https://github.com/apache/lucene-solr/pull/991
 
 
   # Description
   
   In Solr 7, and previous versions, the both the `socketTimeout` and 
`connTimeout` defaults in `solr.xml` have accepted 0 as values. This is even 
[documented in the ref 
guide](https://lucene.apache.org/solr/guide/8_2/format-of-solr-xml.html#defining-solr-xml).
 Using these same defaults with Solr 8 results in timeouts when trying to 
manually create replicas. The major change here seems to be that the 
Http2SolrClient is being used instead of the HttpSolrClient used in Solr 7 and 
previous versions.
   
   After some digging, I think that the issue lies in the Http2SolrClient, 
specifically in how it enforces idleTimeouts. Since the idleTimeout is set to 
0, since that is what solr pulls from the solr.xml, the listener immediately 
responds with a timeout.
   
   
   # Solution
   
   Set a default idleTimeout if 0 is provided. Basically treat an idleTimeout 
(or socketTimeout) of 0 the same as null. The ref-guide should also likely be 
updated with the same defaults as used in the solr.xml packaged in Solr.
   
   # Tests
   
   Added a test to make sure that the use of an idleTimeout of 0 does not 
immediately timeout requests, and that a default is used.
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [ ] I have reviewed the guidelines for [How to 
Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms 
to the standards described there to the best of my ability.
   - [ ] I have created a Jira issue and added the issue ID to my pull request 
title.
   - [ ] I am authorized to contribute this code to the ASF and have removed 
any code I do not have a license to distribute.
   - [ ] I have given Solr maintainers 
[access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
 to contribute to my PR branch. (optional but recommended)
   - [ ] I have developed this patch against the `master` branch.
   - [ ] I have run `ant precommit` and the appropriate test suite.
   - [ ] I have added tests for my changes.
   - [ ] I have added documentation for the [Ref 
Guide](https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide) 
(for Solr changes only).
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (SOLR-13887) socketTimeout of 0 causing timeouts in the Http2SolrClient

2019-11-01 Thread Houston Putman (Jira)
Houston Putman created SOLR-13887:
-

 Summary: socketTimeout of 0 causing timeouts in the Http2SolrClient
 Key: SOLR-13887
 URL: https://issues.apache.org/jira/browse/SOLR-13887
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
  Components: http2
Affects Versions: master (9.0), 8.4
Reporter: Houston Putman
 Fix For: master (9.0), 8.4


In Solr 7, and previous versions, the both the *socketTimeout* and 
*connTimeout* defaults in _solr.xml_ have accepted 0 as values. This is even 
[documented in the ref 
guide|https://lucene.apache.org/solr/guide/8_2/format-of-solr-xml.html#defining-solr-xml].
 Using these same defaults with Solr 8 results in timeouts when trying to 
manually create replicas. The major change here seems to be that the 
Http2SolrClient is being used instead of the HttpSolrClient used in Solr 7 and 
previous versions.

After some digging, I think that the issue lies in the Http2SolrClient, 
[specifically 
here|https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/client/solrj/impl/Http2SolrClient.java#L399].
 Since the idleTimeout is set to 0, since that is what solr pulls from the 
solr.xml, the listener immediately responds with a timeout.

The fix here is pretty simple, just set a default if 0 is provided. Basically 
treat an idleTimeout (or socketTimeout) of 0 the same as null. The ref-guide 
should also likely be updated with the same defaults as used in the solr.xml 
packaged in Solr.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8996) maxScore is sometimes missing from distributed grouped responses

2019-11-01 Thread Christine Poerschke (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965060#comment-16965060
 ] 

Christine Poerschke commented on LUCENE-8996:
-

Attached LUCENE-8996.04.patch file, it simply combines the Oct 22nd 
LUCENE-8996.patch and the Oct 23rd LUCENE-8996.02.patch files. Thoughts?

> maxScore is sometimes missing from distributed grouped responses
> 
>
> Key: LUCENE-8996
> URL: https://issues.apache.org/jira/browse/LUCENE-8996
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 5.3
>Reporter: Julien Massenet
>Assignee: Christine Poerschke
>Priority: Minor
> Attachments: LUCENE-8996.02.patch, LUCENE-8996.03.patch, 
> LUCENE-8996.04.patch, LUCENE-8996.patch, LUCENE-8996.patch, 
> lucene_6_5-GroupingMaxScore.patch, lucene_solr_5_3-GroupingMaxScore.patch, 
> master-GroupingMaxScore.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This issue occurs when using the grouping feature in distributed mode and 
> sorting by score.
> Each group's {{docList}} in the response is supposed to contain a 
> {{maxScore}} entry that hold the maximum score for that group. Using the 
> current releases, it sometimes happens that this piece of information is not 
> included:
> {code}
> {
>   "responseHeader": {
> "status": 0,
> "QTime": 42,
> "params": {
>   "sort": "score desc",
>   "fl": "id,score",
>   "q": "_text_:\"72\"",
>   "group.limit": "2",
>   "group.field": "group2",
>   "group.sort": "score desc",
>   "group": "true",
>   "wt": "json",
>   "fq": "group2:72 OR group2:45"
> }
>   },
>   "grouped": {
> "group2": {
>   "matches": 567,
>   "groups": [
> {
>   "groupValue": 72,
>   "doclist": {
> "numFound": 562,
> "start": 0,
> "maxScore": 2.0378063,
> "docs": [
>   {
> "id": "29!26551",
> "score": 2.0378063
>   },
>   {
> "id": "78!11462",
> "score": 2.0298104
>   }
> ]
>   }
> },
> {
>   "groupValue": 45,
>   "doclist": {
> "numFound": 5,
> "start": 0,
> "docs": [
>   {
> "id": "72!8569",
> "score": 1.8988966
>   },
>   {
> "id": "72!14075",
> "score": 1.5191172
>   }
> ]
>   }
> }
>   ]
> }
>   }
> }
> {code}
> Looking into the issue, it comes from the fact that if a shard does not 
> contain a document from that group, trying to merge its {{maxScore}} with 
> real {{maxScore}} entries from other shards is invalid (it results in NaN).
> I'm attaching a patch containing a fix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8996) maxScore is sometimes missing from distributed grouped responses

2019-11-01 Thread Christine Poerschke (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christine Poerschke updated LUCENE-8996:

Status: Patch Available  (was: Open)

> maxScore is sometimes missing from distributed grouped responses
> 
>
> Key: LUCENE-8996
> URL: https://issues.apache.org/jira/browse/LUCENE-8996
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 5.3
>Reporter: Julien Massenet
>Assignee: Christine Poerschke
>Priority: Minor
> Attachments: LUCENE-8996.02.patch, LUCENE-8996.03.patch, 
> LUCENE-8996.04.patch, LUCENE-8996.patch, LUCENE-8996.patch, 
> lucene_6_5-GroupingMaxScore.patch, lucene_solr_5_3-GroupingMaxScore.patch, 
> master-GroupingMaxScore.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This issue occurs when using the grouping feature in distributed mode and 
> sorting by score.
> Each group's {{docList}} in the response is supposed to contain a 
> {{maxScore}} entry that hold the maximum score for that group. Using the 
> current releases, it sometimes happens that this piece of information is not 
> included:
> {code}
> {
>   "responseHeader": {
> "status": 0,
> "QTime": 42,
> "params": {
>   "sort": "score desc",
>   "fl": "id,score",
>   "q": "_text_:\"72\"",
>   "group.limit": "2",
>   "group.field": "group2",
>   "group.sort": "score desc",
>   "group": "true",
>   "wt": "json",
>   "fq": "group2:72 OR group2:45"
> }
>   },
>   "grouped": {
> "group2": {
>   "matches": 567,
>   "groups": [
> {
>   "groupValue": 72,
>   "doclist": {
> "numFound": 562,
> "start": 0,
> "maxScore": 2.0378063,
> "docs": [
>   {
> "id": "29!26551",
> "score": 2.0378063
>   },
>   {
> "id": "78!11462",
> "score": 2.0298104
>   }
> ]
>   }
> },
> {
>   "groupValue": 45,
>   "doclist": {
> "numFound": 5,
> "start": 0,
> "docs": [
>   {
> "id": "72!8569",
> "score": 1.8988966
>   },
>   {
> "id": "72!14075",
> "score": 1.5191172
>   }
> ]
>   }
> }
>   ]
> }
>   }
> }
> {code}
> Looking into the issue, it comes from the fact that if a shard does not 
> contain a document from that group, trying to merge its {{maxScore}} with 
> real {{maxScore}} entries from other shards is invalid (it results in NaN).
> I'm attaching a patch containing a fix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8996) maxScore is sometimes missing from distributed grouped responses

2019-11-01 Thread Christine Poerschke (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christine Poerschke updated LUCENE-8996:

Attachment: LUCENE-8996.04.patch

> maxScore is sometimes missing from distributed grouped responses
> 
>
> Key: LUCENE-8996
> URL: https://issues.apache.org/jira/browse/LUCENE-8996
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 5.3
>Reporter: Julien Massenet
>Assignee: Christine Poerschke
>Priority: Minor
> Attachments: LUCENE-8996.02.patch, LUCENE-8996.03.patch, 
> LUCENE-8996.04.patch, LUCENE-8996.patch, LUCENE-8996.patch, 
> lucene_6_5-GroupingMaxScore.patch, lucene_solr_5_3-GroupingMaxScore.patch, 
> master-GroupingMaxScore.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This issue occurs when using the grouping feature in distributed mode and 
> sorting by score.
> Each group's {{docList}} in the response is supposed to contain a 
> {{maxScore}} entry that hold the maximum score for that group. Using the 
> current releases, it sometimes happens that this piece of information is not 
> included:
> {code}
> {
>   "responseHeader": {
> "status": 0,
> "QTime": 42,
> "params": {
>   "sort": "score desc",
>   "fl": "id,score",
>   "q": "_text_:\"72\"",
>   "group.limit": "2",
>   "group.field": "group2",
>   "group.sort": "score desc",
>   "group": "true",
>   "wt": "json",
>   "fq": "group2:72 OR group2:45"
> }
>   },
>   "grouped": {
> "group2": {
>   "matches": 567,
>   "groups": [
> {
>   "groupValue": 72,
>   "doclist": {
> "numFound": 562,
> "start": 0,
> "maxScore": 2.0378063,
> "docs": [
>   {
> "id": "29!26551",
> "score": 2.0378063
>   },
>   {
> "id": "78!11462",
> "score": 2.0298104
>   }
> ]
>   }
> },
> {
>   "groupValue": 45,
>   "doclist": {
> "numFound": 5,
> "start": 0,
> "docs": [
>   {
> "id": "72!8569",
> "score": 1.8988966
>   },
>   {
> "id": "72!14075",
> "score": 1.5191172
>   }
> ]
>   }
> }
>   ]
> }
>   }
> }
> {code}
> Looking into the issue, it comes from the fact that if a shard does not 
> contain a document from that group, trying to merge its {{maxScore}} with 
> real {{maxScore}} entries from other shards is invalid (it results in NaN).
> I'm attaching a patch containing a fix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8996) maxScore is sometimes missing from distributed grouped responses

2019-11-01 Thread Christine Poerschke (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965059#comment-16965059
 ] 

Christine Poerschke commented on LUCENE-8996:
-

Thanks [~munendrasn] and [~diegoceccarelli] for your input and insights!

Let me try to sum up things so far and what could be next.
 * The main objective of this bug fix ticket is to correctly return 
{{maxScore}} for groups that are _not_ represented on all shards.
 ** Currently {{maxScore}} is missing because {{TopGroups.merge}} does a 
{{maxScore = Math.max(maxScore, shardGroupDocs.maxScore);}} computation with 
assumptions.
 ** The computation assumes that {{shardGroupDocs.maxScore}} is always a number 
when in fact it can be {{NaN}} if that group is not represented on that shard.
 ** {{Math.max(a, b)}} returns {{NaN}} if either value is {{NaN}}
 ** code links:
 *** 
[https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.3.0/lucene/grouping/src/java/org/apache/lucene/search/grouping/TopGroups.java#L172]
 *** 
[https://docs.oracle.com/javase/8/docs/api/java/lang/Math.html#max-float-float-]

 * We now know that {{shardGroupDocs.maxScore}} could also be absent if the 
group was represented on the shard but _scores were not requested_.
 ** In this scenario changing {{TopGroups.merge}} to ignore any {{NaN}} scores 
had a side effect as illustrated by the {{TestDistributedGrouping}} test 
failure.
 ** If all shards return a {{maxScore}} of {{NaN}} then the {{TopGroups.merge}} 
computation (with {{nonNANmax}} instead of {{Math.max}}) will return as the 
overall result whatever the initial value of the local {{maxScore}} variable 
was.
 ** The local {{maxScore}} variable is currently initialised to 
{{Float.MIN_VALUE}}.
 ** If scores were not requested then they should not be returned in the 
response (but as per SOLR-6612 they currently are returned in distributed mode 
but not in non-distributed mode).

 * The {{TestDistributedGrouping}} logic could be adjusted to skip the 
{{maxScore}} comparison (until SOLR-6612 is fixed) or the 
{{GroupedEndResultTransformer}} response composition logic could be adjusted to 
screen out {{Float.MIN_VALUE}} scores but irrespective of that 
{{TopGroups.merge}} returning a {{MIN_VALUE}} score if given (say) three 
{{NaN}} scores seems surprising and the {{NaN}} that is currently being 
returned seems more logical.

 * If we changed the initialisation of the local {{maxScore}} variable from 
{{MIN_VALUE}} to {{NaN}} then groups without scores would merge into an overall 
{{NaN}} i.e. no overall score.
 * Local initialisation changes pose a backwards compatibility question: are 
there cases where previously {{MIN_VALUE}} was returned and where now {{NaN}} 
would be returned instead?
 ** Code inspection shows that {{MIN_VALUE}} would be returned if the groups 
and shards for-loops do not run.
 *** 
[https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.3.0/lucene/grouping/src/java/org/apache/lucene/search/grouping/TopGroups.java#L138-L227]
 *** The shards for-loop will always run since the no-shards case already 
resulted in a {{return}} at L105.
 *** In the Apache Lucene/Solr code base there is only one non-test caller of 
the {{TopGroups.merge}} method: 
[https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.3.0/solr/core/src/java/org/apache/solr/search/grouping/distributed/responseprocessor/TopGroupsShardResponseProcessor.java#L153-L168]
 *** The groups for-loop will always run since no-groups results in a 
{{continue}} at L156.

 * Per the above the local initialisation changes would pose no backwards 
compatibility concerns within the Apache Lucene/Solr code base.
 * Considering any {{TopGroups.merge}} callers outside the Lucene/Solr code 
base, some {{lucene/CHANGES.txt}} wording like the following could be used to 
document the change in behaviour. _"TopGroups.merge now returns Float.NaN 
instead of Float.MIN_VALUE if no groups are passed as input."_
 * However, I would suggest that such wording would be unnecessary since 
merging empty group arrays seems unlikely and exotic?

> maxScore is sometimes missing from distributed grouped responses
> 
>
> Key: LUCENE-8996
> URL: https://issues.apache.org/jira/browse/LUCENE-8996
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 5.3
>Reporter: Julien Massenet
>Assignee: Christine Poerschke
>Priority: Minor
> Attachments: LUCENE-8996.02.patch, LUCENE-8996.03.patch, 
> LUCENE-8996.patch, LUCENE-8996.patch, lucene_6_5-GroupingMaxScore.patch, 
> lucene_solr_5_3-GroupingMaxScore.patch, master-GroupingMaxScore.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This issue occurs when using the grouping feature in distributed mode and 
> sorting by score.
> Each 

[jira] [Updated] (LUCENE-9031) UnsupportedOperationException on highlighting Interval Query

2019-11-01 Thread Mikhail Khludnev (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Khludnev updated LUCENE-9031:
-
Attachment: LUCENE-9031.patch

> UnsupportedOperationException on highlighting Interval Query
> 
>
> Key: LUCENE-9031
> URL: https://issues.apache.org/jira/browse/LUCENE-9031
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/queries
>Reporter: Mikhail Khludnev
>Assignee: Mikhail Khludnev
>Priority: Major
> Fix For: 8.4
>
> Attachments: LUCENE-9031.patch, LUCENE-9031.patch, LUCENE-9031.patch, 
> LUCENE-9031.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When UnifiedHighlighter highlights Interval Query it encounters 
> UnsupportedOperationException. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9031) UnsupportedOperationException on highlighting Interval Query

2019-11-01 Thread Mikhail Khludnev (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Khludnev updated LUCENE-9031:
-
Attachment: (was: LUCENE-9031.patch)

> UnsupportedOperationException on highlighting Interval Query
> 
>
> Key: LUCENE-9031
> URL: https://issues.apache.org/jira/browse/LUCENE-9031
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/queries
>Reporter: Mikhail Khludnev
>Assignee: Mikhail Khludnev
>Priority: Major
> Fix For: 8.4
>
> Attachments: LUCENE-9031.patch, LUCENE-9031.patch, LUCENE-9031.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When UnifiedHighlighter highlights Interval Query it encounters 
> UnsupportedOperationException. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] magibney commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

2019-11-01 Thread GitBox
magibney commented on a change in pull request #892: LUCENE-8972: Add 
ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341624565
 
 

 ##
 File path: 
lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformCharFilter.java
 ##
 @@ -0,0 +1,384 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.analysis.icu;
+
+import java.io.IOException;
+import java.io.Reader;
+
+import com.ibm.icu.text.ReplaceableString;
+import com.ibm.icu.text.Transliterator;
+import com.ibm.icu.text.Transliterator.Position;
+import com.ibm.icu.text.UTF16;
+
+import org.apache.lucene.analysis.CharFilter;
+import org.apache.lucene.analysis.charfilter.BaseCharFilter;
+import org.apache.lucene.util.ArrayUtil;
+
+/**
+ * A {@link CharFilter} that transforms text with ICU.
+ * 
+ * ICU provides text-transformation functionality via its Transliteration API.
+ * Although script conversion is its most common use, a Transliterator can
+ * actually perform a more general class of tasks. In fact, Transliterator
+ * defines a very general API which specifies only that a segment of the input
+ * text is replaced by new text. The particulars of this conversion are
+ * determined entirely by subclasses of Transliterator.
+ * 
+ * 
+ * Some useful transformations for search are built-in:
+ * 
+ * Conversion from Traditional to Simplified Chinese characters
+ * Conversion from Hiragana to Katakana
+ * Conversion from Fullwidth to Halfwidth forms.
+ * Script conversions, for example Serbian Cyrillic to Latin
+ * 
+ * 
+ * Example usage: stream = new ICUTransformCharFilter(reader,
+ * Transliterator.getInstance("Traditional-Simplified"));
+ * 
+ * For more details, see the http://userguide.icu-project.org/transforms/general;>ICU User
+ * Guide.
+ */
+public final class ICUTransformCharFilter extends BaseCharFilter {
+
+  // Transliterator to transform the text
+  private final Transliterator transform;
+
+  // Reusable position object
+  private final Position position = new Position();
+
+  private static final int READ_BUFFER_SIZE = 1024;
+  private final char[] tmpBuffer = new char[READ_BUFFER_SIZE];
+
+  private static final int INITIAL_TRANSLITERATE_BUFFER_SIZE = 1024;
+  private final StringBuffer buffer = new 
StringBuffer(INITIAL_TRANSLITERATE_BUFFER_SIZE);
+  private final ReplaceableString replaceable = new ReplaceableString(buffer);
+
+  private static final int BUFFER_PRUNE_THRESHOLD = 1024;
+
+  private int outputCursor = 0;
+  private boolean inputFinished = false;
+  private int charCount = 0;
+  private int offsetDiffAdjust = 0;
+
+  private static final int HARD_MAX_ROLLBACK_BUFFER_CAPACITY = 
Integer.highestOneBit(Integer.MAX_VALUE);
+  static final int DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY = 8192;
+  private final int maxRollbackBufferCapacity;
+
+  private static final int DEFAULT_INITIAL_ROLLBACK_BUFFER_CAPACITY = 4; // 
must be power of 2
+  private char[] rollbackBuffer;
+  private int rollbackBufferSize = 0;
+
+  ICUTransformCharFilter(Reader in, Transliterator transform) {
+this(in, transform, DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY);
+  }
+
+  /**
+   * Construct new {@link ICUTransformCharFilter} with the specified {@link 
Transliterator}, backed by
+   * the specified {@link Reader}.
+   * @param in input source
+   * @param transform used to perform transliteration
+   * @param maxRollbackBufferCapacityHint used to control the maximum size to 
which this
+   * {@link ICUTransformCharFilter} will buffer and rollback partial 
transliteration of input sequences.
+   * The provided hint will be converted to an enforced limit of "the greatest 
power of 2 (excluding '1')
+   * less than or equal to the specified value". It is illegal to specify a 
negative value. There is no
+   * power of 2 greater than 
Integer.highestOneBit(Integer.MAX_VALUE)), so to prevent overflow, 
values
+   * in this range will resolve to an enforced limit of 
Integer.highestOneBit(Integer.MAX_VALUE)).
+   * Specifying "0" (or "1", in practice) disables rollback. Larger values can 
in some cases yield more accurate
+   * transliteration, at the cost of 

[GitHub] [lucene-solr] magibney commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

2019-11-01 Thread GitBox
magibney commented on a change in pull request #892: LUCENE-8972: Add 
ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341628865
 
 

 ##
 File path: 
lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformCharFilter.java
 ##
 @@ -0,0 +1,384 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.analysis.icu;
+
+import java.io.IOException;
+import java.io.Reader;
+
+import com.ibm.icu.text.ReplaceableString;
+import com.ibm.icu.text.Transliterator;
+import com.ibm.icu.text.Transliterator.Position;
+import com.ibm.icu.text.UTF16;
+
+import org.apache.lucene.analysis.CharFilter;
+import org.apache.lucene.analysis.charfilter.BaseCharFilter;
+import org.apache.lucene.util.ArrayUtil;
+
+/**
+ * A {@link CharFilter} that transforms text with ICU.
+ * 
+ * ICU provides text-transformation functionality via its Transliteration API.
+ * Although script conversion is its most common use, a Transliterator can
+ * actually perform a more general class of tasks. In fact, Transliterator
+ * defines a very general API which specifies only that a segment of the input
+ * text is replaced by new text. The particulars of this conversion are
+ * determined entirely by subclasses of Transliterator.
+ * 
+ * 
+ * Some useful transformations for search are built-in:
+ * 
+ * Conversion from Traditional to Simplified Chinese characters
+ * Conversion from Hiragana to Katakana
+ * Conversion from Fullwidth to Halfwidth forms.
+ * Script conversions, for example Serbian Cyrillic to Latin
+ * 
+ * 
+ * Example usage: stream = new ICUTransformCharFilter(reader,
+ * Transliterator.getInstance("Traditional-Simplified"));
+ * 
+ * For more details, see the http://userguide.icu-project.org/transforms/general;>ICU User
+ * Guide.
+ */
+public final class ICUTransformCharFilter extends BaseCharFilter {
+
+  // Transliterator to transform the text
+  private final Transliterator transform;
+
+  // Reusable position object
+  private final Position position = new Position();
+
+  private static final int READ_BUFFER_SIZE = 1024;
+  private final char[] tmpBuffer = new char[READ_BUFFER_SIZE];
+
+  private static final int INITIAL_TRANSLITERATE_BUFFER_SIZE = 1024;
+  private final StringBuffer buffer = new 
StringBuffer(INITIAL_TRANSLITERATE_BUFFER_SIZE);
+  private final ReplaceableString replaceable = new ReplaceableString(buffer);
+
+  private static final int BUFFER_PRUNE_THRESHOLD = 1024;
+
+  private int outputCursor = 0;
+  private boolean inputFinished = false;
+  private int charCount = 0;
+  private int offsetDiffAdjust = 0;
+
+  private static final int HARD_MAX_ROLLBACK_BUFFER_CAPACITY = 
Integer.highestOneBit(Integer.MAX_VALUE);
+  static final int DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY = 8192;
+  private final int maxRollbackBufferCapacity;
+
+  private static final int DEFAULT_INITIAL_ROLLBACK_BUFFER_CAPACITY = 4; // 
must be power of 2
+  private char[] rollbackBuffer;
+  private int rollbackBufferSize = 0;
+
+  ICUTransformCharFilter(Reader in, Transliterator transform) {
+this(in, transform, DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY);
+  }
+
+  /**
+   * Construct new {@link ICUTransformCharFilter} with the specified {@link 
Transliterator}, backed by
+   * the specified {@link Reader}.
+   * @param in input source
+   * @param transform used to perform transliteration
+   * @param maxRollbackBufferCapacityHint used to control the maximum size to 
which this
+   * {@link ICUTransformCharFilter} will buffer and rollback partial 
transliteration of input sequences.
+   * The provided hint will be converted to an enforced limit of "the greatest 
power of 2 (excluding '1')
+   * less than or equal to the specified value". It is illegal to specify a 
negative value. There is no
+   * power of 2 greater than 
Integer.highestOneBit(Integer.MAX_VALUE)), so to prevent overflow, 
values
+   * in this range will resolve to an enforced limit of 
Integer.highestOneBit(Integer.MAX_VALUE)).
+   * Specifying "0" (or "1", in practice) disables rollback. Larger values can 
in some cases yield more accurate
+   * transliteration, at the cost of 

[GitHub] [lucene-solr] magibney commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

2019-11-01 Thread GitBox
magibney commented on a change in pull request #892: LUCENE-8972: Add 
ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341604330
 
 

 ##
 File path: 
lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformCharFilter.java
 ##
 @@ -0,0 +1,384 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.analysis.icu;
+
+import java.io.IOException;
+import java.io.Reader;
+
+import com.ibm.icu.text.ReplaceableString;
+import com.ibm.icu.text.Transliterator;
+import com.ibm.icu.text.Transliterator.Position;
+import com.ibm.icu.text.UTF16;
+
+import org.apache.lucene.analysis.CharFilter;
+import org.apache.lucene.analysis.charfilter.BaseCharFilter;
+import org.apache.lucene.util.ArrayUtil;
+
+/**
+ * A {@link CharFilter} that transforms text with ICU.
+ * 
+ * ICU provides text-transformation functionality via its Transliteration API.
+ * Although script conversion is its most common use, a Transliterator can
+ * actually perform a more general class of tasks. In fact, Transliterator
+ * defines a very general API which specifies only that a segment of the input
+ * text is replaced by new text. The particulars of this conversion are
+ * determined entirely by subclasses of Transliterator.
+ * 
+ * 
+ * Some useful transformations for search are built-in:
+ * 
+ * Conversion from Traditional to Simplified Chinese characters
+ * Conversion from Hiragana to Katakana
+ * Conversion from Fullwidth to Halfwidth forms.
+ * Script conversions, for example Serbian Cyrillic to Latin
+ * 
+ * 
+ * Example usage: stream = new ICUTransformCharFilter(reader,
+ * Transliterator.getInstance("Traditional-Simplified"));
+ * 
+ * For more details, see the http://userguide.icu-project.org/transforms/general;>ICU User
+ * Guide.
+ */
+public final class ICUTransformCharFilter extends BaseCharFilter {
+
+  // Transliterator to transform the text
+  private final Transliterator transform;
+
+  // Reusable position object
+  private final Position position = new Position();
+
+  private static final int READ_BUFFER_SIZE = 1024;
+  private final char[] tmpBuffer = new char[READ_BUFFER_SIZE];
+
+  private static final int INITIAL_TRANSLITERATE_BUFFER_SIZE = 1024;
+  private final StringBuffer buffer = new 
StringBuffer(INITIAL_TRANSLITERATE_BUFFER_SIZE);
+  private final ReplaceableString replaceable = new ReplaceableString(buffer);
+
+  private static final int BUFFER_PRUNE_THRESHOLD = 1024;
+
+  private int outputCursor = 0;
+  private boolean inputFinished = false;
+  private int charCount = 0;
+  private int offsetDiffAdjust = 0;
+
+  private static final int HARD_MAX_ROLLBACK_BUFFER_CAPACITY = 
Integer.highestOneBit(Integer.MAX_VALUE);
+  static final int DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY = 8192;
+  private final int maxRollbackBufferCapacity;
+
+  private static final int DEFAULT_INITIAL_ROLLBACK_BUFFER_CAPACITY = 4; // 
must be power of 2
+  private char[] rollbackBuffer;
+  private int rollbackBufferSize = 0;
+
+  ICUTransformCharFilter(Reader in, Transliterator transform) {
+this(in, transform, DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY);
+  }
+
+  /**
+   * Construct new {@link ICUTransformCharFilter} with the specified {@link 
Transliterator}, backed by
+   * the specified {@link Reader}.
+   * @param in input source
+   * @param transform used to perform transliteration
+   * @param maxRollbackBufferCapacityHint used to control the maximum size to 
which this
+   * {@link ICUTransformCharFilter} will buffer and rollback partial 
transliteration of input sequences.
+   * The provided hint will be converted to an enforced limit of "the greatest 
power of 2 (excluding '1')
+   * less than or equal to the specified value". It is illegal to specify a 
negative value. There is no
+   * power of 2 greater than 
Integer.highestOneBit(Integer.MAX_VALUE)), so to prevent overflow, 
values
+   * in this range will resolve to an enforced limit of 
Integer.highestOneBit(Integer.MAX_VALUE)).
+   * Specifying "0" (or "1", in practice) disables rollback. Larger values can 
in some cases yield more accurate
+   * transliteration, at the cost of 

[GitHub] [lucene-solr] magibney commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

2019-11-01 Thread GitBox
magibney commented on a change in pull request #892: LUCENE-8972: Add 
ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341603797
 
 

 ##
 File path: 
lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformCharFilter.java
 ##
 @@ -0,0 +1,384 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.analysis.icu;
+
+import java.io.IOException;
+import java.io.Reader;
+
+import com.ibm.icu.text.ReplaceableString;
+import com.ibm.icu.text.Transliterator;
+import com.ibm.icu.text.Transliterator.Position;
+import com.ibm.icu.text.UTF16;
+
+import org.apache.lucene.analysis.CharFilter;
+import org.apache.lucene.analysis.charfilter.BaseCharFilter;
+import org.apache.lucene.util.ArrayUtil;
+
+/**
+ * A {@link CharFilter} that transforms text with ICU.
+ * 
+ * ICU provides text-transformation functionality via its Transliteration API.
+ * Although script conversion is its most common use, a Transliterator can
+ * actually perform a more general class of tasks. In fact, Transliterator
+ * defines a very general API which specifies only that a segment of the input
+ * text is replaced by new text. The particulars of this conversion are
+ * determined entirely by subclasses of Transliterator.
+ * 
+ * 
+ * Some useful transformations for search are built-in:
+ * 
+ * Conversion from Traditional to Simplified Chinese characters
+ * Conversion from Hiragana to Katakana
+ * Conversion from Fullwidth to Halfwidth forms.
+ * Script conversions, for example Serbian Cyrillic to Latin
+ * 
+ * 
+ * Example usage: stream = new ICUTransformCharFilter(reader,
+ * Transliterator.getInstance("Traditional-Simplified"));
+ * 
+ * For more details, see the http://userguide.icu-project.org/transforms/general;>ICU User
+ * Guide.
+ */
+public final class ICUTransformCharFilter extends BaseCharFilter {
+
+  // Transliterator to transform the text
+  private final Transliterator transform;
+
+  // Reusable position object
+  private final Position position = new Position();
+
+  private static final int READ_BUFFER_SIZE = 1024;
+  private final char[] tmpBuffer = new char[READ_BUFFER_SIZE];
+
+  private static final int INITIAL_TRANSLITERATE_BUFFER_SIZE = 1024;
+  private final StringBuffer buffer = new 
StringBuffer(INITIAL_TRANSLITERATE_BUFFER_SIZE);
+  private final ReplaceableString replaceable = new ReplaceableString(buffer);
+
+  private static final int BUFFER_PRUNE_THRESHOLD = 1024;
+
+  private int outputCursor = 0;
+  private boolean inputFinished = false;
+  private int charCount = 0;
+  private int offsetDiffAdjust = 0;
+
+  private static final int HARD_MAX_ROLLBACK_BUFFER_CAPACITY = 
Integer.highestOneBit(Integer.MAX_VALUE);
+  static final int DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY = 8192;
+  private final int maxRollbackBufferCapacity;
+
+  private static final int DEFAULT_INITIAL_ROLLBACK_BUFFER_CAPACITY = 4; // 
must be power of 2
+  private char[] rollbackBuffer;
+  private int rollbackBufferSize = 0;
+
+  ICUTransformCharFilter(Reader in, Transliterator transform) {
+this(in, transform, DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY);
+  }
+
+  /**
+   * Construct new {@link ICUTransformCharFilter} with the specified {@link 
Transliterator}, backed by
+   * the specified {@link Reader}.
+   * @param in input source
+   * @param transform used to perform transliteration
+   * @param maxRollbackBufferCapacityHint used to control the maximum size to 
which this
+   * {@link ICUTransformCharFilter} will buffer and rollback partial 
transliteration of input sequences.
+   * The provided hint will be converted to an enforced limit of "the greatest 
power of 2 (excluding '1')
+   * less than or equal to the specified value". It is illegal to specify a 
negative value. There is no
+   * power of 2 greater than 
Integer.highestOneBit(Integer.MAX_VALUE)), so to prevent overflow, 
values
+   * in this range will resolve to an enforced limit of 
Integer.highestOneBit(Integer.MAX_VALUE)).
+   * Specifying "0" (or "1", in practice) disables rollback. Larger values can 
in some cases yield more accurate
+   * transliteration, at the cost of 

[GitHub] [lucene-solr] magibney commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

2019-11-01 Thread GitBox
magibney commented on a change in pull request #892: LUCENE-8972: Add 
ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341725715
 
 

 ##
 File path: 
lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformCharFilter.java
 ##
 @@ -0,0 +1,384 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.analysis.icu;
+
+import java.io.IOException;
+import java.io.Reader;
+
+import com.ibm.icu.text.ReplaceableString;
+import com.ibm.icu.text.Transliterator;
+import com.ibm.icu.text.Transliterator.Position;
+import com.ibm.icu.text.UTF16;
+
+import org.apache.lucene.analysis.CharFilter;
+import org.apache.lucene.analysis.charfilter.BaseCharFilter;
+import org.apache.lucene.util.ArrayUtil;
+
+/**
+ * A {@link CharFilter} that transforms text with ICU.
+ * 
+ * ICU provides text-transformation functionality via its Transliteration API.
+ * Although script conversion is its most common use, a Transliterator can
+ * actually perform a more general class of tasks. In fact, Transliterator
+ * defines a very general API which specifies only that a segment of the input
+ * text is replaced by new text. The particulars of this conversion are
+ * determined entirely by subclasses of Transliterator.
+ * 
+ * 
+ * Some useful transformations for search are built-in:
+ * 
+ * Conversion from Traditional to Simplified Chinese characters
+ * Conversion from Hiragana to Katakana
+ * Conversion from Fullwidth to Halfwidth forms.
+ * Script conversions, for example Serbian Cyrillic to Latin
+ * 
+ * 
+ * Example usage: stream = new ICUTransformCharFilter(reader,
+ * Transliterator.getInstance("Traditional-Simplified"));
+ * 
+ * For more details, see the http://userguide.icu-project.org/transforms/general;>ICU User
+ * Guide.
+ */
+public final class ICUTransformCharFilter extends BaseCharFilter {
+
+  // Transliterator to transform the text
+  private final Transliterator transform;
+
+  // Reusable position object
+  private final Position position = new Position();
+
+  private static final int READ_BUFFER_SIZE = 1024;
+  private final char[] tmpBuffer = new char[READ_BUFFER_SIZE];
+
+  private static final int INITIAL_TRANSLITERATE_BUFFER_SIZE = 1024;
+  private final StringBuffer buffer = new 
StringBuffer(INITIAL_TRANSLITERATE_BUFFER_SIZE);
+  private final ReplaceableString replaceable = new ReplaceableString(buffer);
+
+  private static final int BUFFER_PRUNE_THRESHOLD = 1024;
+
+  private int outputCursor = 0;
+  private boolean inputFinished = false;
+  private int charCount = 0;
+  private int offsetDiffAdjust = 0;
+
+  private static final int HARD_MAX_ROLLBACK_BUFFER_CAPACITY = 
Integer.highestOneBit(Integer.MAX_VALUE);
+  static final int DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY = 8192;
+  private final int maxRollbackBufferCapacity;
+
+  private static final int DEFAULT_INITIAL_ROLLBACK_BUFFER_CAPACITY = 4; // 
must be power of 2
+  private char[] rollbackBuffer;
+  private int rollbackBufferSize = 0;
+
+  ICUTransformCharFilter(Reader in, Transliterator transform) {
+this(in, transform, DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY);
+  }
+
+  /**
+   * Construct new {@link ICUTransformCharFilter} with the specified {@link 
Transliterator}, backed by
+   * the specified {@link Reader}.
+   * @param in input source
+   * @param transform used to perform transliteration
+   * @param maxRollbackBufferCapacityHint used to control the maximum size to 
which this
+   * {@link ICUTransformCharFilter} will buffer and rollback partial 
transliteration of input sequences.
+   * The provided hint will be converted to an enforced limit of "the greatest 
power of 2 (excluding '1')
+   * less than or equal to the specified value". It is illegal to specify a 
negative value. There is no
+   * power of 2 greater than 
Integer.highestOneBit(Integer.MAX_VALUE)), so to prevent overflow, 
values
+   * in this range will resolve to an enforced limit of 
Integer.highestOneBit(Integer.MAX_VALUE)).
+   * Specifying "0" (or "1", in practice) disables rollback. Larger values can 
in some cases yield more accurate
+   * transliteration, at the cost of 

[GitHub] [lucene-solr] magibney commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

2019-11-01 Thread GitBox
magibney commented on a change in pull request #892: LUCENE-8972: Add 
ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341294234
 
 

 ##
 File path: 
lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformCharFilter.java
 ##
 @@ -0,0 +1,384 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.analysis.icu;
+
+import java.io.IOException;
+import java.io.Reader;
+
+import com.ibm.icu.text.ReplaceableString;
+import com.ibm.icu.text.Transliterator;
+import com.ibm.icu.text.Transliterator.Position;
+import com.ibm.icu.text.UTF16;
+
+import org.apache.lucene.analysis.CharFilter;
+import org.apache.lucene.analysis.charfilter.BaseCharFilter;
+import org.apache.lucene.util.ArrayUtil;
+
+/**
+ * A {@link CharFilter} that transforms text with ICU.
+ * 
+ * ICU provides text-transformation functionality via its Transliteration API.
+ * Although script conversion is its most common use, a Transliterator can
+ * actually perform a more general class of tasks. In fact, Transliterator
+ * defines a very general API which specifies only that a segment of the input
+ * text is replaced by new text. The particulars of this conversion are
+ * determined entirely by subclasses of Transliterator.
+ * 
+ * 
+ * Some useful transformations for search are built-in:
+ * 
+ * Conversion from Traditional to Simplified Chinese characters
+ * Conversion from Hiragana to Katakana
+ * Conversion from Fullwidth to Halfwidth forms.
+ * Script conversions, for example Serbian Cyrillic to Latin
+ * 
+ * 
+ * Example usage: stream = new ICUTransformCharFilter(reader,
+ * Transliterator.getInstance("Traditional-Simplified"));
+ * 
+ * For more details, see the http://userguide.icu-project.org/transforms/general;>ICU User
+ * Guide.
+ */
+public final class ICUTransformCharFilter extends BaseCharFilter {
+
+  // Transliterator to transform the text
+  private final Transliterator transform;
+
+  // Reusable position object
+  private final Position position = new Position();
+
+  private static final int READ_BUFFER_SIZE = 1024;
+  private final char[] tmpBuffer = new char[READ_BUFFER_SIZE];
+
+  private static final int INITIAL_TRANSLITERATE_BUFFER_SIZE = 1024;
+  private final StringBuffer buffer = new 
StringBuffer(INITIAL_TRANSLITERATE_BUFFER_SIZE);
+  private final ReplaceableString replaceable = new ReplaceableString(buffer);
+
+  private static final int BUFFER_PRUNE_THRESHOLD = 1024;
+
+  private int outputCursor = 0;
+  private boolean inputFinished = false;
+  private int charCount = 0;
+  private int offsetDiffAdjust = 0;
+
+  private static final int HARD_MAX_ROLLBACK_BUFFER_CAPACITY = 
Integer.highestOneBit(Integer.MAX_VALUE);
+  static final int DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY = 8192;
+  private final int maxRollbackBufferCapacity;
+
+  private static final int DEFAULT_INITIAL_ROLLBACK_BUFFER_CAPACITY = 4; // 
must be power of 2
+  private char[] rollbackBuffer;
+  private int rollbackBufferSize = 0;
+
+  ICUTransformCharFilter(Reader in, Transliterator transform) {
+this(in, transform, DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY);
+  }
+
+  /**
+   * Construct new {@link ICUTransformCharFilter} with the specified {@link 
Transliterator}, backed by
+   * the specified {@link Reader}.
+   * @param in input source
+   * @param transform used to perform transliteration
+   * @param maxRollbackBufferCapacityHint used to control the maximum size to 
which this
+   * {@link ICUTransformCharFilter} will buffer and rollback partial 
transliteration of input sequences.
+   * The provided hint will be converted to an enforced limit of "the greatest 
power of 2 (excluding '1')
+   * less than or equal to the specified value". It is illegal to specify a 
negative value. There is no
+   * power of 2 greater than 
Integer.highestOneBit(Integer.MAX_VALUE)), so to prevent overflow, 
values
+   * in this range will resolve to an enforced limit of 
Integer.highestOneBit(Integer.MAX_VALUE)).
+   * Specifying "0" (or "1", in practice) disables rollback. Larger values can 
in some cases yield more accurate
+   * transliteration, at the cost of 

[jira] [Commented] (SOLR-13844) Remove recovering shard term with corresponding core shard term.

2019-11-01 Thread Houston Putman (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965053#comment-16965053
 ] 

Houston Putman commented on SOLR-13844:
---

Thanks for getting this merged [~caomanhdat]! Are you planning on pushing this 
to 8.x as well?

> Remove recovering shard term with corresponding core shard term.
> 
>
> Key: SOLR-13844
> URL: https://issues.apache.org/jira/browse/SOLR-13844
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: master (9.0)
>Reporter: Houston Putman
>Priority: Minor
> Fix For: master (9.0), 8.4
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Currently if a recovering replica (solr 7.3+) is deleted, the term for that 
> core in the shard's terms in ZK is removed. However the {{_recovering}} 
> term is not removed as well. This can create clutter and confusion in the 
> shard terms ZK node.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-10786) Add DBSCAN clustering Streaming Evaluator

2019-11-01 Thread Joel Bernstein (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-10786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Bernstein updated SOLR-10786:
--
Attachment: Screen Shot 2019-11-01 at 12.47.07 PM.png

> Add DBSCAN clustering Streaming Evaluator
> -
>
> Key: SOLR-10786
> URL: https://issues.apache.org/jira/browse/SOLR-10786
> Project: Solr
>  Issue Type: New Feature
>Reporter: Joel Bernstein
>Assignee: Joel Bernstein
>Priority: Major
> Fix For: 8.1, master (9.0)
>
> Attachments: SOLR-10786.patch, SOLR-10786.patch, SOLR-10786.patch, 
> SOLR-10786.patch, Screen Shot 2019-11-01 at 12.47.07 PM.png
>
>
> The DBSCAN clustering Stream Evaluator will cluster numeric vectors using the 
> DBSCAN clustering algorithm.
> Clustering implementation will be provided by Apache Commons Math.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-10786) Add DBSCAN clustering Streaming Evaluator

2019-11-01 Thread Joel Bernstein (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-10786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Bernstein updated SOLR-10786:
--
Attachment: SOLR-10786.patch

> Add DBSCAN clustering Streaming Evaluator
> -
>
> Key: SOLR-10786
> URL: https://issues.apache.org/jira/browse/SOLR-10786
> Project: Solr
>  Issue Type: New Feature
>Reporter: Joel Bernstein
>Assignee: Joel Bernstein
>Priority: Major
> Fix For: 8.1, master (9.0)
>
> Attachments: SOLR-10786.patch, SOLR-10786.patch, SOLR-10786.patch, 
> SOLR-10786.patch
>
>
> The DBSCAN clustering Stream Evaluator will cluster numeric vectors using the 
> DBSCAN clustering algorithm.
> Clustering implementation will be provided by Apache Commons Math.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-10786) Add DBSCAN clustering Streaming Evaluator

2019-11-01 Thread Joel Bernstein (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-10786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965025#comment-16965025
 ] 

Joel Bernstein commented on SOLR-10786:
---

New implementation attached.

> Add DBSCAN clustering Streaming Evaluator
> -
>
> Key: SOLR-10786
> URL: https://issues.apache.org/jira/browse/SOLR-10786
> Project: Solr
>  Issue Type: New Feature
>Reporter: Joel Bernstein
>Assignee: Joel Bernstein
>Priority: Major
> Fix For: 8.1, master (9.0)
>
> Attachments: SOLR-10786.patch, SOLR-10786.patch, SOLR-10786.patch, 
> SOLR-10786.patch
>
>
> The DBSCAN clustering Stream Evaluator will cluster numeric vectors using the 
> DBSCAN clustering algorithm.
> Clustering implementation will be provided by Apache Commons Math.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] magibney commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

2019-11-01 Thread GitBox
magibney commented on a change in pull request #892: LUCENE-8972: Add 
ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341696621
 
 

 ##
 File path: 
lucene/analysis/icu/src/test/org/apache/lucene/analysis/icu/TestICUTransformCharFilter.java
 ##
 @@ -0,0 +1,153 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.analysis.icu;
+
+
+import java.io.IOException;
+import java.io.Reader;
+import java.io.StringReader;
+
+import org.apache.lucene.analysis.Analyzer;
+import org.apache.lucene.analysis.BaseTokenStreamTestCase;
+import org.apache.lucene.analysis.MockTokenizer;
+import org.apache.lucene.analysis.Tokenizer;
+import org.apache.lucene.analysis.core.KeywordTokenizer;
+
+import com.ibm.icu.text.Transliterator;
+import com.ibm.icu.text.UnicodeSet;
+
+
+/**
+ * Test the ICUTransformCharFilter with some basic examples.
+ */
+public class TestICUTransformCharFilter extends BaseTokenStreamTestCase {
+  
+  public void testBasicFunctionality() throws Exception {
+checkToken(Transliterator.getInstance("Traditional-Simplified"), 
+"簡化字", "简化字"); 
+checkToken(Transliterator.getInstance("Katakana-Hiragana"), 
+"ヒラガナ", "ひらがな");
+checkToken(Transliterator.getInstance("Fullwidth-Halfwidth"), 
+"アルアノリウ", "アルアノリウ");
+checkToken(Transliterator.getInstance("Any-Latin"), 
+"Αλφαβητικός Κατάλογος", "Alphabētikós Katálogos");
+checkToken(Transliterator.getInstance("NFD; [:Nonspacing Mark:] Remove"), 
+"Alphabētikós Katálogos", "Alphabetikos Katalogos");
+checkToken(Transliterator.getInstance("Han-Latin"),
+"中国", "zhōng guó");
+  }
+  
+  public void testRollbackBuffer() throws Exception {
+checkToken(Transliterator.getInstance("Cyrillic-Latin"),
+"я", "â"); // final NFC transform applied
+checkToken(Transliterator.getInstance("Cyrillic-Latin"), 0,
+"я", "a\u0302a\u0302a\u0302a\u0302a\u0302"); // final NFC 
transform never applied
+checkToken(Transliterator.getInstance("Cyrillic-Latin"), 2,
+"я", "ââa\u0302a\u0302a\u0302");
+checkToken(Transliterator.getInstance("Cyrillic-Latin"), 4,
+"яя", 
"ââa\u0302a\u0302a\u0302a\u0302a\u0302a\u0302a\u0302a\u0302");
+checkToken(Transliterator.getInstance("Cyrillic-Latin"), 8,
+"", "ââa\u0302a\u0302a\u0302âââ");
+  }
+
+  public void testCustomFunctionality() throws Exception {
+String rules = "a > b; b > c;"; // convert a's to b's and b's to c's
+checkToken(Transliterator.createFromRules("test", rules, 
Transliterator.FORWARD), "abacadaba", "bcbcbdbcb");
+  }
+  
+  public void testCustomFunctionality2() throws Exception {
+String rules = "c { a > b; a > d;"; // convert a's to b's and b's to c's
+checkToken(Transliterator.createFromRules("test", rules, 
Transliterator.FORWARD), "caa", "cbd");
+  }
+  
+  public void testOptimizer() throws Exception {
+String rules = "a > b; b > c;"; // convert a's to b's and b's to c's
+Transliterator custom = Transliterator.createFromRules("test", rules, 
Transliterator.FORWARD);
+assertTrue(custom.getFilter() == null);
+new ICUTransformCharFilter(new StringReader(""), custom);
+assertTrue(custom.getFilter().equals(new UnicodeSet("[ab]")));
+  }
+  
+  public void testOptimizer2() throws Exception {
+checkToken(Transliterator.getInstance("Traditional-Simplified; CaseFold"), 
+"ABCDE", "abcde");
+  }
+  
+  public void testOptimizerSurrogate() throws Exception {
+String rules = "\\U00020087 > x;"; // convert CJK UNIFIED IDEOGRAPH-20087 
to an x
+Transliterator custom = Transliterator.createFromRules("test", rules, 
Transliterator.FORWARD);
+assertTrue(custom.getFilter() == null);
+new ICUTransformCharFilter(new StringReader(""), custom);
+assertTrue(custom.getFilter().equals(new UnicodeSet("[\\U00020087]")));
+  }
+
+  private void checkToken(Transliterator transform, String input, String 
expected) throws IOException {
+checkToken(transform, 
ICUTransformCharFilter.DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY, input, 

[GitHub] [lucene-solr] magibney commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

2019-11-01 Thread GitBox
magibney commented on a change in pull request #892: LUCENE-8972: Add 
ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341695837
 
 

 ##
 File path: 
lucene/analysis/icu/src/test/org/apache/lucene/analysis/icu/TestICUTransformCharFilter.java
 ##
 @@ -0,0 +1,153 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.analysis.icu;
+
+
+import java.io.IOException;
+import java.io.Reader;
+import java.io.StringReader;
+
+import org.apache.lucene.analysis.Analyzer;
+import org.apache.lucene.analysis.BaseTokenStreamTestCase;
+import org.apache.lucene.analysis.MockTokenizer;
+import org.apache.lucene.analysis.Tokenizer;
+import org.apache.lucene.analysis.core.KeywordTokenizer;
+
+import com.ibm.icu.text.Transliterator;
+import com.ibm.icu.text.UnicodeSet;
+
+
+/**
+ * Test the ICUTransformCharFilter with some basic examples.
+ */
+public class TestICUTransformCharFilter extends BaseTokenStreamTestCase {
+  
+  public void testBasicFunctionality() throws Exception {
+checkToken(Transliterator.getInstance("Traditional-Simplified"), 
+"簡化字", "简化字"); 
+checkToken(Transliterator.getInstance("Katakana-Hiragana"), 
+"ヒラガナ", "ひらがな");
+checkToken(Transliterator.getInstance("Fullwidth-Halfwidth"), 
+"アルアノリウ", "アルアノリウ");
+checkToken(Transliterator.getInstance("Any-Latin"), 
+"Αλφαβητικός Κατάλογος", "Alphabētikós Katálogos");
+checkToken(Transliterator.getInstance("NFD; [:Nonspacing Mark:] Remove"), 
+"Alphabētikós Katálogos", "Alphabetikos Katalogos");
+checkToken(Transliterator.getInstance("Han-Latin"),
+"中国", "zhōng guó");
+  }
+  
+  public void testRollbackBuffer() throws Exception {
+checkToken(Transliterator.getInstance("Cyrillic-Latin"),
+"я", "â"); // final NFC transform applied
+checkToken(Transliterator.getInstance("Cyrillic-Latin"), 0,
+"я", "a\u0302a\u0302a\u0302a\u0302a\u0302"); // final NFC 
transform never applied
+checkToken(Transliterator.getInstance("Cyrillic-Latin"), 2,
+"я", "ââa\u0302a\u0302a\u0302");
+checkToken(Transliterator.getInstance("Cyrillic-Latin"), 4,
+"яя", 
"ââa\u0302a\u0302a\u0302a\u0302a\u0302a\u0302a\u0302a\u0302");
+checkToken(Transliterator.getInstance("Cyrillic-Latin"), 8,
+"", "ââa\u0302a\u0302a\u0302âââ");
+  }
+
+  public void testCustomFunctionality() throws Exception {
+String rules = "a > b; b > c;"; // convert a's to b's and b's to c's
+checkToken(Transliterator.createFromRules("test", rules, 
Transliterator.FORWARD), "abacadaba", "bcbcbdbcb");
+  }
+  
+  public void testCustomFunctionality2() throws Exception {
+String rules = "c { a > b; a > d;"; // convert a's to b's and b's to c's
+checkToken(Transliterator.createFromRules("test", rules, 
Transliterator.FORWARD), "caa", "cbd");
+  }
+  
+  public void testOptimizer() throws Exception {
+String rules = "a > b; b > c;"; // convert a's to b's and b's to c's
+Transliterator custom = Transliterator.createFromRules("test", rules, 
Transliterator.FORWARD);
+assertTrue(custom.getFilter() == null);
+new ICUTransformCharFilter(new StringReader(""), custom);
+assertTrue(custom.getFilter().equals(new UnicodeSet("[ab]")));
+  }
+  
+  public void testOptimizer2() throws Exception {
+checkToken(Transliterator.getInstance("Traditional-Simplified; CaseFold"), 
+"ABCDE", "abcde");
+  }
+  
+  public void testOptimizerSurrogate() throws Exception {
+String rules = "\\U00020087 > x;"; // convert CJK UNIFIED IDEOGRAPH-20087 
to an x
+Transliterator custom = Transliterator.createFromRules("test", rules, 
Transliterator.FORWARD);
+assertTrue(custom.getFilter() == null);
+new ICUTransformCharFilter(new StringReader(""), custom);
+assertTrue(custom.getFilter().equals(new UnicodeSet("[\\U00020087]")));
+  }
+
+  private void checkToken(Transliterator transform, String input, String 
expected) throws IOException {
+checkToken(transform, 
ICUTransformCharFilter.DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY, input, 

[jira] [Resolved] (SOLR-13207) AIOOBE in calculateMinShouldMatch

2019-11-01 Thread Tomas Eduardo Fernandez Lobbe (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomas Eduardo Fernandez Lobbe resolved SOLR-13207.
--
Fix Version/s: 8.4
   master (9.0)
   Resolution: Fixed

> AIOOBE in calculateMinShouldMatch
> -
>
> Key: SOLR-13207
> URL: https://issues.apache.org/jira/browse/SOLR-13207
> Project: Solr
>  Issue Type: Bug
> Environment: h1. Steps to reproduce
> * Use a Linux machine.
> * Build commit {{ea2c8ba}} of Solr as described in the section below.
> * Build the films collection as described below.
> * Start the server using the command {{./bin/solr start -f -p 8983 -s 
> /tmp/home}}
> * Request the URL given in the bug description.
> h1. Compiling the server
> {noformat}
> git clone https://github.com/apache/lucene-solr
> cd lucene-solr
> git checkout ea2c8ba
> ant compile
> cd solr
> ant server
> {noformat}
> h1. Building the collection and reproducing the bug
> We followed [Exercise 
> 2|http://lucene.apache.org/solr/guide/7_5/solr-tutorial.html#exercise-2] from 
> the [Solr 
> Tutorial|http://lucene.apache.org/solr/guide/7_5/solr-tutorial.html].
> {noformat}
> mkdir -p /tmp/home
> echo '' > 
> /tmp/home/solr.xml
> {noformat}
> In one terminal start a Solr instance in foreground:
> {noformat}
> ./bin/solr start -f -p 8983 -s /tmp/home
> {noformat}
> In another terminal, create a collection of movies, with no shards and no 
> replication, and initialize it:
> {noformat}
> bin/solr create -c films
> curl -X POST -H 'Content-type:application/json' --data-binary '{"add-field": 
> {"name":"name", "type":"text_general", "multiValued":false, "stored":true}}' 
> http://localhost:8983/solr/films/schema
> curl -X POST -H 'Content-type:application/json' --data-binary 
> '{"add-copy-field" : {"source":"*","dest":"_text_"}}' 
> http://localhost:8983/solr/films/schema
> ./bin/post -c films example/films/films.json
> curl -v “URL_BUG”
> {noformat}
> Please check the issue description below to find the “URL_BUG” that will 
> allow you to reproduce the issue reported.
>Reporter: Johannes Kloos
>Priority: Major
>  Labels: diffblue, newdev
> Fix For: master (9.0), 8.4
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Requesting the following URL causes Solr to return an HTTP 500 error response:
> {noformat}
> http://localhost:8983/solr/films/select?mm=%3C=edismax=fq=field(id,1)
> {noformat}
> The error response seems to be caused by the following uncaught exception:
> {noformat}
> java.lang.ArrayIndexOutOfBoundsException: 0
> at 
> org.apache.solr.util.SolrPluginUtils.calculateMinShouldMatch(SolrPluginUtils.java:683)
> at 
> org.apache.solr.util.SolrPluginUtils.setMinShouldMatch(SolrPluginUtils.java:641)
> at 
> org.apache.solr.util.SolrPluginUtils.setMinShouldMatch(SolrPluginUtils.java:660)
> at 
> org.apache.solr.search.ExtendedDismaxQParser.parseOriginalQuery(ExtendedDismaxQParser.java:415)
> at 
> org.apache.solr.search.ExtendedDismaxQParser.parse(ExtendedDismaxQParser.java:173)
> at org.apache.solr.search.QParser.getQuery(QParser.java:173)
> at 
> org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:158)
> at 
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:272)
> at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2559)
> {noformat}
> The mm parameter is given as ‘<’. It is (after some string mangling) split 
> into sub-strings separated by ‘<’, putattively giving the left-hand and 
> right-hand argument of the operator. In the example, there are no such 
> arguments, so the resulting array “parts” is empty (cf. String.split 
> documentation). But we immediately try to access parts[0], leading to an 
> AIOOBE.
> To set up an environment to reproduce this bug, follow the description in the 
> ‘Environment’ field.
> We automatically found this issue and ~70 more like this using [Diffblue 
> Microservices Testing|https://www.diffblue.com/labs/?utm_source=solr-br]. 
> Find more information on this [fuzz testing 
> campaign|https://www.diffblue.com/blog/2018/12/19/diffblue-microservice-testing-a-sneak-peek-at-our-early-product-and-results?utm_source=solr-br].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13207) AIOOBE in calculateMinShouldMatch

2019-11-01 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965018#comment-16965018
 ] 

ASF subversion and git services commented on SOLR-13207:


Commit 543d0b79aa9d028bb79485b14648a12beca9b8aa in lucene-solr's branch 
refs/heads/branch_8x from Chris Hennick
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=543d0b7 ]

SOLR-13207: Handle query errors in calculateMinShouldMatch (#978)

Traps error that arises when the < operator is used at the end of a query field.
Also handles NumberFormatException when the operand isn't a number.

> AIOOBE in calculateMinShouldMatch
> -
>
> Key: SOLR-13207
> URL: https://issues.apache.org/jira/browse/SOLR-13207
> Project: Solr
>  Issue Type: Bug
> Environment: h1. Steps to reproduce
> * Use a Linux machine.
> * Build commit {{ea2c8ba}} of Solr as described in the section below.
> * Build the films collection as described below.
> * Start the server using the command {{./bin/solr start -f -p 8983 -s 
> /tmp/home}}
> * Request the URL given in the bug description.
> h1. Compiling the server
> {noformat}
> git clone https://github.com/apache/lucene-solr
> cd lucene-solr
> git checkout ea2c8ba
> ant compile
> cd solr
> ant server
> {noformat}
> h1. Building the collection and reproducing the bug
> We followed [Exercise 
> 2|http://lucene.apache.org/solr/guide/7_5/solr-tutorial.html#exercise-2] from 
> the [Solr 
> Tutorial|http://lucene.apache.org/solr/guide/7_5/solr-tutorial.html].
> {noformat}
> mkdir -p /tmp/home
> echo '' > 
> /tmp/home/solr.xml
> {noformat}
> In one terminal start a Solr instance in foreground:
> {noformat}
> ./bin/solr start -f -p 8983 -s /tmp/home
> {noformat}
> In another terminal, create a collection of movies, with no shards and no 
> replication, and initialize it:
> {noformat}
> bin/solr create -c films
> curl -X POST -H 'Content-type:application/json' --data-binary '{"add-field": 
> {"name":"name", "type":"text_general", "multiValued":false, "stored":true}}' 
> http://localhost:8983/solr/films/schema
> curl -X POST -H 'Content-type:application/json' --data-binary 
> '{"add-copy-field" : {"source":"*","dest":"_text_"}}' 
> http://localhost:8983/solr/films/schema
> ./bin/post -c films example/films/films.json
> curl -v “URL_BUG”
> {noformat}
> Please check the issue description below to find the “URL_BUG” that will 
> allow you to reproduce the issue reported.
>Reporter: Johannes Kloos
>Priority: Major
>  Labels: diffblue, newdev
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Requesting the following URL causes Solr to return an HTTP 500 error response:
> {noformat}
> http://localhost:8983/solr/films/select?mm=%3C=edismax=fq=field(id,1)
> {noformat}
> The error response seems to be caused by the following uncaught exception:
> {noformat}
> java.lang.ArrayIndexOutOfBoundsException: 0
> at 
> org.apache.solr.util.SolrPluginUtils.calculateMinShouldMatch(SolrPluginUtils.java:683)
> at 
> org.apache.solr.util.SolrPluginUtils.setMinShouldMatch(SolrPluginUtils.java:641)
> at 
> org.apache.solr.util.SolrPluginUtils.setMinShouldMatch(SolrPluginUtils.java:660)
> at 
> org.apache.solr.search.ExtendedDismaxQParser.parseOriginalQuery(ExtendedDismaxQParser.java:415)
> at 
> org.apache.solr.search.ExtendedDismaxQParser.parse(ExtendedDismaxQParser.java:173)
> at org.apache.solr.search.QParser.getQuery(QParser.java:173)
> at 
> org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:158)
> at 
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:272)
> at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2559)
> {noformat}
> The mm parameter is given as ‘<’. It is (after some string mangling) split 
> into sub-strings separated by ‘<’, putattively giving the left-hand and 
> right-hand argument of the operator. In the example, there are no such 
> arguments, so the resulting array “parts” is empty (cf. String.split 
> documentation). But we immediately try to access parts[0], leading to an 
> AIOOBE.
> To set up an environment to reproduce this bug, follow the description in the 
> ‘Environment’ field.
> We automatically found this issue and ~70 more like this using [Diffblue 
> Microservices Testing|https://www.diffblue.com/labs/?utm_source=solr-br]. 
> Find more information on this [fuzz testing 
> campaign|https://www.diffblue.com/blog/2018/12/19/diffblue-microservice-testing-a-sneak-peek-at-our-early-product-and-results?utm_source=solr-br].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For 

[jira] [Commented] (SOLR-13841) Add jackson databind annotations to SolrJ classpath

2019-11-01 Thread Ishan Chattopadhyaya (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965013#comment-16965013
 ] 

Ishan Chattopadhyaya commented on SOLR-13841:
-

bq. At this juncture, I think the commits should be reverted and this issue 
closed as Won't-Fix.  File a new issue.
+1


> Add jackson databind annotations to SolrJ classpath
> ---
>
> Key: SOLR-13841
> URL: https://issues.apache.org/jira/browse/SOLR-13841
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> We can start using annotations in SolrJ to minimize the amount of code we 
> write & improve readability. Jackson is a widely used library and everyone is 
> already familiar with it



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-11-01 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965008#comment-16965008
 ] 

Adrien Grand commented on LUCENE-8920:
--

Wouldn't the bitTable be the same as the bitTable of the current arc?

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Priority: Minor
> Attachments: TestTermsDictRamBytesUsed.java
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13207) AIOOBE in calculateMinShouldMatch

2019-11-01 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965006#comment-16965006
 ] 

ASF subversion and git services commented on SOLR-13207:


Commit b17d630e509adc9e62b07c599583558e715a869f in lucene-solr's branch 
refs/heads/master from Chris Hennick
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=b17d630 ]

SOLR-13207: Handle query errors in calculateMinShouldMatch (#978)

Traps error that arises when the < operator is used at the end of a query field.
Also handles NumberFormatException when the operand isn't a number.

> AIOOBE in calculateMinShouldMatch
> -
>
> Key: SOLR-13207
> URL: https://issues.apache.org/jira/browse/SOLR-13207
> Project: Solr
>  Issue Type: Bug
> Environment: h1. Steps to reproduce
> * Use a Linux machine.
> * Build commit {{ea2c8ba}} of Solr as described in the section below.
> * Build the films collection as described below.
> * Start the server using the command {{./bin/solr start -f -p 8983 -s 
> /tmp/home}}
> * Request the URL given in the bug description.
> h1. Compiling the server
> {noformat}
> git clone https://github.com/apache/lucene-solr
> cd lucene-solr
> git checkout ea2c8ba
> ant compile
> cd solr
> ant server
> {noformat}
> h1. Building the collection and reproducing the bug
> We followed [Exercise 
> 2|http://lucene.apache.org/solr/guide/7_5/solr-tutorial.html#exercise-2] from 
> the [Solr 
> Tutorial|http://lucene.apache.org/solr/guide/7_5/solr-tutorial.html].
> {noformat}
> mkdir -p /tmp/home
> echo '' > 
> /tmp/home/solr.xml
> {noformat}
> In one terminal start a Solr instance in foreground:
> {noformat}
> ./bin/solr start -f -p 8983 -s /tmp/home
> {noformat}
> In another terminal, create a collection of movies, with no shards and no 
> replication, and initialize it:
> {noformat}
> bin/solr create -c films
> curl -X POST -H 'Content-type:application/json' --data-binary '{"add-field": 
> {"name":"name", "type":"text_general", "multiValued":false, "stored":true}}' 
> http://localhost:8983/solr/films/schema
> curl -X POST -H 'Content-type:application/json' --data-binary 
> '{"add-copy-field" : {"source":"*","dest":"_text_"}}' 
> http://localhost:8983/solr/films/schema
> ./bin/post -c films example/films/films.json
> curl -v “URL_BUG”
> {noformat}
> Please check the issue description below to find the “URL_BUG” that will 
> allow you to reproduce the issue reported.
>Reporter: Johannes Kloos
>Priority: Major
>  Labels: diffblue, newdev
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Requesting the following URL causes Solr to return an HTTP 500 error response:
> {noformat}
> http://localhost:8983/solr/films/select?mm=%3C=edismax=fq=field(id,1)
> {noformat}
> The error response seems to be caused by the following uncaught exception:
> {noformat}
> java.lang.ArrayIndexOutOfBoundsException: 0
> at 
> org.apache.solr.util.SolrPluginUtils.calculateMinShouldMatch(SolrPluginUtils.java:683)
> at 
> org.apache.solr.util.SolrPluginUtils.setMinShouldMatch(SolrPluginUtils.java:641)
> at 
> org.apache.solr.util.SolrPluginUtils.setMinShouldMatch(SolrPluginUtils.java:660)
> at 
> org.apache.solr.search.ExtendedDismaxQParser.parseOriginalQuery(ExtendedDismaxQParser.java:415)
> at 
> org.apache.solr.search.ExtendedDismaxQParser.parse(ExtendedDismaxQParser.java:173)
> at org.apache.solr.search.QParser.getQuery(QParser.java:173)
> at 
> org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:158)
> at 
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:272)
> at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2559)
> {noformat}
> The mm parameter is given as ‘<’. It is (after some string mangling) split 
> into sub-strings separated by ‘<’, putattively giving the left-hand and 
> right-hand argument of the operator. In the example, there are no such 
> arguments, so the resulting array “parts” is empty (cf. String.split 
> documentation). But we immediately try to access parts[0], leading to an 
> AIOOBE.
> To set up an environment to reproduce this bug, follow the description in the 
> ‘Environment’ field.
> We automatically found this issue and ~70 more like this using [Diffblue 
> Microservices Testing|https://www.diffblue.com/labs/?utm_source=solr-br]. 
> Find more information on this [fuzz testing 
> campaign|https://www.diffblue.com/blog/2018/12/19/diffblue-microservice-testing-a-sneak-peek-at-our-early-product-and-results?utm_source=solr-br].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional 

[GitHub] [lucene-solr] tflobbe merged pull request #978: SOLR-13207: Handle query errors in calculateMinShouldMatch

2019-11-01 Thread GitBox
tflobbe merged pull request #978: SOLR-13207: Handle query errors in 
calculateMinShouldMatch
URL: https://github.com/apache/lucene-solr/pull/978
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (SOLR-13886) SyncSliceTest started failing frequently

2019-11-01 Thread Tomas Eduardo Fernandez Lobbe (Jira)
Tomas Eduardo Fernandez Lobbe created SOLR-13886:


 Summary: SyncSliceTest started failing frequently
 Key: SOLR-13886
 URL: https://issues.apache.org/jira/browse/SOLR-13886
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
  Components: Tests
Reporter: Tomas Eduardo Fernandez Lobbe


While I can see some failures of this test in the past, they weren't frequent 
and were usually things like port bindings (maybe SOLR-13871) or timeouts. I've 
started this failure in Jenkins (and locally) frequently:
{noformat}
Build: https://jenkins.thetaphi.de/job/Lucene-Solr-master-MacOSX/5410/
Java: 64bit/jdk-13 -XX:-UseCompressedOops -XX:+UseParallelGC

2 tests failed.
FAILED:  org.apache.solr.cloud.SyncSliceTest.test

Error Message:
expected:<5> but was:<4>

Stack Trace:
java.lang.AssertionError: expected:<5> but was:<4>
at 
__randomizedtesting.SeedInfo.seed([F8E3B768E16E848D:70B788B24F92E975]:0)
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:834)
at org.junit.Assert.assertEquals(Assert.java:645)
at org.junit.Assert.assertEquals(Assert.java:631)
at org.apache.solr.cloud.SyncSliceTest.test(SyncSliceTest.java:150)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:567)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
at 
org.apache.solr.BaseDistributedSearchTestCase$ShardsRepeatRule$ShardsFixedStatement.callStatement(BaseDistributedSearchTestCase.java:1082)
at 
org.apache.solr.BaseDistributedSearchTestCase$ShardsRepeatRule$ShardsStatement.evaluate(BaseDistributedSearchTestCase.java:1054)
at 
com.carrotsearch.randomizedtesting.rules.SystemPropertiesRestoreRule$1.evaluate(SystemPropertiesRestoreRule.java:57)
at 
org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
at 
org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
at 
org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
at 
org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
at 
org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at 
com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
at 
com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
at 
com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at 
com.carrotsearch.randomizedtesting.rules.SystemPropertiesRestoreRule$1.evaluate(SystemPropertiesRestoreRule.java:57)
at 
org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at 
org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
at 
com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
at 
com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at 

[jira] [Commented] (SOLR-13841) Add jackson databind annotations to SolrJ classpath

2019-11-01 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964998#comment-16964998
 ] 

David Smiley commented on SOLR-13841:
-

At this juncture, I think the commits should be reverted and this issue closed 
as Won't-Fix.  File a new issue.

> Add jackson databind annotations to SolrJ classpath
> ---
>
> Key: SOLR-13841
> URL: https://issues.apache.org/jira/browse/SOLR-13841
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> We can start using annotations in SolrJ to minimize the amount of code we 
> write & improve readability. Jackson is a widely used library and everyone is 
> already familiar with it



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] magibney commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

2019-11-01 Thread GitBox
magibney commented on a change in pull request #892: LUCENE-8972: Add 
ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341672527
 
 

 ##
 File path: 
lucene/analysis/icu/src/test/org/apache/lucene/analysis/icu/TestICUTransformCharFilter.java
 ##
 @@ -0,0 +1,153 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.analysis.icu;
+
+
+import java.io.IOException;
+import java.io.Reader;
+import java.io.StringReader;
+
+import org.apache.lucene.analysis.Analyzer;
+import org.apache.lucene.analysis.BaseTokenStreamTestCase;
+import org.apache.lucene.analysis.MockTokenizer;
+import org.apache.lucene.analysis.Tokenizer;
+import org.apache.lucene.analysis.core.KeywordTokenizer;
+
+import com.ibm.icu.text.Transliterator;
+import com.ibm.icu.text.UnicodeSet;
+
+
+/**
+ * Test the ICUTransformCharFilter with some basic examples.
+ */
+public class TestICUTransformCharFilter extends BaseTokenStreamTestCase {
+  
+  public void testBasicFunctionality() throws Exception {
+checkToken(Transliterator.getInstance("Traditional-Simplified"), 
+"簡化字", "简化字"); 
+checkToken(Transliterator.getInstance("Katakana-Hiragana"), 
+"ヒラガナ", "ひらがな");
+checkToken(Transliterator.getInstance("Fullwidth-Halfwidth"), 
+"アルアノリウ", "アルアノリウ");
+checkToken(Transliterator.getInstance("Any-Latin"), 
+"Αλφαβητικός Κατάλογος", "Alphabētikós Katálogos");
+checkToken(Transliterator.getInstance("NFD; [:Nonspacing Mark:] Remove"), 
+"Alphabētikós Katálogos", "Alphabetikos Katalogos");
+checkToken(Transliterator.getInstance("Han-Latin"),
+"中国", "zhōng guó");
+  }
+  
+  public void testRollbackBuffer() throws Exception {
+checkToken(Transliterator.getInstance("Cyrillic-Latin"),
+"я", "â"); // final NFC transform applied
+checkToken(Transliterator.getInstance("Cyrillic-Latin"), 0,
+"я", "a\u0302a\u0302a\u0302a\u0302a\u0302"); // final NFC 
transform never applied
+checkToken(Transliterator.getInstance("Cyrillic-Latin"), 2,
+"я", "ââa\u0302a\u0302a\u0302");
+checkToken(Transliterator.getInstance("Cyrillic-Latin"), 4,
+"яя", 
"ââa\u0302a\u0302a\u0302a\u0302a\u0302a\u0302a\u0302a\u0302");
+checkToken(Transliterator.getInstance("Cyrillic-Latin"), 8,
+"", "ââa\u0302a\u0302a\u0302âââ");
+  }
+
+  public void testCustomFunctionality() throws Exception {
+String rules = "a > b; b > c;"; // convert a's to b's and b's to c's
+checkToken(Transliterator.createFromRules("test", rules, 
Transliterator.FORWARD), "abacadaba", "bcbcbdbcb");
+  }
+  
+  public void testCustomFunctionality2() throws Exception {
+String rules = "c { a > b; a > d;"; // convert a's to b's and b's to c's
+checkToken(Transliterator.createFromRules("test", rules, 
Transliterator.FORWARD), "caa", "cbd");
+  }
+  
+  public void testOptimizer() throws Exception {
+String rules = "a > b; b > c;"; // convert a's to b's and b's to c's
+Transliterator custom = Transliterator.createFromRules("test", rules, 
Transliterator.FORWARD);
+assertTrue(custom.getFilter() == null);
+new ICUTransformCharFilter(new StringReader(""), custom);
+assertTrue(custom.getFilter().equals(new UnicodeSet("[ab]")));
+  }
+  
+  public void testOptimizer2() throws Exception {
+checkToken(Transliterator.getInstance("Traditional-Simplified; CaseFold"), 
+"ABCDE", "abcde");
+  }
+  
+  public void testOptimizerSurrogate() throws Exception {
+String rules = "\\U00020087 > x;"; // convert CJK UNIFIED IDEOGRAPH-20087 
to an x
+Transliterator custom = Transliterator.createFromRules("test", rules, 
Transliterator.FORWARD);
+assertTrue(custom.getFilter() == null);
+new ICUTransformCharFilter(new StringReader(""), custom);
+assertTrue(custom.getFilter().equals(new UnicodeSet("[\\U00020087]")));
+  }
+
+  private void checkToken(Transliterator transform, String input, String 
expected) throws IOException {
+checkToken(transform, 
ICUTransformCharFilter.DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY, input, 

[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-11-01 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964993#comment-16964993
 ] 

Bruno Roustant commented on LUCENE-8920:


{quote}Maybe we should update the naming with your proposed approach
{quote}
In “direct-addressing” I like the “direct” because the goal of this structure 
is to access an arc directly in constant time (not linear or logarithmic). With 
this approach we still access the arcs based on their label within the range 
but indirectly. “indexed” seems too general to me, maybe because I’m thinking 
of another use of bitTable for fixed array binary search. I’m ok to keep direct 
addressing, or another option for example “range addressing”.

 
{quote}no longer explicitly serializing labels
{quote}
Initially I thought we could remove labels to save memory. Indeed we can infer 
the label from the index in the range (this was already the case with the 
previous direct-addressing patch). But the label per arc seems still required 
to support FST.readNextArcLAbel() which peeks the label of the next arc without 
changing the current arc (so in this case we don’t have the bitTable or label 
range metadata of the next arc). This method is used by FSTEnum to seekFloor.

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Priority: Minor
> Attachments: TestTermsDictRamBytesUsed.java
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] magibney commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

2019-11-01 Thread GitBox
magibney commented on a change in pull request #892: LUCENE-8972: Add 
ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341669927
 
 

 ##
 File path: 
lucene/analysis/icu/src/test/org/apache/lucene/analysis/icu/TestICUTransformCharFilter.java
 ##
 @@ -0,0 +1,153 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.analysis.icu;
+
+
+import java.io.IOException;
+import java.io.Reader;
+import java.io.StringReader;
+
+import org.apache.lucene.analysis.Analyzer;
+import org.apache.lucene.analysis.BaseTokenStreamTestCase;
+import org.apache.lucene.analysis.MockTokenizer;
+import org.apache.lucene.analysis.Tokenizer;
+import org.apache.lucene.analysis.core.KeywordTokenizer;
+
+import com.ibm.icu.text.Transliterator;
+import com.ibm.icu.text.UnicodeSet;
+
+
+/**
+ * Test the ICUTransformCharFilter with some basic examples.
+ */
+public class TestICUTransformCharFilter extends BaseTokenStreamTestCase {
+  
+  public void testBasicFunctionality() throws Exception {
+checkToken(Transliterator.getInstance("Traditional-Simplified"), 
+"簡化字", "简化字"); 
+checkToken(Transliterator.getInstance("Katakana-Hiragana"), 
+"ヒラガナ", "ひらがな");
+checkToken(Transliterator.getInstance("Fullwidth-Halfwidth"), 
+"アルアノリウ", "アルアノリウ");
+checkToken(Transliterator.getInstance("Any-Latin"), 
+"Αλφαβητικός Κατάλογος", "Alphabētikós Katálogos");
+checkToken(Transliterator.getInstance("NFD; [:Nonspacing Mark:] Remove"), 
+"Alphabētikós Katálogos", "Alphabetikos Katalogos");
+checkToken(Transliterator.getInstance("Han-Latin"),
+"中国", "zhōng guó");
+  }
+  
+  public void testRollbackBuffer() throws Exception {
+checkToken(Transliterator.getInstance("Cyrillic-Latin"),
+"я", "â"); // final NFC transform applied
+checkToken(Transliterator.getInstance("Cyrillic-Latin"), 0,
+"я", "a\u0302a\u0302a\u0302a\u0302a\u0302"); // final NFC 
transform never applied
+checkToken(Transliterator.getInstance("Cyrillic-Latin"), 2,
+"я", "ââa\u0302a\u0302a\u0302");
+checkToken(Transliterator.getInstance("Cyrillic-Latin"), 4,
+"яя", 
"ââa\u0302a\u0302a\u0302a\u0302a\u0302a\u0302a\u0302a\u0302");
+checkToken(Transliterator.getInstance("Cyrillic-Latin"), 8,
+"", "ââa\u0302a\u0302a\u0302âââ");
+  }
+
+  public void testCustomFunctionality() throws Exception {
+String rules = "a > b; b > c;"; // convert a's to b's and b's to c's
+checkToken(Transliterator.createFromRules("test", rules, 
Transliterator.FORWARD), "abacadaba", "bcbcbdbcb");
+  }
+  
+  public void testCustomFunctionality2() throws Exception {
+String rules = "c { a > b; a > d;"; // convert a's to b's and b's to c's
+checkToken(Transliterator.createFromRules("test", rules, 
Transliterator.FORWARD), "caa", "cbd");
+  }
+  
+  public void testOptimizer() throws Exception {
+String rules = "a > b; b > c;"; // convert a's to b's and b's to c's
+Transliterator custom = Transliterator.createFromRules("test", rules, 
Transliterator.FORWARD);
+assertTrue(custom.getFilter() == null);
+new ICUTransformCharFilter(new StringReader(""), custom);
+assertTrue(custom.getFilter().equals(new UnicodeSet("[ab]")));
+  }
+  
+  public void testOptimizer2() throws Exception {
+checkToken(Transliterator.getInstance("Traditional-Simplified; CaseFold"), 
+"ABCDE", "abcde");
+  }
+  
+  public void testOptimizerSurrogate() throws Exception {
+String rules = "\\U00020087 > x;"; // convert CJK UNIFIED IDEOGRAPH-20087 
to an x
+Transliterator custom = Transliterator.createFromRules("test", rules, 
Transliterator.FORWARD);
+assertTrue(custom.getFilter() == null);
+new ICUTransformCharFilter(new StringReader(""), custom);
+assertTrue(custom.getFilter().equals(new UnicodeSet("[\\U00020087]")));
+  }
+
+  private void checkToken(Transliterator transform, String input, String 
expected) throws IOException {
+checkToken(transform, 
ICUTransformCharFilter.DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY, input, 

[GitHub] [lucene-solr] KoenDG commented on a change in pull request #990: [SOLR-13885] Typo corrections.

2019-11-01 Thread GitBox
KoenDG commented on a change in pull request #990: [SOLR-13885] Typo 
corrections.
URL: https://github.com/apache/lucene-solr/pull/990#discussion_r341652528
 
 

 ##
 File path: solr/solr-ref-guide/src/spatial-search.adoc
 ##
 @@ -415,7 +415,7 @@ The format, either `ints2D` (default) or `png`.
 
 [TIP]
 
-You'll experiment with different `distErrPct` values (probably 0.10 - 0.20) 
with various input geometries till the default size is what you're looking for. 
The specific details of how it's computed isn't important. For high-detail 
grids used in point-plotting (loosely one cell per pixel), set `distErr` to be 
the number of decimal-degrees of several pixels or so of the map being 
displayed. Also, you probably don't want to use a geohash-based grid because 
the cell orientation between grid levels flip-flops between being square and 
rectangle. Quad is consistent and has more levels, albeit at the expense of a 
larger index.
+You'll experiment with different `distErrPct` values (probably 0.10 - 0.20) 
with various input geometries till the default size is what you're looking for. 
The specific details of how it's computed aren't important. For high-detail 
grids used in point-plotting (loosely one cell per pixel), set `distErr` to be 
the number of decimal-degrees of several pixels or so of the map being 
displayed. Also, you probably don't want to use a geohash-based grid because 
the cell orientation between grid levels flip-flops between being square and 
rectangle. Quad is consistent and has more levels, albeit at the expense of a 
larger index.
 
 Review comment:
   isn't -> aren't


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] KoenDG opened a new pull request #990: [SOLR-13885] Typo corrections.

2019-11-01 Thread GitBox
KoenDG opened a new pull request #990: [SOLR-13885] Typo corrections.
URL: https://github.com/apache/lucene-solr/pull/990
 
 
   https://issues.apache.org/jira/browse/SOLR-13885
   
   I noticed this while reading the documenation.
   
   Try pronouncing `it's` as `it is` and you'll head the lines no longer making 
sense.
   
   All fixes should be correct, though perhaps not complete across the 
documentation. I only checked the `adoc` files.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (SOLR-13885) Typos in documentation, it's vs its

2019-11-01 Thread Koen De Groote (Jira)
Koen De Groote created SOLR-13885:
-

 Summary: Typos in documentation, it's vs its
 Key: SOLR-13885
 URL: https://issues.apache.org/jira/browse/SOLR-13885
 Project: Solr
  Issue Type: Improvement
  Components: documentation
Reporter: Koen De Groote


I ran into this while reading the documentation and decided to grep all the 
adoc files for it.

 

Fixes should be correct. Perhaps not complete.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] magibney commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

2019-11-01 Thread GitBox
magibney commented on a change in pull request #892: LUCENE-8972: Add 
ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341637488
 
 

 ##
 File path: 
lucene/analysis/icu/src/test/org/apache/lucene/analysis/icu/TestICUTransformCharFilter.java
 ##
 @@ -0,0 +1,153 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.analysis.icu;
+
+
+import java.io.IOException;
+import java.io.Reader;
+import java.io.StringReader;
+
+import org.apache.lucene.analysis.Analyzer;
+import org.apache.lucene.analysis.BaseTokenStreamTestCase;
+import org.apache.lucene.analysis.MockTokenizer;
+import org.apache.lucene.analysis.Tokenizer;
+import org.apache.lucene.analysis.core.KeywordTokenizer;
+
+import com.ibm.icu.text.Transliterator;
+import com.ibm.icu.text.UnicodeSet;
+
+
+/**
+ * Test the ICUTransformCharFilter with some basic examples.
+ */
+public class TestICUTransformCharFilter extends BaseTokenStreamTestCase {
+  
+  public void testBasicFunctionality() throws Exception {
+checkToken(Transliterator.getInstance("Traditional-Simplified"), 
+"簡化字", "简化字"); 
+checkToken(Transliterator.getInstance("Katakana-Hiragana"), 
+"ヒラガナ", "ひらがな");
+checkToken(Transliterator.getInstance("Fullwidth-Halfwidth"), 
+"アルアノリウ", "アルアノリウ");
+checkToken(Transliterator.getInstance("Any-Latin"), 
+"Αλφαβητικός Κατάλογος", "Alphabētikós Katálogos");
+checkToken(Transliterator.getInstance("NFD; [:Nonspacing Mark:] Remove"), 
+"Alphabētikós Katálogos", "Alphabetikos Katalogos");
+checkToken(Transliterator.getInstance("Han-Latin"),
+"中国", "zhōng guó");
+  }
+  
+  public void testRollbackBuffer() throws Exception {
+checkToken(Transliterator.getInstance("Cyrillic-Latin"),
+"я", "â"); // final NFC transform applied
+checkToken(Transliterator.getInstance("Cyrillic-Latin"), 0,
+"я", "a\u0302a\u0302a\u0302a\u0302a\u0302"); // final NFC 
transform never applied
+checkToken(Transliterator.getInstance("Cyrillic-Latin"), 2,
 
 Review comment:
   This happens because "Cyrillic-Latin" is a compound transliterator, composed 
of child transliterators "NFD, [main Cyrillic-Latin tranliteration], NFC". 
After the inner child transliterator changes a single "я" into "â" 
(decomposed), the final NFC transformation sees the two-character decomposition 
of "â", but doesn't do anything "yet" because it's not sure whether it will see 
more combining diacritics. But NFC never sees those input characters again, 
because on subsequent invocation, they are excluded by the toplevel filters on 
the main "Cyrillic-Latin" transliterator (which only pays attention to Cyrillic 
input characters). So the partially-transliterated, decomposed characters are 
passed through to the output. There's a solid, more general explanation in the 
code for ICU 
[Transliterator](https://github.com/unicode-org/icu/blob/a075ac9c/icu4j/main/classes/translit/src/com/ibm/icu/text/Transliterator.java#L1137-L1163).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-13836) Streaming Expression Query Parser

2019-11-01 Thread Gus Heck (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964923#comment-16964923
 ] 

Gus Heck edited comment on SOLR-13836 at 11/1/19 3:51 PM:
--

It might be good to think about whether this circumvents the protections added 
in SOLR-12891
(particularly the MacroExpander changes that treated expr differently)


was (Author: gus_heck):
It might be good to think about whether this circumvents the protections added 
in SOLR-12891

> Streaming Expression Query Parser
> -
>
> Key: SOLR-13836
> URL: https://issues.apache.org/jira/browse/SOLR-13836
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers, streaming expressions
>Reporter: Trey Grainger
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> It is currently possible to hit the search handler in a streaming expression 
> ("search(...)"), but it is not currently possible to invoke a streaming 
> expression from within a regular search within the search handler. In some 
> cases, it would be useful to leverage the power of streaming expressions to 
> generate a result set and then join that result set with a normal set of 
> search results.
> This isn't expected to be particularly efficient for high cardinality 
> streaming expression results, but it would be pretty powerful feature that 
> could enable a bunch of use cases that aren't possible today within a normal 
> search.
> h2. Example:
> *Docs:*
> {code:java}
> curl -X POST -H "Content-Type: application/json" 
> http://localhost:8983/solr/food_collection/update?commit=true  --data-binary '
> [
> {"id": "1", "name_s":"donut","vector_fs":[5.0,0.0,1.0,5.0,0.0,4.0,5.0,1.0]},
> {"id": "2", "name_s":"apple 
> juice","vector_fs":[1.0,5.0,0.0,0.0,0.0,4.0,4.0,3.0]},
> {"id": "3", 
> "name_s":"cappuccino","vector_fs":[0.0,5.0,3.0,0.0,4.0,1.0,2.0,3.0]},
> {"id": "4", "name_s":"cheese 
> pizza","vector_fs":[5.0,0.0,4.0,4.0,0.0,1.0,5.0,2.0]},
> {"id": "5", "name_s":"green 
> tea","vector_fs":[0.0,5.0,0.0,0.0,2.0,1.0,1.0,5.0]},
> {"id": "6", "name_s":"latte","vector_fs":[0.0,5.0,4.0,0.0,4.0,1.0,3.0,3.0]},
> {"id": "7", "name_s":"soda","vector_fs":[0.0,5.0,0.0,0.0,3.0,5.0,5.0,0.0]},
> {"id": "8", "name_s":"cheese bread 
> sticks","vector_fs":[5.0,0.0,4.0,5.0,0.0,1.0,4.0,2.0]},
> {"id": "9", "name_s":"water","vector_fs":[0.0,5.0,0.0,0.0,0.0,0.0,0.0,5.0]},
> {"id": "10", "name_s":"cinnamon bread 
> sticks","vector_fs":[5.0,0.0,1.0,5.0,0.0,3.0,4.0,2.0]}
> ]
> {code}
>  
> *Query:*
> {code:java}
> http://localhost:8983/solr/food/select?q=*:*=\{!streaming_expression}top(select(search(food,%20q=%22*:*%22,%20fl=%22id,vector_fs%22,%20sort=%22id%20asc%22),%20cosineSimilarity(vector_fs,%20array(5.1,0.0,1.0,5.0,0.0,4.0,5.0,1.0))%20as%20cos,%20id),%20n=5,%20sort=%22cos%20desc%22)=id,name_s
> {code}
>  
> *Response:*
> {code:java}
> {
>   "responseHeader":{
> "zkConnected":true,
> "status":0,
> "QTime":7,
> "params":{
>   "q":"*:*",
>   "fl":"id,name_s",
>   "fq":"{!streaming_expression}top(select(search(food, q=\"*:*\", 
> fl=\"id,vector_fs\", sort=\"id asc\"), cosineSimilarity(vector_fs, 
> array(5.2,0.0,1.0,5.0,0.0,4.0,5.0,1.0)) as cos, id), n=5, sort=\"cos 
> desc\")"}},
>   "response":{"numFound":5,"start":0,"docs":[
>   {
> "name_s":"donut",
> "id":"1"},
>   {
> "name_s":"apple juice",
> "id":"2"},
>   {
> "name_s":"cheese pizza",
> "id":"4"},
>   {
> "name_s":"cheese bread sticks",
> "id":"8"},
>   {
> "name_s":"cinnamon bread sticks",
> "id":"10"}]
>   }}
> {code}
> The current implementation also supports the following additional parameters:
>  *f*: (optional) The field name from the streaming expression containing the 
> document ids upon which to filter. Defaults to the same uniqueKey field name 
> from your documents. 
>  *method*: (optional) Any of termsFilter (default), booleanQuery, automaton, 
> docValuesTermsFilter.
> The method may go away, especially if we find a more efficient way to join 
> the stream to the main query doc set.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13836) Streaming Expression Query Parser

2019-11-01 Thread Gus Heck (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964923#comment-16964923
 ] 

Gus Heck commented on SOLR-13836:
-

It might be good to think about whether this circumvents the protections added 
in SOLR-12891

> Streaming Expression Query Parser
> -
>
> Key: SOLR-13836
> URL: https://issues.apache.org/jira/browse/SOLR-13836
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers, streaming expressions
>Reporter: Trey Grainger
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> It is currently possible to hit the search handler in a streaming expression 
> ("search(...)"), but it is not currently possible to invoke a streaming 
> expression from within a regular search within the search handler. In some 
> cases, it would be useful to leverage the power of streaming expressions to 
> generate a result set and then join that result set with a normal set of 
> search results.
> This isn't expected to be particularly efficient for high cardinality 
> streaming expression results, but it would be pretty powerful feature that 
> could enable a bunch of use cases that aren't possible today within a normal 
> search.
> h2. Example:
> *Docs:*
> {code:java}
> curl -X POST -H "Content-Type: application/json" 
> http://localhost:8983/solr/food_collection/update?commit=true  --data-binary '
> [
> {"id": "1", "name_s":"donut","vector_fs":[5.0,0.0,1.0,5.0,0.0,4.0,5.0,1.0]},
> {"id": "2", "name_s":"apple 
> juice","vector_fs":[1.0,5.0,0.0,0.0,0.0,4.0,4.0,3.0]},
> {"id": "3", 
> "name_s":"cappuccino","vector_fs":[0.0,5.0,3.0,0.0,4.0,1.0,2.0,3.0]},
> {"id": "4", "name_s":"cheese 
> pizza","vector_fs":[5.0,0.0,4.0,4.0,0.0,1.0,5.0,2.0]},
> {"id": "5", "name_s":"green 
> tea","vector_fs":[0.0,5.0,0.0,0.0,2.0,1.0,1.0,5.0]},
> {"id": "6", "name_s":"latte","vector_fs":[0.0,5.0,4.0,0.0,4.0,1.0,3.0,3.0]},
> {"id": "7", "name_s":"soda","vector_fs":[0.0,5.0,0.0,0.0,3.0,5.0,5.0,0.0]},
> {"id": "8", "name_s":"cheese bread 
> sticks","vector_fs":[5.0,0.0,4.0,5.0,0.0,1.0,4.0,2.0]},
> {"id": "9", "name_s":"water","vector_fs":[0.0,5.0,0.0,0.0,0.0,0.0,0.0,5.0]},
> {"id": "10", "name_s":"cinnamon bread 
> sticks","vector_fs":[5.0,0.0,1.0,5.0,0.0,3.0,4.0,2.0]}
> ]
> {code}
>  
> *Query:*
> {code:java}
> http://localhost:8983/solr/food/select?q=*:*=\{!streaming_expression}top(select(search(food,%20q=%22*:*%22,%20fl=%22id,vector_fs%22,%20sort=%22id%20asc%22),%20cosineSimilarity(vector_fs,%20array(5.1,0.0,1.0,5.0,0.0,4.0,5.0,1.0))%20as%20cos,%20id),%20n=5,%20sort=%22cos%20desc%22)=id,name_s
> {code}
>  
> *Response:*
> {code:java}
> {
>   "responseHeader":{
> "zkConnected":true,
> "status":0,
> "QTime":7,
> "params":{
>   "q":"*:*",
>   "fl":"id,name_s",
>   "fq":"{!streaming_expression}top(select(search(food, q=\"*:*\", 
> fl=\"id,vector_fs\", sort=\"id asc\"), cosineSimilarity(vector_fs, 
> array(5.2,0.0,1.0,5.0,0.0,4.0,5.0,1.0)) as cos, id), n=5, sort=\"cos 
> desc\")"}},
>   "response":{"numFound":5,"start":0,"docs":[
>   {
> "name_s":"donut",
> "id":"1"},
>   {
> "name_s":"apple juice",
> "id":"2"},
>   {
> "name_s":"cheese pizza",
> "id":"4"},
>   {
> "name_s":"cheese bread sticks",
> "id":"8"},
>   {
> "name_s":"cinnamon bread sticks",
> "id":"10"}]
>   }}
> {code}
> The current implementation also supports the following additional parameters:
>  *f*: (optional) The field name from the streaming expression containing the 
> document ids upon which to filter. Defaults to the same uniqueKey field name 
> from your documents. 
>  *method*: (optional) Any of termsFilter (default), booleanQuery, automaton, 
> docValuesTermsFilter.
> The method may go away, especially if we find a more efficient way to join 
> the stream to the main query doc set.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] magibney commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

2019-11-01 Thread GitBox
magibney commented on a change in pull request #892: LUCENE-8972: Add 
ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341618155
 
 

 ##
 File path: 
lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformCharFilter.java
 ##
 @@ -0,0 +1,384 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.analysis.icu;
+
+import java.io.IOException;
+import java.io.Reader;
+
+import com.ibm.icu.text.ReplaceableString;
+import com.ibm.icu.text.Transliterator;
+import com.ibm.icu.text.Transliterator.Position;
+import com.ibm.icu.text.UTF16;
+
+import org.apache.lucene.analysis.CharFilter;
+import org.apache.lucene.analysis.charfilter.BaseCharFilter;
+import org.apache.lucene.util.ArrayUtil;
+
+/**
+ * A {@link CharFilter} that transforms text with ICU.
+ * 
+ * ICU provides text-transformation functionality via its Transliteration API.
+ * Although script conversion is its most common use, a Transliterator can
+ * actually perform a more general class of tasks. In fact, Transliterator
+ * defines a very general API which specifies only that a segment of the input
+ * text is replaced by new text. The particulars of this conversion are
+ * determined entirely by subclasses of Transliterator.
+ * 
+ * 
+ * Some useful transformations for search are built-in:
+ * 
+ * Conversion from Traditional to Simplified Chinese characters
+ * Conversion from Hiragana to Katakana
+ * Conversion from Fullwidth to Halfwidth forms.
+ * Script conversions, for example Serbian Cyrillic to Latin
+ * 
+ * 
+ * Example usage: stream = new ICUTransformCharFilter(reader,
+ * Transliterator.getInstance("Traditional-Simplified"));
+ * 
+ * For more details, see the http://userguide.icu-project.org/transforms/general;>ICU User
+ * Guide.
+ */
+public final class ICUTransformCharFilter extends BaseCharFilter {
+
+  // Transliterator to transform the text
+  private final Transliterator transform;
+
+  // Reusable position object
+  private final Position position = new Position();
+
+  private static final int READ_BUFFER_SIZE = 1024;
+  private final char[] tmpBuffer = new char[READ_BUFFER_SIZE];
+
+  private static final int INITIAL_TRANSLITERATE_BUFFER_SIZE = 1024;
+  private final StringBuffer buffer = new 
StringBuffer(INITIAL_TRANSLITERATE_BUFFER_SIZE);
+  private final ReplaceableString replaceable = new ReplaceableString(buffer);
+
+  private static final int BUFFER_PRUNE_THRESHOLD = 1024;
+
+  private int outputCursor = 0;
+  private boolean inputFinished = false;
+  private int charCount = 0;
+  private int offsetDiffAdjust = 0;
+
+  private static final int HARD_MAX_ROLLBACK_BUFFER_CAPACITY = 
Integer.highestOneBit(Integer.MAX_VALUE);
+  static final int DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY = 8192;
+  private final int maxRollbackBufferCapacity;
+
+  private static final int DEFAULT_INITIAL_ROLLBACK_BUFFER_CAPACITY = 4; // 
must be power of 2
+  private char[] rollbackBuffer;
+  private int rollbackBufferSize = 0;
+
+  ICUTransformCharFilter(Reader in, Transliterator transform) {
+this(in, transform, DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY);
+  }
+
+  /**
+   * Construct new {@link ICUTransformCharFilter} with the specified {@link 
Transliterator}, backed by
+   * the specified {@link Reader}.
+   * @param in input source
+   * @param transform used to perform transliteration
+   * @param maxRollbackBufferCapacityHint used to control the maximum size to 
which this
+   * {@link ICUTransformCharFilter} will buffer and rollback partial 
transliteration of input sequences.
+   * The provided hint will be converted to an enforced limit of "the greatest 
power of 2 (excluding '1')
+   * less than or equal to the specified value". It is illegal to specify a 
negative value. There is no
+   * power of 2 greater than 
Integer.highestOneBit(Integer.MAX_VALUE)), so to prevent overflow, 
values
+   * in this range will resolve to an enforced limit of 
Integer.highestOneBit(Integer.MAX_VALUE)).
+   * Specifying "0" (or "1", in practice) disables rollback. Larger values can 
in some cases yield more accurate
+   * transliteration, at the cost of 

[jira] [Commented] (SOLR-13875) Check why LBSolrClient thinks it has to sort on _docid_

2019-11-01 Thread Erick Erickson (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964920#comment-16964920
 ] 

Erick Erickson commented on SOLR-13875:
---

It may be the fastest sort possible, but my question is more why are we sorting 
at all just to see if the core is alive?

At minimum, it'll take near 1B comparisons in this case, admittedly out there 
on the edge, but still

> Check why LBSolrClient thinks it has to sort on _docid_
> ---
>
> Key: SOLR-13875
> URL: https://issues.apache.org/jira/browse/SOLR-13875
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Erick Erickson
>Priority: Major
>
> Related to SOLR-13874. LBSolrClient issues a query that sorts on \_docid\_. 
> This can take over 12 seconds on a 1B doc replica. Is the sorting necessary 
> at all? If it's removed, what are the effects on performance? Of we change 
> SOLR-13874, we can close this.
> Frankly I suspect that no matter what, there have to be over 1B comparisons 
> done and that's what's taking the time, but I wanted to call this out 
> explicitly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] magibney commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

2019-11-01 Thread GitBox
magibney commented on a change in pull request #892: LUCENE-8972: Add 
ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341628865
 
 

 ##
 File path: 
lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformCharFilter.java
 ##
 @@ -0,0 +1,384 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.analysis.icu;
+
+import java.io.IOException;
+import java.io.Reader;
+
+import com.ibm.icu.text.ReplaceableString;
+import com.ibm.icu.text.Transliterator;
+import com.ibm.icu.text.Transliterator.Position;
+import com.ibm.icu.text.UTF16;
+
+import org.apache.lucene.analysis.CharFilter;
+import org.apache.lucene.analysis.charfilter.BaseCharFilter;
+import org.apache.lucene.util.ArrayUtil;
+
+/**
+ * A {@link CharFilter} that transforms text with ICU.
+ * 
+ * ICU provides text-transformation functionality via its Transliteration API.
+ * Although script conversion is its most common use, a Transliterator can
+ * actually perform a more general class of tasks. In fact, Transliterator
+ * defines a very general API which specifies only that a segment of the input
+ * text is replaced by new text. The particulars of this conversion are
+ * determined entirely by subclasses of Transliterator.
+ * 
+ * 
+ * Some useful transformations for search are built-in:
+ * 
+ * Conversion from Traditional to Simplified Chinese characters
+ * Conversion from Hiragana to Katakana
+ * Conversion from Fullwidth to Halfwidth forms.
+ * Script conversions, for example Serbian Cyrillic to Latin
+ * 
+ * 
+ * Example usage: stream = new ICUTransformCharFilter(reader,
+ * Transliterator.getInstance("Traditional-Simplified"));
+ * 
+ * For more details, see the http://userguide.icu-project.org/transforms/general;>ICU User
+ * Guide.
+ */
+public final class ICUTransformCharFilter extends BaseCharFilter {
+
+  // Transliterator to transform the text
+  private final Transliterator transform;
+
+  // Reusable position object
+  private final Position position = new Position();
+
+  private static final int READ_BUFFER_SIZE = 1024;
+  private final char[] tmpBuffer = new char[READ_BUFFER_SIZE];
+
+  private static final int INITIAL_TRANSLITERATE_BUFFER_SIZE = 1024;
+  private final StringBuffer buffer = new 
StringBuffer(INITIAL_TRANSLITERATE_BUFFER_SIZE);
+  private final ReplaceableString replaceable = new ReplaceableString(buffer);
+
+  private static final int BUFFER_PRUNE_THRESHOLD = 1024;
+
+  private int outputCursor = 0;
+  private boolean inputFinished = false;
+  private int charCount = 0;
+  private int offsetDiffAdjust = 0;
+
+  private static final int HARD_MAX_ROLLBACK_BUFFER_CAPACITY = 
Integer.highestOneBit(Integer.MAX_VALUE);
+  static final int DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY = 8192;
+  private final int maxRollbackBufferCapacity;
+
+  private static final int DEFAULT_INITIAL_ROLLBACK_BUFFER_CAPACITY = 4; // 
must be power of 2
+  private char[] rollbackBuffer;
+  private int rollbackBufferSize = 0;
+
+  ICUTransformCharFilter(Reader in, Transliterator transform) {
+this(in, transform, DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY);
+  }
+
+  /**
+   * Construct new {@link ICUTransformCharFilter} with the specified {@link 
Transliterator}, backed by
+   * the specified {@link Reader}.
+   * @param in input source
+   * @param transform used to perform transliteration
+   * @param maxRollbackBufferCapacityHint used to control the maximum size to 
which this
+   * {@link ICUTransformCharFilter} will buffer and rollback partial 
transliteration of input sequences.
+   * The provided hint will be converted to an enforced limit of "the greatest 
power of 2 (excluding '1')
+   * less than or equal to the specified value". It is illegal to specify a 
negative value. There is no
+   * power of 2 greater than 
Integer.highestOneBit(Integer.MAX_VALUE)), so to prevent overflow, 
values
+   * in this range will resolve to an enforced limit of 
Integer.highestOneBit(Integer.MAX_VALUE)).
+   * Specifying "0" (or "1", in practice) disables rollback. Larger values can 
in some cases yield more accurate
+   * transliteration, at the cost of 

[jira] [Commented] (SOLR-13785) Optimistic concurrency issue for nested documents

2019-11-01 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964913#comment-16964913
 ] 

David Smiley commented on SOLR-13785:
-

It's definitely erroneous to for a child documents' ID to overlap with parent 
documents.  Really _any document_... they must be unique.  When it comes to 
nested documents, there are no protections here (as doing so seems expensive) 
so you can really shoot yourself in the foot if not careful.  So if you take 
IDs from some external system, you might need to prefix or otherwise manipulate 
them to ensure they are globally unique within Solr.

> Optimistic concurrency issue for nested documents
> -
>
> Key: SOLR-13785
> URL: https://issues.apache.org/jira/browse/SOLR-13785
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: update
>Reporter: Vishalaksh Aggarwal
>Priority: Minor
>
> I am using Solr v7.7.1 in cloud mode. I was facing an issue related to 
> optimistic concurrency:
> I have a nested document which can be updated concurrently multiple times 
> before committing the updates hence using __version__ to enable optimistic 
> concurrency. During the process of indexing, we fetch the document which we 
> want to modify along with its __version__ , modify it and then send it to 
> solr along with the same __version__. However, even when I am sure that there 
> is no concurrent update for a particular document, the following error is 
> still thrown:
> {noformat}
> Caused by: 
> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
> from server at http://1.2.3.4:8983/solr/mcollection_shard1_replica_n2: 
> version conflict for  expected=1645085633861910528 
> actual=1645090791527284737
> {noformat}
> Here is the explanation of why this might have happened:
> While indexing a document, when __version__ is mentioned, I believe the solr 
> checks the version of the already existing latest document by doing a 
> real-time get.
> Now if you have 2 nested documents where, in one document (doc1), parent has 
> id="" and in other document(doc2), the child has id="", then it may 
> be possible that solr might check version of doc2 when you intended to index 
> doc1. This might be because solr still indexes all the documents in flat 
> structure and doesn't consider parent-child relationship while doing 
> real-time get.
> For a fix, I had to make the id of parent and child documents different from 
> each other.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] magibney commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

2019-11-01 Thread GitBox
magibney commented on a change in pull request #892: LUCENE-8972: Add 
ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341624565
 
 

 ##
 File path: 
lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformCharFilter.java
 ##
 @@ -0,0 +1,384 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.analysis.icu;
+
+import java.io.IOException;
+import java.io.Reader;
+
+import com.ibm.icu.text.ReplaceableString;
+import com.ibm.icu.text.Transliterator;
+import com.ibm.icu.text.Transliterator.Position;
+import com.ibm.icu.text.UTF16;
+
+import org.apache.lucene.analysis.CharFilter;
+import org.apache.lucene.analysis.charfilter.BaseCharFilter;
+import org.apache.lucene.util.ArrayUtil;
+
+/**
+ * A {@link CharFilter} that transforms text with ICU.
+ * 
+ * ICU provides text-transformation functionality via its Transliteration API.
+ * Although script conversion is its most common use, a Transliterator can
+ * actually perform a more general class of tasks. In fact, Transliterator
+ * defines a very general API which specifies only that a segment of the input
+ * text is replaced by new text. The particulars of this conversion are
+ * determined entirely by subclasses of Transliterator.
+ * 
+ * 
+ * Some useful transformations for search are built-in:
+ * 
+ * Conversion from Traditional to Simplified Chinese characters
+ * Conversion from Hiragana to Katakana
+ * Conversion from Fullwidth to Halfwidth forms.
+ * Script conversions, for example Serbian Cyrillic to Latin
+ * 
+ * 
+ * Example usage: stream = new ICUTransformCharFilter(reader,
+ * Transliterator.getInstance("Traditional-Simplified"));
+ * 
+ * For more details, see the http://userguide.icu-project.org/transforms/general;>ICU User
+ * Guide.
+ */
+public final class ICUTransformCharFilter extends BaseCharFilter {
+
+  // Transliterator to transform the text
+  private final Transliterator transform;
+
+  // Reusable position object
+  private final Position position = new Position();
+
+  private static final int READ_BUFFER_SIZE = 1024;
+  private final char[] tmpBuffer = new char[READ_BUFFER_SIZE];
+
+  private static final int INITIAL_TRANSLITERATE_BUFFER_SIZE = 1024;
+  private final StringBuffer buffer = new 
StringBuffer(INITIAL_TRANSLITERATE_BUFFER_SIZE);
+  private final ReplaceableString replaceable = new ReplaceableString(buffer);
+
+  private static final int BUFFER_PRUNE_THRESHOLD = 1024;
+
+  private int outputCursor = 0;
+  private boolean inputFinished = false;
+  private int charCount = 0;
+  private int offsetDiffAdjust = 0;
+
+  private static final int HARD_MAX_ROLLBACK_BUFFER_CAPACITY = 
Integer.highestOneBit(Integer.MAX_VALUE);
+  static final int DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY = 8192;
+  private final int maxRollbackBufferCapacity;
+
+  private static final int DEFAULT_INITIAL_ROLLBACK_BUFFER_CAPACITY = 4; // 
must be power of 2
+  private char[] rollbackBuffer;
+  private int rollbackBufferSize = 0;
+
+  ICUTransformCharFilter(Reader in, Transliterator transform) {
+this(in, transform, DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY);
+  }
+
+  /**
+   * Construct new {@link ICUTransformCharFilter} with the specified {@link 
Transliterator}, backed by
+   * the specified {@link Reader}.
+   * @param in input source
+   * @param transform used to perform transliteration
+   * @param maxRollbackBufferCapacityHint used to control the maximum size to 
which this
+   * {@link ICUTransformCharFilter} will buffer and rollback partial 
transliteration of input sequences.
+   * The provided hint will be converted to an enforced limit of "the greatest 
power of 2 (excluding '1')
+   * less than or equal to the specified value". It is illegal to specify a 
negative value. There is no
+   * power of 2 greater than 
Integer.highestOneBit(Integer.MAX_VALUE)), so to prevent overflow, 
values
+   * in this range will resolve to an enforced limit of 
Integer.highestOneBit(Integer.MAX_VALUE)).
+   * Specifying "0" (or "1", in practice) disables rollback. Larger values can 
in some cases yield more accurate
+   * transliteration, at the cost of 

[jira] [Resolved] (SOLR-13830) [child] transformer documentation has a little error

2019-11-01 Thread David Smiley (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Smiley resolved SOLR-13830.
-
Fix Version/s: 8.4
 Assignee: David Smiley
   Resolution: Fixed

Thanks for reporting the problem.  I simply omitted the characterization of the 
structure.

> [child] transformer documentation has a little error
> 
>
> Key: SOLR-13830
> URL: https://issues.apache.org/jira/browse/SOLR-13830
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: documentation
>Affects Versions: 8.2
>Reporter: Bram Biesbrouck
>Assignee: David Smiley
>Priority: Minor
> Fix For: 8.4
>
>
> I think there's a bug in the Solr docs about [child].
> See 
> https://lucene.apache.org/solr/guide/8_2/transforming-result-documents.html#child-childdoctransformerfactory
> It states:
> "This transformer returns all descendant documents of each parent document 
> matching your query in a flat list nested inside the matching parent 
> document."
> This is not correct: the descendant documents are "wired into" the parent, 
> creating a hierarchical structure (which is nice).
> Or am I misinterpreting the docs?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] magibney commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

2019-11-01 Thread GitBox
magibney commented on a change in pull request #892: LUCENE-8972: Add 
ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341618155
 
 

 ##
 File path: 
lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformCharFilter.java
 ##
 @@ -0,0 +1,384 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.analysis.icu;
+
+import java.io.IOException;
+import java.io.Reader;
+
+import com.ibm.icu.text.ReplaceableString;
+import com.ibm.icu.text.Transliterator;
+import com.ibm.icu.text.Transliterator.Position;
+import com.ibm.icu.text.UTF16;
+
+import org.apache.lucene.analysis.CharFilter;
+import org.apache.lucene.analysis.charfilter.BaseCharFilter;
+import org.apache.lucene.util.ArrayUtil;
+
+/**
+ * A {@link CharFilter} that transforms text with ICU.
+ * 
+ * ICU provides text-transformation functionality via its Transliteration API.
+ * Although script conversion is its most common use, a Transliterator can
+ * actually perform a more general class of tasks. In fact, Transliterator
+ * defines a very general API which specifies only that a segment of the input
+ * text is replaced by new text. The particulars of this conversion are
+ * determined entirely by subclasses of Transliterator.
+ * 
+ * 
+ * Some useful transformations for search are built-in:
+ * 
+ * Conversion from Traditional to Simplified Chinese characters
+ * Conversion from Hiragana to Katakana
+ * Conversion from Fullwidth to Halfwidth forms.
+ * Script conversions, for example Serbian Cyrillic to Latin
+ * 
+ * 
+ * Example usage: stream = new ICUTransformCharFilter(reader,
+ * Transliterator.getInstance("Traditional-Simplified"));
+ * 
+ * For more details, see the http://userguide.icu-project.org/transforms/general;>ICU User
+ * Guide.
+ */
+public final class ICUTransformCharFilter extends BaseCharFilter {
+
+  // Transliterator to transform the text
+  private final Transliterator transform;
+
+  // Reusable position object
+  private final Position position = new Position();
+
+  private static final int READ_BUFFER_SIZE = 1024;
+  private final char[] tmpBuffer = new char[READ_BUFFER_SIZE];
+
+  private static final int INITIAL_TRANSLITERATE_BUFFER_SIZE = 1024;
+  private final StringBuffer buffer = new 
StringBuffer(INITIAL_TRANSLITERATE_BUFFER_SIZE);
+  private final ReplaceableString replaceable = new ReplaceableString(buffer);
+
+  private static final int BUFFER_PRUNE_THRESHOLD = 1024;
+
+  private int outputCursor = 0;
+  private boolean inputFinished = false;
+  private int charCount = 0;
+  private int offsetDiffAdjust = 0;
+
+  private static final int HARD_MAX_ROLLBACK_BUFFER_CAPACITY = 
Integer.highestOneBit(Integer.MAX_VALUE);
+  static final int DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY = 8192;
+  private final int maxRollbackBufferCapacity;
+
+  private static final int DEFAULT_INITIAL_ROLLBACK_BUFFER_CAPACITY = 4; // 
must be power of 2
+  private char[] rollbackBuffer;
+  private int rollbackBufferSize = 0;
+
+  ICUTransformCharFilter(Reader in, Transliterator transform) {
+this(in, transform, DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY);
+  }
+
+  /**
+   * Construct new {@link ICUTransformCharFilter} with the specified {@link 
Transliterator}, backed by
+   * the specified {@link Reader}.
+   * @param in input source
+   * @param transform used to perform transliteration
+   * @param maxRollbackBufferCapacityHint used to control the maximum size to 
which this
+   * {@link ICUTransformCharFilter} will buffer and rollback partial 
transliteration of input sequences.
+   * The provided hint will be converted to an enforced limit of "the greatest 
power of 2 (excluding '1')
+   * less than or equal to the specified value". It is illegal to specify a 
negative value. There is no
+   * power of 2 greater than 
Integer.highestOneBit(Integer.MAX_VALUE)), so to prevent overflow, 
values
+   * in this range will resolve to an enforced limit of 
Integer.highestOneBit(Integer.MAX_VALUE)).
+   * Specifying "0" (or "1", in practice) disables rollback. Larger values can 
in some cases yield more accurate
+   * transliteration, at the cost of 

[jira] [Commented] (SOLR-13830) [child] transformer documentation has a little error

2019-11-01 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964898#comment-16964898
 ] 

ASF subversion and git services commented on SOLR-13830:


Commit bb2215bebf8ffa2d51bea8395f31e73b990e0091 in lucene-solr's branch 
refs/heads/branch_8_3 from David Smiley
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=bb2215b ]

SOLR-13830: Correct ref guide on [child] response structure.

(cherry picked from commit 124d38a597f3ed45ea0879ea1573edf0b9de2809)


> [child] transformer documentation has a little error
> 
>
> Key: SOLR-13830
> URL: https://issues.apache.org/jira/browse/SOLR-13830
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: documentation
>Affects Versions: 8.2
>Reporter: Bram Biesbrouck
>Priority: Minor
>
> I think there's a bug in the Solr docs about [child].
> See 
> https://lucene.apache.org/solr/guide/8_2/transforming-result-documents.html#child-childdoctransformerfactory
> It states:
> "This transformer returns all descendant documents of each parent document 
> matching your query in a flat list nested inside the matching parent 
> document."
> This is not correct: the descendant documents are "wired into" the parent, 
> creating a hierarchical structure (which is nice).
> Or am I misinterpreting the docs?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13830) [child] transformer documentation has a little error

2019-11-01 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964896#comment-16964896
 ] 

ASF subversion and git services commented on SOLR-13830:


Commit 124d38a597f3ed45ea0879ea1573edf0b9de2809 in lucene-solr's branch 
refs/heads/master from David Smiley
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=124d38a ]

SOLR-13830: Correct ref guide on [child] response structure.


> [child] transformer documentation has a little error
> 
>
> Key: SOLR-13830
> URL: https://issues.apache.org/jira/browse/SOLR-13830
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: documentation
>Affects Versions: 8.2
>Reporter: Bram Biesbrouck
>Priority: Minor
>
> I think there's a bug in the Solr docs about [child].
> See 
> https://lucene.apache.org/solr/guide/8_2/transforming-result-documents.html#child-childdoctransformerfactory
> It states:
> "This transformer returns all descendant documents of each parent document 
> matching your query in a flat list nested inside the matching parent 
> document."
> This is not correct: the descendant documents are "wired into" the parent, 
> creating a hierarchical structure (which is nice).
> Or am I misinterpreting the docs?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] magibney commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

2019-11-01 Thread GitBox
magibney commented on a change in pull request #892: LUCENE-8972: Add 
ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341610744
 
 

 ##
 File path: 
lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformCharFilter.java
 ##
 @@ -0,0 +1,384 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.analysis.icu;
+
+import java.io.IOException;
+import java.io.Reader;
+
+import com.ibm.icu.text.ReplaceableString;
+import com.ibm.icu.text.Transliterator;
+import com.ibm.icu.text.Transliterator.Position;
+import com.ibm.icu.text.UTF16;
+
+import org.apache.lucene.analysis.CharFilter;
+import org.apache.lucene.analysis.charfilter.BaseCharFilter;
+import org.apache.lucene.util.ArrayUtil;
+
+/**
+ * A {@link CharFilter} that transforms text with ICU.
+ * 
+ * ICU provides text-transformation functionality via its Transliteration API.
+ * Although script conversion is its most common use, a Transliterator can
+ * actually perform a more general class of tasks. In fact, Transliterator
+ * defines a very general API which specifies only that a segment of the input
+ * text is replaced by new text. The particulars of this conversion are
+ * determined entirely by subclasses of Transliterator.
+ * 
+ * 
+ * Some useful transformations for search are built-in:
+ * 
+ * Conversion from Traditional to Simplified Chinese characters
+ * Conversion from Hiragana to Katakana
+ * Conversion from Fullwidth to Halfwidth forms.
+ * Script conversions, for example Serbian Cyrillic to Latin
+ * 
+ * 
+ * Example usage: stream = new ICUTransformCharFilter(reader,
+ * Transliterator.getInstance("Traditional-Simplified"));
+ * 
+ * For more details, see the http://userguide.icu-project.org/transforms/general;>ICU User
+ * Guide.
+ */
+public final class ICUTransformCharFilter extends BaseCharFilter {
+
+  // Transliterator to transform the text
+  private final Transliterator transform;
+
+  // Reusable position object
+  private final Position position = new Position();
+
+  private static final int READ_BUFFER_SIZE = 1024;
+  private final char[] tmpBuffer = new char[READ_BUFFER_SIZE];
+
+  private static final int INITIAL_TRANSLITERATE_BUFFER_SIZE = 1024;
+  private final StringBuffer buffer = new 
StringBuffer(INITIAL_TRANSLITERATE_BUFFER_SIZE);
+  private final ReplaceableString replaceable = new ReplaceableString(buffer);
+
+  private static final int BUFFER_PRUNE_THRESHOLD = 1024;
+
+  private int outputCursor = 0;
+  private boolean inputFinished = false;
+  private int charCount = 0;
+  private int offsetDiffAdjust = 0;
+
+  private static final int HARD_MAX_ROLLBACK_BUFFER_CAPACITY = 
Integer.highestOneBit(Integer.MAX_VALUE);
+  static final int DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY = 8192;
+  private final int maxRollbackBufferCapacity;
+
+  private static final int DEFAULT_INITIAL_ROLLBACK_BUFFER_CAPACITY = 4; // 
must be power of 2
+  private char[] rollbackBuffer;
+  private int rollbackBufferSize = 0;
+
+  ICUTransformCharFilter(Reader in, Transliterator transform) {
+this(in, transform, DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY);
+  }
+
+  /**
+   * Construct new {@link ICUTransformCharFilter} with the specified {@link 
Transliterator}, backed by
+   * the specified {@link Reader}.
+   * @param in input source
+   * @param transform used to perform transliteration
+   * @param maxRollbackBufferCapacityHint used to control the maximum size to 
which this
+   * {@link ICUTransformCharFilter} will buffer and rollback partial 
transliteration of input sequences.
+   * The provided hint will be converted to an enforced limit of "the greatest 
power of 2 (excluding '1')
+   * less than or equal to the specified value". It is illegal to specify a 
negative value. There is no
+   * power of 2 greater than 
Integer.highestOneBit(Integer.MAX_VALUE)), so to prevent overflow, 
values
+   * in this range will resolve to an enforced limit of 
Integer.highestOneBit(Integer.MAX_VALUE)).
+   * Specifying "0" (or "1", in practice) disables rollback. Larger values can 
in some cases yield more accurate
+   * transliteration, at the cost of 

[GitHub] [lucene-solr] magibney commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

2019-11-01 Thread GitBox
magibney commented on a change in pull request #892: LUCENE-8972: Add 
ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341604330
 
 

 ##
 File path: 
lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformCharFilter.java
 ##
 @@ -0,0 +1,384 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.analysis.icu;
+
+import java.io.IOException;
+import java.io.Reader;
+
+import com.ibm.icu.text.ReplaceableString;
+import com.ibm.icu.text.Transliterator;
+import com.ibm.icu.text.Transliterator.Position;
+import com.ibm.icu.text.UTF16;
+
+import org.apache.lucene.analysis.CharFilter;
+import org.apache.lucene.analysis.charfilter.BaseCharFilter;
+import org.apache.lucene.util.ArrayUtil;
+
+/**
+ * A {@link CharFilter} that transforms text with ICU.
+ * 
+ * ICU provides text-transformation functionality via its Transliteration API.
+ * Although script conversion is its most common use, a Transliterator can
+ * actually perform a more general class of tasks. In fact, Transliterator
+ * defines a very general API which specifies only that a segment of the input
+ * text is replaced by new text. The particulars of this conversion are
+ * determined entirely by subclasses of Transliterator.
+ * 
+ * 
+ * Some useful transformations for search are built-in:
+ * 
+ * Conversion from Traditional to Simplified Chinese characters
+ * Conversion from Hiragana to Katakana
+ * Conversion from Fullwidth to Halfwidth forms.
+ * Script conversions, for example Serbian Cyrillic to Latin
+ * 
+ * 
+ * Example usage: stream = new ICUTransformCharFilter(reader,
+ * Transliterator.getInstance("Traditional-Simplified"));
+ * 
+ * For more details, see the http://userguide.icu-project.org/transforms/general;>ICU User
+ * Guide.
+ */
+public final class ICUTransformCharFilter extends BaseCharFilter {
+
+  // Transliterator to transform the text
+  private final Transliterator transform;
+
+  // Reusable position object
+  private final Position position = new Position();
+
+  private static final int READ_BUFFER_SIZE = 1024;
+  private final char[] tmpBuffer = new char[READ_BUFFER_SIZE];
+
+  private static final int INITIAL_TRANSLITERATE_BUFFER_SIZE = 1024;
+  private final StringBuffer buffer = new 
StringBuffer(INITIAL_TRANSLITERATE_BUFFER_SIZE);
+  private final ReplaceableString replaceable = new ReplaceableString(buffer);
+
+  private static final int BUFFER_PRUNE_THRESHOLD = 1024;
+
+  private int outputCursor = 0;
+  private boolean inputFinished = false;
+  private int charCount = 0;
+  private int offsetDiffAdjust = 0;
+
+  private static final int HARD_MAX_ROLLBACK_BUFFER_CAPACITY = 
Integer.highestOneBit(Integer.MAX_VALUE);
+  static final int DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY = 8192;
+  private final int maxRollbackBufferCapacity;
+
+  private static final int DEFAULT_INITIAL_ROLLBACK_BUFFER_CAPACITY = 4; // 
must be power of 2
+  private char[] rollbackBuffer;
+  private int rollbackBufferSize = 0;
+
+  ICUTransformCharFilter(Reader in, Transliterator transform) {
+this(in, transform, DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY);
+  }
+
+  /**
+   * Construct new {@link ICUTransformCharFilter} with the specified {@link 
Transliterator}, backed by
+   * the specified {@link Reader}.
+   * @param in input source
+   * @param transform used to perform transliteration
+   * @param maxRollbackBufferCapacityHint used to control the maximum size to 
which this
+   * {@link ICUTransformCharFilter} will buffer and rollback partial 
transliteration of input sequences.
+   * The provided hint will be converted to an enforced limit of "the greatest 
power of 2 (excluding '1')
+   * less than or equal to the specified value". It is illegal to specify a 
negative value. There is no
+   * power of 2 greater than 
Integer.highestOneBit(Integer.MAX_VALUE)), so to prevent overflow, 
values
+   * in this range will resolve to an enforced limit of 
Integer.highestOneBit(Integer.MAX_VALUE)).
+   * Specifying "0" (or "1", in practice) disables rollback. Larger values can 
in some cases yield more accurate
+   * transliteration, at the cost of 

[GitHub] [lucene-solr] magibney commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

2019-11-01 Thread GitBox
magibney commented on a change in pull request #892: LUCENE-8972: Add 
ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341603797
 
 

 ##
 File path: 
lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformCharFilter.java
 ##
 @@ -0,0 +1,384 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.analysis.icu;
+
+import java.io.IOException;
+import java.io.Reader;
+
+import com.ibm.icu.text.ReplaceableString;
+import com.ibm.icu.text.Transliterator;
+import com.ibm.icu.text.Transliterator.Position;
+import com.ibm.icu.text.UTF16;
+
+import org.apache.lucene.analysis.CharFilter;
+import org.apache.lucene.analysis.charfilter.BaseCharFilter;
+import org.apache.lucene.util.ArrayUtil;
+
+/**
+ * A {@link CharFilter} that transforms text with ICU.
+ * 
+ * ICU provides text-transformation functionality via its Transliteration API.
+ * Although script conversion is its most common use, a Transliterator can
+ * actually perform a more general class of tasks. In fact, Transliterator
+ * defines a very general API which specifies only that a segment of the input
+ * text is replaced by new text. The particulars of this conversion are
+ * determined entirely by subclasses of Transliterator.
+ * 
+ * 
+ * Some useful transformations for search are built-in:
+ * 
+ * Conversion from Traditional to Simplified Chinese characters
+ * Conversion from Hiragana to Katakana
+ * Conversion from Fullwidth to Halfwidth forms.
+ * Script conversions, for example Serbian Cyrillic to Latin
+ * 
+ * 
+ * Example usage: stream = new ICUTransformCharFilter(reader,
+ * Transliterator.getInstance("Traditional-Simplified"));
+ * 
+ * For more details, see the http://userguide.icu-project.org/transforms/general;>ICU User
+ * Guide.
+ */
+public final class ICUTransformCharFilter extends BaseCharFilter {
+
+  // Transliterator to transform the text
+  private final Transliterator transform;
+
+  // Reusable position object
+  private final Position position = new Position();
+
+  private static final int READ_BUFFER_SIZE = 1024;
+  private final char[] tmpBuffer = new char[READ_BUFFER_SIZE];
+
+  private static final int INITIAL_TRANSLITERATE_BUFFER_SIZE = 1024;
+  private final StringBuffer buffer = new 
StringBuffer(INITIAL_TRANSLITERATE_BUFFER_SIZE);
+  private final ReplaceableString replaceable = new ReplaceableString(buffer);
+
+  private static final int BUFFER_PRUNE_THRESHOLD = 1024;
+
+  private int outputCursor = 0;
+  private boolean inputFinished = false;
+  private int charCount = 0;
+  private int offsetDiffAdjust = 0;
+
+  private static final int HARD_MAX_ROLLBACK_BUFFER_CAPACITY = 
Integer.highestOneBit(Integer.MAX_VALUE);
+  static final int DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY = 8192;
+  private final int maxRollbackBufferCapacity;
+
+  private static final int DEFAULT_INITIAL_ROLLBACK_BUFFER_CAPACITY = 4; // 
must be power of 2
+  private char[] rollbackBuffer;
+  private int rollbackBufferSize = 0;
+
+  ICUTransformCharFilter(Reader in, Transliterator transform) {
+this(in, transform, DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY);
+  }
+
+  /**
+   * Construct new {@link ICUTransformCharFilter} with the specified {@link 
Transliterator}, backed by
+   * the specified {@link Reader}.
+   * @param in input source
+   * @param transform used to perform transliteration
+   * @param maxRollbackBufferCapacityHint used to control the maximum size to 
which this
+   * {@link ICUTransformCharFilter} will buffer and rollback partial 
transliteration of input sequences.
+   * The provided hint will be converted to an enforced limit of "the greatest 
power of 2 (excluding '1')
+   * less than or equal to the specified value". It is illegal to specify a 
negative value. There is no
+   * power of 2 greater than 
Integer.highestOneBit(Integer.MAX_VALUE)), so to prevent overflow, 
values
+   * in this range will resolve to an enforced limit of 
Integer.highestOneBit(Integer.MAX_VALUE)).
+   * Specifying "0" (or "1", in practice) disables rollback. Larger values can 
in some cases yield more accurate
+   * transliteration, at the cost of 

[jira] [Commented] (SOLR-13844) Remove recovering shard term with corresponding core shard term.

2019-11-01 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964872#comment-16964872
 ] 

ASF subversion and git services commented on SOLR-13844:


Commit 6e1ecd1218b1d5b8ebfad3e24191c09a85e3902a in lucene-solr's branch 
refs/heads/master from Houston Putman
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=6e1ecd1 ]

SOLR-13844: Remove replica recovery terms with the replica term (#951)



> Remove recovering shard term with corresponding core shard term.
> 
>
> Key: SOLR-13844
> URL: https://issues.apache.org/jira/browse/SOLR-13844
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: master (9.0)
>Reporter: Houston Putman
>Priority: Minor
> Fix For: master (9.0), 8.4
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Currently if a recovering replica (solr 7.3+) is deleted, the term for that 
> core in the shard's terms in ZK is removed. However the {{_recovering}} 
> term is not removed as well. This can create clutter and confusion in the 
> shard terms ZK node.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] CaoManhDat merged pull request #951: SOLR-13844: Remove replica recovery terms with the replica term

2019-11-01 Thread GitBox
CaoManhDat merged pull request #951: SOLR-13844: Remove replica recovery terms 
with the replica term
URL: https://github.com/apache/lucene-solr/pull/951
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] CaoManhDat commented on issue #951: SOLR-13844: Remove replica recovery terms with the replica term

2019-11-01 Thread GitBox
CaoManhDat commented on issue #951: SOLR-13844: Remove replica recovery terms 
with the replica term
URL: https://github.com/apache/lucene-solr/pull/951#issuecomment-548810642
 
 
   Nice work @HoustonPutman and thanks!


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9025) Add more efficient lookupTerm() overload to SortedSetDocValues

2019-11-01 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964871#comment-16964871
 ] 

David Smiley commented on LUCENE-9025:
--

+1 to close as duplicate and we work on LUCENE-8836 instead.  No new API, just 
make it faster/smarter.

> Add more efficient lookupTerm() overload to SortedSetDocValues
> --
>
> Key: LUCENE-9025
> URL: https://issues.apache.org/jira/browse/LUCENE-9025
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: master (9.0)
>Reporter: Jason Gerlowski
>Priority: Minor
> Attachments: LUCENE-9025.patch
>
>
> {{SortedSetDocValues.lookupTerm(BytesRef)}} performs a binary search of the 
> entire docValues range to find the ordinal of the requested BytesRef.
> For an individual invocation, this is optimal.  Without other context, binary 
> search needs to cover the entire space.
> But there are some common uses of {{lookupTerm}} where this shouldn't be 
> necessary.  For example: making multiple {{lookupTerm}} calls to fetch the 
> ordinals for each value in a sorted list of terms.  {{lookupTerm}} will 
> binary-search the whole space on each invocation, even though the caller 
> knows that there's no point searching anything before the ordinal that came 
> back from the previous {{lookupTerm}} call.
> I propose we add a {{SortedSetDocValues.lookupTerm}} overload which takes a 
> lower-bound to start the binary search at: {{public long lookupTerm(BytesRef 
> key, long lowerSearchBound) throws IOException}}  This saves each 
> binary-search a few iterations in usage scenarios like the one described 
> above, which can conceivably add up.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13394) Change default GC from CMS to G1

2019-11-01 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964869#comment-16964869
 ] 

David Smiley commented on SOLR-13394:
-

I'd love to see a {{-debug}} option.  It could not only ensure that "jps" etc. 
work, it might also toggle some settings that we disable for security reasons 
(e.g. DIH debug).

Also major +1 to documenting *why* these flags are set.  This stuff is magic 
voodoo without commentary.

> Change default GC from CMS to G1
> 
>
> Key: SOLR-13394
> URL: https://issues.apache.org/jira/browse/SOLR-13394
> Project: Solr
>  Issue Type: Improvement
>Reporter: Ishan Chattopadhyaya
>Assignee: Ishan Chattopadhyaya
>Priority: Major
> Fix For: 8.1
>
> Attachments: SOLR-13394.patch, SOLR-13394.patch
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> CMS has been deprecated in new versions of Java 
> (http://openjdk.java.net/jeps/291). This issue is to switch Solr default from 
> CMS to G1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] magibney commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

2019-11-01 Thread GitBox
magibney commented on a change in pull request #892: LUCENE-8972: Add 
ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341593957
 
 

 ##
 File path: 
lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformCharFilter.java
 ##
 @@ -0,0 +1,384 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.analysis.icu;
+
+import java.io.IOException;
+import java.io.Reader;
+
+import com.ibm.icu.text.ReplaceableString;
+import com.ibm.icu.text.Transliterator;
+import com.ibm.icu.text.Transliterator.Position;
+import com.ibm.icu.text.UTF16;
+
+import org.apache.lucene.analysis.CharFilter;
+import org.apache.lucene.analysis.charfilter.BaseCharFilter;
+import org.apache.lucene.util.ArrayUtil;
+
+/**
+ * A {@link CharFilter} that transforms text with ICU.
+ * 
+ * ICU provides text-transformation functionality via its Transliteration API.
+ * Although script conversion is its most common use, a Transliterator can
+ * actually perform a more general class of tasks. In fact, Transliterator
+ * defines a very general API which specifies only that a segment of the input
+ * text is replaced by new text. The particulars of this conversion are
+ * determined entirely by subclasses of Transliterator.
+ * 
+ * 
+ * Some useful transformations for search are built-in:
+ * 
+ * Conversion from Traditional to Simplified Chinese characters
+ * Conversion from Hiragana to Katakana
+ * Conversion from Fullwidth to Halfwidth forms.
+ * Script conversions, for example Serbian Cyrillic to Latin
+ * 
+ * 
+ * Example usage: stream = new ICUTransformCharFilter(reader,
+ * Transliterator.getInstance("Traditional-Simplified"));
+ * 
+ * For more details, see the http://userguide.icu-project.org/transforms/general;>ICU User
+ * Guide.
+ */
+public final class ICUTransformCharFilter extends BaseCharFilter {
+
+  // Transliterator to transform the text
+  private final Transliterator transform;
+
+  // Reusable position object
+  private final Position position = new Position();
+
+  private static final int READ_BUFFER_SIZE = 1024;
+  private final char[] tmpBuffer = new char[READ_BUFFER_SIZE];
+
+  private static final int INITIAL_TRANSLITERATE_BUFFER_SIZE = 1024;
+  private final StringBuffer buffer = new 
StringBuffer(INITIAL_TRANSLITERATE_BUFFER_SIZE);
+  private final ReplaceableString replaceable = new ReplaceableString(buffer);
+
+  private static final int BUFFER_PRUNE_THRESHOLD = 1024;
+
+  private int outputCursor = 0;
+  private boolean inputFinished = false;
+  private int charCount = 0;
+  private int offsetDiffAdjust = 0;
+
+  private static final int HARD_MAX_ROLLBACK_BUFFER_CAPACITY = 
Integer.highestOneBit(Integer.MAX_VALUE);
+  static final int DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY = 8192;
+  private final int maxRollbackBufferCapacity;
+
+  private static final int DEFAULT_INITIAL_ROLLBACK_BUFFER_CAPACITY = 4; // 
must be power of 2
+  private char[] rollbackBuffer;
+  private int rollbackBufferSize = 0;
+
+  ICUTransformCharFilter(Reader in, Transliterator transform) {
+this(in, transform, DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY);
+  }
+
+  /**
+   * Construct new {@link ICUTransformCharFilter} with the specified {@link 
Transliterator}, backed by
+   * the specified {@link Reader}.
+   * @param in input source
+   * @param transform used to perform transliteration
+   * @param maxRollbackBufferCapacityHint used to control the maximum size to 
which this
+   * {@link ICUTransformCharFilter} will buffer and rollback partial 
transliteration of input sequences.
+   * The provided hint will be converted to an enforced limit of "the greatest 
power of 2 (excluding '1')
+   * less than or equal to the specified value". It is illegal to specify a 
negative value. There is no
+   * power of 2 greater than 
Integer.highestOneBit(Integer.MAX_VALUE)), so to prevent overflow, 
values
+   * in this range will resolve to an enforced limit of 
Integer.highestOneBit(Integer.MAX_VALUE)).
+   * Specifying "0" (or "1", in practice) disables rollback. Larger values can 
in some cases yield more accurate
+   * transliteration, at the cost of 

[jira] [Commented] (SOLR-13749) Implement support for joining across collections with multiple shards ( XCJF )

2019-11-01 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964860#comment-16964860
 ] 

David Smiley commented on SOLR-13749:
-

Might it make more sense to use a query parser based on streaming expressions 
-- SOLR-13836 (CC [~solrtrey] ) as a more generalized way of joining from 
external places?  It's so general, this would allow even querying a JDBC source!

> Implement support for joining across collections with multiple shards ( XCJF )
> --
>
> Key: SOLR-13749
> URL: https://issues.apache.org/jira/browse/SOLR-13749
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Kevin Watters
>Assignee: Gus Heck
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This ticket includes 2 query parsers.
> The first one is the "Cross collection join filter"  (XCJF) parser. This is 
> the "Cross-collection join filter" query parser. It can do a call out to a 
> remote collection to get a set of join keys to be used as a filter against 
> the local collection.
> The second one is the Hash Range query parser that you can specify a field 
> name and a hash range, the result is that only the documents that would have 
> hashed to that range will be returned.
> This query parser will do an intersection based on join keys between 2 
> collections.
> The local collection is the collection that you are searching against.
> The remote collection is the collection that contains the join keys that you 
> want to use as a filter.
> Each shard participating in the distributed request will execute a query 
> against the remote collection.  If the local collection is setup with the 
> compositeId router to be routed on the join key field, a hash range query is 
> applied to the remote collection query to only match the documents that 
> contain a potential match for the documents that are in the local shard/core. 
>  
>  
> Here's some vocab to help with the descriptions of the various parameters.
> ||Term||Description||
> |Local Collection|This is the main collection that is being queried.|
> |Remote Collection|This is the collection that the XCJFQuery will query to 
> resolve the join keys.|
> |XCJFQuery|The lucene query that executes a search to get back a set of join 
> keys from a remote collection|
> |HashRangeQuery|The lucene query that matches only the documents whose hash 
> code on a field falls within a specified range.|
>  
>  
> ||Param ||Required ||Description||
> |collection|Required|The name of the external Solr collection to be queried 
> to retrieve the set of join key values ( required )|
> |zkHost|Optional|The connection string to be used to connect to Zookeeper.  
> zkHost and solrUrl are both optional parameters, and at most one of them 
> should be specified.  
> If neither of zkHost or solrUrl are specified, the local Zookeeper cluster 
> will be used. ( optional )|
> |solrUrl|Optional|The URL of the external Solr node to be queried ( optional 
> )|
> |from|Required|The join key field name in the external collection ( required 
> )|
> |to|Required|The join key field name in the local collection|
> |v|See Note|The query to be executed against the external Solr collection to 
> retrieve the set of join key values.  
> Note:  The original query can be passed at the end of the string or as the 
> "v" parameter.  
> It's recommended to use query parameter substitution with the "v" parameter 
> to ensure no issues arise with the default query parsers.|
> |routed| |true / false.  If true, the XCJF query will use each shard's hash 
> range to determine the set of join keys to retrieve for that shard.
> This parameter improves the performance of the cross-collection join, but 
> it depends on the local collection being routed by the toField.  If this 
> parameter is not specified, 
> the XCJF query will try to determine the correct value automatically.|
> |ttl| |The length of time that an XCJF query in the cache will be considered 
> valid, in seconds.  Defaults to 3600 (one hour).  
> The XCJF query will not be aware of changes to the remote collection, so 
> if the remote collection is updated, cached XCJF queries may give inaccurate 
> results.  
> After the ttl period has expired, the XCJF query will re-execute the join 
> against the remote collection.|
> |_All others_| |Any normal Solr parameter can also be specified as a local 
> param.|
>  
> Example Solr Config.xml changes:
>  
>  {{<}}{{cache}} {{name}}{{=}}{{"hash_vin"}}
>  {{   }}{{class}}{{=}}{{"solr.LRUCache"}}
>  {{   }}{{size}}{{=}}{{"128"}}
>  {{   }}{{initialSize}}{{=}}{{"0"}}
>  {{   }}{{regenerator}}{{=}}{{"solr.NoOpRegenerator"}}{{/>}}
>   
>  {{<}}{{queryParser}} 

[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-11-01 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964854#comment-16964854
 ] 

Adrien Grand commented on LUCENE-8920:
--

Maybe we should update the naming with your proposed approach, e.g. "indexed" 
or something along those lines instead of "direct addressing" which suggests 
nodes are directly addressed by their label like in Mike's previous patch?

Out of curiosity have you looked into no longer explicitly serializing labels 
with this new approach since labels can be inferred from the bitset?

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Priority: Minor
> Attachments: TestTermsDictRamBytesUsed.java
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-12993) Split the state.json into 2. a small frequently modified data + a large unmodified data

2019-11-01 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-12993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964848#comment-16964848
 ] 

David Smiley commented on SOLR-12993:
-

Looks interesting.  For the case of large clusters in number of collections but 
_not_ number of replicas (i.e. 1000s Collections, each with 1 replica), will 
this issue add more overhead?  Hopefully not much.

> Split the state.json into 2. a small frequently modified data + a large 
> unmodified data
> ---
>
> Key: SOLR-12993
> URL: https://issues.apache.org/jira/browse/SOLR-12993
> Project: Solr
>  Issue Type: Improvement
>Reporter: Noble Paul
>Priority: Major
>
> The new design is posted here
> https://docs.google.com/document/d/1AZyiRA_bRhAWkUM1Nj5kg__xpPM9Fd_iwHhYazG38xI/edit#



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13875) Check why LBSolrClient thinks it has to sort on _docid_

2019-11-01 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964831#comment-16964831
 ] 

David Smiley commented on SOLR-13875:
-

It's awfully suspicious to me that {{_docid_}} sort is slow no matter how many 
docs.  In theory, it's the fastest sort possible!  It's basically no-sorting.

> Check why LBSolrClient thinks it has to sort on _docid_
> ---
>
> Key: SOLR-13875
> URL: https://issues.apache.org/jira/browse/SOLR-13875
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Erick Erickson
>Priority: Major
>
> Related to SOLR-13874. LBSolrClient issues a query that sorts on \_docid\_. 
> This can take over 12 seconds on a 1B doc replica. Is the sorting necessary 
> at all? If it's removed, what are the effects on performance? Of we change 
> SOLR-13874, we can close this.
> Frankly I suspect that no matter what, there have to be over 1B comparisons 
> done and that's what's taking the time, but I wanted to call this out 
> explicitly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] gerlowskija commented on issue #883: SOLR-13762: Add binary support to XMLCodec

2019-11-01 Thread GitBox
gerlowskija commented on issue #883: SOLR-13762: Add binary support to XMLCodec
URL: https://github.com/apache/lucene-solr/pull/883#issuecomment-548790282
 
 
   Thanks @thomaswoeckinger , and sorry for the delay.  This updated code lgtm 
and I'll merge today after running the tests.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] msokolov commented on a change in pull request #980: LUCENE-8920: Reduce the memory used by direct addressing of arcs

2019-11-01 Thread GitBox
msokolov commented on a change in pull request #980: LUCENE-8920: Reduce the 
memory used by direct addressing of arcs
URL: https://github.com/apache/lucene-solr/pull/980#discussion_r341523588
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/util/fst/FST.java
 ##
 @@ -864,6 +958,64 @@ private long readUnpackedNodeTarget(BytesReader in) 
throws IOException {
 return readNextRealArc(arc, in);
   }
 
+  /**
+   * Reads the presence bits of a direct-addressing node, store them in the 
provided arc {@link Arc#bitTable()}
+   * and returns the number of presence bytes.
+   */
+  private int readPresenceBytes(Arc arc, BytesReader in) throws IOException 
{
+int numPresenceBytes = getNumPresenceBytes(arc.numArcs());
+long[] presenceBits = new long[(numPresenceBytes + 7) / Long.BYTES];
+for (int i = 0; i < numPresenceBytes; i++) {
+  // Read the next unsigned byte, shift it to the left, and appends it to 
the current long.
+  presenceBits[i / Long.BYTES] |= (in.readByte() & 0xFFL) << (i * 
Byte.SIZE);
+}
+arc.bitTable = presenceBits;
+assert checkPresenceBytesAreValid(arc);
+return numPresenceBytes;
+  }
+
+  static boolean checkPresenceBytesAreValid(Arc arc) {
+assert (arc.bitTable()[0] & 1L) != 0; // First bit must be set.
+assert (arc.bitTable()[arc.bitTable().length - 1] & (1L << (arc.numArcs() 
- 1))) != 0; // Last bit must be set.
+assert countBits(arc.bitTable()) <= arc.numArcs(); // Total bit set (real 
num arcs) must be <= label range (stored in arc.numArcs()).
+return true;
+  }
+
+  /**
+   * Counts all bits set in the provided longs.
+   */
+  static int countBits(long[] bits) {
 
 Review comment:
   If we make these Arc members we can keep the bit array private to Arc, or at 
least we can make Arc.bitTable() package private


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] msokolov commented on a change in pull request #980: LUCENE-8920: Reduce the memory used by direct addressing of arcs

2019-11-01 Thread GitBox
msokolov commented on a change in pull request #980: LUCENE-8920: Reduce the 
memory used by direct addressing of arcs
URL: https://github.com/apache/lucene-solr/pull/980#discussion_r341526242
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/util/fst/FST.java
 ##
 @@ -676,26 +697,38 @@ long addNode(Builder builder, 
Builder.UnCompiledNode nodeIn) throws IOExce
 */
 
 if (doFixedArray) {
-  final int MAX_HEADER_SIZE = 11; // header(byte) + numArcs(vint) + 
numBytes(vint)
   assert maxBytesPerArc > 0;
   // 2nd pass just "expands" all arcs to take up a fixed byte size
 
+  // If more than (1 / DIRECT_ARC_LOAD_FACTOR) of the "slots" would be 
occupied, write an arc
+  // array that may have holes in it so that we can address the arcs 
directly by label without
 
 Review comment:
   We should re-word this comment since the array *storage* doesn't really have 
holes in it? Maybe something like "write a bit vector indicating which array 
elements are present so that we can ..."


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] msokolov commented on a change in pull request #980: LUCENE-8920: Reduce the memory used by direct addressing of arcs

2019-11-01 Thread GitBox
msokolov commented on a change in pull request #980: LUCENE-8920: Reduce the 
memory used by direct addressing of arcs
URL: https://github.com/apache/lucene-solr/pull/980#discussion_r341528510
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/util/fst/FST.java
 ##
 @@ -726,6 +759,57 @@ private void writeArrayPacked(Builder builder, 
Builder.UnCompiledNode node
 }
   }
 
+  private void writeArrayDirectAddressing(Builder builder, 
Builder.UnCompiledNode nodeIn, long fixedArrayStart, int maxBytesPerArc, int 
labelRange) {
+int numPresenceBytes = getNumPresenceBytes(labelRange);
+// expand the arcs in place, backwards
+long srcPos = builder.bytes.getPosition();
+long destPos = fixedArrayStart + numPresenceBytes + nodeIn.numArcs * 
maxBytesPerArc;
+// if destPos == srcPos it means all the arcs were the same length, and 
the array of them is *already* direct
+assert destPos >= srcPos;
+if (destPos > srcPos) {
+  builder.bytes.skipBytes((int) (destPos - srcPos));
+  assert builder.bytes.getPosition() == destPos;
+  for (int arcIdx = nodeIn.numArcs - 1; arcIdx >= 0; arcIdx--) {
+destPos -= maxBytesPerArc;
+int arcLen = builder.reusedBytesPerArc[arcIdx];
+srcPos -= arcLen;
+if (srcPos != destPos) {
+  assert destPos > srcPos: "destPos=" + destPos + " srcPos=" + srcPos 
+ " arcIdx=" + arcIdx + " maxBytesPerArc=" + maxBytesPerArc + " 
reusedBytesPerArc[arcIdx]=" + builder.reusedBytesPerArc[arcIdx] + " 
nodeIn.numArcs=" + nodeIn.numArcs;
+  builder.bytes.copyBytes(srcPos, destPos, arcLen);
+}
+  }
+}
+assert destPos - numPresenceBytes == fixedArrayStart;
+writePresenceBits(builder, nodeIn, labelRange, fixedArrayStart);
+  }
+
+  private void writePresenceBits(Builder builder, Builder.UnCompiledNode 
nodeIn, int labelRange, long dest) {
+long bytePos = dest;
+byte presenceBits = 1; // The first arc is always present.
+int presenceIndex = 0;
+int previousLabel = nodeIn.arcs[0].label;
+for (int arcIdx = 1; arcIdx < nodeIn.numArcs; arcIdx++) {
+  int label = nodeIn.arcs[arcIdx].label;
+  presenceIndex += label - previousLabel;
+  while (presenceIndex >= 8) {
 
 Review comment:
   Maybe we should also use `Byte.SIZE` here for consistency, since the point 
is to write a byte at a time here?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] msokolov commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

2019-11-01 Thread GitBox
msokolov commented on a change in pull request #892: LUCENE-8972: Add 
ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341515854
 
 

 ##
 File path: 
lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformCharFilter.java
 ##
 @@ -0,0 +1,384 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.analysis.icu;
+
+import java.io.IOException;
+import java.io.Reader;
+
+import com.ibm.icu.text.ReplaceableString;
+import com.ibm.icu.text.Transliterator;
+import com.ibm.icu.text.Transliterator.Position;
+import com.ibm.icu.text.UTF16;
+
+import org.apache.lucene.analysis.CharFilter;
+import org.apache.lucene.analysis.charfilter.BaseCharFilter;
+import org.apache.lucene.util.ArrayUtil;
+
+/**
+ * A {@link CharFilter} that transforms text with ICU.
+ * 
+ * ICU provides text-transformation functionality via its Transliteration API.
+ * Although script conversion is its most common use, a Transliterator can
+ * actually perform a more general class of tasks. In fact, Transliterator
+ * defines a very general API which specifies only that a segment of the input
+ * text is replaced by new text. The particulars of this conversion are
+ * determined entirely by subclasses of Transliterator.
+ * 
+ * 
+ * Some useful transformations for search are built-in:
+ * 
+ * Conversion from Traditional to Simplified Chinese characters
+ * Conversion from Hiragana to Katakana
+ * Conversion from Fullwidth to Halfwidth forms.
+ * Script conversions, for example Serbian Cyrillic to Latin
+ * 
+ * 
+ * Example usage: stream = new ICUTransformCharFilter(reader,
+ * Transliterator.getInstance("Traditional-Simplified"));
+ * 
+ * For more details, see the http://userguide.icu-project.org/transforms/general;>ICU User
+ * Guide.
+ */
+public final class ICUTransformCharFilter extends BaseCharFilter {
+
+  // Transliterator to transform the text
+  private final Transliterator transform;
+
+  // Reusable position object
+  private final Position position = new Position();
+
+  private static final int READ_BUFFER_SIZE = 1024;
+  private final char[] tmpBuffer = new char[READ_BUFFER_SIZE];
+
+  private static final int INITIAL_TRANSLITERATE_BUFFER_SIZE = 1024;
+  private final StringBuffer buffer = new 
StringBuffer(INITIAL_TRANSLITERATE_BUFFER_SIZE);
+  private final ReplaceableString replaceable = new ReplaceableString(buffer);
+
+  private static final int BUFFER_PRUNE_THRESHOLD = 1024;
+
+  private int outputCursor = 0;
+  private boolean inputFinished = false;
+  private int charCount = 0;
+  private int offsetDiffAdjust = 0;
+
+  private static final int HARD_MAX_ROLLBACK_BUFFER_CAPACITY = 
Integer.highestOneBit(Integer.MAX_VALUE);
+  static final int DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY = 8192;
+  private final int maxRollbackBufferCapacity;
+
+  private static final int DEFAULT_INITIAL_ROLLBACK_BUFFER_CAPACITY = 4; // 
must be power of 2
+  private char[] rollbackBuffer;
+  private int rollbackBufferSize = 0;
+
+  ICUTransformCharFilter(Reader in, Transliterator transform) {
+this(in, transform, DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY);
+  }
+
+  /**
+   * Construct new {@link ICUTransformCharFilter} with the specified {@link 
Transliterator}, backed by
+   * the specified {@link Reader}.
+   * @param in input source
+   * @param transform used to perform transliteration
+   * @param maxRollbackBufferCapacityHint used to control the maximum size to 
which this
+   * {@link ICUTransformCharFilter} will buffer and rollback partial 
transliteration of input sequences.
+   * The provided hint will be converted to an enforced limit of "the greatest 
power of 2 (excluding '1')
+   * less than or equal to the specified value". It is illegal to specify a 
negative value. There is no
+   * power of 2 greater than 
Integer.highestOneBit(Integer.MAX_VALUE)), so to prevent overflow, 
values
+   * in this range will resolve to an enforced limit of 
Integer.highestOneBit(Integer.MAX_VALUE)).
+   * Specifying "0" (or "1", in practice) disables rollback. Larger values can 
in some cases yield more accurate
+   * transliteration, at the cost of 

[GitHub] [lucene-solr] msokolov commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

2019-11-01 Thread GitBox
msokolov commented on a change in pull request #892: LUCENE-8972: Add 
ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341511030
 
 

 ##
 File path: 
lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformCharFilter.java
 ##
 @@ -0,0 +1,384 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.analysis.icu;
+
+import java.io.IOException;
+import java.io.Reader;
+
+import com.ibm.icu.text.ReplaceableString;
+import com.ibm.icu.text.Transliterator;
+import com.ibm.icu.text.Transliterator.Position;
+import com.ibm.icu.text.UTF16;
+
+import org.apache.lucene.analysis.CharFilter;
+import org.apache.lucene.analysis.charfilter.BaseCharFilter;
+import org.apache.lucene.util.ArrayUtil;
+
+/**
+ * A {@link CharFilter} that transforms text with ICU.
+ * 
+ * ICU provides text-transformation functionality via its Transliteration API.
+ * Although script conversion is its most common use, a Transliterator can
+ * actually perform a more general class of tasks. In fact, Transliterator
+ * defines a very general API which specifies only that a segment of the input
+ * text is replaced by new text. The particulars of this conversion are
+ * determined entirely by subclasses of Transliterator.
+ * 
+ * 
+ * Some useful transformations for search are built-in:
+ * 
+ * Conversion from Traditional to Simplified Chinese characters
+ * Conversion from Hiragana to Katakana
+ * Conversion from Fullwidth to Halfwidth forms.
+ * Script conversions, for example Serbian Cyrillic to Latin
+ * 
+ * 
+ * Example usage: stream = new ICUTransformCharFilter(reader,
+ * Transliterator.getInstance("Traditional-Simplified"));
+ * 
+ * For more details, see the http://userguide.icu-project.org/transforms/general;>ICU User
+ * Guide.
+ */
+public final class ICUTransformCharFilter extends BaseCharFilter {
+
+  // Transliterator to transform the text
+  private final Transliterator transform;
+
+  // Reusable position object
+  private final Position position = new Position();
+
+  private static final int READ_BUFFER_SIZE = 1024;
+  private final char[] tmpBuffer = new char[READ_BUFFER_SIZE];
+
+  private static final int INITIAL_TRANSLITERATE_BUFFER_SIZE = 1024;
+  private final StringBuffer buffer = new 
StringBuffer(INITIAL_TRANSLITERATE_BUFFER_SIZE);
+  private final ReplaceableString replaceable = new ReplaceableString(buffer);
+
+  private static final int BUFFER_PRUNE_THRESHOLD = 1024;
+
+  private int outputCursor = 0;
+  private boolean inputFinished = false;
+  private int charCount = 0;
+  private int offsetDiffAdjust = 0;
+
+  private static final int HARD_MAX_ROLLBACK_BUFFER_CAPACITY = 
Integer.highestOneBit(Integer.MAX_VALUE);
+  static final int DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY = 8192;
+  private final int maxRollbackBufferCapacity;
+
+  private static final int DEFAULT_INITIAL_ROLLBACK_BUFFER_CAPACITY = 4; // 
must be power of 2
+  private char[] rollbackBuffer;
+  private int rollbackBufferSize = 0;
+
+  ICUTransformCharFilter(Reader in, Transliterator transform) {
+this(in, transform, DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY);
+  }
+
+  /**
+   * Construct new {@link ICUTransformCharFilter} with the specified {@link 
Transliterator}, backed by
+   * the specified {@link Reader}.
+   * @param in input source
+   * @param transform used to perform transliteration
+   * @param maxRollbackBufferCapacityHint used to control the maximum size to 
which this
+   * {@link ICUTransformCharFilter} will buffer and rollback partial 
transliteration of input sequences.
+   * The provided hint will be converted to an enforced limit of "the greatest 
power of 2 (excluding '1')
+   * less than or equal to the specified value". It is illegal to specify a 
negative value. There is no
+   * power of 2 greater than 
Integer.highestOneBit(Integer.MAX_VALUE)), so to prevent overflow, 
values
+   * in this range will resolve to an enforced limit of 
Integer.highestOneBit(Integer.MAX_VALUE)).
+   * Specifying "0" (or "1", in practice) disables rollback. Larger values can 
in some cases yield more accurate
+   * transliteration, at the cost of 

[GitHub] [lucene-solr] msokolov commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

2019-11-01 Thread GitBox
msokolov commented on a change in pull request #892: LUCENE-8972: Add 
ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341511084
 
 

 ##
 File path: 
lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformCharFilter.java
 ##
 @@ -0,0 +1,384 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.analysis.icu;
+
+import java.io.IOException;
+import java.io.Reader;
+
+import com.ibm.icu.text.ReplaceableString;
+import com.ibm.icu.text.Transliterator;
+import com.ibm.icu.text.Transliterator.Position;
+import com.ibm.icu.text.UTF16;
+
+import org.apache.lucene.analysis.CharFilter;
+import org.apache.lucene.analysis.charfilter.BaseCharFilter;
+import org.apache.lucene.util.ArrayUtil;
+
+/**
+ * A {@link CharFilter} that transforms text with ICU.
+ * 
+ * ICU provides text-transformation functionality via its Transliteration API.
+ * Although script conversion is its most common use, a Transliterator can
+ * actually perform a more general class of tasks. In fact, Transliterator
+ * defines a very general API which specifies only that a segment of the input
+ * text is replaced by new text. The particulars of this conversion are
+ * determined entirely by subclasses of Transliterator.
+ * 
+ * 
+ * Some useful transformations for search are built-in:
+ * 
+ * Conversion from Traditional to Simplified Chinese characters
+ * Conversion from Hiragana to Katakana
+ * Conversion from Fullwidth to Halfwidth forms.
+ * Script conversions, for example Serbian Cyrillic to Latin
+ * 
+ * 
+ * Example usage: stream = new ICUTransformCharFilter(reader,
+ * Transliterator.getInstance("Traditional-Simplified"));
+ * 
+ * For more details, see the http://userguide.icu-project.org/transforms/general;>ICU User
+ * Guide.
+ */
+public final class ICUTransformCharFilter extends BaseCharFilter {
+
+  // Transliterator to transform the text
+  private final Transliterator transform;
+
+  // Reusable position object
+  private final Position position = new Position();
+
+  private static final int READ_BUFFER_SIZE = 1024;
+  private final char[] tmpBuffer = new char[READ_BUFFER_SIZE];
+
+  private static final int INITIAL_TRANSLITERATE_BUFFER_SIZE = 1024;
+  private final StringBuffer buffer = new 
StringBuffer(INITIAL_TRANSLITERATE_BUFFER_SIZE);
+  private final ReplaceableString replaceable = new ReplaceableString(buffer);
+
+  private static final int BUFFER_PRUNE_THRESHOLD = 1024;
+
+  private int outputCursor = 0;
+  private boolean inputFinished = false;
+  private int charCount = 0;
+  private int offsetDiffAdjust = 0;
+
+  private static final int HARD_MAX_ROLLBACK_BUFFER_CAPACITY = 
Integer.highestOneBit(Integer.MAX_VALUE);
+  static final int DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY = 8192;
+  private final int maxRollbackBufferCapacity;
+
+  private static final int DEFAULT_INITIAL_ROLLBACK_BUFFER_CAPACITY = 4; // 
must be power of 2
+  private char[] rollbackBuffer;
+  private int rollbackBufferSize = 0;
+
+  ICUTransformCharFilter(Reader in, Transliterator transform) {
+this(in, transform, DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY);
+  }
+
+  /**
+   * Construct new {@link ICUTransformCharFilter} with the specified {@link 
Transliterator}, backed by
+   * the specified {@link Reader}.
+   * @param in input source
+   * @param transform used to perform transliteration
+   * @param maxRollbackBufferCapacityHint used to control the maximum size to 
which this
+   * {@link ICUTransformCharFilter} will buffer and rollback partial 
transliteration of input sequences.
+   * The provided hint will be converted to an enforced limit of "the greatest 
power of 2 (excluding '1')
+   * less than or equal to the specified value". It is illegal to specify a 
negative value. There is no
+   * power of 2 greater than 
Integer.highestOneBit(Integer.MAX_VALUE)), so to prevent overflow, 
values
+   * in this range will resolve to an enforced limit of 
Integer.highestOneBit(Integer.MAX_VALUE)).
+   * Specifying "0" (or "1", in practice) disables rollback. Larger values can 
in some cases yield more accurate
+   * transliteration, at the cost of 

  1   2   >