[jira] [Commented] (LUCENE-10671) Lucene

2022-08-01 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17573919#comment-17573919
 ] 

Uwe Schindler commented on LUCENE-10671:


We can delete the whole issue.

> Lucene
> --
>
> Key: LUCENE-10671
> URL: https://issues.apache.org/jira/browse/LUCENE-10671
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/hnsw
>Affects Versions: 8.11.2
>Reporter: allnewcracksoftwares
>Priority: Minor
>
> [link title|http://example.com]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Deleted] (LUCENE-10671) Lucene

2022-08-01 Thread Uwe Schindler (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler deleted LUCENE-10671:
---


> Lucene
> --
>
> Key: LUCENE-10671
> URL: https://issues.apache.org/jira/browse/LUCENE-10671
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: allnewcracksoftwares
>Priority: Minor
>
> [link title|http://example.com]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10643) Lucene Jenkins CI - s390x support

2022-07-07 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17563840#comment-17563840
 ] 

Uwe Schindler commented on LUCENE-10643:


bq. The timeout is caused by a hard limit in jenkins that should be 
configurable via system properties

I raised this setting on Policeman Jenkins to 30.000.

> Lucene Jenkins CI - s390x support 
> --
>
> Key: LUCENE-10643
> URL: https://issues.apache.org/jira/browse/LUCENE-10643
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Nayana Thorat
>Assignee: Uwe Schindler
>Priority: Major
>  Labels: jenkins
>
> This issue adds Lucene builds on ASF Jenkins with S390x architecture (big 
> endian).
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10643) Lucene Jenkins CI - s390x support

2022-07-07 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17563836#comment-17563836
 ] 

Uwe Schindler commented on LUCENE-10643:


Hi [~Nayana],
the Java 19 (JDK project Panama) run to support Lucene's MMapDirectory v2 (see 
PR https://github.com/apache/lucene/pull/912) was working fine with this Big 
Endian platform. I will also report this also to OpenJDK community, as this is 
an important thing for them to know! It looks like all memory swap instuctions 
in Java's MemorySegment API are inserted at correct places when reading writing 
the little endian file format of Lucene.

The MMap v2 job is here: 
https://ci-builds.apache.org/job/Lucene/job/Lucene-MMAPv2-Linux%20(s390x%20big%20endian)/

> Lucene Jenkins CI - s390x support 
> --
>
> Key: LUCENE-10643
> URL: https://issues.apache.org/jira/browse/LUCENE-10643
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Nayana Thorat
>Assignee: Uwe Schindler
>Priority: Major
>  Labels: jenkins
>
> This issue adds Lucene builds on ASF Jenkins with S390x architecture (big 
> endian).
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10643) Lucene Jenkins CI - s390x support

2022-07-07 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17563763#comment-17563763
 ] 

Uwe Schindler commented on LUCENE-10643:


It is configured to be {{@daily}}. Normal lucene builds run {{@hourly}} on our 
special "lucene" tagged nodes to not occupy nodes used by other projects, by 
always running builds.

The reason for this is how Lucene tests works: They check with random data, so 
whenever you see a failure, it is something new  (often JVM bugs):
- https://www.youtube.com/watch?v=-uVE_w8flIU
- 
https://2019.berlinbuzzwords.de/sites/2019.berlinbuzzwords.de/files/media/documents/dawidweiss-randomizedtesting-pub.pdf
- https://www.youtube.com/watch?v=PVRdLyQGUxE
- 
https://2013.berlinbuzzwords.de/sites/2013.berlinbuzzwords.de/files/slides/Schindler-BugsBugsBugs.pdf

> Lucene Jenkins CI - s390x support 
> --
>
> Key: LUCENE-10643
> URL: https://issues.apache.org/jira/browse/LUCENE-10643
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Nayana Thorat
>Assignee: Uwe Schindler
>Priority: Major
>  Labels: jenkins
>
> This issue adds Lucene builds on ASF Jenkins with S390x architecture (big 
> endian).
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10643) Lucene Jenkins CI - s390x support

2022-07-07 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17563742#comment-17563742
 ] 

Uwe Schindler commented on LUCENE-10643:


See this: https://www.mail-archive.com/dev@lucene.apache.org/msg314005.html

> Lucene Jenkins CI - s390x support 
> --
>
> Key: LUCENE-10643
> URL: https://issues.apache.org/jira/browse/LUCENE-10643
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Nayana Thorat
>Assignee: Uwe Schindler
>Priority: Major
>  Labels: jenkins
>
> This issue adds Lucene builds on ASF Jenkins with S390x architecture (big 
> endian).
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10643) Lucene Jenkins CI - s390x support

2022-07-07 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17563729#comment-17563729
 ] 

Uwe Schindler edited comment on LUCENE-10643 at 7/7/22 11:41 AM:
-

[~Nayana]: This is not a problem. It appears on all builds and has to do with 
some bug in Jenkins. Cannot be prevented, sorry. As long as builds succeed all 
is fine. [~dweiss] has some hints about the bug.


was (Author: thetaphi):
[~Nayana]: This is not a problem. It appears on all builds and has to do with 
some bug in Jenkins. Cannot be prevented, sorry. As long as builds succeed all 
is fine. [~dweiss] has some hints about the bug.

It happens on all our builds.

> Lucene Jenkins CI - s390x support 
> --
>
> Key: LUCENE-10643
> URL: https://issues.apache.org/jira/browse/LUCENE-10643
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Nayana Thorat
>Assignee: Uwe Schindler
>Priority: Major
>  Labels: jenkins
>
> This issue adds Lucene builds on ASF Jenkins with S390x architecture (big 
> endian).
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10643) Lucene Jenkins CI - s390x support

2022-07-07 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17563729#comment-17563729
 ] 

Uwe Schindler commented on LUCENE-10643:


[~Nayana]: This is not a problem. It appears on all builds and has to do with 
some bug in Jenkins. Cannot be prevented, sorry. As long as builds succeed all 
is fine. [~dweiss] has some hints about the bug.

It happens on all our builds.

> Lucene Jenkins CI - s390x support 
> --
>
> Key: LUCENE-10643
> URL: https://issues.apache.org/jira/browse/LUCENE-10643
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Nayana Thorat
>Assignee: Uwe Schindler
>Priority: Major
>  Labels: jenkins
>
> This issue adds Lucene builds on ASF Jenkins with S390x architecture (big 
> endian).
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10643) Lucene Jenkins CI - s390x support

2022-07-07 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17563719#comment-17563719
 ] 

Uwe Schindler commented on LUCENE-10643:


Great thanks, will setup a job for that. It look like it is recent enough.

> Lucene Jenkins CI - s390x support 
> --
>
> Key: LUCENE-10643
> URL: https://issues.apache.org/jira/browse/LUCENE-10643
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Nayana Thorat
>Assignee: Uwe Schindler
>Priority: Major
>  Labels: jenkins
>
> This issue adds Lucene builds on ASF Jenkins with S390x architecture (big 
> endian).
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10643) Lucene Jenkins CI - s390x support

2022-07-06 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17563427#comment-17563427
 ] 

Uwe Schindler edited comment on LUCENE-10643 at 7/6/22 8:51 PM:


[~Nayana]: Would it be possible to get an EA release of JDK-19?

This would be important to test our project Panama Branch 
(https://github.com/apache/lucene/pull/912) with a big endian architecture. It 
is enough if you could install a JDK 19 preview release (>= JDK-19-ea+23) at 
any location you like (we have a job only running on a non-ASF server at 
moment, but as we'd like to support Java 19 soon, it might be good to test 
before).

It is enough to give us the path to this EA installation, no need to setup in 
Jenkins as Java version. We only need JDK-19 for compiling, so we pass a 
special env var with the Java 19 as RUNTIME_JAVA_HOME. I'd setup the job like 
that.

>From what I found out, Oracle does not offer JDK 19 EA releases for s390x, but 
>maybe IBM or Eclipse Adoptium has a preview build?


was (Author: thetaphi):
[~Nayana]: Would it be possible to get an EA release of JDK-19.

This would be important to test our project Panama Branch 
(https://github.com/apache/lucene/pull/912) with a big endian architecture. It 
is enough if you could install a JDK 19 preview release (>= JDK-19-ea+23) to 
setup a job. We have a job only running on a non ASF server at moment, but as 
we'd like to support Java 19 soon, it might be good to test.

It is enough to give us the path to this EA installation, no need to setup in 
Jenkins as Java version. We only need JDK-19 for compiling, so we pass a 
special env var with the Java 19 as RUNTIME_JAVA_HOME. I'd setup the job like 
that.

>From what I found out, Oracle does not offer JDK 19 EA releases for s390x, but 
>maybe IBM or Eclipse Adoptium has a preview build?

> Lucene Jenkins CI - s390x support 
> --
>
> Key: LUCENE-10643
> URL: https://issues.apache.org/jira/browse/LUCENE-10643
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Nayana Thorat
>Assignee: Uwe Schindler
>Priority: Major
>  Labels: jenkins
>
> This issue adds Lucene builds on ASF Jenkins with S390x architecture (big 
> endian).
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10643) Lucene Jenkins CI - s390x support

2022-07-06 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17563427#comment-17563427
 ] 

Uwe Schindler commented on LUCENE-10643:


[~Nayana]: Would it be possible to get an EA release of JDK-19.

This would be important to test our project Panama Branch 
(https://github.com/apache/lucene/pull/912) with a big endian architecture. It 
is enough if you could install a JDK 19 preview release (>= JDK-19-ea+23) to 
setup a job. We have a job only running on a non ASF server at moment, but as 
we'd like to support Java 19 soon, it might be good to test.

It is enough to give us the path to this EA installation, no need to setup in 
Jenkins as Java version. We only need JDK-19 for compiling, so we pass a 
special env var with the Java 19 as RUNTIME_JAVA_HOME. I'd setup the job like 
that.

>From what I found out, Oracle does not offer JDK 19 EA releases for s390x, but 
>maybe IBM or Eclipse Adoptium has a preview build?

> Lucene Jenkins CI - s390x support 
> --
>
> Key: LUCENE-10643
> URL: https://issues.apache.org/jira/browse/LUCENE-10643
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Nayana Thorat
>Assignee: Uwe Schindler
>Priority: Major
>  Labels: jenkins
>
> This issue adds Lucene builds on ASF Jenkins with S390x architecture (big 
> endian).
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10642) Regexp query: escape sequences are treated as character classes

2022-07-06 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17563382#comment-17563382
 ] 

Uwe Schindler edited comment on LUCENE-10642 at 7/6/22 6:36 PM:


bq. From the user perspective, is it non-intuitive why the character classes 
should be denoted with two slashes

That's only in Java code (the usual stupidness) and possibly JSON. The problem 
is if you write "\n" the java compiler creates a newline out of it and theres 
never a \n in the regular expression.

Actually it is a problem if you cant write {{\\n}} as this would be seen by 
parser as \n.


was (Author: thetaphi):
bq. From the user perspective, is it non-intuitive why the character classes 
should be denoted with two slashes

That's only in Java code (the usual stupidness) and possibly JSON. The problem 
is if you write "\n" the java compiler creates a newline out of it and theres 
never a \n in the regular expression.

Actually it is a problem if you cant write \\n as this would be seen by parser 
as \n.

> Regexp query: escape sequences are treated as character classes
> ---
>
> Key: LUCENE-10642
> URL: https://issues.apache.org/jira/browse/LUCENE-10642
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 9.0, 9.1, 9.2, 9.3
>Reporter: Andriy Redko
>Priority: Major
>
> Interesting issue has been reported to Opensearch project [1], which has been 
> caused by [2], [3]. In the nutshell, the regression is causing escape 
> sequences (like \n, \r, \t, ...) to be treated as character classes 
> (specifically, 
> [https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#bs).]
> The problematic function is RegExp::matchPredefinedCharacterClass which does 
> not consider characters that denote an escaped construct. Simple test to 
> reproduce which fails with IllegalArgumentException("{color:#0451a5}invalid 
> character class{color}"):
>  
> {noformat}
> public class TestRegexpQuery extends LuceneTestCase {
>   public void testEscapeSequences() throws IOException {       
> assertEquals(1, regexQueryNrHits("\\n"));       
> assertEquals(1, regexQueryNrHits("[\\n]"));   }
>   }
> }
>   {noformat}
>  
> [1] [https://github.com/opensearch-project/OpenSearch/issues/3781]
> [2] 
> [https://github.com/apache/lucene/commit/1efce5444dd40142c55c5a3a30eeebc7b86796c3]
> [3] 
> [https://github.com/apache/lucene/commit/819e668ce2fcfcf86b652a191cdbe0fad0a8ffce]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10642) Regexp query: escape sequences are treated as character classes

2022-07-06 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17563382#comment-17563382
 ] 

Uwe Schindler edited comment on LUCENE-10642 at 7/6/22 6:36 PM:


bq. From the user perspective, is it non-intuitive why the character classes 
should be denoted with two slashes

That's only in Java code (the usual stupidness) and possibly JSON. The problem 
is if you write "\n" the java compiler creates a newline out of it and theres 
never a \n in the regular expression.

Actually it is a problem if you cant write \\n as this would be seen by parser 
as \n.


was (Author: thetaphi):
bq. From the user perspective, is it non-intuitive why the character classes 
should be denoted with two slashes

That's only in Java code (the usual stupidness) and possibly JSON.

> Regexp query: escape sequences are treated as character classes
> ---
>
> Key: LUCENE-10642
> URL: https://issues.apache.org/jira/browse/LUCENE-10642
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 9.0, 9.1, 9.2, 9.3
>Reporter: Andriy Redko
>Priority: Major
>
> Interesting issue has been reported to Opensearch project [1], which has been 
> caused by [2], [3]. In the nutshell, the regression is causing escape 
> sequences (like \n, \r, \t, ...) to be treated as character classes 
> (specifically, 
> [https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#bs).]
> The problematic function is RegExp::matchPredefinedCharacterClass which does 
> not consider characters that denote an escaped construct. Simple test to 
> reproduce which fails with IllegalArgumentException("{color:#0451a5}invalid 
> character class{color}"):
>  
> {noformat}
> public class TestRegexpQuery extends LuceneTestCase {
>   public void testEscapeSequences() throws IOException {       
> assertEquals(1, regexQueryNrHits("\\n"));       
> assertEquals(1, regexQueryNrHits("[\\n]"));   }
>   }
> }
>   {noformat}
>  
> [1] [https://github.com/opensearch-project/OpenSearch/issues/3781]
> [2] 
> [https://github.com/apache/lucene/commit/1efce5444dd40142c55c5a3a30eeebc7b86796c3]
> [3] 
> [https://github.com/apache/lucene/commit/819e668ce2fcfcf86b652a191cdbe0fad0a8ffce]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10642) Regexp query: escape sequences are treated as character classes

2022-07-06 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17563382#comment-17563382
 ] 

Uwe Schindler commented on LUCENE-10642:


bq. From the user perspective, is it non-intuitive why the character classes 
should be denoted with two slashes

That's only in Java code (the usual stupidness) and possibly JSON.

> Regexp query: escape sequences are treated as character classes
> ---
>
> Key: LUCENE-10642
> URL: https://issues.apache.org/jira/browse/LUCENE-10642
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 9.0, 9.1, 9.2, 9.3
>Reporter: Andriy Redko
>Priority: Major
>
> Interesting issue has been reported to Opensearch project [1], which has been 
> caused by [2], [3]. In the nutshell, the regression is causing escape 
> sequences (like \n, \r, \t, ...) to be treated as character classes 
> (specifically, 
> [https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#bs).]
> The problematic function is RegExp::matchPredefinedCharacterClass which does 
> not consider characters that denote an escaped construct. Simple test to 
> reproduce which fails with IllegalArgumentException("{color:#0451a5}invalid 
> character class{color}"):
>  
> {noformat}
> public class TestRegexpQuery extends LuceneTestCase {
>   public void testEscapeSequences() throws IOException {       
> assertEquals(1, regexQueryNrHits("\\n"));       
> assertEquals(1, regexQueryNrHits("[\\n]"));   }
>   }
> }
>   {noformat}
>  
> [1] [https://github.com/opensearch-project/OpenSearch/issues/3781]
> [2] 
> [https://github.com/apache/lucene/commit/1efce5444dd40142c55c5a3a30eeebc7b86796c3]
> [3] 
> [https://github.com/apache/lucene/commit/819e668ce2fcfcf86b652a191cdbe0fad0a8ffce]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10643) Lucene Jenkins CI - s390x support

2022-07-06 Thread Uwe Schindler (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler resolved LUCENE-10643.

Resolution: Resolved

> Lucene Jenkins CI - s390x support 
> --
>
> Key: LUCENE-10643
> URL: https://issues.apache.org/jira/browse/LUCENE-10643
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Nayana Thorat
>Assignee: Uwe Schindler
>Priority: Major
>  Labels: jenkins
>
> This issue adds Lucene builds on ASF Jenkins with S390x architecture (big 
> endian).
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10643) Lucene Jenkins CI - s390x support

2022-07-06 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17563374#comment-17563374
 ] 

Uwe Schindler edited comment on LUCENE-10643 at 7/6/22 6:23 PM:


Job is here: 
https://ci-builds.apache.org/job/Lucene/job/Lucene-Check-main%20(s390x%20big%20endian)/


was (Author: thetaphi):
Job is here: 
https://ci-builds.apache.org/job/Lucene/job/Lucene-Check-main%20(s390x,%20big%20endian)/

> Lucene Jenkins CI - s390x support 
> --
>
> Key: LUCENE-10643
> URL: https://issues.apache.org/jira/browse/LUCENE-10643
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Nayana Thorat
>Assignee: Uwe Schindler
>Priority: Major
>  Labels: jenkins
>
> This issue adds Lucene builds on ASF Jenkins with S390x architecture (big 
> endian).
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10643) Lucene Jenkins CI - s390x support

2022-07-06 Thread Uwe Schindler (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-10643:
---
Labels: jenkins  (was: jenkins solr)

> Lucene Jenkins CI - s390x support 
> --
>
> Key: LUCENE-10643
> URL: https://issues.apache.org/jira/browse/LUCENE-10643
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Nayana Thorat
>Assignee: Uwe Schindler
>Priority: Major
>  Labels: jenkins
>
> This issue adds Lucene builds on ASF Jenkins with S390x architecture (big 
> endian).
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10643) Lucene Jenkins CI - s390x support

2022-07-06 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17563374#comment-17563374
 ] 

Uwe Schindler commented on LUCENE-10643:


Job is here: 
https://ci-builds.apache.org/job/Lucene/job/Lucene-Check-main%20(s390x,%20big%20endian)/

> Lucene Jenkins CI - s390x support 
> --
>
> Key: LUCENE-10643
> URL: https://issues.apache.org/jira/browse/LUCENE-10643
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Nayana Thorat
>Assignee: Uwe Schindler
>Priority: Major
>  Labels: jenkins, solr
>
> This issue adds Lucene builds on ASF Jenkins with S390x architecture (big 
> endian).
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10643) Lucene Jenkins CI - s390x support

2022-07-06 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17563372#comment-17563372
 ] 

Uwe Schindler commented on LUCENE-10643:


I opened a new issue for Solr: SOLR-16284

> Lucene Jenkins CI - s390x support 
> --
>
> Key: LUCENE-10643
> URL: https://issues.apache.org/jira/browse/LUCENE-10643
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Nayana Thorat
>Assignee: Uwe Schindler
>Priority: Major
>  Labels: jenkins, solr
>
> This issue adds Lucene builds on ASF Jenkins with S390x architecture (big 
> endian).
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10643) Lucene Jenkins CI - s390x support

2022-07-06 Thread Uwe Schindler (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-10643:
---
Description: 
This issue adds Lucene builds on ASF Jenkins with S390x architecture (big 
endian).
 

  was:
{color:#1d1c1d}About Apache Solr Jenkins CI. I can see there are various jobs 
running for Solr {color}{color:#1d1c1d}[https://ci-builds.apache.org/job/Solr/ 
|https://ci-builds.apache.org/job/Solr/]{color}like 
https://ci-builds.apache.org/job/Solr/job/Solr-Artifacts-main
{color:#1d1c1d}Would like to  know how these jobs are setup.  Jenkinfiles are 
maintained on Github for those jobs or if build/test commands are directly 
executed from Execute shell option while job configuration? ( I can't find any 
jenkinsfiles in solr Github repo){color}
{color:#1d1c1d}The purpose of asking for this information is that we would like 
to add one more job which will be executed on s390x nodes/labels.{color}

{color:#1d1c1d}s390x nodes are already available: 
[https://jenkins-ccos.apache.org/view/all/]{color}

 

Summary: Lucene Jenkins CI - s390x support   (was: Solr Jenkins CI - 
s390x support )

> Lucene Jenkins CI - s390x support 
> --
>
> Key: LUCENE-10643
> URL: https://issues.apache.org/jira/browse/LUCENE-10643
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Nayana Thorat
>Assignee: Uwe Schindler
>Priority: Major
>  Labels: jenkins, solr
>
> This issue adds Lucene builds on ASF Jenkins with S390x architecture (big 
> endian).
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Moved] (LUCENE-10643) Solr Jenkins CI - s390x support

2022-07-06 Thread Uwe Schindler (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler moved SOLR-16221 to LUCENE-10643:
---

  Key: LUCENE-10643  (was: SOLR-16221)
Lucene Fields: New
  Project: Lucene - Core  (was: Solr)

> Solr Jenkins CI - s390x support 
> 
>
> Key: LUCENE-10643
> URL: https://issues.apache.org/jira/browse/LUCENE-10643
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Nayana Thorat
>Assignee: Uwe Schindler
>Priority: Major
>  Labels: jenkins, solr
>
> {color:#1d1c1d}About Apache Solr Jenkins CI. I can see there are various jobs 
> running for Solr 
> {color}{color:#1d1c1d}[https://ci-builds.apache.org/job/Solr/ 
> |https://ci-builds.apache.org/job/Solr/]{color}like 
> https://ci-builds.apache.org/job/Solr/job/Solr-Artifacts-main
> {color:#1d1c1d}Would like to  know how these jobs are setup.  Jenkinfiles are 
> maintained on Github for those jobs or if build/test commands are directly 
> executed from Execute shell option while job configuration? ( I can't find 
> any jenkinsfiles in solr Github repo){color}
> {color:#1d1c1d}The purpose of asking for this information is that we would 
> like to add one more job which will be executed on s390x nodes/labels.{color}
> {color:#1d1c1d}s390x nodes are already available: 
> [https://jenkins-ccos.apache.org/view/all/]{color}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10642) Regexp query: escape sequences are treated as character classes

2022-07-06 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17563351#comment-17563351
 ] 

Uwe Schindler commented on LUCENE-10642:


The question is: How was this handled before. We never supported escapes. I 
think \n never matched anything if used before the above change. Now it prints 
error message, but it was never functional. So it is not really an regression.

> Regexp query: escape sequences are treated as character classes
> ---
>
> Key: LUCENE-10642
> URL: https://issues.apache.org/jira/browse/LUCENE-10642
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 9.0, 9.1, 9.2, 9.3
>Reporter: Andriy Redko
>Priority: Major
>
> Interesting issue has been reported to Opensearch project [1], which has been 
> caused by [2], [3]. In the nutshell, the regression is causing escape 
> sequences (like \n, \r, \t, ...) to be treated as character classes 
> (specifically, 
> [https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#bs).]
> The problematic function is RegExp::matchPredefinedCharacterClass which does 
> not consider characters that denote an escaped construct. Simple test to 
> reproduce which fails with IllegalArgumentException("{color:#0451a5}invalid 
> character class{color}"):
>  
> {noformat}
> public class TestRegexpQuery extends LuceneTestCase {
>   public void testEscapeSequences() throws IOException {       
> assertEquals(1, regexQueryNrHits("\\n"));       
> assertEquals(1, regexQueryNrHits("[\\n]"));   }
>   }
> }
>   {noformat}
>  
> [1] [https://github.com/opensearch-project/OpenSearch/issues/3781]
> [2] 
> [https://github.com/apache/lucene/commit/1efce5444dd40142c55c5a3a30eeebc7b86796c3]
> [3] 
> [https://github.com/apache/lucene/commit/819e668ce2fcfcf86b652a191cdbe0fad0a8ffce]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10642) Regexp query: escape sequences are treated as character classes

2022-07-06 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17563350#comment-17563350
 ] 

Uwe Schindler commented on LUCENE-10642:


I think the common replacements should be rewritten to characters. Like \t \n 
\r and a few others. It could be handled like a character class with one 
character.

> Regexp query: escape sequences are treated as character classes
> ---
>
> Key: LUCENE-10642
> URL: https://issues.apache.org/jira/browse/LUCENE-10642
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 9.0, 9.1, 9.2, 9.3
>Reporter: Andriy Redko
>Priority: Major
>
> Interesting issue has been reported to Opensearch project [1], which has been 
> caused by [2], [3]. In the nutshell, the regression is causing escape 
> sequences (like \n, \r, \t, ...) to be treated as character classes 
> (specifically, 
> [https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#bs).]
> The problematic function is RegExp::matchPredefinedCharacterClass which does 
> not consider characters that denote an escaped construct. Simple test to 
> reproduce which fails with IllegalArgumentException("{color:#0451a5}invalid 
> character class{color}"):
>  
> {noformat}
> public class TestRegexpQuery extends LuceneTestCase {
>   public void testEscapeSequences() throws IOException {       
> assertEquals(1, regexQueryNrHits("\\n"));       
> assertEquals(1, regexQueryNrHits("[\\n]"));   }
>   }
> }
>   {noformat}
>  
> [1] [https://github.com/opensearch-project/OpenSearch/issues/3781]
> [2] 
> [https://github.com/apache/lucene/commit/1efce5444dd40142c55c5a3a30eeebc7b86796c3]
> [3] 
> [https://github.com/apache/lucene/commit/819e668ce2fcfcf86b652a191cdbe0fad0a8ffce]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-29 Thread Uwe Schindler (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Uwe Schindler edited a comment on  LUCENE-10557  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Migrate to GitHub issue from Jira   
 

  
 
 
 
 

 
 Spring also have a cool redirector in their webserver. It only redirects if you don't have some special param: https://jira.spring.io/browse/SPR- 17649 17639 ?redirect=falseAnd they also added a comment at end of all their issues (also by the bot).  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-29 Thread Uwe Schindler (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Uwe Schindler commented on  LUCENE-10557  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Migrate to GitHub issue from Jira   
 

  
 
 
 
 

 
 Spring also have a cool redirector in their webserver. It only redirects if you don't have some special param: https://jira.spring.io/browse/SPR-17649?redirect=false  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[jira] [Comment Edited] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-29 Thread Uwe Schindler (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Uwe Schindler edited a comment on  LUCENE-10557  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Migrate to GitHub issue from Jira   
 

  
 
 
 
 

 
 Spring also have a cool redirector in their webserver. It only redirects if you don't have some special param: https://jira.spring.io/browse/SPR-17649?redirect=false And they also added a comment at end of all their issues (also by the bot).  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-29 Thread Uwe Schindler (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Uwe Schindler commented on  LUCENE-10557  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Migrate to GitHub issue from Jira   
 

  
 
 
 
 

 
 I know from the past that INFRA had some video conferences with Github representatives, so ASF is not just "some arbitrary customer". I think there was a lot of discussions going on. The LUCENE.NET import was long before they had close contact to Github. I would really prefer to keep all orginal contributors, the change of names to some private account is a real blocker to me. When we can't modify the comment/issue creator mail address to use the official ASF one of the person or use some generic bot account, I would vote now -1 to the migration. P.S.: Spring used a generic user for the import "spring-projects-issues": https://github.com/spring-projects/spring-framework/issues/created_by/spring-projects-issues  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-29 Thread Uwe Schindler (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Uwe Schindler commented on  LUCENE-10557  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Migrate to GitHub issue from Jira   
 

  
 
 
 
 

 
 Maybe we can ask them to manually disable notifications during the import.  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-29 Thread Uwe Schindler (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Uwe Schindler commented on  LUCENE-10557  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Migrate to GitHub issue from Jira   
 

  
 
 
 
 

 
 noreply@lao did not work. This time it gave an error message!  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-29 Thread Uwe Schindler (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Uwe Schindler commented on  LUCENE-10557  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Migrate to GitHub issue from Jira   
 

  
 
 
 
 

 
 

We can't run the migration job on ourselves (and I don't want to use my account for it). Actual migration will be done by an INFRA's account. See Lucene.NET project: https://github.com/apache/lucenenet/issues/280 - seems it is still a personal account.
 Chris Lambertus (fluxo) is his private account. I don't like to use that account, too. I would prefer some generic "bot" account   
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-29 Thread Uwe Schindler (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Uwe Schindler commented on  LUCENE-10557  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Migrate to GitHub issue from Jira   
 

  
 
 
 
 

 
 Maybe a solution to silence all mails during migration would be to use a fake-address below @lucene.apache.org like nore...@lucene.apache.org. The limitation by the automation at infra is possibly limited to the mailing list domain. and mocob...@apache.org has wrong mail domain.  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-29 Thread Uwe Schindler (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Uwe Schindler commented on  LUCENE-10557  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Migrate to GitHub issue from Jira   
 

  
 
 
 
 

 
 When we do the migration, we should use some "generic" user / bot account? Otherwise we have "mocobeta" linked on all issues  Maybe theres an account for doing this by INFRA. They have tokens and some bot user in Github that could be used for the migration. We should contact them if they can give us a token (maybe they can create a token just for Lucene). I'd really recommend to talk in Slack with them, using interfaces is a bit slow in discussing such ad hoc solutions.  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-29 Thread Uwe Schindler (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Uwe Schindler commented on  LUCENE-10557  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Migrate to GitHub issue from Jira   
 

  
 
 
 
 

 
 Tomoko Uchida, this came to private@lao: 
 
Betreff: Notification schemes for lucene-jira-archive.git updated Datum: Wed, 29 Jun 2022 14:28:15 - Von: GitBox  Antwort an: priv...@lucene.apache.org An: priv...@lucene.apache.org 
The following notification schemes have been changed on lucene-jira-archive by tomoko: 
 
adding new scheme (commits): 'comm...@lucene.apache.org' 
adding new scheme (issues): 'issues@lucene.apache.org' 
adding new scheme (pullrequests): 'issues@lucene.apache.org' 
adding new scheme (jira_options): 'link label worklog' 
 
With regards, ASF Infra.
  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-29 Thread Uwe Schindler (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Uwe Schindler commented on  LUCENE-10557  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Migrate to GitHub issue from Jira   
 

  
 
 
 
 

 
 

Looks like all updates in the repository are still noticed in d...@lucene.apache.org (initial setting when creating the repo?). Could anybody mute this?
 d...@lucene.apache.org and comm...@lucene.apache.org were selected as default during creating repo (see my screenshot above). Actually the PR/issue list should have been issues@lucene.apache.org, but for this case it should be completely silent. I think there seems to be some delay, maybe ask on Slack's infra channel.  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-29 Thread Uwe Schindler (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Uwe Schindler commented on  LUCENE-10557  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Migrate to GitHub issue from Jira   
 

  
 
 
 
 

 
 LOL. I got message that it already exists.   
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[jira] [Comment Edited] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-29 Thread Uwe Schindler (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Uwe Schindler edited a comment on  LUCENE-10557  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Migrate to GitHub issue from Jira   
 

  
 
 
 
 

 
 Hi Tomoko, I am able to create repos:!screenshot-1.png|width=720!Will now create the  issue  repo .  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-29 Thread Uwe Schindler (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Uwe Schindler commented on  LUCENE-10557  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Migrate to GitHub issue from Jira   
 

  
 
 
 
 

 
 Hi Tomoko, I am able to create repos:  Will now create the issue.  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9) 

 If image attachments aren't displayed, see 
this article. 
  
 
 

 
   
 

  
 

  
 

   



[jira] [Updated] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-29 Thread Uwe Schindler (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Uwe Schindler updated an issue  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
 Lucene - Core /  LUCENE-10557  
 
 
  Migrate to GitHub issue from Jira   
 

  
 
 
 
 

 
Change By: 
 Uwe Schindler  
 
 
Attachment: 
 screenshot-1.png  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[jira] [Comment Edited] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-28 Thread Uwe Schindler (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Uwe Schindler edited a comment on  LUCENE-10557  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Migrate to GitHub issue from Jira   
 

  
 
 
 
 

 
 bq. We'll be able to save multiple attachments with the same filename, but I think the reference disambiguation is almost impossible (filename is the unique key for an attachment in an issue).The older versions have a different database id, so the link is different. The filename is just for human consumption. Where it is ambiguous is inside comments. The comments referring to a file are always pointing to latets version.This was also a long-standing issue in JIRA and thousands of people complained (I maintain a huuuge JIRA instance). They "solved" it in later versions by appending numbers to filename in later versions - but only during upload. Internally, the filename is still not a unique key.  In addition you can still create duplicates using JIRA's API.  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-28 Thread Uwe Schindler (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Uwe Schindler commented on  LUCENE-10557  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Migrate to GitHub issue from Jira   
 

  
 
 
 
 

 
 Yes, that's exactly the problem of JIRA: https://jira.atlassian.com/browse/JRASERVER-2169  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[jira] [Comment Edited] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-28 Thread Uwe Schindler (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Uwe Schindler edited a comment on  LUCENE-10557  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Migrate to GitHub issue from Jira   
 

  
 
 
 
 

 
 bq. We'll be able to save multiple attachments with the same filename, but I think the reference disambiguation is almost impossible (filename is the unique key for an attachment in an issue).The older versions have a different database id, so the link is different. The filename is just for human consumption. Where it is ambiguous is inside comments. The comments referring to a file are always pointing to latets version. This was also a long-standing issue in JIRA and thousands of people complained (I maintain a huuuge JIRA instance). They "solved" it in later versions by appending numbers to filename in later versions - but only during upload. Internally, the filename is still not a unique key.  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[jira] [Comment Edited] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-28 Thread Uwe Schindler (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Uwe Schindler edited a comment on  LUCENE-10557  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Migrate to GitHub issue from Jira   
 

  
 
 
 
 

 
 bq. We'll be able to save multiple attachments with the same filename, but I think the reference disambiguation is almost impossible (filename is the unique key for an attachment in an issue).The older versions have a different database id, so the link is different. The filename is just for human consumption.  Where it is ambiguous is inside comments. The comments referring to a file are always pointing to latets version.  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-28 Thread Uwe Schindler (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Uwe Schindler commented on  LUCENE-10557  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Migrate to GitHub issue from Jira   
 

  
 
 
 
 

 
 

We'll be able to save multiple attachments with the same filename, but I think the reference disambiguation is almost impossible (filename is the unique key for an attachment in an issue).
 The older versions have a different database id, so the link is different. The filename is just for human consumption.  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[jira] [Commented] (LUCENE-10627) Using CompositeByteBuf to Reduce Memory Copy

2022-06-28 Thread Uwe Schindler (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Uwe Schindler commented on  LUCENE-10627  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Using CompositeByteBuf to Reduce Memory Copy   
 

  
 
 
 
 

 
 

I wonder if we could reduce this complexity by reusing some existing abstractions like ByteBuffersDataInput instead of this new CompositeByteBuf, and have a single Compressor#compress API instead of two.
 +1  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-28 Thread Uwe Schindler (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Uwe Schindler commented on  LUCENE-10557  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Migrate to GitHub issue from Jira   
 

  
 
 
 
 

 
 
 

Once we have done this: Should we rewrite CHANGES.txt and replace all LUCENE- links to GITHUB# links?
 
I'm not sure if it should be done. Just for your information the current changes2html.pl supports only Pull Requests, so the script should be changed if we want to mention GitHub issues in CHANGES. (I have little experience with perl, but I'll take a look if it's needed. Maybe we should also support issues near future.)
 Actually we do not need extra code for this. As Github issue numbers and pull requests share the same increasing integer space, GITHUB#1234 is always either a PR or an issue. So there is no problem in converting those. If we decide to move all historic issues to Github, we should also update the file. Github actually does the right thing when you create a link to an issue or PR, it will redirect to the canonic URL. I would prefer to use "issue/" in the URL. If it is a PR then github redirects.  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-27 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17559285#comment-17559285
 ] 

Uwe Schindler commented on LUCENE-10557:


Once we have done this: Should we rewrite CHANGES.txt and replace all 
LUCENE- links to GITHUB# links?

> Migrate to GitHub issue from Jira
> -
>
> Key: LUCENE-10557
> URL: https://issues.apache.org/jira/browse/LUCENE-10557
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> A few (not the majority) Apache projects already use the GitHub issue instead 
> of Jira. For example,
> Airflow: [https://github.com/apache/airflow/issues]
> BookKeeper: [https://github.com/apache/bookkeeper/issues]
> So I think it'd be technically possible that we move to GitHub issue. I have 
> little knowledge of how to proceed with it, I'd like to discuss whether we 
> should migrate to it, and if so, how to smoothly handle the migration.
> The major tasks would be:
>  * (/) Get a consensus about the migration among committers
>  * Choose issues that should be moved to GitHub
>  ** Discussion thread 
> [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12]
>  ** -Conclusion for now: We don't migrate any issues. Only new issues should 
> be opened on GitHub.-
>  ** Write a prototype migration script - the decision could be made on that. 
> Things to consider:
>  *** version numbers - labels or milestones?
>  *** add a comment/ prepend a link to the source Jira issue on github side,
>  *** add a comment/ prepend a link on the jira side to the new issue on 
> github side (for people who access jira from blogs, mailing list archives and 
> other sources that will have stale links),
>  *** convert cross-issue automatic links in comments/ descriptions (as 
> suggested by Robert),
>  *** strategy to deal with sub-issues (hierarchies),
>  *** maybe prefix (or postfix) the issue title on github side with the 
> original LUCENE-XYZ key so that it is easier to search for a particular issue 
> there?
>  *** how to deal with user IDs (author, reporter, commenters)? Do they have 
> to be github users? Will information about people not registered on github be 
> lost?
>  *** create an extra mapping file of old-issue-new-issue URLs for any 
> potential future uses. 
>  *** what to do with issue numbers in git/svn commits? These could be 
> rewritten but it'd change the entire git history tree - I don't think this is 
> practical, while doable.
>  * Build the convention for issue label/milestone management
>  ** Do some experiments on a sandbox repository 
> [https://github.com/mocobeta/sandbox-lucene-10557]
>  ** Make documentation for metadata (label/milestone) management 
>  * Enable Github issue on the lucene's repository
>  ** Raise an issue on INFRA
>  ** (Create an issue-only private repository for sensitive issues if it's 
> needed and allowed)
>  ** Set a mail hook to 
> [issues@lucene.apache.org|mailto:issues@lucene.apache.org] (many thanks to 
> the general mail group name)
>  * Set a schedule for migration
>  ** Give some time to committers to play around with issues/labels/milestones 
> before the actual migration
>  ** Make an announcement on the mail lists
>  ** Show some text messages when opening a new Jira issue (in issue template?)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9500) Did we hit a DEFLATE bug?

2022-06-24 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17558427#comment-17558427
 ] 

Uwe Schindler commented on LUCENE-9500:
---

I opened a PR to remove the fix on main branch: 
https://github.com/apache/lucene/pull/977

> Did we hit a DEFLATE bug?
> -
>
> Key: LUCENE-9500
> URL: https://issues.apache.org/jira/browse/LUCENE-9500
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 8.x, 9.0, 8.7
>Reporter: Adrien Grand
>Assignee: Uwe Schindler
>Priority: Critical
>  Labels: Java13, Java14, Java15, java11, jdk11, jdk13, jdk14, 
> jdk15
> Fix For: 8.x, 9.0, 8.7
>
> Attachments: PresetDictTest.java, test_data.txt
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> I've been digging 
> [https://ci-builds.apache.org/job/Lucene/job/Lucene-Solr-NightlyTests-master/23/]
>  all day and managed to isolate a simple reproduction that shows the problem. 
> I've been starring at it all day and can't find what we are doing wrong, 
> which makes me wonder whether we're calling DEFLATE the wrong way or whether 
> we hit a DEFLATE bug. I've looked at it so much that I may be missing the 
> most obvious stuff.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons

2022-06-11 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17553131#comment-17553131
 ] 

Uwe Schindler edited comment on LUCENE-10610 at 6/11/22 5:51 PM:
-

I looked at the code again:
- Automaton class has no equals and no hashCode
- RunAutomaton has equals and hashCode

I don't want to refactor or decide if equals or hashCode is needed. I would 
just make the already existing hashCode bug free. hashCode should take the same 
fields to calculate the hashcode that are also used by equals. This would make 
query cache work fine, that's all needed.

I do not think we need to discuss if equals/hashCode ensures that two 
automatons are semantically equal (describe state machine with same behaviour). 
For query cache we only need to make sure that a query thats created with the 
same input has a RunAutomaton that equals the one of other query (I think 
that's given, only hashCode). We don't need to cache cases where the automaton 
looks different because the regex was different but functionally same.

If we need it for query cache, i think maybe the RunAutomaton should not be 
used at all by the query and only the direct query inputs used for the Query's 
equals/hashcode (like regex string or prefix/wildcard or fuzzy term).


was (Author: thetaphi):
I looked at the code again:
- Automaton class has no equals and no hashCode
- RunAutomaton has equals and hashCode

I don't want to refactor or decide if equals or hashCode is needed. I would 
just make the already existing hashCode bug free. hashCode should take the same 
fields to calculate the hashcode that are also used by equals. This would make 
query cache work fine, that's all needed.

I do not think we need to discuss if equals/hashCode ensures that two 
automatons are semantically equal (describe state machine with same behaviour). 
For query cache we only need to make sure that a query thats created with the 
same input has a RunAutomaton that equals the one of other query (I think 
that's given, only hashCode). We don't need to cache cases where the automaton 
looks different because the regex was different but functionally same.

If we need it for query cache, i think maybe the RunAutomaton should not be 
used at all by the query and only the direct query inputs be cached (like regex 
string or prefix/wildcard or fuzzy term).

> RunAutomaton#hashCode() can easily cause hash collision for different 
> Automatons
> 
>
> Key: LUCENE-10610
> URL: https://issues.apache.org/jira/browse/LUCENE-10610
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Tomoko Uchida
>Priority: Minor
>
> Current RunAutomaton#hashCode() is:
> {code:java}
>   @Override
>   public int hashCode() {
> final int prime = 31;
> int result = 1;
> result = prime * result + alphabetSize;
> result = prime * result + points.length;
> result = prime * result + size;
> return result;
>   }
> {code}
> Since it does not take account of the contents of the {{points}} array, this 
> returns the same value for different automatons when their alphabet size and 
> state size are the same.
> For example, this test code passes.
> {code:java}
>   public void testHashCode() throws IOException {
> PrefixQuery q1 = new PrefixQuery(new Term("field", "aba"));
> PrefixQuery q2 = new PrefixQuery(new Term("field", "fee"));
> assert q1.compiled.runAutomaton.hashCode() == 
> q2.compiled.runAutomaton.hashCode();
>   }
> {code}
> I suspect this is a bug?
> Note that I think it's not a serious one; all callers of this {{hashCode()}} 
> take account of additional information when calculating their own hash value, 
> it seems there is no substantial impact on higher-level APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons

2022-06-11 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17553131#comment-17553131
 ] 

Uwe Schindler edited comment on LUCENE-10610 at 6/11/22 5:49 PM:
-

I looked at the code again:
- Automaton class has no equals and no hashCode
- RunAutomaton has equals and hashCode

I don't want to refactor or decide if equals or hashCode is needed. I would 
just make the already existing hashCode bug free. hashCode should take the same 
fields to calculate the hashcode that are also used by equals. This would make 
query cache work fine, that's all needed.

I do not think we need to discuss if equals/hashCode ensures that two 
automatons are semantically equal (describe state machine with same behaviour). 
For query cache we only need to make sure that a query thats created with the 
same input has a RunAutomaton that equals the one of other query (I think 
that's given, only hashCode). We don't need to cache cases where the automaton 
looks different because the regex was different but functionally same.

If we need it for query cache, i think maybe the RunAutomaton should not be 
used at all by the query and only the direct query inputs be cached (like regex 
string or prefix/wildcard or fuzzy term).


was (Author: thetaphi):
I looked at the code again:
- Automaton class has no equals and no hashCode
- RunAutomaton has equals and hashCode

I don't want to refactor or decide if equals or hashCode is needed. I would 
just make the already existing hashCode bug free. hashCode should take the same 
fields to calculate the hashcode that are also used by equals. This would make 
query cache work fine, that's all needed.

I do not think we need to discuss if equals/hashCode ensures that two 
automatons are semantically equal (describe state machine with same behaviour). 
For query cache we only need to make sure that a query thats created with the 
same input has the RunAutomaton. We don't need to cache cases where the 
automaton looks different because the regex was different but functionally same.

> RunAutomaton#hashCode() can easily cause hash collision for different 
> Automatons
> 
>
> Key: LUCENE-10610
> URL: https://issues.apache.org/jira/browse/LUCENE-10610
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Tomoko Uchida
>Priority: Minor
>
> Current RunAutomaton#hashCode() is:
> {code:java}
>   @Override
>   public int hashCode() {
> final int prime = 31;
> int result = 1;
> result = prime * result + alphabetSize;
> result = prime * result + points.length;
> result = prime * result + size;
> return result;
>   }
> {code}
> Since it does not take account of the contents of the {{points}} array, this 
> returns the same value for different automatons when their alphabet size and 
> state size are the same.
> For example, this test code passes.
> {code:java}
>   public void testHashCode() throws IOException {
> PrefixQuery q1 = new PrefixQuery(new Term("field", "aba"));
> PrefixQuery q2 = new PrefixQuery(new Term("field", "fee"));
> assert q1.compiled.runAutomaton.hashCode() == 
> q2.compiled.runAutomaton.hashCode();
>   }
> {code}
> I suspect this is a bug?
> Note that I think it's not a serious one; all callers of this {{hashCode()}} 
> take account of additional information when calculating their own hash value, 
> it seems there is no substantial impact on higher-level APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons

2022-06-11 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17553131#comment-17553131
 ] 

Uwe Schindler commented on LUCENE-10610:


I looked at the code again:
- Automaton class has no equals and no hashCode
- RunAutomaton has equals and hashCode

I don't want to refactor or decide if equals or hashCode is needed. I would 
just make the already existing hashCode bug free. hashCode should take the same 
fields to calculate the hashcode that are also used by equals. This would make 
query cache work fine, that's all needed.

I do not think we need to discuss if equals/hashCode ensures that two 
automatons are semantically equal (describe state machine with same behaviour). 
For query cache we only need to make sure that a query thats created with the 
same input has the RunAutomaton. We don't need to cache cases where the 
automaton looks different because the regex was different but functionally same.

> RunAutomaton#hashCode() can easily cause hash collision for different 
> Automatons
> 
>
> Key: LUCENE-10610
> URL: https://issues.apache.org/jira/browse/LUCENE-10610
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Tomoko Uchida
>Priority: Minor
>
> Current RunAutomaton#hashCode() is:
> {code:java}
>   @Override
>   public int hashCode() {
> final int prime = 31;
> int result = 1;
> result = prime * result + alphabetSize;
> result = prime * result + points.length;
> result = prime * result + size;
> return result;
>   }
> {code}
> Since it does not take account of the contents of the {{points}} array, this 
> returns the same value for different automatons when their alphabet size and 
> state size are the same.
> For example, this test code passes.
> {code:java}
>   public void testHashCode() throws IOException {
> PrefixQuery q1 = new PrefixQuery(new Term("field", "aba"));
> PrefixQuery q2 = new PrefixQuery(new Term("field", "fee"));
> assert q1.compiled.runAutomaton.hashCode() == 
> q2.compiled.runAutomaton.hashCode();
>   }
> {code}
> I suspect this is a bug?
> Note that I think it's not a serious one; all callers of this {{hashCode()}} 
> take account of additional information when calculating their own hash value, 
> it seems there is no substantial impact on higher-level APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons

2022-06-11 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17553063#comment-17553063
 ] 

Uwe Schindler commented on LUCENE-10610:


I think the problem may be query caching. This would mean all automatons cannot 
be cached. Equals is currently implemented for RunAutomaton, it is just that 
hashCode is inconsistent. Which is unfortunate, but not a bug. You just get 
more hash collisions.

> RunAutomaton#hashCode() can easily cause hash collision for different 
> Automatons
> 
>
> Key: LUCENE-10610
> URL: https://issues.apache.org/jira/browse/LUCENE-10610
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Tomoko Uchida
>Priority: Minor
>
> Current RunAutomaton#hashCode() is:
> {code:java}
>   @Override
>   public int hashCode() {
> final int prime = 31;
> int result = 1;
> result = prime * result + alphabetSize;
> result = prime * result + points.length;
> result = prime * result + size;
> return result;
>   }
> {code}
> Since it does not take account of the contents of the {{points}} array, this 
> returns the same value for different automatons when their alphabet size and 
> state size are the same.
> For example, this test code passes.
> {code:java}
>   public void testHashCode() throws IOException {
> PrefixQuery q1 = new PrefixQuery(new Term("field", "aba"));
> PrefixQuery q2 = new PrefixQuery(new Term("field", "fee"));
> assert q1.compiled.runAutomaton.hashCode() == 
> q2.compiled.runAutomaton.hashCode();
>   }
> {code}
> I suspect this is a bug?
> Note that I think it's not a serious one; all callers of this {{hashCode()}} 
> take account of additional information when calculating their own hash value, 
> it seems there is no substantial impact on higher-level APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons

2022-06-10 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552733#comment-17552733
 ] 

Uwe Schindler edited comment on LUCENE-10610 at 6/10/22 12:49 PM:
--

Thanks for finding this. The solution is:
- Make equals and hashCode symmetric in what it includes
- cache the hashCode for performance by either calculating it in constructor or 
do lazy init using a transient field. A Integer object (initially null) may 
also be a good candidate for this. No synchronization needed, as different 
threads may create the same cached value in parallel which won't hurt


was (Author: thetaphi):
Thanks for finding this. The solution is:
- Make equals and hashCode symmetric in what it includes
- cache the hashCode for performance by either calculating it in constructor or 
do lazy init using a transient field. A SetOnce may also be a good 
candidate for this. No synchronization needed, as different threads may create 
the same cached value in parallel which won't hurt

> RunAutomaton#hashCode() can easily cause hash collision for different 
> Automatons
> 
>
> Key: LUCENE-10610
> URL: https://issues.apache.org/jira/browse/LUCENE-10610
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Tomoko Uchida
>Priority: Minor
>
> Current RunAutomaton#hashCode() is:
> {code:java}
>   @Override
>   public int hashCode() {
> final int prime = 31;
> int result = 1;
> result = prime * result + alphabetSize;
> result = prime * result + points.length;
> result = prime * result + size;
> return result;
>   }
> {code}
> Since it does not take account of the contents of the {{points}} array, this 
> returns the same value for different automatons when their alphabet size and 
> state size are the same.
> For example, this test code passes.
> {code:java}
>   public void testHashCode() throws IOException {
> PrefixQuery q1 = new PrefixQuery(new Term("field", "aba"));
> PrefixQuery q2 = new PrefixQuery(new Term("field", "fee"));
> assert q1.compiled.runAutomaton.hashCode() == 
> q2.compiled.runAutomaton.hashCode();
>   }
> {code}
> I suspect this is a bug?
> Note that I think it's not a serious one; all callers of this {{hashCode()}} 
> take account of additional information when calculating their own hash value, 
> it seems there is no substantial impact on higher-level APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons

2022-06-10 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552733#comment-17552733
 ] 

Uwe Schindler commented on LUCENE-10610:


Thanks for finding this. The solution is:
- Make equals and hashCode symmetric in what it includes
- cache the hashCode for performance by either calculating it in constructor or 
do lazy init using a transient field. A SetOnce may also be a good 
candidate for this. No synchronization needed, as different threads may create 
the same cached value in parallel which won't hurt

> RunAutomaton#hashCode() can easily cause hash collision for different 
> Automatons
> 
>
> Key: LUCENE-10610
> URL: https://issues.apache.org/jira/browse/LUCENE-10610
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Tomoko Uchida
>Priority: Minor
>
> Current RunAutomaton#hashCode() is:
> {code:java}
>   @Override
>   public int hashCode() {
> final int prime = 31;
> int result = 1;
> result = prime * result + alphabetSize;
> result = prime * result + points.length;
> result = prime * result + size;
> return result;
>   }
> {code}
> Since it does not take account of the contents of the {{points}} array, this 
> returns the same value for different automatons when their alphabet size and 
> state size are the same.
> For example, this test code passes.
> {code:java}
>   public void testHashCode() throws IOException {
> PrefixQuery q1 = new PrefixQuery(new Term("field", "aba"));
> PrefixQuery q2 = new PrefixQuery(new Term("field", "fee"));
> assert q1.compiled.runAutomaton.hashCode() == 
> q2.compiled.runAutomaton.hashCode();
>   }
> {code}
> I suspect this is a bug?
> Note that I think it's not a serious one; all callers of this {{hashCode()}} 
> take account of additional information when calculating their own hash value, 
> it seems there is no substantial impact on higher-level APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons

2022-06-10 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552727#comment-17552727
 ] 

Uwe Schindler commented on LUCENE-10610:


I checked the code. If you look at equals you see the automaton is never party 
of the game. it is just referred to in the constructor and only used for some 
calculations in derived objects.

The equals of RunAutomaton only comapres the local arrays. Therefor the 
hashCode must also do this. As creating hashCodes of array is expensive, those 
should be cached.

Actually it is quite easy: either create the integer hashCode in ctor and just 
return it, or do lazy init. The arrays and bitsets in equals don't change 
anymore.

> RunAutomaton#hashCode() can easily cause hash collision for different 
> Automatons
> 
>
> Key: LUCENE-10610
> URL: https://issues.apache.org/jira/browse/LUCENE-10610
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Tomoko Uchida
>Priority: Minor
>
> Current RunAutomaton#hashCode() is:
> {code:java}
>   @Override
>   public int hashCode() {
> final int prime = 31;
> int result = 1;
> result = prime * result + alphabetSize;
> result = prime * result + points.length;
> result = prime * result + size;
> return result;
>   }
> {code}
> Since it does not take account of the contents of the {{points}} array, this 
> returns the same value for different automatons when their alphabet size and 
> state size are the same.
> For example, this test code passes.
> {code:java}
>   public void testHashCode() throws IOException {
> PrefixQuery q1 = new PrefixQuery(new Term("field", "aba"));
> PrefixQuery q2 = new PrefixQuery(new Term("field", "fee"));
> assert q1.compiled.runAutomaton.hashCode() == 
> q2.compiled.runAutomaton.hashCode();
>   }
> {code}
> I suspect this is a bug?
> Note that I think it's not a serious one; all callers of this {{hashCode()}} 
> take account of additional information when calculating their own hash value, 
> it seems there is no substantial impact on higher-level APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons

2022-06-10 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552725#comment-17552725
 ] 

Uwe Schindler commented on LUCENE-10610:


But the RunAutomaton is not modifiable, right. Then maybe cache hashCode only 
there. If its only an array thats easy using Arrays#hashCode() :-)

> RunAutomaton#hashCode() can easily cause hash collision for different 
> Automatons
> 
>
> Key: LUCENE-10610
> URL: https://issues.apache.org/jira/browse/LUCENE-10610
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Tomoko Uchida
>Priority: Minor
>
> Current RunAutomaton#hashCode() is:
> {code:java}
>   @Override
>   public int hashCode() {
> final int prime = 31;
> int result = 1;
> result = prime * result + alphabetSize;
> result = prime * result + points.length;
> result = prime * result + size;
> return result;
>   }
> {code}
> Since it does not take account of the contents of the {{points}} array, this 
> returns the same value for different automatons when their alphabet size and 
> state size are the same.
> For example, this test code passes.
> {code:java}
>   public void testHashCode() throws IOException {
> PrefixQuery q1 = new PrefixQuery(new Term("field", "aba"));
> PrefixQuery q2 = new PrefixQuery(new Term("field", "fee"));
> assert q1.compiled.runAutomaton.hashCode() == 
> q2.compiled.runAutomaton.hashCode();
>   }
> {code}
> I suspect this is a bug?
> Note that I think it's not a serious one; all callers of this {{hashCode()}} 
> take account of additional information when calculating their own hash value, 
> it seems there is no substantial impact on higher-level APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons

2022-06-10 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552710#comment-17552710
 ] 

Uwe Schindler commented on LUCENE-10610:


Yes, but Automaton should cache the hashcode.

> RunAutomaton#hashCode() can easily cause hash collision for different 
> Automatons
> 
>
> Key: LUCENE-10610
> URL: https://issues.apache.org/jira/browse/LUCENE-10610
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Tomoko Uchida
>Priority: Minor
>
> Current RunAutomaton#hashCode() is:
> {code:java}
>   @Override
>   public int hashCode() {
> final int prime = 31;
> int result = 1;
> result = prime * result + alphabetSize;
> result = prime * result + points.length;
> result = prime * result + size;
> return result;
>   }
> {code}
> Since it does not take account of the contents of the {{points}} array, this 
> returns the same value for different automatons when their alphabet size and 
> state size are the same.
> For example, this test code passes.
> {code:java}
>   public void testHashCode() throws IOException {
> PrefixQuery q1 = new PrefixQuery(new Term("field", "aba"));
> PrefixQuery q2 = new PrefixQuery(new Term("field", "fee"));
> assert q1.compiled.runAutomaton.hashCode() == 
> q2.compiled.runAutomaton.hashCode();
>   }
> {code}
> I suspect this is a bug?
> Note that I think it's not a serious one; all callers of this {{hashCode()}} 
> take account of additional information when calculating their own hash value, 
> it seems there is no substantial impact on higher-level APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons

2022-06-10 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552702#comment-17552702
 ] 

Uwe Schindler commented on LUCENE-10610:


The hashCode does not need to be unique. If 2 have same hash code the equals 
relation must be used.

But in general the hashcode should also contain the points, so yet it is a kind 
of bug - but only makes lookups ineffective. The problem is that it is 
expensive to create that from huge FSAs. Normally you should have a separate 
private field in the class marked "privat transient int hashCode" and you 
lazily cache the calculated hashcode there (works if object is unmodifiable). 
The Java String class has this to not recreate the hashCode on every call. The 
method does not need to be synchronized (although can concurrently updated), 
because the hash code is atomic size (32 bits) and it does not matter if 2 
threads recreate the hash code at same time.

> RunAutomaton#hashCode() can easily cause hash collision for different 
> Automatons
> 
>
> Key: LUCENE-10610
> URL: https://issues.apache.org/jira/browse/LUCENE-10610
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Tomoko Uchida
>Priority: Minor
>
> Current RunAutomaton#hashCode() is:
> {code:java}
>   @Override
>   public int hashCode() {
> final int prime = 31;
> int result = 1;
> result = prime * result + alphabetSize;
> result = prime * result + points.length;
> result = prime * result + size;
> return result;
>   }
> {code}
> Since it does not take account of the contents of the {{points}} array, this 
> returns the same value for different automatons when their alphabet size and 
> state size are the same.
> For example, this test code passes.
> {code:java}
>   public void testHashCode() throws IOException {
> PrefixQuery q1 = new PrefixQuery(new Term("field", "aba"));
> PrefixQuery q2 = new PrefixQuery(new Term("field", "fee"));
> assert q1.compiled.runAutomaton.hashCode() == 
> q2.compiled.runAutomaton.hashCode();
>   }
> {code}
> I suspect this is a bug?
> Note that I think it's not a serious one; all callers of this {{hashCode()}} 
> take account of additional information when calculating their own hash value, 
> it seems there is no substantial impact on higher-level APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10602) Dynamic Index Cache Sizing

2022-06-08 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17551720#comment-17551720
 ] 

Uwe Schindler commented on LUCENE-10602:


Theoretically you could also serialize huge Bitsets to disk (temp directory) 
and with MMapDircetory (and or Java 19 MemorySegments -> next week on BBUZZ) 
make a off-heap cache for large shards.

I think there is a lot of flexibility in that small interface.

> Dynamic Index Cache Sizing
> --
>
> Key: LUCENE-10602
> URL: https://issues.apache.org/jira/browse/LUCENE-10602
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Chris Earle
>Priority: Major
>
> Working with Lucene's filter cache, it has become apparent that it can be an 
> enormous drain on the heap and therefore the JVM. After extensive usage of an 
> index, it is not uncommon to tune performance by shrinking or altogether 
> removing the filter cache.
> Lucene tracks hit/miss stats of the filter cache, but it does nothing with 
> the data other than inform an interested user about the effectiveness of 
> their index's caching.
> It would be interesting if Lucene would be able to tune the index filter 
> cache heuristically based on actual usage (age, frequency, and value).
> This could ultimately be used to give GBs of heap back to an individual 
> Lucene instance instead of burning it on cache storage that's not effectively 
> used (or useful).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10602) Dynamic Index Cache Sizing

2022-06-08 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17551719#comment-17551719
 ] 

Uwe Schindler commented on LUCENE-10602:


Elasticsearch could for example subclass LRUQueryCache and disallow caching of 
results for some (huge) shards or make sure no smaller entries are expunged.

> Dynamic Index Cache Sizing
> --
>
> Key: LUCENE-10602
> URL: https://issues.apache.org/jira/browse/LUCENE-10602
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Chris Earle
>Priority: Major
>
> Working with Lucene's filter cache, it has become apparent that it can be an 
> enormous drain on the heap and therefore the JVM. After extensive usage of an 
> index, it is not uncommon to tune performance by shrinking or altogether 
> removing the filter cache.
> Lucene tracks hit/miss stats of the filter cache, but it does nothing with 
> the data other than inform an interested user about the effectiveness of 
> their index's caching.
> It would be interesting if Lucene would be able to tune the index filter 
> cache heuristically based on actual usage (age, frequency, and value).
> This could ultimately be used to give GBs of heap back to an individual 
> Lucene instance instead of burning it on cache storage that's not effectively 
> used (or useful).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10602) Dynamic Index Cache Sizing

2022-06-08 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17551717#comment-17551717
 ] 

Uwe Schindler commented on LUCENE-10602:


If all this is needed, why not implement your own QueryCache instance for ES? 
Then you have more control over when and how something is cached or evicted. 
There's no need to use the IndexSearcher#DEFAULT_QUERY_CACHE, you can set you 
own (globally as default per node or per searcher).

Maybe it would be good to get more context, but actualy the cache context is 
more up to the application. IMHO, it should be enough to implement a 
sophisticated 
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/QueryCache.java

The LRUQueryCache in Lucene is just an example.

> Dynamic Index Cache Sizing
> --
>
> Key: LUCENE-10602
> URL: https://issues.apache.org/jira/browse/LUCENE-10602
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Chris Earle
>Priority: Major
>
> Working with Lucene's filter cache, it has become apparent that it can be an 
> enormous drain on the heap and therefore the JVM. After extensive usage of an 
> index, it is not uncommon to tune performance by shrinking or altogether 
> removing the filter cache.
> Lucene tracks hit/miss stats of the filter cache, but it does nothing with 
> the data other than inform an interested user about the effectiveness of 
> their index's caching.
> It would be interesting if Lucene would be able to tune the index filter 
> cache heuristically based on actual usage (age, frequency, and value).
> This could ultimately be used to give GBs of heap back to an individual 
> Lucene instance instead of burning it on cache storage that's not effectively 
> used (or useful).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10578) Make minimum required Java version for build more specific

2022-06-03 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17546055#comment-17546055
 ] 

Uwe Schindler commented on LUCENE-10578:


Please lets start with the versions deployed on Jenkins where all tests pass. 
Updating after Berlinbuzzwords please. I am too busy at moment, sorry.

> Make minimum required Java version for build more specific
> --
>
> Key: LUCENE-10578
> URL: https://issues.apache.org/jira/browse/LUCENE-10578
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Tomoko Uchida
>Priority: Minor
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> See this mail thread for background: 
> [https://lists.apache.org/thread/6md5k94pqdkkwg0f66hor2sonm2t77jo]
> To prevent developers (especially, release managers) from using too old java 
> versions, we could (should?) elaborate the minimum required java versions for 
> the build.
> Possible questions in my mind:
>  * should we stop the build with an error or emit a warning and continue?
>  * do minor versions depend on the vendor? if yes, should we also specify the 
> vendor?
>  * how do we determine/maintain the minimum version?
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query

2022-05-26 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17542440#comment-17542440
 ] 

Uwe Schindler edited comment on LUCENE-10562 at 5/26/22 10:58 AM:
--

Hi [~zhuming],
this is better a question to ask on the user mailing list.

As short answer: If you use {{TopTermsScoringBooleanQueryRewrite}} you have to 
live with the consequences. As said several times in this issue: If you need to 
use wildcard queries think about changing your analysis, so you can do the same 
queries (e.g., by using ngrams in the analysis) in a performant ways. It is 
impossible to implement wildcard queries in an efficient way in inverted 
indexes, as the the expansion is always done before the query and it can't use 
any other query clauses: There's no way to only select terms in the first query 
that would also produce a hit for the second query (your filter) as there is no 
relationship at all.

In addition: Scoring of wildcard queries like that - "hoping for something" - 
does not look like the right way to solve your problem.


was (Author: thetaphi):
Hi [~zhuming],
this is better a question to ask on the user mailing list.

As short answer: If you use {{TopTermsScoringBooleanQueryRewrite}} you have to 
live with the consequences. As said several times in this issue: If you need to 
use wildcard queries think about changing your analysis, so you can do the same 
queries (e.g., by using ngrams in the analysis) in a performant ways. It is 
impossible to implement wildcard queries in an efficient way in inverted 
indexes, as the the expansion is always done before the query and it can't use 
any other query clauses: There's no way to only select terms in the first query 
that would also produce a hit for the second query (your filter) as there is no 
relationship at all.

In addition: Scoring of wildcard queries like that are not the right way to 
solve your problem.

> Large system: Wildcard search leads to full index scan despite filter query
> ---
>
> Key: LUCENE-10562
> URL: https://issues.apache.org/jira/browse/LUCENE-10562
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.11.1
>Reporter: Henrik Hertel
>Priority: Major
>  Labels: performance
>
> I use Solr and have a large system with 1TB in one core and about 5 million 
> documents. The textual content of large PDF files is indexed there. My query 
> is extremely slow (more than 30 seconds)  as soon as I use wildcards e.g. 
> {code:java}
> *searchvalue*
> {code}
> , even though I put a filter query in front of it that reduces to less than 
> 20 documents.
> searchvalue -> less than 1 second
> searchvalue* -> less than 1 second
> My query:
> {code:java}
> select?defType=lucene=content_t:*searchvalue*=metadataitemids_is:20950=id=50=0
>  {code}
> I've tried everything imaginable. It doesn't make sense to me why a search 
> over a small subset should take so long. If I omit the filter query 
> metadataitemids_is:20950, so search the entire inventory, then it also takes 
> the same amount of time. Therefore, I suspect that despite the filter query, 
> the main query runs over the entire index.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query

2022-05-26 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17542440#comment-17542440
 ] 

Uwe Schindler commented on LUCENE-10562:


Hi [~zhuming],
this is better a question to ask on the user mailing list.

As short answer: If you use {{TopTermsScoringBooleanQueryRewrite}} you have to 
live with the consequences. As said several times in this issue: If you need to 
use wildcard queries think about changing your analysis, so you can do the same 
queries (e.g., by using ngrams in the analysis) in a performant ways. It is 
impossible to implement wildcard queries in an efficient way in inverted 
indexes, as the the expansion is always done before the query and it can't use 
any other query clauses: There's no way to only select terms in the first query 
that would also produce a hit for the second query (your filter) as there is no 
relationship at all.

In addition: Scoring of wildcard queries like that are not the right way to 
solve your problem.

> Large system: Wildcard search leads to full index scan despite filter query
> ---
>
> Key: LUCENE-10562
> URL: https://issues.apache.org/jira/browse/LUCENE-10562
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.11.1
>Reporter: Henrik Hertel
>Priority: Major
>  Labels: performance
>
> I use Solr and have a large system with 1TB in one core and about 5 million 
> documents. The textual content of large PDF files is indexed there. My query 
> is extremely slow (more than 30 seconds)  as soon as I use wildcards e.g. 
> {code:java}
> *searchvalue*
> {code}
> , even though I put a filter query in front of it that reduces to less than 
> 20 documents.
> searchvalue -> less than 1 second
> searchvalue* -> less than 1 second
> My query:
> {code:java}
> select?defType=lucene=content_t:*searchvalue*=metadataitemids_is:20950=id=50=0
>  {code}
> I've tried everything imaginable. It doesn't make sense to me why a search 
> over a small subset should take so long. If I omit the filter query 
> metadataitemids_is:20950, so search the entire inventory, then it also takes 
> the same amount of time. Therefore, I suspect that despite the filter query, 
> the main query runs over the entire index.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-18 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538598#comment-17538598
 ] 

Uwe Schindler commented on LUCENE-10572:


bq. The stopwords are going to skew everything. If someone is removing them, 
the distribution of tokens will look much different.

If wikipedia has so many stopwords, this would explain what Mike is seeing. 
Every stop word produces a hash that's already known. So the Arrays.equals() 
code runs on each stopword every time it is seen over and over.

Maybe let's just change the analyzer that Mike uses to remove those stopwords? 
Or are there many stopwords we do not know about?

Nevertheless, this is a valid use case: Text without stopwords and text with 
stopwords (especially because we recommend to user not to remove stopwords 
anymore).

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
> Attachments: Screen Shot 2022-05-16 at 10.28.22 AM.png
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-17 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538198#comment-17538198
 ] 

Uwe Schindler commented on LUCENE-10572:


If we have 2 hash tables, we could have one for short terms up to 255 bytes ( 
for sure could also make the limit smaller, but 255 is the limit to get the 1 
byte length encoding), and all longer ones in a separate hash (where also the 
comparisons are more expensive).

I am not sure if the additioal complexity is worth to do this.

About changing the hash algorithm: we may add a counter into the hash table to 
actually measure how many collisions we have during indexing wikipedia. But 
actually when inserting a term already in the hash-table, we get a hash 
collision and have to confirm with Array.equals() that the term is already 
there. I tend to think that the smaller terms are more often duplicates than 
larger ones, so having them in a separate table may be a good idea.

Maybe we should have some statistics during wikipedia indexing:

- how many hash collisions do we have (where term is actually not already in 
table)? => this ratio should be low. We can compare hash algorithms for that.
- how many hash collisions do we get because the term is already in table? => 
this is most expensive memory-wise, because hash AND equals have to be 
calculated.
- how many inserts of new terms without a collision do we get?

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
> Attachments: Screen Shot 2022-05-16 at 10.28.22 AM.png
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-16 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17537477#comment-17537477
 ] 

Uwe Schindler commented on LUCENE-10572:


BytesRefHash has this field already: {{int[] bytesStart;}} the problem is that 
it encodes also the block number, so a difference between bytes in different 
blocks is not size.

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-16 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17537405#comment-17537405
 ] 

Uwe Schindler edited comment on LUCENE-10572 at 5/16/22 8:56 AM:
-

Hi Dawid,
thats a nice issue. When looking at the BytesRefHash/ByteBlockPool 
implementation I already thought: Why the hell do we need redundant the length 
at all? We have a lookup from the index to the offset inside the block pool 
anyways. Instead of storing the length, we can lookup "offset of next entry - 
offset of entry to be looked up". The only special case is the very last item, 
but we just need to keep a slot for the next entry anyways.

So I'd also look into improving this. Nevertheless, the main limiting factor of 
the BytesRefHash is the equals (although vectorized) because it always needs to 
be verified. This is costly for long terms (long comparison) and it is very cpu 
cache unfriendly.


was (Author: thetaphi):
Hi Dawid,
thats a nice issue. When looking at the BytesRefHash/ByteBlockPool 
implementation I already thought: Why the hell do we need redundant the length 
at all? We have a lookup from the index to the offset inside the block pool 
anyways. Instead of storing the length, we can lookup offset of next entry - 
offset of entry to be looked up. The only special case is the very last item.

So I'd also look into improving this. Nevertheless, the main limiting factor of 
the BytesRefHash is the equals (although vectorized) because it always needs to 
be verified. This is costly for long terms (long comparison) and it is very cpu 
cache unfriendly.

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-16 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17537405#comment-17537405
 ] 

Uwe Schindler commented on LUCENE-10572:


Hi Dawid,
thats a nice issue. When looking at the BytesRefHash/ByteBlockPool 
implementation I already thought: Why the hell do we need redundant the length 
at all? We have a lookup from the index to the offset inside the block pool 
anyways. Instead of storing the length, we can lookup offset of next entry - 
offset of entry to be looked up. The only special case is the very last item.

So I'd also look into improving this. Nevertheless, the main limiting factor of 
the BytesRefHash is the equals (although vectorized) because it always needs to 
be verified. This is costly for long terms (long comparison) and it is very cpu 
cache unfriendly.

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-14 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17537013#comment-17537013
 ] 

Uwe Schindler edited comment on LUCENE-10572 at 5/14/22 10:29 AM:
--

Mike, could you make a test on how much memory increaes by the PR and if 
there's a speed improvement at all?

If memory due to the 2-byte length (this can sum up very fast if you know that 
most terms in your index are < 128 bytes) is increasing and there's no speed 
imporvement, let's throw away my PR. This would be the confirmation that equals 
is memory bound (for the reasons I told you before). Arrays.equals() is using 
intrinsic "ArraySupport#vectorizedMismatch()" (using SIMD) in JDK 17 when the 
array is long enough (>= 8 bytes, if smaller it does not do any vectorization) 
-> this should answer your question:

- Our code: 
[BytesRefHash.java#L178|https://github.com/apache/lucene/blob/c1b626c0636821f4d7c085895359489e7dfa330f/lucene/core/src/java/org/apache/lucene/util/BytesRefHash.java#L178]
- Arrays#equals calling ArraySupport#mismatch: 
[Arrays.java#L2710-L2712|https://github.com/openjdk/jdk/blob/jdk-17-ga/src/java.base/share/classes/java/util/Arrays.java#L2710-L2712]
- ArraySupport#mismatch calling the vectorizedMismatch method: 
[ArraysSupport.java#L275-L303|https://github.com/openjdk/jdk/blob/jdk-17-ga/src/java.base/share/classes/jdk/internal/util/ArraysSupport.java#L275-L303]
- Here is the method called at end, which is {{@IntrinsicCandidate}}: 
[ArraysSupport.java#L111-L161|https://github.com/openjdk/jdk/blob/dfacda488bfbe2e11e8d607a6d08527710286982/src/java.base/share/classes/jdk/internal/util/ArraysSupport.java#L111-L161]


was (Author: thetaphi):
Mike, could you make a test on how much memory increaes by the PR and if 
there's a speed improvement at all?

If memory due to the 2-byte length (this can sum up very fast if you know that 
most terms in your index are < 128 bytes) is increasing and there's no speed 
imporvement, let's throw away my PR. This would be the confirmation that equals 
is memory bound (for the reasons I told you before). Arrays.equals() is using 
intrinsic "ArraySupport#vectorizedMismatch()" (using SIMD) in JDK 17 when the 
array is long enough (around 9 bytes, if smaller it does not do any 
vectorization) -> this should answer your question:

- Our code: 
[BytesRefHash.java#L178|https://github.com/apache/lucene/blob/c1b626c0636821f4d7c085895359489e7dfa330f/lucene/core/src/java/org/apache/lucene/util/BytesRefHash.java#L178]
- Arrays#equals calling ArraySupport#mismatch: 
[Arrays.java#L2710-L2712|https://github.com/openjdk/jdk/blob/jdk-17-ga/src/java.base/share/classes/java/util/Arrays.java#L2710-L2712]
- ArraySupport#mismatch calling the vectorizedMismatch method: 
[ArraysSupport.java#L275-L303|https://github.com/openjdk/jdk/blob/jdk-17-ga/src/java.base/share/classes/jdk/internal/util/ArraysSupport.java#L275-L303]
- Here is the method called at end, which is {{@IntrinsicCandidate}}: 
[ArraysSupport.java#L111-L161|https://github.com/openjdk/jdk/blob/dfacda488bfbe2e11e8d607a6d08527710286982/src/java.base/share/classes/jdk/internal/util/ArraysSupport.java#L111-L161]

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's 

[jira] [Comment Edited] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-14 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17537013#comment-17537013
 ] 

Uwe Schindler edited comment on LUCENE-10572 at 5/14/22 10:28 AM:
--

Mike, could you make a test on how much memory increaes by the PR and if 
there's a speed improvement at all?

If memory due to the 2-byte length (this can sum up very fast if you know that 
most terms in your index are < 128 bytes) is increasing and there's no speed 
imporvement, let's throw away my PR. This would be the confirmation that equals 
is memory bound (for the reasons I told you before). Arrays.equals() is using 
intrinsic "ArraySupport#vectorizedMismatch()" (using SIMD) in JDK 17 when the 
array is long enough (around 9 bytes, if smaller it does not do any 
vectorization) -> this should answer your question:

- Our code: 
[BytesRefHash.java#L178|https://github.com/apache/lucene/blob/c1b626c0636821f4d7c085895359489e7dfa330f/lucene/core/src/java/org/apache/lucene/util/BytesRefHash.java#L178]
- Arrays#equals calling ArraySupport#mismatch: 
[Arrays.java#L2710-L2712|https://github.com/openjdk/jdk/blob/jdk-17-ga/src/java.base/share/classes/java/util/Arrays.java#L2710-L2712]
- ArraySupport#mismatch calling the vectorizedMismatch method: 
[ArraysSupport.java#L275-L303|https://github.com/openjdk/jdk/blob/jdk-17-ga/src/java.base/share/classes/jdk/internal/util/ArraysSupport.java#L275-L303]
- Here is the method called at end, which is {{@IntrinsicCandidate}}: 
[ArraysSupport.java#L111-L161|https://github.com/openjdk/jdk/blob/dfacda488bfbe2e11e8d607a6d08527710286982/src/java.base/share/classes/jdk/internal/util/ArraysSupport.java#L111-L161]


was (Author: thetaphi):
Mike, could you make a test on how much memory increaes by the PR and if 
there's a speed improvement at all?

If memory due to the 2-byte length (this can sum up very fast if you know that 
most terms in your index are < 128 bytes) is increasing and there's no speed 
imporvement, let's throw away my PR. This would be the confirmation that equals 
is memory bound (for the reasons I told you before). Arrays.equals() is using 
intrinsic "ArraySupport#vectorizedMismatch()" (using SIMD) in JDK 17 -> this 
should answer your question:


- Our code: 
[BytesRefHash.java#L178|https://github.com/apache/lucene/blob/c1b626c0636821f4d7c085895359489e7dfa330f/lucene/core/src/java/org/apache/lucene/util/BytesRefHash.java#L178]
- Arrays#equals calling ArraySupport#mismatch: 
[Arrays.java#L2710-L2712|https://github.com/openjdk/jdk/blob/jdk-17-ga/src/java.base/share/classes/java/util/Arrays.java#L2710-L2712]
- ArraySupport#mismatch calling the vectorizedMismatch method: 
[ArraysSupport.java#L275-L303|https://github.com/openjdk/jdk/blob/jdk-17-ga/src/java.base/share/classes/jdk/internal/util/ArraysSupport.java#L275-L303]
- Here is the method called at end, which is {{@IntrinsicCandidate}}: 
[ArraysSupport.java#L111-L161|https://github.com/openjdk/jdk/blob/dfacda488bfbe2e11e8d607a6d08527710286982/src/java.base/share/classes/jdk/internal/util/ArraysSupport.java#L111-L161]

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an 

[jira] [Comment Edited] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-14 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17537013#comment-17537013
 ] 

Uwe Schindler edited comment on LUCENE-10572 at 5/14/22 10:27 AM:
--

Mike, could you make a test on how much memory increaes by the PR and if 
there's a speed improvement at all?

If memory due to the 2-byte length (this can sum up very fast if you know that 
most terms in your index are < 128 bytes) is increasing and there's no speed 
imporvement, let's throw away my PR. This would be the confirmation that equals 
is memory bound (for the reasons I told you before). Arrays.equals() is using 
intrinsic "ArraySupport#vectorizedMismatch()" (using SIMD) in JDK 17 -> this 
should answer your question:


- Our code: 
[BytesRefHash.java#L178|https://github.com/apache/lucene/blob/c1b626c0636821f4d7c085895359489e7dfa330f/lucene/core/src/java/org/apache/lucene/util/BytesRefHash.java#L178]
- Arrays#equals calling ArraySupport#mismatch: 
[Arrays.java#L2710-L2712|https://github.com/openjdk/jdk/blob/jdk-17-ga/src/java.base/share/classes/java/util/Arrays.java#L2710-L2712]
- ArraySupport#mismatch calling the vectorizedMismatch method: 
[ArraysSupport.java#L275-L303|https://github.com/openjdk/jdk/blob/jdk-17-ga/src/java.base/share/classes/jdk/internal/util/ArraysSupport.java#L275-L303]
- Here is the method called at end, which is {{@IntrinsicCandidate}}: 
[ArraysSupport.java#L111-L161|https://github.com/openjdk/jdk/blob/dfacda488bfbe2e11e8d607a6d08527710286982/src/java.base/share/classes/jdk/internal/util/ArraysSupport.java#L111-L161]


was (Author: thetaphi):
Mike, could you make a test on how much memory increaes by the PR and if 
there's a speed improvement at all?

If memory due to the 2-byte length (this can sum up very fast if you know that 
most terms in your index are < 128 bytes) is increasing and there's no speed 
imporvement, let's throw away my PR. This would be the confirmation that equals 
is memory bound (for the reasons I told you before). Arrays.equals() is using 
intrinsic "ArraySupport#vectorizedMismatch()" (using SIMD) in JDK 17 -> this 
should answer your question:


- Our code: 
[BytesRefHash.java#L178|https://github.com/apache/lucene/blob/c1b626c0636821f4d7c085895359489e7dfa330f/lucene/core/src/java/org/apache/lucene/util/BytesRefHash.java#L178]
- Arrays#equals calling ArraySupport#mismatch: 
[Arrays.java#L2710-L2712|https://github.com/openjdk/jdk/blob/jdk-17-ga/src/java.base/share/classes/java/util/Arrays.java#L2710-L2712]
- ArraySupport#mismatch calling the vectorizedMismatch method: 
[ArraysSupport.java#L275-L303|https://github.com/openjdk/jdk/blob/jdk-17-ga/src/java.base/share/classes/jdk/internal/util/ArraysSupport.java#L275-L303]
- Here is the method called at end, which is {{@IntrinsicCandidate}}: 
[ArraysSupport.java#L111-L135|https://github.com/openjdk/jdk/blob/dfacda488bfbe2e11e8d607a6d08527710286982/src/java.base/share/classes/jdk/internal/util/ArraysSupport.java#L111-L135]

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using 

[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-14 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17537013#comment-17537013
 ] 

Uwe Schindler commented on LUCENE-10572:


Mike, could you make a test on how much memory increaes by the PR and if 
there's a speed improvement at all?

If memory due to the 2-byte length (this can sum up very fast if you know that 
most terms in your index are < 128 bytes) is increasing and there's no speed 
imporvement, let's throw away my PR. This would be the confirmation that equals 
is memory bound (for the reasons I told you before). Arrays.equals() is using 
intrinsic "ArraySupport#vectorizedMismatch()" (using SIMD) in JDK 17 -> this 
should answer your question:


- Our code: 
[BytesRefHash.java#L178|https://github.com/apache/lucene/blob/c1b626c0636821f4d7c085895359489e7dfa330f/lucene/core/src/java/org/apache/lucene/util/BytesRefHash.java#L178]
- Arrays#equals calling ArraySupport#mismatch: 
[Arrays.java#L2710-L2712|https://github.com/openjdk/jdk/blob/jdk-17-ga/src/java.base/share/classes/java/util/Arrays.java#L2710-L2712]
- ArraySupport#mismatch calling the vectorizedMismatch method: 
[ArraysSupport.java#L275-L303|https://github.com/openjdk/jdk/blob/jdk-17-ga/src/java.base/share/classes/jdk/internal/util/ArraysSupport.java#L275-L303]
- Here is the method called at end, which is {{@IntrinsicCandidate}}: 
[ArraysSupport.java#L111-L135|https://github.com/openjdk/jdk/blob/dfacda488bfbe2e11e8d607a6d08527710286982/src/java.base/share/classes/jdk/internal/util/ArraysSupport.java#L111-L135]

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-14 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17537009#comment-17537009
 ] 

Uwe Schindler commented on LUCENE-10572:


Hi,
actually the reason why BE encoding was used in ByteBlockPool and BytesRefHash 
(same for PagedBytes) was to emulate those vInt encodinf. For the data 
structure it makes otherwise no difference. So I am fine with any decission. 
The only thing that needs to be done is to remove the vInt encoding, because it 
relies on BE for it to work (see the code that I removed). It could be fixed to 
also work LE, but its not universal. If we do not run into space problem during 
indexing (if you have many short terms which is the default for most texts, the 
special case with 1-byte encoding for lengths < 128 is a good idea). It just 
brings problems when you all the time encode/decode it (e.g. when searching the 
hash table with equals).

>From looking at the code: I doubt that the problem really comes from the vInt 
>encoding in BE format. Most terms are shorter than 128 bytes in normal 
>indexes. I am quite sure the problem with the hotspot at equals is simple that 
>you need to do a full comparison using Arrays.equals(). This happens mostly in 
>the case that the term is already there (hash collisions also happen on same 
>term). Simpy said: If you insert a BytesRef into the hash and the term already 
>exists (a common case during indexing text), it will get a hashcode that 
>already exists in the table. To verify that the term is really the same it has 
>to compare the bytes. This is why we see the hotspot on equals. There does not 
>even help a better hashing algorithm, as the case "term already in table" 
>always needs a verification no matter how good the hashing algorithm is.

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: 

[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-13 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17536937#comment-17536937
 ] 

Uwe Schindler commented on LUCENE-10572:


I removed the vInt-like encoding in ByteBlockPool and BytesRefHash. After that 
I was able to switch to native shorts.

I did not touch PagedBytes, although the same thing could be done there.

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-13 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17536933#comment-17536933
 ] 

Uwe Schindler commented on LUCENE-10572:


bq. Playing with BytesRefHash and ByteBlockPool crushed most tests around 
docvalues and blockterms. So Robert is right: Some of those are BE just because 
of sorting the blocks as byte arrays (so it must be BE).

This is because of the Vint-link encoding, not sorting: You need to remove the 
vint encoding, so you don't need first byte to switch between 1 or 2 bytes. 
When we remove the Vint-linke encoding we can use the native order in 
BytesRefHash and ByteBlockPool and possibly PagesByte, too.

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-13 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17536931#comment-17536931
 ] 

Uwe Schindler commented on LUCENE-10572:


bq. Today its swapping bytes on the intel and ARM machines and the AIX machine 
is "fast" instead. in quotes because we know its still not 

I know that OpenJDK is heavily tested on big endian ARM machines (i think they 
can be switched). So won't those users of modern ARM-Macs good candidates?

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-13 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17536927#comment-17536927
 ] 

Uwe Schindler edited comment on LUCENE-10572 at 5/13/22 11:00 PM:
--

Here is a draft PR about the idea: https://github.com/apache/lucene/pull/888

I just changed LZ4 to use the native order (as it is explicitly allowed and 
also documented in the algorithm).

Playing with BytesRefHash and ByteBlockPool crushed most tests around docvalues 
and blockterms. So Robert is right: Some of those are BE just because of 
sorting the blocks as byte arrays (so it must be BE).

If this is not going to work, throw it away. I just started to make it testable.


was (Author: thetaphi):
Here is a draft PR about the idea. I just changed LZ4 to use the native order 
(as it is explicitly allowed and also documented in the algorithm).

Playing with BytesRefHash and ByteBlockPool crushed most tests around docvalues 
and blockterms. So Robert is right: Some of those are BE just because of 
sorting the blocks as byte arrays (so it must be BE).

If this is not going to work, throw it away. I just started to make it testable.

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-13 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17536927#comment-17536927
 ] 

Uwe Schindler commented on LUCENE-10572:


Here is a draft PR about the idea. I just changed LZ4 to use the native order 
(as it is explicitly allowed and also documented in the algorithm).

Playing with BytesRefHash and ByteBlockPool crushed most tests around docvalues 
and blockterms. So Robert is right: Some of those are BE just because of 
sorting the blocks as byte arrays (so it must be BE).

If this is not going to work, throw it away. I just started to make it testable.

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-13 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17536921#comment-17536921
 ] 

Uwe Schindler commented on LUCENE-10572:


Yeah exactly, sometimes BE is used to allow to sort the terms based as byte 
sequence.

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-13 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17536916#comment-17536916
 ] 

Uwe Schindler commented on LUCENE-10572:


Hi,
I have a PR almost ready. In my comment above I confused the native order issue 
we discussed with LZ4. In my patch, I replaced it there to be NATIVE, too.

In BytesRefHash we have one big endian variant, but in ByteBlockPool we have 
also little endian writes. It's too late for me now, I changed the Big Endian 
ones in BytesRefHash for now, but what we should for sure check: Those blocks 
should never be written to disk, so maybe somebody with more knowledge should 
look into it.

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-13 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17536907#comment-17536907
 ] 

Uwe Schindler commented on LUCENE-10572:


bq. We just use our own constant, but then we can change it for testing. I 
would even ban the nativeOrder() in forbidden apis, too. It is just like Locale 
or anything else, same thing. Let's be specific but then test all of them.

I disagree with that. We also do not forbid {{Locale#getDefault}} because if 
somebody uses that method he explciitely want the default locale.

Actually using ByteOrder#nativeOrder() is also an explicit vote to do that. And 
with coming MMapDirectory v2 and more panama features like accessing native 
APIs like locking pages in mmapdir or using madvise/fadvise based on IOContext 
needs native order to talk to those APIs.

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-13 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17536905#comment-17536905
 ] 

Uwe Schindler commented on LUCENE-10572:


bq. ok, i like your suggestion actually. it solves my issue, it would make this 
testable. Then native order could be used freely without this concern.

We can do this in the same way like the initialization of StringUtils. There we 
check the system property tests.seed and then initialize some randomness. In 
BitUtil we could have a similar code (or maybe share that in Constants to get 
the random seed and save as a Long value, or null if not given - this would 
prevent us from doing the lookup multiple times). In BitUtil we could have 
varhandles like {{BitUtil.VH_NATIVE_SHORT}} that are native in production 
environments, but randomized in test environments.

I can make a PR as starting point tomorrow, it's too late now. We can then 
improve from there - [~mikemccand] other ideas included. 

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-13 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17536883#comment-17536883
 ] 

Uwe Schindler commented on LUCENE-10572:


Hey,
I agree with the length encoding. This is indeed used at other places, too.

My argument was meant primarily for the hash seed, where we already use a var 
handle. This one is (like the hash algorithm) private to the bytesrefhash.

If we want to test all platforms, we could default to platform in the 
initializer when not in Test mode. In Test mode we use a random be order. This 
could be controlled by a sysproo in the static initializer of the class (it 
cannot be fully dynamic as the var handle MUST be declared static final.

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-13 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17536855#comment-17536855
 ] 

Uwe Schindler edited comment on LUCENE-10572 at 5/13/22 7:55 PM:
-

With the swapping of bytes I remember the other issue where this was discussed. 
In my personal opinion we should just use ByteOrder.nativeOrder() here to get 
the varhandle. The byte order does not matter, we just need to get 2 bytes to 
seed the hash. So to be fast on all platforms use the native order.

The reason for this decision was that Robert was arguing about reproducibility. 
And he did not like platform dependencies (he was arguing that we can't test 
the algorithm with different platforms, so if we use default byte order of the 
platform we run our code (e.g. x86, somebody could see bugs on arm).

I don't think that's an issue, just the typical way how Robert argues. In that 
case let's get a varhandle with platform order in the static ctor and not use 
the one from BitUtil.

We may also use a var handle in same way to save the length, also in platform 
order (as this encoding is private to the class and is never serialized to 
disk).


was (Author: thetaphi):
With the swapping of bytes I remember the other issue where this was discussed. 
In my personal opinion we should just use ByteOrder.nativeOrder() here to get 
the varhandle. The byte order does not matter, we just need to get 2 bytes to 
seed the hash. So to be fast on all platforms use the native order.

The reason for this decision was that Robert was arguing about reproducibility. 
And he did not like platform dependencies (he was arguing that we can't test 
the algorithm with different platforms, so if we use default byte order of the 
platform we run out code (e.g. you), somebody could see bugs on arm.

I don't think that's an issue, just the typical way how Robert argues. In that 
case let's get a varhandle with platform order in the static ctor and not use 
the one from BitUtil.

We may also use a var handle in same way to save the length, also in platform 
order (as this encoding is private to the class and is never serialized to 
disk).

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> 

[jira] [Comment Edited] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-13 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17536855#comment-17536855
 ] 

Uwe Schindler edited comment on LUCENE-10572 at 5/13/22 7:53 PM:
-

With the swapping of bytes I remember the other issue where this was discussed. 
In my personal opinion we should just use ByteOrder.nativeOrder() here to get 
the varhandle. The byte order does not matter, we just need to get 2 bytes to 
seed the hash. So to be fast on all platforms use the native order.

The reason for this decision was that Robert was arguing about reproducibility. 
And he did not like platform dependencies (he was arguing that we can't test 
the algorithm with different platforms, so if we use default byte order of the 
platform we run out code (e.g. you), somebody could see bugs on arm.

I don't think that's an issue, just the typical way how Robert argues. In that 
case let's get a varhandle with platform order in the static ctor and not use 
the one from BitUtil.

We may also use a var handle in same way to save the length, also in platform 
order (as this encoding is private to the class and is never serialized to 
disk).


was (Author: thetaphi):
With the swapping of bytes I remember the other issue where this was discussed. 
In my personal opinion we should just use ByteOrder.nativeOrder() here to get 
the varhandle. The byte order does not matter, we just need to get 2 bytes to 
seed the hash. So to be fast on all platforms use the native order.

The reason for this decision was that Robert was arguing about reproducibility. 
And he did not like platform dependencies (he was arguing that we can't test 
the algorithm with different platforms, so if we use default byte order of the 
platform we run out code (e.g. you), somebody could see bugs on arm.

I don't think that's an issue, just the typical way how Robert argues. In that 
case let's get a varhandle with platform order in the static ctor and not use 
the one from BitUtil.

We may also use a var handle in same way to save the length.

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe 

[jira] [Comment Edited] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-13 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17536855#comment-17536855
 ] 

Uwe Schindler edited comment on LUCENE-10572 at 5/13/22 7:51 PM:
-

With the swapping of bytes I remember the other issue where this was discussed. 
In my personal opinion we should just use ByteOrder.nativeOrder() here to get 
the varhandle. The byte order does not matter, we just need to get 2 bytes to 
seed the hash. So to be fast on all platforms use the native order.

The reason for this decision was that Robert was arguing about reproducibility. 
And he did not like platform dependencies (he was arguing that we can't test 
the algorithm with different platforms, so if we use default byte order of the 
platform we run out code (e.g. you), somebody could see bugs on arm.

I don't think that's an issue, just the typical way how Robert argues. In that 
case let's get a varhandle with platform order in the static ctor and not use 
the one from BitUtil.

We may also use a var handle in same way to save the length.


was (Author: thetaphi):
With the swapping of bytes I remember the other issue where this was discussed. 
In my personal opinion we should just use ByteOrder.getDefault() here to get 
the varhandle. The byte order does not mapper, we just need to get 2 bytes.

The reason for this decision was that Robert was arguing about reproducibility. 
And he did not like platform dependencies (he was arguing that we can't test 
the algorithm with different platforms, so if we use default byte order of the 
platform we run out code (e.g. you), somebody could see bugs on arm.

I don't think that's an issue, just the typical way how Robert argues. In that 
case let's get a varhandle with platform order in the static ctor and not use 
the one from BitUtil.

We may also use a var handle in same way to save the length.

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-13 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17536855#comment-17536855
 ] 

Uwe Schindler commented on LUCENE-10572:


With the swapping of bytes I remember the other issue where this was discussed. 
In my personal opinion we should just use ByteOrder.getDefault() here to get 
the varhandle. The byte order does not mapper, we just need to get 2 bytes.

The reason for this decision was that Robert was arguing about reproducibility. 
And he did not like platform dependencies (he was arguing that we can't test 
the algorithm with different platforms, so if we use default byte order of the 
platform we run out code (e.g. you), somebody could see bugs on arm.

I don't think that's an issue, just the typical way how Robert argues. In that 
case let's get a varhandle with platform order in the static ctor and not use 
the one from BitUtil.

We may also use a var handle in same way to save the length.

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10551) LowercaseAsciiCompression should return false when it's unable to compress

2022-05-11 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17534927#comment-17534927
 ] 

Uwe Schindler commented on LUCENE-10551:


I think you should also open a bug report in GraalVM.

> LowercaseAsciiCompression should return false when it's unable to compress
> --
>
> Key: LUCENE-10551
> URL: https://issues.apache.org/jira/browse/LUCENE-10551
> Project: Lucene - Core
>  Issue Type: Bug
> Environment: Lucene version 8.11.1
>Reporter: Peixin Li
>Priority: Major
> Attachments: LUCENE-10551-test.patch
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> {code:java}
>  Failed to commit..
> java.lang.IllegalStateException: 10 <> 5 
> cion1cion_desarrollociones_oraclecionesnaturacionesnatura2tedppsa-integrationdemotiontion
>  cloud gen2tion instance - dev1tion instance - 
> testtion-devbtion-instancetion-prdtion-promerication-qation064533tion535217tion697401tion761348tion892818tion_matrationcauto_simmonsintgic_testtioncloudprodictioncloudservicetiongateway10tioninstance-jtsundatamartprd??o
>         at 
> org.apache.lucene.util.compress.LowercaseAsciiCompression.compress(LowercaseAsciiCompression.java:115)
>         at 
> org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.writeBlock(BlockTreeTermsWriter.java:834)
>         at 
> org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.writeBlocks(BlockTreeTermsWriter.java:628)
>         at 
> org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.pushTerm(BlockTreeTermsWriter.java:947)
>         at 
> org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.write(BlockTreeTermsWriter.java:912)
>         at 
> org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter.write(BlockTreeTermsWriter.java:318)
>         at 
> org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.write(PerFieldPostingsFormat.java:170)
>         at 
> org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:120)
>         at 
> org.apache.lucene.index.DefaultIndexingChain.flush(DefaultIndexingChain.java:267)
>         at 
> org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:350)
>         at 
> org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:476)
>         at 
> org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:656)
>         at 
> org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3364)
>         at 
> org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3770)
>         at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3728)
>        {code}
> {code:java}
> key=och-live--WorkResource.renewAssignmentToken.ResourceTime[namespace=workflow,
>  resourceGroup=workflow-service-overlay]{availabilityDomain=iad-ad-1, 
> domainId=och-live, host=workflow-service-overlay-01341.node.ad1.us-ashburn-1})
> java.lang.IllegalStateException: 29 <> 16 
> 

[jira] [Comment Edited] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query

2022-05-09 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17533857#comment-17533857
 ] 

Uwe Schindler edited comment on LUCENE-10562 at 5/9/22 3:38 PM:


As explanation why this is slow: It has nothing to do with filters or query 
first. The problem is already before that: Wildcard queries are expanded to 
filter bitsets / large OR queries in the query preprocessing (rewrite mode). 
This happens before the actualy query is executed. So as soon as you have a 
wildcard with many matching terms, the preprocessing takes a significant amount 
of time. The actual query execution is fast and can be optimized. Due to the 
way on how an inverted index is built, there's no way to use another query to 
limit the amount of preprocessing work. The preprocessing time is linear to the 
total number of terms in a field, not size of index or number of documents.


was (Author: thetaphi):
As explanation why this is slow: It has nothing to do with filters or query 
first. The problem is already before that: Wildcard queries are expanded to 
filter bitsets / large OR queries in the query preprocessing (rewrite mode). 
This happens before the actualy query is executed. So as soon as you have a 
wildcard with many matching terms, the preprocessing takes a significant amount 
of time. The actual query execution is fast and can be optimized. Due to the 
way on how an inverted index is built, there's no way to use another query to 
limit the amount of preprocessing work.

> Large system: Wildcard search leads to full index scan despite filter query
> ---
>
> Key: LUCENE-10562
> URL: https://issues.apache.org/jira/browse/LUCENE-10562
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.11.1
>Reporter: Henrik Hertel
>Priority: Major
>  Labels: performance
>
> I use Solr and have a large system with 1TB in one core and about 5 million 
> documents. The textual content of large PDF files is indexed there. My query 
> is extremely slow (more than 30 seconds)  as soon as I use wildcards e.g. 
> {code:java}
> *searchvalue*
> {code}
> , even though I put a filter query in front of it that reduces to less than 
> 20 documents.
> searchvalue -> less than 1 second
> searchvalue* -> less than 1 second
> My query:
> {code:java}
> select?defType=lucene=content_t:*searchvalue*=metadataitemids_is:20950=id=50=0
>  {code}
> I've tried everything imaginable. It doesn't make sense to me why a search 
> over a small subset should take so long. If I omit the filter query 
> metadataitemids_is:20950, so search the entire inventory, then it also takes 
> the same amount of time. Therefore, I suspect that despite the filter query, 
> the main query runs over the entire index.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query

2022-05-09 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17533857#comment-17533857
 ] 

Uwe Schindler commented on LUCENE-10562:


As explanation why this is slow: It has nothing to do with filters or query 
first. The problem is already before that: Wildcard queries are expanded to 
filter bitsets / large OR queries in the query preprocessing (rewrite mode). 
This happens before the actualy query is executed. So as soon as you have a 
wildcard with many matching terms, the preprocessing takes a significant amount 
of time. The actual query execution is fast and can be optimized. Due to the 
way on how an inverted index is built, there's no way to use another query to 
limit the amount of preprocessing work.

> Large system: Wildcard search leads to full index scan despite filter query
> ---
>
> Key: LUCENE-10562
> URL: https://issues.apache.org/jira/browse/LUCENE-10562
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.11.1
>Reporter: Henrik Hertel
>Priority: Major
>  Labels: performance
>
> I use Solr and have a large system with 1TB in one core and about 5 million 
> documents. The textual content of large PDF files is indexed there. My query 
> is extremely slow (more than 30 seconds)  as soon as I use wildcards e.g. 
> {code:java}
> *searchvalue*
> {code}
> , even though I put a filter query in front of it that reduces to less than 
> 20 documents.
> searchvalue -> less than 1 second
> searchvalue* -> less than 1 second
> My query:
> {code:java}
> select?defType=lucene=content_t:*searchvalue*=metadataitemids_is:20950=id=50=0
>  {code}
> I've tried everything imaginable. It doesn't make sense to me why a search 
> over a small subset should take so long. If I omit the filter query 
> metadataitemids_is:20950, so search the entire inventory, then it also takes 
> the same amount of time. Therefore, I suspect that despite the filter query, 
> the main query runs over the entire index.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query

2022-05-09 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17533856#comment-17533856
 ] 

Uwe Schindler edited comment on LUCENE-10562 at 5/9/22 3:31 PM:


Hi,
I think those question do not relate to Lucene and are no issues at all. I 
think those questions should be asked on the Solr mailing list: 
us...@solr.apache.org.

This is not a bug and there is no way to improve this situation inside Lucene. 
Some additional hints:
- Consider using the reverse wildcard filter in Solr (there's documentation 
about this). But this won't help if you need a wildcard on both sides of the 
term
- Consider to disable wildcards for end-users in your case (the flexible or 
dismax query parser in Solr can do this)
- Decompounding may be needed (see below)

In general, using wildcards in a full text search engine is showing that text 
analysis works wrong. Based on your name and profile, it looks like this is a 
typical "German language problem". In Germany, compounds are usual 
("Donaudampschiffahrtskapitän", the captain of a steam powered ship on the 
German river Donau) and then people using wildcards is always a sign for 
missing decompounding. This can be done with hyphenation-compound token filter 
in combination with dictionaries. An example and minimalized data files for 
German language is here: https://github.com/uschindler/german-decompounder

When you do decompounding, wildcards should not be needed.


was (Author: thetaphi):
Hi,
I think those question do not relate to Lucene and are no issues at all. I 
think those questions should be asked on the Solr mailing list: 
us...@solr.apache.org.

This is not a bug and there is no way to improve this situation inside Lucene. 
Some additional hints:
- Consider using the reverse wildcard filter in Solr (there's documentation 
about this). But this won't help if you need a wildcard on both sides of the 
star
- Consider to disable wildcards for end-users in your case (the flexible or 
dismax query parser in Solr can do this)
- Decompounding may be needed (see below)

In general, using wildcards in a full text search engine is showing that text 
analysis works wrong. Based on your name and profile, it looks like this is a 
typical "German language problem". In Germany, compounds are usual 
("Donaudampschiffahrtskapitän", the captain of a steam powered ship on the 
German river Donau) and then people using wildcards is always a sign for 
missing decompounding. This can be done with hyphenation-compound token filter 
in combination with dictionaries. An example and minimalized data files for 
German language is here: https://github.com/uschindler/german-decompounder

When you do decompounding, wildcards should not be needed.

> Large system: Wildcard search leads to full index scan despite filter query
> ---
>
> Key: LUCENE-10562
> URL: https://issues.apache.org/jira/browse/LUCENE-10562
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.11.1
>Reporter: Henrik Hertel
>Priority: Major
>  Labels: performance
>
> I use Solr and have a large system with 1TB in one core and about 5 million 
> documents. The textual content of large PDF files is indexed there. My query 
> is extremely slow (more than 30 seconds)  as soon as I use wildcards e.g. 
> {code:java}
> *searchvalue*
> {code}
> , even though I put a filter query in front of it that reduces to less than 
> 20 documents.
> searchvalue -> less than 1 second
> searchvalue* -> less than 1 second
> My query:
> {code:java}
> select?defType=lucene=content_t:*searchvalue*=metadataitemids_is:20950=id=50=0
>  {code}
> I've tried everything imaginable. It doesn't make sense to me why a search 
> over a small subset should take so long. If I omit the filter query 
> metadataitemids_is:20950, so search the entire inventory, then it also takes 
> the same amount of time. Therefore, I suspect that despite the filter query, 
> the main query runs over the entire index.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query

2022-05-09 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17533856#comment-17533856
 ] 

Uwe Schindler edited comment on LUCENE-10562 at 5/9/22 3:30 PM:


Hi,
I think those question do not relate to Lucene and are no issues at all. I 
think those questions should be asked on the Solr mailing list: 
us...@solr.apache.org.

This is not a bug and there is no way to improve this situation inside Lucene. 
Some additional hints:
- Consider using the reverse wildcard filter in Solr (there's documentation 
about this). But this won't help if you need a wildcard on both sides of the 
star
- Consider to disable wildcards for end-users in your case (the flexible or 
dismax query parser in Solr can do this)
- Decompounding may be needed (see below)

In general, using wildcards in a full text search engine is showing that text 
analysis works wrong. Based on your name and profile, it looks like this is a 
typical "German language problem". In Germany, compounds are usual 
("Donaudampschiffahrtskapitän", the captain of a steam powered ship on the 
German river Donau) and then people using wildcards is always a sign for 
missing decompounding. This can be done with hyphenation-compound token filter 
in combination with dictionaries. An example and minimalized data files for 
German language is here: https://github.com/uschindler/german-decompounder

When you do decompounding, wildcards should not be needed.


was (Author: thetaphi):
Hi,
I think those question do not relate to Lucene and are no issues at all.

I think those quetsions should be asked on the Solr mailing list: 
us...@solr.apache.org.

This is not a bug and there is no way to improve this situation inside Lucene. 
Some additional hints:
- Consider using the reverse wildcard filter in Solr (there's documentation 
about this). But this won't help if you need a wildcard on both sides of the 
star
- Consider to disable wildcards for end-users in your case (the flexible or 
dismax query parser in Solr can do this)

In general, using wildcards in a full text search engine is showing that text 
analysis works wrong. Based on your name and profile, it looks like this is a 
typical "German language problem". In Germany, compounds are usual 
("Donaudampschiffahrtskapitän", the captain of a steam powered ship on the 
German river Donau) and then people using wildcards is always a sign for 
missing decompounding. This can be done with hyphenation-compound token filter 
in combination with dictionaries. An example and minimalized data files for 
German language is here: https://github.com/uschindler/german-decompounder

When you do decompounding, wildcards should not be needed.

> Large system: Wildcard search leads to full index scan despite filter query
> ---
>
> Key: LUCENE-10562
> URL: https://issues.apache.org/jira/browse/LUCENE-10562
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.11.1
>Reporter: Henrik Hertel
>Priority: Major
>  Labels: performance
>
> I use Solr and have a large system with 1TB in one core and about 5 million 
> documents. The textual content of large PDF files is indexed there. My query 
> is extremely slow (more than 30 seconds)  as soon as I use wildcards e.g. 
> {code:java}
> *searchvalue*
> {code}
> , even though I put a filter query in front of it that reduces to less than 
> 20 documents.
> searchvalue -> less than 1 second
> searchvalue* -> less than 1 second
> My query:
> {code:java}
> select?defType=lucene=content_t:*searchvalue*=metadataitemids_is:20950=id=50=0
>  {code}
> I've tried everything imaginable. It doesn't make sense to me why a search 
> over a small subset should take so long. If I omit the filter query 
> metadataitemids_is:20950, so search the entire inventory, then it also takes 
> the same amount of time. Therefore, I suspect that despite the filter query, 
> the main query runs over the entire index.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query

2022-05-09 Thread Uwe Schindler (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler resolved LUCENE-10562.

Resolution: Won't Fix

> Large system: Wildcard search leads to full index scan despite filter query
> ---
>
> Key: LUCENE-10562
> URL: https://issues.apache.org/jira/browse/LUCENE-10562
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.11.1
>Reporter: Henrik Hertel
>Priority: Major
>  Labels: performance
>
> I use Solr and have a large system with 1TB in one core and about 5 million 
> documents. The textual content of large PDF files is indexed there. My query 
> is extremely slow (more than 30 seconds)  as soon as I use wildcards e.g. 
> {code:java}
> *searchvalue*
> {code}
> , even though I put a filter query in front of it that reduces to less than 
> 20 documents.
> searchvalue -> less than 1 second
> searchvalue* -> less than 1 second
> My query:
> {code:java}
> select?defType=lucene=content_t:*searchvalue*=metadataitemids_is:20950=id=50=0
>  {code}
> I've tried everything imaginable. It doesn't make sense to me why a search 
> over a small subset should take so long. If I omit the filter query 
> metadataitemids_is:20950, so search the entire inventory, then it also takes 
> the same amount of time. Therefore, I suspect that despite the filter query, 
> the main query runs over the entire index.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query

2022-05-09 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17533856#comment-17533856
 ] 

Uwe Schindler commented on LUCENE-10562:


Hi,
I think those question do not relate to Lucene and are no issues at all.

I think those quetsions should be asked on the Solr mailing list: 
us...@solr.apache.org.

This is not a bug and there is no way to improve this situation inside Lucene. 
Some additional hints:
- Consider using the reverse wildcard filter in Solr (there's documentation 
about this). But this won't help if you need a wildcard on both sides of the 
star
- Consider to disable wildcards for end-users in your case (the flexible or 
dismax query parser in Solr can do this)

In general, using wildcards in a full text search engine is showing that text 
analysis works wrong. Based on your name and profile, it looks like this is a 
typical "German language problem". In Germany, compounds are usual 
("Donaudampschiffahrtskapitän", the captain of a steam powered ship on the 
German river Donau) and then people using wildcards is always a sign for 
missing decompounding. This can be done with hyphenation-compound token filter 
in combination with dictionaries. An example and minimalized data files for 
German language is here: https://github.com/uschindler/german-decompounder

When you do decompounding, wildcards should not be needed.

> Large system: Wildcard search leads to full index scan despite filter query
> ---
>
> Key: LUCENE-10562
> URL: https://issues.apache.org/jira/browse/LUCENE-10562
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.11.1
>Reporter: Henrik Hertel
>Priority: Major
>  Labels: performance
>
> I use Solr and have a large system with 1TB in one core and about 5 million 
> documents. The textual content of large PDF files is indexed there. My query 
> is extremely slow (more than 30 seconds)  as soon as I use wildcards e.g. 
> {code:java}
> *searchvalue*
> {code}
> , even though I put a filter query in front of it that reduces to less than 
> 20 documents.
> searchvalue -> less than 1 second
> searchvalue* -> less than 1 second
> My query:
> {code:java}
> select?defType=lucene=content_t:*searchvalue*=metadataitemids_is:20950=id=50=0
>  {code}
> I've tried everything imaginable. It doesn't make sense to me why a search 
> over a small subset should take so long. If I omit the filter query 
> metadataitemids_is:20950, so search the entire inventory, then it also takes 
> the same amount of time. Therefore, I suspect that despite the filter query, 
> the main query runs over the entire index.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10558) Add URL constructors for classpath/module usage as complement to Path ctors in Kuromoji and Nori

2022-05-06 Thread Uwe Schindler (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler resolved LUCENE-10558.

Resolution: Fixed

Thanks to all who helped/commented/complained!

> Add URL constructors for classpath/module usage as complement to Path ctors 
> in Kuromoji and Nori
> 
>
> Key: LUCENE-10558
> URL: https://issues.apache.org/jira/browse/LUCENE-10558
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 9.1
>Reporter: Michael Sokolov
>Assignee: Uwe Schindler
>Priority: Major
> Fix For: 10.0 (main), 9.2
>
>  Time Spent: 16h 20m
>  Remaining Estimate: 0h
>
> When we refactored the constructors for  these resource objects used by the 
> kuromoji JapaneseTokenizer,  we (inadvertently, I expect) changed the 
> behavior for consumers that were supplying these resources on the classpath. 
> In that case, we silently replaced the custom resources with the Lucene 
> built-in ones.  I think we cannot support the old API because of Java Module 
> system restrictions, but we didn't provide any usable replacement or notice 
> either.
>  
> This issue is for exposing the new (private) constructors that accept 
> streams, and adding a notice to Migration.md to point users at them, since 
> they can be used with resources streams loaded from the classpath by the 
> caller.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10558) Add URL constructors for classpath/module usage as complement to Path ctors in Kuromoji and Nori

2022-05-06 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17532904#comment-17532904
 ] 

Uwe Schindler commented on LUCENE-10558:


This was fixed in 9.x (9.2 is next release) with the new APIs, but also old 
(deprecated) APIs were revised:
- loading from classpath works again and is not loading default files (the bug 
reported here)
- the inconsistent behaviour of loading from classpath was restored (path 
format, replacement of . by / in ConnectionCosts)

> Add URL constructors for classpath/module usage as complement to Path ctors 
> in Kuromoji and Nori
> 
>
> Key: LUCENE-10558
> URL: https://issues.apache.org/jira/browse/LUCENE-10558
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 9.1
>Reporter: Michael Sokolov
>Assignee: Uwe Schindler
>Priority: Major
> Fix For: 10.0 (main), 9.2
>
>  Time Spent: 16h 10m
>  Remaining Estimate: 0h
>
> When we refactored the constructors for  these resource objects used by the 
> kuromoji JapaneseTokenizer,  we (inadvertently, I expect) changed the 
> behavior for consumers that were supplying these resources on the classpath. 
> In that case, we silently replaced the custom resources with the Lucene 
> built-in ones.  I think we cannot support the old API because of Java Module 
> system restrictions, but we didn't provide any usable replacement or notice 
> either.
>  
> This issue is for exposing the new (private) constructors that accept 
> streams, and adding a notice to Migration.md to point users at them, since 
> they can be used with resources streams loaded from the classpath by the 
> caller.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10558) Add URL constructors for classpath/module usage as complement to Path ctors in Kuromoji and Nori

2022-05-06 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17532903#comment-17532903
 ] 

Uwe Schindler commented on LUCENE-10558:


Forward port to main branch: https://github.com/apache/lucene/pull/871

> Add URL constructors for classpath/module usage as complement to Path ctors 
> in Kuromoji and Nori
> 
>
> Key: LUCENE-10558
> URL: https://issues.apache.org/jira/browse/LUCENE-10558
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 9.1
>Reporter: Michael Sokolov
>Assignee: Uwe Schindler
>Priority: Major
> Fix For: 10.0 (main), 9.2
>
>  Time Spent: 16h 10m
>  Remaining Estimate: 0h
>
> When we refactored the constructors for  these resource objects used by the 
> kuromoji JapaneseTokenizer,  we (inadvertently, I expect) changed the 
> behavior for consumers that were supplying these resources on the classpath. 
> In that case, we silently replaced the custom resources with the Lucene 
> built-in ones.  I think we cannot support the old API because of Java Module 
> system restrictions, but we didn't provide any usable replacement or notice 
> either.
>  
> This issue is for exposing the new (private) constructors that accept 
> streams, and adding a notice to Migration.md to point users at them, since 
> they can be used with resources streams loaded from the classpath by the 
> caller.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10558) Add URL constructors for classpath/module usage as complement to Path ctors in Kuromoji and Nori

2022-05-06 Thread Uwe Schindler (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-10558:
---
Component/s: modules/analysis

> Add URL constructors for classpath/module usage as complement to Path ctors 
> in Kuromoji and Nori
> 
>
> Key: LUCENE-10558
> URL: https://issues.apache.org/jira/browse/LUCENE-10558
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 9.1
>Reporter: Michael Sokolov
>Assignee: Uwe Schindler
>Priority: Major
> Fix For: 10.0 (main), 9.2
>
>  Time Spent: 16h 10m
>  Remaining Estimate: 0h
>
> When we refactored the constructors for  these resource objects used by the 
> kuromoji JapaneseTokenizer,  we (inadvertently, I expect) changed the 
> behavior for consumers that were supplying these resources on the classpath. 
> In that case, we silently replaced the custom resources with the Lucene 
> built-in ones.  I think we cannot support the old API because of Java Module 
> system restrictions, but we didn't provide any usable replacement or notice 
> either.
>  
> This issue is for exposing the new (private) constructors that accept 
> streams, and adding a notice to Migration.md to point users at them, since 
> they can be used with resources streams loaded from the classpath by the 
> caller.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10558) Add URL constructors for classpath/module usage as complement to Path ctors in Kuromoji and Nori

2022-05-06 Thread Uwe Schindler (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-10558:
---
Affects Version/s: 9.1

> Add URL constructors for classpath/module usage as complement to Path ctors 
> in Kuromoji and Nori
> 
>
> Key: LUCENE-10558
> URL: https://issues.apache.org/jira/browse/LUCENE-10558
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.1
>Reporter: Michael Sokolov
>Assignee: Uwe Schindler
>Priority: Major
>  Time Spent: 16h 10m
>  Remaining Estimate: 0h
>
> When we refactored the constructors for  these resource objects used by the 
> kuromoji JapaneseTokenizer,  we (inadvertently, I expect) changed the 
> behavior for consumers that were supplying these resources on the classpath. 
> In that case, we silently replaced the custom resources with the Lucene 
> built-in ones.  I think we cannot support the old API because of Java Module 
> system restrictions, but we didn't provide any usable replacement or notice 
> either.
>  
> This issue is for exposing the new (private) constructors that accept 
> streams, and adding a notice to Migration.md to point users at them, since 
> they can be used with resources streams loaded from the classpath by the 
> caller.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10558) Add URL constructors for classpath/module usage as complement to Path ctors in Kuromoji and Nori

2022-05-06 Thread Uwe Schindler (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-10558:
---
Fix Version/s: 10.0 (main)
   9.2

> Add URL constructors for classpath/module usage as complement to Path ctors 
> in Kuromoji and Nori
> 
>
> Key: LUCENE-10558
> URL: https://issues.apache.org/jira/browse/LUCENE-10558
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.1
>Reporter: Michael Sokolov
>Assignee: Uwe Schindler
>Priority: Major
> Fix For: 10.0 (main), 9.2
>
>  Time Spent: 16h 10m
>  Remaining Estimate: 0h
>
> When we refactored the constructors for  these resource objects used by the 
> kuromoji JapaneseTokenizer,  we (inadvertently, I expect) changed the 
> behavior for consumers that were supplying these resources on the classpath. 
> In that case, we silently replaced the custom resources with the Lucene 
> built-in ones.  I think we cannot support the old API because of Java Module 
> system restrictions, but we didn't provide any usable replacement or notice 
> either.
>  
> This issue is for exposing the new (private) constructors that accept 
> streams, and adding a notice to Migration.md to point users at them, since 
> they can be used with resources streams loaded from the classpath by the 
> caller.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



  1   2   3   4   5   6   7   8   9   10   >