[jira] [Updated] (LUCENE-8966) KoreanTokenizer should split unknown words on digits

2019-09-13 Thread Jim Ferenczi (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8966:
-
Fix Version/s: 8.3
   master (9.0)
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

> KoreanTokenizer should split unknown words on digits
> 
>
> Key: LUCENE-8966
> URL: https://issues.apache.org/jira/browse/LUCENE-8966
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
> Fix For: master (9.0), 8.3
>
> Attachments: LUCENE-8966.patch, LUCENE-8966.patch
>
>
> Since https://issues.apache.org/jira/browse/LUCENE-8548 the Korean tokenizer 
> groups characters of unknown words if they belong to the same script or an 
> inherited one. This is ok for inputs like Мoscow (with a Cyrillic М and the 
> rest in Latin) but this rule doesn't work well on digits since they are 
> considered common with other scripts. For instance the input "44사이즈" is kept 
> as is even though "사이즈" is part of the dictionary. We should restore the 
> original behavior and splits any unknown words if a digit is followed by 
> another type.
> This issue was first discovered in 
> [https://github.com/elastic/elasticsearch/issues/46365]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8977) Handle punctuation characters in KoreanTokenizer

2019-09-13 Thread Jim Ferenczi (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16929027#comment-16929027
 ] 

Jim Ferenczi commented on LUCENE-8977:
--

I wonder why you think that this is an issue. Punctuations are removed by 
default so this is only an issue if you want to use the Korean number filter ?

> Handle punctuation characters in KoreanTokenizer
> 
>
> Key: LUCENE-8977
> URL: https://issues.apache.org/jira/browse/LUCENE-8977
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Namgyu Kim
>Priority: Minor
>
> As we discussed on LUCENE-8966, KoreanTokenizer always divides into one and 
> the others now when there are continuous punctuation marks.
>  (사이즈 => [사이즈] [.] [...])
>  But KoreanTokenizer doesn't divide when first character is punctuation.
>  (...사이즈 => [...] [사이즈])
> It looks like the result from the viterbi path, but users can think weird 
> about the following case:
>  ("사이즈" means "size" in Korean)
> ||Case #1||Case #2||
> |Input : "...사이즈..."|Input : "...4..4사이즈"|
> |Result : [...] [사이즈] [.] [..]|Result : [...] [4] [.] [.] [4] [사이즈]|
> From what I checked, Nori has a punctuation characters(like . ,) in the 
> dictionary but Kuromoji is not.
>  ("サイズ" means "size" in Japanese)
> ||Case #1||Case #2||
> |Input : "...サイズ..."|Input : "...4..4サイズ"|
> |Result : [...] [サイズ] [...]|Result : [...] [4] [..] [4] [サイズ]|
> There are some ways to resolve it like hard-coding for punctuation but it 
> seems not good.
>  So I think we need to discuss it.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8966) KoreanTokenizer should split unknown words on digits

2019-09-09 Thread Jim Ferenczi (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16925588#comment-16925588
 ] 

Jim Ferenczi commented on LUCENE-8966:
--

I don't think it's a bug [~danmuzi] or at least that it's related to this 
issue. In your example the first dot ('.' is a word dictionary) is considered a 
better path than grouping all dots eagerly. We process the unknown words 
greedily so we compare the path "[4], [.], [.]" with  "[4], [.], [.], 
[]", "[4], [.], [.], [.], [...]", ... "[4], [..]". Keeping the first 
dot separated from the rest indicates that a number followed by a dot is a 
better splitting path than multiple dots in our model. We can discuss this 
behavior in a new issue if you think this should be configurable (for instance 
the JapaneseTokenizer process unknown words greedily only in search mode) ?

> KoreanTokenizer should split unknown words on digits
> 
>
> Key: LUCENE-8966
> URL: https://issues.apache.org/jira/browse/LUCENE-8966
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: LUCENE-8966.patch, LUCENE-8966.patch
>
>
> Since https://issues.apache.org/jira/browse/LUCENE-8548 the Korean tokenizer 
> groups characters of unknown words if they belong to the same script or an 
> inherited one. This is ok for inputs like Мoscow (with a Cyrillic М and the 
> rest in Latin) but this rule doesn't work well on digits since they are 
> considered common with other scripts. For instance the input "44사이즈" is kept 
> as is even though "사이즈" is part of the dictionary. We should restore the 
> original behavior and splits any unknown words if a digit is followed by 
> another type.
> This issue was first discovered in 
> [https://github.com/elastic/elasticsearch/issues/46365]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8966) KoreanTokenizer should split unknown words on digits

2019-09-05 Thread Jim Ferenczi (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923394#comment-16923394
 ] 

Jim Ferenczi commented on LUCENE-8966:
--

{quote}

Would you consider grouping numbers and (at least some) punctuation together so 
that we can preserve decimals and fractions?

{quote}

For complex number grouping and normalization, [~danmuzi] added a 
KoreanNumberFilter in https://issues.apache.org/jira/browse/LUCENE-8812

It is identical to the JapaneseNumberFilter excepts that it only detects Korean 
hangul numbers. I don't think it handles fractions though but this could be 
added if needed. 

> KoreanTokenizer should split unknown words on digits
> 
>
> Key: LUCENE-8966
> URL: https://issues.apache.org/jira/browse/LUCENE-8966
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: LUCENE-8966.patch, LUCENE-8966.patch
>
>
> Since https://issues.apache.org/jira/browse/LUCENE-8548 the Korean tokenizer 
> groups characters of unknown words if they belong to the same script or an 
> inherited one. This is ok for inputs like Мoscow (with a Cyrillic М and the 
> rest in Latin) but this rule doesn't work well on digits since they are 
> considered common with other scripts. For instance the input "44사이즈" is kept 
> as is even though "사이즈" is part of the dictionary. We should restore the 
> original behavior and splits any unknown words if a digit is followed by 
> another type.
> This issue was first discovered in 
> [https://github.com/elastic/elasticsearch/issues/46365]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8966) KoreanTokenizer should split unknown words on digits

2019-09-05 Thread Jim Ferenczi (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8966:
-
Attachment: LUCENE-8966.patch
Status: Patch Available  (was: Patch Available)

New patch without the dead code

> KoreanTokenizer should split unknown words on digits
> 
>
> Key: LUCENE-8966
> URL: https://issues.apache.org/jira/browse/LUCENE-8966
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: LUCENE-8966.patch, LUCENE-8966.patch
>
>
> Since https://issues.apache.org/jira/browse/LUCENE-8548 the Korean tokenizer 
> groups characters of unknown words if they belong to the same script or an 
> inherited one. This is ok for inputs like Мoscow (with a Cyrillic М and the 
> rest in Latin) but this rule doesn't work well on digits since they are 
> considered common with other scripts. For instance the input "44사이즈" is kept 
> as is even though "사이즈" is part of the dictionary. We should restore the 
> original behavior and splits any unknown words if a digit is followed by 
> another type.
> This issue was first discovered in 
> [https://github.com/elastic/elasticsearch/issues/46365]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8966) KoreanTokenizer should split unknown words on digits

2019-09-05 Thread Jim Ferenczi (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923357#comment-16923357
 ] 

Jim Ferenczi commented on LUCENE-8966:
--

Thanks for looking [~thetaphi]. These two private static functions are dead 
codes that I forgot to remove. The other places use Character.isDigit() 
consistently. 

> KoreanTokenizer should split unknown words on digits
> 
>
> Key: LUCENE-8966
> URL: https://issues.apache.org/jira/browse/LUCENE-8966
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: LUCENE-8966.patch
>
>
> Since https://issues.apache.org/jira/browse/LUCENE-8548 the Korean tokenizer 
> groups characters of unknown words if they belong to the same script or an 
> inherited one. This is ok for inputs like Мoscow (with a Cyrillic М and the 
> rest in Latin) but this rule doesn't work well on digits since they are 
> considered common with other scripts. For instance the input "44사이즈" is kept 
> as is even though "사이즈" is part of the dictionary. We should restore the 
> original behavior and splits any unknown words if a digit is followed by 
> another type.
> This issue was first discovered in 
> [https://github.com/elastic/elasticsearch/issues/46365]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8966) KoreanTokenizer should split unknown words on digits

2019-09-05 Thread Jim Ferenczi (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923222#comment-16923222
 ] 

Jim Ferenczi commented on LUCENE-8966:
--

Here is a patch that breaks unknown words on digits instead of grouping them 
with other types.

> KoreanTokenizer should split unknown words on digits
> 
>
> Key: LUCENE-8966
> URL: https://issues.apache.org/jira/browse/LUCENE-8966
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: LUCENE-8966.patch
>
>
> Since https://issues.apache.org/jira/browse/LUCENE-8548 the Korean tokenizer 
> groups characters of unknown words if they belong to the same script or an 
> inherited one. This is ok for inputs like Мoscow (with a Cyrillic М and the 
> rest in Latin) but this rule doesn't work well on digits since they are 
> considered common with other scripts. For instance the input "44사이즈" is kept 
> as is even though "사이즈" is part of the dictionary. We should restore the 
> original behavior and splits any unknown words if a digit is followed by 
> another type.
> This issue was first discovered in 
> [https://github.com/elastic/elasticsearch/issues/46365]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8966) KoreanTokenizer should split unknown words on digits

2019-09-05 Thread Jim Ferenczi (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8966:
-
Status: Patch Available  (was: Open)

> KoreanTokenizer should split unknown words on digits
> 
>
> Key: LUCENE-8966
> URL: https://issues.apache.org/jira/browse/LUCENE-8966
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: LUCENE-8966.patch
>
>
> Since https://issues.apache.org/jira/browse/LUCENE-8548 the Korean tokenizer 
> groups characters of unknown words if they belong to the same script or an 
> inherited one. This is ok for inputs like Мoscow (with a Cyrillic М and the 
> rest in Latin) but this rule doesn't work well on digits since they are 
> considered common with other scripts. For instance the input "44사이즈" is kept 
> as is even though "사이즈" is part of the dictionary. We should restore the 
> original behavior and splits any unknown words if a digit is followed by 
> another type.
> This issue was first discovered in 
> [https://github.com/elastic/elasticsearch/issues/46365]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8966) KoreanTokenizer should split unknown words on digits

2019-09-05 Thread Jim Ferenczi (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8966:
-
Attachment: LUCENE-8966.patch

> KoreanTokenizer should split unknown words on digits
> 
>
> Key: LUCENE-8966
> URL: https://issues.apache.org/jira/browse/LUCENE-8966
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: LUCENE-8966.patch
>
>
> Since https://issues.apache.org/jira/browse/LUCENE-8548 the Korean tokenizer 
> groups characters of unknown words if they belong to the same script or an 
> inherited one. This is ok for inputs like Мoscow (with a Cyrillic М and the 
> rest in Latin) but this rule doesn't work well on digits since they are 
> considered common with other scripts. For instance the input "44사이즈" is kept 
> as is even though "사이즈" is part of the dictionary. We should restore the 
> original behavior and splits any unknown words if a digit is followed by 
> another type.
> This issue was first discovered in 
> [https://github.com/elastic/elasticsearch/issues/46365]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8966) KoreanTokenizer should split unknown words on digits

2019-09-05 Thread Jim Ferenczi (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8966:
-
Description: 
Since https://issues.apache.org/jira/browse/LUCENE-8548 the Korean tokenizer 
groups characters of unknown words if they belong to the same script or an 
inherited one. This is ok for inputs like Мoscow (with a Cyrillic М and the 
rest in Latin) but this rule doesn't work well on digits since they are 
considered common with other scripts. For instance the input "44사이즈" is kept as 
is even though "사이즈" is part of the dictionary. We should restore the original 
behavior and splits any unknown words if a digit is followed by another type.

This issue was first discovered in 
[https://github.com/elastic/elasticsearch/issues/46365]

  was:
Since LUCENE-XXX the Korean tokenizer groups characters of unknown words if 
they belong to the same script or an inherited one. This is ok for inputs like 
Мoscow (with a Cyrillic М and the rest in Latin) but this rule doesn't work 
well on digits since they are considered common with other scripts. For 
instance the input "44사이즈" is kept as is even though "사이즈" is part of the 
dictionary. We should restore the original behavior and splits any unknown 
words if a digit is followed by another type.

This issue was first discovered in 
[https://github.com/elastic/elasticsearch/issues/46365]


> KoreanTokenizer should split unknown words on digits
> 
>
> Key: LUCENE-8966
> URL: https://issues.apache.org/jira/browse/LUCENE-8966
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
>
> Since https://issues.apache.org/jira/browse/LUCENE-8548 the Korean tokenizer 
> groups characters of unknown words if they belong to the same script or an 
> inherited one. This is ok for inputs like Мoscow (with a Cyrillic М and the 
> rest in Latin) but this rule doesn't work well on digits since they are 
> considered common with other scripts. For instance the input "44사이즈" is kept 
> as is even though "사이즈" is part of the dictionary. We should restore the 
> original behavior and splits any unknown words if a digit is followed by 
> another type.
> This issue was first discovered in 
> [https://github.com/elastic/elasticsearch/issues/46365]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8966) KoreanTokenizer should split unknown words on digits

2019-09-05 Thread Jim Ferenczi (Jira)
Jim Ferenczi created LUCENE-8966:


 Summary: KoreanTokenizer should split unknown words on digits
 Key: LUCENE-8966
 URL: https://issues.apache.org/jira/browse/LUCENE-8966
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Jim Ferenczi


Since LUCENE-XXX the Korean tokenizer groups characters of unknown words if 
they belong to the same script or an inherited one. This is ok for inputs like 
Мoscow (with a Cyrillic М and the rest in Latin) but this rule doesn't work 
well on digits since they are considered common with other scripts. For 
instance the input "44사이즈" is kept as is even though "사이즈" is part of the 
dictionary. We should restore the original behavior and splits any unknown 
words if a digit is followed by another type.

This issue was first discovered in 
[https://github.com/elastic/elasticsearch/issues/46365]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8959) JapaneseNumberFilter does not take whitespaces into account when concatenating numbers

2019-08-29 Thread Jim Ferenczi (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8959:
-
Description: Today the JapaneseNumberFilter tries to concatenate numbers 
even if they are separated by whitespaces. So for instance "10 100" is 
rewritten into "10100" -even if the tokenizer doesn't discard punctuations-. In 
practice this is not an issue but this can lead to giant number of tokens if 
there are a lot of numbers separated by spaces. The number of concatenation 
should be configurable with a sane default limit in order to avoid creating 
giant tokens that slows down the analysis if the tokenizer is not correctly 
configured.  (was: Today the JapaneseNumberFilter tries to concatenate numbers 
even if they are separated by whitespaces. So for instance "10 100" is 
rewritten into "10100" even if the tokenizer doesn't discard punctuations. In 
practice this is not an issue but this can lead to giant number of tokens if 
there are a lot of numbers separated by spaces. The number of concatenation 
should be configurable with a sane default limit in order to avoid creating big 
tokens that slows down the analysis.)

> JapaneseNumberFilter does not take whitespaces into account when 
> concatenating numbers
> --
>
> Key: LUCENE-8959
> URL: https://issues.apache.org/jira/browse/LUCENE-8959
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
>
> Today the JapaneseNumberFilter tries to concatenate numbers even if they are 
> separated by whitespaces. So for instance "10 100" is rewritten into "10100" 
> -even if the tokenizer doesn't discard punctuations-. In practice this is not 
> an issue but this can lead to giant number of tokens if there are a lot of 
> numbers separated by spaces. The number of concatenation should be 
> configurable with a sane default limit in order to avoid creating giant 
> tokens that slows down the analysis if the tokenizer is not correctly 
> configured.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8959) JapaneseNumberFilter does not take whitespaces into account when concatenating numbers

2019-08-29 Thread Jim Ferenczi (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918532#comment-16918532
 ] 

Jim Ferenczi commented on LUCENE-8959:
--

*Update:* Whitespaces were removed in my tests because I was using the default 
JapanesePartOfSpeechStopFilter before the JapaneseNumberFilter. The behavior is 
correct when discardPunctuations is correctly set and the 
JapanesePartOfSpeechStopFilter is the first filter in the chain. We could 
protect against the rabbit hole for users that forget to set 
discardPunctuations to false or remove the whitespaces in a preceding filter 
but the behavior is correct. Sorry for the false alarm.

> JapaneseNumberFilter does not take whitespaces into account when 
> concatenating numbers
> --
>
> Key: LUCENE-8959
> URL: https://issues.apache.org/jira/browse/LUCENE-8959
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
>
> Today the JapaneseNumberFilter tries to concatenate numbers even if they are 
> separated by whitespaces. So for instance "10 100" is rewritten into "10100" 
> even if the tokenizer doesn't discard punctuations. In practice this is not 
> an issue but this can lead to giant number of tokens if there are a lot of 
> numbers separated by spaces. The number of concatenation should be 
> configurable with a sane default limit in order to avoid creating big tokens 
> that slows down the analysis.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8959) JapaneseNumberFilter does not take whitespaces into account when concatenating numbers

2019-08-29 Thread Jim Ferenczi (Jira)
Jim Ferenczi created LUCENE-8959:


 Summary: JapaneseNumberFilter does not take whitespaces into 
account when concatenating numbers
 Key: LUCENE-8959
 URL: https://issues.apache.org/jira/browse/LUCENE-8959
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Jim Ferenczi


Today the JapaneseNumberFilter tries to concatenate numbers even if they are 
separated by whitespaces. So for instance "10 100" is rewritten into "10100" 
even if the tokenizer doesn't discard punctuations. In practice this is not an 
issue but this can lead to giant number of tokens if there are a lot of numbers 
separated by spaces. The number of concatenation should be configurable with a 
sane default limit in order to avoid creating big tokens that slows down the 
analysis.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8943) Incorrect IDF in MultiPhraseQuery and SpanOrQuery

2019-08-12 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905332#comment-16905332
 ] 

Jim Ferenczi commented on LUCENE-8943:
--

{quote}

Your post made me think of the problem in another way. If we had something like 
MultiWordsSynonymQuery, we could have even more control. Similar to 
SynonymQuery we could use one IDF value for all synonyms. Synonym boost would 
work much more reliably.

{quote}

 

Yes, that's what I tried to explain in my post. It is a specific issue with 
multi-words synonyms so we should have a dedicated query. 

 

{quote}

Usually the values for pseudoStats would be computed bottom up (SpanWeight, 
PhraseWeight) from the subqueries. But we could implement a general 
MultiWordsSynonymQuery as subclass of BooleanQuery (only allowing disjunction) 
which would set (adapt) pseudoStats in all its subweights (docFreq as max 
docFreq of all synonyms just as SynonymQuery currently does).

{quote}

 

+1, that's how I'd start with this. We don't need to handle all type of queries 
though, only Term (e.g.: body:ny), conjunction of Term queries (e.g.: body:new 
AND body:york) and phrase queries (e.g.: "new york") should be accepted.

> Incorrect IDF in MultiPhraseQuery and SpanOrQuery
> -
>
> Key: LUCENE-8943
> URL: https://issues.apache.org/jira/browse/LUCENE-8943
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/query/scoring
>Affects Versions: 8.0
>Reporter: Christoph Goller
>Priority: Major
>
> I recently stumbled across a very old bug in the IDF computation for 
> MultiPhraseQuery and SpanOrQuery.
> BM25Similarity and TFIDFSimilarity / ClassicSimilarity have a method for 
> combining IDF values from more than on term / TermStatistics.
> I mean the method:
> Explanation idfExplain(CollectionStatistics collectionStats, TermStatistics 
> termStats[])
> It simply adds up the IDFs from all termStats[].
> This method is used e.g. in PhraseQuery where it makes sense. If we assume 
> that for the phrase "New York" the occurrences of both words are independent, 
> we can multiply their probabilitis and since IDFs are logarithmic we add them 
> up. Seems to be a reasonable approximation. However, this method is also used 
> to add up the IDFs of all terms in a MultiPhraseQuery as can be seen in:
> Similarity.SimScorer getStats(IndexSearcher searcher)
> A MultiPhraseQuery is actually a PhraseQuery with alternatives at individual 
> positions. IDFs of alternative terms for one position should not be added up. 
> Instead we should use the minimum value as an approcimation because this 
> corresponds to the docFreq of the most frequent term and we know that this is 
> a lower bound for the docFreq for this position.
> In SpanOrQuerry we have the same problem It uses buildSimWeight(...) from 
> SpanWeight and adds up all IDFs of all OR-clauses.
> If my arguments are not convincing, look at SynonymQuery / SynonymWeight in 
> the constructor:
> SynonymWeight(Query query, IndexSearcher searcher, ScoreMode scoreMode, float 
> boost) 
> A SynonymQuery is also a kind of OR-query and it uses the maximum of the 
> docFreq of all its alternative terms. I think this is how it should be.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8943) Incorrect IDF in MultiPhraseQuery and SpanOrQuery

2019-08-09 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903968#comment-16903968
 ] 

Jim Ferenczi commented on LUCENE-8943:
--

I don't think we can realistically approximate the doc freq of phrases, 
especially if you consider more than 2 terms. The issue with the score 
difference of "wifi" (single term) vs "wi fi" (multiple terms) is more a 
synonym issue where the association between these terms is made at search time. 
Currently BM25 similarity sums the idf values but this was done to limit the 
difference with the classic (tfidf) similarity. The other similarities take a 
simpler approach that just sum the score of each term that appear in the query 
like a boolean query would do (see MultiSimilarity). It's difficult to pick one 
approach over the other here but the context is important. For single term 
synonym (terms that appear at the same position) we have the SynonymQuery that 
is used to blend the score of such terms. I tend to agree that the 
MultiPhraseQuery should take the same approach so that each position can score 
once instead of per terms. However it is difficult to expand this strategy to 
variable length multi words synonyms. We could try with a specialized 
MultiWordsSynonymQuery that would apply some strategy (approximation of the doc 
count like you propose or anything that makes sense here ;) ) to make sure that 
all variations are scored the same. Does this makes sense ?

> Incorrect IDF in MultiPhraseQuery and SpanOrQuery
> -
>
> Key: LUCENE-8943
> URL: https://issues.apache.org/jira/browse/LUCENE-8943
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/query/scoring
>Affects Versions: 8.0
>Reporter: Christoph Goller
>Priority: Major
>
> I recently stumbled across a very old bug in the IDF computation for 
> MultiPhraseQuery and SpanOrQuery.
> BM25Similarity and TFIDFSimilarity / ClassicSimilarity have a method for 
> combining IDF values from more than on term / TermStatistics.
> I mean the method:
> Explanation idfExplain(CollectionStatistics collectionStats, TermStatistics 
> termStats[])
> It simply adds up the IDFs from all termStats[].
> This method is used e.g. in PhraseQuery where it makes sense. If we assume 
> that for the phrase "New York" the occurrences of both words are independent, 
> we can multiply their probabilitis and since IDFs are logarithmic we add them 
> up. Seems to be a reasonable approximation. However, this method is also used 
> to add up the IDFs of all terms in a MultiPhraseQuery as can be seen in:
> Similarity.SimScorer getStats(IndexSearcher searcher)
> A MultiPhraseQuery is actually a PhraseQuery with alternatives at individual 
> positions. IDFs of alternative terms for one position should not be added up. 
> Instead we should use the minimum value as an approcimation because this 
> corresponds to the docFreq of the most frequent term and we know that this is 
> a lower bound for the docFreq for this position.
> In SpanOrQuerry we have the same problem It uses buildSimWeight(...) from 
> SpanWeight and adds up all IDFs of all OR-clauses.
> If my arguments are not convincing, look at SynonymQuery / SynonymWeight in 
> the constructor:
> SynonymWeight(Query query, IndexSearcher searcher, ScoreMode scoreMode, float 
> boost) 
> A SynonymQuery is also a kind of OR-query and it uses the maximum of the 
> docFreq of all its alternative terms. I think this is how it should be.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8747) Allow access to submatches from Matches instances

2019-08-06 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901091#comment-16901091
 ] 

Jim Ferenczi commented on LUCENE-8747:
--

Can we return a list of Matches in findNamedMatches ? The set of string is 
useful for testing purpose but it should be easy to extract any named Matches 
from a global Matches object ?

> Allow access to submatches from Matches instances
> -
>
> Key: LUCENE-8747
> URL: https://issues.apache.org/jira/browse/LUCENE-8747
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8747.patch, LUCENE-8747.patch, LUCENE-8747.patch, 
> LUCENE-8747.patch
>
>
> A Matches object currently allows access to all matching terms from a query, 
> but the structure of the matching query is flattened out, so if you want to 
> find which subqueries have matched you need to iterate over all matches, 
> collecting queries as you go.  It should be easier to get this information 
> from the parent Matches object.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8941) Build wildcard matches more lazily

2019-08-01 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898316#comment-16898316
 ] 

Jim Ferenczi commented on LUCENE-8941:
--

+1 the patch looks good. Can you add an assert in the additional test that 
checks that the number of segments in the reader is 1 ?This is required since 
the test makes some assumptions regarding the distribution of the terms and 
this could be broken easily if we change the setUp for the entire test suite 
later.

> Build wildcard matches more lazily
> --
>
> Key: LUCENE-8941
> URL: https://issues.apache.org/jira/browse/LUCENE-8941
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8941.patch
>
>
> When retrieving a Matches object from a multi-term query, such as an 
> AutomatonQuery or TermInSetQuery, we currently find all matching term 
> iterators up-front, to return a disjunction over all of them.  This can be 
> inefficient if we're only interested in finding out if anything matched, and 
> are iterating over a different field to retrieve offsets.
> We can improve this by returning immediately when the first matching term is 
> found, and only collecting other matching terms when we start iterating.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8935) BooleanQuery with no scoring clauses cannot skip documents when running TOP_SCORES mode

2019-07-29 Thread Jim Ferenczi (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi resolved LUCENE-8935.
--
   Resolution: Fixed
Fix Version/s: 8.3
   master (9.0)

> BooleanQuery with no scoring clauses cannot skip documents when running 
> TOP_SCORES mode
> ---
>
> Key: LUCENE-8935
> URL: https://issues.apache.org/jira/browse/LUCENE-8935
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
> Fix For: master (9.0), 8.3
>
> Attachments: LUCENE-8935.patch
>
>
> Today a boolean query that is composed of filtering clauses only (more than 
> one) cannot skip documents when the search is executed with the TOP_SCORES 
> mode. However since all documents have a score of 0 it should be possible to 
> early terminate the query as soon as we collected enough top hits. Wrapping 
> the resulting boolean scorer in a constant score scorer should allow early 
> termination in this case and would speed up the retrieval of top hits case 
> considerably if the total hit count is not requested.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8935) BooleanQuery with no scoring clauses cannot skip documents when running TOP_SCORES mode

2019-07-26 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16893713#comment-16893713
 ] 

Jim Ferenczi commented on LUCENE-8935:
--

Sorry I misunderstood the logic but the number of scoring clauses is already 
computed from the pruned list of scorers so the actual patch works. It's the 
scorer supplier that can be null but in such case they would not appear in 
Boolean2ScorerSupplier. 

> BooleanQuery with no scoring clauses cannot skip documents when running 
> TOP_SCORES mode
> ---
>
> Key: LUCENE-8935
> URL: https://issues.apache.org/jira/browse/LUCENE-8935
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: LUCENE-8935.patch
>
>
> Today a boolean query that is composed of filtering clauses only (more than 
> one) cannot skip documents when the search is executed with the TOP_SCORES 
> mode. However since all documents have a score of 0 it should be possible to 
> early terminate the query as soon as we collected enough top hits. Wrapping 
> the resulting boolean scorer in a constant score scorer should allow early 
> termination in this case and would speed up the retrieval of top hits case 
> considerably if the total hit count is not requested.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8935) BooleanQuery with no scoring clauses cannot skip documents when running TOP_SCORES mode

2019-07-26 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16893708#comment-16893708
 ] 

Jim Ferenczi commented on LUCENE-8935:
--

The logic is already at the bottom of Boolean2ScorerSupplier#get but good call 
on the SHOULD clause that can produce a null scorer.

We can check the number of scoring clauses after the build instead of checking 
the number of scorer suppliers. I'll work on a fix.

> BooleanQuery with no scoring clauses cannot skip documents when running 
> TOP_SCORES mode
> ---
>
> Key: LUCENE-8935
> URL: https://issues.apache.org/jira/browse/LUCENE-8935
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: LUCENE-8935.patch
>
>
> Today a boolean query that is composed of filtering clauses only (more than 
> one) cannot skip documents when the search is executed with the TOP_SCORES 
> mode. However since all documents have a score of 0 it should be possible to 
> early terminate the query as soon as we collected enough top hits. Wrapping 
> the resulting boolean scorer in a constant score scorer should allow early 
> termination in this case and would speed up the retrieval of top hits case 
> considerably if the total hit count is not requested.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8933) JapaneseTokenizer creates Token objects with corrupt offsets

2019-07-26 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16893677#comment-16893677
 ] 

Jim Ferenczi commented on LUCENE-8933:
--

{quote}

Should we go further and check that the concatenation of the segments is equal 
to the surface form?

{quote}

 

+1 too, the user dictionary should be used for segmentations purpose only. This 
would be a breaking change though since users seem to abuse this functionality 
to normalize input (see example). Maybe we can check the length in 8x and the 
content in master only ?

> JapaneseTokenizer creates Token objects with corrupt offsets
> 
>
> Key: LUCENE-8933
> URL: https://issues.apache.org/jira/browse/LUCENE-8933
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
>
> An Elasticsearch user reported the following stack trace when parsing 
> synonyms. It looks like the only reason why this might occur is if the offset 
> of a {{org.apache.lucene.analysis.ja.Token}} is not within the expected range.
>  
> {noformat}
> Caused by: java.lang.ArrayIndexOutOfBoundsException
> at 
> org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl.copyBuffer(CharTermAttributeImpl.java:44)
>  ~[lucene-core-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - 
> nknize - 2018-12-07 14:44:20]
> at 
> org.apache.lucene.analysis.ja.JapaneseTokenizer.incrementToken(JapaneseTokenizer.java:486)
>  ~[?:?]
> at 
> org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:318)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.elasticsearch.index.analysis.ESSolrSynonymParser.analyze(ESSolrSynonymParser.java:57)
>  ~[elasticsearch-6.6.1.jar:6.6.1]
> at 
> org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.elasticsearch.index.analysis.SynonymTokenFilterFactory.buildSynonyms(SynonymTokenFilterFactory.java:154)
>  ~[elasticsearch-6.6.1.jar:6.6.1]
> ... 24 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8935) BooleanQuery with no scoring clauses cannot skip documents when running TOP_SCORES mode

2019-07-26 Thread Jim Ferenczi (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8935:
-
Attachment: LUCENE-8935.patch
Status: Open  (was: Open)

Here is a patch that wraps the boolean scorer in a constant score scorer when 
there is no scoring clause and the score mode is TOP_SCORES.

> BooleanQuery with no scoring clauses cannot skip documents when running 
> TOP_SCORES mode
> ---
>
> Key: LUCENE-8935
> URL: https://issues.apache.org/jira/browse/LUCENE-8935
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: LUCENE-8935.patch
>
>
> Today a boolean query that is composed of filtering clauses only (more than 
> one) cannot skip documents when the search is executed with the TOP_SCORES 
> mode. However since all documents have a score of 0 it should be possible to 
> early terminate the query as soon as we collected enough top hits. Wrapping 
> the resulting boolean scorer in a constant score scorer should allow early 
> termination in this case and would speed up the retrieval of top hits case 
> considerably if the total hit count is not requested.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8935) BooleanQuery with no scoring clauses cannot skip documents when running TOP_SCORES mode

2019-07-26 Thread Jim Ferenczi (JIRA)
Jim Ferenczi created LUCENE-8935:


 Summary: BooleanQuery with no scoring clauses cannot skip 
documents when running TOP_SCORES mode
 Key: LUCENE-8935
 URL: https://issues.apache.org/jira/browse/LUCENE-8935
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Jim Ferenczi


Today a boolean query that is composed of filtering clauses only (more than 
one) cannot skip documents when the search is executed with the TOP_SCORES 
mode. However since all documents have a score of 0 it should be possible to 
early terminate the query as soon as we collected enough top hits. Wrapping the 
resulting boolean scorer in a constant score scorer should allow early 
termination in this case and would speed up the retrieval of top hits case 
considerably if the total hit count is not requested.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8933) JapaneseTokenizer creates Token objects with corrupt offsets

2019-07-26 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16893478#comment-16893478
 ] 

Jim Ferenczi commented on LUCENE-8933:
--

{quote}
If there are no other opinions or objections, I'd like to create a patch that 
add a validation rule to the UserDictionary.
{quote}

Thanks [~tomoko]!

{quote}
For purpose of format validation, I think it would be better that we check if 
the sum of length of segments is equal to the length of its surface form.
i.e., we also should not allow such entry "aabbcc,a b c,aa bb cc,pos_tag" even 
if this does not cause any exceptions.
{quote}

+1



> JapaneseTokenizer creates Token objects with corrupt offsets
> 
>
> Key: LUCENE-8933
> URL: https://issues.apache.org/jira/browse/LUCENE-8933
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
>
> An Elasticsearch user reported the following stack trace when parsing 
> synonyms. It looks like the only reason why this might occur is if the offset 
> of a {{org.apache.lucene.analysis.ja.Token}} is not within the expected range.
>  
> {noformat}
> Caused by: java.lang.ArrayIndexOutOfBoundsException
> at 
> org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl.copyBuffer(CharTermAttributeImpl.java:44)
>  ~[lucene-core-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - 
> nknize - 2018-12-07 14:44:20]
> at 
> org.apache.lucene.analysis.ja.JapaneseTokenizer.incrementToken(JapaneseTokenizer.java:486)
>  ~[?:?]
> at 
> org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:318)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.elasticsearch.index.analysis.ESSolrSynonymParser.analyze(ESSolrSynonymParser.java:57)
>  ~[elasticsearch-6.6.1.jar:6.6.1]
> at 
> org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.elasticsearch.index.analysis.SynonymTokenFilterFactory.buildSynonyms(SynonymTokenFilterFactory.java:154)
>  ~[elasticsearch-6.6.1.jar:6.6.1]
> ... 24 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8933) JapaneseTokenizer creates Token objects with corrupt offsets

2019-07-25 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892671#comment-16892671
 ] 

Jim Ferenczi commented on LUCENE-8933:
--

The first argument of the dictionary rule is the original block to detect and 
the second argument is the segmentation for the block. So the rule "aaa,aa a,," 
will split the input "aaa" into two tokens "aa" and "a". When computing the 
offsets of the splitted terms in the user dictionary we assume that the 
segmentation has the same character than the input minus the whitespaces. We 
don't check that is the case so rules with broken offsets are only detected 
when they match in a token stream. You don't need emojis or surrogate pairs to 
break this, just provide a rule where the length of the segmentation is greater 
than the input minus the whitespaces:
{code:java}
UserDictionary dict = UserDictionary.open(new StringReader("aaa,,,"));
JapaneseTokenizer tok = new JapaneseTokenizer(dict, true, Mode.NORMAL);
tok.setReader(new StringReader("aaa"));
tok.reset();
tok.incrementToken();
{code}

I think we just need to validate the input and throw an exception if the 
assumption are not met at build time.



> JapaneseTokenizer creates Token objects with corrupt offsets
> 
>
> Key: LUCENE-8933
> URL: https://issues.apache.org/jira/browse/LUCENE-8933
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
>
> An Elasticsearch user reported the following stack trace when parsing 
> synonyms. It looks like the only reason why this might occur is if the offset 
> of a {{org.apache.lucene.analysis.ja.Token}} is not within the expected range.
>  
> {noformat}
> Caused by: java.lang.ArrayIndexOutOfBoundsException
> at 
> org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl.copyBuffer(CharTermAttributeImpl.java:44)
>  ~[lucene-core-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - 
> nknize - 2018-12-07 14:44:20]
> at 
> org.apache.lucene.analysis.ja.JapaneseTokenizer.incrementToken(JapaneseTokenizer.java:486)
>  ~[?:?]
> at 
> org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:318)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.elasticsearch.index.analysis.ESSolrSynonymParser.analyze(ESSolrSynonymParser.java:57)
>  ~[elasticsearch-6.6.1.jar:6.6.1]
> at 
> org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.elasticsearch.index.analysis.SynonymTokenFilterFactory.buildSynonyms(SynonymTokenFilterFactory.java:154)
>  ~[elasticsearch-6.6.1.jar:6.6.1]
> ... 24 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8889) Remove Dead Code From PointRangeQuery

2019-06-27 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16873992#comment-16873992
 ] 

Jim Ferenczi commented on LUCENE-8889:
--

Why is it an issue ? We have some use cases in Elasticsearch that requires to 
access these points and I guess that there are other cases outside of Lucene. 
It's library so we should not expect that all accessors are used directly. We 
can add a test case if you're concerned by the fact that they are never used 
internally. 

> Remove Dead Code From PointRangeQuery
> -
>
> Key: LUCENE-8889
> URL: https://issues.apache.org/jira/browse/LUCENE-8889
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Minor
>
> PointRangeQuery has accessors for the underlying points in the query but 
> those are never accessed. We should remove them



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8859) Add an option to load the completion suggester's FST off-heap

2019-06-26 Thread Jim Ferenczi (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi resolved LUCENE-8859.
--
   Resolution: Fixed
Fix Version/s: 8.2
   master (9.0)

> Add an option to load the completion suggester's FST off-heap
> -
>
> Key: LUCENE-8859
> URL: https://issues.apache.org/jira/browse/LUCENE-8859
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
> Fix For: master (9.0), 8.2
>
> Attachments: LUCENE-8859.patch
>
>
> Now that FSTs can be loaded off-heap 
> (https://issues.apache.org/jira/browse/LUCENE-8635) it would be nice to 
> expose this option in the completion suggester postings format. I didn't ran 
> any benchmark yet so I can't say it this really makes sense or not but I 
> wanted to get some opinion whether this could be a good trade-off.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-7714) Optimize range queries for the sorted case

2019-06-26 Thread Jim Ferenczi (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi resolved LUCENE-7714.
--
   Resolution: Fixed
Fix Version/s: 8.2
   master (9.0)

Thanks [~jtibshirani]!

> Optimize range queries for the sorted case
> --
>
> Key: LUCENE-7714
> URL: https://issues.apache.org/jira/browse/LUCENE-7714
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: master (9.0), 8.2
>
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> It feels like we could make range queries faster when the index is sorted, 
> maybe by running on doc values, figuring ou the first and last matching 
> documents with a binary search and returning a doc id set iterator that 
> iterates through this range of documents?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8806) WANDScorer should support two-phase iterator

2019-06-25 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872213#comment-16872213
 ] 

Jim Ferenczi commented on LUCENE-8806:
--

I am testing with wikimediumall

> WANDScorer should support two-phase iterator
> 
>
> Key: LUCENE-8806
> URL: https://issues.apache.org/jira/browse/LUCENE-8806
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8806.patch, LUCENE-8806.patch
>
>
> Following https://issues.apache.org/jira/browse/LUCENE-8770 the WANDScorer 
> should leverage two-phase iterators in order to be faster when used in 
> conjunctions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8806) WANDScorer should support two-phase iterator

2019-06-25 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872169#comment-16872169
 ] 

Jim Ferenczi commented on LUCENE-8806:
--

{quote}
FYI we have an issue for phrases already LUCENE-8311.
{quote}

I forgot about this one, thanks! 

{quote}
I was thinking this could only get faster than before since we would now 
leverage two-phase iterators instead of using iterators naively.
{quote}

That was my assumption too but not checking the two-phase iterator when looking 
for candidates forces the second clause (the one with the lowest score) to 
advance even when the first clause is a false positive. So It might be related 
to the fact that checking for a match on high frequency phrases is faster than 
advancing the other clause.

> WANDScorer should support two-phase iterator
> 
>
> Key: LUCENE-8806
> URL: https://issues.apache.org/jira/browse/LUCENE-8806
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8806.patch, LUCENE-8806.patch
>
>
> Following https://issues.apache.org/jira/browse/LUCENE-8770 the WANDScorer 
> should leverage two-phase iterators in order to be faster when used in 
> conjunctions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8848) UnifiedHighlighter should highlight all Query types that implement Weight.matches

2019-06-25 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872150#comment-16872150
 ] 

Jim Ferenczi commented on LUCENE-8848:
--

The RandomIndexWriter is created but not closed if the condition line 1367 
matches. I'll push a fix.

> UnifiedHighlighter should highlight all Query types that implement 
> Weight.matches
> -
>
> Key: LUCENE-8848
> URL: https://issues.apache.org/jira/browse/LUCENE-8848
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Major
> Fix For: 8.2
>
> Attachments: LUCENE-8848.patch
>
>
> The UnifiedHighlighter internally extracts terms and automata from the query. 
>  Usually this works perfectly but it's possible a Query might be of a type it 
> doesn't know -- a leaf query that is perhaps in effect similar to a 
> MultiTermQuery yet it might not even be a subclass of this or it does but the 
> UH doesn't know how to extract an automata from it.  The UH is oblivious to 
> this and probably won't highlight this query.  If re-analysis of the text is 
> necessary, the UH will pre-filter all terms to only those it _thinks_ are 
> pertinent.  Or if offsets are in the postings then the UH could perform very 
> poorly by unleashing this query on the index for each highlighted document 
> without recognizing re-analysis is a more appropriate path.
> I think to solve this, the UnifiedHighlighter.getFieldHighlighter needs to 
> inspect the query (using a QueryVisitor) to see if it can find a leaf query 
> that is not one it knows how to pull automata from, and is otherwise not in a 
> special list (like MatchAllDocsQuery).  If we find one, we avoid choosing 
> OffsetSource.POSTINGS or OffsetSource.NONE_NEEDED since we might in effect 
> have an MTQ like query.  If a MemoryIndex is needed then we don't pre-filter 
> the terms since we can't assume we know precisely which terms are pertinent.
> We needn't bother extracting terms & automata in this case either; it's 
> wasted effort which can involve building a CharacterRunAutomaton (see 
> MultiTermHighlighting.binaryToCharRunAutomaton).  Speaking of which, it'd be 
> nice to avoid that in other cases as well, like for WEIGHT_MATCHES when we 
> aren't using MemoryIndex (thus no term pre-filtering).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8806) WANDScorer should support two-phase iterator

2019-06-25 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872142#comment-16872142
 ] 

Jim Ferenczi commented on LUCENE-8806:
--

I ran luceneutil with some disjunctions of phrase and term queries:
{noformat}
TaskQPS baseline  StdDev   QPS patch  StdDev
Pct diff
  HighPhraseHighTerm8.47  (1.6%)4.78  (2.6%)  
-43.6% ( -47% -  -40%)
   MedPhraseHighTerm   15.54  (1.2%)9.41  (2.5%)  
-39.5% ( -42% -  -36%)
HighPhraseHighPhrase5.99  (1.4%)3.65  (3.0%)  
-39.0% ( -42% -  -35%)
 HighPhraseLowPhrase   15.57  (1.2%)   14.26  (3.6%)   
-8.4% ( -13% -   -3%)
  LowPhraseLowPhrase   27.25  (2.0%)   31.75  (4.5%)   
16.5% (   9% -   23%)
   HighPhraseLowTerm   26.31  (0.9%)   31.42  (3.4%)   
19.4% (  14% -   24%)
   HighPhraseMedTerm   12.95  (1.0%)   15.74  (3.8%)   
21.6% (  16% -   26%)
  MedPhraseMedPhrase9.21  (2.4%)   11.50  (8.3%)   
24.9% (  13% -   36%)
MedPhraseLowTerm   24.85  (1.6%)   31.52  (5.5%)   
26.8% (  19% -   34%)
  MedPhraseLowPhrase   11.64  (2.3%)   15.06  (7.1%)   
29.3% (  19% -   39%)
 HighPhraseMedPhrase8.27  (2.0%)   10.77  (7.2%)   
30.2% (  20% -   40%)
MedPhraseMedTerm   14.53  (1.7%)   19.33  (5.6%)   
33.0% (  25% -   40%)
{noformat}

While the change speeds up some cases it also shows a non-negligible regression 
with high and med frequencies.
Currently the phrase scorer doesn't check impacts to compute the max score per 
blocks so I tried to hack a simple patch that merges the impacts of the terms 
that appear in the phrase query. The patch keeps the minimum frequency per norm 
value in order to compute an upper bound of the score of the phrase query. I 
ran luceneutil again with the modified patch and results are much better:
{noformat}
TaskQPS baseline  StdDev   QPS patch  StdDev
Pct diff
  HighPhraseHighTerm8.22  (3.3%)8.83  (1.9%)
7.4% (   2% -   12%)
  LowPhraseLowPhrase   26.57  (0.7%)   28.55  (5.5%)
7.4% (   1% -   13%)
 HighPhraseMedPhrase7.98  (0.8%)9.01  (5.0%)   
12.9% (   7% -   18%)
  MedPhraseMedPhrase8.95  (1.4%)   10.11  (6.6%)   
12.9% (   4% -   21%)
   MedPhraseHighTerm   15.10  (1.1%)   17.69  (4.6%)   
17.2% (  11% -   23%)
  MedPhraseLowPhrase   11.17  (1.1%)   13.11  (4.9%)   
17.4% (  11% -   23%)
 HighPhraseLowPhrase   15.09  (1.5%)   18.85  (7.4%)   
24.9% (  15% -   34%)
HighPhraseHighPhrase5.75  (2.3%)7.26  (4.5%)   
26.2% (  18% -   33%)
   HighPhraseLowTerm   25.68  (0.7%)   34.46  (2.4%)   
34.2% (  30% -   37%)
MedPhraseMedTerm   14.23  (0.1%)   20.71  (2.3%)   
45.5% (  43% -   47%)
MedPhraseLowTerm   24.30  (0.6%)   38.47  (2.4%)   
58.3% (  55% -   61%)
   HighPhraseMedTerm   12.77  (0.6%)   22.21  (3.1%)   
73.9% (  69% -   77%)
{noformat}

However simple phrase queries (without disjunctions) seem to be slower with the 
merging of impacts:
{noformat}
  TaskQPS baseline  StdDev   QPS patch  StdDev  
  Pct diff
  HighPhrase   10.48  (0.0%)9.74  (0.0%)   
-7.1% (  -7% -   -7%)
   MedPhrase   20.92  (0.0%)   20.25  (0.0%)   
-3.2% (  -3% -   -3%)
   LowPhrase   24.07  (0.0%)   23.33  (0.0%)   
-3.1% (  -3% -   -3%)
{noformat}

I am not sure that the merging of impacts is correct so far so I'll add some 
tests. It's also unrelated to this change (even if it helps for performance) so 
I'll open a separate issue to discuss this merging of impacts for phrase query 
separately.
Considering the results of this change alone (two-phase iterator for the wand) 
I will not merge it yet since it doesn't improve queries with lots of matches 
but we can revisit when/if the merging of impacts for phrase queries is 
implemented. WDYT ?

> WANDScorer should support two-phase iterator
> 
>
> Key: LUCENE-8806
> URL: https://issues.apache.org/jira/browse/LUCENE-8806
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8806.patch, LUCENE-8806.patch
>
>
> Following https://issues.apache.org/jira/browse/LUCENE-8770 the WANDScorer 
> should leverage two-phase iterators in order to be faster when used in 
> conjunctions.




[jira] [Commented] (LUCENE-8859) Add an option to load the completion suggester's FST off-heap

2019-06-18 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16866367#comment-16866367
 ] 

Jim Ferenczi commented on LUCENE-8859:
--

Thanks for looking Adrien. Currently users can add the file extension (.lkp) to 
the list of files to preload but I agree that it could be simplified. Are you 
concerned by the fact that we could preload a file even when it is consumed 
once (i.e.: if the postings format loads the FST on-heap) ?

> Add an option to load the completion suggester's FST off-heap
> -
>
> Key: LUCENE-8859
> URL: https://issues.apache.org/jira/browse/LUCENE-8859
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: LUCENE-8859.patch
>
>
> Now that FSTs can be loaded off-heap 
> (https://issues.apache.org/jira/browse/LUCENE-8635) it would be nice to 
> expose this option in the completion suggester postings format. I didn't ran 
> any benchmark yet so I can't say it this really makes sense or not but I 
> wanted to get some opinion whether this could be a good trade-off.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8859) Add an option to load the completion suggester's FST off-heap

2019-06-14 Thread Jim Ferenczi (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8859:
-
Attachment: LUCENE-8859.patch

> Add an option to load the completion suggester's FST off-heap
> -
>
> Key: LUCENE-8859
> URL: https://issues.apache.org/jira/browse/LUCENE-8859
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: LUCENE-8859.patch
>
>
> Now that FSTs can be loaded off-heap 
> (https://issues.apache.org/jira/browse/LUCENE-8635) it would be nice to 
> expose this option in the completion suggester postings format. I didn't ran 
> any benchmark yet so I can't say it this really makes sense or not but I 
> wanted to get some opinion whether this could be a good trade-off.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8859) Add an option to load the completion suggester's FST off-heap

2019-06-14 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863810#comment-16863810
 ] 

Jim Ferenczi commented on LUCENE-8859:
--

Here is a patch that exposes an option to force the load on-heap, off-heap or 
auto (make the decision based on the type of directory that is used, e.g.: mmap 
vs others). [^LUCENE-8859.patch] 

> Add an option to load the completion suggester's FST off-heap
> -
>
> Key: LUCENE-8859
> URL: https://issues.apache.org/jira/browse/LUCENE-8859
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: LUCENE-8859.patch
>
>
> Now that FSTs can be loaded off-heap 
> (https://issues.apache.org/jira/browse/LUCENE-8635) it would be nice to 
> expose this option in the completion suggester postings format. I didn't ran 
> any benchmark yet so I can't say it this really makes sense or not but I 
> wanted to get some opinion whether this could be a good trade-off.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8859) Add an option to load the completion suggester's FST off-heap

2019-06-14 Thread Jim Ferenczi (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8859:
-
Priority: Minor  (was: Major)

> Add an option to load the completion suggester's FST off-heap
> -
>
> Key: LUCENE-8859
> URL: https://issues.apache.org/jira/browse/LUCENE-8859
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
>
> Now that FSTs can be loaded off-heap 
> (https://issues.apache.org/jira/browse/LUCENE-8635) it would be nice to 
> expose this option in the completion suggester postings format. I didn't ran 
> any benchmark yet so I can't say it this really makes sense or not but I 
> wanted to get some opinion whether this could be a good trade-off.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8635) Lazy loading Lucene FST offheap using mmap

2019-06-14 Thread Jim Ferenczi (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8635:
-
Priority: Major  (was: Minor)

> Lazy loading Lucene FST offheap using mmap
> --
>
> Key: LUCENE-8635
> URL: https://issues.apache.org/jira/browse/LUCENE-8635
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/FSTs
> Environment: I used below setup for es_rally tests:
> single node i3.xlarge running ES 6.5
> es_rally was running on another i3.xlarge instance
>Reporter: Ankit Jain
>Priority: Major
> Fix For: 8.0, 8.x, master (9.0)
>
> Attachments: fst-offheap-ra-rev.patch, fst-offheap-rev.patch, 
> offheap.patch, optional_offheap_ra.patch, ra.patch, rally_benchmark.xlsx
>
>
> Currently, FST loads all the terms into heap memory during index open. This 
> causes frequent JVM OOM issues if the term size gets big. A better way of 
> doing this will be to lazily load FST using mmap. That ensures only the 
> required terms get loaded into memory.
>  
> Lucene can expose API for providing list of fields to load terms offheap. I'm 
> planning to take following approach for this:
>  # Add a boolean property fstOffHeap in FieldInfo
>  # Pass list of offheap fields to lucene during index open (ALL can be 
> special keyword for loading ALL fields offheap)
>  # Initialize the fstOffHeap property during lucene index open
>  # FieldReader invokes default FST constructor or OffHeap constructor based 
> on fstOffHeap field
>  
> I created a patch (that loads all fields offheap), did some benchmarks using 
> es_rally and results look good.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8859) Add an option to load the completion suggester's FST off-heap

2019-06-14 Thread Jim Ferenczi (JIRA)
Jim Ferenczi created LUCENE-8859:


 Summary: Add an option to load the completion suggester's FST 
off-heap
 Key: LUCENE-8859
 URL: https://issues.apache.org/jira/browse/LUCENE-8859
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Jim Ferenczi


Now that FSTs can be loaded off-heap 
(https://issues.apache.org/jira/browse/LUCENE-8635) it would be nice to expose 
this option in the completion suggester postings format. I didn't ran any 
benchmark yet so I can't say it this really makes sense or not but I wanted to 
get some opinion whether this could be a good trade-off.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8635) Lazy loading Lucene FST offheap using mmap

2019-06-14 Thread Jim Ferenczi (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8635:
-
Priority: Minor  (was: Major)

> Lazy loading Lucene FST offheap using mmap
> --
>
> Key: LUCENE-8635
> URL: https://issues.apache.org/jira/browse/LUCENE-8635
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/FSTs
> Environment: I used below setup for es_rally tests:
> single node i3.xlarge running ES 6.5
> es_rally was running on another i3.xlarge instance
>Reporter: Ankit Jain
>Priority: Minor
> Fix For: 8.0, 8.x, master (9.0)
>
> Attachments: fst-offheap-ra-rev.patch, fst-offheap-rev.patch, 
> offheap.patch, optional_offheap_ra.patch, ra.patch, rally_benchmark.xlsx
>
>
> Currently, FST loads all the terms into heap memory during index open. This 
> causes frequent JVM OOM issues if the term size gets big. A better way of 
> doing this will be to lazily load FST using mmap. That ensures only the 
> required terms get loaded into memory.
>  
> Lucene can expose API for providing list of fields to load terms offheap. I'm 
> planning to take following approach for this:
>  # Add a boolean property fstOffHeap in FieldInfo
>  # Pass list of offheap fields to lucene during index open (ALL can be 
> special keyword for loading ALL fields offheap)
>  # Initialize the fstOffHeap property during lucene index open
>  # FieldReader invokes default FST constructor or OffHeap constructor based 
> on fstOffHeap field
>  
> I created a patch (that loads all fields offheap), did some benchmarks using 
> es_rally and results look good.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8845) Allow maxExpansions to be set on multi-term Intervals

2019-06-11 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860693#comment-16860693
 ] 

Jim Ferenczi commented on LUCENE-8845:
--

+1

> Allow maxExpansions to be set on multi-term Intervals
> -
>
> Key: LUCENE-8845
> URL: https://issues.apache.org/jira/browse/LUCENE-8845
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 8.2
>
> Attachments: LUCENE-8845.patch
>
>
> MultiTermIntervalsSource has a maxExpansions parameter which is always set to 
> 128 by the factory methods Intervals.prefix() and Intervals.wildcard().  We 
> should keep 128 as the default, but also add additional methods that take a 
> configurable maximum.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8845) Allow maxExpansions to be set on multi-term Intervals

2019-06-10 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860252#comment-16860252
 ] 

Jim Ferenczi commented on LUCENE-8845:
--

{quote}
2) I think this is covered by the javadocs and the 'expert' marking. Some users 
really do need to see all expansions, and if they're aware of the trade-offs 
involved then I don't think we need any further hard caps.
{quote}

I think we should try to prevent users to shoot themselves in the foot. IMO 
this is more important than for other queries because reaching the limit throw 
an error so I expect that users raise the limit until they find a number that 
work for all queries. Can we add an hard limit equals to max_boolean_clause ? 
This would be consistent with the discussion in 
https://issues.apache.org/jira/browse/LUCENE-8811 that should also check the 
number of sources in an interval query ?

> Allow maxExpansions to be set on multi-term Intervals
> -
>
> Key: LUCENE-8845
> URL: https://issues.apache.org/jira/browse/LUCENE-8845
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 8.2
>
> Attachments: LUCENE-8845.patch
>
>
> MultiTermIntervalsSource has a maxExpansions parameter which is always set to 
> 128 by the factory methods Intervals.prefix() and Intervals.wildcard().  We 
> should keep 128 as the default, but also add additional methods that take a 
> configurable maximum.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8812) add KoreanNumberFilter to Nori(Korean) Analyzer

2019-06-10 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859778#comment-16859778
 ] 

Jim Ferenczi commented on LUCENE-8812:
--

Thanks [~danmuzi]!

> add KoreanNumberFilter to Nori(Korean) Analyzer
> ---
>
> Key: LUCENE-8812
> URL: https://issues.apache.org/jira/browse/LUCENE-8812
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Namgyu Kim
>Assignee: Namgyu Kim
>Priority: Major
> Fix For: master (9.0), 8.2
>
> Attachments: LUCENE-8812.patch
>
>
> This is a follow-up issue to LUCENE-8784.
> The KoreanNumberFilter is a TokenFilter that normalizes Korean numbers to 
> regular Arabic decimal numbers in half-width characters.
> Logic is similar to JapaneseNumberFilter.
> It should be able to cover the following test cases.
> 1) Korean Word to Number
> 십만이천오백 => 102500
> 2) 1 character conversion
> 일영영영 => 1000
> 3) Decimal Point Calculation
> 3.2천 => 3200
> 4) Comma between three digits
> 4,647.0010 => 4647.001



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8840) TopTermsBlendedFreqScoringRewrite should use SynonymQuery

2019-06-07 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16858600#comment-16858600
 ] 

Jim Ferenczi commented on LUCENE-8840:
--

{quote}
I am curious to understand how including doc frequencies can be better than the 
overall score. IMO, including BM25 scores gives us some additional advantages, 
such as defending against cases where the overall non matching token count in a 
document is significantly high. Did you see any scenarios that had relevance 
troubles due to inclusion of entire BM25 scores?
{quote}

The idea of the SynonymQuery is to score the terms as if they were indexed as a 
single term. I think this fits nicely with the fuzzy query. For instance 
imagine a fuzzy query with the terms "bad" and "baz". With the current solution 
if a document contains both terms it will rank significantly higher than 
documents that contain only one of them. This can change depending on the inner 
doc frequencies but this doesn't seem right IMO. On the contrary the synonym 
query would give the same score to documents containing "baz" with a frequency 
of 4 than another document containing "bad" and "baz" 2 times. This feels more 
natural to me because we shouldn't favor documents that contain multiple 
variations of the same fuzzy term. 

{quote}
On a different note, I am also wondering if we should devise relevance tests 
which allow us to measure the relevance impact of a change. Something added to 
luceneutil should be nice. Thoughts?
{quote}

That would be great but this doesn't look like a low hanging fruit. Maybe open 
a separate issue to discuss ?

{quote}
IMO if we want to restrict the contribution of each term to the blended query's 
final score, then we could think of a blended scorer step which utilizes 
something on the lines of BM25's term frequency saturation when merging scores 
from different blended terms. WDYT?
{quote}

I am not sure I fully understand but the SynonymQuery kind of does that. It 
sums the inner doc frequencies of all matching terms to ensure that the 
contribution of each term to the final score is bounded. 

> TopTermsBlendedFreqScoringRewrite should use SynonymQuery
> -
>
> Key: LUCENE-8840
> URL: https://issues.apache.org/jira/browse/LUCENE-8840
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8840.patch
>
>
> Today the TopTermsBlendedFreqScoringRewrite, which is the default rewrite 
> method for Fuzzy queries, uses the BlendedTermQuery to score documents that 
> match the fuzzy terms. This query blends the frequencies used for scoring 
> across the terms and creates a disjunction of all the blended terms. This 
> means that each fuzzy term that match in a document will add their BM25 score 
> contribution. We already have a query that can blend the statistics of 
> multiple terms in a single scorer that sums the doc frequencies rather than 
> the entire BM25 score: the SynonymQuery. Since 
> https://issues.apache.org/jira/browse/LUCENE-8652 this query also handles 
> boost between 0 and 1 so it should be easy to change the default rewrite 
> method for Fuzzy queries to use it instead of the BlendedTermQuery. This 
> would bound the contribution of each term to the final score which seems a 
> better alternative in terms of relevancy than the current solution. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8840) TopTermsBlendedFreqScoringRewrite should use SynonymQuery

2019-06-07 Thread Jim Ferenczi (JIRA)
Jim Ferenczi created LUCENE-8840:


 Summary: TopTermsBlendedFreqScoringRewrite should use SynonymQuery
 Key: LUCENE-8840
 URL: https://issues.apache.org/jira/browse/LUCENE-8840
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Jim Ferenczi


Today the TopTermsBlendedFreqScoringRewrite, which is the default rewrite 
method for Fuzzy queries, uses the BlendedTermQuery to score documents that 
match the fuzzy terms. This query blends the frequencies used for scoring 
across the terms and creates a disjunction of all the blended terms. This means 
that each fuzzy term that match in a document will add their BM25 score 
contribution. We already have a query that can blend the statistics of multiple 
terms in a single scorer that sums the doc frequencies rather than the entire 
BM25 score: the SynonymQuery. Since 
https://issues.apache.org/jira/browse/LUCENE-8652 this query also handles boost 
between 0 and 1 so it should be easy to change the default rewrite method for 
Fuzzy queries to use it instead of the BlendedTermQuery. This would bound the 
contribution of each term to the final score which seems a better alternative 
in terms of relevancy than the current solution. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8812) add KoreanNumberFilter to Nori(Korean) Analyzer

2019-06-06 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16857911#comment-16857911
 ] 

Jim Ferenczi commented on LUCENE-8812:
--

Sorry I didn't see your reply. I agree with you that it is ambiguous to put it 
in analysis-common so +1 to add it in the nori module for now and revisit 
if/when we create a separate module for the mecab tokenizer. 

> add KoreanNumberFilter to Nori(Korean) Analyzer
> ---
>
> Key: LUCENE-8812
> URL: https://issues.apache.org/jira/browse/LUCENE-8812
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Namgyu Kim
>Priority: Major
> Attachments: LUCENE-8812.patch
>
>
> This is a follow-up issue to LUCENE-8784.
> The KoreanNumberFilter is a TokenFilter that normalizes Korean numbers to 
> regular Arabic decimal numbers in half-width characters.
> Logic is similar to JapaneseNumberFilter.
> It should be able to cover the following test cases.
> 1) Korean Word to Number
> 십만이천오백 => 102500
> 2) 1 character conversion
> 일영영영 => 1000
> 3) Decimal Point Calculation
> 3.2천 => 3200
> 4) Comma between three digits
> 4,647.0010 => 4647.001



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8812) add KoreanNumberFilter to Nori(Korean) Analyzer

2019-05-30 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16851778#comment-16851778
 ] 

Jim Ferenczi commented on LUCENE-8812:
--

The patch looks good [~danmuzi], I wonder if it would be difficult to have a 
base class for the Japanese and Korean number filter since they share a large 
amount of code. However I think it's ok to merge this first and we can tackle 
the merge in a follow up, wdyt ?

> add KoreanNumberFilter to Nori(Korean) Analyzer
> ---
>
> Key: LUCENE-8812
> URL: https://issues.apache.org/jira/browse/LUCENE-8812
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Namgyu Kim
>Priority: Major
> Attachments: LUCENE-8812.patch
>
>
> This is a follow-up issue to LUCENE-8784.
> The KoreanNumberFilter is a TokenFilter that normalizes Korean numbers to 
> regular Arabic decimal numbers in half-width characters.
> Logic is similar to JapaneseNumberFilter.
> It should be able to cover the following test cases.
> 1) Korean Word to Number
> 십만이천오백 => 102500
> 2) 1 character conversion
> 일영영영 => 1000
> 3) Decimal Point Calculation
> 3.2천 => 3200
> 4) Comma between three digits
> 4,647.0010 => 4647.001



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

2019-05-30 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16851574#comment-16851574
 ] 

Jim Ferenczi commented on LUCENE-8816:
--

This sounds like a great plan [~tomoko]. Decoupling the system dictionary 
should help for the merge of the Korean tokenizer but I agree that this merge 
is out of scope for this issue.

> Decouple Kuromoji's morphological analyser and its dictionary
> -
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

2019-05-28 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849778#comment-16849778
 ] 

Jim Ferenczi commented on LUCENE-8816:
--

We discussed this when we added the Korean module and said that we could have a 
separate module to handle "mecab-like" tokenization and one module per 
dictionary (ipadic, mecab-ko-dic, ...). There are some assertions in the 
JapaneseTokenizer that checks some invariant of the ipadic (leftId == rightId 
for instance) but I guess we could move them in the dictionary module. This 
could be a nice cleanup if the goal is to handle multiple mecab dictionaries 
(in different languages).

 

{quote}

While it has been slowly obsoleted, well-maintained and/or extended 
dictionaries risen up in recent years (e.g. 
[mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
[UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
attempts/projects/efforts are made in Japan.

{quote}

 

While allowing more flexibility would be nice I wonder if there are that many 
different dictionaries. If the ipadic is obsolete we could also adapt the main 
distribution (kuromoji) to use the UniDic instead. Even if we handle multiple 
dictionaries we'll still need to provide a way for users to add custom entries. 
Mecab has an option to compute the leftId, rightId and cost automatically from 
a partial user entry so I wonder if this could help to avoid users to 
reimplement a dictionary from scratch ?

 

> Decouple Kuromoji's morphological analyser and its dictionary
> -
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.

2019-05-27 Thread Jim Ferenczi (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi resolved LUCENE-8784.
--
   Resolution: Fixed
Fix Version/s: 8.2
   master (9.0)

Thanks [~danmuzi]! I pushed to master and branch_8x. I also removed the 
discardPunctuations option from the KoreanAnalyzer in order to be consistent 
with the JapaneseAnalyzer. It's an advanced option that should be used with a 
specific token filter in mind (KoreanNumberFilter for instance).

>  Nori(Korean) tokenizer removes the decimal point. 
> ---
>
> Key: LUCENE-8784
> URL: https://issues.apache.org/jira/browse/LUCENE-8784
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Munkyu Im
>Priority: Major
> Fix For: master (9.0), 8.2
>
> Attachments: LUCENE-8784.patch, LUCENE-8784.patch, LUCENE-8784.patch, 
> LUCENE-8784.patch
>
>
> This is the same issue that I mentioned to 
> [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367]
> unlike standard analyzer, nori analyzer removes the decimal point.
> nori tokenizer removes "." character by default.
>  In this case, it is difficult to index the keywords including the decimal 
> point.
> It would be nice if there had the option whether add a decimal point or not.
> Like Japanese tokenizer does,  Nori need an option to preserve decimal point.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.

2019-05-24 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16847695#comment-16847695
 ] 

Jim Ferenczi commented on LUCENE-8784:
--

The last patch for this issue looks good to me. I'll test locally and merge if 
all tests pass. 

Thanks for opening LUCENE-8812, I'll take a look when this issue gets merged.

>  Nori(Korean) tokenizer removes the decimal point. 
> ---
>
> Key: LUCENE-8784
> URL: https://issues.apache.org/jira/browse/LUCENE-8784
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Munkyu Im
>Priority: Major
> Attachments: LUCENE-8784.patch, LUCENE-8784.patch, LUCENE-8784.patch, 
> LUCENE-8784.patch
>
>
> This is the same issue that I mentioned to 
> [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367]
> unlike standard analyzer, nori analyzer removes the decimal point.
> nori tokenizer removes "." character by default.
>  In this case, it is difficult to index the keywords including the decimal 
> point.
> It would be nice if there had the option whether add a decimal point or not.
> Like Japanese tokenizer does,  Nori need an option to preserve decimal point.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8788) Order LeafReaderContexts by Estimated Number Of Hits

2019-05-24 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16847455#comment-16847455
 ] 

Jim Ferenczi commented on LUCENE-8788:
--

{quote}

I like the idea [~jim.ferenczi] proposed. I can open a Jira for that and work 
on a patch for it as well, unless Jim wants to do it himself?

{quote}

Something is needed for the search side and this issue is the right place to 
add such functionalities. I wonder if we need an issue for the merge side 
though since it's already possible to change the order of segments in a custom 
FilterMergePolicy. I tried to do it in a POC and the change is trivial so I am 
not sure that we need to do anything in core.

> Order LeafReaderContexts by Estimated Number Of Hits
> 
>
> Key: LUCENE-8788
> URL: https://issues.apache.org/jira/browse/LUCENE-8788
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> We offer no guarantee on the order in which an IndexSearcher will look at 
> segments during a search operation. This can be improved for use cases where 
> an engine using Lucene invokes early termination and uses the partially 
> collected hits. A better model would be if we sorted segments by the 
> estimated number of hits, thus increasing the probability of the overall 
> relevance of the returned partial results.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.

2019-05-24 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16847381#comment-16847381
 ] 

Jim Ferenczi commented on LUCENE-8784:
--

{quote}

By the way, would not it be better to leave the constructors that do not use 
discardPunctuation parameters?
(Existing Nori users have to modify the code after uploading)

{quote}

Yes we should do that, otherwise it's a breaking change and we cannot push to 
8x.

{quote}

I also added Javadoc for discardPunctuation in your patch. (KoreanAnalyzer, 
KoreanTokenizerFactory)

{quote}

thanks!

{quote}

I developed KoreanNumberFilter by referring to JapaneseNumberFilter.
Please check my patch :D

{quote}

The patch looks good but we should iterate on this in a new issue. We try to do 
one feature at a time in a single issue so let's add discardPunctuation in this 
one and we can open a new one as a follow up to add the KoreanNumberFilter ?

 

 

 

 

>  Nori(Korean) tokenizer removes the decimal point. 
> ---
>
> Key: LUCENE-8784
> URL: https://issues.apache.org/jira/browse/LUCENE-8784
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Munkyu Im
>Priority: Major
> Attachments: LUCENE-8784.patch, LUCENE-8784.patch, LUCENE-8784.patch
>
>
> This is the same issue that I mentioned to 
> [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367]
> unlike standard analyzer, nori analyzer removes the decimal point.
> nori tokenizer removes "." character by default.
>  In this case, it is difficult to index the keywords including the decimal 
> point.
> It would be nice if there had the option whether add a decimal point or not.
> Like Japanese tokenizer does,  Nori need an option to preserve decimal point.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.

2019-05-22 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16845856#comment-16845856
 ] 

Jim Ferenczi commented on LUCENE-8784:
--

Hi [~danmuzi],
I don't think we should have one option for every punctuation type and the 
current check in the patch based on Character.OTHER_PUNCTUATION would match 
more than just the full stop character. If we want to preserve punctuations we 
can add the same option than for Kuromoji (discardPunctuation) and output a 
token for each punctuation group. So for an input like "10.1?" we would output 
4 tokens: "10", ".", "1", "?". Then if you need to "regroup" tokens based on 
additional rules you can add another filter to do this like the 
JapaneseNumberFilter does. The other option would be to detect numbers with 
decimal points accurately like the standard tokenizer does but we don't want to 
reinvent the wheel either. If we want the same grouping for unknown words in 
this tokenizer we should probably implement it on top of the standard or ICU 
tokenizer directly. 
.

>  Nori(Korean) tokenizer removes the decimal point. 
> ---
>
> Key: LUCENE-8784
> URL: https://issues.apache.org/jira/browse/LUCENE-8784
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Munkyu Im
>Priority: Major
> Attachments: LUCENE-8784.patch
>
>
> This is the same issue that I mentioned to 
> [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367]
> unlike standard analyzer, nori analyzer removes the decimal point.
> nori tokenizer removes "." character by default.
>  In this case, it is difficult to index the keywords including the decimal 
> point.
> It would be nice if there had the option whether add a decimal point or not.
> Like Japanese tokenizer does,  Nori need an option to preserve decimal point.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8770) BlockMaxConjunctionScorer should support two-phase scorers

2019-05-21 Thread Jim Ferenczi (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi resolved LUCENE-8770.
--
   Resolution: Fixed
Fix Version/s: 8.2
   master (9.0)

Thanks [~jpountz]!

> BlockMaxConjunctionScorer should support two-phase scorers
> --
>
> Key: LUCENE-8770
> URL: https://issues.apache.org/jira/browse/LUCENE-8770
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
> Fix For: master (9.0), 8.2
>
> Attachments: LUCENE-8770.patch, LUCENE-8770.patch
>
>
> The support for two-phase scorers in BlockMaxConjunctionScorer is missing. 
> This can slow down some queries that need to execute costly second phase on 
> more documents.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8806) WANDScorer should support two-phase iterator

2019-05-21 Thread Jim Ferenczi (JIRA)
Jim Ferenczi created LUCENE-8806:


 Summary: WANDScorer should support two-phase iterator
 Key: LUCENE-8806
 URL: https://issues.apache.org/jira/browse/LUCENE-8806
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Jim Ferenczi


Following https://issues.apache.org/jira/browse/LUCENE-8770 the WANDScorer 
should leverage two-phase iterators in order to be faster when used in 
conjunctions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8770) BlockMaxConjunctionScorer should support two-phase scorers

2019-05-21 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844612#comment-16844612
 ] 

Jim Ferenczi commented on LUCENE-8770:
--

{quote}
 I wonder how useful computing the score in the two-phase and the iterator help 
now, can we get rid of it or would it hurt?
{quote}

I think we can, I ran the benchmark without the score check in the iterator and 
here's the result:


{noformat}
TaskQPS baseline  StdDev   QPS patch  StdDev
Pct diff
  AndHighMed   58.08  (4.9%)   54.60 (11.0%)   
-6.0% ( -20% -   10%)
 AndHighHigh   23.08  (7.5%)   22.64 (12.1%)   
-1.9% ( -19% -   19%)
  AndHighLow  427.13  (4.7%)  434.58 (10.5%)
1.7% ( -12% -   17%)
{noformat}


> BlockMaxConjunctionScorer should support two-phase scorers
> --
>
> Key: LUCENE-8770
> URL: https://issues.apache.org/jira/browse/LUCENE-8770
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: LUCENE-8770.patch
>
>
> The support for two-phase scorers in BlockMaxConjunctionScorer is missing. 
> This can slow down some queries that need to execute costly second phase on 
> more documents.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8770) BlockMaxConjunctionScorer should support two-phase scorers

2019-05-20 Thread Jim Ferenczi (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8770:
-
Attachment: (was: LUCENE-8770.patch)

> BlockMaxConjunctionScorer should support two-phase scorers
> --
>
> Key: LUCENE-8770
> URL: https://issues.apache.org/jira/browse/LUCENE-8770
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
>
> The support for two-phase scorers in BlockMaxConjunctionScorer is missing. 
> This can slow down some queries that need to execute costly second phase on 
> more documents.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-7840) BooleanQuery.rewriteNoScoring - optimize away any SHOULD clauses if at least 1 MUST/FILTER clause and 0==minShouldMatch

2019-05-09 Thread Jim Ferenczi (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi resolved LUCENE-7840.
--
   Resolution: Fixed
Fix Version/s: 8.2
   master (9.0)

Thanks [~atris]!

> BooleanQuery.rewriteNoScoring - optimize away any SHOULD clauses if at least 
> 1 MUST/FILTER clause and 0==minShouldMatch
> ---
>
> Key: LUCENE-7840
> URL: https://issues.apache.org/jira/browse/LUCENE-7840
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Hoss Man
>Priority: Major
> Fix For: master (9.0), 8.2
>
> Attachments: LUCENE-7840.patch, LUCENE-7840.patch, LUCENE-7840.patch
>
>
> I haven't thought this through completely, let alone write up a patch / test 
> case, but IIUC...
> We should be able to optimize  {{ BooleanQuery rewriteNoScoring() }} so that 
> (after converting MUST clauses to FILTER clauses) we can check for the common 
> case of {{0==getMinimumNumberShouldMatch()}} and throw away any SHOULD 
> clauses as long as there is is at least one FILTER clause.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7840) BooleanQuery.rewriteNoScoring - optimize away any SHOULD clauses if at least 1 MUST/FILTER clause and 0==minShouldMatch

2019-05-07 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834756#comment-16834756
 ] 

Jim Ferenczi commented on LUCENE-7840:
--

Thanks [~atris], it looks good to me too. I'll commit shortly

> BooleanQuery.rewriteNoScoring - optimize away any SHOULD clauses if at least 
> 1 MUST/FILTER clause and 0==minShouldMatch
> ---
>
> Key: LUCENE-7840
> URL: https://issues.apache.org/jira/browse/LUCENE-7840
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Hoss Man
>Priority: Major
> Attachments: LUCENE-7840.patch, LUCENE-7840.patch
>
>
> I haven't thought this through completely, let alone write up a patch / test 
> case, but IIUC...
> We should be able to optimize  {{ BooleanQuery rewriteNoScoring() }} so that 
> (after converting MUST clauses to FILTER clauses) we can check for the common 
> case of {{0==getMinimumNumberShouldMatch()}} and throw away any SHOULD 
> clauses as long as there is is at least one FILTER clause.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7840) BooleanQuery.rewriteNoScoring - optimize away any SHOULD clauses if at least 1 MUST/FILTER clause and 0==minShouldMatch

2019-05-07 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834417#comment-16834417
 ] 

Jim Ferenczi commented on LUCENE-7840:
--

Can you build the new query in a single pass ? You could check the condition to 
add the SHOULD clauses before the loop with:
{code:java}
boolean keepShould = getMinimumNumberShouldMatch() > 0 || 
(clauseSets.get(Occur.MUST).size() + clauseSets.get(Occur.FILTER).size() == 0);
{code}
and then add the should clauses in the main loop if keepShould is true ?

> BooleanQuery.rewriteNoScoring - optimize away any SHOULD clauses if at least 
> 1 MUST/FILTER clause and 0==minShouldMatch
> ---
>
> Key: LUCENE-7840
> URL: https://issues.apache.org/jira/browse/LUCENE-7840
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Hoss Man
>Priority: Major
> Attachments: LUCENE-7840.patch
>
>
> I haven't thought this through completely, let alone write up a patch / test 
> case, but IIUC...
> We should be able to optimize  {{ BooleanQuery rewriteNoScoring() }} so that 
> (after converting MUST clauses to FILTER clauses) we can check for the common 
> case of {{0==getMinimumNumberShouldMatch()}} and throw away any SHOULD 
> clauses as long as there is is at least one FILTER clause.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7840) BooleanQuery.rewriteNoScoring - optimize away any SHOULD clauses if at least 1 MUST/FILTER clause and 0==minShouldMatch

2019-05-06 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833727#comment-16833727
 ] 

Jim Ferenczi commented on LUCENE-7840:
--

I think so yes, we don't need to build the scorer supplier for the SHOULD 
clauses so it makes sense to move the logic there.

> BooleanQuery.rewriteNoScoring - optimize away any SHOULD clauses if at least 
> 1 MUST/FILTER clause and 0==minShouldMatch
> ---
>
> Key: LUCENE-7840
> URL: https://issues.apache.org/jira/browse/LUCENE-7840
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Hoss Man
>Priority: Major
>
> I haven't thought this through completely, let alone write up a patch / test 
> case, but IIUC...
> We should be able to optimize  {{ BooleanQuery rewriteNoScoring() }} so that 
> (after converting MUST clauses to FILTER clauses) we can check for the common 
> case of {{0==getMinimumNumberShouldMatch()}} and throw away any SHOULD 
> clauses as long as there is is at least one FILTER clause.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7840) BooleanQuery.rewriteNoScoring - optimize away any SHOULD clauses if at least 1 MUST/FILTER clause and 0==minShouldMatch

2019-05-06 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833621#comment-16833621
 ] 

Jim Ferenczi commented on LUCENE-7840:
--

Note that the logic to remove SHOULD clauses is already implemented when we 
build the Scorer:
https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/BooleanWeight.java#L391
Moving the logic to rewriteNoScoring makes sense to me but this won't optimize 
anything since the removal is already in place.


> BooleanQuery.rewriteNoScoring - optimize away any SHOULD clauses if at least 
> 1 MUST/FILTER clause and 0==minShouldMatch
> ---
>
> Key: LUCENE-7840
> URL: https://issues.apache.org/jira/browse/LUCENE-7840
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Hoss Man
>Priority: Major
>
> I haven't thought this through completely, let alone write up a patch / test 
> case, but IIUC...
> We should be able to optimize  {{ BooleanQuery rewriteNoScoring() }} so that 
> (after converting MUST clauses to FILTER clauses) we can check for the common 
> case of {{0==getMinimumNumberShouldMatch()}} and throw away any SHOULD 
> clauses as long as there is is at least one FILTER clause.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8772) [nori] A word that is registered in advance, but the words are not separated and recognized as 'UNKNOWN'

2019-04-19 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16821810#comment-16821810
 ] 

Jim Ferenczi commented on LUCENE-8772:
--

That's expected since the unknown word heuristic is to group characters of the 
same class together. In this case `갊수학` is considered as a single word and `갊` 
is unknown so we jump to the end of the unknown word to find new entries. You 
can add `갊` in the user dict or a special rule `갊수학 갊 수학` that will decompose 
the terms. We could also change the heuristic to add unknown word of length 1 
in order to be able to detect user words inside unknown blocks but I wonder if 
the cost to do that is not prohibitive.

> [nori]  A word that is registered in advance, but the words are not separated 
> and recognized as 'UNKNOWN'
> -
>
> Key: LUCENE-8772
> URL: https://issues.apache.org/jira/browse/LUCENE-8772
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 7.5, 7.6, 7.7, 7.7.1, 8.0
>Reporter: YOO JEONGIN
>Priority: Major
> Attachments: image-2019-04-19-11-32-56-310.png
>
>
> hello,
> In case of 'nori', if there is no word starting from the left, 'UNKNOWN' is 
> analyzed even if there is a word already registered in the middle.
>  So here is the question.
>  Does nori analyze only on the left side and do not analyze from the right 
> side?
>  Could this be solved?
>  
> ex)
> input => 갊수학
> Condition
> dictionary registered : 수학
>  dictionary Unregistered : 갊
> result => 갊수학
> !image-2019-04-19-11-32-56-310.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8770) BlockMaxConjunctionScorer should support two-phase scorers

2019-04-18 Thread Jim Ferenczi (JIRA)
Jim Ferenczi created LUCENE-8770:


 Summary: BlockMaxConjunctionScorer should support two-phase scorers
 Key: LUCENE-8770
 URL: https://issues.apache.org/jira/browse/LUCENE-8770
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Jim Ferenczi


The support for two-phase scorers in BlockMaxConjunctionScorer is missing. This 
can slow down some queries that need to execute costly second phase on more 
documents.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8759) BlockMaxConjunctionScorer's simplified way of computing max scores hurts performance

2019-04-16 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819179#comment-16819179
 ] 

Jim Ferenczi commented on LUCENE-8759:
--

Thanks that looks cleaner, I think I got confused because I was not removing 1 
to the floatBits. I tested your solution on all positive floats and it works so 
I'll commit soon unless you have objections.

> BlockMaxConjunctionScorer's simplified way of computing max scores hurts 
> performance
> 
>
> Key: LUCENE-8759
> URL: https://issues.apache.org/jira/browse/LUCENE-8759
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-8759.patch
>
>
> BlockMaxConjunctionScorer computes the minimum value that the score should 
> have after each scorer in order to be able to interrupt scorer as soon as 
> possible. For instance say scorers A, B and C produce maximum scores that are 
> equal to 4, 2 and 1. If the minimum competitive score is X, then the score 
> after scoring A, B and C must be at least X, the score after scoring A and B 
> must be at least X-1 and the score after scoring A must be at least X-1-2.
> However this is made a bit more complex than that due to floating-point 
> numbers and the fact that intermediate score values are doubles which only 
> get casted to a float after all values have been summed up. In order to keep 
> things simple, BlockMaxConjunctionScore has the following comment and code
> {code}
> // Also compute the minimum required scores for a hit to be 
> competitive
> // A double that is less than 'score' might still be converted to 
> 'score'
> // when casted to a float, so we go to the previous float to avoid 
> this issue
> minScores[minScores.length - 1] = minScore > 0 ? 
> Math.nextDown(minScore) : 0;
> {code}
> It simplifies the problem by calling Math.nextDown(minScore). However this is 
> problematic because it defeats the fact that TopScoreDocCollector calls 
> setMinCompetitiveScore on the float value that is immediately greater than 
> the k-th greatest hit so far.
> nextDown(minScore) is not the value that we need. The value that we need is 
> the smallest double that converts to minScore when casted to a float, which 
> would be half-way between nextDown(minScore) and minScore. In some cases this 
> would help get better performance out of conjunctions, especially if some 
> clauses produce constant scores.
> MaxScoreSumPropagator#setMinCompetitiveScore has the same issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8759) BlockMaxConjunctionScorer's simplified way of computing max scores hurts performance

2019-04-16 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819049#comment-16819049
 ] 

Jim Ferenczi commented on LUCENE-8759:
--

I tried this approach but the shift for denormalized numbers is computed from 
the float bits so I started from the same bits for normalized numbers too.

> BlockMaxConjunctionScorer's simplified way of computing max scores hurts 
> performance
> 
>
> Key: LUCENE-8759
> URL: https://issues.apache.org/jira/browse/LUCENE-8759
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-8759.patch
>
>
> BlockMaxConjunctionScorer computes the minimum value that the score should 
> have after each scorer in order to be able to interrupt scorer as soon as 
> possible. For instance say scorers A, B and C produce maximum scores that are 
> equal to 4, 2 and 1. If the minimum competitive score is X, then the score 
> after scoring A, B and C must be at least X, the score after scoring A and B 
> must be at least X-1 and the score after scoring A must be at least X-1-2.
> However this is made a bit more complex than that due to floating-point 
> numbers and the fact that intermediate score values are doubles which only 
> get casted to a float after all values have been summed up. In order to keep 
> things simple, BlockMaxConjunctionScore has the following comment and code
> {code}
> // Also compute the minimum required scores for a hit to be 
> competitive
> // A double that is less than 'score' might still be converted to 
> 'score'
> // when casted to a float, so we go to the previous float to avoid 
> this issue
> minScores[minScores.length - 1] = minScore > 0 ? 
> Math.nextDown(minScore) : 0;
> {code}
> It simplifies the problem by calling Math.nextDown(minScore). However this is 
> problematic because it defeats the fact that TopScoreDocCollector calls 
> setMinCompetitiveScore on the float value that is immediately greater than 
> the k-th greatest hit so far.
> nextDown(minScore) is not the value that we need. The value that we need is 
> the smallest double that converts to minScore when casted to a float, which 
> would be half-way between nextDown(minScore) and minScore. In some cases this 
> would help get better performance out of conjunctions, especially if some 
> clauses produce constant scores.
> MaxScoreSumPropagator#setMinCompetitiveScore has the same issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8751) Weight#matches should use the scorerSupplier to create scorers

2019-04-10 Thread Jim Ferenczi (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi resolved LUCENE-8751.
--
   Resolution: Fixed
Fix Version/s: master (9.0)
   8.1

> Weight#matches should use the scorerSupplier to create scorers
> --
>
> Key: LUCENE-8751
> URL: https://issues.apache.org/jira/browse/LUCENE-8751
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
> Fix For: 8.1, master (9.0)
>
> Attachments: LUCENE-8751.patch
>
>
> The default implementation for Weight#matches creates a scorer to check if 
> the document matches. Since this API is per document it would be more 
> efficient to create a ScorerSupplier and then create the scorer with a 
> leadCost of 1. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7386) Flatten nested disjunctions

2019-04-09 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813236#comment-16813236
 ] 

Jim Ferenczi commented on LUCENE-7386:
--

+1, I also find it more easy to read when the simplification is done at the 
rewrite level.

> Flatten nested disjunctions
> ---
>
> Key: LUCENE-7386
> URL: https://issues.apache.org/jira/browse/LUCENE-7386
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-7386.patch, LUCENE-7386.patch, LUCENE-7386.patch
>
>
> Now that coords are gone it became easier to flatten nested disjunctions. It 
> might sound weird to write nested disjunctions in the first place, but 
> disjunctions can be created implicitly by other queries such as 
> more-like-this, LatLonPoint.newBoxQuery, non-scoring synonym queries, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8701) Speed up ToParentBlockJoinQuery when total hit count is not needed

2019-04-05 Thread Jim Ferenczi (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi resolved LUCENE-8701.
--
   Resolution: Fixed
Fix Version/s: master (9.0)
   8.1

Thanks Adrien, I pushed a patch that also rewrites the constant score query.

> Speed up ToParentBlockJoinQuery when total hit count is not needed
> --
>
> Key: LUCENE-8701
> URL: https://issues.apache.org/jira/browse/LUCENE-8701
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
> Fix For: 8.1, master (9.0)
>
> Attachments: LUCENE-8701.patch, LUCENE-8701.patch
>
>
> We spotted a regression on nested queries in the Elastisearch nightly track:
> https://elasticsearch-benchmarks.elastic.co/index.html#tracks/nested/nightly/30d
> It seems related to the fact that we propagate the TOP_SCORES score mode to 
> the child query even though we don't compute a max score in the 
> BlockJoinScorer and don't propagate the minimum score either. Since it is not 
> possible to compute a max score for a document that depends on other 
> documents (the children) we should probably force the score mode to COMPLETE 
> to build the child scorer. This should avoid the overhead of loading and 
> reading the impacts. It should also be possible to early terminate queries 
> that use the ScoreMode.None mode since in this case the score of each parent 
> document is the same.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8751) Weight#matches should use the scorerSupplier to create scorers

2019-04-03 Thread Jim Ferenczi (JIRA)
Jim Ferenczi created LUCENE-8751:


 Summary: Weight#matches should use the scorerSupplier to create 
scorers
 Key: LUCENE-8751
 URL: https://issues.apache.org/jira/browse/LUCENE-8751
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Jim Ferenczi


The default implementation for Weight#matches creates a scorer to check if the 
document matches. Since this API is per document it would be more efficient to 
create a ScorerSupplier and then create the scorer with a leadCost of 1. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8730) Ensure WordDelimiterGraphFilter always emits its original token first

2019-04-01 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16806975#comment-16806975
 ] 

Jim Ferenczi commented on LUCENE-8730:
--

+1, thanks Alan

> Ensure WordDelimiterGraphFilter always emits its original token first
> -
>
> Key: LUCENE-8730
> URL: https://issues.apache.org/jira/browse/LUCENE-8730
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8730.patch, LUCENE-8730.patch
>
>
> WordDelimiterFilter and WordDelimiterGraphFilter behave almost identically 
> outside setting position length; the only difference being that WDGF can 
> sometimes emit its original token as the second output token rather than the 
> first.  We should change this to conform to the behaviour of the older filter 
> - this will make it much easier to remove WDF entirely and cut over tests 
> that use it incidentally.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8730) Ensure WordDelimiterGraphFilter always emits its original token first

2019-04-01 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16806642#comment-16806642
 ] 

Jim Ferenczi commented on LUCENE-8730:
--

+1 to output the original token first. Is it possible to set the original token 
offset (savedTermLength) once since the value doesn't change ? I also wonder if 
the first value in the buffer should be filtered from the sort entirely (e.g. 
call sorter.sort(1, bufferedLen)) to ensure correctness ?
 

> Ensure WordDelimiterGraphFilter always emits its original token first
> -
>
> Key: LUCENE-8730
> URL: https://issues.apache.org/jira/browse/LUCENE-8730
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8730.patch
>
>
> WordDelimiterFilter and WordDelimiterGraphFilter behave almost identically 
> outside setting position length; the only difference being that WDGF can 
> sometimes emit its original token as the second output token rather than the 
> first.  We should change this to conform to the behaviour of the older filter 
> - this will make it much easier to remove WDF entirely and cut over tests 
> that use it incidentally.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8732) Allow ConstantScoreQuery to skip counting hits

2019-03-27 Thread Jim Ferenczi (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi resolved LUCENE-8732.
--
   Resolution: Fixed
Fix Version/s: master (9.0)
   8.1

> Allow ConstantScoreQuery to skip counting hits
> --
>
> Key: LUCENE-8732
> URL: https://issues.apache.org/jira/browse/LUCENE-8732
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
> Fix For: 8.1, master (9.0)
>
> Attachments: LUCENE-8732.patch
>
>
> We already have a ConstantScoreScorer that knows how to early terminate the 
> collection but the ConstantScoreQuery uses a private scorer that doesn't take 
> advantage of setMinCompetitiveScore. This issue is about reusing the 
> ConstantScoreScorer in the ConstantScoreQuery in order to early terminate 
> queries that don't need to compute the total number of hits.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8732) Allow ConstantScoreQuery to skip counting hits

2019-03-22 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16798997#comment-16798997
 ] 

Jim Ferenczi commented on LUCENE-8732:
--

{noformat}
Should we keep returning the sub scorer via getChildren()?
{noformat}
I removed it because we build the inner weight with COMPLETE_NO_SCORE so we 
shouldn't allow to call score() on the inner scorer ? 

> Allow ConstantScoreQuery to skip counting hits
> --
>
> Key: LUCENE-8732
> URL: https://issues.apache.org/jira/browse/LUCENE-8732
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: LUCENE-8732.patch
>
>
> We already have a ConstantScoreScorer that knows how to early terminate the 
> collection but the ConstantScoreQuery uses a private scorer that doesn't take 
> advantage of setMinCompetitiveScore. This issue is about reusing the 
> ConstantScoreScorer in the ConstantScoreQuery in order to early terminate 
> queries that don't need to compute the total number of hits.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8477) Improve handling of inner disjunctions in intervals

2019-03-20 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16797618#comment-16797618
 ] 

Jim Ferenczi commented on LUCENE-8477:
--

The patch fixes disjunctions that share a common prefix but the same problem 
can arise for disjunctions that share suffixes. For instance the query or(york, 
BLOCK(new, york)) has the same minimum interval semantic than "york". So a 
query like BLOCK(in, or(york, BLOCK(new, york))) will not match "in new york" 
because "new york" is discarded by the minimum interval "york". We could apply 
the same logic and rewrite the query automatically but I am sure we can find 
other pathological cases due to minimum interval semantics. IMO we should 
document this unintuitive behavior rather than rewriting all queries in a 
non-optimal form. 

> Improve handling of inner disjunctions in intervals
> ---
>
> Key: LUCENE-8477
> URL: https://issues.apache.org/jira/browse/LUCENE-8477
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8477.patch, LUCENE-8477.patch, LUCENE-8477.patch, 
> LUCENE-8477.patch
>
>
> The current implementation of the disjunction interval produced by 
> {{Intervals.or}} is a direct implementation of the OR operator from the Vigna 
> paper.  This produces minimal intervals, meaning that (a) is preferred over 
> (a b), and (b) also over (a b).  This has advantages when it comes to 
> counting intervals for scoring, but also has drawbacks when it comes to 
> matching.  For example, a phrase query for ((a OR (a b)) BLOCK (c)) will not 
> match the document (a b c), because (a) will be preferred over (a b), and (a 
> c) does not match.
> This ticket is to discuss the best way of dealing with disjunctions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8732) Allow ConstantScoreQuery to skip counting hits

2019-03-20 Thread Jim Ferenczi (JIRA)
Jim Ferenczi created LUCENE-8732:


 Summary: Allow ConstantScoreQuery to skip counting hits
 Key: LUCENE-8732
 URL: https://issues.apache.org/jira/browse/LUCENE-8732
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Jim Ferenczi


We already have a ConstantScoreScorer that knows how to early terminate the 
collection but the ConstantScoreQuery uses a private scorer that doesn't take 
advantage of setMinCompetitiveScore. This issue is about reusing the 
ConstantScoreScorer in the ConstantScoreQuery in order to early terminate 
queries that don't need to compute the total number of hits.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8631) How Nori Tokenizer can deal with Longest-Matching

2019-03-12 Thread Jim Ferenczi (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi resolved LUCENE-8631.
--
   Resolution: Fixed
Fix Version/s: 8.1
   master (9.0)

Thanks [~gritmind]!

> How Nori Tokenizer can deal with Longest-Matching
> -
>
> Key: LUCENE-8631
> URL: https://issues.apache.org/jira/browse/LUCENE-8631
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Yeongsu Kim
>Priority: Major
> Fix For: master (9.0), 8.1
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> I think... Nori tokenizer has one issue. 
> I don’t understand why “Longest-Matching” is NOT working to Nori tokenizer 
> via config mode (config mode: 
> [https://www.elastic.co/guide/en/elasticsearch/plugins/6.x/analysis-nori-tokenizer.html).]
>  
> Here is an example for explaining what is longest-matching.
> Let assume we have `userdict_ko.txt` including only three Korean single-words 
> such as ‘골드’, ‘브라운’, ‘골드브라운’, and save it to Nori analyzer. After update, we 
> can see that it outputs two tokens such as ‘골드’ and ‘브라운’, when the input is 
> ‘골드브라운’. (In English: ‘골드’ means ‘gold’, ‘브라운’ means ‘brown’, and ‘골드브라운’ 
> means ‘goldbrown’)
>  
> With this result, we recognize that “Longest-Matching” is NOT working. If 
> “Longest-Matching” is working, the output must be ‘골드브라운’, which is the 
> longest matching word in the user dictionary.
>  
> Curiously enough, when we add user dictionary via custom mode (custom mode: 
> [https://github.com/jimczi/nori/blob/master/how-to-custom-dict.asciidoc|https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjimczi%2Fnori%2Fblob%2Fmaster%2Fhow-to-custom-dict.asciidoc=02%7C01%7Chigh_yeongsu%40wemakeprice.com%7C6953d739414e4da5ad1408d67473a6fe%7C6322d5f522044e9d9ca6d18828a04daf%7C0%7C0%7C636824437418170758=5iuNvKr8WJCXlCkJQrf5r3BgDVnF5hpG7l%2BQL0Ok7Aw%3D=0]),
>  we found the result is ‘골드브라운’, where ‘Longest-Matching’ is applied. We 
> think the reason is because learned Mecab engine automatically generates word 
> costs by its own criteria. We hope this mechanism is also applied to config 
> mode.
>  
> Would you tell me the way to “Longest-Matching” via config mode (not custom) 
> or give me some hints (e.g. where to modify source codes) to solve this 
> problem?
>  
> P.S
> Recently, I've mailed to [~jim.ferenczi], who is a developer of Nori, and 
> received his suggestions:
>    - Add a way to set a score to each new rule (this way you could set up a 
> negative cost for the compound word that is less than the sum of the two 
> single words.
>    - Same as above but the cost is computed from the statistics of the 
> training (like the custom dictionary does when you recompile entirely).
>    - Implement longest-match first in the dictionary.
>  
> Thanks for your support.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8652) Add boosting support in the SynonymQuery

2019-03-11 Thread Jim Ferenczi (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi resolved LUCENE-8652.
--
   Resolution: Fixed
Fix Version/s: master (9.0)
   8.1

I've deprecated the Term[] constructor in branch_8x. Thanks [~jpountz] and 
[~romseygeek]!

> Add boosting support in the SynonymQuery
> 
>
> Key: LUCENE-8652
> URL: https://issues.apache.org/jira/browse/LUCENE-8652
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Jim Ferenczi
>Priority: Minor
> Fix For: 8.1, master (9.0)
>
> Attachments: LUCENE-8652.patch, LUCENE-8652.patch
>
>
> The SynonymQuery tries to score multiple terms as if you had indexed them as 
> one term.
> This is good for "true" synonyms where each term should have the same 
> contribution to the final score but this doesn't handle the case where terms 
> have different weights. For scoring purpose it would be nice to be able to 
> assign a boost per term that we could multiply with the term's document 
> frequency in order to take into account the importance of the term within the 
> synonym list.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8717) Handle stop words that appear at articulation points

2019-03-08 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787785#comment-16787785
 ] 

Jim Ferenczi commented on LUCENE-8717:
--

While I understand why we'd need the deleted attribute I am a bit reluctant to 
add another attribute just to handle deletion in graphs correctly. We'd need to 
change all consumer of TokenStreams (which you partially do in the patch) to 
handle the new attribute and IMO it feels weird to introduce a deleted 
attribute if we don't use all the time (for non-articulation points). I wonder 
if we should try to resurrect https://issues.apache.org/jira/browse/LUCENE-5012 
to make graph token stream a first class citizen of the analysis stream ? It 
could be a nice project for Lucene 9

> Handle stop words that appear at articulation points
> 
>
> Key: LUCENE-8717
> URL: https://issues.apache.org/jira/browse/LUCENE-8717
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8717.patch
>
>
> Our set of TokenFilters currently cannot handle the case where a multi-term 
> synonym starts with a stopword.  This means that given a synonym file 
> containing the mapping "the walking dead => twd" and a standard english 
> stopword filter, QueryBuilder will produce incorrect queries.
> The tricky part here is that our standard way of dealing with stopwords, 
> which is to just remove them entirely from the token stream and use a larger 
> position increment on subsequent tokens, doesn't work when the removed token 
> also has a position length greater than 1.  There are various tricks you can 
> do to increment position length on the previous token, but this doesn't work 
> if the stopword is the first token in the token stream, or if there are 
> multiple stopwords in the side path.
> Instead, I'd like to propose adding a new TermDeletedAttribute, which we only 
> use on tokens that should be removed from the stream but which hold necessary 
> information about the structure of the token graph.  These tokens can then be 
> removed by GraphTokenStreamFiniteStrings at query time, and by 
> FlattenGraphFilter at index time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (SOLR-13299) Solr 8 RC2 does not start on Windows with SSL/TLS enabled on Java 8

2019-03-07 Thread Jim Ferenczi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi resolved SOLR-13299.
-
   Resolution: Fixed
Fix Version/s: master (9.0)

Thanks [~thetaphi]!

> Solr 8 RC2 does not start on Windows with SSL/TLS enabled on Java 8
> ---
>
> Key: SOLR-13299
> URL: https://issues.apache.org/jira/browse/SOLR-13299
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Server
>Affects Versions: 8.0
> Environment: Windows 10
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
>Priority: Blocker
>  Labels: Java8
> Fix For: 8.0, master (9.0)
>
> Attachments: SOLR-13299.patch
>
>
> When trying to start the Solr 8 release candidate with Java 8 having SSL/TLS 
> enabled, the Windows solr.cmd startup script fails with ALPN not found while 
> complainig about the ALPN JAR file which only works with Java 9+:
> {noformat}
> Waiting up to 30 to see Solr running on port 8983
> java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.eclipse.jetty.start.Main.invokeMain(Main.java:220)
> at org.eclipse.jetty.start.Main.start(Main.java:490)
> at org.eclipse.jetty.start.Main.main(Main.java:77)
> Caused by: java.security.PrivilegedActionException: 
> java.lang.reflect.InvocationTargetException
> at java.security.AccessController.doPrivileged(Native Method)
> at 
> org.eclipse.jetty.xml.XmlConfiguration.main(XmlConfiguration.java:1511)
> ... 7 more
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at org.eclipse.jetty.util.TypeUtil.construct(TypeUtil.java:663)
> at 
> org.eclipse.jetty.xml.XmlConfiguration$JettyXmlConfiguration.newObj(XmlConfiguration.java:858)
> at 
> org.eclipse.jetty.xml.XmlConfiguration$JettyXmlConfiguration.itemValue(XmlConfiguration.java:1309)
> at 
> org.eclipse.jetty.xml.XmlConfiguration$JettyXmlConfiguration.value(XmlConfiguration.java:1214)
> at 
> org.eclipse.jetty.xml.XmlConfiguration$JettyXmlConfiguration.newArray(XmlConfiguration.java:936)
> at 
> org.eclipse.jetty.xml.XmlConfiguration$JettyXmlConfiguration.itemValue(XmlConfiguration.java:1313)
> at 
> org.eclipse.jetty.xml.XmlConfiguration$JettyXmlConfiguration.value(XmlConfiguration.java:1214)
> at 
> org.eclipse.jetty.xml.XmlConfiguration$JettyXmlConfiguration.newObj(XmlConfiguration.java:842)
> at 
> org.eclipse.jetty.xml.XmlConfiguration$JettyXmlConfiguration.itemValue(XmlConfiguration.java:1309)
> at 
> org.eclipse.jetty.xml.XmlConfiguration$JettyXmlConfiguration.value(XmlConfiguration.java:1214)
> at 
> org.eclipse.jetty.xml.XmlConfiguration$JettyXmlConfiguration.access$500(XmlConfiguration.java:326)
> at 
> org.eclipse.jetty.xml.XmlConfiguration$JettyXmlConfiguration$AttrOrElementNode.getList(XmlConfiguration.java:1442)
> at 
> org.eclipse.jetty.xml.XmlConfiguration$JettyXmlConfiguration$AttrOrElementNode.getList(XmlConfiguration.java:1417)
> at 
> org.eclipse.jetty.xml.XmlConfiguration$JettyXmlConfiguration.call(XmlConfiguration.java:780)
> at 
> org.eclipse.jetty.xml.XmlConfiguration$JettyXmlConfiguration.configure(XmlConfiguration.java:472)
> at 
> org.eclipse.jetty.xml.XmlConfiguration$JettyXmlConfiguration.configure(XmlConfiguration.java:413)
> at 
> org.eclipse.jetty.xml.XmlConfiguration.configure(XmlConfiguration.java:311)
> at 
> org.eclipse.jetty.xml.XmlConfiguration$1.run(XmlConfiguration.java:1558)
> at 
> org.eclipse.jetty.xml.XmlConfiguration$1.run(XmlConfiguration.java:1512)
> ... 9 more
> Caused by: java.lang.IllegalStateException: No Server ALPNProcessors!
> at 
> org.eclipse.jetty.alpn.server.ALPNServerConnectionFactory.(ALPNServerConnectionFactory.java:53)
>... 32 more
> Suppressed: java.lang.UnsupportedClassVersionError: 
> org/eclipse/jetty/alpn/java/server/JDK9ServerALPNProcessor has been compiled 
> 

[jira] [Resolved] (LUCENE-8686) TestTaxonomySumValueSource.testRandom Failure

2019-02-20 Thread Jim Ferenczi (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi resolved LUCENE-8686.
--
   Resolution: Fixed
Fix Version/s: master (9.0)
   8.0
   7.7.1

> TestTaxonomySumValueSource.testRandom Failure
> -
>
> Key: LUCENE-8686
> URL: https://issues.apache.org/jira/browse/LUCENE-8686
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 8.x
>Reporter: Nicholas Knize
>Priority: Major
> Fix For: 7.7.1, 8.0, master (9.0)
>
> Attachments: LUCENE-8686.patch
>
>
> Reproducible test failure:
> {{NOTE: reproduce with: ant test  -Dtestcase=TestTaxonomyFacetSumValueSource 
> -Dtests.method=testRandom -Dtests.seed=7F625DA1A252DF8E -Dtests.slow=true 
> -Dtests.badapples=true -Dtests.locale=hu -Dtests.timezone=America/Boa_Vista 
> -Dtests.asserts=true -Dtests.file.encoding=UTF8}}
> Stacktrace:
> {code:java}
> 10:45:46[junit4] FAILURE 0.24s J3 | 
> TestTaxonomyFacetSumValueSource.testRandom <<<
> 10:45:46[junit4]> Throwable #1: java.lang.AssertionError: 
> expected:<4> but was:<3>
> 10:45:46[junit4]> at 
> __randomizedtesting.SeedInfo.seed([7F625DA1A252DF8E:D2E78AE133269FD]:0)
> 10:45:46[junit4]> at 
> org.apache.lucene.facet.FacetTestCase.assertFloatValuesEquals(FacetTestCase.java:200)
> 10:45:46[junit4]> at 
> org.apache.lucene.facet.FacetTestCase.assertFloatValuesEquals(FacetTestCase.java:193)
> 10:45:46[junit4]> at 
> org.apache.lucene.facet.taxonomy.TestTaxonomyFacetSumValueSource.testRandom(TestTaxonomyFacetSumValueSource.java:477)
> 10:45:46[junit4]> at java.lang.Thread.run(Thread.java:748)
> 10:45:46[junit4]   2> NOTE: test params are: codec=Asserting(Lucene80): 
> {$full_path$=PostingsFormat(name=Direct), 
> $facets=PostingsFormat(name=LuceneVarGapDocFreqInterval), 
> $payloads$=PostingsFormat(name=Direct), 
> f=PostingsFormat(name=LuceneFixedGap), 
> $facets2=PostingsFormat(name=LuceneFixedGap), 
> $b=PostingsFormat(name=LuceneFixedGap), 
> content=TestBloomFilteredLucenePostings(BloomFilteringPostingsFormat(Lucene50(blocksize=128)))},
>  docValues:{$facets=DocValuesFormat(name=Asserting), 
> price=DocValuesFormat(name=Lucene80), num=DocValuesFormat(name=Direct), 
> $facets2=DocValuesFormat(name=Direct), value=DocValuesFormat(name=Lucene80), 
> $b=DocValuesFormat(name=Direct)}, maxPointsInLeafNode=1902, 
> maxMBSortInHeap=6.164841106101889, 
> sim=Asserting(org.apache.lucene.search.similarities.AssertingSimilarity@27ff3281),
>  locale=hu, timezone=America/Boa_Vista
> 10:45:46[junit4]   2> NOTE: Linux 4.15.0-1027-gcp amd64/Oracle 
> Corporation 1.8.0_191 
> (64-bit)/cpus=16,threads=1,free=394275096,total=523239424
> 10:45:46[junit4]   2> NOTE: All tests run in this JVM: [TestFacetsConfig, 
> TestRandomSamplingFacetsCollector, TestTaxonomyFacetSumValueSource]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8701) Speed up ToParentBlockJoinQuery when total hit count is not needed

2019-02-19 Thread Jim Ferenczi (JIRA)
Jim Ferenczi created LUCENE-8701:


 Summary: Speed up ToParentBlockJoinQuery when total hit count is 
not needed
 Key: LUCENE-8701
 URL: https://issues.apache.org/jira/browse/LUCENE-8701
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Jim Ferenczi


We spotted a regression on nested queries in the Elastisearch nightly track:
https://elasticsearch-benchmarks.elastic.co/index.html#tracks/nested/nightly/30d
It seems related to the fact that we propagate the TOP_SCORES score mode to the 
child query even though we don't compute a max score in the BlockJoinScorer and 
don't propagate the minimum score either. Since it is not possible to compute a 
max score for a document that depends on other documents (the children) we 
should probably force the score mode to COMPLETE to build the child scorer. 
This should avoid the overhead of loading and reading the impacts. It should 
also be possible to early terminate queries that use the ScoreMode.None mode 
since in this case the score of each parent document is the same.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8697) GraphTokenStreamFiniteStrings does not correctly handle gaps in the token graph

2019-02-18 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16771386#comment-16771386
 ] 

Jim Ferenczi commented on LUCENE-8697:
--

The patch looks good. Note that while the patch fixes a real issue it doesn't 
solve the bug reported in https://issues.apache.org/jira/browse/LUCENE-8250. 
Since the issues are different, I am +1 to push this patch as is and to work on 
LUCENE-8250 in a follow up.

> GraphTokenStreamFiniteStrings does not correctly handle gaps in the token 
> graph
> ---
>
> Key: LUCENE-8697
> URL: https://issues.apache.org/jira/browse/LUCENE-8697
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8697.patch
>
>
> Currently, side-paths with gaps in can end up being missed entirely when 
> iterating through token streams.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8693) nori > special(symbol) characters issue

2019-02-13 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16766963#comment-16766963
 ] 

Jim Ferenczi commented on LUCENE-8693:
--

Can you give an example and explain why you need to keep these "special" 
characters ? The tokenizer removes punctuations so it's expected that some 
characters are filtered but they shouldn't be needed in the index so I'd like 
to understand the issue better.

> nori > special(symbol) characters issue
> ---
>
> Key: LUCENE-8693
> URL: https://issues.apache.org/jira/browse/LUCENE-8693
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 7.6
>Reporter: YOO JEONGIN
>Priority: Major
>
> Hi
> I'm using the "nori" analyzer.
> Whether it's an error or an intentional question.
> All special characters are filtered.
> Special characters stored in the dictionary are also filtered.
> How do I print special characters?
> thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8681) Prorated early termination

2019-02-06 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761583#comment-16761583
 ] 

Jim Ferenczi commented on LUCENE-8681:
--

For the normal case where a single top docs collector is used and segments are 
consumed sequentially we can early terminate any segment on the first non 
competitive hits when the queue is already full.
So most of the time we don't need to collect N documents per segment in the 
index sorted case. I guess this optimization can be useful when each segment is 
consumed in a different thread and uses a different priority queue. 
However I wonder if this could be implemented directly in a custom 
CollectorManager, currently the CollectorManager creates a Collector for each 
leaf if the executor is not null and merge all the collectors at the end. Since 
each collector is independent you could set different topN size based on the 
segment statistics and the heuristic you described here. The only change that 
is missing to achieve this is to provide the reader context when the 
CollectorManager creates a collector (CollectorManager#newCollector), with this 
information you could create different top docs collector and merge them at the 
end with the expected topN size ? 


> Prorated early termination
> --
>
> Key: LUCENE-8681
> URL: https://issues.apache.org/jira/browse/LUCENE-8681
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Mike Sokolov
>Priority: Major
>
> In this issue we'll exploit the distribution of top K documents among 
> segments to extract performance gains when using early termination. The basic 
> idea is we do not need to collect K documents from every segment and then 
> merge. Rather we can collect a number of documents that is proportional to 
> the segment's size plus an error bound derived from the combinatorics seen as 
> a (multinomial) probability distribution.
> https://github.com/apache/lucene-solr/pull/564 has the proposed change.
> [~rcmuir] pointed out on the mailing list that this patch confounds two 
> settings: (1) whether to collect all hits, ensuring correct hit counts, and 
> (2) whether to guarantee that the top K hits are precisely the top K.
> The current patch treats this as the same thing. It takes the position that 
> if the user says it's OK to have approximate counts, then it's also OK to 
> introduce some small chance of ranking error; occasionally some of the top K 
> we return may draw from the top K + epsilon.
> Instead we could provide some additional knobs to the user. Currently the 
> public API is {{TopFieldCOllector.create(Sort, int, FieldDoc, int 
> threshold)}}. The threshold parameter controls when to apply early 
> termination; it allows the collector to terminate once the given number of 
> documents have been collected.
> Instead of using the same threshold to control leaf-level early termination, 
> we could provide an additional leaf-level parameter. For example, this could 
> be a scale factor on the error bound, eg a number of standard deviations to 
> apply. The patch uses 3, but a much more conservative bound would be 4 or 
> even 5. With these values, some speedup would still result, but with a much 
> lower level of ranking errors. A value of MAX_INT would ensure no leaf-level 
> termination would ever occur.
> We could also hide the precise numerical bound and offer users a three-way 
> enum (EXACT, APPROXIMATE_COUNT, APPROXIMATE_RANK) that controls whether to 
> apply this optimization, using some predetermined error bound.
> I posted the patch without any user-level tuning since I think the user has 
> already indicated a preference for speed over precision by specifying a 
> finite (global) threshold, but if we want to provide finer control, these two 
> options seem to make the most sense to me. Providing access to the number of 
> standard deviation to allow from the expected distribution gives the user the 
> finest control, but it could be hard to explain its proper use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8676) TestKoreanTokenizer#testRandomHugeStrings failure

2019-02-01 Thread Jim Ferenczi (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi resolved LUCENE-8676.
--
   Resolution: Fixed
Fix Version/s: 7.7
   8.0

> TestKoreanTokenizer#testRandomHugeStrings failure
> -
>
> Key: LUCENE-8676
> URL: https://issues.apache.org/jira/browse/LUCENE-8676
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Jim Ferenczi
>Priority: Major
> Fix For: 8.0, 7.7
>
> Attachments: LUCENE-8676.patch
>
>
> KoreanTokenizer#testRandomHugeString failed in CI with the following 
> exception:
> {noformat}
>   [junit4]> Throwable #1: java.lang.AssertionError
>[junit4]>at 
> __randomizedtesting.SeedInfo.seed([8C5E2BE10F581CB:90E6857D4E833D83]:0)
>[junit4]>at 
> org.apache.lucene.analysis.ko.KoreanTokenizer.add(KoreanTokenizer.java:334)
>[junit4]>at 
> org.apache.lucene.analysis.ko.KoreanTokenizer.parse(KoreanTokenizer.java:707)
>[junit4]>at 
> org.apache.lucene.analysis.ko.KoreanTokenizer.incrementToken(KoreanTokenizer.java:377)
>[junit4]>at 
> org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:748)
>[junit4]>at 
> org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:659)
>[junit4]>at 
> org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:561)
>[junit4]>at 
> org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:474)
>[junit4]>at 
> org.apache.lucene.analysis.ko.TestKoreanTokenizer.testRandomHugeStrings(TestKoreanTokenizer.java:313)
>[junit4]>at java.lang.Thread.run(Thread.java:748)
>[junit4]   2> NOTE: leaving temporary files
> {noformat}
> I am able to reproduce locally with:
> {noformat}
> ant test  -Dtestcase=TestKoreanTokenizer -Dtests.method=testRandomHugeStrings 
> -Dtests.seed=8C5E2BE10F581CB -Dtests.multiplier=2 -Dtests.nightly=true 
> -Dtests.slow=true 
> -Dtests.linedocsfile=/home/jenkins/jenkins-slave/workspace/Lucene-Solr-NightlyTests-7.7/test-data/enwiki.random.lines.txt
>  -Dtests.locale=uk-UA -Dtests.timezone=Europe/Istanbul -Dtests.asserts=true 
> -Dtests.file.encoding=ISO-8859-1
> {noformat}
> After some investigation I found out that the position of the buffer is not 
> updated when the maximum backtrace size is reached (1024).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8676) TestKoreanTokenizer#testRandomHugeStrings failure

2019-01-31 Thread Jim Ferenczi (JIRA)
Jim Ferenczi created LUCENE-8676:


 Summary: TestKoreanTokenizer#testRandomHugeStrings failure
 Key: LUCENE-8676
 URL: https://issues.apache.org/jira/browse/LUCENE-8676
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Jim Ferenczi


KoreanTokenizer#testRandomHugeString failed in CI with the following exception:

{noformat}
  [junit4]> Throwable #1: java.lang.AssertionError
   [junit4]>at 
__randomizedtesting.SeedInfo.seed([8C5E2BE10F581CB:90E6857D4E833D83]:0)
   [junit4]>at 
org.apache.lucene.analysis.ko.KoreanTokenizer.add(KoreanTokenizer.java:334)
   [junit4]>at 
org.apache.lucene.analysis.ko.KoreanTokenizer.parse(KoreanTokenizer.java:707)
   [junit4]>at 
org.apache.lucene.analysis.ko.KoreanTokenizer.incrementToken(KoreanTokenizer.java:377)
   [junit4]>at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:748)
   [junit4]>at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:659)
   [junit4]>at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:561)
   [junit4]>at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:474)
   [junit4]>at 
org.apache.lucene.analysis.ko.TestKoreanTokenizer.testRandomHugeStrings(TestKoreanTokenizer.java:313)
   [junit4]>at java.lang.Thread.run(Thread.java:748)
   [junit4]   2> NOTE: leaving temporary files
{noformat}

I am able to reproduce locally with:

{noformat}
ant test  -Dtestcase=TestKoreanTokenizer -Dtests.method=testRandomHugeStrings 
-Dtests.seed=8C5E2BE10F581CB -Dtests.multiplier=2 -Dtests.nightly=true 
-Dtests.slow=true 
-Dtests.linedocsfile=/home/jenkins/jenkins-slave/workspace/Lucene-Solr-NightlyTests-7.7/test-data/enwiki.random.lines.txt
 -Dtests.locale=uk-UA -Dtests.timezone=Europe/Istanbul -Dtests.asserts=true 
-Dtests.file.encoding=ISO-8859-1
{noformat}

After some investigation I found out that the position of the buffer is not 
updated when the maximum backtrace size is reached (1024).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8665) Add temporary code in TestBackwardsCompatibility to handle two concurrent releases

2019-01-29 Thread Jim Ferenczi (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi resolved LUCENE-8665.
--
Resolution: Won't Fix

Fair enough. I reverted the version bump on branch_8x and will focus on 
releasing 7.7 first

> Add temporary code in TestBackwardsCompatibility to handle two concurrent 
> releases
> --
>
> Key: LUCENE-8665
> URL: https://issues.apache.org/jira/browse/LUCENE-8665
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Jim Ferenczi
>Priority: Major
>
> Today TestBackwardsCompatibility can handle a single release at a time 
> because TestBackwardsCompatibility#testAllVersionsTested is lenient on the 
> latest version only (the one that is released). However and since we want to 
> release two versions simultaneously (7.7 and 8.0) this test is failing on 
> branch_8x. This means that we need to do one release at a time or add more 
> leniency in the test to handle this special case. We could for instance add 
> something like:
> {noformat}
> // NORELEASE: we have two releases in progress (7.7.0 and 8.0.0) so we 
> could be missing
> // 2 files, 1 for 7.7.0 and one for 8.0.0. This should be removed when 
> 7.7.0 is released.
> if (extraFiles.isEmpty() && missingFiles.size() == 2 && 
> missingFiles.contains("7.7.0-cfs") && missingFiles.contains("8.0.0-cfs")) {
>   // success
>   return;
> }
> {noformat}
> and remove the code when 7.7.0 is released ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8665) Add temporary code in TestBackwardsCompatibility to handle two concurrent releases

2019-01-29 Thread Jim Ferenczi (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8665:
-
Description: 
Today TestBackwardsCompatibility can handle a single release at a time because 
TestBackwardsCompatibility#testAllVersionsTested is lenient on the latest 
version only (the one that is released). However and since we want to release 
two versions simultaneously (7.7 and 8.0) this test is failing on branch_8x. 
This means that we need to do one release at a time or add more leniency in the 
test to handle this special case. We could for instance add something like:

{noformat}
// NORELEASE: we have two releases in progress (7.7.0 and 8.0.0) so we 
could be missing
// 2 files, 1 for 7.7.0 and one for 8.0.0. This should be removed when 
7.7.0 is released.
if (extraFiles.isEmpty() && missingFiles.size() == 2 && 
missingFiles.contains("7.7.0-cfs") && missingFiles.contains("8.0.0-cfs")) {
  // success
  return;
}
{noformat}

and remove the code when 7.7.0 is released ?

  was:
Today TestBackwardsCompatibility can handle a single release at a time because 
TestBackwardsCompatibility#testAllVersionsTested is lenient on the latest 
version only (the one that is released). However and since we want to release 
two versions simultaneously (7.7 and 8.0) this test is failing on branch_8x. 
This means that we need to do one release at a time or add more leniency in the 
test to handle this special case. We could for instance add something like:

{noformat}
// NORELEASE: we have two releases in progress (7.7.0 and 8.0.0) so we 
could be missing
// 2 files, 1 for 7.7.0 and one for 8.0.0. This should be removed when 
7.7.0 is released.
if (extraFiles.isEmpty() && missingFiles.size() == 2 && 
missingFiles.contains("7.7.0-cfs") && missingFiles.contains("8.0.0-cfs")) {
  // success
  return;
}
{noformat}

and remove the code when 7.6.0 is released ?


> Add temporary code in TestBackwardsCompatibility to handle two concurrent 
> releases
> --
>
> Key: LUCENE-8665
> URL: https://issues.apache.org/jira/browse/LUCENE-8665
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Jim Ferenczi
>Priority: Major
>
> Today TestBackwardsCompatibility can handle a single release at a time 
> because TestBackwardsCompatibility#testAllVersionsTested is lenient on the 
> latest version only (the one that is released). However and since we want to 
> release two versions simultaneously (7.7 and 8.0) this test is failing on 
> branch_8x. This means that we need to do one release at a time or add more 
> leniency in the test to handle this special case. We could for instance add 
> something like:
> {noformat}
> // NORELEASE: we have two releases in progress (7.7.0 and 8.0.0) so we 
> could be missing
> // 2 files, 1 for 7.7.0 and one for 8.0.0. This should be removed when 
> 7.7.0 is released.
> if (extraFiles.isEmpty() && missingFiles.size() == 2 && 
> missingFiles.contains("7.7.0-cfs") && missingFiles.contains("8.0.0-cfs")) {
>   // success
>   return;
> }
> {noformat}
> and remove the code when 7.7.0 is released ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8665) Add temporary code in TestBackwardsCompatibility to handle two concurrent releases

2019-01-29 Thread Jim Ferenczi (JIRA)
Jim Ferenczi created LUCENE-8665:


 Summary: Add temporary code in TestBackwardsCompatibility to 
handle two concurrent releases
 Key: LUCENE-8665
 URL: https://issues.apache.org/jira/browse/LUCENE-8665
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Jim Ferenczi


Today TestBackwardsCompatibility can handle a single release at a time because 
TestBackwardsCompatibility#testAllVersionsTested is lenient on the latest 
version only (the one that is released). However and since we want to release 
two versions simultaneously (7.7 and 8.0) this test is failing on branch_8x. 
This means that we need to do one release at a time or add more leniency in the 
test to handle this special case. We could for instance add something like:

{noformat}
// NORELEASE: we have two releases in progress (7.7.0 and 8.0.0) so we 
could be missing
// 2 files, 1 for 7.7.0 and one for 8.0.0. This should be removed when 
7.7.0 is released.
if (extraFiles.isEmpty() && missingFiles.size() == 2 && 
missingFiles.contains("7.7.0-cfs") && missingFiles.contains("8.0.0-cfs")) {
  // success
  return;
}
{noformat}

and remove the code when 7.6.0 is released ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8660) Include totalHitsThreshold when tracking total hits in TopDocsCollector

2019-01-29 Thread Jim Ferenczi (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi resolved LUCENE-8660.
--
   Resolution: Fixed
Fix Version/s: 8.0

> Include totalHitsThreshold when tracking total hits in TopDocsCollector
> ---
>
> Key: LUCENE-8660
> URL: https://issues.apache.org/jira/browse/LUCENE-8660
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Jim Ferenczi
>Priority: Minor
> Fix For: 8.0
>
> Attachments: LUCENE-8660.patch, LUCENE-8660.patch
>
>
> Today the total hits threshold in the top docs collector is not inclusive, 
> this means that total hits are tracked up to totalHitsThreshold-1. After 
> discussing with @jpountz we agreed that this is not intuitive to return a 
> lower bound that is equal to totalHitsThreshold even if the count is accurate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8660) Include totalHitsThreshold when tracking total hits in TopDocsCollector

2019-01-28 Thread Jim Ferenczi (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8660:
-
Description: Today the total hits threshold in the top docs collector is 
not inclusive, this means that total hits are tracked up to 
totalHitsThreshold-1. After discussing with @jpountz we agreed that this is not 
intuitive to return a lower bound that is equal to totalHitsThreshold even if 
the count is accurate.  (was: Today the total hits threshold in the top docs 
collector is not inclusive, this means that total hits are tracked up to 
totalHitsThreshold-1. After discussing with @jpountz we agreed that this is not 
intuitive to return a lower bound that is equal to totalHitsThreshold even if 
the count is accurate.
This also means that a threshold of Integer.MAX_VALUE will always return 
accurate total hits even if the number of matching docs is equal to 
(Integer.MAX_VALUE-1).)

> Include totalHitsThreshold when tracking total hits in TopDocsCollector
> ---
>
> Key: LUCENE-8660
> URL: https://issues.apache.org/jira/browse/LUCENE-8660
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: LUCENE-8660.patch
>
>
> Today the total hits threshold in the top docs collector is not inclusive, 
> this means that total hits are tracked up to totalHitsThreshold-1. After 
> discussing with @jpountz we agreed that this is not intuitive to return a 
> lower bound that is equal to totalHitsThreshold even if the count is accurate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8660) Include totalHitsThreshold when tracking total hits in TopDocsCollector

2019-01-28 Thread Jim Ferenczi (JIRA)
Jim Ferenczi created LUCENE-8660:


 Summary: Include totalHitsThreshold when tracking total hits in 
TopDocsCollector
 Key: LUCENE-8660
 URL: https://issues.apache.org/jira/browse/LUCENE-8660
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Jim Ferenczi


Today the total hits threshold in the top docs collector is not inclusive, this 
means that total hits are tracked up to totalHitsThreshold-1. After discussing 
with @jpountz we agreed that this is not intuitive to return a lower bound that 
is equal to totalHitsThreshold even if the count is accurate.
This also means that a threshold of Integer.MAX_VALUE will always return 
accurate total hits even if the number of matching docs is equal to 
(Integer.MAX_VALUE-1).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8658) Illegal assertion in WANDScorer

2019-01-24 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751485#comment-16751485
 ] 

Jim Ferenczi commented on LUCENE-8658:
--

+1, thanks for fixing Adrien!

> Illegal assertion in WANDScorer
> ---
>
> Key: LUCENE-8658
> URL: https://issues.apache.org/jira/browse/LUCENE-8658
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-8658.patch
>
>
> [~jim.ferenczi] told me about an assertion error that he ran into while 
> playing with WANDScorer.
> WANDScorer tries to not have to deal with accuracy issues on floating-point 
> numbers. In order to do this, it turns all scores into integers by 
> multiplying them by a scaling factor, and then rounding minimum competitive 
> scores down and rounding maximum scores up. This scaling factor is computed 
> in the constructor in such a way that scores end up in the 0..65536 range. 
> Sub scorers that have a maximum score of +Infty are ignored.
> The assertion is triggered in the rare case that a Scorer returns +Infty for 
> its maximum score when computing the scaling factor but then returns finite 
> values that are greater than the maximum scores of other clauses when asked 
> for the maximum score over smaller ranges of doc ids.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8216) Better cross-field scoring

2019-01-21 Thread Jim Ferenczi (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8216:
-
Attachment: (was: LUCENE-8652.patch)

> Better cross-field scoring
> --
>
> Key: LUCENE-8216
> URL: https://issues.apache.org/jira/browse/LUCENE-8216
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Major
> Attachments: LUCENE-8216.patch, LUCENE-8216.patch
>
>
> I'd like Lucene to have better support for scoring across multiple fields. 
> Today we have BlendedTermQuery which tries to help there but it probably 
> tries to do too much on some aspects (handling cross-field term queries AND 
> synonyms) and too little on other ones (it tries to merge index-level 
> statistics, but not per-document statistics like tf and norm).
> Maybe we could implement something like BM25F so that queries across multiple 
> fields would retain the benefits of BM25 like the fact that the impact of the 
> term frequency saturates quickly, which is not the case with BlendedTermQuery 
> if you have occurrences across many fields.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8216) Better cross-field scoring

2019-01-21 Thread Jim Ferenczi (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8216:
-
Fix Version/s: 8.0

> Better cross-field scoring
> --
>
> Key: LUCENE-8216
> URL: https://issues.apache.org/jira/browse/LUCENE-8216
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Major
> Fix For: 8.0
>
> Attachments: LUCENE-8216.patch, LUCENE-8216.patch
>
>
> I'd like Lucene to have better support for scoring across multiple fields. 
> Today we have BlendedTermQuery which tries to help there but it probably 
> tries to do too much on some aspects (handling cross-field term queries AND 
> synonyms) and too little on other ones (it tries to merge index-level 
> statistics, but not per-document statistics like tf and norm).
> Maybe we could implement something like BM25F so that queries across multiple 
> fields would retain the benefits of BM25 like the fact that the impact of the 
> term frequency saturates quickly, which is not the case with BlendedTermQuery 
> if you have occurrences across many fields.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Issue Comment Deleted] (LUCENE-8216) Better cross-field scoring

2019-01-21 Thread Jim Ferenczi (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8216:
-
Comment: was deleted

(was: Here is a patch that adds a way to boost each term individually in the 
synonym query.
The sole constructor has been replaced with a SynonymQuery.Builder that can 
adds boosted terms. The boost must be between 0+ and 1 in order to ensure that 
document term frequencies do not go beyond the total term frequency. )

> Better cross-field scoring
> --
>
> Key: LUCENE-8216
> URL: https://issues.apache.org/jira/browse/LUCENE-8216
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Major
> Attachments: LUCENE-8216.patch, LUCENE-8216.patch
>
>
> I'd like Lucene to have better support for scoring across multiple fields. 
> Today we have BlendedTermQuery which tries to help there but it probably 
> tries to do too much on some aspects (handling cross-field term queries AND 
> synonyms) and too little on other ones (it tries to merge index-level 
> statistics, but not per-document statistics like tf and norm).
> Maybe we could implement something like BM25F so that queries across multiple 
> fields would retain the benefits of BM25 like the fact that the impact of the 
> term frequency saturates quickly, which is not the case with BlendedTermQuery 
> if you have occurrences across many fields.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8652) Add boosting support in the SynonymQuery

2019-01-21 Thread Jim Ferenczi (JIRA)
Jim Ferenczi created LUCENE-8652:


 Summary: Add boosting support in the SynonymQuery
 Key: LUCENE-8652
 URL: https://issues.apache.org/jira/browse/LUCENE-8652
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Jim Ferenczi


The SynonymQuery tries to score multiple terms as if you had indexed them as 
one term.
This is good for "true" synonyms where each term should have the same 
contribution to the final score but this doesn't handle the case where terms 
have different weights. For scoring purpose it would be nice to be able to 
assign a boost per term that we could multiply with the term's document 
frequency in order to take into account the importance of the term within the 
synonym list.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8633) Remove term weighting from interval scoring

2019-01-16 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16743886#comment-16743886
 ] 

Jim Ferenczi commented on LUCENE-8633:
--

Thanks Alan, +1 the patch looks good.

> Remove term weighting from interval scoring
> ---
>
> Key: LUCENE-8633
> URL: https://issues.apache.org/jira/browse/LUCENE-8633
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8633.patch, LUCENE-8633.patch
>
>
> IntervalScorer currently uses the same scoring mechanism as SpanScorer, 
> summing the IDF of all possibly matching terms from its parent 
> IntervalsSource and using that in conjunction with a sloppy frequency to 
> produce a similarity-based score.  This doesn't really make sense, however, 
> as it means that terms that don't appear in a document can still contribute 
> to the score, and appears to make scores from interval queries comparable 
> with scores from term or phrase queries when they really aren't.
> I'd like to explore a different scoring mechanism for intervals, based purely 
> on sloppy frequency and ignoring term weighting.  This should make the scores 
> easier to reason about, as well as making them useful for things like 
> proximity boosting on boolean queries.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



  1   2   3   4   5   >