subject:"\[jira\] \[Updated\] \(LUCENE\-5030\) FuzzySuggester has to operate FSTs of Unicode\-letters, not UTF\-8, to work correctly for 1\-byte \(like English\) and multi\-byte \(non\-Latin\) letters"

[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters

2013-07-17 Thread Artem Lukanin (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Lukanin updated LUCENE-5030:
--

Attachment: LUCENE-5030.patch

Moved the parameter from AnalyzingLookupFactory to FuzzyLookupFactory

 FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work 
 correctly for 1-byte (like English) and multi-byte (non-Latin) letters
 

 Key: LUCENE-5030
 URL: https://issues.apache.org/jira/browse/LUCENE-5030
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.3
Reporter: Artem Lukanin
Assignee: Michael McCandless
 Fix For: 5.0, 4.4

 Attachments: benchmark-INFO_SEP.txt, benchmark-old.txt, 
 benchmark-wo_convertion.txt, LUCENE-5030.patch, LUCENE-5030.patch, 
 LUCENE-5030.patch, LUCENE-5030.patch, LUCENE-5030.patch, 
 nonlatin_fuzzySuggester1.patch, nonlatin_fuzzySuggester2.patch, 
 nonlatin_fuzzySuggester3.patch, nonlatin_fuzzySuggester4.patch, 
 nonlatin_fuzzySuggester_combo1.patch, nonlatin_fuzzySuggester_combo2.patch, 
 nonlatin_fuzzySuggester_combo.patch, nonlatin_fuzzySuggester.patch, 
 nonlatin_fuzzySuggester.patch, nonlatin_fuzzySuggester.patch, 
 run-suggest-benchmark.patch


 There is a limitation in the current FuzzySuggester implementation: it 
 computes edits in UTF-8 space instead of Unicode character (code point) 
 space. 
 This should be fixable: we'd need to fix TokenStreamToAutomaton to work in 
 Unicode character space, then fix FuzzySuggester to do the same steps that 
 FuzzyQuery does: do the LevN expansion in Unicode character space, then 
 convert that automaton to UTF-8, then intersect with the suggest FST.
 See the discussion here: 
 http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-td4067018.html#none

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters

2013-07-17 Thread Artem Lukanin (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Lukanin updated LUCENE-5030:
--

Attachment: LUCENE-5030.patch

The code is refactored not to touch AnalyzingSuggester. Please, review.

 FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work 
 correctly for 1-byte (like English) and multi-byte (non-Latin) letters
 

 Key: LUCENE-5030
 URL: https://issues.apache.org/jira/browse/LUCENE-5030
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.3
Reporter: Artem Lukanin
Assignee: Michael McCandless
 Fix For: 5.0, 4.4

 Attachments: benchmark-INFO_SEP.txt, benchmark-old.txt, 
 benchmark-wo_convertion.txt, LUCENE-5030.patch, LUCENE-5030.patch, 
 LUCENE-5030.patch, LUCENE-5030.patch, LUCENE-5030.patch, LUCENE-5030.patch, 
 nonlatin_fuzzySuggester1.patch, nonlatin_fuzzySuggester2.patch, 
 nonlatin_fuzzySuggester3.patch, nonlatin_fuzzySuggester4.patch, 
 nonlatin_fuzzySuggester_combo1.patch, nonlatin_fuzzySuggester_combo2.patch, 
 nonlatin_fuzzySuggester_combo.patch, nonlatin_fuzzySuggester.patch, 
 nonlatin_fuzzySuggester.patch, nonlatin_fuzzySuggester.patch, 
 run-suggest-benchmark.patch


 There is a limitation in the current FuzzySuggester implementation: it 
 computes edits in UTF-8 space instead of Unicode character (code point) 
 space. 
 This should be fixable: we'd need to fix TokenStreamToAutomaton to work in 
 Unicode character space, then fix FuzzySuggester to do the same steps that 
 FuzzyQuery does: do the LevN expansion in Unicode character space, then 
 convert that automaton to UTF-8, then intersect with the suggest FST.
 See the discussion here: 
 http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-td4067018.html#none

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters

2013-07-04 Thread Artem Lukanin (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Lukanin updated LUCENE-5030:
--

Attachment: LUCENE-5030.patch

I have renamed the variables in comments and tests for consistency

 FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work 
 correctly for 1-byte (like English) and multi-byte (non-Latin) letters
 

 Key: LUCENE-5030
 URL: https://issues.apache.org/jira/browse/LUCENE-5030
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.3
Reporter: Artem Lukanin
Assignee: Michael McCandless
 Fix For: 5.0, 4.4

 Attachments: benchmark-INFO_SEP.txt, benchmark-old.txt, 
 benchmark-wo_convertion.txt, LUCENE-5030.patch, LUCENE-5030.patch, 
 LUCENE-5030.patch, LUCENE-5030.patch, nonlatin_fuzzySuggester1.patch, 
 nonlatin_fuzzySuggester2.patch, nonlatin_fuzzySuggester3.patch, 
 nonlatin_fuzzySuggester4.patch, nonlatin_fuzzySuggester_combo1.patch, 
 nonlatin_fuzzySuggester_combo2.patch, nonlatin_fuzzySuggester_combo.patch, 
 nonlatin_fuzzySuggester.patch, nonlatin_fuzzySuggester.patch, 
 nonlatin_fuzzySuggester.patch, run-suggest-benchmark.patch


 There is a limitation in the current FuzzySuggester implementation: it 
 computes edits in UTF-8 space instead of Unicode character (code point) 
 space. 
 This should be fixable: we'd need to fix TokenStreamToAutomaton to work in 
 Unicode character space, then fix FuzzySuggester to do the same steps that 
 FuzzyQuery does: do the LevN expansion in Unicode character space, then 
 convert that automaton to UTF-8, then intersect with the suggest FST.
 See the discussion here: 
 http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-td4067018.html#none

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters

2013-07-03 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-5030:
---

Attachment: LUCENE-5030.patch

New patch, fixing the linter error, renaming UNICODE_AWARE - 
FUZZY_UNICODE_AWARE, and fixing one compilation warning ... I think it's ready.

 FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work 
 correctly for 1-byte (like English) and multi-byte (non-Latin) letters
 

 Key: LUCENE-5030
 URL: https://issues.apache.org/jira/browse/LUCENE-5030
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.3
Reporter: Artem Lukanin
Assignee: Michael McCandless
 Fix For: 5.0, 4.4

 Attachments: benchmark-INFO_SEP.txt, benchmark-old.txt, 
 benchmark-wo_convertion.txt, LUCENE-5030.patch, LUCENE-5030.patch, 
 LUCENE-5030.patch, nonlatin_fuzzySuggester1.patch, 
 nonlatin_fuzzySuggester2.patch, nonlatin_fuzzySuggester3.patch, 
 nonlatin_fuzzySuggester4.patch, nonlatin_fuzzySuggester_combo1.patch, 
 nonlatin_fuzzySuggester_combo2.patch, nonlatin_fuzzySuggester_combo.patch, 
 nonlatin_fuzzySuggester.patch, nonlatin_fuzzySuggester.patch, 
 nonlatin_fuzzySuggester.patch, run-suggest-benchmark.patch


 There is a limitation in the current FuzzySuggester implementation: it 
 computes edits in UTF-8 space instead of Unicode character (code point) 
 space. 
 This should be fixable: we'd need to fix TokenStreamToAutomaton to work in 
 Unicode character space, then fix FuzzySuggester to do the same steps that 
 FuzzyQuery does: do the LevN expansion in Unicode character space, then 
 convert that automaton to UTF-8, then intersect with the suggest FST.
 See the discussion here: 
 http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-td4067018.html#none

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters

2013-07-02 Thread Artem Lukanin (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Lukanin updated LUCENE-5030:
--

Attachment: LUCENE-5030.patch

The javadocs are fixed.

 FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work 
 correctly for 1-byte (like English) and multi-byte (non-Latin) letters
 

 Key: LUCENE-5030
 URL: https://issues.apache.org/jira/browse/LUCENE-5030
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.3
Reporter: Artem Lukanin
Assignee: Michael McCandless
 Fix For: 5.0, 4.4

 Attachments: benchmark-INFO_SEP.txt, benchmark-old.txt, 
 benchmark-wo_convertion.txt, LUCENE-5030.patch, LUCENE-5030.patch, 
 nonlatin_fuzzySuggester1.patch, nonlatin_fuzzySuggester2.patch, 
 nonlatin_fuzzySuggester3.patch, nonlatin_fuzzySuggester4.patch, 
 nonlatin_fuzzySuggester_combo1.patch, nonlatin_fuzzySuggester_combo2.patch, 
 nonlatin_fuzzySuggester_combo.patch, nonlatin_fuzzySuggester.patch, 
 nonlatin_fuzzySuggester.patch, nonlatin_fuzzySuggester.patch, 
 run-suggest-benchmark.patch


 There is a limitation in the current FuzzySuggester implementation: it 
 computes edits in UTF-8 space instead of Unicode character (code point) 
 space. 
 This should be fixable: we'd need to fix TokenStreamToAutomaton to work in 
 Unicode character space, then fix FuzzySuggester to do the same steps that 
 FuzzyQuery does: do the LevN expansion in Unicode character space, then 
 convert that automaton to UTF-8, then intersect with the suggest FST.
 See the discussion here: 
 http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-td4067018.html#none

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters

2013-07-01 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-5030:
---

Fix Version/s: 4.4
   5.0

 FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work 
 correctly for 1-byte (like English) and multi-byte (non-Latin) letters
 

 Key: LUCENE-5030
 URL: https://issues.apache.org/jira/browse/LUCENE-5030
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.3
Reporter: Artem Lukanin
Assignee: Michael McCandless
 Fix For: 5.0, 4.4

 Attachments: benchmark-INFO_SEP.txt, benchmark-old.txt, 
 benchmark-wo_convertion.txt, LUCENE-5030.patch, 
 nonlatin_fuzzySuggester1.patch, nonlatin_fuzzySuggester2.patch, 
 nonlatin_fuzzySuggester3.patch, nonlatin_fuzzySuggester4.patch, 
 nonlatin_fuzzySuggester_combo1.patch, nonlatin_fuzzySuggester_combo2.patch, 
 nonlatin_fuzzySuggester_combo.patch, nonlatin_fuzzySuggester.patch, 
 nonlatin_fuzzySuggester.patch, nonlatin_fuzzySuggester.patch, 
 run-suggest-benchmark.patch


 There is a limitation in the current FuzzySuggester implementation: it 
 computes edits in UTF-8 space instead of Unicode character (code point) 
 space. 
 This should be fixable: we'd need to fix TokenStreamToAutomaton to work in 
 Unicode character space, then fix FuzzySuggester to do the same steps that 
 FuzzyQuery does: do the LevN expansion in Unicode character space, then 
 convert that automaton to UTF-8, then intersect with the suggest FST.
 See the discussion here: 
 http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-td4067018.html#none

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters

2013-06-27 Thread Artem Lukanin (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Lukanin updated LUCENE-5030:
--

Attachment: LUCENE-5030.patch

Done. Please, review LUCENE-5030.patch

 FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work 
 correctly for 1-byte (like English) and multi-byte (non-Latin) letters
 

 Key: LUCENE-5030
 URL: https://issues.apache.org/jira/browse/LUCENE-5030
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.3
Reporter: Artem Lukanin
 Attachments: benchmark-INFO_SEP.txt, benchmark-old.txt, 
 benchmark-wo_convertion.txt, LUCENE-5030.patch, 
 nonlatin_fuzzySuggester1.patch, nonlatin_fuzzySuggester2.patch, 
 nonlatin_fuzzySuggester3.patch, nonlatin_fuzzySuggester4.patch, 
 nonlatin_fuzzySuggester_combo1.patch, nonlatin_fuzzySuggester_combo2.patch, 
 nonlatin_fuzzySuggester_combo.patch, nonlatin_fuzzySuggester.patch, 
 nonlatin_fuzzySuggester.patch, nonlatin_fuzzySuggester.patch, 
 run-suggest-benchmark.patch


 There is a limitation in the current FuzzySuggester implementation: it 
 computes edits in UTF-8 space instead of Unicode character (code point) 
 space. 
 This should be fixable: we'd need to fix TokenStreamToAutomaton to work in 
 Unicode character space, then fix FuzzySuggester to do the same steps that 
 FuzzyQuery does: do the LevN expansion in Unicode character space, then 
 convert that automaton to UTF-8, then intersect with the suggest FST.
 See the discussion here: 
 http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-td4067018.html#none

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters

2013-06-26 Thread Artem Lukanin (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Lukanin updated LUCENE-5030:
--

Attachment: nonlatin_fuzzySuggester_combo1.patch

Sorry, I don't understand, why testStolenBytes worked before. I have restored 
it and now it fails. Can you please suggest, what wrong I did?

As I understood, if we do not preserve the separator, 1 token with a separator 
and 2 tokens (which is actually 1 string with a separator) equals after 
removing the separator in  replaceSep, so we should get 2 results instead of 1 
when we do a lookup. No?

I've added a test for IllegalArgumentException.

 FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work 
 correctly for 1-byte (like English) and multi-byte (non-Latin) letters
 

 Key: LUCENE-5030
 URL: https://issues.apache.org/jira/browse/LUCENE-5030
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.3
Reporter: Artem Lukanin
 Attachments: benchmark-INFO_SEP.txt, benchmark-old.txt, 
 benchmark-wo_convertion.txt, nonlatin_fuzzySuggester1.patch, 
 nonlatin_fuzzySuggester2.patch, nonlatin_fuzzySuggester3.patch, 
 nonlatin_fuzzySuggester4.patch, nonlatin_fuzzySuggester_combo1.patch, 
 nonlatin_fuzzySuggester_combo.patch, nonlatin_fuzzySuggester.patch, 
 nonlatin_fuzzySuggester.patch, nonlatin_fuzzySuggester.patch, 
 run-suggest-benchmark.patch


 There is a limitation in the current FuzzySuggester implementation: it 
 computes edits in UTF-8 space instead of Unicode character (code point) 
 space. 
 This should be fixable: we'd need to fix TokenStreamToAutomaton to work in 
 Unicode character space, then fix FuzzySuggester to do the same steps that 
 FuzzyQuery does: do the LevN expansion in Unicode character space, then 
 convert that automaton to UTF-8, then intersect with the suggest FST.
 See the discussion here: 
 http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-td4067018.html#none

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters

2013-06-26 Thread Artem Lukanin (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Lukanin updated LUCENE-5030:
--

Attachment: nonlatin_fuzzySuggester_combo2.patch

I have restored testStolenBytes completely and now all the tests pass.

But I'm not sure, what did you mean by 0xff byte in {code}token(new 
BytesRef(new byte[] {0x61, (byte) 0xff, 0x61})){code}? Letter ÿ or SEP_LABEL?

Now it is treated as letter ÿ, but in the previous modification of the test I 
treated it as SEP_LABEL.

 FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work 
 correctly for 1-byte (like English) and multi-byte (non-Latin) letters
 

 Key: LUCENE-5030
 URL: https://issues.apache.org/jira/browse/LUCENE-5030
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.3
Reporter: Artem Lukanin
 Attachments: benchmark-INFO_SEP.txt, benchmark-old.txt, 
 benchmark-wo_convertion.txt, nonlatin_fuzzySuggester1.patch, 
 nonlatin_fuzzySuggester2.patch, nonlatin_fuzzySuggester3.patch, 
 nonlatin_fuzzySuggester4.patch, nonlatin_fuzzySuggester_combo1.patch, 
 nonlatin_fuzzySuggester_combo2.patch, nonlatin_fuzzySuggester_combo.patch, 
 nonlatin_fuzzySuggester.patch, nonlatin_fuzzySuggester.patch, 
 nonlatin_fuzzySuggester.patch, run-suggest-benchmark.patch


 There is a limitation in the current FuzzySuggester implementation: it 
 computes edits in UTF-8 space instead of Unicode character (code point) 
 space. 
 This should be fixable: we'd need to fix TokenStreamToAutomaton to work in 
 Unicode character space, then fix FuzzySuggester to do the same steps that 
 FuzzyQuery does: do the LevN expansion in Unicode character space, then 
 convert that automaton to UTF-8, then intersect with the suggest FST.
 See the discussion here: 
 http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-td4067018.html#none

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters

2013-06-24 Thread Artem Lukanin (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Lukanin updated LUCENE-5030:
--

Attachment: nonlatin_fuzzySuggester_combo.patch

I have uploaded a lucene/solr combo patch with new UNICODE_AWARE option

 FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work 
 correctly for 1-byte (like English) and multi-byte (non-Latin) letters
 

 Key: LUCENE-5030
 URL: https://issues.apache.org/jira/browse/LUCENE-5030
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.3
Reporter: Artem Lukanin
 Attachments: benchmark-INFO_SEP.txt, benchmark-old.txt, 
 benchmark-wo_convertion.txt, nonlatin_fuzzySuggester1.patch, 
 nonlatin_fuzzySuggester2.patch, nonlatin_fuzzySuggester3.patch, 
 nonlatin_fuzzySuggester4.patch, nonlatin_fuzzySuggester_combo.patch, 
 nonlatin_fuzzySuggester.patch, nonlatin_fuzzySuggester.patch, 
 nonlatin_fuzzySuggester.patch, run-suggest-benchmark.patch


 There is a limitation in the current FuzzySuggester implementation: it 
 computes edits in UTF-8 space instead of Unicode character (code point) 
 space. 
 This should be fixable: we'd need to fix TokenStreamToAutomaton to work in 
 Unicode character space, then fix FuzzySuggester to do the same steps that 
 FuzzyQuery does: do the LevN expansion in Unicode character space, then 
 convert that automaton to UTF-8, then intersect with the suggest FST.
 See the discussion here: 
 http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-td4067018.html#none

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters

2013-06-21 Thread Artem Lukanin (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Lukanin updated LUCENE-5030:
--

Attachment: benchmark-wo_convertion.txt
benchmark-old.txt
benchmark-INFO_SEP.txt

 FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work 
 correctly for 1-byte (like English) and multi-byte (non-Latin) letters
 

 Key: LUCENE-5030
 URL: https://issues.apache.org/jira/browse/LUCENE-5030
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.3
Reporter: Artem Lukanin
 Attachments: benchmark-INFO_SEP.txt, benchmark-old.txt, 
 benchmark-wo_convertion.txt, nonlatin_fuzzySuggester1.patch, 
 nonlatin_fuzzySuggester2.patch, nonlatin_fuzzySuggester3.patch, 
 nonlatin_fuzzySuggester4.patch, nonlatin_fuzzySuggester.patch, 
 nonlatin_fuzzySuggester.patch, nonlatin_fuzzySuggester.patch, 
 run-suggest-benchmark.patch


 There is a limitation in the current FuzzySuggester implementation: it 
 computes edits in UTF-8 space instead of Unicode character (code point) 
 space. 
 This should be fixable: we'd need to fix TokenStreamToAutomaton to work in 
 Unicode character space, then fix FuzzySuggester to do the same steps that 
 FuzzyQuery does: do the LevN expansion in Unicode character space, then 
 convert that automaton to UTF-8, then intersect with the suggest FST.
 See the discussion here: 
 http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-td4067018.html#none

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters

2013-06-20 Thread Artem Lukanin (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Lukanin updated LUCENE-5030:
--

Attachment: nonlatin_fuzzySuggester.patch

I used INFO_SEP and INFO_SEP2 for separators and holes. All the tests pass (I 
have fixed AnalyzingSuggesterTest.testStolenBytes). The benchmark is improved:

{code}[junit4:junit4] Suite: 
org.apache.lucene.search.suggest.LookupBenchmarkTest
[junit4:junit4] Completed in 0.04s, 0 tests
[junit4:junit4] 
[junit4:junit4] JVM J0: 1.64 .. 2.34 = 0.71s
[junit4:junit4] Execution time total: 2.36 sec.
[junit4:junit4] Tests summary: 1 suite, 0 tests
 [echo] 5 slowest tests:
[junit4:tophints]  22.95s | org.apache.lucene.search.spell.TestSpellChecker
[junit4:tophints]  15.08s | org.apache.lucene.search.suggest.fst.TestSort
[junit4:tophints]  13.41s | 
org.apache.lucene.search.suggest.analyzing.AnalyzingSuggesterTest
[junit4:tophints]  11.84s | 
org.apache.lucene.search.suggest.fst.FSTCompletionTest
[junit4:tophints]  10.78s | 
org.apache.lucene.search.suggest.analyzing.FuzzySuggesterTest
{code}

 FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work 
 correctly for 1-byte (like English) and multi-byte (non-Latin) letters
 

 Key: LUCENE-5030
 URL: https://issues.apache.org/jira/browse/LUCENE-5030
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.3
Reporter: Artem Lukanin
 Attachments: nonlatin_fuzzySuggester1.patch, 
 nonlatin_fuzzySuggester2.patch, nonlatin_fuzzySuggester3.patch, 
 nonlatin_fuzzySuggester4.patch, nonlatin_fuzzySuggester.patch, 
 nonlatin_fuzzySuggester.patch, nonlatin_fuzzySuggester.patch


 There is a limitation in the current FuzzySuggester implementation: it 
 computes edits in UTF-8 space instead of Unicode character (code point) 
 space. 
 This should be fixable: we'd need to fix TokenStreamToAutomaton to work in 
 Unicode character space, then fix FuzzySuggester to do the same steps that 
 FuzzyQuery does: do the LevN expansion in Unicode character space, then 
 convert that automaton to UTF-8, then intersect with the suggest FST.
 See the discussion here: 
 http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-td4067018.html#none

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters

2013-06-20 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-5030:
---

Attachment: run-suggest-benchmark.patch

Hi Artem,

Sorry, running the LookupBenchmarkTest is tricky ... you need to make temporary 
changes in 3 places.  I'm attaching a patch that should let you run it by just 
doing ant test -Dtestcase=LookupBenchmarkTest.

 FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work 
 correctly for 1-byte (like English) and multi-byte (non-Latin) letters
 

 Key: LUCENE-5030
 URL: https://issues.apache.org/jira/browse/LUCENE-5030
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.3
Reporter: Artem Lukanin
 Attachments: nonlatin_fuzzySuggester1.patch, 
 nonlatin_fuzzySuggester2.patch, nonlatin_fuzzySuggester3.patch, 
 nonlatin_fuzzySuggester4.patch, nonlatin_fuzzySuggester.patch, 
 nonlatin_fuzzySuggester.patch, nonlatin_fuzzySuggester.patch, 
 run-suggest-benchmark.patch


 There is a limitation in the current FuzzySuggester implementation: it 
 computes edits in UTF-8 space instead of Unicode character (code point) 
 space. 
 This should be fixable: we'd need to fix TokenStreamToAutomaton to work in 
 Unicode character space, then fix FuzzySuggester to do the same steps that 
 FuzzyQuery does: do the LevN expansion in Unicode character space, then 
 convert that automaton to UTF-8, then intersect with the suggest FST.
 See the discussion here: 
 http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-td4067018.html#none

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters

2013-06-19 Thread Artem Lukanin (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Lukanin updated LUCENE-5030:
--

Attachment: nonlatin_fuzzySuggester.patch

now tests in FuzzySuggesterTest and AnalyzingSuggesterTest pass, except for 
AnalyzingSuggesterTest.testRandom (when preserveSep = true).

If I enable VERBOSE, I see, that suggestions are correct. I guess, there is a 
bug in the test, but I cannot find it.

Can you please review?

 FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work 
 correctly for 1-byte (like English) and multi-byte (non-Latin) letters
 

 Key: LUCENE-5030
 URL: https://issues.apache.org/jira/browse/LUCENE-5030
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.3
Reporter: Artem Lukanin
 Attachments: nonlatin_fuzzySuggester1.patch, 
 nonlatin_fuzzySuggester2.patch, nonlatin_fuzzySuggester3.patch, 
 nonlatin_fuzzySuggester4.patch, nonlatin_fuzzySuggester.patch, 
 nonlatin_fuzzySuggester.patch


 There is a limitation in the current FuzzySuggester implementation: it 
 computes edits in UTF-8 space instead of Unicode character (code point) 
 space. 
 This should be fixable: we'd need to fix TokenStreamToAutomaton to work in 
 Unicode character space, then fix FuzzySuggester to do the same steps that 
 FuzzyQuery does: do the LevN expansion in Unicode character space, then 
 convert that automaton to UTF-8, then intersect with the suggest FST.
 See the discussion here: 
 http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-td4067018.html#none

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters

2013-06-18 Thread Artem Lukanin (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Lukanin updated LUCENE-5030:
--

Attachment: nonlatin_fuzzySuggester4.patch

I have fixed testRandom, which repeats the logic of FuzzySuggester.
Now all the tests pass.
Please, review.

 FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work 
 correctly for 1-byte (like English) and multi-byte (non-Latin) letters
 

 Key: LUCENE-5030
 URL: https://issues.apache.org/jira/browse/LUCENE-5030
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.3
Reporter: Artem Lukanin
 Attachments: nonlatin_fuzzySuggester1.patch, 
 nonlatin_fuzzySuggester2.patch, nonlatin_fuzzySuggester3.patch, 
 nonlatin_fuzzySuggester4.patch, nonlatin_fuzzySuggester.patch


 There is a limitation in the current FuzzySuggester implementation: it 
 computes edits in UTF-8 space instead of Unicode character (code point) 
 space. 
 This should be fixable: we'd need to fix TokenStreamToAutomaton to work in 
 Unicode character space, then fix FuzzySuggester to do the same steps that 
 FuzzyQuery does: do the LevN expansion in Unicode character space, then 
 convert that automaton to UTF-8, then intersect with the suggest FST.
 See the discussion here: 
 http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-td4067018.html#none

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters

2013-06-17 Thread Artem Lukanin (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Lukanin updated LUCENE-5030:
--

Attachment: nonlatin_fuzzySuggester1.patch

Now all the tests pass except testRandom when preserveSep is true.

Michael, can you explain me, how this preserve separator feature works?

 FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work 
 correctly for 1-byte (like English) and multi-byte (non-Latin) letters
 

 Key: LUCENE-5030
 URL: https://issues.apache.org/jira/browse/LUCENE-5030
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.3
Reporter: Artem Lukanin
 Attachments: nonlatin_fuzzySuggester1.patch, 
 nonlatin_fuzzySuggester.patch


 There is a limitation in the current FuzzySuggester implementation: it 
 computes edits in UTF-8 space instead of Unicode character (code point) 
 space. 
 This should be fixable: we'd need to fix TokenStreamToAutomaton to work in 
 Unicode character space, then fix FuzzySuggester to do the same steps that 
 FuzzyQuery does: do the LevN expansion in Unicode character space, then 
 convert that automaton to UTF-8, then intersect with the suggest FST.
 See the discussion here: 
 http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-td4067018.html#none

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters

2013-06-17 Thread Artem Lukanin (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Lukanin updated LUCENE-5030:
--

Attachment: nonlatin_fuzzySuggester2.patch

the patch without autoformatting

 FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work 
 correctly for 1-byte (like English) and multi-byte (non-Latin) letters
 

 Key: LUCENE-5030
 URL: https://issues.apache.org/jira/browse/LUCENE-5030
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.3
Reporter: Artem Lukanin
 Attachments: nonlatin_fuzzySuggester1.patch, 
 nonlatin_fuzzySuggester2.patch, nonlatin_fuzzySuggester.patch


 There is a limitation in the current FuzzySuggester implementation: it 
 computes edits in UTF-8 space instead of Unicode character (code point) 
 space. 
 This should be fixable: we'd need to fix TokenStreamToAutomaton to work in 
 Unicode character space, then fix FuzzySuggester to do the same steps that 
 FuzzyQuery does: do the LevN expansion in Unicode character space, then 
 convert that automaton to UTF-8, then intersect with the suggest FST.
 See the discussion here: 
 http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-td4067018.html#none

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters

2013-06-17 Thread Artem Lukanin (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Lukanin updated LUCENE-5030:
--

Attachment: nonlatin_fuzzySuggester3.patch

with untouched trailing spaces

 FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work 
 correctly for 1-byte (like English) and multi-byte (non-Latin) letters
 

 Key: LUCENE-5030
 URL: https://issues.apache.org/jira/browse/LUCENE-5030
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.3
Reporter: Artem Lukanin
 Attachments: nonlatin_fuzzySuggester1.patch, 
 nonlatin_fuzzySuggester2.patch, nonlatin_fuzzySuggester3.patch, 
 nonlatin_fuzzySuggester.patch


 There is a limitation in the current FuzzySuggester implementation: it 
 computes edits in UTF-8 space instead of Unicode character (code point) 
 space. 
 This should be fixable: we'd need to fix TokenStreamToAutomaton to work in 
 Unicode character space, then fix FuzzySuggester to do the same steps that 
 FuzzyQuery does: do the LevN expansion in Unicode character space, then 
 convert that automaton to UTF-8, then intersect with the suggest FST.
 See the discussion here: 
 http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-td4067018.html#none

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters

2013-06-13 Thread Artem Lukanin (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Lukanin updated LUCENE-5030:
--

Attachment: nonlatin_fuzzySuggester.patch

I've added a test, which demonstrates the bug. I have fixed 
TokenStreamToAutomaton, but I have no idea, how to update AnalyzingSuggester, 
which wants bytes instead of chars (ints, which cannot fit a byte).

 FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work 
 correctly for 1-byte (like English) and multi-byte (non-Latin) letters
 

 Key: LUCENE-5030
 URL: https://issues.apache.org/jira/browse/LUCENE-5030
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.3
Reporter: Artem Lukanin
 Attachments: nonlatin_fuzzySuggester.patch


 There is a limitation in the current FuzzySuggester implementation: it 
 computes edits in UTF-8 space instead of Unicode character (code point) 
 space. 
 This should be fixable: we'd need to fix TokenStreamToAutomaton to work in 
 Unicode character space, then fix FuzzySuggester to do the same steps that 
 FuzzyQuery does: do the LevN expansion in Unicode character space, then 
 convert that automaton to UTF-8, then intersect with the suggest FST.
 See the discussion here: 
 http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-td4067018.html#none

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters

2013-06-03 Thread Artem Lukanin (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Lukanin updated LUCENE-5030:
--

Description: 
There is a limitation in the current FuzzySuggester implementation: it computes 
edits in UTF-8 space instead of Unicode character (code point) space. 

This should be fixable: we'd need to fix TokenStreamToAutomaton to work in 
Unicode character space, then fix FuzzySuggester to do the same steps that 
FuzzyQuery does: do the LevN expansion in Unicode character space, then convert 
that automaton to UTF-8, then intersect with the suggest FST.

See the discussion here: 
http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-td4067018.html#none

  was:
There is a limitation in the current FuzzySuggester 
implementation: it computes edits in UTF-8 space instead of Unicode 
character (code point) space. 

This should be fixable: we'd need to fix TokenStreamToAutomaton to 
work in Unicode character space, then fix FuzzySuggester to do the 
same steps that FuzzyQuery does: do the LevN expansion in Unicode 
character space, then convert that automaton to UTF-8, then intersect 
with the suggest FST.

See the discussion here: 
http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-td4067018.html#none


 FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work 
 correctly for 1-byte (like English) and multi-byte (non-Latin) letters
 

 Key: LUCENE-5030
 URL: https://issues.apache.org/jira/browse/LUCENE-5030
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.3
Reporter: Artem Lukanin

 There is a limitation in the current FuzzySuggester implementation: it 
 computes edits in UTF-8 space instead of Unicode character (code point) 
 space. 
 This should be fixable: we'd need to fix TokenStreamToAutomaton to work in 
 Unicode character space, then fix FuzzySuggester to do the same steps that 
 FuzzyQuery does: do the LevN expansion in Unicode character space, then 
 convert that automaton to UTF-8, then intersect with the suggest FST.
 See the discussion here: 
 http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-td4067018.html#none

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters

[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters

[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters

[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters

[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters

[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters

[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters

[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters

[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters

[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters

[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters

[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters

[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters

[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters

[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters

[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters

[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters

[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters

[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters

[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters

20 matches

Site Navigation

Mail list logo

Footer information