[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters
[ https://issues.apache.org/jira/browse/LUCENE-5030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artem Lukanin updated LUCENE-5030: -- Attachment: LUCENE-5030.patch Moved the parameter from AnalyzingLookupFactory to FuzzyLookupFactory FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters Key: LUCENE-5030 URL: https://issues.apache.org/jira/browse/LUCENE-5030 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.3 Reporter: Artem Lukanin Assignee: Michael McCandless Fix For: 5.0, 4.4 Attachments: benchmark-INFO_SEP.txt, benchmark-old.txt, benchmark-wo_convertion.txt, LUCENE-5030.patch, LUCENE-5030.patch, LUCENE-5030.patch, LUCENE-5030.patch, LUCENE-5030.patch, nonlatin_fuzzySuggester1.patch, nonlatin_fuzzySuggester2.patch, nonlatin_fuzzySuggester3.patch, nonlatin_fuzzySuggester4.patch, nonlatin_fuzzySuggester_combo1.patch, nonlatin_fuzzySuggester_combo2.patch, nonlatin_fuzzySuggester_combo.patch, nonlatin_fuzzySuggester.patch, nonlatin_fuzzySuggester.patch, nonlatin_fuzzySuggester.patch, run-suggest-benchmark.patch There is a limitation in the current FuzzySuggester implementation: it computes edits in UTF-8 space instead of Unicode character (code point) space. This should be fixable: we'd need to fix TokenStreamToAutomaton to work in Unicode character space, then fix FuzzySuggester to do the same steps that FuzzyQuery does: do the LevN expansion in Unicode character space, then convert that automaton to UTF-8, then intersect with the suggest FST. See the discussion here: http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-td4067018.html#none -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters
[ https://issues.apache.org/jira/browse/LUCENE-5030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artem Lukanin updated LUCENE-5030: -- Attachment: LUCENE-5030.patch The code is refactored not to touch AnalyzingSuggester. Please, review. FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters Key: LUCENE-5030 URL: https://issues.apache.org/jira/browse/LUCENE-5030 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.3 Reporter: Artem Lukanin Assignee: Michael McCandless Fix For: 5.0, 4.4 Attachments: benchmark-INFO_SEP.txt, benchmark-old.txt, benchmark-wo_convertion.txt, LUCENE-5030.patch, LUCENE-5030.patch, LUCENE-5030.patch, LUCENE-5030.patch, LUCENE-5030.patch, LUCENE-5030.patch, nonlatin_fuzzySuggester1.patch, nonlatin_fuzzySuggester2.patch, nonlatin_fuzzySuggester3.patch, nonlatin_fuzzySuggester4.patch, nonlatin_fuzzySuggester_combo1.patch, nonlatin_fuzzySuggester_combo2.patch, nonlatin_fuzzySuggester_combo.patch, nonlatin_fuzzySuggester.patch, nonlatin_fuzzySuggester.patch, nonlatin_fuzzySuggester.patch, run-suggest-benchmark.patch There is a limitation in the current FuzzySuggester implementation: it computes edits in UTF-8 space instead of Unicode character (code point) space. This should be fixable: we'd need to fix TokenStreamToAutomaton to work in Unicode character space, then fix FuzzySuggester to do the same steps that FuzzyQuery does: do the LevN expansion in Unicode character space, then convert that automaton to UTF-8, then intersect with the suggest FST. See the discussion here: http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-td4067018.html#none -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters
[ https://issues.apache.org/jira/browse/LUCENE-5030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artem Lukanin updated LUCENE-5030: -- Attachment: LUCENE-5030.patch I have renamed the variables in comments and tests for consistency FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters Key: LUCENE-5030 URL: https://issues.apache.org/jira/browse/LUCENE-5030 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.3 Reporter: Artem Lukanin Assignee: Michael McCandless Fix For: 5.0, 4.4 Attachments: benchmark-INFO_SEP.txt, benchmark-old.txt, benchmark-wo_convertion.txt, LUCENE-5030.patch, LUCENE-5030.patch, LUCENE-5030.patch, LUCENE-5030.patch, nonlatin_fuzzySuggester1.patch, nonlatin_fuzzySuggester2.patch, nonlatin_fuzzySuggester3.patch, nonlatin_fuzzySuggester4.patch, nonlatin_fuzzySuggester_combo1.patch, nonlatin_fuzzySuggester_combo2.patch, nonlatin_fuzzySuggester_combo.patch, nonlatin_fuzzySuggester.patch, nonlatin_fuzzySuggester.patch, nonlatin_fuzzySuggester.patch, run-suggest-benchmark.patch There is a limitation in the current FuzzySuggester implementation: it computes edits in UTF-8 space instead of Unicode character (code point) space. This should be fixable: we'd need to fix TokenStreamToAutomaton to work in Unicode character space, then fix FuzzySuggester to do the same steps that FuzzyQuery does: do the LevN expansion in Unicode character space, then convert that automaton to UTF-8, then intersect with the suggest FST. See the discussion here: http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-td4067018.html#none -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters
[ https://issues.apache.org/jira/browse/LUCENE-5030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-5030: --- Attachment: LUCENE-5030.patch New patch, fixing the linter error, renaming UNICODE_AWARE - FUZZY_UNICODE_AWARE, and fixing one compilation warning ... I think it's ready. FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters Key: LUCENE-5030 URL: https://issues.apache.org/jira/browse/LUCENE-5030 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.3 Reporter: Artem Lukanin Assignee: Michael McCandless Fix For: 5.0, 4.4 Attachments: benchmark-INFO_SEP.txt, benchmark-old.txt, benchmark-wo_convertion.txt, LUCENE-5030.patch, LUCENE-5030.patch, LUCENE-5030.patch, nonlatin_fuzzySuggester1.patch, nonlatin_fuzzySuggester2.patch, nonlatin_fuzzySuggester3.patch, nonlatin_fuzzySuggester4.patch, nonlatin_fuzzySuggester_combo1.patch, nonlatin_fuzzySuggester_combo2.patch, nonlatin_fuzzySuggester_combo.patch, nonlatin_fuzzySuggester.patch, nonlatin_fuzzySuggester.patch, nonlatin_fuzzySuggester.patch, run-suggest-benchmark.patch There is a limitation in the current FuzzySuggester implementation: it computes edits in UTF-8 space instead of Unicode character (code point) space. This should be fixable: we'd need to fix TokenStreamToAutomaton to work in Unicode character space, then fix FuzzySuggester to do the same steps that FuzzyQuery does: do the LevN expansion in Unicode character space, then convert that automaton to UTF-8, then intersect with the suggest FST. See the discussion here: http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-td4067018.html#none -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters
[ https://issues.apache.org/jira/browse/LUCENE-5030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artem Lukanin updated LUCENE-5030: -- Attachment: LUCENE-5030.patch The javadocs are fixed. FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters Key: LUCENE-5030 URL: https://issues.apache.org/jira/browse/LUCENE-5030 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.3 Reporter: Artem Lukanin Assignee: Michael McCandless Fix For: 5.0, 4.4 Attachments: benchmark-INFO_SEP.txt, benchmark-old.txt, benchmark-wo_convertion.txt, LUCENE-5030.patch, LUCENE-5030.patch, nonlatin_fuzzySuggester1.patch, nonlatin_fuzzySuggester2.patch, nonlatin_fuzzySuggester3.patch, nonlatin_fuzzySuggester4.patch, nonlatin_fuzzySuggester_combo1.patch, nonlatin_fuzzySuggester_combo2.patch, nonlatin_fuzzySuggester_combo.patch, nonlatin_fuzzySuggester.patch, nonlatin_fuzzySuggester.patch, nonlatin_fuzzySuggester.patch, run-suggest-benchmark.patch There is a limitation in the current FuzzySuggester implementation: it computes edits in UTF-8 space instead of Unicode character (code point) space. This should be fixable: we'd need to fix TokenStreamToAutomaton to work in Unicode character space, then fix FuzzySuggester to do the same steps that FuzzyQuery does: do the LevN expansion in Unicode character space, then convert that automaton to UTF-8, then intersect with the suggest FST. See the discussion here: http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-td4067018.html#none -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters
[ https://issues.apache.org/jira/browse/LUCENE-5030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-5030: --- Fix Version/s: 4.4 5.0 FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters Key: LUCENE-5030 URL: https://issues.apache.org/jira/browse/LUCENE-5030 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.3 Reporter: Artem Lukanin Assignee: Michael McCandless Fix For: 5.0, 4.4 Attachments: benchmark-INFO_SEP.txt, benchmark-old.txt, benchmark-wo_convertion.txt, LUCENE-5030.patch, nonlatin_fuzzySuggester1.patch, nonlatin_fuzzySuggester2.patch, nonlatin_fuzzySuggester3.patch, nonlatin_fuzzySuggester4.patch, nonlatin_fuzzySuggester_combo1.patch, nonlatin_fuzzySuggester_combo2.patch, nonlatin_fuzzySuggester_combo.patch, nonlatin_fuzzySuggester.patch, nonlatin_fuzzySuggester.patch, nonlatin_fuzzySuggester.patch, run-suggest-benchmark.patch There is a limitation in the current FuzzySuggester implementation: it computes edits in UTF-8 space instead of Unicode character (code point) space. This should be fixable: we'd need to fix TokenStreamToAutomaton to work in Unicode character space, then fix FuzzySuggester to do the same steps that FuzzyQuery does: do the LevN expansion in Unicode character space, then convert that automaton to UTF-8, then intersect with the suggest FST. See the discussion here: http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-td4067018.html#none -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters
[ https://issues.apache.org/jira/browse/LUCENE-5030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artem Lukanin updated LUCENE-5030: -- Attachment: LUCENE-5030.patch Done. Please, review LUCENE-5030.patch FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters Key: LUCENE-5030 URL: https://issues.apache.org/jira/browse/LUCENE-5030 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.3 Reporter: Artem Lukanin Attachments: benchmark-INFO_SEP.txt, benchmark-old.txt, benchmark-wo_convertion.txt, LUCENE-5030.patch, nonlatin_fuzzySuggester1.patch, nonlatin_fuzzySuggester2.patch, nonlatin_fuzzySuggester3.patch, nonlatin_fuzzySuggester4.patch, nonlatin_fuzzySuggester_combo1.patch, nonlatin_fuzzySuggester_combo2.patch, nonlatin_fuzzySuggester_combo.patch, nonlatin_fuzzySuggester.patch, nonlatin_fuzzySuggester.patch, nonlatin_fuzzySuggester.patch, run-suggest-benchmark.patch There is a limitation in the current FuzzySuggester implementation: it computes edits in UTF-8 space instead of Unicode character (code point) space. This should be fixable: we'd need to fix TokenStreamToAutomaton to work in Unicode character space, then fix FuzzySuggester to do the same steps that FuzzyQuery does: do the LevN expansion in Unicode character space, then convert that automaton to UTF-8, then intersect with the suggest FST. See the discussion here: http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-td4067018.html#none -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters
[ https://issues.apache.org/jira/browse/LUCENE-5030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artem Lukanin updated LUCENE-5030: -- Attachment: nonlatin_fuzzySuggester_combo1.patch Sorry, I don't understand, why testStolenBytes worked before. I have restored it and now it fails. Can you please suggest, what wrong I did? As I understood, if we do not preserve the separator, 1 token with a separator and 2 tokens (which is actually 1 string with a separator) equals after removing the separator in replaceSep, so we should get 2 results instead of 1 when we do a lookup. No? I've added a test for IllegalArgumentException. FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters Key: LUCENE-5030 URL: https://issues.apache.org/jira/browse/LUCENE-5030 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.3 Reporter: Artem Lukanin Attachments: benchmark-INFO_SEP.txt, benchmark-old.txt, benchmark-wo_convertion.txt, nonlatin_fuzzySuggester1.patch, nonlatin_fuzzySuggester2.patch, nonlatin_fuzzySuggester3.patch, nonlatin_fuzzySuggester4.patch, nonlatin_fuzzySuggester_combo1.patch, nonlatin_fuzzySuggester_combo.patch, nonlatin_fuzzySuggester.patch, nonlatin_fuzzySuggester.patch, nonlatin_fuzzySuggester.patch, run-suggest-benchmark.patch There is a limitation in the current FuzzySuggester implementation: it computes edits in UTF-8 space instead of Unicode character (code point) space. This should be fixable: we'd need to fix TokenStreamToAutomaton to work in Unicode character space, then fix FuzzySuggester to do the same steps that FuzzyQuery does: do the LevN expansion in Unicode character space, then convert that automaton to UTF-8, then intersect with the suggest FST. See the discussion here: http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-td4067018.html#none -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters
[ https://issues.apache.org/jira/browse/LUCENE-5030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artem Lukanin updated LUCENE-5030: -- Attachment: nonlatin_fuzzySuggester_combo2.patch I have restored testStolenBytes completely and now all the tests pass. But I'm not sure, what did you mean by 0xff byte in {code}token(new BytesRef(new byte[] {0x61, (byte) 0xff, 0x61})){code}? Letter ÿ or SEP_LABEL? Now it is treated as letter ÿ, but in the previous modification of the test I treated it as SEP_LABEL. FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters Key: LUCENE-5030 URL: https://issues.apache.org/jira/browse/LUCENE-5030 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.3 Reporter: Artem Lukanin Attachments: benchmark-INFO_SEP.txt, benchmark-old.txt, benchmark-wo_convertion.txt, nonlatin_fuzzySuggester1.patch, nonlatin_fuzzySuggester2.patch, nonlatin_fuzzySuggester3.patch, nonlatin_fuzzySuggester4.patch, nonlatin_fuzzySuggester_combo1.patch, nonlatin_fuzzySuggester_combo2.patch, nonlatin_fuzzySuggester_combo.patch, nonlatin_fuzzySuggester.patch, nonlatin_fuzzySuggester.patch, nonlatin_fuzzySuggester.patch, run-suggest-benchmark.patch There is a limitation in the current FuzzySuggester implementation: it computes edits in UTF-8 space instead of Unicode character (code point) space. This should be fixable: we'd need to fix TokenStreamToAutomaton to work in Unicode character space, then fix FuzzySuggester to do the same steps that FuzzyQuery does: do the LevN expansion in Unicode character space, then convert that automaton to UTF-8, then intersect with the suggest FST. See the discussion here: http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-td4067018.html#none -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters
[ https://issues.apache.org/jira/browse/LUCENE-5030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artem Lukanin updated LUCENE-5030: -- Attachment: nonlatin_fuzzySuggester_combo.patch I have uploaded a lucene/solr combo patch with new UNICODE_AWARE option FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters Key: LUCENE-5030 URL: https://issues.apache.org/jira/browse/LUCENE-5030 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.3 Reporter: Artem Lukanin Attachments: benchmark-INFO_SEP.txt, benchmark-old.txt, benchmark-wo_convertion.txt, nonlatin_fuzzySuggester1.patch, nonlatin_fuzzySuggester2.patch, nonlatin_fuzzySuggester3.patch, nonlatin_fuzzySuggester4.patch, nonlatin_fuzzySuggester_combo.patch, nonlatin_fuzzySuggester.patch, nonlatin_fuzzySuggester.patch, nonlatin_fuzzySuggester.patch, run-suggest-benchmark.patch There is a limitation in the current FuzzySuggester implementation: it computes edits in UTF-8 space instead of Unicode character (code point) space. This should be fixable: we'd need to fix TokenStreamToAutomaton to work in Unicode character space, then fix FuzzySuggester to do the same steps that FuzzyQuery does: do the LevN expansion in Unicode character space, then convert that automaton to UTF-8, then intersect with the suggest FST. See the discussion here: http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-td4067018.html#none -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters
[ https://issues.apache.org/jira/browse/LUCENE-5030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artem Lukanin updated LUCENE-5030: -- Attachment: benchmark-wo_convertion.txt benchmark-old.txt benchmark-INFO_SEP.txt FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters Key: LUCENE-5030 URL: https://issues.apache.org/jira/browse/LUCENE-5030 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.3 Reporter: Artem Lukanin Attachments: benchmark-INFO_SEP.txt, benchmark-old.txt, benchmark-wo_convertion.txt, nonlatin_fuzzySuggester1.patch, nonlatin_fuzzySuggester2.patch, nonlatin_fuzzySuggester3.patch, nonlatin_fuzzySuggester4.patch, nonlatin_fuzzySuggester.patch, nonlatin_fuzzySuggester.patch, nonlatin_fuzzySuggester.patch, run-suggest-benchmark.patch There is a limitation in the current FuzzySuggester implementation: it computes edits in UTF-8 space instead of Unicode character (code point) space. This should be fixable: we'd need to fix TokenStreamToAutomaton to work in Unicode character space, then fix FuzzySuggester to do the same steps that FuzzyQuery does: do the LevN expansion in Unicode character space, then convert that automaton to UTF-8, then intersect with the suggest FST. See the discussion here: http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-td4067018.html#none -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters
[ https://issues.apache.org/jira/browse/LUCENE-5030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artem Lukanin updated LUCENE-5030: -- Attachment: nonlatin_fuzzySuggester.patch I used INFO_SEP and INFO_SEP2 for separators and holes. All the tests pass (I have fixed AnalyzingSuggesterTest.testStolenBytes). The benchmark is improved: {code}[junit4:junit4] Suite: org.apache.lucene.search.suggest.LookupBenchmarkTest [junit4:junit4] Completed in 0.04s, 0 tests [junit4:junit4] [junit4:junit4] JVM J0: 1.64 .. 2.34 = 0.71s [junit4:junit4] Execution time total: 2.36 sec. [junit4:junit4] Tests summary: 1 suite, 0 tests [echo] 5 slowest tests: [junit4:tophints] 22.95s | org.apache.lucene.search.spell.TestSpellChecker [junit4:tophints] 15.08s | org.apache.lucene.search.suggest.fst.TestSort [junit4:tophints] 13.41s | org.apache.lucene.search.suggest.analyzing.AnalyzingSuggesterTest [junit4:tophints] 11.84s | org.apache.lucene.search.suggest.fst.FSTCompletionTest [junit4:tophints] 10.78s | org.apache.lucene.search.suggest.analyzing.FuzzySuggesterTest {code} FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters Key: LUCENE-5030 URL: https://issues.apache.org/jira/browse/LUCENE-5030 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.3 Reporter: Artem Lukanin Attachments: nonlatin_fuzzySuggester1.patch, nonlatin_fuzzySuggester2.patch, nonlatin_fuzzySuggester3.patch, nonlatin_fuzzySuggester4.patch, nonlatin_fuzzySuggester.patch, nonlatin_fuzzySuggester.patch, nonlatin_fuzzySuggester.patch There is a limitation in the current FuzzySuggester implementation: it computes edits in UTF-8 space instead of Unicode character (code point) space. This should be fixable: we'd need to fix TokenStreamToAutomaton to work in Unicode character space, then fix FuzzySuggester to do the same steps that FuzzyQuery does: do the LevN expansion in Unicode character space, then convert that automaton to UTF-8, then intersect with the suggest FST. See the discussion here: http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-td4067018.html#none -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters
[ https://issues.apache.org/jira/browse/LUCENE-5030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-5030: --- Attachment: run-suggest-benchmark.patch Hi Artem, Sorry, running the LookupBenchmarkTest is tricky ... you need to make temporary changes in 3 places. I'm attaching a patch that should let you run it by just doing ant test -Dtestcase=LookupBenchmarkTest. FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters Key: LUCENE-5030 URL: https://issues.apache.org/jira/browse/LUCENE-5030 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.3 Reporter: Artem Lukanin Attachments: nonlatin_fuzzySuggester1.patch, nonlatin_fuzzySuggester2.patch, nonlatin_fuzzySuggester3.patch, nonlatin_fuzzySuggester4.patch, nonlatin_fuzzySuggester.patch, nonlatin_fuzzySuggester.patch, nonlatin_fuzzySuggester.patch, run-suggest-benchmark.patch There is a limitation in the current FuzzySuggester implementation: it computes edits in UTF-8 space instead of Unicode character (code point) space. This should be fixable: we'd need to fix TokenStreamToAutomaton to work in Unicode character space, then fix FuzzySuggester to do the same steps that FuzzyQuery does: do the LevN expansion in Unicode character space, then convert that automaton to UTF-8, then intersect with the suggest FST. See the discussion here: http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-td4067018.html#none -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters
[ https://issues.apache.org/jira/browse/LUCENE-5030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artem Lukanin updated LUCENE-5030: -- Attachment: nonlatin_fuzzySuggester.patch now tests in FuzzySuggesterTest and AnalyzingSuggesterTest pass, except for AnalyzingSuggesterTest.testRandom (when preserveSep = true). If I enable VERBOSE, I see, that suggestions are correct. I guess, there is a bug in the test, but I cannot find it. Can you please review? FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters Key: LUCENE-5030 URL: https://issues.apache.org/jira/browse/LUCENE-5030 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.3 Reporter: Artem Lukanin Attachments: nonlatin_fuzzySuggester1.patch, nonlatin_fuzzySuggester2.patch, nonlatin_fuzzySuggester3.patch, nonlatin_fuzzySuggester4.patch, nonlatin_fuzzySuggester.patch, nonlatin_fuzzySuggester.patch There is a limitation in the current FuzzySuggester implementation: it computes edits in UTF-8 space instead of Unicode character (code point) space. This should be fixable: we'd need to fix TokenStreamToAutomaton to work in Unicode character space, then fix FuzzySuggester to do the same steps that FuzzyQuery does: do the LevN expansion in Unicode character space, then convert that automaton to UTF-8, then intersect with the suggest FST. See the discussion here: http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-td4067018.html#none -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters
[ https://issues.apache.org/jira/browse/LUCENE-5030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artem Lukanin updated LUCENE-5030: -- Attachment: nonlatin_fuzzySuggester4.patch I have fixed testRandom, which repeats the logic of FuzzySuggester. Now all the tests pass. Please, review. FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters Key: LUCENE-5030 URL: https://issues.apache.org/jira/browse/LUCENE-5030 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.3 Reporter: Artem Lukanin Attachments: nonlatin_fuzzySuggester1.patch, nonlatin_fuzzySuggester2.patch, nonlatin_fuzzySuggester3.patch, nonlatin_fuzzySuggester4.patch, nonlatin_fuzzySuggester.patch There is a limitation in the current FuzzySuggester implementation: it computes edits in UTF-8 space instead of Unicode character (code point) space. This should be fixable: we'd need to fix TokenStreamToAutomaton to work in Unicode character space, then fix FuzzySuggester to do the same steps that FuzzyQuery does: do the LevN expansion in Unicode character space, then convert that automaton to UTF-8, then intersect with the suggest FST. See the discussion here: http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-td4067018.html#none -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters
[ https://issues.apache.org/jira/browse/LUCENE-5030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artem Lukanin updated LUCENE-5030: -- Attachment: nonlatin_fuzzySuggester1.patch Now all the tests pass except testRandom when preserveSep is true. Michael, can you explain me, how this preserve separator feature works? FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters Key: LUCENE-5030 URL: https://issues.apache.org/jira/browse/LUCENE-5030 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.3 Reporter: Artem Lukanin Attachments: nonlatin_fuzzySuggester1.patch, nonlatin_fuzzySuggester.patch There is a limitation in the current FuzzySuggester implementation: it computes edits in UTF-8 space instead of Unicode character (code point) space. This should be fixable: we'd need to fix TokenStreamToAutomaton to work in Unicode character space, then fix FuzzySuggester to do the same steps that FuzzyQuery does: do the LevN expansion in Unicode character space, then convert that automaton to UTF-8, then intersect with the suggest FST. See the discussion here: http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-td4067018.html#none -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters
[ https://issues.apache.org/jira/browse/LUCENE-5030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artem Lukanin updated LUCENE-5030: -- Attachment: nonlatin_fuzzySuggester2.patch the patch without autoformatting FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters Key: LUCENE-5030 URL: https://issues.apache.org/jira/browse/LUCENE-5030 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.3 Reporter: Artem Lukanin Attachments: nonlatin_fuzzySuggester1.patch, nonlatin_fuzzySuggester2.patch, nonlatin_fuzzySuggester.patch There is a limitation in the current FuzzySuggester implementation: it computes edits in UTF-8 space instead of Unicode character (code point) space. This should be fixable: we'd need to fix TokenStreamToAutomaton to work in Unicode character space, then fix FuzzySuggester to do the same steps that FuzzyQuery does: do the LevN expansion in Unicode character space, then convert that automaton to UTF-8, then intersect with the suggest FST. See the discussion here: http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-td4067018.html#none -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters
[ https://issues.apache.org/jira/browse/LUCENE-5030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artem Lukanin updated LUCENE-5030: -- Attachment: nonlatin_fuzzySuggester3.patch with untouched trailing spaces FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters Key: LUCENE-5030 URL: https://issues.apache.org/jira/browse/LUCENE-5030 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.3 Reporter: Artem Lukanin Attachments: nonlatin_fuzzySuggester1.patch, nonlatin_fuzzySuggester2.patch, nonlatin_fuzzySuggester3.patch, nonlatin_fuzzySuggester.patch There is a limitation in the current FuzzySuggester implementation: it computes edits in UTF-8 space instead of Unicode character (code point) space. This should be fixable: we'd need to fix TokenStreamToAutomaton to work in Unicode character space, then fix FuzzySuggester to do the same steps that FuzzyQuery does: do the LevN expansion in Unicode character space, then convert that automaton to UTF-8, then intersect with the suggest FST. See the discussion here: http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-td4067018.html#none -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters
[ https://issues.apache.org/jira/browse/LUCENE-5030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artem Lukanin updated LUCENE-5030: -- Attachment: nonlatin_fuzzySuggester.patch I've added a test, which demonstrates the bug. I have fixed TokenStreamToAutomaton, but I have no idea, how to update AnalyzingSuggester, which wants bytes instead of chars (ints, which cannot fit a byte). FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters Key: LUCENE-5030 URL: https://issues.apache.org/jira/browse/LUCENE-5030 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.3 Reporter: Artem Lukanin Attachments: nonlatin_fuzzySuggester.patch There is a limitation in the current FuzzySuggester implementation: it computes edits in UTF-8 space instead of Unicode character (code point) space. This should be fixable: we'd need to fix TokenStreamToAutomaton to work in Unicode character space, then fix FuzzySuggester to do the same steps that FuzzyQuery does: do the LevN expansion in Unicode character space, then convert that automaton to UTF-8, then intersect with the suggest FST. See the discussion here: http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-td4067018.html#none -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters
[ https://issues.apache.org/jira/browse/LUCENE-5030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artem Lukanin updated LUCENE-5030: -- Description: There is a limitation in the current FuzzySuggester implementation: it computes edits in UTF-8 space instead of Unicode character (code point) space. This should be fixable: we'd need to fix TokenStreamToAutomaton to work in Unicode character space, then fix FuzzySuggester to do the same steps that FuzzyQuery does: do the LevN expansion in Unicode character space, then convert that automaton to UTF-8, then intersect with the suggest FST. See the discussion here: http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-td4067018.html#none was: There is a limitation in the current FuzzySuggester implementation: it computes edits in UTF-8 space instead of Unicode character (code point) space. This should be fixable: we'd need to fix TokenStreamToAutomaton to work in Unicode character space, then fix FuzzySuggester to do the same steps that FuzzyQuery does: do the LevN expansion in Unicode character space, then convert that automaton to UTF-8, then intersect with the suggest FST. See the discussion here: http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-td4067018.html#none FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters Key: LUCENE-5030 URL: https://issues.apache.org/jira/browse/LUCENE-5030 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.3 Reporter: Artem Lukanin There is a limitation in the current FuzzySuggester implementation: it computes edits in UTF-8 space instead of Unicode character (code point) space. This should be fixable: we'd need to fix TokenStreamToAutomaton to work in Unicode character space, then fix FuzzySuggester to do the same steps that FuzzyQuery does: do the LevN expansion in Unicode character space, then convert that automaton to UTF-8, then intersect with the suggest FST. See the discussion here: http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-td4067018.html#none -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org