Re: fsa.dict.speller.replacement-pairs slow down spell check
On 2014-01-02 14:26, Marcin Miłkowski wrote: I did look at them again and it turns out that most only slowed down the process but did not help in creating good suggestions as short fragments were not put together the way they should. As your changes have become quite complex now, I'd suggest to release 2.4.1 without those, i.e. with only the shortened list of pairs for en-US. Is that okay? Can I use the current version of the pairs file or does that depend on your code changes? Regards Daniel -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: fsa.dict.speller.replacement-pairs slow down spell check
W dniu 2014-01-04 11:15, Daniel Naber pisze: On 2014-01-02 14:26, Marcin Miłkowski wrote: I did look at them again and it turns out that most only slowed down the process but did not help in creating good suggestions as short fragments were not put together the way they should. As your changes have become quite complex now, I'd suggest to release 2.4.1 without those, i.e. with only the shortened list of pairs for en-US. Is that okay? Can I use the current version of the pairs file or does that depend on your code changes? Well, I don't know if there are really any users of standalone version 2.4 that would benefit from this fix. I'd release a really fixed version but if you want, you can simply substitute the .info file and release this as 2.4.1. Best, MM -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: fsa.dict.speller.replacement-pairs slow down spell check
W dniu 2014-01-02 18:30, Jaume Ortolà i Font pisze: 2014/1/2 Marcin Miłkowski list-addr...@wp.pl mailto:list-addr...@wp.pl I did look at them again and it turns out that most only slowed down the process but did not help in creating good suggestions as short fragments were not put together the way they should. So I went ahead and removed a fair number of the replacement pairs but we still need to find ways to limit search if we have too many replacement pairs. That's easy, but the search will be not exhaustive... I don't think traversing the automaton directly would be faster because we stop searching the automaton early anyway (as soon as the letter is not found). I don't think so. Precisely when the algorithm is looking at the first letters (Mi.. ) there are many words to search in the dictionary (with distance=1: ma..., me..., mi...; ai..., bi..., ci..., ... zi...). As you progress (Milkow...), there are less and less words. So every time we start again the search with a new candidate I think we lose a lot of time. Traversing the automaton should be quite faster. Yes, you are right. I did some profiling and indeed, our main problem is that findRepl on line 306 in Speller.java is run zillions of times. One easy but brutal fix would be to run findRepl only on the original word and use replacement pairs only to generate complete candidates. This would save a lot of time but on the pain of quality. Maybe it would be better before we have a proper traversal routine? The only way to find out whether the quality of suggestions really drops is to run it on common misspellings with and without findRepl on wordsToCheck list. Could you make this experiment on Catalan data? The difficulty in traversing the automaton appears when replacing pairs with different lengths (one to two, one to three, two to three...). The management of the matrix where the distances are stored becomes tricky. I can see the problem, and this is why there was an upper limit of replacement lengths in the original algorithm (in fsa_spell by Jan Daciuk). This is actually quite tricky to solve. When I have some time, I could try again... I'm afraid the truth is that we have to be very careful when adding replacement pairs, as completely ignoring one-letter replacements is not a good idea (at least not for Polish, in which people confuse h and ch). If the replacement h-ch can happen in any context or in most contexts, then the one-letter replacement is unavoidable. Exactly. Regards, Marcin -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: fsa.dict.speller.replacement-pairs slow down spell check
Hi, In the current implementation the number of possible suggestions grows exponentially with the replacement pairs, which is not a good thing... For Milkowski you get 6144 possible suggestions in American English. I fixed a limit of 7 possible simultaneous replacements in a word, which (if the replacements are one to two) gives you 2^7=128 possible suggestions. But in the case of Milkowski in American English, almost all letters have 3 o 4 possible replacements (which I didn't forsee), so the limit is about 4^7=16384. What we can do? 1. Limit the number of replacement pairs. Avoid specially one letter replacements. Add in more context. 2. Limit the number of possible replacements in a single word. This will make the search less exhaustive. 3. Change the approach. We should try again to traverse directly the dictionary with the replacement pairs. But the implementation is difficult. I tried some time ago, and got some results. But there were details difficult to solve, and I gave up. Perhaps Marcin or Dawid Weiss can provide some insight... Regards, Jaume Ortolà 2013/12/31 Daniel Naber daniel.na...@languagetool.org Hi, a large amount of pairs for 'fsa.dict.speller.replacement-pairs' (in the *.info file) makes creating suggestions very slow. You can reproduce that if your text contains only the word Milkowski and you check it with en-US (sorry your name is triggering a bug, Marcin :-). It takes several seconds to finish, while it's fast with en-GB, which has less replacement pairs. I guess it's so slow that we're back to hunspell performance, which we originally tried to avoid when switching to Morfologik. Is this a bug? Any better idea than just trimming down the number of replacement pairs? Regards Daniel -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: fsa.dict.speller.replacement-pairs slow down spell check
Hi, W dniu 2013-12-31 13:20, Jaume Ortolà i Font pisze: Hi, In the current implementation the number of possible suggestions grows exponentially with the replacement pairs, which is not a good thing... For Milkowski you get 6144 possible suggestions in American English. I fixed a limit of 7 possible simultaneous replacements in a word, which (if the replacements are one to two) gives you 2^7=128 possible suggestions. But in the case of Milkowski in American English, almost all letters have 3 o 4 possible replacements (which I didn't forsee), so the limit is about 4^7=16384. I'd get rid of these multiple replacements. They are spurious anyway, I guess. Some of them seem to be repeated (I just copied these from hunspell). I will look at them again. What we can do? 1. Limit the number of replacement pairs. Avoid specially one letter replacements. Add in more context. 2. Limit the number of possible replacements in a single word. This will make the search less exhaustive. 3. Change the approach. We should try again to traverse directly the dictionary with the replacement pairs. But the implementation is difficult. I tried some time ago, and got some results. But there were details difficult to solve, and I gave up. Perhaps Marcin or Dawid Weiss can provide some insight... This is indeed a bit tricky... Best, Marcin Regards, Jaume Ortolà 2013/12/31 Daniel Naber daniel.na...@languagetool.org mailto:daniel.na...@languagetool.org Hi, a large amount of pairs for 'fsa.dict.speller.replacement-pairs' (in the *.info file) makes creating suggestions very slow. You can reproduce that if your text contains only the word Milkowski and you check it with en-US (sorry your name is triggering a bug, Marcin :-). It takes several seconds to finish, while it's fast with en-GB, which has less replacement pairs. I guess it's so slow that we're back to hunspell performance, which we originally tried to avoid when switching to Morfologik. Is this a bug? Any better idea than just trimming down the number of replacement pairs? Regards Daniel -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net mailto:Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel