Re: fsa.dict.speller.replacement-pairs slow down spell check

2014-01-04 Thread Daniel Naber
On 2014-01-02 14:26, Marcin Miłkowski wrote:

 I did look at them again and it turns out that most only slowed down 
 the
 process but did not help in creating good suggestions as short 
 fragments
 were not put together the way they should.

As your changes have become quite complex now, I'd suggest to release 
2.4.1 without those, i.e. with only the shortened list of pairs for 
en-US. Is that okay? Can I use the current version of the pairs file or 
does that depend on your code changes?

Regards
  Daniel


--
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: fsa.dict.speller.replacement-pairs slow down spell check

2014-01-04 Thread Marcin Miłkowski
W dniu 2014-01-04 11:15, Daniel Naber pisze:
 On 2014-01-02 14:26, Marcin Miłkowski wrote:

 I did look at them again and it turns out that most only slowed down
 the
 process but did not help in creating good suggestions as short
 fragments
 were not put together the way they should.

 As your changes have become quite complex now, I'd suggest to release
 2.4.1 without those, i.e. with only the shortened list of pairs for
 en-US. Is that okay? Can I use the current version of the pairs file or
 does that depend on your code changes?

Well, I don't know if there are really any users of standalone version 
2.4 that would benefit from this fix.

I'd release a really fixed version but if you want, you can simply 
substitute the .info file and release this as 2.4.1.

Best,
MM

--
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: fsa.dict.speller.replacement-pairs slow down spell check

2014-01-03 Thread Marcin Miłkowski
W dniu 2014-01-02 18:30, Jaume Ortolà i Font pisze:
 2014/1/2 Marcin Miłkowski list-addr...@wp.pl mailto:list-addr...@wp.pl

 I did look at them again and it turns out that most only slowed down the
 process but did not help in creating good suggestions as short fragments
 were not put together the way they should. So I went ahead and removed a
 fair number of the replacement pairs but we still need to find ways to
 limit search if we have too many replacement pairs.


 That's easy, but the search will be not exhaustive...

 I don't think traversing the automaton directly would be faster because
 we stop searching the automaton early anyway (as soon as the letter is
 not found).


 I don't think so. Precisely when the algorithm is looking at the first
 letters (Mi.. ) there are many words to search in the dictionary (with
 distance=1: ma..., me..., mi...; ai..., bi..., ci..., ... zi...). As you
 progress (Milkow...), there are less and less words. So every time we
 start again the search with a new candidate I think we lose a lot of
 time. Traversing the automaton should be quite faster.

Yes, you are right. I did some profiling and indeed, our main problem is 
that findRepl on line 306 in Speller.java is run zillions of times.

One easy but brutal fix would be to run findRepl only on the original 
word and use replacement pairs only to generate complete candidates. 
This would save a lot of time but on the pain of quality. Maybe it would 
be better before we have a proper traversal routine?

The only way to find out whether the quality of suggestions really drops 
is to run it on common misspellings with and without findRepl on 
wordsToCheck list. Could you make this experiment on Catalan data?


 The difficulty in traversing the automaton appears when replacing pairs
 with different lengths (one to two, one to three, two to three...). The
 management of the matrix where the distances are stored becomes tricky.

I can see the problem, and this is why there was an upper limit of 
replacement lengths in the original algorithm (in fsa_spell by Jan 
Daciuk). This is actually quite tricky to solve.

 When I have some time, I could try again...

 I'm afraid the truth is that we have to be very careful when
 adding replacement pairs, as completely ignoring one-letter replacements
 is not a good idea (at least not for Polish, in which people confuse h
 and ch).


   If the replacement h-ch can happen in any context or in most
 contexts, then the one-letter replacement is unavoidable.

Exactly.

Regards,
Marcin

--
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: fsa.dict.speller.replacement-pairs slow down spell check

2013-12-31 Thread Jaume Ortolà i Font
Hi,

In the current implementation the number of possible suggestions grows
exponentially with the replacement pairs, which is not a good thing...
For Milkowski you get 6144 possible suggestions in American English. I
fixed a limit of 7 possible simultaneous replacements in a word, which (if
the replacements are one to two) gives you 2^7=128 possible suggestions.
But in the case of Milkowski in American English, almost all letters have
3 o 4 possible replacements (which I didn't forsee), so the limit is about
4^7=16384.

What we can do?
1. Limit the number of replacement pairs. Avoid specially one letter
replacements. Add in more context.
2. Limit the number of possible replacements in a single word. This will
make the search less exhaustive.
3. Change the approach. We should try again to traverse directly the
dictionary with the replacement pairs. But the implementation is difficult.
I tried some time ago, and got some results. But there were details
difficult to solve, and I gave up. Perhaps Marcin or Dawid Weiss can
provide some insight...

Regards,
Jaume Ortolà



2013/12/31 Daniel Naber daniel.na...@languagetool.org

 Hi,

 a large amount of pairs for 'fsa.dict.speller.replacement-pairs' (in the
 *.info file) makes creating suggestions very slow. You can reproduce
 that if your text contains only the word Milkowski and you check it
 with en-US (sorry your name is triggering a bug, Marcin :-). It takes
 several seconds to finish, while it's fast with en-GB, which has less
 replacement pairs.

 I guess it's so slow that we're back to hunspell performance, which we
 originally tried to avoid when switching to Morfologik. Is this a bug?
 Any better idea than just trimming down the number of replacement pairs?

 Regards
   Daniel



 --
 Rapidly troubleshoot problems before they affect your business. Most IT
 organizations don't have a clear picture of how application performance
 affects their revenue. With AppDynamics, you get 100% visibility into your
 Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics
 Pro!
 http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel

--
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: fsa.dict.speller.replacement-pairs slow down spell check

2013-12-31 Thread Marcin Miłkowski
Hi,

W dniu 2013-12-31 13:20, Jaume Ortolà i Font pisze:
 Hi,

 In the current implementation the number of possible suggestions grows
 exponentially with the replacement pairs, which is not a good thing...
 For Milkowski you get 6144 possible suggestions in American English. I
 fixed a limit of 7 possible simultaneous replacements in a word, which
 (if the replacements are one to two) gives you 2^7=128 possible
 suggestions. But in the case of Milkowski in American English, almost
 all letters have 3 o 4 possible replacements (which I didn't forsee), so
 the limit is about 4^7=16384.

I'd get rid of these multiple replacements. They are spurious anyway, I 
guess. Some of them seem to be repeated (I just copied these from 
hunspell). I will look at them again.


 What we can do?
 1. Limit the number of replacement pairs. Avoid specially one letter
 replacements. Add in more context.
 2. Limit the number of possible replacements in a single word. This will
 make the search less exhaustive.
 3. Change the approach. We should try again to traverse directly the
 dictionary with the replacement pairs. But the implementation is
 difficult. I tried some time ago, and got some results. But there were
 details difficult to solve, and I gave up. Perhaps Marcin or Dawid Weiss
 can provide some insight...

This is indeed a bit tricky...

Best,
Marcin


 Regards,
 Jaume Ortolà



 2013/12/31 Daniel Naber daniel.na...@languagetool.org
 mailto:daniel.na...@languagetool.org

 Hi,

 a large amount of pairs for 'fsa.dict.speller.replacement-pairs' (in the
 *.info file) makes creating suggestions very slow. You can reproduce
 that if your text contains only the word Milkowski and you check it
 with en-US (sorry your name is triggering a bug, Marcin :-). It takes
 several seconds to finish, while it's fast with en-GB, which has less
 replacement pairs.

 I guess it's so slow that we're back to hunspell performance, which we
 originally tried to avoid when switching to Morfologik. Is this a bug?
 Any better idea than just trimming down the number of replacement pairs?

 Regards
Daniel


 
 --
 Rapidly troubleshoot problems before they affect your business. Most IT
 organizations don't have a clear picture of how application performance
 affects their revenue. With AppDynamics, you get 100% visibility
 into your
 Java,.NET,  PHP application. Start your 15-day FREE TRIAL of
 AppDynamics Pro!
 
 http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 mailto:Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel




 --
 Rapidly troubleshoot problems before they affect your business. Most IT
 organizations don't have a clear picture of how application performance
 affects their revenue. With AppDynamics, you get 100% visibility into your
 Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
 http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk



 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel



--
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel