Re: Plain English rules

2014-12-22 Thread Daniel Naber
On 2014-12-20 11:32, Heikki Lehvaslaiho wrote: Heikki, > I've set up a gist with 80 English rules that (mostly) expand > redundant/wordy rules in LanguageTools 2.7. Testrules script passes > these, but it would be good for someone to go though them before > inclusion to the main rules file. > >

Re: Plain English rules

2014-12-22 Thread Marcin Miłkowski
W dniu 2014-12-22 o 11:33, Daniel Naber pisze: > On 2014-12-20 11:32, Heikki Lehvaslaiho wrote: > > Heikki, > >> I've set up a gist with 80 English rules that (mostly) expand >> redundant/wordy rules in LanguageTools 2.7. Testrules script passes >> these, but it would be good for someone to go thou

Re: Improving spelling suggestions with frequency dictionaries

2014-12-22 Thread Andriy Rysin
Thanks, so I ran this (f-uk-UA.dict is file with frequencies): java -jar ./mfl/morfologik-tools-*-standalone.jar fsa_dump -x -d f-uk-UA.dict | wc -l 1013148 while (uk-UA.dict is old one without frequencies) java -jar ./mfl/morfologik-tools-*-standalone.jar fsa_dump -d uk-UA.dict | wc -l 1623646 i

Re: Improving spelling suggestions with frequency dictionaries

2014-12-22 Thread Dawid Weiss
I honestly don't know -- I'm not into LT development, I know the FST package. :) try "head" instead of wc -l and see what the output is? Dawid On Mon, Dec 22, 2014 at 2:09 PM, Andriy Rysin wrote: > Thanks, so I ran this (f-uk-UA.dict is file with frequencies): > java -jar ./mfl/morfologik-tools

Re: Improving spelling suggestions with frequency dictionaries

2014-12-22 Thread Andriy Rysin
I just doublechecked: if I encode my dictionary with fsa_build both using cfsa2 and fsa5 give me ~1.6 million rows if I encode using org.languagetool.dev.SpellDictionaryBuilder with frequency information I get ~1 million rows if I encode using org.languagetool.dev.SpellDictionaryBuilder without fre

Re: Improving spelling suggestions with frequency dictionaries

2014-12-22 Thread Dawid Weiss
> Here, head output for 1 million row generated by SpellDictionaryBuilder Not the metadata headers, what's the actual raw data in the output? D. -- Dive into the World of Parallel Programming! The Go Parallel Website, sp

Re: Improving spelling suggestions with frequency dictionaries

2014-12-22 Thread Andriy Rysin
fsa_dump -x -d uk_UA_f.dict gives me this (words seem to have apostrophe in front for some reason and many words are garbled): Decoded FSA data (in the encoding above) '▒ '▒ 'ю 'ю 'ю-▒'ю-▒ 'южн'южн 'южна 'южна 'южнові 'южнові 'южном 'ю

Re: Improving spelling suggestions with frequency dictionaries

2014-12-22 Thread Dawid Weiss
It would be easier if you could set up a small github project and: 1) include a 100-lines source file with your data; 2) include the commands you're using to compress this data through the various pipelines you mentioned. I suspect SpellDictionaryBuilder preprocesses the input somehow before it p

added.txt activated for most languages

2014-12-22 Thread Daniel Naber
Hi, I made a change to the BaseTagger. From the change log: -The part-of-speech tagger for most languages can now be extended by adding entries to the file org/languagetool/resource/XX/added.txt (XX being the language code). The format is "fullform baseform postags", three columns separated

Re: Improving spelling suggestions with frequency dictionaries

2014-12-22 Thread Andriy Rysin
Ok, I've debugged it a bit and I found a problem: I was assuming it expects the input dictionary in utf-8 (like the frequency list) but it wants it using my output encoding. I converted my input into cp1251 and now I get 1.6 million states I expect. Thanks, Andriy FSA properties -- FS

English Rule Additions

2014-12-22 Thread Nick Hough
I have devised some rules for common English mistakes for the letter ‘A’, which you can see here: https://gist.github.com/howlinghuffy/d25d3d6b43c7a9b485cb I plan on doing many more submissions like this over the coming months; let me

Re: added.txt activated for most languages

2014-12-22 Thread Jaume Ortolà i Font
Hi Daniel, I use the manual-tagger not only as a way to add new words and tags, but also as a means of fixing tags temporarily until the next dictionary update. So if there is a manual tag, the dictionary tag is ignored. I think that makes sense. Could we do it likewise in the CombiningTagger? We

Re: Improving spelling suggestions with frequency dictionaries

2014-12-22 Thread Andriy Rysin
So I've made frequency work for spelling suggestions but I have two questions: 1) before I added frequency my spelling .dict was ~600k, with frequency added (I added frequency to all 1.6 million words) it's now 1.8M; maybe that's ok, but I just wanted to doublecheck if I am doing everything right