Re: [Languagetool] Possible way of speeding up LanguageTool

2012-01-14 Thread Marcin Miłkowski
W dniu 2012-01-14 01:39, Jimmy O'Regan pisze: 2012/1/13 Dominique Pellédominique.pe...@gmail.com: I see that Ukrainian tokenizer uses a MySpell (UkrainianMyspellTagger class) which reads and parses a text file dist/resource/uk/ukrainian.dict of 1,841,900 bytes, so that's not fast. It is

Re: [Languagetool] Ignore a warning if a multi-word term is in a sentence

2012-01-28 Thread Marcin Miłkowski
W dniu 2012-01-26 19:02, Mike Unwalla pisze: Hello All, I want to ignore a warning if a particular multi-word term is in a sentence. For example, I want to find 'deploy', except if the term 'drogue chute' is in the sentence. 'Warn for misused word, except in correct context'

Re: [Languagetool] Spelling rule in Polish and (possibly) other languages

2012-02-05 Thread Marcin Miłkowski
W dniu 2012-02-04 16:14, Wojciech Górnaś pisze: Hello! Looking for a programmer to help me create a simple spelling rule for Polish. There is a set of words that are supposed to be spelled together but are often misspelled as separate words. Such mistakes are relatively common and standard

Re: [Languagetool] Spelling rule in Polish and (possibly) other languages

2012-02-05 Thread Marcin Miłkowski
W dniu 2012-02-04 19:54, Daniel Naber pisze: On Samstag, 4. Februar 2012, Wojciech Górnaś wrote: I thought about a tool that would allow adding new pairs of words to one general rule. Do you think that's possible? You could write a Java rule, i.e. a class that extends the Rule class and

Re: [Languagetool] LibreOffice 3.5 and LT

2012-02-19 Thread Marcin Miłkowski
W dniu 2012-02-19 15:30, Daniel Naber pisze: Hi, I just tried using LT with LibreOffice 3.5 and it didn't work for English. This seems to be caused by the embedded grammar checker (Lightproof). Does anybody know whether it's still possible to activate LT for English? Interesting, it seems

Re: [Languagetool] LibreOffice 3.5 and LT

2012-02-19 Thread Marcin Miłkowski
W dniu 2012-02-19 21:07, Daniel Naber pisze: On Sonntag, 19. Februar 2012, Daniel Naber wrote: Interesting, it seems that LO 3.5rc3 works with LT here in English, even when Lightproof is activated. Weird. Did you check the pattern rules? A lot of the Lightproof rules come from LT so it

Re: [Languagetool] Idea to avoid duplicate errors withing a rulegroup

2012-02-26 Thread Marcin Miłkowski
W dniu 2012-02-26 17:34, Dominique Pellé pisze: Daniel Naber wrote: On Sonntag, 26. Februar 2012, Dominique Pellé wrote: Is there a way to tell LanguageTool that all therule in arulegroup are mutually exclusive? I don't think so. But why limit this to rulegroups? What about iterating

Re: [Languagetool] Hunspell support

2012-04-24 Thread Marcin Miłkowski
W dniu 2012-04-24 21:13, Serkan Kaba pisze: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 There's also http://kenai.com/projects/jmyspell which is a pure Java implementation though it seems to be discontinued. This one does not include the features of hunspell that we need for languages

Re: [Languagetool] Hunspell support

2012-04-25 Thread Marcin Miłkowski
W dniu 2012-04-24 18:53, Daniel Naber pisze: On Dienstag, 24. April 2012, Yakov Reztsov wrote: Or use our fsa dictionary? That won't work well for languages with compounds, like German. I'm all for including the JNI version of hunspell, as the license seems to be okay. +1 For converters

Re: [Languagetool] Hunspell support

2012-04-28 Thread Marcin Miłkowski
Hi again, W dniu 2012-04-24 17:50, Yakov Reztsov pisze: May be try dren-dk / HunspellJNA? http://dren.dk/hunspell.html https://github.com/dren-dk/HunspellJNA This code is licensed under LGPL/GPL/MPL. I tried it yesterday. It works quite nicely but hunspell seems to perform very slow

Re: [Languagetool] Hunspell support

2012-04-28 Thread Marcin Miłkowski
W dniu 2012-04-28 14:35, Daniel Naber pisze: On Samstag, 28. April 2012, Marcin Miłkowski wrote: I tried it yesterday. It works quite nicely but hunspell seems to perform very slow compared to our fsa engine. Are you sure it's not a problem of slow initialization or so? Can you post some

Re: [Languagetool] Hunspell support

2012-04-29 Thread Marcin Miłkowski
W dniu 2012-04-29 02:12, Daniel Naber pisze: On Samstag, 28. April 2012, Marcin Miłkowski wrote: My impression was that it's slow because I could see lines showing up on the console. I mean on every call, there was a System.Err printout, and I could see lines printing. Could you maybe

Re: [Languagetool] Hunspell support

2012-04-29 Thread Marcin Miłkowski
W dniu 2012-04-29 21:20, Daniel Naber pisze: On Sonntag, 29. April 2012, Marcin Miłkowski wrote: However, checking if the word is misspelled takes virtually no time. It's only generating suggestions that seems to bog us down. We could wait until a user actually requests the suggestions

[Languagetool] TeX story

2012-05-07 Thread Marcin Miłkowski
Hi all, I created a wiki page for people who want to use LT as a grammar checker for (La)TeX: http://languagetool.wikidot.com/checking-la-tex-with-languagetool I could add LyX as a solution but this is much more tricky in using on different platforms. Best regards, Marcin

Re: [Languagetool] TeX story

2012-05-07 Thread Marcin Miłkowski
in TexStudio, so hunspell complains at „OK” (which is in Polish smart quotes). You should add all kinds of quotes to your word tokenizer. Best, Marcin Benito On 05/07/12 19:20, Marcin Miłkowski wrote: Hi all, I created a wiki page for people who want to use LT as a grammar checker for (La)TeX

Re: [Languagetool] TeX story

2012-05-08 Thread Marcin Miłkowski
this construct, you might add support for it at least as an option. All in all, however, it's great to have LT in a major TeX editor! I will add the info to the wiki. Marcin Benito On 05/07/12 23:21, Marcin Miłkowski wrote: Hi Benito, W dniu 2012-05-07 20:23, Benito van der Zander

Re: [Languagetool] TeX story

2012-05-10 Thread Marcin Miłkowski
it might be a good idea to look at how they do it. Marcin Benito On 05/09/12 10:29, Marcin Miłkowski wrote: W dniu 2012-05-09 00:33, Marcin Miłkowski pisze: (but even if \cite is correctly removed, there remains an additional space before the point which LT may detect as error. And if I

Re: [Languagetool] Any advice?

2012-05-16 Thread Marcin Miłkowski
W dniu 2012-05-16 22:28, gulp21 pisze: As much as I hate passing the buck, I'm afraid writing such a rule is beyond my (pretty much non-existent) Java skills. I planned to write a Java rule for it, but I'm rather busy at the moment. Unless somebody is quicker than me, or has a better idea,

[Languagetool] Initial hunspell support

2012-05-18 Thread Marcin Miłkowski
Hi all, I just added preliminary hunspell support to LanguageTool. This is not yet ready for production, as essential parts are missing. But the infrastructure and some files are already there. Now, I didn't add all spelling dictionaries to the source - there's only one for Polish right now

Re: [Languagetool] Initial hunspell support

2012-05-18 Thread Marcin Miłkowski
W dniu 2012-05-18 17:01, Ruud Baars pisze: Marcin, I must have missed some info. What is the Hunspell integration meant to do exactly? Find spelling mistakes. We have a Java rule that uses Hunspell for this purpose now. Marcin Ruud On 18-05-12 15:10, Marcin Miłkowski wrote: Hi all

Re: [Languagetool] Initial hunspell support

2012-05-18 Thread Marcin Miłkowski
for Dutch in a Dutch language Union sponsored project). OK, that means you'd need a special class for Dutch. Marcin Ruud On 18-05-12 17:27, Marcin Miłkowski wrote: W dniu 2012-05-18 17:01, Ruud Baars pisze: Marcin, I must have missed some info. What is the Hunspell integration meant

Re: [Languagetool] Initial hunspell support

2012-05-18 Thread Marcin Miłkowski
the option -r through the library. Marcin Ruud On 18-05-12 19:17, Daniel Naber wrote: On Freitag, 18. Mai 2012, Marcin Miłkowski wrote: Hi Marcin, Now, I didn't add all spelling dictionaries to the source - there's only one for Polish right now as I needed at least one for testing

Re: [Languagetool] Initial hunspell support

2012-05-18 Thread Marcin Miłkowski
W dniu 2012-05-18 19:17, Daniel Naber pisze: On Freitag, 18. Mai 2012, Marcin Miłkowski wrote: Hi Marcin, Now, I didn't add all spelling dictionaries to the source - there's only one for Polish right now as I needed at least one for testing. The dictionaries should go to resources

Re: [Languagetool] Initial hunspell support

2012-05-18 Thread Marcin Miłkowski
W dniu 2012-05-18 22:57, Daniel Naber pisze: On Freitag, 18. Mai 2012, Marcin Miłkowski wrote: thanks for adding Hunspell support! What about moving the hunspell dict to another sub directory hunspell in our sources to make more obvious that this is just copied from a different project? Well

Re: [Languagetool] Initial hunspell support

2012-05-19 Thread Marcin Miłkowski
link to the installed dictionaries in LO/AOO without reinstalling them? lp, m. 2012/5/18 Marcin Miłkowskilist-addr...@wp.pl: W dniu 2012-05-18 22:57, Daniel Naber pisze: On Freitag, 18. Mai 2012, Marcin Miłkowski wrote: thanks for adding Hunspell support! What about moving the hunspell dict

Re: [Languagetool] Initial hunspell support

2012-05-19 Thread Marcin Miłkowski
, Marcin Miłkowski wrote: thanks for adding Hunspell support! What about moving the hunspell dict to another sub directory hunspell in our sources to make more obvious that this is just copied from a different project? Well, you mean adding hunspell under resources? I'd suggest using src/resource

Re: [Languagetool] Initial hunspell support

2012-05-19 Thread Marcin Miłkowski
W dniu 2012-05-19 11:13, Dominique Pellé pisze: Marcin Miłkowskilist-addr...@wp.pl wrote: Hi all, I just added preliminary hunspell support to LanguageTool. This is not yet ready for production, as essential parts are missing. But the infrastructure and some files are already there. Hi

Re: [Languagetool] Initial hunspell support

2012-05-21 Thread Marcin Miłkowski
Hi again, I've just split the builds, and it should work now out of the box for Polish, also under JNLP (if you care to compile it, the online version on our web page is not up-to-date - we usually wait for a stable release). Daniel - it seems that your script for managing the snapshots

Re: [Languagetool] Initial hunspell support

2012-05-21 Thread Marcin Miłkowski
W dniu 2012-05-21 19:41, Daniel Naber pisze: I could add stemming and analyzing options - I think it's interesting to include it for analyzing compounds in German or Dutch. We already have our own component that can do that for German (jwordsplitter). When reading the GSoC ideas on the wiki,

Re: [Languagetool] Initial hunspell support

2012-05-21 Thread Marcin Miłkowski
W dniu 2012-05-21 20:35, Daniel Naber pisze: But is there anyone that does not want hunspell for a language? I mean maintainers should be able to decide if they want it or not. The problem is that some languages are not really maintained I guess... I also think that all languages should get

Re: [Languagetool] Initial hunspell support

2012-05-21 Thread Marcin Miłkowski
W dniu 2012-05-21 23:41, Daniel Naber pisze: On Montag, 21. Mai 2012, Marcin Miłkowski wrote: Unfortunately, there is no way to disable rules when using the server, and this is a major drawback in this case The reason was probably that there once was only one LT object in the http server

Re: [Languagetool] Initial hunspell support

2012-05-22 Thread Marcin Miłkowski
W dniu 2012-05-22 00:24, Marcin Miłkowski pisze: My preliminary results (I will send them later when they are ready in a formatted form) are that the hunspell rule is around 25 times (!) slower than all other rules. I profiled the rules. The data were correct, and you can see them on a graph

Re: [Languagetool] Initial hunspell support

2012-05-22 Thread Marcin Miłkowski
Hello, W dniu 2012-05-22 09:19, Mike Unwalla pisze: Hello, I don't know the implications of having hunspell, therefore, possibly, my comment is nonsense. I don't want LT to flag a word unless I tell LT to flag that word. If hunspell causes LT flag a word, then hunspell will cause me

Re: [Languagetool] inflected words from two tokens

2012-05-22 Thread Marcin Miłkowski
W dniu 2012-05-20 11:19, Jaume Ortolà i Font pisze: I am already familiar with the methods for inflecting tokens. Now, in order to make a suggestion, I need the lemma from one token and some information of the POS tag of another token. In fact, this is rarely necessary because this information

[Languagetool] Country variants (was: Initial hunspell support)

2012-05-22 Thread Marcin Miłkowski
W dniu 2012-05-21 20:06, Marcin Miłkowski pisze: I guess if every maintainer needs to add the files, a lot of languages won't have support for spell checking in 1.8... Is this something you could do for all languages? How large will the LT zip become then anyway? I could, of course

Re: [Languagetool] Initial hunspell support

2012-05-23 Thread Marcin Miłkowski
W dniu 2012-05-22 23:43, Daniel Naber pisze: On Dienstag, 22. Mai 2012, Marcin Miłkowski wrote: It depends on how you are using LT: if from the command-line, then you have no problem. If it's used as a web server, you might have a problem, but I'm trying to figure out a good solution. Two

Re: [Languagetool] Configuration redesign

2012-05-24 Thread Marcin Miłkowski
W dniu 2012-05-24 00:29, Daniel Naber pisze: On Mittwoch, 23. Mai 2012, Jan Schreiber wrote: I agree with Daniel that that sounds suspiciously like too many clicks, but otoh the 1300+ German rules are very difficult to navigate through in the GUI without folding. The French ruleset is even

Re: [Languagetool] How to use the url feature with LanguageTool from command line?

2012-05-27 Thread Marcin Miłkowski
W dniu 2012-05-27 11:35, Jan Schreiber pisze: However, when I use LanguageTool in the command line, I don't see anything about the URL information: For all I know, the URL is displayed in LibreOffice only. I just added it to the commandline interface. BTW, we might need to add a new

Re: [Languagetool] multi-word spellcheck?

2012-05-28 Thread Marcin Miłkowski
W dniu 2012-05-28 08:33, Ruud Baars pisze: In Dutch, there are lots of errors because of typing the space too early or late. Spell ckecking gets most of them, because one of the words is incorrect then. But suggestions ar worthless. Even worse, it happens that both words are technically

Re: [Languagetool] Idea to facilitate debugging disambiguation rules

2012-05-28 Thread Marcin Miłkowski
That requires some additions to the AnalyzedToken and multiple other places. But agreed, very useful. Will think of it. 28-05-2012 13:24 użytkownik Jaume Ortolà i Font jaumeort...@gmail.com napisał: I have the same difficulty with the disambiguation rules in Catalan. The proposed solution would

Re: [Languagetool] How to use the url feature with LanguageTool from command line?

2012-05-29 Thread Marcin Miłkowski
W dniu 2012-05-30 00:07, Dominique Pellé pisze: Dominique Pellédominique.pe...@gmail.com wrote: Marcin Miłkowski wrote: W dniu 2012-05-27 11:35, Jan Schreiber pisze: However, when I use LanguageTool in the command line, I don't see anything about the URL information: For all

Re: [Languagetool] hunspell vs. fsa spell

2012-05-30 Thread Marcin Miłkowski
W dniu 2012-05-30 11:38, Ruud Baars pisze: Sounds good. I think the smart compound detection of Hunspell would make it possible to generate a words list. Hard to get al these words checked and tagged though. I will see. Takes more time than available. I will take care of the process of

Re: [Languagetool] disable ARTICLE_MISSING

2012-06-01 Thread Marcin Miłkowski
Hi again, Most of the false alarms, by the way, on the first page, are not caused by this rule but by inappropriate POS tagging or tokenization. I will have a look. I just (hopefully) fixed most of the false alarms. Daniel, could you re-run the English rules on Wikipedia on our community

Re: [Languagetool] OutOfBoundsException using Unification

2012-06-01 Thread Marcin Miłkowski
Hi, W dniu 2012-06-01 11:47, Jaume Ortolà i Font pisze: Hi, When using 'unification', I sometimes find 'out of bounds' errors, and I think this could be easily prevented. The next rule and other similar rules seem to work fine with the examples. But when I test it with the Wikipedia corpus,

Re: [Languagetool] Country variants (was: Initial hunspell support)

2012-06-01 Thread Marcin Miłkowski
, Marcin Miłkowski pisze: I guess if every maintainer needs to add the files, a lot of languages won't have support for spell checking in 1.8... Is this something you could do for all languages? How large will the LT zip become then anyway? I could, of course. This is almost trivially easy. But I

Re: [Languagetool] disable ARTICLE_MISSING

2012-06-02 Thread Marcin Miłkowski
W dniu 2012-06-01 17:22, Daniel Naber pisze: On Freitag, 1. Juni 2012, Marcin Miłkowski wrote: Daniel, could you re-run the English rules on Wikipedia on our community site for me to see if it really helped? If not, it'll be wiser to disable the rule by default. done: http

Re: [Languagetool] disable ARTICLE_MISSING

2012-06-02 Thread Marcin Miłkowski
W dniu 2012-06-02 13:08, Daniel Naber pisze: On Samstag, 2. Juni 2012, Marcin Miłkowski wrote: I also had a look on results on Brown corpus, most of them are false alarms. It's best to disable the rule by default. I just did. Thanks for hunting down the false alarams. I fixed a lot of false

[Languagetool] New feature: suppress misspelled suggestions

2012-06-03 Thread Marcin Miłkowski
Hi all, Now it is possible to suppress misspelled suggestions altogether in XML rules by applying an attribute suppress_misspelled=yes on the suggestion element, AND on the match element. If only match element has this attribute set to yes, then the suggestion is displayed, but no content of

Re: [Languagetool] How to enable spellchecking?

2012-06-04 Thread Marcin Miłkowski
W dniu 2012-06-04 11:49, Dominique Pellé pisze: Marcin Miłkowskilist-addr...@wp.pl wrote: Bad news: I just tried the Breton version of LanguageTool with Hunspell. Something is wrong: it highlights as mistakes all words containing apostrophes ' or dashes -. Such words are very frequent in

Re: [Languagetool] How to enable spellchecking?

2012-06-04 Thread Marcin Miłkowski
W dniu 2012-06-04 19:16, Daniel Naber pisze: On Montag, 4. Juni 2012, Marcin Miłkowski wrote: There is no command line for hunspell library. It's not executed as standalone program. Ubuntu comes with such a program: hunspell -d src/resource/nl/hunspell/nl_NL /tmp/test.txt It also

Re: [Languagetool] How to enable spellchecking?

2012-06-05 Thread Marcin Miłkowski
W dniu 2012-06-05 12:48, Jaume Ortolà i Font pisze: Hi, In Catalan there is exactly the same problem you described here. 2012/6/4 Marcin Miłkowski list-addr...@wp.pl mailto:list-addr...@wp.pl W dniu 2012-06-03 23:53, Dominique Pellé pisze: Bad news: I just tried the Breton

Re: [Languagetool] How to enable spellchecking?

2012-06-06 Thread Marcin Miłkowski
W dniu 2012-06-06 09:22, Dominique Pellé pisze: Marcin Miłkowskilist-addr...@wp.pl wrote: Well, this was my main reason for not adding the dictionary: I did not know whether the US version or the GB version should be the default. If we default to British or American English, we should, IMHO,

Re: [Languagetool] problems with unification

2012-06-06 Thread Marcin Miłkowski
W dniu 2012-06-05 23:01, Daniel Naber pisze: On Montag, 4. Juni 2012, Jaume Ortolà i Font wrote: I don't fully understand the Unifier.java code. The relation between negated and non-negated unification seems straightforward, but the results are bizarre. Any help? I cannot directly help, but

Re: [Languagetool] LibreOffice 3.5.4 released

2012-06-07 Thread Marcin Miłkowski
W dniu 2012-06-07 20:08, Piotr pisze: Thanks Marcin. Can I have both versions of Java on the same computer? After all I will point LO to just one of them. Yes, you can. Regards, Marcin On Thu, Jun 7, 2012 at 7:55 PM, Marcin Miłkowski list-addr...@wp.pl mailto:list-addr...@wp.pl wrote

Re: [Languagetool] Idea to facilitate debugging disambiguation rules

2012-06-13 Thread Marcin Miłkowski
W dniu 2012-06-13 08:29, Juan Martorell pisze: When debugging in command line mode, if the sentence matches a rule then you get no disambiguator log. Is it easy to change that behaviour? You should get it anyway, below POS tags. Are you using -v? Regards, Marcin

[Languagetool] Rule configuration and simplifying interface

2012-06-14 Thread Marcin Miłkowski
Hi all, we need to make configuration simpler. While I like Daniel's idea to suppress the config dialog altogether and display only disabled rules at the bottom of the dialog box, I think we could have something better. Now, if you look at most word processors, they offer a simplified view of

[Languagetool] Hunspell performance

2012-06-14 Thread Marcin Miłkowski
Hi all, I performed some further tests on Hunspell Rule. First, I disabled suggestions for Polish. This gave a huge speedup: from 56 sentences per second, I got around 770 sentences per second. Then, I tested English (American Dictionary). With suggestions, I got 71 sentences per second,

[Languagetool] Hunspell performance and a new speller...

2012-06-14 Thread Marcin Miłkowski
Hi all, I performed some further tests on Hunspell Rule. First, I disabled suggestions for Polish. This gave a huge speedup: from 56 sentences per second, I got around 770 sentences per second. Then, I tested English (American Dictionary). With suggestions, I got 71 sentences per second,

[Languagetool] Hunspell performance and a new speller...

2012-06-14 Thread Marcin Miłkowski
Somehow it didn't get into the list already two times in a row, I'm resending it... Hi all, I performed some further tests on Hunspell Rule. First, I disabled suggestions for Polish. This gave a huge speedup: from 56 sentences per second, I got around 770 sentences per second. Then, I

Re: [Languagetool] Rule configuration and simplifying interface

2012-06-14 Thread Marcin Miłkowski
W dniu 2012-06-14 17:19, Daniel Naber pisze: On Donnerstag, 14. Juni 2012, Marcin Miłkowski wrote: we need to make configuration simpler. While I like Daniel's idea to suppress the config dialog altogether and display only disabled rules at the bottom of the dialog box, I think we could have

Re: [Languagetool] Hunspell rule and OO/LibreOffice

2012-06-15 Thread Marcin Miłkowski
W dniu 2012-06-12 22:16, Daniel Naber pisze: Hi, we don't need the hunspell rule in OO/LO and it doesn't seem to be active - so that's okay. Is it only inactive because the language is detected as English as opposed to American English etc? Wouldn't it be more robust to explicitly disable

Re: [Languagetool] Hunspell performance and a new speller...

2012-06-15 Thread Marcin Miłkowski
W dniu 2012-06-15 09:31, Ruud Baars pisze: marcin, I am assuming Hunspell is only called when a word is 'UNKNOWN', right? If not, that might be a change worth trying. No. It's called every time. The problem is that for many languages, the taggers do contain rare forms. Note that the

Re: [Languagetool] cfsa2 dictionary

2012-06-22 Thread Marcin Miłkowski
Hi, when compiling the dictionary using morfologik-tools, you simply use -f cfsa2 (just add it to the fsa_build command call). I will update the wiki soon. Regards, Marcin W dniu 2012-06-21 07:34, Dominique Pellé pisze: Hi I see that some dictionaries have been re-encoded using cfsa2 to

[Languagetool] Wordlists for spellers?

2012-06-22 Thread Marcin Miłkowski
Hi all, considering that hunspell is so slow and that for many languages, we already have wordlists (for example, there are wordlists for all variants of English, and all aspell dictionaries may be converted to word lists), we might replace all hunspell dictionaries with morfologik-speller

Re: [Languagetool] Wordlists for spellers?

2012-06-22 Thread Marcin Miłkowski
Ruud, what about aspell-nl? Is that the same? Or bigger? Marcin W dniu 2012-06-22 12:34, R.J. Baars pisze: There is a words list for Dutch at www.opentaal.org. Relatively small, just 450.000 words, including proper names, but it is the best open one there is. To be exact:

Re: [Languagetool] Wordlists for spellers?

2012-06-22 Thread Marcin Miłkowski
. It works quite fast here. Marcin W dniu 2012-06-22 17:18, Jan Schreiber pisze: We could try my wordlist (1.3 million word forms) for German if that speeds up the process: http://sourceforge.net/projects/germandict/ --Jan Am 22.06.2012 11:09, schrieb Marcin Miłkowski: Hi all

Re: [Languagetool] Wordlists for spellers?

2012-06-22 Thread Marcin Miłkowski
OK, got it. Now, the file contains numerous .txt files and one .csv file. I understand that the complete speller dictionary should contain the words from all .txt files? And - to be sure - do you want me to replace hunspell for NL with the wordlist-based dictionary? I'm not sure if we have any

Re: [Languagetool] Hunspell spellcheck performance

2012-06-22 Thread Marcin Miłkowski
W dniu 2012-06-22 22:43, Daniel Naber pisze: On Freitag, 22. Juni 2012, Dominique Pellé wrote: One way to speed this up is to cache the last N misspelled words and their suggestions. So when we encounter the same misspelled word again, we can quickly find in cache its suggestions without

Re: [Languagetool] Hunspell spellcheck performance

2012-06-23 Thread Marcin Miłkowski
it for German, it's irrelevant anyway Regards, Marcin 2012/6/23 Daniel Naber list2...@danielnaber.de On Samstag, 23. Juni 2012, Marcin Miłkowski wrote: Sounds like a bug to me. Do you get any suggestion for, say, ser gutt? I got some. M. I see now: it only works for words with one error

Re: [Languagetool] Hunspell spellcheck performance

2012-06-24 Thread Marcin Miłkowski
the relative word frequency might help is suggesting the words in the correct order. Yes, but for this, I'd need a frequency list. A wikipedia-based one might be skewed because of the high level of occurrence of proper names. Marcin Ruud Op 23-06-12 12:42, Marcin Miłkowski schreef: Yes

Re: [Languagetool] Wordlists for spellers?

2012-06-24 Thread Marcin Miłkowski
W dniu 2012-06-23 20:51, Jan Schreiber pisze: I'm fine with that, if it works fast enough. Hunspell is no doubt the best free spell checker around for German. Just a thought - why not have hunspell for checking and MorfologikSpeller (with edit distance = 2) for suggestions? It's a bit of a

Re: [Languagetool] Wordlists for spellers?

2012-06-24 Thread Marcin Miłkowski
W dniu 2012-06-24 10:49, Daniel Naber pisze: On Sonntag, 24. Juni 2012, Marcin Miłkowski wrote: Just a thought - why not have hunspell for checking and MorfologikSpeller (with edit distance = 2) for suggestions? Thought about this too. What would the MorfologikSpeller then use

Re: [Languagetool] Wordlists for spellers?

2012-06-24 Thread Marcin Miłkowski
W dniu 2012-06-24 13:59, Daniel Naber pisze: On Sonntag, 24. Juni 2012, Marcin Miłkowski wrote: Okay. For me, having no suggestions for some languages seems okay for 1.8, but if you want to do that, go ahead. You mean without the trick to do the compounding suggestions as well? Or including

Re: [Languagetool] Wordlists for spellers?

2012-06-24 Thread Marcin Miłkowski
W dniu 2012-06-24 13:59, Daniel Naber pisze: On Sonntag, 24. Juni 2012, Marcin Miłkowski wrote: Okay. For me, having no suggestions for some languages seems okay for 1.8, but if you want to do that, go ahead. You mean without the trick to do the compounding suggestions as well? Or including

[Languagetool] SimpleReplaceRule as spelling rule?

2012-06-29 Thread Marcin Miłkowski
Hi all, shouldn't we make also SimpleReplaceRule a spelling rule and display its results underlined in red? Regards, Marcin -- Live Security Virtual Conference Exclusive live event will cover all the ways today's

Re: [Languagetool] Release LanguageTool 1.8 in progress

2012-07-01 Thread Marcin Miłkowski
W dniu 2012-06-30 22:01, Daniel Naber pisze: On Samstag, 30. Juni 2012, Marcin Miłkowski wrote: it works for me under linux when I move jar href=hunspell-linux-i386.jar/ Works for me, too - please everybody test whether the Start LanguageTool link on http://www.languagetool.org works

Re: [Languagetool] Breton speller (and general tokenization issue)

2012-07-01 Thread Marcin Miłkowski
tokenization. You suggested that we use it for word tokenization. I understand there are consequences all over the tool range. No, not at all. Marcin Ruud Op 01-07-12 18:59, Marcin Miłkowski schreef: Hi Dominique, and all, I have started conversion of the Breton speller to the MorfologikSpeller

Re: [Languagetool] Breton speller (and general tokenization issue)

2012-07-01 Thread Marcin Miłkowski
W dniu 2012-07-01 22:53, Dominique Pellé pisze: Marcin Miłkowski list-addr...@wp.pl wrote: Hi Dominique, and all, I have started conversion of the Breton speller to the MorfologikSpeller format, and thanks to the new support of UTF-8, it was successful. But there is one outstanding issue

Re: [Languagetool] ICU and LanguageTool?

2012-07-07 Thread Marcin Miłkowski
Hi Nathan, W dniu 2012-07-04 14:47, Nathan Wells pisze: LibreOffice LibO Master 3.6 incorperates automatic linebreaking for the Khmer language using the latest version of ICU (see: http://bugs.icu-project.org/trac/ticket/8329) I'm confused. Do you mean *line* breaking or *word* breaking?

Re: [Languagetool] filtering XML

2012-08-11 Thread Marcin Miłkowski
W dniu 2012-08-10 21:34, Daniel Naber pisze: Hi, a bug report[1] complains about incorrect position information and concludes that it's caused by filtering XML. Does anybody know why we should filter XML? I tend to agree with the report that this filtering should happen outside of

Re: [Languagetool] What characters cause tokenization?

2012-08-13 Thread Marcin Miłkowski
Hello, W dniu 2012-08-13 12:17, Mike Unwalla pisze: Hello, Examples of characters that cause tokenization: space . ! { } [ / Examples of characters that do not cause tokenization: # $ % ^ _ + = * @ ~ I looked on languagetool.wikidot.com and on http://www.languagetool.org/, but I did

Re: [Languagetool] What characters cause tokenization?

2012-08-13 Thread Marcin Miłkowski
W dniu 2012-08-13 20:57, Daniel Naber pisze: Am Mo 13.08.2012, 20:07:26 schrieb Dominique Pellé: I would also split words with at least the backticks ` and pipe |. I don't really disadvantages in not splitting at those characters. I don't see a problem with that either, so feel free to make

Re: [Languagetool] path changes in SVN

2012-09-05 Thread Marcin Miłkowski
W dniu 2012-09-04 23:31, Daniel Naber pisze: On 04.09.2012, 10:38:55 Jaume Ortolà i Font wrote: Hi Jaume, After the path changes, I made a clean checkout. I'm using Windows 7 and Eclipse and I have found some problems: thanks, both issues should be fixed now. Please let me know if there's

Re: [Languagetool] vote about logo proposals

2012-09-18 Thread Marcin Miłkowski
W dniu 2012-09-17 22:22, Daniel Naber pisze: Hi, I'm glad to announce that we have three very nice proposals for a new LanguageTool logo! See them at http://www.danielnaber.de/tmp/lt-logo-proposals.png Note: please do not spread or use these logos yet: as long as we haven't made a

Re: [Languagetool] vote about logo proposals

2012-09-19 Thread Marcin Miłkowski
W dniu 2012-09-18 23:06, Daniel Naber pisze: On 18.09.2012, 20:05:24 gulp21 wrote: Proposal #1 reminds me of a map api. Proposal #2B is a bit overcrowded, thus I think that proposal #2A is the best one. But I also agree with Nathan that it looks more like a logo for a translation tool. As

Re: [Languagetool] info request

2012-09-30 Thread Marcin Miłkowski
Hi, W dniu 2012-09-30 20:00, Mauro Condarelli pisze: Hi, thanks. Comments below. On 30/09/2012 11:52, Marcin Miłkowski wrote: W dniu 2012-09-30 05:05, Mauro Condarelli pisze: Hi, I'm coding my test app with LT and I'm almost ready to become productive. I need a deeper insight in two

Re: [Languagetool] are our libraries up-to-date?

2012-10-01 Thread Marcin Miłkowski
W dniu 2012-10-01 00:09, Daniel Naber pisze: On 30.09.2012, 11:31:54 Marcin Miłkowski wrote: I'm not sure about morfologik* libraries. There might be an unreleased version in our code (I fixed a bug with UTF-8 but 1.5.4 was not released yet). Could we release the Maven version of LT with

Re: [Languagetool] info request

2012-10-02 Thread Marcin Miłkowski
W dniu 2012-10-02 12:56, Mauro Condarelli pisze: On domenica 30 settembre 2012 20:30:17, Marcin Miłkowski wrote: I would also like to know if it's possible to add further information to the words (e.g.: pointers to synonyms and antinonyms). This is not part-of-speech information, so it does

Re: [Languagetool] getting fsa operational

2012-11-03 Thread Marcin Miłkowski
W dniu 2012-10-22 21:11, Dominique Pellé pisze: Daniel Naber list2...@danielnaber.de wrote: On 16.10.2012, 16:52:26 Ruud Baars wrote: By the way, I read the instruction in the wiki, but these are quite complex (for me). I think you can ignore everything not related to Java - both exporting

Re: [Languagetool] Performance

2012-11-16 Thread Marcin Miłkowski
W dniu 2012-11-15 22:58, Daniel Naber pisze: Hi, the load test I did for our HTTPS service showed that we have kind of a performance problem for languages that have a lot of rules. Testing a random German text with 125 sentences takes 2 seconds on my machine at average. About half of that

Re: [Languagetool] Performance

2012-11-16 Thread Marcin Miłkowski
W dniu 2012-11-16 09:28, R.J. Baars pisze: Hi, the load test I did for our HTTPS service showed that we have kind of a performance problem for languages that have a lot of rules. Testing a random German text with 125 sentences takes 2 seconds on my machine at average. About half of that time

Re: [Languagetool] Performance

2012-11-16 Thread Marcin Miłkowski
W dniu 2012-11-16 10:10, R.J. Baars pisze: Hi, the load test I did for our HTTPS service showed that we have kind of a performance problem for languages that have a lot of rules. Testing a random German text with 125 sentences takes 2 seconds on my machine at average. About half of that time

Re: [Languagetool] Performance

2012-11-16 Thread Marcin Miłkowski
W dniu 2012-11-16 16:17, Dominique Pellé pisze: Marcin Miłkowski list-addr...@wp.pl wrote: Hm, just a stupid question, but did you remember that loading rules is also quite expensive? We have to parse XML and compile regexes, and read from disk. I'm hoping that XML parsing and loading other

Re: [Languagetool] Performance

2012-11-17 Thread Marcin Miłkowski
W dniu 2012-11-16 23:51, Daniel Naber pisze: On 16.11.2012, 15:24:16 Marcin Miłkowski wrote: Hm, just a stupid question, but did you remember that loading rules is also quite expensive? We have to parse XML and compile regexes, and read from disk. Loading rules takes about 130ms once

Re: [Languagetool] Performance

2012-11-18 Thread Marcin Miłkowski
W dniu 2012-11-17 23:03, Daniel Naber pisze: On 17.11.2012, 17:17:56 Marcin Miłkowski wrote: Do you have stats per call of the method? It may be more informative. See attached file, is that what you mean? It's after checking three or so German documents. This is what I mean. The method

Re: [Languagetool] Performance

2012-11-19 Thread Marcin Miłkowski
W dniu 2012-11-19 01:12, Dominique Pellé pisze: Marcin Miłkowski list-addr...@wp.pl wrote: W dniu 2012-11-18 22:48, Daniel Naber pisze: On 18.11.2012, 21:33:32 Marcin Miłkowski wrote: The easiest way to see whether this could speed things up at all is change the line 162

Re: [Languagetool] Performance

2012-11-19 Thread Marcin Miłkowski
W dniu 2012-11-19 11:39, Dominique Pellé pisze: Marcin Miłkowski wrote: In Breton at least, I experience sometimes a combinatorial explosion or rules in order to implement what I want. Of course many rules probably slow down. I don't think that having extra 16 rules changes

Re: [Languagetool] Performance

2012-11-19 Thread Marcin Miłkowski
W dniu 2012-11-19 18:35, Daniel Naber pisze: On 19.11.2012, 10:39:33 Marcin Miłkowski wrote: Yeah, but to reuse the pattern, we would have to build a finite-state machine in memory (or on disk) first. This is far from trivial because we would have to flatly encode all features (token, pos

Re: [Languagetool] Error with French POS dictionary (french.dict) created by morfologik 1.5.2 (OK with morfologik 1.4.0)

2012-11-22 Thread Marcin Miłkowski
W dniu 2012-11-21 23:28, Dominique Pellé pisze: Hi I see that my script to produce the French POS tag dictionary still uses morfologik 1.4.0. I just tried to update it to use morfologik 1.5.2. The script creates the dictionary french.dict without error, but the created dictionary with

Re: [Languagetool] Performance

2012-11-22 Thread Marcin Miłkowski
W dniu 2012-11-22 08:14, Dominique Pellé pisze: Jan Schreiberwrote: Ruud Baars wrote: Suppose there is a slow/expensive rule, that almost never matches to warn about some low-priority issue; maybe it is worth switching it off then. That looks like a good idea

  1   2   3   4   5   >