Re: [en] postags for 'haven'

2016-07-29 Thread Marcin Miłkowski
W dniu 29.07.2016 o 13:57, Mike Unwalla pisze:
> Marcin, thanks.
>
> Except for the removal of VBP, I decided to make no changes to the
> disambiguation at this time, for these reasons:
>
> 1. In disambiguation.xml, if I remove the readings MD and VBP from 'haven'
> if it is not MD (that is, it is not part of "haven't"), my problem is not
> solved. This suggestion:
>  have
> shows 'haven' in the list of suggestions.

Right. I adapted the file filter-archaic.txt to remove this (I remove 
other contractions already). The file is in the resources folder.

>
> (Aside: I could not see how to remove the readings using only 1 token. But,
> I made a test rule.)

You can easily do that in many ways. For example, by making the token to 
have just one reading. This comes from the Polish file:

 

mimo


 

I required "mimo" to have the POS "prep:gen" but of course that was a 
part of a larger sequence.


>
> 2. I found a few examples of 'haven' as a verb on the NOW Corpus (News on
> the Web) (http://corpus.byu.edu/now/). Example, "Commodities Traders flocked
> to haven assets Friday, with gold jumping almost five per cent."
>
> The simplest solution to my problem is to use a rulegroup in grammar.xml.
> One rule contains
>   have
> rather than
>   have


Yes, but this one should have the base form 'haven', not 'have'. And 
indeed, there's 'haven' as a verb in many dictionaries, and in Keats:

haven
/"heɪv(ə)n/
v. LME. [f. the n.]
† v.i. Go into or shelter in a haven. LME–E17.
  v.t. Put (a ship etc.) into a haven. Now chiefly fig., give shelter 
to, protect. E17.Quotation
  KEATS Blissfully haven'd both from joy and pain.

>

Regards,
Marcin

> Regards,
>
> Mike
>
> -Original Message-
> From: Marcin Milkowski [mailto:list-addr...@wp.pl]
> 
>
> I'd use the second method because this is what I did with other similar
> cases. It's mostly because I used to write disambiguation files to
> remove readings rather than to add ones. But either way will do.
>
> BTW: VBP is most definitely wrong, as 'have' cannot be negated as a
> normal verb by using a contraction.
>
> Best,
> Marcin
>
>
>
> --
> ___
> Languagetool-devel mailing list
> Languagetool-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>
>


--
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: [en] postags for 'haven'

2016-07-27 Thread Marcin Miłkowski
W dniu 25.07.2016 o 16:28, Mike Unwalla pisze:
> Hello,
>
> The word 'haven' has these postags: NN, MD, VBP.
>
> When the word is not a verb (as in "the village is a haven of tranquility"), 
> I want to remove MD and VBP. Postag VBP causes 'haven' to appear as a 
> suggestion when I use postag_replace with the verb 'have'.
>
> I can think of 2 methods to remove the unwanted postags, but I am not sure 
> about the best method:
> 1) Remove MD and VPB using removed.txt. Change the rules in 
> disambiguation.xml to apply MD and VBP when 'haven' is used as part of a verb 
> (as in "I haven't a clue what to do").
> 2) Leave the postags as they are. Add a rule in disambiguation to remove MD 
> and VBP when 'haven' is used as a noun.
>
> I think that the first option is better, because 'haven' is not a verb. It is 
> part of a verb when it occurs in "haven't".
>
> What do you think is the best method, and why?

I'd use the second method because this is what I did with other similar 
cases. It's mostly because I used to write disambiguation files to 
remove readings rather than to add ones. But either way will do.

BTW: VBP is most definitely wrong, as 'have' cannot be negated as a 
normal verb by using a contraction.

Best,
Marcin

--
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are 
consuming the most bandwidth. Provides multi-vendor support for NetFlow, 
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports.http://sdm.link/zohodev2dev
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Names of rule groups are now required and non-empty

2016-04-23 Thread Marcin Miłkowski
Hi all,

I made one change in rule syntax today. The name attribute was optional 
until today but it was inconsistent with configuration: neither the GUI, 
nor the command-line, nor even the API allows to disable just one rule 
from a rule group. You disable the whole group at the same time. But in 
some languages, there were separate names of rules in a group listed 
together, and there was no warning that disabling them in the Options 
window actually also disables all the other rules in the group 
(displayed separately). This pertains to the following languages:

- French,
- Galician,
- Icelandic,
- Romanian.

To make the behavior consistent, I fixed this by requiring that rule 
groups and categories have non-empty attribute "name". That also 
required setting some rule names as the rule group name. See here the 
changes:

https://github.com/languagetool-org/languagetool/commit/2fdaa2f61f47d5fef3e4202498042516f468ac95

I don't think this is a controversial change as it was plainly a bug. If 
you want to have a rule that is configurable separately, simply move it 
out of the rulegroup.

Best regards,
Marcin

--
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: Roadmap for Spanish

2016-04-07 Thread Marcin Miłkowski
W dniu 06.04.2016 o 14:57, Juan Martorell pisze:
> On 5 April 2016 at 16:29, Jaume Ortolà i Font  > wrote:
>
>
> 2014-06-06 20:45 GMT+02:00 Juan Martorell  >:
>
>
> *1st and foremost: disambiguator:*
>
> My current strategy for disambiguation is starting by the longer
> constructions and then downsizing to the two tokens
> constructions. Positive and negative examples should be included.
>
>
> I can point out some strategies for disambiguation.  I will try to
> make a summary.
>
>
> That's a great opportunity for Wiki improvement!

I added some today:

http://wiki.languagetool.org/developing-a-disambiguator

Best regards,
Marcin

--
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: Roadmap for Spanish

2016-04-06 Thread Marcin Miłkowski
W dniu 06.04.2016 o 14:55, Juan Martorell pisze:
>
>
> On 4 April 2016 at 19:28, Marcin Miłkowski <list-addr...@wp.pl
> <mailto:list-addr...@wp.pl>> wrote:
>
> Hi,
>
> W dniu 03.04.2016 o 12:46, Juan Martorell pisze:
> > I realized this because every rule
> > I added introduced a new regression sometimes worsening what we had
> > before a lot.
>
> I could try to help to avoid this with some tricks with disambiguation.
>
>
> Please update the wiki so everyone can benefit of them.

Heh, the trouble is that it's mostly implicit knowledge. I'll try to 
write up some strategies.

>
>
> > I therefore blame the dictionary for all this, and no good
> > disambiguation can be done without a decent tagging. I am tired of
> > waiting for someone else to volunteer, every time someone shows up, she
> > seems intimidated by the task and she eventually loses interest.
>
> What's the problem with the dictionary?
>
> (1) It assigns too many POS tags, making disambiguation difficult.
>
> (2) It lacks important POS tags, so disambiguation cannot help.
>
> If (1), I really can help by writing up some methods I have found for
> Polish and English. I can read Spanish, so this should be fairly easy.
>
>
> Mostly (2). Freeling tends to keep the lexic roots and calculate the
> inflections, so the dictionary is rather incomplete.

That means you should expand the dictionary a lot, IMHO.


> To transform one adjective into an adverb, in English you use the suffix
> `-ly` and in Spanish you use the suffix `-mente`:
>
> Equal --> equally
> Igual --> igualmente
>
> I found 18340 candidates for suffixation in the Spanish dictionary for
> this particular case.

I'd add them to the dictionary. Why? Because these things might be false 
alarms, and removing them later by hand might be easier.

>
> Same for diminutives, augmentatives and superlatives. Depending on the
> zone these may vary, but if you want to be fully inclusive
> <https://es.wikipedia.org/wiki/Diminutivo>, you have to include 17
> diminutives, both genders; 9 augmentatives, both genders; 1 superlative,
> both genders excluding the irregular forms
> <https://es.wikipedia.org/wiki/Superlativo>. They apply to the same
> ~18000 candidates. They are widely used in writing, so it is worth to
> include them.

Well, I don't really tag diminutives (except for the most frequent that 
are already included in the dictionary) and it doesn't really hurt.

> It is quite common to attach some pronouns to the verb thus including
> information about direct and/or indirect object, or passive/impersonal
> voice. Combinations are hughe, some like:
>
> infinitive+pronoun as DO; example: from /subir/: /subir*me*/,
> /subir*te*/, /subir*se*/, /subir*lo*/, /subir*la*/, /subirn*os*/,
> /subir*os*/, /subir/*se*, /subir*los*/, /subir*las*/.
> infinitive+pronoun as IO+pronoun as DO; example: from /subir/:
> /subír*_te_me*/, /subír*_se_me*/, /subír*_me_te*/, /subír*_se_te*/,
> /subír*_se_nos*/, /subír*_nos_los*/, /subír*_os_las* /etc.
> imperative+pronoun as DO:; example: from /subir/:/súbeme/, /subid*me*/,
> /súba*me*/, /súban*me*/; /súbe*te*/, /subí*os*/, /súba*se*/, /súban*se*
> /etc.
> imperative+pronoun as IO+pronoun as DO; example: from /subir/:
> /súbe*_me_lo*/, /súbe*_te_lo*/, /súbe_*te*_*me*/, /subí*_os_las* /etc.

I'd go for Jaume's strategies with Catalan. They are probably exactly 
suited to your situation.

I would tokenize this internally if it doesn't lead to any ambiguity 
(you don't need a space to tokenize). I don't do this for Polish as we 
have lots of ambiguities: "miałem" might be past of "mieć" and the past 
would be "miał" + "em", "em" being the agglutinate for the first person 
singular, but it's also instrumentative of the noun "miał", which 
shouldn't be tokenized. We have a stream of tokens, and we would need to 
replace it with a graph (one edge for "miałem", another for "miał" + 
"em"), which is not exactly the nicest thing to play with. So I don't 
tokenize but have a non-tokenized list hardcoded in the dictionary.

>
> Gerund accepts the same derivation.
>
> These derivations are enough on themselves to justify some automation:
> so far 18000 adjectives * (1 adverb + 17 * 2 diminutives + 9 * 2
> augmentatives + 1 augmentative) =~ 972000 words to include in the
> dictionary.
> If you ad all the pronominal derivations: 7654 verbs * ( (10 IPasDO +
> (6! / 2!(6-4)!) IPasIODO) * 3 verbal tenses = 7654 * (10 + 15) * 3 =~
> 574000 words to include in the dictionary.
>
> It makes a total of aprox 1,5 millon words to include, excluding the
> America

Re: Preventing inflections in suggestions

2016-03-12 Thread Marcin Miłkowski
W dniu 12.03.2016 o 05:02, Andriy Rysin pisze:
> Hi all
>
> I have some word forms (colloquial forms of the verbs) I would like to
> tag and recognize but I don't want them to show up in suggestions. I
> found that I can remove the tags I don't want from the list when
> building synthesizer but I am wondering if that's the right way to do
> it (I did google and search our wiki quickly but nothing popped up).

I remove archaic forms for English and Polish words altogether. You're 
right, removing individual forms from the synthesizer is the easiest way 
(not to mention it will be computationally cheap).

I believe I also did this for some tags (not sure right now, maybe I 
forgot, ooops).

Regards,
Marcin

--
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785111=/4140
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: Create rule - comma before word(s)

2016-03-03 Thread Marcin Miłkowski
W dniu 03.03.2016 o 19:06, Marco A.G.Pinto pisze:
> Hello!
>
> The following Portuguese words require a comma before them:
> 1) Eu gosto muito de chocolate, *mas *não posso comer para não engordar.
> 2) Eu gosto muito de chocolate, *porém *não posso comer para não engordar.
> 3) Eu gosto muito de chocolate, *contudo *não posso comer para não engordar.
> 4) Eu gosto muito de chocolate, *no entanto* não posso comer para não
> engordar.
> 5) Eu gosto muito de chocolate, *entretanto *não posso comer para não
> engordar.
> 6) Eu gosto muito de chocolate, *todavia *não posso comer para não engordar.
>
>
> Is there a simple way of adapting the rule Yakov helped me with the
> other day?:
> *
>  
>
>  
>  ou
>  seja
>  
>  ,
>
>Usar vírgula: \1 \2,
>Pensa primeiro, ou
> seja escolhe acertadamente.
>  *
>
>
> PS-> Notice that in example 4) it is two words, not one.

Actually, I would write it in a slightly more general way.


[,;:–—\(]

no
entanto


Why? Just because one usually use an opening parenthesis, a colon etc. 
instead of a comma in some context. And exceptions have a slightly 
different logic than negation, which may become tricky in some 
situations with regular expressions (probably not here, though).

Best,
Marcin

>
> Thanks!
>
> Kind regards,
> >Marco A.G.Pinto
>   ---
>
> --
>
>
> --
> Site24x7 APM Insight: Get Deep Visibility into Application Performance
> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> Monitor end-to-end web transactions and take corrective actions now
> Troubleshoot faster and improve end-user experience. Signup Now!
> http://pubads.g.doubleclick.net/gampad/clk?id=272487151=/4140
>
>
>
> ___
> Languagetool-devel mailing list
> Languagetool-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>


--
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151=/4140
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: Valency dictionary and attribute [long mail]

2016-02-08 Thread Marcin Miłkowski
Hi Andriy,

W dniu 07.02.2016 o 21:35, Andriy Rysin pisze:
> Hi Marcin
>
> I was actually thinking for something even more abstract. To adjust
> your example:
> 
>   

Re: Valency dictionary and attribute [long mail]

2016-02-02 Thread Marcin Miłkowski
W dniu 02.02.2016 o 18:08, Andriy Rysin pisze:
> Hey Marcin
>
> this is great addition, though I have one remark. Besides valency
> information some other type of information could be useful too (if we
> starting to head this direction). E.g. I have rules in Ukrainian that
> suggests superlative form for adjective when "самий" (very) + base
> form is used. Currently I have the relation between base form and
> comparative/superlative forms encoded in the dictionary but in general
> this is higher-level information that should be stored outside of the
> tag dictionary.

I would argue that in some languages (at least in Polish and English) 
this is not a semantic-level information, this is a grammatical 
information, or morphosyntactic information.

>
> I am wondering if we could develop more generic approach for such
> additional (semantic) information, e.g. split each type of this info
> into category and allow generic references in the token/exception,
> something like this:
>
>  semantic_info=":"/>
>
> or even as a subelement (I assume semantic information can get pretty
> long/complicated so child element may be better choice and will allow
> to add new attributes easily on it later)
>
> 
>  value=""/>
>  value=""/>
> 
>
> so in valency case you described (1st case) it could be:
>
> 
>
> 

Valency is definitely not a semantic category:

https://en.wikipedia.org/wiki/Valency_(linguistics)

But your approach seems quite elegant. I would argue that valency is one 
kind of information that should be treated as key-value


 


This would match a verb that takes an accusative noun phrase (of course, 
the values would be defined per valency lexicon in a language). There 
are free valency lexicons for many languages beside Polish.

>
> Thus if we add other semantic information into LT we can use this info
> in the logic without changing the LT core.

The core XML parsing will have to be changed anyway.

Best,
Marcin

>
> Thanks
> Andriy
>
> 2016-01-28 7:30 GMT-05:00 Marcin Miłkowski <list-addr...@wp.pl>:
>> Hi all,
>>
>> To allow for better disambiguation and have better rules, I need to
>> include a valency dictionary with LT. These are dictionaries that
>> specify which grammatical cases or prepositions go with which verbs etc.
>> There are such resources for many languages that we support. And using
>> these resources, we could enrich POS tag disambiguation a lot (I'm using
>> a horribly long regular expression right now instead of a dictionary,
>> for example), and write up a lot of important rules.
>>
>> The obvious choice for representing the dictionary (which is available
>> for Polish on a fairly liberal license) is to use a finite-state lexicon
>> that we normally use for taggers. The dictionary will be applied after
>> tagging because valency dictionary will require POS tag + lexeme
>> information. In Polish, the entries look like this:
>>
>> absurdalny: pewny: : : : {prepnp(dla,gen)}
>> absurdalny: pewny: : : : {prepnp(w,loc)}
>> absurdalny: pewny: : pred: : {prepnp(dla,gen)}+{cp(gdy)}
>> absurdalny: pewny: : pred: : {prepnp(dla,gen)}+{cp(int)}
>> absurdalny: potoczny: : pred: : {prepnp(dla,gen)}+{cp(jak)}
>> absurdalny: pewny: : pred: : {prepnp(dla,gen)}+{cp(jeśli)}
>> absurdalny: pewny: : pred: : {prepnp(dla,gen)}+{cp(kiedy)}
>> absurdalny: pewny: : pred: : {prepnp(dla,gen)}+{cp(że)}
>> absurdalny: pewny: : pred: : {prepnp(dla,gen)}+{cp(żeby)}
>>
>> But for French (see http://bach.arts.kuleuven.be/dicovalence/) they are
>> paragraph-based:
>>
>> VAL$abaisser: P0 P1
>> VTYPE$  predicator simple
>> VERB$   ABAISSER/abaisser
>> NUM$10
>> EG$ il faudra abaisser la persienne
>> TR_DU$  laten zakken, neerhalen, neerlaten, doen dalen
>> TR_EN$  let down, lower
>> FRAME$  subj:pron|n:[hum], obj:pron|n:[nhum,?abs]
>> P0$ (que), qui, je, nous, elle, il, ils, on, (ça), (ceci), celui-ci, 
>> ceux-ci
>> P1$ que, la, le, les, en Q, ça, ceci, celui-ci, ceux-ci
>> RP$ passif être, se passif
>> AUX$avoir
>>
>>
>> VAL$abaisser: P0 (P1)
>> VTYPE$  predicator simple
>> VERB$   ABAISSER/abaisser
>> NUM$20
>> EG$ il a raconté cette anecdote pour m'abaisser
>> TR_DU$  vernederen, kleineren
>> TR_EN$  humiliate
>> FRAME$  subj:pron|n:[hum], ?obj:pron|n:[hum]
>> P0$ (que), qui, je, nous, elle, il, ils, on, (ça), celui-ci, ceux-ci,
>> (ça(de_inf))
>> P1$ 0, qui, te, vous, la, le, les, se réc., en Q, celui-ci, ceux-ci,
>> l'un l'autre
>> RP$ passif êtr

Re: MS Word add-in for LT

2016-02-02 Thread Marcin Miłkowski
W dniu 30.01.2016 o 15:17, Daniel Naber pisze:
> On 2016-01-30 12:07, Jaume Ortolà i Font wrote:
>
>> My preferred approach would be to write a "configuration program" (in
>> HTML/JavaScript), with a custom form for every language. This program
>> would generate output(s) that can be used everywhere (stand-alone LT,
>> command-line LT, LibreOffice, LT server, any LT client...). What do
>> you think?  Could we plan how to do it?
>
> What about an approach that avoids the configuration dialog entirely?
> For every rule match, make it possible to disable the rule with a single
> click. At the end of the list of all matches, show the turned off rules
> so they can be enabled again with a single click. We do something like
> that in the stand-alone client, but one could do more. For example, have
> an option "turn off all rules of this category" for a match. Then show
> not only which rules/categories are turned off, but also how many
> matches these would have generated. I prefer this over a configuration
> dialog, as only the most advanced users care to open the configuration
> anyway.

This is a nice idea but there should be a way to undo these changes. An 
inexperienced user may disable a major rule by mistake and then complain 
that it no longer works. There should be at least "Reset settings" 
button, or a configuration box.

I agree that it doesn't have to be as complex as our original dialog. 
But it might at least enumerate all categories for a language.

One more thing: I think it's important to remember that most users won't 
be savvy enough to start the server by themselves. There should be a 
script that starts LT in the background from MS Word (if no server 
answers a query on the assigned port), and the install should check 
whether Java is installed on the machine and require that for a local 
installation.

Best regards,
Marcin

--
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311=/4140
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: MS Word add-in for LT

2016-01-30 Thread Marcin Miłkowski
W dniu 29.01.2016 o 14:38, Jaume Ortolà i Font pisze:
> 2016-01-29 14:07 GMT+01:00 Marcin Miłkowski <list-addr...@wp.pl
> <mailto:list-addr...@wp.pl>>:
>
> W dniu 29.01.2016 o 12:27, Jaume Ortolà i Font pisze:
> Just tested and it works in MS Word 2007.
>
> There are some settings that seem to be relevant only for Catalan,
> though, in the Settings dialog box (general, valencia etc.). There's
> also a "Tipography" check box (it should rather by Typography, btw)
> whose meaning is unclear to me. Is it punctuation or spelling?
>
>
> Yes, they are relevant only for Catalan. I have to figure out how to
> make it general for every language (perhaps just hiding it).
>
> I see that "typography" makes little sense in English. It's for some
> punctuation/typographical rules.
>
> Perhaps we could have several check boxes (like spelling, grammar,
> punctuation, style...), so these classes of rules can be
> enabled/disabled. But perhaps this can't be applied easily to all
> languages.

Why not simply port some of the code that we have for listing all 
categories of rules -- or even write up a small piece of Java code to 
create a resource file that would be used to create a localized dialog 
for a given language? This seems the easiest option. A more complex 
option is to port most of the whole settings dialog.

>
> And one more thing: the webpage should clearly warn that you need to
> reboot the machine to install the plugin. I hate rebooting during the
> install, and I imagine most people would be quite angry to see that this
> is a necessary step.
>
>
> I didn't need to reboot (Windows 7, MS Word 2010). Anyway I will warn
> about it.
>
> 2016-01-29 14:12 GMT+01:00 Marcin Miłkowski <list-addr...@wp.pl
> <mailto:list-addr...@wp.pl>>:
>
>
> Hm, I think you also need a way to edit the suggestion if the original
> suggestion from LT is not helpful. There's no way to edit the document
> manually, and no way to enter a manual suggestion. MS Word has that
> option in its original spelling / grammar dialog.
>
> The context is indeed editable inside the dialog. (How can we make it
> clear?). There is a minor problem As it is done now, the replacement of
> the edited context can cause some formatting to be lost. This can be
> improved.

Perfect.

I also noticed that the code looks at the beginning of the paragraph to 
set the language, which was Polish in my case, but the first visible 
character was English, and the whole paragraph was English. In other 
words, something went south when checking. I had to mark the whole page 
as English and it worked perfectly but MS Word checks were fine before...

What about a Mac version? Is it much more difficult? I know that the 
plugin code changed a lot between Word version but maybe the latest 
versions could be supported?

Best,
Marcin


--
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311=/4140
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: New Language - constraint grammar importing tool

2016-01-29 Thread Marcin Miłkowski
W dniu 29.01.2016 o 13:58, Curon Wyn pisze:
> Hi Marcin/Daniel
>
> Ar 29/01/2016 09:01, ysgrifennodd Marcin Miłkowski:
>
>> I would try to get this working. The conversion was pretty good, only
>> very special constructions of constraint grammar could not be handled. I
>> guess most of An Gramadóir rules would convert just fine, though there
>> might have been some changes to our XML handling code, so minor manual
>> changes would be needed anyway.
>>
> An Gramadóir has had no development in at least 2 years, if not more, other 
> than the minor change. All modern Celtic languages have some presence 
> (excluding Breton), with some rules for other languages. Converting all 
> content would make sense, but they can sit elsewhere rather than brought into 
> LanguageTool without native support for further development. For Welsh the 
> rules are fairly limited, and better sources of data are available.
>
>
>
> Converting some content looks relatively straightforward, but some rules, due 
> to the use of a Regex over multiple words would be problematic.

We also have this feature but it's relatively new.

>
>
>
> Not a technical problem, but An Gramadóir is licensed under GPL, which would 
> make things very complicated from a legal prospective.
>
>> What were the problems with compiling it?
> No problem compiling, I checked out tag v3.1 and compiled without issue, but 
> I couldn't work out a way to run the converter. Some filenames have changed 
> from the aforementioned wiki link and I got a lot of errors about missing 
> libraries if running direct using java. Is there a maven command or something 
> to run the converter?

Not that I know of…

But your best bet is simply to download an old version of LT and it 
should contain most if not all libraries you need.

Best,
Marcin


--
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311=/4140
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: MS Word add-in for LT

2016-01-29 Thread Marcin Miłkowski
W dniu 29.01.2016 o 14:12, Marcin Miłkowski pisze:
> W dniu 29.01.2016 o 12:27, Jaume Ortolà i Font pisze:
>> Hi,
>>
>> I have released a new version of the plug-in [1].
>>
>> All messages are now in English by default. The translations into other
>> languages have to be put in files like this [2]. I think almost all
>> needed strings can be extracted from the current translations in the
>> LanguageTool project. Only two or three extra strings will be needed.
>>
>
> Hm, I think you also need a way to edit the suggestion if the original
> suggestion from LT is not helpful. There's no way to edit the document
> manually, and no way to enter a manual suggestion. MS Word has that
> option in its original spelling / grammar dialog.

Oh, I missed that the upper part of the dialog is editable. Fine, then 
this works.

Best,
Marcin

>
> Best,
> Marcin
>
> --
> Site24x7 APM Insight: Get Deep Visibility into Application Performance
> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> Monitor end-to-end web transactions and take corrective actions now
> Troubleshoot faster and improve end-user experience. Signup Now!
> http://pubads.g.doubleclick.net/gampad/clk?id=267308311=/4140
> ___
> Languagetool-devel mailing list
> Languagetool-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>
>


--
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311=/4140
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: MS Word add-in for LT

2016-01-29 Thread Marcin Miłkowski
W dniu 29.01.2016 o 12:27, Jaume Ortolà i Font pisze:
> Hi,
>
> I have released a new version of the plug-in [1].
>
> All messages are now in English by default. The translations into other
> languages have to be put in files like this [2]. I think almost all
> needed strings can be extracted from the current translations in the
> LanguageTool project. Only two or three extra strings will be needed.
>

Hm, I think you also need a way to edit the suggestion if the original 
suggestion from LT is not helpful. There's no way to edit the document 
manually, and no way to enter a manual suggestion. MS Word has that 
option in its original spelling / grammar dialog.

Best,
Marcin

--
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311=/4140
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: New Language - constraint grammar importing tool

2016-01-29 Thread Marcin Miłkowski
Hi,

W dniu 29.01.2016 o 00:34, curon@wyn.cymru pisze:
> Hi,
>
> First of all, I must thank you all for developing such a good grammar 
> correction tool under an open source license.
>
> A few years ago I started looking at developing An Gramadóir, as work had 
> already been done for the Welsh language. Unfortunately this project has had 
> no development for some time, and the only proprietary checker is fairly 
> limited. I did have my eye on LanguageTool, but never got around to doing 
> anything until now.
>
> There used to be a tool for converting constraint grammar files in Apertium 
> to the xml used in LanguageTool. As described here:
> http://wiki.languagetool.org/using-the-rule-converter-gui
>
> This was removed in October 2015. I have tried to compile the last release 
> before removal, but I have failed to try to get this working. Is it worth 
> pursuing this route, or would I be better off converting this manually?

I would try to get this working. The conversion was pretty good, only 
very special constructions of constraint grammar could not be handled. I 
guess most of An Gramadóir rules would convert just fine, though there 
might have been some changes to our XML handling code, so minor manual 
changes would be needed anyway.

What were the problems with compiling it?

Best regards,
Marcin

>
> I must admit I'm quite impressed by the number of rules implemented for 
> Breton, a closely related language to Welsh. I have wondered how many of 
> rules apply to Welsh, but unfortunately I'm unable to make much sense of the 
> rules.xml as all the comments are in Breton, which is a little different, 
> particularly in written form!
>
> A few supported and unsupported languages are in the old An Gramadoir svn 
> repository, would converting/importing all of these be of any use? I can't 
> promise anything, but I may attempt to script some of the process.
>
> Curon
> --
> Site24x7 APM Insight: Get Deep Visibility into Application Performance
> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> Monitor end-to-end web transactions and take corrective actions now
> Troubleshoot faster and improve end-user experience. Signup Now!
> http://pubads.g.doubleclick.net/gampad/clk?id=267308311=/4140
> ___
> Languagetool-devel mailing list
> Languagetool-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>


--
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311=/4140
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: MS Word add-in for LT

2016-01-26 Thread Marcin Miłkowski
Hi Jaume,

this is very good news!

W dniu 26.01.2016 o 10:47, Jaume Ortolà i Font pisze:
> Hi,
>
> I have made a beta release of a MS Word add-in for LanguageTool [1].
> ("Add-in" is Microsoft terminology for "plug-in").
>
> It has some limitations, but I think it can work fine and be useful. The
> checking is made only in a dialog box, with the usual options in these
> dialogues. Unfortunately the errors are not underlined in the text and
> there is no context menu for suggestions. See further explanations in
> the README.md.
>
> The current implementation is in Catalan, but I plan to make it
> multilingual soon. I will tell you as soon as it is done.
>
> The checking is made automatically in the language defined in the text
> (at the paragraph level). The mapping of language codes has to be
> completed here [2]. Anyway, if the language is not found, then the
> default language (defined in settings) is used.
>
> The release version is not yet signed with a certificate, and this can
> cause problems during the installation. It needs to be tested in
> different versions of MS Word. I only tested it in MS Word 2010.

We have a digital certificate for Java Web Start. Would that do? Let me 
know, and then we could share use the certificate for the release.

>
> If you try it, tell me your impressions.
>
> PS. There is another way to provide grammar checking in MS Word using
> the Microsoft grammar API. This would be a DLL written in C/C++. I got
> the API documentation from Microsoft. But under the conditions I got it,
> I can not share it. The DLL could be implemented and distributed, but it
> can't be open source.

Yeah, that's why we should stick to open source. Though Virastyar seems 
not to care too much.

Regards,
Marcin

>
> Regards,
> Jaume Ortolà
>
> [1] https://github.com/jaumeortola/languagetool-msword10-addin
> [2]
> https://github.com/jaumeortola/languagetool-msword10-addin/blob/master/languagetool-msword10-addin/ThisAddIn.cs#L286
>
>
> --
> Site24x7 APM Insight: Get Deep Visibility into Application Performance
> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> Monitor end-to-end web transactions and take corrective actions now
> Troubleshoot faster and improve end-user experience. Signup Now!
> http://pubads.g.doubleclick.net/gampad/clk?id=267308311=/4140
>
>
>
> ___
> Languagetool-devel mailing list
> Languagetool-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>


--
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311=/4140
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: Splitting segment.srx?

2016-01-25 Thread Marcin Miłkowski
W dniu 25.01.2016 o 18:55, Andriy Rysin pisze:
> We do have some segmentation rules for Ukrainian and although I didn't
> spend much time on it specifically it does a decent job for now.
>
> What I was hoping is that when the expert lignuists start working on
> the language module they would approach this on more scientific
> grounds and will introduce some improvements into srx file. And the
> less complicated the process is (and the less I need to be involved
> :)) the better.

Hm, I would bet there are no scientific materials that would help much. 
I read quite a bit of papers on that, and they're not specific enough. 
Just saying.

Regards,
Marcin

>
> But I understand this involves splitting a file that is bound by a
> standard so it's not a win-win situation (otherwise I would have
> already sent a patch for review :)).
>
> Regards,
> Andriy
>
> 2016-01-25 12:32 GMT-05:00 Marcin Miłkowski <list-addr...@wp.pl>:
>> W dniu 25.01.2016 o 17:08, Andriy Rysin pisze:
>>> Well I am currently trying to involve several linguist who are not
>>> proficient with development tools in developing Ukrainian module for
>>> LT. So I was hoping that decreasing the number of steps for them to
>>> remember/perform and keeping all language-specific parts in the
>>> language module would increase the chances they would be productive.
>>> IMHO slight complication in the core module done once is far
>>> outweighted by benefits of simplified process for each language
>>> developer.
>>>
>>> But this suggestion resulted in much more negativity than I expected
>>> so I guess I'll add more steps when documenting the process for
>>> non-developers.
>>
>> Well, developing the sentence segmentation is usually quite easy and
>> takes just a well-segmented corpus of a language. Don't you have one for
>> Ukrainian? Most languages don't need much tweaks after the first
>> segmenting rules have been created.
>>
>> Sorry if it sounded negative, but really, it would make my life harder:
>> I need several languages in the same file, so we would need to join XML
>> files on the fly, and make sure that nobody tries to override the
>> standard settings for all languages by mistake etc.
>>
>> Regards,
>> Marcin
>>
>>>
>>> Regards,
>>> Andriy
>>>
>>> 2016-01-25 3:50 GMT-05:00 Marcin Miłkowski <list-addr...@wp.pl>:
>>>> W dniu 25.01.2016 o 03:29, Andriy Rysin pisze:
>>>>> Currently 95% of the language handling is done in language module so
>>>>> when I edit segment.srx I need to remember to recompile/redeploy
>>>>> languagetool-core.
>>>>
>>>> Make a script ;)
>>>>
>>>>>
>>>>> If we're using segment.srx only inside languagetool I don't see how
>>>>> we're breaking the standard if we compose the full segment.srx file
>>>>> from the language modules when we need it. And if somebody wants to
>>>>> have full segment.srx for using outside of LT we could add a target to
>>>>> build in, e.g. in languagetool-tools.
>>>>
>>>> The file is relatively small. Why would we really want it – just to make
>>>> sure that you don't have to remember to recompile languagetool-core? ;)
>>>> I just don't see a need.
>>>>
>>>>> This would help LT being more modular, which for most of the software
>>>>> is a good architectural approach.
>>>>
>>>> Your approach is to build complicated tools just to solve an issue that
>>>> is a matter of taste. This is a waste of time.
>>>>
>>>> It would be much more productive to build more GUI tools.
>>>>
>>>> Regards,
>>>> Marcin
>>>>
>>>>>
>>>>> Regards,
>>>>> Andriy
>>>>>
>>>>>
>>>>>
>>>>> 2016-01-24 17:13 GMT-05:00 Marcin Miłkowski <list-addr...@wp.pl>:
>>>>>> W dniu 24.01.2016 o 17:15, Andriy Rysin pisze:
>>>>>>> Would it make sense to split segment.srx into language modules (and
>>>>>>> assemble dynamically from available languages)? For now it seems to be
>>>>>>> the only language-specific piece that belongs to core module.
>>>>>>> Was there any attempts at this and if yes what was the obstacle?
>>>>>>
>>>>>> No, there were no attempts because it's against the official
>>>>>> specifications o

Re: Splitting segment.srx?

2016-01-25 Thread Marcin Miłkowski
W dniu 25.01.2016 o 17:08, Andriy Rysin pisze:
> Well I am currently trying to involve several linguist who are not
> proficient with development tools in developing Ukrainian module for
> LT. So I was hoping that decreasing the number of steps for them to
> remember/perform and keeping all language-specific parts in the
> language module would increase the chances they would be productive.
> IMHO slight complication in the core module done once is far
> outweighted by benefits of simplified process for each language
> developer.
>
> But this suggestion resulted in much more negativity than I expected
> so I guess I'll add more steps when documenting the process for
> non-developers.

Well, developing the sentence segmentation is usually quite easy and 
takes just a well-segmented corpus of a language. Don't you have one for 
Ukrainian? Most languages don't need much tweaks after the first 
segmenting rules have been created.

Sorry if it sounded negative, but really, it would make my life harder: 
I need several languages in the same file, so we would need to join XML 
files on the fly, and make sure that nobody tries to override the 
standard settings for all languages by mistake etc.

Regards,
Marcin

>
> Regards,
> Andriy
>
> 2016-01-25 3:50 GMT-05:00 Marcin Miłkowski <list-addr...@wp.pl>:
>> W dniu 25.01.2016 o 03:29, Andriy Rysin pisze:
>>> Currently 95% of the language handling is done in language module so
>>> when I edit segment.srx I need to remember to recompile/redeploy
>>> languagetool-core.
>>
>> Make a script ;)
>>
>>>
>>> If we're using segment.srx only inside languagetool I don't see how
>>> we're breaking the standard if we compose the full segment.srx file
>>> from the language modules when we need it. And if somebody wants to
>>> have full segment.srx for using outside of LT we could add a target to
>>> build in, e.g. in languagetool-tools.
>>
>> The file is relatively small. Why would we really want it – just to make
>> sure that you don't have to remember to recompile languagetool-core? ;)
>> I just don't see a need.
>>
>>> This would help LT being more modular, which for most of the software
>>> is a good architectural approach.
>>
>> Your approach is to build complicated tools just to solve an issue that
>> is a matter of taste. This is a waste of time.
>>
>> It would be much more productive to build more GUI tools.
>>
>> Regards,
>> Marcin
>>
>>>
>>> Regards,
>>> Andriy
>>>
>>>
>>>
>>> 2016-01-24 17:13 GMT-05:00 Marcin Miłkowski <list-addr...@wp.pl>:
>>>> W dniu 24.01.2016 o 17:15, Andriy Rysin pisze:
>>>>> Would it make sense to split segment.srx into language modules (and
>>>>> assemble dynamically from available languages)? For now it seems to be
>>>>> the only language-specific piece that belongs to core module.
>>>>> Was there any attempts at this and if yes what was the obstacle?
>>>>
>>>> No, there were no attempts because it's against the official
>>>> specifications of the SRX standard. The SRX file is supposed to work for
>>>> all languages that a given software application can support. This is how
>>>> SRX files in general look in all computer-aided translation apps, for
>>>> example.
>>>>
>>>> Note also that SRX contains a several common parts that all languages
>>>> inherit (such as splitting paragraphs at one or two line breaks), so
>>>> this file is cascading.
>>>>
>>>> I just don't see a point in splitting. Is there an ideological point
>>>> that the core should be language-independent? I don't think we should
>>>> care about it.
>>>>
>>>> Regards,
>>>> Marcin
>>>>
>>>> --
>>>> Site24x7 APM Insight: Get Deep Visibility into Application Performance
>>>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>>>> Monitor end-to-end web transactions and take corrective actions now
>>>> Troubleshoot faster and improve end-user experience. Signup Now!
>>>> http://pubads.g.doubleclick.net/gampad/clk?id=267308311=/4140
>>>> ___
>>>> Languagetool-devel mailing list
>>>> Languagetool-devel@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>>
>>> -

Re: Splitting segment.srx?

2016-01-25 Thread Marcin Miłkowski
W dniu 25.01.2016 o 03:29, Andriy Rysin pisze:
> Currently 95% of the language handling is done in language module so
> when I edit segment.srx I need to remember to recompile/redeploy
> languagetool-core.

Make a script ;)

>
> If we're using segment.srx only inside languagetool I don't see how
> we're breaking the standard if we compose the full segment.srx file
> from the language modules when we need it. And if somebody wants to
> have full segment.srx for using outside of LT we could add a target to
> build in, e.g. in languagetool-tools.

The file is relatively small. Why would we really want it – just to make 
sure that you don't have to remember to recompile languagetool-core? ;) 
I just don't see a need.

> This would help LT being more modular, which for most of the software
> is a good architectural approach.

Your approach is to build complicated tools just to solve an issue that 
is a matter of taste. This is a waste of time.

It would be much more productive to build more GUI tools.

Regards,
Marcin

>
> Regards,
> Andriy
>
>
>
> 2016-01-24 17:13 GMT-05:00 Marcin Miłkowski <list-addr...@wp.pl>:
>> W dniu 24.01.2016 o 17:15, Andriy Rysin pisze:
>>> Would it make sense to split segment.srx into language modules (and
>>> assemble dynamically from available languages)? For now it seems to be
>>> the only language-specific piece that belongs to core module.
>>> Was there any attempts at this and if yes what was the obstacle?
>>
>> No, there were no attempts because it's against the official
>> specifications of the SRX standard. The SRX file is supposed to work for
>> all languages that a given software application can support. This is how
>> SRX files in general look in all computer-aided translation apps, for
>> example.
>>
>> Note also that SRX contains a several common parts that all languages
>> inherit (such as splitting paragraphs at one or two line breaks), so
>> this file is cascading.
>>
>> I just don't see a point in splitting. Is there an ideological point
>> that the core should be language-independent? I don't think we should
>> care about it.
>>
>> Regards,
>> Marcin
>>
>> --
>> Site24x7 APM Insight: Get Deep Visibility into Application Performance
>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>> Monitor end-to-end web transactions and take corrective actions now
>> Troubleshoot faster and improve end-user experience. Signup Now!
>> http://pubads.g.doubleclick.net/gampad/clk?id=267308311=/4140
>> ___
>> Languagetool-devel mailing list
>> Languagetool-devel@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>
> --
> Site24x7 APM Insight: Get Deep Visibility into Application Performance
> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> Monitor end-to-end web transactions and take corrective actions now
> Troubleshoot faster and improve end-user experience. Signup Now!
> http://pubads.g.doubleclick.net/gampad/clk?id=267308311=/4140
> ___
> Languagetool-devel mailing list
> Languagetool-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>


--
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311=/4140
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: Splitting segment.srx?

2016-01-24 Thread Marcin Miłkowski
W dniu 24.01.2016 o 17:15, Andriy Rysin pisze:
> Would it make sense to split segment.srx into language modules (and
> assemble dynamically from available languages)? For now it seems to be
> the only language-specific piece that belongs to core module.
> Was there any attempts at this and if yes what was the obstacle?

No, there were no attempts because it's against the official 
specifications of the SRX standard. The SRX file is supposed to work for 
all languages that a given software application can support. This is how 
SRX files in general look in all computer-aided translation apps, for 
example.

Note also that SRX contains a several common parts that all languages 
inherit (such as splitting paragraphs at one or two line breaks), so 
this file is cascading.

I just don't see a point in splitting. Is there an ideological point 
that the core should be language-independent? I don't think we should 
care about it.

Regards,
Marcin

--
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311=/4140
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: Java Webstart commented out

2015-12-29 Thread Marcin Miłkowski
W dniu 29.12.2015 o 14:10, Daniel Naber pisze:
> Hi,
>
> I've commented out the Webstart link at languagetool.org. After updating
> to LT 3.2, Webstart complained that not all JARs are signed with the
> same key. I could "fix" this by manually cleaning the Webstart cache
> (not the browser cache). Then I tried with Chromium and it used IcedTea
> and failed to start LT. Making even the build work at all with LT 3.2
> and Java 8 was 2 hours of work. But only 20 users per day click that
> link, out of about 10,000 visits every day on languagetool.org. So in
> other words, Java Webstart is just too fragile and a lot of work for me
> so it's not worth it. Existing users who downloaded it in the past
> shouldn't have any issues, the Webstart files are still there, just the
> link has been removed.
>

Well, I downloaded the jnlp file using Chrome and used Java 8 (64-bit) 
on Windows 10 and it worked perfectly. It also worked on Firefox and 
(quite old) IcedTea on Ubuntu (64-bit). I'll update IcedTea tomorrow and 
see how this changes things.

Best,
Marcin

--
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: LanguageTool in 2015 + the future

2015-12-17 Thread Marcin Miłkowski
W dniu 08.12.2015 o 23:11, Daniel Naber pisze:
> On 2015-12-07 19:30, Marcin Miłkowski wrote:
>
>> I think there's a community that we haven't addressed at all: language
>> professionals, be it proofreaders or translators (and translation
>> agencies). Translators are using suboptimal tools, such as Apsic
>> XBench,
>> for their proofreading tasks. If we could get interest of technically
>> savvy translators, we could get new contributors. This might also mean
>> some input from commercial companies.
>
> I think that's a good point. Don't you have experience translating whole
> books? I only have experience translating software user interfaces. Am I
> right that both the process and the software used is totally different?

Not anymore. Right now, I use computer-aided translation (CAT) tools for 
all my translation tasks (for which I cannot find time anyway). But 
basically, without CATs modern translation feels like using pen and 
paper instead of a computer.

But you're right, tools for UI translations are not based on word 
processors anymore.

>
> Are these (software UI and books) the two use cases for translation or
> are there more? How does LT need to be changed to support these use
> cases? Is it a change in core, in its UI, or does it "just" mean writing
> more add-ons to integrate LT?

Well, for more technical translations, people use tools such as 
CheckMate (mentioned on our wiki). But the integration isn't perfect and 
the tool itself difficult to use if you're not a technical savvy user.

In contrast, Apsic XBench is extremely easy to use but not free anymore 
(at least in its more powerful, Unicode-supporting version).

IMHO, we could simply see what open (and then closed) CAT tools are 
currently most popular, and see how we could interface them. I also know 
a company that would love to use LT as a terminology checker. I just 
don't have time to work with them right now. Terminology checker would 
take an approved translation glossary (these come as CVS files) and 
convert it to a set of LT rules (either on the fly or compile to XML for 
further customizations). For all morphologically complex languages, 
glossaries cannot be used directly for checking whether an approved term 
has been used or not. So this would fill a very important need.

We could offer a web service for conversion, and XML rule file for 
download. But LT has to have good, intuitive support for additonal rule 
files (that's why I worked on these issues).

Hope this helps a bit.

>
>> LanguageTool. There will be minority languages with poorer support, and
>> that's always the case.
>
> Dutch, Spanish, and Italian are also among the languages with very few
> commits in the last 6 month.

Which doesn't mean that the support is necessarily bad...

Regards,
Marcin

--
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: LanguageTool in 2015 + the future

2015-12-07 Thread Marcin Miłkowski
Hi,

W dniu 07.12.2015 o 14:56, Daniel Naber pisze:
> Hi,
>
> the year is slowly coming to an end, so I thought I'd try to summarize
> what we've achieved this year and how we can move LT forward in the
> future. In 2015, we...
>
> * made three releases so far (2.9, 3.0, 3.1), another one is planned
> * more than doubled the number of visits to languagetool.org (January:
> 156,000, November: 326,000)
> * released a Chrome extension with more than 1,500 users now
> * added support for ngram models to detect confusion of (mostly)
> homophones (English, German)
> * did several things I forgot to list here
> * added and improved many language-specific rules. Specifically, 14
> languages are maintained if you define this as "had at least ten commits
> in its grammar.xml and disambiguation.xml files this year". However,
> this also means 17 languages are not maintained.

This is impressive overall!

>
> This last point of unmaintained languages highlights what I think is an
> important issue: In the last three years, we increased our number of
> users by a factor of 10. At the same time, the number of commits and
> people who regularly contribute didn't grow at all (see attachment).
> Many languages are not maintained, and even those that are often only
> have a single contributor. If that contributor becomes inactive, finding
> a new one seems almost impossible. If we continue like this, LT will
> some day end up with very few languages that are actually maintained. As
> there doesn't seem to be any correlation between number of users and
> number of regular contributors, user growth won't help us.
>
> I have no solution for this problem, but some ideas I'd like to get
> feedback on:
>
> (1) Clean up: throw out all unmaintained languages that also have less
> than 100 rules. This way users don't get the false impression that their
> language is supported when it actually isn't. It might also create some
> motivation to contribute when users notice that "their" language is
> being thrown out.
>

I'm strongly against. If there's already some initial support, getting 
less technical contributors is much easier. We could exclude such 
initial support in our releases, however, if that's supposed to help in 
getting maintainers. I think it would only mean wasted time and effort.

> (2) Grow the contributor community: somehow find new contributors to
> revive the unmaintained languages and find contributors to support the
> maintainers of languages that are already doing well. The thing is: I
> have no idea how to do this. For example, we have a text on
> languagetool.org saying we're looking for help with marketing. This text
> has been shown to more than 40,000 visitors and the effect so far has
> been zero (actually four people have contacted me, but three of those
> have already disappeared). What is holding people back from becoming a
> regular contributor?

Well, there's some learning curve, and that's why.

I think there's a community that we haven't addressed at all: language 
professionals, be it proofreaders or translators (and translation 
agencies). Translators are using suboptimal tools, such as Apsic XBench, 
for their proofreading tasks. If we could get interest of technically 
savvy translators, we could get new contributors. This might also mean 
some input from commercial companies.

>
> (3) Crowdsourcing: give up on finding qualified contributors, instead
> develop tools that allow contribution via very, very simple means, like
> clicking on correct and incorrect sentences. It's not clear how well
> this could work. It might be combined with (4).

For it to work, we might need really lots of people...

>
> (4) Statistics: give up on finding qualified contributors and find
> errors using ngram statistics etc. With statistics, finding errors is
> language-independent. Quality might be worse than with hand-written
> rules, but for languages that are not maintained anyway there are often
> hardly hand-written rules. Of course, everybody could still contribute
> manually written rules and maybe revive language support that way.
>
> (5) Business: develop a business model and pay people for working on LT.
> This is difficult, developing a business is a full-time job on its own.
> Even if it worked, it would only cover very few mainstream languages.
>
> These are the options I can think of that go beyond "let's just keep
> going". Yes, we could just keep going - for some languages, LT is in
> good health. But to be a sustainable project in the long term, I think
> we need either more than one contributor per language or we need a
> technological approach that works without a maintainer per language.
>
> Please, everybody, let me know what you think and what ideas you have
> about the future of LanguageTool.

I don't think varied level of support was ever a problem for 
LanguageTool. There will be minority languages with poorer support, and 
that's always the case. Yes, of course. Why worry? Life's too 

Re: Terminology checking

2015-08-14 Thread Marcin Miłkowski
W dniu 14.08.2015 o 08:56, Dmitri Gabinski pisze:
 For reference: you can use Okapi CheckMate for such purposes. CheckMate
 can also engage LanguageTool to check spelling/grammar.

Well, for morphologically-rich languages, you cannot, as it would only 
check the base forms.

Turning a simple CSV into a PatternRule programmatically seems very easy 
and requires simple steps:

- read the CSV and get the terms;

- analyze the term; mostly it's adjective + noun, but sometimes it's a 
noun as a modifier + noun, which changes things a bit for some pairs;

- create a rule on the fly depending on the kind of term we're looking for.

The second step is required because for some languages, the rule would 
be for the first case an adjective in the form that is in accord with 
the noun, and for the second case, an inflected noun plus a number of 
uninflected nouns in the genetive case.

Basically, the analysis step might differ for different language pairs.

Regards,
Marcin

--
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: improvements in Morfologik speller

2015-06-09 Thread Marcin Miłkowski
W dniu 2015-06-08 o 21:27, Jaume Ortolà i Font pisze:
 2015-06-08 9:39 GMT+02:00 Daniel Naber daniel.na...@languagetool.org
 mailto:daniel.na...@languagetool.org:

 On 2015-06-02 15:06, Jaume Ortolà i Font wrote:

 Hi Jaume,

 sorry for the late reply.

  There are some failures with the current German LanguageTool tests.
  Could you take a look, Daniel? You need to use replacements in
  lower-case (r rh, rh r). Are the results reasonable?

 This case looks like a regression to me:

 Not found: 'Haus' in: [Hauch, Hau, Haue, Haut, -Au, -Aue, -Aug, -Haus,
 -Haut, Ahaus, Back, Baku, Bank, Bark, Bau, Bau-, Baud, Baum, Baus,
 Chauke]

 As long as there's a suggestion with a distance of 1, shouldn't it be
 preferred over suggestions with a distance of 2?

 For the case Ligafußboll, the suggestion with a distance of 2 seems to
 be lost, I think that shouldn't be the case:

 Expected :[Ligafußball, Ligafußballs]
 Actual   :[Ligafußball]


 You are right. These results are not expected. I will look at them again.

 A question: Ligafußball doesn't exist as a word in the dictionary.
 It's a compound, isn't it?

  If the preferred option in German is convert-case=false, then my
  changes will not affect the German tests in any way.

 Could you describe what exactly convert-case does, I'm not sure I
 completely understand it.


 It is the same for replacement-pairs, convert-case and
 ignore-diacritics. If any of these features is enabled, then these
 differences add a distance of 0 between the original word and the
 possible suggestion.

 Examples:
 If ss ß is in replacement-pairs, the distance between Ligafussball
 (original wrong word) and Ligafußball (suggestion) is zero.
 If convert-case=true, the distance between ligafußball (original word)
 and Ligafußball (suggestion) is zero.
 If ignore-diacritics=true, the distance between horen (original word)
 and hören (suggestion) is zero.
 If ignore-diacritics=true, the distance between horem (original word)
 and hören (suggestion) is one (not two).
 In the file de_DE.info you wrote:
 # ignore-diacritics=false speeds up building the suggestions by a factor
 of about 2:

 Is that true with the current Speller code?

 A question for Marcin:
 As you can see here [1], the condition isConvertingCase() is inside the
 condition isIgnoringDiacritics(), so they are not independent. Was it
 made on purpose? Should we correct it?

Seems like a bug to me. I've probably lost count of parentheses. Please 
correct it.

Regards,
Marcin

--
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: Problems with LibreOffice 4.2x on Windows

2015-05-08 Thread Marcin Miłkowski
W dniu 2015-05-07 o 21:54, Daniel Naber pisze:
 On 2015-05-07 20:51, Marcin Miłkowski wrote:

 on Windows machines, LanguageTool 2.9 seems to cause crashes in LO 4.2x
 (and newer versions). See this comment:

 http://en.libreofficeforum.org/node/9867#comment-41082

 Also see here (starting with Kumara's first comment):
 http://languagetool-user-forum.2306527.n4.nabble.com/Announcement-LanguageTool-2-9-td4642477.html

 There's probably more than one issue at the same time. One of them might
 be that LT is growing slowly but steadily and takes more memory on each
 release, simply because we have more rules. LO is available only as 32
 bit software on Windows, so you need to use a 32 bit Java, which means
 that the default maximum memory for the JVM is 256MB. With large
 documents and more than one language per document, this can be too low
 for LT. It can be reproduced on Linux by setting -Xmx256M as a Java
 parameter. Sometimes you get crashes, sometimes only stacktraces printed
 on the shell where you've started LO.

Right, I didn't notice this discussion as I way away recently from most 
of my mail.

Anyway, I just tried LO 5.0 alpha 64-bit and had no problems. I will 
investigate further in a couple of days.

Regards,
Marcin

--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: large file mode for command-line version

2015-04-13 Thread Marcin Miłkowski
W dniu 2015-04-13 o 18:26, Daniel Naber pisze:
 Hi,

 is there any reason we still need the special mode in our command line
 version that gets activated if the input data is 64,000 characters
 long? It complicates the code and causes this bug:
 https://github.com/languagetool-org/languagetool/issues/251

Well, without this mode, we cannot check larger files. Also, the older 
mode is simply faster than line-by-line processing in the large-file mode.

But it should be replaced with a proper input buffering class.

Best,
Marcin

--
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Google Doc grammar check

2015-04-05 Thread Marcin Miłkowski
Hi all,

there's a Proofread Bot extension that does some external grammar checks 
in Google Docs:

https://chrome.google.com/webstore/detail/proofread-bot/djancbfmkanmnofhdfindoppiapcgnbf

So basically, we can see it's doable. There's also VeritySpell extension 
and so on.

Regards,
Marcin

--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: Rule suggestion

2015-04-01 Thread Marcin Miłkowski
W dniu 2015-03-31 o 23:19, Torsten Wagner pisze:
 Hi,

 as I read on the website you are looking for suggestions of new rules,
 even if they seem to be trivial.
 I found that the languagetool return a lot of false positive for
 abbreviations ending with an s. Languagetool assumes that you might
 use the plural from of the word wrongly.

 Message: Don't use indefinite articles with plural words. Did you mean
 a lap or simply LAPS? (deactivate) Correction: a lap; LAPS Context:
 ...range and the handling of a LAPS required a good amount of
 experience an...

 The S here stands for sensor and is not a plural s
 Other examples of abbreviations which just jump out of my mind are

 ADHS (german form of  Attention deficit hyperactivity disorder )
 PS postscript
 FPS frames per second
 LTS long term support
 DDS direct digital synthesis

 Most of the time those abbreviations are written with all capital letters.
 Maybe the above rule can be enhanced to exclude words written in all
 capital letters.

Could you please give some complete examples? I don't see any false 
alarm for the context you quoted above, and that makes fixing the rule 
harder...

Regards,
Marcin

--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Attribute-value pairs for POS tags [Was: Re: German tests]

2015-03-04 Thread Marcin Miłkowski
W dniu 2015-03-04 o 13:42, Daniel Naber pisze:
 On 2015-03-04 08:52, Marcin Miłkowski wrote:

 If we could move the first part of the code to another class, which
 would analyze POS tags to get proper values of attributes, the code
 would be cleaner and faster. The basic attribute-value class could
 contain several default attributes (they probably need to be
 addressable
 by Strings to make them easily extended by subclasses for new languages
 and new tagsets), such as number, case, gender, and tense. Not all
 languages need to have such attribute values in their tagsets, but they
 need to implement a POS tag analyzer if they want to use these
 attributes.

 I'm not sure I understand this last part: does it mean we would just
 keep the old code as long as there are languages that don't have
 switched to the new attribute values?

I don't want to switch. I think it may still be useful to use full POS 
tags and regexes over POS tags, for example in English with its largely 
non-positional Penn tagset. Moreover, there will always be languages 
without a POS tagger. They don't need to implement the new 
attribute-value interface at all. I see this as an additional member 
(precomputed from the POS tag string value) with its own getters and 
setters. If you don't implement this for a language, then the getter 
will simply return an empty map for all tokens, and that's it.


 Also, while I like the idea, it looks similar to what I tried in branch
 readable-pos-tags but had to give up as it became too much:
 http://www.mail-archive.com/search?l=languagetool-devel@lists.sourceforge.netq=subject:%22readable+POS+tags%22o=newestf=1
 The focus in that branch was on having the attributes also in the XML
 files, but other than that your approach is similar, isn't it?

Yes, the idea is similar. I think we could approach this step by step to 
avoid changing too much:

- first encapsulate the attribute-setting code in the Unifier class,
- move it to a separate class,
- reuse that code to the AnalyzedToken class,
- remove code for equivalences from XML files and replace that with 
language-dependent Java classes that would implement appropriate 
attribute-value classes stored as members in the AnalyzedToken.

I think the easiest way would be to subclass a generic POS tag analyzer 
class.

What was the biggest issue in your branch aside the complexity?

Regards,
Marcin

--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: German tests

2015-03-03 Thread Marcin Miłkowski
W dniu 2015-03-02 o 23:08, Daniel Naber pisze:
 On 2015-03-02 22:44, Marcin Miłkowski wrote:

 Die[der/ART:DEF:AKK:PLU:FEM*,der/ART:DEF:NOM:PLU:FEM*,der/PRO:DEM:AKK:SIN:FEM*,der/PRO:DEM:NOM:SIN:FEM*,der/PRO:PER:AKK:SIN:FEM*,der/PRO:PER:NOM:SIN:FEM*]

 That's the difference, for me it gets unified to:

 Die[der/ART:DEF:AKK:PLU:FEM*,der/ART:DEF:NOM:PLU:FEM*]

 So only the plural readings are kept.

 Disambiguator rules involved:
 UNIFY_DET_ADJ_SUB
 UNIFY_ADJ_SUB

 Any idea what might be going on? Does using a lowercase die change
 anything for you?

No idea, frankly. Lowercase changes nothing. But I do get the same alarm 
when using the yesterday's snapshot I have downloaded from our website, 
so I guess you must be working on some other version of unification 
rules. Maybe you did not include the changes I made in the Unifier.java?

Best,
Marcin

--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: German tests

2015-03-03 Thread Marcin Miłkowski
W dniu 2015-03-03 o 11:05, Daniel Naber pisze:
 On 2015-03-03 09:19, Marcin Miłkowski wrote:

 No idea, frankly. Lowercase changes nothing. But I do get the same
 alarm
 when using the yesterday's snapshot I have downloaded from our website,
 so I guess you must be working on some other version of unification
 rules. Maybe you did not include the changes I made in the
 Unifier.java?

 I've downloaded
 https://languagetool.org/download/snapshots/LanguageTool-20150302-snapshot.zip
 but it works for me.

The same I used…

 I call java -jar languagetool-commandline.jar -v
 -l de test.txt and the Die in Die diplomatischen Beziehungen gets
 unified to only plural forms. So this must be something about OS, Java,
 or locale settings. Or it's some non-deterministic behavior - I assume
 you have tested it more than once and you always get the same result?

I have tested this under IDEA, on the command line with my own mvn 
build, and on the build downloaded from our website. Some command line 
options. The results are consistent. The file I tested contains:

Die diplomatischen Beziehungen zwischen Kanada und dem Iran sind 
seitdem abgebrochen.

I also get the same result in the GUI run from IDEA.


 Here's my system:
 Java: 1.7.0_51
 OS: Ubuntu 12.04
 Locale: de_DE.UTF-8 (in shell: export | grep LANG)

Mine is:

Java 1.8.0_31 (64-bit)
OS: Windows 7 64-bit
Locale: Polish

Same with JDK 1.8.0_25

I did some testing on my Virtual Box install of Ubuntu, with the same 
Java version you have, and I don't get any errors there (the locale is 
Polish, so I don't think it matters). I will try to update that Ubuntu 
to see how this changes things.

Regards,
Marcin



--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: German tests

2015-03-03 Thread Marcin Miłkowski
W dniu 2015-03-03 o 22:59, Daniel Naber pisze:
 On 2015-03-03 14:36, Andriy Rysin wrote:

 I installed jdk1.7.0_75 and German tests pass with it so it's java 8
 which makes it fail.

 I did some debugging and the problem is caused by the elements in
 Unifier that we iterate over but that have no guaranteed order, like
 Maps and Sets. By mechanically replacing them with classes that have an
 order (e.g. ConcurrentHashMap - LinkedHashMap), I could make Java 7 and
 Java 8 behave the same way. Actually the wrong way, because then the
 unification fails under Java 7, too. So we'll need to change the
 algorithm so it doesn't depend on the order of elements. I'll share my
 debugging branch. Warning: it's full of System.out.println and changes
 just for debugging. Run a check on Die diplomatischen Beziehungen with
 Java 7 and Java 8 and that branch and you'll see the differences in the
 output.

Actually, I think I have found something close to the cause of the bug: 
the thing  is that some readings are assigned attribute values that they 
don't really have. For example, in Java 8, the reading 
PRO:DEM:NOM:SIN:FEM of Die is assigned both singular and plural 
values of the number attribute. Unification, as far as I can see, 
works fine afterwards; it's just in Java 8 the lack of order in elements 
in the Map does not stop the algorithm from being wrong.

I think we are getting close to the point where we should add a generic 
attribute-value interface to our AnalyzedTokens. The Unifier is so 
complex because it does two things at the same time:

- checking token attribute values, by using regexes (which is 
computationally costly, and it's computed many times);
- running unification on the values.

If we could move the first part of the code to another class, which 
would analyze POS tags to get proper values of attributes, the code 
would be cleaner and faster. The basic attribute-value class could 
contain several default attributes (they probably need to be addressable 
by Strings to make them easily extended by subclasses for new languages 
and new tagsets), such as number, case, gender, and tense. Not all 
languages need to have such attribute values in their tagsets, but they 
need to implement a POS tag analyzer if they want to use these attributes.

Another advantage of this setup would be that we could easily use 
computationally cheap tests in our grammar rules, for example by having:

tokenattribute id=reflexivityvalue 
id=reflexive//attribute/token

for language-dependent attributes (not defined in our XML schema).

And more terse for default attributes:

token number=singular/

Because these values would be precomputed, no regex would be evaluated, 
just a very quick equal() test on the AnalyzedToken.

All we need for this is:

- an attribute-value class that would be a member of the AnalyzedToken, 
probably a Map to a Set;
- a POS tag analyzer class, which would assign empty attributes in the 
generic version, and would be subclassed by all languages that have 
tagsets; for most positional tagsets, we don't need regexes to parse the 
tags, so this could be really fast (for determiners in German, for 
example, we simply need to split the string by :, and read the String 
at a given constant position in the array).
- some trivial extensions in the Element class and the PatternRuleLoader.

Regards,
Marcin

--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: German tests

2015-03-02 Thread Marcin Miłkowski
W dniu 2015-03-02 o 21:20, Daniel Naber pisze:
 On 2015-03-02 18:03, Andriy Rysin wrote:

 I ran
 mvn clean test
 still has same issue, I tried en, ca, and pl and they all pass (and no
 extra output).
 I did mvn clean install in languagetool-core to make sure I get the
 latest core classes.

 Thanks to your log I can see that this happens when the unification in
 the disambiguator doesn't work. Could you run just the phrase Die
 diplomatischen Beziehungen with the command line client and option -v?
 You should get a disambiguator log.

Here's mine:

1973 rules activated for language German
S 
Die[der/ART:DEF:AKK:PLU:FEM,der/ART:DEF:NOM:PLU:FEM,der/PRO:DEM:AKK:SIN:FEM,der/PRO:DEM:NOM:SIN:FEM,der/PRO:PER:AKK:SIN:FEM,der/PRO:PER:NOM:SIN:FEM,B-NP]
 
diplomatischen[diplomatisch/ADJ:AKK:PLU:FEM:GRU:DEF,diplomatisch/ADJ:AKK:PLU:FEM:GRU:IND,diplomatisch/ADJ:NOM:PLU:FEM:GRU:DEF,diplomatisch/ADJ:NOM:PLU:FEM:GRU:IND,I-NP]
 
Beziehungen[Beziehung/SUB:AKK:PLU:FEM,Beziehung/SUB:NOM:PLU:FEM,I-NP] 
zwischen[zwischen/PRP:LOK+TMP:DAT+AKK,zwischen/ZUS,O] 
Kanada[Kanada/EIG:AKK:SIN:NEU:ART:COU,Kanada/EIG:AKK:SIN:NEU:NOA:COU,Kanada/EIG:DAT:SIN:NEU:ART:COU,Kanada/EIG:DAT:SIN:NEU:NOA:COU,Kanada/EIG:GEN:SIN:NEU:ART:COU,Kanada/EIG:NOM:SIN:NEU:ART:COU,Kanada/EIG:NOM:SIN:NEU:NOA:COU,O]
 
und[und/KON:NEB,O] 
dem[der/ART:DEF:DAT:SIN:MAS,der/ART:DEF:DAT:SIN:NEU,der/PRO:DEM:DAT:SIN:MAS,der/PRO:DEM:DAT:SIN:NEU,der/PRO:PER:DAT:SIN:MAS,der/PRO:PER:DAT:SIN:NEU,B-NP|NPS]
 
Iran[Iran/EIG:AKK:SIN:MAS:ART:COU,Iran/EIG:AKK:SIN:MAS:NOA:COU,Iran/EIG:DAT:SIN:MAS:ART:COU,Iran/EIG:DAT:SIN:MAS:NOA:COU,Iran/EIG:GEN:SIN:MAS:ART:COU,Iran/EIG:NOM:SIN:MAS:ART:COU,Iran/EIG:NOM:SIN:MAS:NOA:COU,I-NP|NPS]
 
sind[sein/VER:1:PLU:PRÄ:NON,sein/VER:3:PLU:PRÄ:NON,sein/VER:AUX:1:PLU:PRÄ,sein/VER:AUX:3:PLU:PRÄ,O]
 
seitdem[seitdem/KON:UNT,O] 
abgebrochen[abgebrochen/PA2:PRD:GRU:VER,abbrechen/VER:PA2:NON,O].[/S./PKT,O] 
P/
Disambiguator log:

UNIFY_DET_ADJ_SUB:1 
Die[der/ART:DEF:AKK:PLU:FEM*,der/ART:DEF:AKK:PLU:MAS*,der/ART:DEF:AKK:PLU:NEU*,der/ART:DEF:AKK:SIN:FEM*,der/ART:DEF:NOM:PLU:FEM*,der/ART:DEF:NOM:PLU:MAS*,der/ART:DEF:NOM:PLU:NEU*,der/ART:DEF:NOM:SIN:FEM*,der/PRO:DEM:AKK:PLU:ALG*,der/PRO:DEM:AKK:SIN:FEM*,der/PRO:DEM:NOM:PLU:ALG*,der/PRO:DEM:NOM:SIN:FEM*,der/PRO:PER:AKK:PLU:ALG*,der/PRO:PER:AKK:SIN:FEM*,der/PRO:PER:NOM:PLU:ALG*,der/PRO:PER:NOM:SIN:FEM*]
 
- 
Die[der/ART:DEF:AKK:PLU:FEM*,der/ART:DEF:NOM:PLU:FEM*,der/PRO:DEM:AKK:SIN:FEM*,der/PRO:DEM:NOM:SIN:FEM*,der/PRO:PER:AKK:SIN:FEM*,der/PRO:PER:NOM:SIN:FEM*]

UNIFY_DET_ADJ_SUB:1 
diplomatischen[diplomatisch/ADJ:AKK:PLU:FEM:GRU:DEF,diplomatisch/ADJ:AKK:PLU:FEM:GRU:IND,diplomatisch/ADJ:AKK:PLU:MAS:GRU:DEF,diplomatisch/ADJ:AKK:PLU:MAS:GRU:IND,diplomatisch/ADJ:AKK:PLU:NEU:GRU:DEF,diplomatisch/ADJ:AKK:PLU:NEU:GRU:IND,diplomatisch/ADJ:AKK:SIN:MAS:GRU:DEF,diplomatisch/ADJ:AKK:SIN:MAS:GRU:IND,diplomatisch/ADJ:AKK:SIN:MAS:GRU:SOL,diplomatisch/ADJ:DAT:PLU:FEM:GRU:DEF,diplomatisch/ADJ:DAT:PLU:FEM:GRU:IND,diplomatisch/ADJ:DAT:PLU:FEM:GRU:SOL,diplomatisch/ADJ:DAT:PLU:MAS:GRU:DEF,diplomatisch/ADJ:DAT:PLU:MAS:GRU:IND,diplomatisch/ADJ:DAT:PLU:MAS:GRU:SOL,diplomatisch/ADJ:DAT:PLU:NEU:GRU:DEF,diplomatisch/ADJ:DAT:PLU:NEU:GRU:IND,diplomatisch/ADJ:DAT:PLU:NEU:GRU:SOL,diplomatisch/ADJ:DAT:SIN:FEM:GRU:DEF,diplomatisch/ADJ:DAT:SIN:FEM:GRU:IND,diplomatisch/ADJ:DAT:SIN:MAS:GRU:DEF,diplomatisch/ADJ:DAT:SIN:MAS:GRU:IND,diplomatisch/ADJ:DAT:SIN:NEU:GRU:DEF,diplomatisch/ADJ:DAT:SIN:NEU:GRU:IND,diplomatisch/ADJ:GEN:PLU:FEM:GRU:DEF,diplomatisch/ADJ:GEN:PLU:FEM:GRU:IND,diplomatisch/ADJ:GEN
:PLU:MAS:GRU:DEF,diplomatisch/ADJ:GEN:PLU:MAS:GRU:IND,diplomatisch/ADJ:GEN:PLU:NEU:GRU:DEF,diplomatisch/ADJ:GEN:PLU:NEU:GRU:IND,diplomatisch/ADJ:GEN:SIN:FEM:GRU:DEF,diplomatisch/ADJ:GEN:SIN:FEM:GRU:IND,diplomatisch/ADJ:GEN:SIN:MAS:GRU:DEF,diplomatisch/ADJ:GEN:SIN:MAS:GRU:IND,diplomatisch/ADJ:GEN:SIN:MAS:GRU:SOL,diplomatisch/ADJ:GEN:SIN:NEU:GRU:DEF,diplomatisch/ADJ:GEN:SIN:NEU:GRU:IND,diplomatisch/ADJ:GEN:SIN:NEU:GRU:SOL,diplomatisch/ADJ:NOM:PLU:FEM:GRU:DEF,diplomatisch/ADJ:NOM:PLU:FEM:GRU:IND,diplomatisch/ADJ:NOM:PLU:MAS:GRU:DEF,diplomatisch/ADJ:NOM:PLU:MAS:GRU:IND,diplomatisch/ADJ:NOM:PLU:NEU:GRU:DEF,diplomatisch/ADJ:NOM:PLU:NEU:GRU:IND]
 
- 
diplomatischen[diplomatisch/ADJ:AKK:PLU:FEM:GRU:DEF,diplomatisch/ADJ:AKK:PLU:FEM:GRU:IND,diplomatisch/ADJ:NOM:PLU:FEM:GRU:DEF,diplomatisch/ADJ:NOM:PLU:FEM:GRU:IND]
UNIFY_ADJ_SUB:1 
diplomatischen[diplomatisch/ADJ:AKK:PLU:FEM:GRU:DEF,diplomatisch/ADJ:AKK:PLU:FEM:GRU:IND,diplomatisch/ADJ:NOM:PLU:FEM:GRU:DEF,diplomatisch/ADJ:NOM:PLU:FEM:GRU:IND]
 
- 
diplomatischen[diplomatisch/ADJ:AKK:PLU:FEM:GRU:DEF,diplomatisch/ADJ:AKK:PLU:FEM:GRU:IND,diplomatisch/ADJ:NOM:PLU:FEM:GRU:DEF,diplomatisch/ADJ:NOM:PLU:FEM:GRU:IND]

UNIFY_DET_ADJ_SUB:1 
Beziehungen[Beziehung/SUB:AKK:PLU:FEM,Beziehung/SUB:DAT:PLU:FEM,Beziehung/SUB:GEN:PLU:FEM,Beziehung/SUB:NOM:PLU:FEM]
 
- Beziehungen[Beziehung/SUB:AKK:PLU:FEM,Beziehung/SUB:NOM:PLU:FEM]
UNIFY_ADJ_SUB:1 
Beziehungen[Beziehung/SUB:AKK:PLU:FEM,Beziehung/SUB:NOM:PLU:FEM] - 

Re: chunker in disambiguator tests

2015-03-02 Thread Marcin Miłkowski
W dniu 2015-03-02 o 03:26, Andriy Rysin pisze:
 It looks like when we run checks we do run chunker before we run
 disambiguator, but when we run disambiguator tests we don't run
 chunker so the rules/examples in the disambiguator don't see multiword
 chunks.

 Is this correct or am I missing something, and if yes was it done on purpose?

Frankly, I don't remember. I would think this is just a mistake on my part.

Best,
Marcin

--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: Disambiguator tests run twice

2015-03-02 Thread Marcin Miłkowski
W dniu 2015-03-02 o 19:00, Andriy Rysin pisze:
 It looks like it's because we have
public void testRules() throws Exception {
  testDisambiguationRulesFromXML();
}
 in test subclass and also testDisambiguationRulesFromXML() from parent
 class is run as well.

 We probably should either not create test method in subclass or rename
 base class method so it's not run as test.

Yes, this has not been fixed since we migrated to Maven. I'm not sure if 
this is needed for the commandline test of rules (I'm not even sure if 
disambiguation rules are tested - I don't use this feature).

Regards,
Marcin


--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: Disabling disambiguator rules

2015-02-27 Thread Marcin Miłkowski
W dniu 2015-02-26 o 21:10, Andriy Rysin pisze:
 Would it make sense to allow to disable disambiguator rules the same
 way we disable checking rules?

They are cascaded, disabling them is like disabling random pieces of 
Java code. It might work but it's very risky due to complexity of 
interrelationships.

 I.e. I have a disambiguator rule that wil remove tokens with :rare tag
 if they overlap with ones without :rare. This produces good results
 for modern texts but does not work as well for books which use older
 or non-standard language features. So when I am running regressions on
 the book texts I could just add another rule id to the -d argument
 to make LT leave :rare tokens in.

Instead of removing those tags, you might simply add a new markup or 
ignore :rare in your checking rules.

Best,
Marcin


 Thanks
 Andriy

 --
 Dive into the World of Parallel Programming The Go Parallel Website, sponsored
 by Intel and developed in partnership with Slashdot Media, is your hub for all
 things parallel software development, from weekly thought leadership blogs to
 news, videos, case studies, tutorials and more. Take a look and join the
 conversation now. http://goparallel.sourceforge.net/
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel




--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: MultiThreadedJLanguageTool

2015-02-22 Thread Marcin Miłkowski
W dniu 2015-02-22 o 15:24, Andriy Rysin pisze:
 On 02/22/2015 04:45 AM, Marcin Miłkowski wrote:
 Hi,


 W dniu 2015-02-21 o 19:22, Andriy Rysin pisze:
 So the main problem with this performance improvement is that we read
 across paragraphs. There are two problems with this:
 1) error context shows sentences from another paragraph:
 I almost worked out a solution for that by adjusting ContextTools but
 then I found the next one:
 2) the cross-sentence rules start to work across paragraphs

 and when I was analyzing the code I found that if we read from the
 file and it's smaller than 64k we don't parse it by paragraphs. So the
 cross-sentence rules work across paragraph here too.
 Let me explain why it works like this. This is because we had a 64K
 limit in the past, and I needed to check larger files. So whenever we
 have a larger file, I devised a rough input buffer code.

 But this is a dirty solution. I think we should get a proper solution
 with an Iterator over a file buffer. Now, the iterator class should be
 sensitive to the paragraph setting (\n or \n\n). I guess we could simply
 send an extra special token to the tokenizer or something like that so
 that we get the PARA_END tag whenever we get to the paragraph boundary.
 I understand that the performance is crippled when we wait until we find
 the whole paragraph?
 Not quite, the problem is that we feel the analyze/check worker threads
 with small chunks of data. I did some stats yesterday and realized that
 4 biggest text files I had to run regression on had ~40% of paragraphs
 with 1-3 sentences. Those are printed media archives and I guess a case
 of 1 is for chapter titles, newspaper titles etc, author, date etc. When
 this happens (and you have i.e. 4 cpus) some of the analyze sentence
 worker threads stay idle and also if we invoke check threads on only
 couple of sentences the splitting in threads may produce more overhead
 than benefit (not sure about this as if you have very big number of rule
 it may still be faster).
 When I removed the checkpoint at the paragraph level and always send
 64k blocks to worker threads (ignoring some regressions) my cpu idle
 state goes from 40% to 10% (and those 10% are because worker threads
 wait for main to read the file and tokenize - we could theoretically
 optimize that one too).
 I actually have one book which has much longer paragraphs and when I
 test it cpus are much less idle.

 Right now the SRX segmentation handles the line ends as well, so we
 would need to look at the connection to the sentence splitter.

 This can be observed in MainTest.testEnglishFile() which gives 3
 matches vs MainTest.testEnglishStdIn4() which reads the same text but
 using stdin gives 4.
 And it should give 3, right? Paragraph-level rules are doing something
 useful.
 It depends, it should give 3 if paragraph-level rules should not work
 across paragraph boundaries, and it should give 4 if it should work across.

 If we are to fix the small file case by splitting paragraph would it
 make sense to remove special hanlding for small files? If it's small
 it would be checked fast anyway and removing extra if/else blocks
 would clean up the code logic...
 I think we should seal the logic in an iterator, and it would work the
 same way for all cases.
 So to move on with more optimizations (by sending bigger blocks to
 worker threads) we need several things:
 1) agree if paragraph-level rules should work across paragraphs,

They should not, at least they were designed to work in a single 
paragraph only. That was my idea back then.

We could have rules that work on the whole file level if we want to have 
it across paragraphs.

 if yes
 there's not much extra work, if no then we have to make sure paragraph
 boundaries are set by sentence tokenizer rather than file reader, and
 add logic to the paragraph-level rules to stop at paragraph boundary; it
 seems that sentence tokenizer already adds newline at the end of last
 sentence in paragraph but I gave up before fully understand how it's set
 and used.

Actually, as far as I remember, the tokenizer does not add any code for 
paragraphs. It just splits a sentence on one or on two newlines, 
depending on what you set on the command-line (using -b).

I believe the paragraph code is added in the 
JLanguageTool.GetAnalyzedSentence(). It just needs to know how many 
newlines make a paragraph. But I didn't read the current code and I 
simply remember what I wanted to code. Maybe I did some dirty hack...

 IMHO splitting text into paragraphs should not be in commandline/file
 reader but in the core logic (e.g. sentence tokenizer)

It's not in the reader, IMHO, right now.


 If we agree on this we can also merge the code for small/large file to
 be the same.

Yes, I guess we should buffer the lines to get a big chunk for checking. 
Actually, I need to design a similar solution for the server I'm calling 
via HTTP in a Python script: I have thousands of small chunks that I 
need

Re: Development tools

2015-02-11 Thread Marcin Miłkowski
W dniu 2015-02-02 o 02:20, Andriy Rysin pisze:
 Sorry if this is obvious, but my friends asked me and I'm away from my
 computer. Is there a way to call parts of sentence analyzer of LT from
 command line?
 I.e. sentence tokenizer, tokenizer, tagger, disambiguator? Or currently
 using Java API is the only way to go?

You can call all of them at once, using the --taggeronly parameter.

http://wiki.languagetool.org/command-line-options

Sorry for not replying earlier, it totally slipped my mind.

Regards,
Marcin


--
Dive into the World of Parallel Programming. The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net/
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: saving configuration in stand-alone

2015-01-28 Thread Marcin Miłkowski
W dniu 2015-01-27 o 23:47, Daniel Naber pisze:
 Hi,

 can anyone confirm that saving the configuration in the stand-alone
 client is broken? If I disable a rule in the config dialog and restart,
 the rule is enabled again.

 Also, does anybody remember why we have both ~/languagetool.properties
 and ~/.languagetool.cfg as configuration file names?

I think I do. Libre/OpenOffice did not allow using some of the features 
we had in the standalone version, so we used two different configuration 
files. It still does not know how to use our spell checker. So there 
were some minor differences in both files, at least in the beginning. Or 
we may have thought that they would be different…

Regards,
Marcin

--
Dive into the World of Parallel Programming. The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net/
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Failing tests in Ukrainian

2015-01-25 Thread Marcin Miłkowski
Andriy,

you seem to have failed to include one file in your commits. At least 
the tests fail for me:

ava.lang.RuntimeException: java.nio.file.AccessDeniedException: 
compounds-unknown.txt
at 
org.languagetool.tagging.uk.UkrainianTagger.debugCompounds(UkrainianTagger.java:137)
at 
org.languagetool.tagging.uk.UkrainianTagger.init(UkrainianTagger.java:127)
at org.languagetool.language.Ukrainian.getTagger(Ukrainian.java:107)
at 
org.languagetool.JLanguageTool.getRawAnalyzedSentence(JLanguageTool.java:693)
at 
org.languagetool.JLanguageTool.getAnalyzedSentence(JLanguageTool.java:678)
at 
org.languagetool.rules.uk.MorfologikUkrainianSpellerRuleTest.testMorfologikSpeller(MorfologikUkrainianSpellerRuleTest.java:42)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at org.junit.runners.Suite.runChild(Suite.java:127)
at org.junit.runners.Suite.runChild(Suite.java:26)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at org.junit.runner.JUnitCore.run(JUnitCore.java:160)
at 
com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:74)
at 
com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:211)
at 
com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:67)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
Caused by: java.nio.file.AccessDeniedException: compounds-unknown.txt
at 
sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:83)
at 
sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97)
at 
sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:102)
at 
sun.nio.fs.WindowsFileSystemProvider.implDelete(WindowsFileSystemProvider.java:269)
at 
sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(AbstractFileSystemProvider.java:108)
at java.nio.file.Files.deleteIfExists(Files.java:1165)
at 
org.languagetool.tagging.uk.UkrainianTagger.debugCompounds(UkrainianTagger.java:133)
... 39 more


Best,
Marcin



--
New Year. New Location. New Benefits. New Data Center in Ashburn, VA.
GigeNET is offering a free month of service with a new server in Ashburn.
Choose from 2 high performing configs, both with 100TB of bandwidth.
Higher redundancy.Lower latency.Increased capacity.Completely compliant.
http://p.sf.net/sfu/gigenet
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: another small XML cleanup idea

2015-01-14 Thread Marcin Miłkowski
W dniu 2015-01-13 o 23:35, Daniel Naber pisze:
 Hi,

 here's another small XML syntax cleanup idea:

 Old syntax:
 match no=1 regexp_match=runter regexp_replace=herunter/

 Proposed new syntax:
 match no=1 regexp=runter - herunter/

 The new syntax is shorter and easier to read. Also, the old syntax isn't
 really clean:  whenever you use 'regexp_match', you also need to use
 'regexp_replace', but this is not expressed in the XSD (I guess that's
 not possible). So why have two attributes anyway? Same for 'postag' and
 'postag_replace'. Does anybody see a problem with this?

Yes, there's one small thing that needs to be taken care of: namely, 
there might be a rule to change - to → (proper Unicode arrow). So 
we need a way to escape -. It might be difficult to escape two 
characters, though.

Best,
Marcin

--
New Year. New Location. New Benefits. New Data Center in Ashburn, VA.
GigeNET is offering a free month of service with a new server in Ashburn.
Choose from 2 high performing configs, both with 100TB of bandwidth.
Higher redundancy.Lower latency.Increased capacity.Completely compliant.
http://p.sf.net/sfu/gigenet
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: small XML syntax changes

2015-01-14 Thread Marcin Miłkowski
W dniu 2015-01-13 o 09:37, Daniel Naber pisze:
 On 2015-01-13 09:10, Marcin Miłkowski wrote:

 I've removed about correct 1000 example sentences for German as they
 were redundant, i.e. they just repeated the incorrect example and its
 'correction' attribute. Unless someone objects for their language, I
 will do the same for all languages (the cleanup effect will probably
 be
 much smaller for most other languages).

 I usually use correct examples as sanity (regression) checks, so please
 keep this in mind.

 How exactly do you do that? Do you extract the sentences from the XML
 first so you have plain text? In that case, we either keep those
 sentences or you would need to change the process a bit that extracts
 the sentences (building correct sentences from incorrect example plus
 its correction). In Polish, only about 60 sentences would be affected.

Ah, now I get what you meant -- if the sentence is already the same as 
the one tested with the correction, then it's redundant. Right.

I just meant I use examples in grammar files to make sure no false 
alarms are fired. But this is obvious.

Regards,
Marcin

--
New Year. New Location. New Benefits. New Data Center in Ashburn, VA.
GigeNET is offering a free month of service with a new server in Ashburn.
Choose from 2 high performing configs, both with 100TB of bandwidth.
Higher redundancy.Lower latency.Increased capacity.Completely compliant.
http://p.sf.net/sfu/gigenet
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: small XML syntax changes

2015-01-13 Thread Marcin Miłkowski
W dniu 2015-01-12 o 22:01, Daniel Naber pisze:
 On 2015-01-12 15:28, Daniel Naber wrote:

 2.) A rule can now have only one example sentence as long as there's a
 correction.

 I've removed about correct 1000 example sentences for German as they
 were redundant, i.e. they just repeated the incorrect example and its
 'correction' attribute. Unless someone objects for their language, I
 will do the same for all languages (the cleanup effect will probably be
 much smaller for most other languages).

I usually use correct examples as sanity (regression) checks, so please 
keep this in mind.

Regards,
Marcin

--
New Year. New Location. New Benefits. New Data Center in Ashburn, VA.
GigeNET is offering a free month of service with a new server in Ashburn.
Choose from 2 high performing configs, both with 100TB of bandwidth.
Higher redundancy.Lower latency.Increased capacity.Completely compliant.
http://p.sf.net/sfu/gigenet
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: English Rule Additions

2014-12-23 Thread Marcin Miłkowski
W dniu 2014-12-23 o 00:02, Nick Hough pisze:
 I have devised some rules for common English mistakes for the letter
 ‘A’, which you can see here:
 https://gist.github.com/howlinghuffy/d25d3d6b43c7a9b485cb

 I plan on doing many more submissions like this over the coming months;
 let me know what you think.

Looks nice to me. Did you run this over a corpus?

Also, it would be very useful to include url element to have more 
documentation for the end users. Link to some publicly available 
information on the web on your rules (a good dictionary etc.).

Best,
Marcin

--
Dive into the World of Parallel Programming! The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: Plain English rules

2014-12-22 Thread Marcin Miłkowski
W dniu 2014-12-22 o 11:33, Daniel Naber pisze:
 On 2014-12-20 11:32, Heikki Lehvaslaiho wrote:

 Heikki,

 I've set up a gist with 80 English rules that (mostly) expand
 redundant/wordy rules in LanguageTools 2.7. Testrules script passes
 these, but it would be good for someone to go though them before
 inclusion to the main rules file.

 https://gist.github.com/heikkil/4efc378102037651f755 [1]

 thanks for those rules! Style rules can cause false alarms, or the
 messages could be considered to be false alarms, so I'm not sure whether
 we should activate these rules by default. What do others think?

I think these rules are following extreme prescriptivism.

I am strongly against the inclusion of such rules as turned on by 
default, because they raise false alarms for perfect English. My rough 
guide is this: if your rules tell that Jane Austen and Charles Dickens 
are bad writers, then your rules are simply wrong. And Dickens does use 
the words indicated in the rules; see for example 'accompany':

https://books.google.pl/books?id=INkAes9Y5AYCpg=PA538lpg=PA538dq=accompany+%22charles+dickens%22source=blots=_lFgWHI48osig=X1vs7tIDaTPM9WSA7sGsXCPOwRohl=plsa=Xei=zg6YVK6RCMWBU-XEgdAFved=0CE4Q6AEwBg#v=onepageq=accompany%20%22charles%20dickens%22f=false

(page 223).

This said, they might be useful for technical writing; in such writing, 
linguistic variation is indeed to be limited. But Mike Unwalla would 
know better.

Best regards,
Marcin

--
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration  more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151iu=/4140/ostg.clktrk
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: English rule: bet regards

2014-12-16 Thread Marcin Miłkowski
W dniu 2014-12-16 o 12:43, Juan Martorell pisze:
 Hi all,

 I came across a rule for a common typo I fall in frequently: writing
 'bet' instead of 'best' I created an initial rule using the LanguageTool
 Rule editor:

 !-- English rule, 2014-12-16 -- rule id=ID name=bet_regards
 pattern marker tokenbet/token /marker tokenregards/token
 /pattern messageDid you mean
 suggestionbest/suggestion?/message example
 type='incorrect'markerBet/marker regards/example example
 type='correct'Best regards/example /rule

 My questions are:

 -Shall I add this rule on my own or should it be done by the maintainer?
 I think it is the maintainer but we still don't have one for English!

That's not exactly true. I do maintain the files for English, and even 
if I'm not a native speaker, I think I can manage ;)

Best,
Marcin



--
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration  more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151iu=/4140/ostg.clktrk
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: discovering language tool

2014-12-13 Thread Marcin Miłkowski
Hi,

W dniu 2014-12-12 o 18:24, Elie Naulleau pisze:
 Hi all,

 I am just discovering LT and I am getting interested in its possibilities.

 I have been auditing/evaluating a correction software for a company
 looking for style correction.
 It is called LELIE, is based on the Dislog language, a layer on top of
 Prolog (Commons licence).
 It is a more powerful approach than LT but it has its drawbacks
 (complexity, maintenance cost, need formal training to maintain, logic
 programming in Prolog, lexicon, rules, reasonning, everything is in
 Prolog, etc.
 http://www.irit.fr/~Patrick.Saint-Dizier/publi_fichier/manuelV1.pdf )
 Linguistically, it relies on rethorical structures (RST,
 http://www.sfu.ca/rst/01intro/intro.html )
 It is able to recognize semantic function like circumstance, concession,
 condition, evaulation, etc.
 Its performance in term of speed are not spectacular (deep parsing,
 Prolog backtracking) but  it is usable.
 Some publications in case you are curious:
 http://www.irit.fr/recherches/ILPL/lelie/accueil.html
 http://dl.acm.org/citation.cfm?id=2388653
 http://anthology.aclweb.org/C/C14/C14-2006.pdf
 https://liris.cnrs.fr/inforsid/sites/default/files/2012_6_1-PatrickSaint-Dizier.pdf


 The reason for this email is that I am looking for an alternative.

 I would like to be able to answer to the following questions :

 - Is LT able to recognize complex structures, such as passive form,
 structures with gap in the middle (I assume so since it seems able to do
 regex on patterns of part of speech)

Yes, to some extent. We can define discontinuous patterns (with the help 
of skipping).

 - Is LT able to take into account a provded SKOS (or similar) thesaurus
 in order to pre-recognized multi-word terms

No, but we have some support for tagging multi-word terms. It should be 
quite easy to add another layer of annotation if it's needed.

 - How LT does part of speech tagging (ML models, other approach,
 TreeTagger, etc ?).

By using a morphosyntactic lexicon and manually created disambiguation 
rules. It uses statistical models for Chinese and Japanese.

 Is it conceivable to plug in one’s POS tagger (for
 instance Stanford NLP Tools tagger) ?

It is but we don't recommend it. These taggers assume grammaticality, 
and they don't show the actual wrong POS tags but the ones that should 
be there. So I really prefer writing rules manually, as they can be 
easily changed.

 - Is it easly extensible ? (rule templates for new form of error
 recognition, complex syntactic patterns that would require their own
 implementation)

I think so.

 - Can it cope with structure information (xml tags). Here is an example
 : enumerations. One could say that all items of an enumeration should
 begin with the same form (infinitive verb, or noun, whatever). To verify
 this, the structure of the document mus be taken in to account. If the
 document is available in XML with sutructure information, it is
 conceivable for LT to process such a document (does its architecture
 allows this, if it not possible yet).

Not possible yet as we don't have this layer of information. But in 
principle, it should be easy to add. Our problem was that it's hard to 
have a self-documenting example that checks if it works (we have 
examples for regression testing and for documentation; adding any 
styling or enumeration in pure text is difficult).

But this is not rocket science: probably we can have additional style 
annotations for examples.


 Another topic :

 Do you know BlackLab (based on Lucene)
 https://github.com/INL/BlackLab/wiki/Features ?
 It can look for patterns (like LT rules) in very large amount of texts
 (thanks to Lucene) and get almost immediate answers.
 It can process annotated text (part of speech, up to 10 levels of more
 type of linguistic information, semantic, tonalities, etc).
 I have been playing with it and I think it could be of a good help to do
 statistics on syntatctic patterns from large corpus, in order may be, to
 infer correction rules froma corpus of uncorrect sentences.

We use Lucene for regression checks on wikipedia and large corpora.

Best regards,
Marcin



 Sorry I have not yet read the full LT documentation but I thought I
 could save some time submitting a question on the dev mailing list,

 Thank you,

 Cheers,
 Elie Naulleau




 --
 Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
 from Actuate! Instantly Supercharge Your Business Reports and Dashboards
 with Interactivity, Sharing, Native Excel Exports, App Integration  more
 Get technology previously reserved for billion-dollar corporations, FREE
 http://pubads.g.doubleclick.net/gampad/clk?id=164703151iu=/4140/ostg.clktrk



 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel




Re: Applying matched token's POS tag to another matched token

2014-10-31 Thread Marcin Miłkowski
Hm, are we talking about suggestions or references in the pattern? 
Because it's certainly possible to do this in suggestions by simply 
using an appropriate match number. Of course, there might be a 
limitation: you cannot use the tag of another token.

If you can think of a clear syntax for this, we could add this. But I'm 
too busy right now, so I'd only add this to my TODO.

Regards,
Marcin

W dniu 2014-10-31 o 10:08, Jaume Ortolà i Font pisze:
 Currently it's not possible. I have need it too sometimes.

 Regards,
 Jaume Ortolà


 2014-10-30 17:37 GMT+01:00 Linas Valiukas shirshe...@gmail.com
 mailto:shirshe...@gmail.com:

 Hi there,

 LanguageTool seems to provide an ability to apply POS tag of a match
 to a word, like this (taken from Development Overview page):

 match no=1 postag=verb:.*perfkierować/match

 However, is there a way to apply a POS tag to one of the matched
 tokens? In other words, I want to do this:

 match no=1 postag=verb:.*perfmatch no=2 //match

 Regards,

 --
 Linas Valiukas


 
 --

 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 mailto:Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel




 --



 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel



--
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: They all means the same.

2014-10-21 Thread Marcin Miłkowski
Indeed. I just fixed this in the repository.

Best,
MM

W dniu 2014-10-19 o 12:57, Kumara Bhikkhu pisze:
 Means should be flagged here:
   They all means the same.

 kb


 --
 Comprehensive Server Monitoring with Site24x7.
 Monitor 10 servers for $9/Month.
 Get alerted through email, SMS, voice calls or mobile push notifications.
 Take corrective actions from your mobile device.
 http://p.sf.net/sfu/Zoho
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel




--
Comprehensive Server Monitoring with Site24x7.
Monitor 10 servers for $9/Month.
Get alerted through email, SMS, voice calls or mobile push notifications.
Take corrective actions from your mobile device.
http://p.sf.net/sfu/Zoho
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: English disambiguation issue

2014-10-16 Thread Marcin Miłkowski
W dniu 2014-10-16 o 15:58, Jonathon Churchill pisze:
 I'd like to bring attention to a problem with one of the English
 disambiguation rules:

 Incorrect: /They are laugh loudly./
 Correct: /They are laughing loudly./
 /
 /
 here the writer may have intended to write /laughing/, however the word
 'laugh' is being understood only as 'NN', as shown from the log from a
 text analysis:

 /was_is_VB_NN:1
 laugh[laugh/NN,laugh/VB,laugh/VBP,B-NP-singular|E-NP-singular] -/
 /laugh[laugh/NN,B-NP-singular|E-NP-singular]/
 /
 /
 this means that we cannot detect this using a VB POS tag in a pattern
 rule such as:

 /pattern/
 /token regexp=yesis|are|be/token/
 ///token postag=VBP//
 ///token postag=RB//
 //pattern/

 Would it be possible for someone to adjust this rule so the verb retains
 the VBP POS tag.

I'll try next week. Note that this is extremely tricky, as there might 
be cases like this:

They are laugh pills.

(= pills that induce laughter)

So basically it's quite difficult to detect whether it's NN or VBP after 
the form of the verb be.

Best,
Marcin

--
Comprehensive Server Monitoring with Site24x7.
Monitor 10 servers for $9/Month.
Get alerted through email, SMS, voice calls or mobile push notifications.
Take corrective actions from your mobile device.
http://p.sf.net/sfu/Zoho
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: switching from Hunspell to Morfologik

2014-10-12 Thread Marcin Miłkowski
Hi,

We have discussed this several times. Basically, I want to tag more words
than I want to accept as spelled correctly. Keeping dictionaries separate
helps with this. Also, the download size matters less and less, and
morfologik dictionaries are fairly small anyway.

Best
Marcin
11 paź 2014 22:00 Jan Schreiber jan.schrei...@languagetool.org
napisał(a):

 Hi,

 I wonder if we could use the switch to Morfologik as an opportunity to
 rethink our general approach to dictionaries.

 Currently we use two dictionaries for all the fully supported languages
 afaik, and those contribute considerably to the large download size of
 LanguageTool.

 Why not use just one dictionary per language and keep all the necessary
 data in one well organised place? This large word database could contain
 everything we need: tags and base form for the grammar checking,
 frequency information for the spelling suggestions, just about
 everything. Even if we want the dictionary to contain incorrect/outdated
 spellings for tagging purposes, all we need is a one-bit flag that tells
 the spell-checking routine if a word is misspelled.

 Cheers,
 Jan


 Am 11.10.2014 12:00, schrieb Daniel Naber:
  Hi,
 
  to provide LT as a 100% pure Java software, I'd like to switch from
  Hunspell (native code) to Morfologik (Java-based).


 --
 Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
 Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
 Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
 Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
 http://p.sf.net/sfu/Zoho
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel


--
Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
http://p.sf.net/sfu/Zoho___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: Morfologik speller

2014-10-03 Thread Marcin Miłkowski
W dniu 2014-10-03 o 13:22, R.J. Baars pisze:
 Marcin,

 would it be possible to use the morfologik speller as a separate program,
 to throw a list of words at, and get the alternatives?

No. It does not tokenize words, and you need a little bit of tooling to 
use the library anyway.


 Is there an example program that does that?

LanguageTool command line version does that if you supply it with the 
Morfologik speller rule name at the command line (-e parameter).

Regards,
Marcin

--
Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
http://pubads.g.doubleclick.net/gampad/clk?id=154622311iu=/4140/ostg.clktrk
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: unexpected ending of a sentence

2014-10-02 Thread Marcin Miłkowski
W dniu 2014-10-02 o 08:25, R.J. Baars pisze:
 I produced a rule, signaling an unexpected end of a sentence, like a
 sentence not ending with a char like . ! or ?

 But this is quite common to happen inside table cells or in headings.

 LT is not aware of these things, is it? Has anyone found a way to prevent
 false alarms in these header or cell conditions?

There is no way, IMHO.

Marcin


 Ruud


 --
 Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
 Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
 Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
 Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
 http://pubads.g.doubleclick.net/gampad/clk?id=154622311iu=/4140/ostg.clktrk
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel




--
Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
http://pubads.g.doubleclick.net/gampad/clk?id=154622311iu=/4140/ostg.clktrk
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: tokenizing numbers

2014-09-30 Thread Marcin Miłkowski
W dniu 2014-09-24 o 21:03, R.J. Baars pisze:
 Maybe we agree to disagree..

 Having them as one token makes detecting patterns easy using regular
 expressions..

But writing suggestions becomes a nightmare, as you have to use groups 
and it becomes complex very soon.

Marcin


 Ruud


 For Polish, I actually want to have numbers tokenized. It makes writing
 number format rules easier. For example, we use comma as a decimal
 separator, not a dot.

 Best
 Marcin
 24 wrz 2014 17:12 Andriy Rysin ary...@gmail.com napisał(a):

 Hmm, so when you meet 1.001 in the document you would not know if it's
 a one 1001 or 1,001...
 In Ukrainian I have rule that require following noun to be in a proper
 form and it'll be different for whole and fractional number endings...

 And if many documents treat dot as comma would not it make sense to
 create a rule that catches that and proposes correct format?

 Andriy

 2014-09-24 10:53 GMT-04:00 R.J. Baars r.j.ba...@xs4all.nl:

 Even when the locale would be nl, there are so many document using the
 English format, we would have to use both.

 But if . and , are treated the same when between digits, it would work
 anyway.

 Ruud

 I did some code for Ukrainan that ignores decimal separator ,
 within
 numbers when tokenizing. I didn't address number group separator .
 yet (looks like this will require srx file change), but . is not used
 widely so I didn't consider it as important. But it would be nice if
 this was handled at common level (taking to account locale of the
 language).

 Andriy


 2014-09-24 8:03 GMT-04:00 R.J. Baars r.j.ba...@xs4all.nl:
 Numbers like 1.234 or 1,000.00 are tokenized into several tokens,
 while
 it
 is one number.

 What do you think about changing the tokenizer to treat them as one
 number? This would maybe affect all languages having rules
 concerning
 numbers, so this is not the right time, but maybe after releasing
 2.7?

 Ruud



 --
 Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
 Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS
 Reports
 Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
 Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer

 http://pubads.g.doubleclick.net/gampad/clk?id=154622311iu=/4140/ostg.clktrk
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel


 --
 Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
 Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS
 Reports
 Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
 Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer

 http://pubads.g.doubleclick.net/gampad/clk?id=154622311iu=/4140/ostg.clktrk
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel





 --
 Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
 Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS
 Reports
 Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
 Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer

 http://pubads.g.doubleclick.net/gampad/clk?id=154622311iu=/4140/ostg.clktrk
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel


 --
 Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
 Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
 Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
 Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer

 http://pubads.g.doubleclick.net/gampad/clk?id=154622311iu=/4140/ostg.clktrk
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel


 --
 Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
 Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
 Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
 Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
 http://pubads.g.doubleclick.net/gampad/clk?id=154622311iu=/4140/ostg.clktrk___
 Languagetool-devel mailing list
 

Re: tokenizing numbers

2014-09-24 Thread Marcin Miłkowski
For Polish, I actually want to have numbers tokenized. It makes writing
number format rules easier. For example, we use comma as a decimal
separator, not a dot.

Best
Marcin
24 wrz 2014 17:12 Andriy Rysin ary...@gmail.com napisał(a):

 Hmm, so when you meet 1.001 in the document you would not know if it's
 a one 1001 or 1,001...
 In Ukrainian I have rule that require following noun to be in a proper
 form and it'll be different for whole and fractional number endings...

 And if many documents treat dot as comma would not it make sense to
 create a rule that catches that and proposes correct format?

 Andriy

 2014-09-24 10:53 GMT-04:00 R.J. Baars r.j.ba...@xs4all.nl:
 
  Even when the locale would be nl, there are so many document using the
  English format, we would have to use both.
 
  But if . and , are treated the same when between digits, it would work
  anyway.
 
  Ruud
 
  I did some code for Ukrainan that ignores decimal separator , within
  numbers when tokenizing. I didn't address number group separator .
  yet (looks like this will require srx file change), but . is not used
  widely so I didn't consider it as important. But it would be nice if
  this was handled at common level (taking to account locale of the
  language).
 
  Andriy
 
 
  2014-09-24 8:03 GMT-04:00 R.J. Baars r.j.ba...@xs4all.nl:
  Numbers like 1.234 or 1,000.00 are tokenized into several tokens, while
  it
  is one number.
 
  What do you think about changing the tokenizer to treat them as one
  number? This would maybe affect all languages having rules concerning
  numbers, so this is not the right time, but maybe after releasing 2.7?
 
  Ruud
 
 
 
 --
  Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
  Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS
 Reports
  Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
  Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
 
 http://pubads.g.doubleclick.net/gampad/clk?id=154622311iu=/4140/ostg.clktrk
  ___
  Languagetool-devel mailing list
  Languagetool-devel@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/languagetool-devel
 
 
 --
  Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
  Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
  Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
  Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
 
 http://pubads.g.doubleclick.net/gampad/clk?id=154622311iu=/4140/ostg.clktrk
  ___
  Languagetool-devel mailing list
  Languagetool-devel@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/languagetool-devel
 
 
 
 
 
 --
  Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
  Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
  Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
  Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
 
 http://pubads.g.doubleclick.net/gampad/clk?id=154622311iu=/4140/ostg.clktrk
  ___
  Languagetool-devel mailing list
  Languagetool-devel@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/languagetool-devel


 --
 Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
 Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
 Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
 Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer

 http://pubads.g.doubleclick.net/gampad/clk?id=154622311iu=/4140/ostg.clktrk
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel


--
Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
http://pubads.g.doubleclick.net/gampad/clk?id=154622311iu=/4140/ostg.clktrk___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: spell checker enhancement

2014-09-16 Thread Marcin Miłkowski
W dniu 2014-09-16 o 09:03, R.J. Baars pisze:
 A word like 'Aviv'is not correct unless 'Tel' is before it.
 So it is best to leave Tel and Aviv out of the spell checker.
 That results in spell checking reporting errors for Aviv.

 In the disambiguator, there is the option to block that, by making an
 immunizing rule:

!-- Tel Aviv--
rule id=TEL_AVIV name=Tel Aviv
  pattern
tokenTel/token
tokenAviv/token
  /pattern
  disambig action=ignore_spelling/
/rule

 That works perfectly. But then, there are a lot of these word
 combinations. Wouldn't it be better to have a multi-word ignore list for
 the spell checker?

 (Or even a multi-word spell checker, not just knowing 'correct' and 'not
 in list', but 'correct', 'incorrect' and 'not in list')

It would not be an enhancement, as this would not give new functionality 
but cripple the existing one. Also, the ability to use all XML syntax is 
extremely important to me (I use POS tags and regular expressions), so I 
wouldn't make use of the multi-word spell checker anyway. So we'd have 
to introduce a crippled syntax that would look a little bit different 
for a human being but with no meaningful functional change. I don't 
think it's worth our time.

The spell checker is best for checking individual words. Just like a 
hammer, it's good for nails, and not for screws. For screws, we have a 
screwdriver. For multi-word entities, we have more refined tools, like 
tagging and disambiguation and special attributes.

Best,
Marcin

--
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce.
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: spell checker enhancement

2014-09-16 Thread Marcin Miłkowski
W dniu 2014-09-16 o 11:21, R.J. Baars pisze:
 Marcin,

 We don't agree. There is a spellchecker, but also a single word ignore
 list for it.

Yes, but for multi-words, we'd have to use the disambiguator code 
internally anyway. You ask for yet another notation of the same thing.

Notice also that no spell checker will propose Tel Aviv for Aviv. 
You need to have an XML rule for that. A simple one, to be sure, but 
still an XML rule. I think it's pretty trivial to go through a list of 
such words and create parallel lists of ignore-spelling rule for 
disambiguation and missing part grammar rules.

Regards,
Marcin

 There are XML rules, but also a Simplereplace rule, a compounding rule.

 So apart from the hammer and the screwdriver, there are more tools.

 But anyway, adding the most frequent ones tot the disambiguator works.

 Getting rid of wrong postags and 10% reported possible spelling errors on
 the entire corpus is a higher priority.
 And fixing false positives. Having almost doubled the amount or rules is
 enough for this month.

 Ruud



 W dniu 2014-09-16 o 09:03, R.J. Baars pisze:
 A word like 'Aviv'is not correct unless 'Tel' is before it.
 So it is best to leave Tel and Aviv out of the spell checker.
 That results in spell checking reporting errors for Aviv.

 In the disambiguator, there is the option to block that, by making an
 immunizing rule:

 !-- Tel Aviv--
 rule id=TEL_AVIV name=Tel Aviv
   pattern
 tokenTel/token
 tokenAviv/token
   /pattern
   disambig action=ignore_spelling/
 /rule

 That works perfectly. But then, there are a lot of these word
 combinations. Wouldn't it be better to have a multi-word ignore list for
 the spell checker?

 (Or even a multi-word spell checker, not just knowing 'correct' and 'not
 in list', but 'correct', 'incorrect' and 'not in list')

 It would not be an enhancement, as this would not give new functionality
 but cripple the existing one. Also, the ability to use all XML syntax is
 extremely important to me (I use POS tags and regular expressions), so I
 wouldn't make use of the multi-word spell checker anyway. So we'd have
 to introduce a crippled syntax that would look a little bit different
 for a human being but with no meaningful functional change. I don't
 think it's worth our time.

 The spell checker is best for checking individual words. Just like a
 hammer, it's good for nails, and not for screws. For screws, we have a
 screwdriver. For multi-word entities, we have more refined tools, like
 tagging and disambiguation and special attributes.

 Best,
 Marcin

 --
 Want excitement?
 Manually upgrade your production database.
 When you want reliability, choose Perforce.
 Perforce version control. Predictably reliable.
 http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel




 --
 Want excitement?
 Manually upgrade your production database.
 When you want reliability, choose Perforce.
 Perforce version control. Predictably reliable.
 http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel




--
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce.
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: new committer: Ebrahim Byagowi

2014-09-13 Thread Marcin Miłkowski
W dniu 2014-08-27 o 09:24, Daniel Naber pisze:
 Hi,

 I'd like to welcome Ebrahim Byagowi (ebraminio on github) as a new
 committer. Ebrahim has recently helped to add Persian to LT, the first
 right-to-left language we support. We're looking forward to your
 contributions, Ebrahim!

Congratulations from me as well!

And by the way, did you see this paper?

http://llc.oxfordjournals.org/content/early/2014/09/07/llc.fqu043.abstract

It might be useful in developing the rules.

Regards,
Marcin

--
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: LT performance optimization

2014-09-11 Thread Marcin Miłkowski
W dniu 2014-09-11 03:54, Andriy Rysin pisze:
 I tried to run my test under PerformanceTest and I had to cut down my
 shortest text to 27m characters and it barely made it with -Xmx6g :)
 I ran for about an hour under profiler after which I shut it down.
 The picture here is slightly different:
 1) Speller holds several methods at the top (in my original test I turn
 speller off)

These methods are already quite fast in comparison to other spelling 
machines. See how they are implemented:

https://github.com/morfologik/morfologik-stemming/blob/master/morfologik-speller/src/main/java/morfologik/speller/Speller.java

The slowest part is the case conversion and diacritics in areEqual(). I 
don't think we can speed up ed() – most time is spent in areEqual() anyway.


 2) *.rule.patterns also has some (not sure why these weren't in the top
 for my original test)

You have quite different pattern in comparison to Polish, where patterns 
eat up a lot of time, mostly in regex.

 3) FSA5 still has several of its *Arc() methods in the top 15

No wonder, it is executed for every token. But it's still very, very fast.


Regards,
Marcin


 Andriy

 On 09/10/2014 12:16 PM, Andriy Rysin wrote:
 This was a full run and it ran for at least several minutes
 (performCheck() was up at fist but then FSA took the lead), this is
 the command I ran

 RULES_TO_IGNORE=MORFOLOGIK_RULE_UK_UA,COMMA_PARENTHESIS_WHITESPACE,WHITESPACE_RULE,EUPHONY,UK_MIXED_ALPHABETS,UK_SIMPLE_REPLACE
 java org.languagetool.commandline.Main -l uk -d $RULES_TO_IGNORE

 Thanks
 Andriy

 2014-09-10 11:20 GMT-04:00 Daniel Naber daniel.na...@languagetool.org:
 On 2014-09-10 16:28, Andriy Rysin wrote:

 Would anybody know if this this something that's specific to my tests,
 or is this something we can optimize, or is it too hard to optimize at
 this level?
 I cannot reproduce this with German or English. Some general ideas:

 -Was this real profiling or just sampling? I'm not sure if real
 profiling adds some overhead for these low-level methods, maybe sampling
 is better.

 -Was this after the process had warmed up?

 -You could comment out the spell checker to see if it comes from the
 tagger or from the spell checker.

 -You could try to use org.languagetool.rules.patterns.PerformanceTest to
 see if it's reproducible there.

 Regards
Daniel


 --
 Want excitement?
 Manually upgrade your production database.
 When you want reliability, choose Perforce
 Perforce version control. Predictably reliable.
 http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel



 --
 Want excitement?
 Manually upgrade your production database.
 When you want reliability, choose Perforce
 Perforce version control. Predictably reliable.
 http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk



 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel



--
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: Suggestion: find POS tag of portion of a word in XML rules

2014-09-10 Thread Marcin Miłkowski
W dniu 2014-09-09 23:10, Dominique Pellé pisze:
 Daniel Naber daniel.na...@languagetool.org
 mailto:daniel.na...@languagetool.org wrote:

 On 2014-09-09 22:38, Dominique Pellé wrote:

  * why does your example give a message in
the java rule.  Why can't we use message…/message
instead?

 You're right, my example was misleading. message can be used.

  * you wrote that args=no:1 refers to the token.
What about if we need to use this for one of the
 exception.../exception inside a token?

 We could introduce more attributes like maybe 'regexp_negate'.

  In other words, the rule matches token (.*)-tu  where
  the POS of portion in parentheses has to be a verb (V.*).
  But there is an exception if the POS of partion in parenthesis
  matches V.* 2 .*. So that rule would correctly:

 Couldn't that also be expressed with V.* [13] .*?



 No, that would miss at least infinitive verbs V inf
 (e.g. chanter) participles  V ppa m s  (chanté)
 and V ppr (chantant).

 We could of course come up with a regexp that
 matches all the possible verbs POS  except those
 V.* 2 .* to avoid an exception, but:

 * that regexp might be rather long as there are
many kinds of  POS verbs. Using an exception is
this more natural.
 * and more generally speaking, being able to
match POS of portion of token in exception
can be useful in some other cases anyway too.

Let me understand your problem:

* you want to match all verbs (V.*) that have -tu at the end (this is 
tokenpostag='V.*' postag_regexp=yes regexp=yes.*-tu/token)

* but not the ones that have verb in a second person: V.* 2 .* So why 
not simply use the old exception postag=V.* 2 .* 
postag_regexp=yes/? It will be a little bit slow due to regular 
expressions but does everything you need, right? Or am I missing something?

Regards,
Marcin

--
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: Suggestion: find POS tag of portion of a word in XML rules

2014-09-10 Thread Marcin Miłkowski
W dniu 2014-09-10 11:34, Dominique Pellé pisze:
 Marcin Miłkowski list-addr...@wp.pl mailto:list-addr...@wp.pl wrote:

 W dniu 2014-09-09 23:10, Dominique Pellé pisze:
   Daniel Naber daniel.na...@languagetool.org
 mailto:daniel.na...@languagetool.org
   mailto:daniel.na...@languagetool.org
 mailto:daniel.na...@languagetool.org wrote:
  
   On 2014-09-09 22:38, Dominique Pellé wrote:
  
* why does your example give a message in
  the java rule.  Why can't we use message…/message
  instead?
  
   You're right, my example was misleading. message can be used.
  
* you wrote that args=no:1 refers to the token.
  What about if we need to use this for one of the
   exception.../exception inside a token?
  
   We could introduce more attributes like maybe 'regexp_negate'.
  
In other words, the rule matches token (.*)-tu  where
the POS of portion in parentheses has to be a verb (V.*).
But there is an exception if the POS of partion in parenthesis
matches V.* 2 .*. So that rule would correctly:
  
   Couldn't that also be expressed with V.* [13] .*?
  
  
  
   No, that would miss at least infinitive verbs V inf
   (e.g. chanter) participles  V ppa m s  (chanté)
   and V ppr (chantant).
  
   We could of course come up with a regexp that
   matches all the possible verbs POS  except those
   V.* 2 .* to avoid an exception, but:
  
   * that regexp might be rather long as there are
  many kinds of  POS verbs. Using an exception is
  this more natural.
   * and more generally speaking, being able to
  match POS of portion of token in exception
  can be useful in some other cases anyway too.

 Let me understand your problem:

 * you want to match all verbs (V.*) that have -tu at the end (this is
 tokenpostag='V.*' postag_regexp=yes regexp=yes.*-tu/token)
 * but not the ones that have verb in a second person: V.* 2 .* So why
 not simply use the old exception postag=V.* 2 .*
 postag_regexp=yes/? It will be a little bit slow due to regular
 expressions but does everything you need, right? Or am I missing
 something



 Hi Marcin

 Not exactly.

 I want to find error in things like Peut-tu and Peux-il which
 are both incorrect in French. Correct should be Peux-tu (= Can you...)
 and Peut-il (Can he...)

 Peut-tu token does not have a POS tag (so what you wrote above
 does not work).  It's an invalid word. Interestingly, it's not even marked
 as invalid by the spelling checker, because Hunspell splits it with
 the dash, and Peut as well as tu are both valid words.

 To detect Peut-tu as a mistake, a grammar rule could check that
 the POS tag of the portion Peut is V ind pres 3 s and since -tu
 expects a verb before the dash with POS like V .* 2 .*,
 there is a mistake.

 For the correct Peux-tu, the portion Peux has POS tag V ind pres 2 s
 and since -tu expects a verb before the bash with POS like V .*2 .*
 no error would be given.

 However, I currently don't have a mechanism to find the POS tag of a
 portion of a token such Peut-tu so I  can't write such a rule.

 I hope that's clearer now.

Yes, I understand now. But isn't it simpler and more universal to make 
tokenization a little bit different? Even voulez-vous is not analyzed 
correctly, and it's pretty trivial to add special cases for the 
tokenizer to split words containing a bunch of personal tokens, as long 
as the first word is a verb. I do such intra-word tokenization for 
Polish for complex adjectives, and it's helpful for many uses, not just 
for the rule you mentioned.

Regards,
Marcin


--
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: Interpunction issue?

2014-09-05 Thread Marcin Miłkowski
W dniu 2014-09-05 11:33, R.J. Baars pisze:
 When there is no space, it is reported.

 I just thought the , means continuation, and the ... does too.

Yes, but in mathematical contexts, may … mean omission. See:

i1, i2, …, in (imagine all numbers and 'n' in subscript).

Marcin


 Ruud

 W dniu 2014-09-05 11:00, R.J. Baars pisze:
 Is ,… to be considered strange? Seems to me to bee to interpunction
 characters, both indicating there is more to it.
 Not necessarily, because there may be good uses: n1, n2, n3, …

 (Though it requires a space after the comma).

 Regards,
 Marcin

 Ruud


 --
 Slashdot TV.
 Video for Nerds.  Stuff that matters.
 http://tv.slashdot.org/
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel



 --
 Slashdot TV.
 Video for Nerds.  Stuff that matters.
 http://tv.slashdot.org/
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel



 --
 Slashdot TV.
 Video for Nerds.  Stuff that matters.
 http://tv.slashdot.org/
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel




--
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: MorfologikSpeller

2014-09-04 Thread Marcin Miłkowski
W dniu 2014-09-04 07:41, R.J. Baars pisze:
 Checking the results for Dutch Morfologik-speller, I found this issue:

 4459.) Line 36899, column 1, Rule ID: MORFOLOGIK_RULE_EN_GB
 Message: Possible spelling mistake found
 Suggestion: afbakening; afbakenings-
 afbakenings-
 ^^^

 This word is refused, but it is correct. It will be correct in phrases like:
 Afbakenings- en andere problemen.

This is an uppercase version. You have a case-sensitive dictionary.


 But approving 'afbakenings' is not the correct solution, since that is
 always wrong.

 I assume 'afbakenings' has been sent to the speller, while it should have
 been afbakenings-.

 The same could happen for a word that has been shortened at the start, but
 that is rather rare.

 Is there an explanation and a solution for this?

Please check how the sentence is analyzed. It's possible that the Dutch 
word tokenizer splits afbakenings- to afbakenings and -.

Regards,
Marcin

--
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: MorfologikSpeller

2014-09-04 Thread Marcin Miłkowski
W dniu 2014-09-03 20:06, R.J. Baars pisze:
 I replace the English dictionary with the newly generated Dutch one.

 Running the complete list of wrong and correct words through LT works. The
 output is less structured than I would like though. When there is no
 suggestion, the entire suggestion line is missing; also the word is not
 recognizable in the output, just underlined, which is more difficult to
 process. I will have to build a program around this to get the data I need
 to judge the suggestions. Taask for tomorrow.

 But it works, with the following conclusions:
 - there is still a lot of words that should have been accepted (missing
 compounding parts in Hunspell)

Daniel is working on that for German.

 - numbers as a whole (0123456) should be skipped, but ranking numbers like
 100e and mp3, F16 should be checked. As far as I could see, there are no
 options for that.

Interesting. This is probably a bug, as I don't expect numbers to be 
checked by a spell checker.

 - When a word is completely in upper-case (UPPERCASE) (which is not in the
 dictionary and set not to be accepted), the alternatives Uppercase and
 uppercase are not suggested.

This is probably because your dictionary is case-sensitive.


 These are no showstoppers, but a small step back from Hunspell.

 Maybe some of these are general things, useful to put on the todo-list.

It seems to me that the number checking is a genuine bug. I never had 
this check words with numbers option set, so this is why I didn't 
encounter this.

Regards,
Marcin

--
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: Bug is disambiguator?

2014-09-03 Thread Marcin Miłkowski
W dniu 2014-09-03 06:22, Dominique Pellé pisze:
 Hi

 Have a look in the following debug output
 of LanguageTool where a token gets non-sensical
 POS tag N.* (multiple times) after a disambiguation
 rule is applied.

 Is it a bug in the disambiguator?
 Or am writing an incorrect disambiguation rule?

 $ echo An eil| java -jar
 languagetool-standalone/target/LanguageTool-2.7-SNAPSHOT/LanguageTool-2.7-SNAPSHOT/languagetool-commandline.jar
 -c utf-8 -l br -v
 Expected text language: Breton
 Working on STDIN...
 664 rules activated for language Breton
 S An[mont/V pres 1 s,monet/V pres 1 s,an/D e sp,]
 eil[eilañ/N.*,eilañ/N.*,eilañ/N.*,eilañ/N.*,eilañ/N.*,/S,]P/
 Disambiguator log:

 UR_N:2 eil[eilañ/V pres 3 s,eilañ/V impe 2 s,eil/K e sp
 o,eil/J,eilañ/SENT_END] -
 eil[eilañ/N.*,eilañ/N.*,eilañ/N.*,eilañ/N.*,eilañ/N.*,eilañ/SENT_END]


 Notice that the token eil gets POS tag N.* (which
 is a invalid POS tag, it's not mean to be a regexp) and
 furthermore, it gets that same POS tag 5 times after
 disambiguation.

 The disambiguation rule UR_N:2 in
 languagetool-language-modules/br/src/main/resources/org/languagetool/resource/br/disambiguation.xml
 is...

  rule
pattern
  token regexp=yesu[ln]|a[nlr]/token
  marker
token postag=V.* postag_regexp=yes/
  /marker
/pattern
disambig action=filter postag=N.*/
  /rule

 The idea of the disambiguation rule is that, if the
 word following an (or al, or ar, etc.) is a verb (V.*),
 then keep only its noun POS tag (N.*)
 in case it happens to be also a noun.
 But obviously, this is not what's happening here.

Actually, this is not, strictly speaking, a bug. What happens is this: 
when you try to filter out a token that does not have a certain tag at 
all, the filter action simply replaces all existing tags with the POS 
tag you specified. I think it was convenient for some purposes or I 
didn't mind (I don't remember).

The easiest way to prevent such things from happening is to add an 
additional condition to the pattern, for example:

   marker
and
token postag=V.* postag_regexp=yes/
token postag=N.* postag_regexp=yes/
/and
   /marker

This will stop eil from getting wrong tags.

We could, in principle, try to add this kind of test to the 
disambiguator action but I'm not sure if it won't break something.

Regards,
Marcin

--
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: Current limitations of MorfologikSpeller

2014-09-03 Thread Marcin Miłkowski
W dniu 2014-09-03 07:40, R.J. Baars pisze:

 I could, If I were able to code. I only do things on the XML level.

Actually, you don't have to. The current morfologik dictionary 
implementation supports the normalization via fsa.dict.input-conversion 
property. See:

http://wiki.languagetool.org/hunspell-support

Regards,
Marcin


 Ruud

 In UkrainianWordTokenizer.java I am replacing Unicode apostrophes
 U+2019 and U+02BC into old good single quote (') to unify all apostrophe
 handling. If Dutch case is similar you could borrow this code.

 Andriy

 On 09/02/2014 08:11 AM, R.J. Baars wrote:
 The Dutch tokenizer is a little bit different from thet otheres, because
 of words with a ' in it.

 That works fine, unless the text does not have a ', but a ’ , which
 happens quite often.

 Since I am not able to edit the java program (little knowledge), could
 someone have a look at this please?

 Ruud


 --
 Slashdot TV.
 Video for Nerds.  Stuff that matters.
 http://tv.slashdot.org/
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel


 --
 Slashdot TV.
 Video for Nerds.  Stuff that matters.
 http://tv.slashdot.org/
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel




 --
 Slashdot TV.
 Video for Nerds.  Stuff that matters.
 http://tv.slashdot.org/
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel




--
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: MorfologikSpeller

2014-09-03 Thread Marcin Miłkowski
W dniu 2014-09-03 11:19, R.J. Baars pisze:

 The wiki states:

 LanguageTool's stand-alone version comes with a tool .
 and the .info file that's already part of LanguageTool. ...

 A bit further on, it says:

 Configuring the dictionary: The dictionary can be further configured using
 an .info file.

 Are we talking about 1, or 2 infofiles?

One.

Marcin


 Ruud


 --
 Slashdot TV.
 Video for Nerds.  Stuff that matters.
 http://tv.slashdot.org/
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel




--
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: MorfologikSpeller

2014-09-03 Thread Marcin Miłkowski
W dniu 2014-09-03 10:58, R.J. Baars pisze:
 To add the words frequencis, I am directed by the wiki to an address where
 there is a frequency list indeed. But only 187000 words; while I have 1.2
 million Dutch words and their frequency myself.

Probably the probabilities of their occurrence is quite low. I tried 
replacing that list with a bigger one for Polish and my results indeed 
made the dictionary file bigger but nothing else changed much.


 The frequency is just a number; what is expected there? I this number a
 plain ratio, a occurrence count, or something else, like logarithmic?
 Will I have to convert to that format, or is a plain wordtabnumber an
 option too?

Log scale, I believe. You might want to filter out some of the lower 
results, as well, as they don't really help and only make files bigger.

Marcin


 Ruud


 --
 Slashdot TV.
 Video for Nerds.  Stuff that matters.
 http://tv.slashdot.org/
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel




--
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: MorfologikSpeller

2014-09-03 Thread Marcin Miłkowski
W dniu 2014-09-03 12:30, R.J. Baars pisze:
 Marcin,

 For English, there are .info files in /resource/ as well as in
 /resource/hunspell.
 First seems to be for the tagging dict, second for the speller.
Ah, of course, there should be one .info file per one .dict file. I 
thought you were asking about one dictionary file.


 (I would prefer spell-checker for directory name.)

 The content of the info file for Dutch should probably be:
 fsa.dict.speller.ignore-numbers=false
 fsa.dict.speller.ignore-all-uppercase=false
 fsa.dict.speller.ignore-camel-case=true
 fsa.dict.speller.ignore-punctuation=false
Note: if you don't have all punctuation in your dictionary, this will 
make the speller complain on all commas, colons, hyphens etc.

 fsa.dict.input-conversion=ij #307;, IJ #306;

You need to use normal Unicode here or Java escaping, not HTML escaping.

 fsa.dict.output-conversion=#307; ij, #306; IJ
Do you have such characters in the dictionary file? If not, then you 
don't need the output conversion.

 fsa.dict.speller.runon-words=false
 fsa.dict.speller.locale=nl_NL
 fsa.dict.speller.convert-case=false
 fsa.dict.speller.ignore-diacritics=true
 fsa.dict.speller.replacement-pairs=y #307;, ei #307;
 fsa.dict.speller.equivalent-chars=
 fsa.dict.frequency-included=true
 fsa.dict.encoding=utf-8
 fsa.dict.separator=
 fsa.dict.author=R. Baars;

 I am not sure about separator , equivalent chars and the locale.
Separator is just used for internal management (usually it's a plus 
character). Doesn't really matter unless you want to use + as an entry 
(and you would have to if you have ignore-punctuation set to false).

 I don quite get the difference between diacritics, equivalent chars and
 replacment pairs. Diacritics seems to me to be part of equivalent and is a
 kind of automatic replacement.
Diacritics is automatic and faster than replacement pairs. Roughly the 
same as equivalent chars.

 ei ij is a replacement, á and a are taken care of by diacritics, and I
 guess Dutch does not have equivalents ...

 Right?
What about apostrophes? Do you want them normalized or not?

Regards,
Marcin




 W dniu 2014-09-03 10:58, R.J. Baars pisze:
 To add the words frequencis, I am directed by the wiki to an address
 where
 there is a frequency list indeed. But only 187000 words; while I have
 1.2
 million Dutch words and their frequency myself.
 Probably the probabilities of their occurrence is quite low. I tried
 replacing that list with a bigger one for Polish and my results indeed
 made the dictionary file bigger but nothing else changed much.

 The frequency is just a number; what is expected there? I this number a
 plain ratio, a occurrence count, or something else, like logarithmic?
 Will I have to convert to that format, or is a plain wordtabnumber an
 option too?
 Log scale, I believe. You might want to filter out some of the lower
 results, as well, as they don't really help and only make files bigger.

 Marcin

 Ruud


 --
 Slashdot TV.
 Video for Nerds.  Stuff that matters.
 http://tv.slashdot.org/
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel



 --
 Slashdot TV.
 Video for Nerds.  Stuff that matters.
 http://tv.slashdot.org/
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel



 --
 Slashdot TV.
 Video for Nerds.  Stuff that matters.
 http://tv.slashdot.org/
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel




--
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: MorfologikSpeller

2014-09-03 Thread Marcin Miłkowski
W dniu 2014-09-03 14:26, R.J. Baars pisze:

 Marcin,

 I filtered the frequencies for any word found more than 50 times; thus
 800.000 frequencies, about 4 times the size of the internet file.
 It adds about 0,4 MB to the dictionary, now in total 9.7 MB.

 The dictionary still needs some improvement (full upercase words longer
 than 5 chars are in there e.g., not confoming advice of the Dutch Language
 Union.
 But that is for later concern.

 I added lower- and uppercased words, because I am not sure what algorithms
 are used for case. If the word found is 'Fuond', and 'found' is in the
 dictionary, I assume default behaviour is to suggest 'Found'. Accepted
 forms are 'found', 'Found' and 'FOUND'. (Is that assumption correct?)

Yes.


 I need some words to be only accepted in lowercase, like 'tv', which only
 has the correct forms 'Tv' and 'tv'; 'TV' is wrong. Same for soem other
 words. (In hunspell I used the keepcase flag on those words).

Hm, I'm not sure. But you can easily put that to a separate common 
simple mistakes file (for SimpleReplaceRule). I found maintaining such a 
file easier than trying to use the same dictionary-search method for 
suggestions. It was particularly difficult for two- and three-letter 
words, and with a SimpleReplaceRule it's just a matter of putting the 
word to the file like this:

TV  tv

And appropriate uppercasing will be applied by the rule anyway.


 So I have now a dictionary to test, and to tune for replacements.
 Is there a way to run a words list through this speller and get the
 suggestions out?

You could simply replace the file for one of the English variants and 
run LT on the command line with only spelling rule enabled. For example, 
for British English, simply enable only MORFOLOGIK_RULE_EN_GB (the 
command-line switch is -e MORFOLOGIK_RULE_EN_GB). That should be the 
easiest way. And you can then compare how it worked on the same file 
with the Dutch hunspell enabled (as you don't have to touch the Dutch 
files yet).

Marcin


 Ruud

 W dniu 2014-09-03 12:30, R.J. Baars pisze:
 Marcin,

 For English, there are .info files in /resource/ as well as in
 /resource/hunspell.
 First seems to be for the tagging dict, second for the speller.
 Ah, of course, there should be one .info file per one .dict file. I
 thought you were asking about one dictionary file.


 (I would prefer spell-checker for directory name.)

 The content of the info file for Dutch should probably be:
 fsa.dict.speller.ignore-numbers=false
 fsa.dict.speller.ignore-all-uppercase=false
 fsa.dict.speller.ignore-camel-case=true
 fsa.dict.speller.ignore-punctuation=false
 Note: if you don't have all punctuation in your dictionary, this will
 make the speller complain on all commas, colons, hyphens etc.

 fsa.dict.input-conversion=ij #307;, IJ #306;

 You need to use normal Unicode here or Java escaping, not HTML escaping.

 fsa.dict.output-conversion=#307; ij, #306; IJ
 Do you have such characters in the dictionary file? If not, then you
 don't need the output conversion.

 fsa.dict.speller.runon-words=false
 fsa.dict.speller.locale=nl_NL
 fsa.dict.speller.convert-case=false
 fsa.dict.speller.ignore-diacritics=true
 fsa.dict.speller.replacement-pairs=y #307;, ei #307;
 fsa.dict.speller.equivalent-chars=
 fsa.dict.frequency-included=true
 fsa.dict.encoding=utf-8
 fsa.dict.separator=
 fsa.dict.author=R. Baars;

 I am not sure about separator , equivalent chars and the locale.
 Separator is just used for internal management (usually it's a plus
 character). Doesn't really matter unless you want to use + as an entry
 (and you would have to if you have ignore-punctuation set to false).

 I don quite get the difference between diacritics, equivalent chars and
 replacment pairs. Diacritics seems to me to be part of equivalent and is
 a
 kind of automatic replacement.
 Diacritics is automatic and faster than replacement pairs. Roughly the
 same as equivalent chars.

 ei ij is a replacement, á and a are taken care of by diacritics, and I
 guess Dutch does not have equivalents ...

 Right?
 What about apostrophes? Do you want them normalized or not?

 Regards,
 Marcin




 W dniu 2014-09-03 10:58, R.J. Baars pisze:
 To add the words frequencis, I am directed by the wiki to an address
 where
 there is a frequency list indeed. But only 187000 words; while I have
 1.2
 million Dutch words and their frequency myself.
 Probably the probabilities of their occurrence is quite low. I tried
 replacing that list with a bigger one for Polish and my results indeed
 made the dictionary file bigger but nothing else changed much.

 The frequency is just a number; what is expected there? I this number
 a
 plain ratio, a occurrence count, or something else, like logarithmic?
 Will I have to convert to that format, or is a plain wordtabnumber
 an
 option too?
 Log scale, I believe. You might want to filter out some of the lower
 results, as well, as they don't really help and only make files bigger.

 Marcin

 Ruud



Re: Bug is disambiguator?

2014-09-03 Thread Marcin Miłkowski
Hi all,

OK, Jaume fixed Catalan rules, so I could integrate the change. Now 
filter action works only when the filter matches the token in the 
pattern. We'll see if it has any impact on today's nightly diff. If not, 
we'll keep the change, and add some documentation.

Regards,
Marcin

--
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: Questions about new date checking rule

2014-08-31 Thread Marcin Miłkowski
W dniu 2014-08-30 23:35, Dominique Pellé pisze:
 Daniel Naber daniel.na...@languagetool.org
 mailto:daniel.na...@languagetool.org wrote:

 On 2014-08-29 21:50, Dominique Pellé wrote:

   Message: The date 31 September 2014 is not a Monday, but a Wednesday.
   Monday, 31 September 2014

 I've now made date parsing more strict, but the rule won't complain
 about these dates and just ignore them. So to catch them, you need other
 rules. See for example the rulegroup with id 'UNGUELTIGES_DATUM' in
 de/grammar.xml.



 Thanks.  That's a rule that can be useful in most languages.
 I've just added it for French. I improved it to detect incorrect
 dates such as 29 février 2014 (=29 February 2014)
 since 2014 is not a leap year, so it has only 28 days.
 See French rule DATE. I had fun detecting leap years
 using regexp :-)

Another kind of error: many people mistype 20014 instead of 2014. 
See YEAR_20001 in English rules.

Regards,
Marcin

--
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: locqualityissuetype

2014-08-27 Thread Marcin Miłkowski
W dniu 2014-08-27 21:41, Jaume Ortolà i Font pisze:
 2014-08-27 19:26 GMT+02:00 R.J. Baars r.j.ba...@xs4all.nl
 mailto:r.j.ba...@xs4all.nl:

 I see. But don't understand. What I do understand is it meant to specify
 something, out of an issue list.

 Is there an issue list somewhere (these documents are so complicated...)



 See the list of values here:
 http://www.w3.org/TR/its20/#lqissue-typevalues

 Some of them are meaningful in the context of LanguageTool (misspelling,
 grammar, style, typographical...). Others are not.

And the use of the issue types is now actually restricted to some other 
software, such as translation quality assurance. For example, this is 
now used for CheckMate Translation QA tool:

http://languagetool-user-forum.2306527.n4.nabble.com/ITS-Localization-Quality-Issue-information-td4640872.html

Best,
Marcin

--
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: chunks in exceptions

2014-08-18 Thread Marcin Miłkowski
W dniu 2014-08-18 04:11, Andriy Rysin pisze:
 On 08/16/2014 06:07 PM, Daniel Naber wrote:
 On 2014-08-11 01:47, Andriy Rysin wrote:

 I was writing a rule were I had to catch a phrase with last word being
 noun, but only if that noun is not part of adverb chunk (with another
 word following). The best way to do that seems to use adverb chunk in
 exception but looks like this is not supported.
 Sorry for the late reply. If by chunks you mean phrases (and not chunks
 in the sense that the Language class has getChunker() implemented): the
 reason that they are not supported is probably that adding support might
 be difficult. The matching algorithm is already complicated.

 Note that you can use antipattern to specify patterns that prevent
 matching. These match on the whole sentence, though, not at a specific
 token.

 No, I actually mean chunker (if I understand the concept correctly), I
 have some adverb chunks defined in multiwords and it would be nice to be
 abel to use them in exceptions and not just in tokens.
 E.g. «показати тією мірою» have adv chunk marker on тією and /adv on
 мірою (besides their POS tags) so I would like to be able to use those.
 And I would like to stick to localized scope so using antipattern is not
 the best approach here.

To have chunks, you'd have to add a separate interface for the chunker. 
The adv tags you mention are *not* chunker tags, these are simple POS 
tags and you can use them in exceptions as POS tags.

Hope that helps.

Best,
Marcin

--
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: spell suggestions for irregular verbs

2014-07-30 Thread Marcin Miłkowski
W dniu 2014-07-27 20:23, Daniel Naber pisze:
 On 2014-07-27 11:20, Marcin Miłkowski wrote:

 I think we should use simpleReplaceRule instead. I think I use it for
 contractions already.

 The problem with that is that incorrectly used irregular verbs are often
 already detected by the spelling rule, it's just that its suggestion
 isn't helpful. If we add another rule we have two matches at the same
 word and the client won't know which one to use. So I guess the better
 suggestions either need to be added in the Java code of the rule, or by
 using replacement pairs?

For contractions, I simply added them to the list of words ignored by 
the speller, so that I get no matches for couldnt from the spelling 
rule. Then, the contraction rule matches couldnt and offers a proper 
correction.

Replacement pairs can be slower, as they add some overhead for searching 
suggestions. They are a hacky solution to a simple problem, which we can 
solve better with a simple replace rule.

Regards,
Marcin

--
Infragistics Professional
Build stunning WinForms apps today!
Reboot your WinForms applications with our WinForms controls. 
Build a bridge from your legacy apps to the future.
http://pubads.g.doubleclick.net/gampad/clk?id=153845071iu=/4140/ostg.clktrk
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: spell suggestions for irregular verbs

2014-07-27 Thread Marcin Miłkowski
I think we should use simpleReplaceRule instead. I think I use it for
contractions already.

Regards
Marcin
27 lip 2014 10:14 Daniel Naber daniel.na...@languagetool.org napisał(a):

 Hi,

 what's the best way to provide good suggestions for misspelled irregular
 verbs, like buyed, beginned, or dealed? I see we have a small
 number of them (like teached/taught) in the en_GB.info file but is it
 okay to add them all there?

 Regards
   Daniel



 --
 Want fast and easy access to all the code in your enterprise? Index and
 search up to 200,000 lines of code with a free copy of Black Duck
 Code Sight - the same software that powers the world's largest code
 search on Ohloh, the Black Duck Open Hub! Try it now.
 http://p.sf.net/sfu/bds
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel


--
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: help with English style rules

2014-07-20 Thread Marcin Miłkowski
W dniu 2014-07-17 06:12, Kumara Bhikkhu pisze:
 Excellent answers. I'm no native speaker, but hope you don't mind me adding.

 Perhaps the only place where LT could *suggest* a comma after for
 example is when it begins the sentence.

I think there will be also genuine cases of a missing comma whenever 
for example is followed by a noun or an adjective.

Best,
Marcin


 kb

 Mike Unwalla wrote thus at 02:13 AM 17-07-14:
 I agree that 'for example' does not have to be followed by a comma.
 Sometimes, 'for example' is at the end of a sentence.

 You can write this, for example.

 Regards,

 Mike Unwalla
 Contact: www.techscribe.co.uk/techw/contact.htm


 -Original Message-
 From: Robin Dunn [mailto:rd...@iparadigms.com]
 Sent: 16 July 2014 17:35
 To: development discussion for LanguageTool
 Subject: Re: help with English style rules

 Hi,

 I am a native English speaker and I agree with the answer at the
 following link which advises 'for example' does not always have to
 be followed by a comma.

 http://english.stackexchange.com/questions/132359/any-exception-with-commas-before-and-after-for-example


 From the link here's an example of a valid case where a comma is
 not required after 'for example':

 While it is common practice to do recalibration between trials, for
 example in reading research, this is not always possible or feasible.


 I'm not aware of a complete comprehensive English grammar guide
 which is freely available online but I think
 english.stackexchange.com is an excellent resource for these kinds
 of questions.

 Regards
 Robin




 On Wed, Jul 16, 2014 at 5:04 PM, Daniel Naber
 daniel.na...@languagetool.org wrote:


  Hi,

  someone suggests that for example should always be followed by a
  comma:
  https://github.com/languagetool-org/languagetool/issues/136

  My question to the English native speakers:

  1.) Do you agree with this?

  2.) Is there an established style guide that we can use as
 a reference
  for questions like these? If possible, it should be available on the
  internet for free, and it should also be comprehensive
 enough to answer
  most questions.

  Regards
Daniel




 --
 Want fast and easy access to all the code in your enterprise? Index and
 search up to 200,000 lines of code with a free copy of Black Duck
 Code Sight - the same software that powers the world's largest code
 search on Ohloh, the Black Duck Open Hub! Try it now.
 http://p.sf.net/sfu/bds
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel


 --
 Want fast and easy access to all the code in your enterprise? Index and
 search up to 200,000 lines of code with a free copy of Black Duck
 Code Sight - the same software that powers the world's largest code
 search on Ohloh, the Black Duck Open Hub! Try it now.
 http://p.sf.net/sfu/bds
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel




--
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: Is exception\2/exception supposed to work?

2014-07-20 Thread Marcin Miłkowski
W dniu 2014-07-17 15:06, Dominique Pellé pisze:



 On Thu, Jul 17, 2014 at 2:49 PM, Dominique Pellé
 dominique.pe...@gmail.com mailto:dominique.pe...@gmail.com wrote:

 Daniel Naber daniel.na...@languagetool.org
 mailto:daniel.na...@languagetool.org wrote:

 On 2014-07-17 10:52, Dominique Pellé wrote:

   I glanced at the Polish grammar.xml, but I could not find
 such rules.

 Sorry, I guess my grep command was wrong and I actually found
 match
 outside the exception element.

   cvc-complex-type.2.4.d: Invalid content was found starting with
   element 'match'. No child element is expected at this point.

 I think this answers the original question: it's not supposed to
 work.
 You might try to patch it anyway of course, if it doesn't add a
 lot of
 complexity.


 Hi Daniel

 I'm not sure how to fix it without spending time studying
 the code. But I found a workaround anyway by replacing...

 token regexp=yesvue?s?exception\2/exception/token

 ... with...

 and

token regexp=yesvue?s?/token
token negate=yesmatch no=1//token
   /and

 I just committed in git (French rule VU_DE_MES_YEUX_VU).

 By the way, I also had to use match no=1/
 instead of match no=2/.   This is confusing
 to me.  It seems that match no=.../ counts tokens
 sometimes from 0  (in fact I can see some match no=0/
 in some rules) and sometimes from 1!?  Is this explained
 anywhere


 Answering to myself as I found the link explaining
 how token are numbered with match no=.../.

 === BEGIN QUOTE http://wiki.languagetool.org/development-overview ===
 [...] matches are numbered from zero, so it's match no=0/ [...]

 A similar mechanism can be used in suggestions, however there are
 more features, and tokens are numbered from 1 (for compatibility
 with the older notation \1 for the first matched token).
 === END QUOTE ===

 So indeed, depending on what match no=.../ is used for,
 tokens are numbered from 0 or from 1. Confusing, but at least
 I understand it now and it's documented.

There was some reason for keeping it this way, and I don't remember it 
right now.

I added a pointer in the wiki in case someone wants to add support for 
matches inside exceptions:

http://wiki.languagetool.org/xml-pattern-rule-extensions

Regards,
Marcin


--
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: English native speaker help

2014-07-20 Thread Marcin Miłkowski
W dniu 2014-07-18 17:58, Mike Unwalla pisze:
 Hi Daniel,

 I am a native speaker of English.

 They are used to manage transfers through the PQR.
 LT: This verb is used with the gerund form: used to managing

 Possible false alarm, but only the writer knows. The verb 'used to' is used
 with the gerund. However, the sentence can be parsed as passive voice 'are
 used' + to. (= Some things [they] are used [by people] to manage transfers
 through the PQR.) Thus, I suggest that you change the LT message. (For more
 examples of sentences that can be parsed in more than one way, refer to
 http://www.simplified-english.co.uk/analysis.html.)

 Also because when you're interested in something then it helps learning
 because you're familiar with it.
 LT: The verb 'help' is used with infinitive: to learn

 The usual structure is 'help to do', but 'help verb+ing' is possible in
 'cannot help verb+ing'. Although she has a nasty temper, I cannot help
 liking her.
 In the context of the sentence, the phrase 'then it helps learning' does not
 sound wrong to me. (I do not mean to say that from a traditional grammatical
 perspective, it is not wrong.)

It's not wrong, IMHO, because 'learning' is also an uncountable noun 
(think of 'conditional learning' in the context of biology). I'll add an 
exception for such words ('learning' is recognized as 'NN:U').


 This did not effect his views too much.
 LT: Did you mean: affect

 In this context, LT's evaluation is correct. However, 'effect' as a verb is
 possible, but unusual. Example, Management must effect the change
 immediately. From Longman Dictionary of Contemporary English: verb
 (transitive) formal: to make something happen. Synonym: bring about.

 If any one on the mailing list has had an overlay assessment...
 LT: Did you mean: anyone

 For the example sentence, it is not a false alarm. However, in 'If any one
 is' where 'one' is a pronoun, then it is a false alarm. Example, These
 components are critical. If any one is defective, the system can fail.

 A part of me can't help but think that they are right.
 LT: This is a nonstandard phrase. Use: thinking

 False alarm. Refer to my previous comment about 'help'.

I made this rule a long time ago. Now, I made it off by default and 
added a URL for an explanation. Some style guides consider this 
incorrect, so I leave the rule in the file for users that might need it.


 The hearing is being rushed because the principle is going out of town.
 LT: This word is normally spelled with hyphen: out-of-town

 False alarm. 'Out-of-town' as an adjective is fine. Example, The new
 out-of-town shopping center is very popular. In this sentence, 'out of
 town' is not an adjective.

This rule works without any context, and this is why it may be quite 
wrong frequently. Ditto for all subsequent cases.

Regards,
Marcin



 Highlight key words and ideas.
 LT: Did you mean: keywords

 Possible false alarm, but only the writer knows. If 'key' means 'important'
 (as in 'key concept'), and the writer wants to mean important words and
 important ideas, then it is a false alarm.

 I have a web site on famous dyslexics.
 LT: Did you mean: website

 The choice between 'web site' and 'website' is a style preference. However,
 'website' is probably much more popular than 'web site'. Microsoft Manual of
 Style and the Yahoo! Style Guide recommend 'website'.

 Regards,

 Mike Unwalla
 Contact: www.techscribe.co.uk/techw/contact.htm

 -Original Message-
 From: Daniel Naber [mailto:daniel.na...@languagetool.org]
 Sent: 18 July 2014 14:24
 To: LanguageTool Developer List
 Subject: English native speaker help

 Hi,

 I'm trying to evaluate LT results, but there are some cases where I'm
 not sure if the message by LT is actually a false alarm of not. Could a
 native speaker maybe have a look at these sentences and the LT output
 and let me know if the sentence is actually okay or not, or if it's okay
 but maybe bad style?

 snip

 Thanks
Daniel



 --
 Want fast and easy access to all the code in your enterprise? Index and
 search up to 200,000 lines of code with a free copy of Black Duck
 Code Sight - the same software that powers the world's largest code
 search on Ohloh, the Black Duck Open Hub! Try it now.
 http://p.sf.net/sfu/bds
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel




--
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds
___
Languagetool-devel mailing list

Re: New Member to LT - for Tamil

2014-07-14 Thread Marcin Miłkowski
W dniu 2014-07-14 09:12, Elanjelian Venugopal pisze:
 Hi, have installed JDK 1.8.0_05 and tested. No changes. :(

 And, BTW, how do I push my changes to grammar.xml back to you? It
 appears I don't have sufficient permission to push it to the master. -e.

Use pull request.

Regards,
Marcin



 On 13 July 2014 20:33, Daniel Naber daniel.na...@languagetool.org
 mailto:daniel.na...@languagetool.org wrote:

 On 2014-07-13 13:22, Panagiotis Minos wrote:

   There is a bug report about this issue for more than a year, see
   https://bugs.openjdk.java.net/browse/JDK-8008572 [1]

 The bug report says Java 7 is affected - does it maybe work on Java 8
 without a work-around?

 Regards
Daniel


 
 --
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 mailto:Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel




 --
 Want fast and easy access to all the code in your enterprise? Index and
 search up to 200,000 lines of code with a free copy of Black Duck#174;
 Code Sight#153; - the same software that powers the world's largest code
 search on Ohloh, the Black Duck Open Hub! Try it now.
 http://p.sf.net/sfu/bds



 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel



--
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck#174;
Code Sight#153; - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: Questions about creating a synthesizer dictionary

2014-07-12 Thread Marcin Miłkowski
W dniu 2014-07-11 23:01, Daniel Naber pisze:
 On 2014-07-11 22:43, Dominique Pellé wrote:

 1/ Why does the above command create files in /tmp rather than
 providing command line options to specify the outputs?

 There's no specific reason that I can remember. Feel free to change the
 command.

That would make it definitely more useful.


 2/ LanguageTool source tree contains *.sh and *.pl scripts to
 create  dictionaries for several languages.  But why do none
 of them use the java program
 org.languagetool.dev.SynthDictionaryBuilder
   to build the synthesizer dictionaries?

 These scripts are older than the Java command, i.e. the Java command was
 supposed to replace the scripts. But as I don't know all the languages,
 I could not test it properly. But maybe we should indeed simply delete
 all the script to force people to use the Java command?

You can't force everybody. I will still use my scripts as I have to 
process my dictionary more than the Java command allows.

Regards,
Marcin

--
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: Morphologic Analyser to solve concordance issue for Portuguese

2014-07-08 Thread Marcin Miłkowski
W dniu 2014-07-08 17:34, Marco A.G.Pinto pisze:
 Hello!

 I have contacted my Minho University friends who make the pt_PT
 dictionaries for Mozilla and OpenOffice/LibreOffice.

 They said they can create the postag dictionary and help.

But you're reinventing the wheel. Why? There is a good dictionary 
already available in FreeLing. I can add the tagger dictionary in 15 
minutes if you want. Creating the dictionary from hunspell is a *BAD* 
idea if you already have a tagged wordlist.

Regards,
Marcin


 :-P

 Kind regards,
  Marco A.G.Pinto
---


 On 08/07/2014 10:08, Jaume Ortolà i Font wrote:
 2014-07-08 9:37 GMT+02:00 Marcin Miłkowski list-addr...@wp.pl
 mailto:list-addr...@wp.pl:


 The Portuguese dictionary is already built. We simply haven't included
 it yet because we usually start from a certain number of rules,
 and then
 add the tagger. Using the tags in rules is a very good idea overall.


 I agree with Marcin. The most sensible think to do is to add the
 Freeling POS tag dictionary for Portuguese. As the same tags are used
 in other languages, existing rules can be used as models, or those who
 are familiar with them can help readily.

 As an example, I have created a rule in the online rule editor for
 non-agreement (determinant plural - noun singular) in Galician.


 !-- Galician rule, 2014-07-08 -- rule id=ID name=concordancia
 determinante substantivo pattern token postag='D...P.'
 postag_regexp='yes'/token token postag='N..S.*'
 postag_regexp='yes'exception postag='N..P.*'
 postag_regexp='yes'/exceptionexception
 regexp='yes'que|de/exception/token /pattern messageError de
 concordancia/message example type='incorrect'markerOs
 amigo/marker/example example type='correct'Os amigos/example
 example type='correct'os dous termos/example
 example type='correct'os que son requiridos/example
 /rule


 Regards,
 Jaume Ortolà



 --


 --
 Open source business process management suite built on Java and Eclipse
 Turn processes into business applications with Bonita BPM Community Edition
 Quickly connect people, data, and systems into organized workflows
 Winner of BOSSIE, CODIE, OW2 and Gartner awards
 http://p.sf.net/sfu/Bonitasoft



 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel



--
Open source business process management suite built on Java and Eclipse
Turn processes into business applications with Bonita BPM Community Edition
Quickly connect people, data, and systems into organized workflows
Winner of BOSSIE, CODIE, OW2 and Gartner awards
http://p.sf.net/sfu/Bonitasoft
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: These|Those + Singular Noun

2014-06-01 Thread Marcin Miłkowski
W dniu 2014-05-31 12:35, Kumara Bhikkhu pisze:
 Marcin Miłkowski wrote thus at 04:00 PM 31-05-14:
 W dniu 2014-05-31 08:29, Kumara Bhikkhu pisze:
 Here's what it doesn't catch: *I find _those translation_ misleading.

 It does:
 1.) Line 1, column 8, Rule ID: THIS_NNS[2]
 Message: Did you mean 'this translation' or 'those translations'?
 Suggestion: this translation; those translations
 I find those translation misleading.
  ^

 *I've just installed the latest snapshot to be sure.

 Don't you get a match as above?

 Nope, but pardon me, my latest is
 LanguageTool-20140529-snapshot.oxt. Now 2 days old.
 Perhaps you made some changes recently.

Definitely not.


 (FYI, I don't have good connection where I am,
 and rely on others to get large files.)

The error is found also online using our form at http://www.languagetool.org

Maybe it could be useful for you to use the form first. Anyway, if 
there's no match in your installation, it could mean that you disabled 
the rule in OO/LO.

Regards,
Marcin


--
Time is money. Stop wasting it! Get your web API in 5 minutes.
www.restlet.com/download
http://p.sf.net/sfu/restlet
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: These|Those + Singular Noun

2014-05-31 Thread Marcin Miłkowski
W dniu 2014-05-31 08:29, Kumara Bhikkhu pisze:
 Marcin Miłkowski wrote thus at 07:31 PM 30-05-14:
 Hm, there's already THIS_NNS[2] rule that finds these|those + singular
 noun. Is there any mistake that it does not find? It definitely detects
 the mistake as specified in your example above.

 You could have told me earlier, you know? Never mind. My analytical side
 of the brain probably needs some workout.

Well, I didn't follow the list for the last couple of days. Sorry. 
Anyway, the web rule editor usually does tell you that the mistake is 
already found. At least it did for me in other cases...


 Here's what it doesn't catch: *I find _those translation_ misleading.

It does:

1.) Line 1, column 8, Rule ID: THIS_NNS[2]
Message: Did you mean 'this translation' or 'those translations'?
Suggestion: this translation; those translations
I find those translation misleading.
^


 *I've just installed the latest snapshot to be sure.

Don't you get a match as above?

Regards,
Marcin


--
Time is money. Stop wasting it! Get your web API in 5 minutes.
www.restlet.com/download
http://p.sf.net/sfu/restlet
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: Found in a few grammar.xml files (en, de, ru)

2014-05-30 Thread Marcin Miłkowski
W dniu 2014-05-29 10:01, Marcin Miłkowski pisze:
 W dniu 2014-05-28 21:42, Dominique Pellé pisze:
 Hi

 Searching for  in grammar.xml files, I see things that
are wrong, or at least suspicious:

 $ ack-grep --xml '' languagetool-language-modules/*/src

 languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/grammar.xml
 25390:token negate=yes/token
 25400:token/token
 25423:token negate=yes/token

 languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xml
 10243:markertoken postag=CD/

 languagetool-language-modules/ru/src/main/resources/org/languagetool/rules/ru/grammar.xml
 935:!--Перед сравнительным оборотом стоит не или слова:
 совсем, совершенно, почти, именно  - запятая не ставится.


 I'm surprised that tests did not pick up automatically
 the  inside the marker tags, at least in the English
 grammar.xml.  The marker tag should never contain text
 but only other sub-tags.  Probably this kind of errors can
 automatically be detected.

 Indeed, this could be detected during validation. The only problem is
 that the marker tag is used to mark up simple text content inside
 example tags, and it's not so trivial to define XML Schema to allow no
 text content inside one tag (pattern), but some content inside another
 (example). At least, I couldn't find an easy way. Anyway, XML
 specialists are welcome to look at rules.xsd and pattern.xsd.

OK, it turned out that it was pretty easy to set up. Now we test the 
marker element correctly. I found one mistake in French rules this way, 
and I fixed it.

Regards,
Marcin

--
Time is money. Stop wasting it! Get your web API in 5 minutes.
www.restlet.com/download
http://p.sf.net/sfu/restlet
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: Also without comma

2014-05-30 Thread Marcin Miłkowski
W dniu 2014-05-30 05:28, Kumara Bhikkhu pisze:
 Current LT flags sentences beginning with Also without a comma, and
 suggest adding a comma. I think and exception should be made when the
 following word is a verb. E.g.: Also specify your gender.

Thanks! I fixed this, and I spotted some further mistakes in the rule.

Regards,
Marcin


 kb


 --
 Time is money. Stop wasting it! Get your web API in 5 minutes.
 www.restlet.com/download
 http://p.sf.net/sfu/restlet
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel




--
Time is money. Stop wasting it! Get your web API in 5 minutes.
www.restlet.com/download
http://p.sf.net/sfu/restlet
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: These|Those + Singular Noun

2014-05-30 Thread Marcin Miłkowski
W dniu 2014-05-29 06:26, Kumara Bhikkhu pisze:
 I need help on this:

 rule id=ID
 name=These|Those + Singular Noun
   pattern
token regexp='yes'these|those/token
token postag='NN|NN:UN' postag_regexp='yes'exception
 postag='IN|VBP' postag_regexp='yes'//token
   /pattern
   messagesuggestionmatch
 no=1//suggestion should be followed by a singular
 noun./message
   shortGrammar/short
   example type='incorrect'I find markerthese
 translation/marker misleading./example
   example type='correct'I find these translations
 misleading./example
   example type='correct'I find this translation
 misleading./example
 /rule

 I get one false alarm:
 113 of _these mollusk_ species have never been collected outside of the
 state

 So, I tried adding this to token 2:

 exception
 chunk=E-NP-singular/

 But got an error message.

Hm, there's already THIS_NNS[2] rule that finds these|those + singular 
noun. Is there any mistake that it does not find? It definitely detects 
the mistake as specified in your example above.

Regards,
Marcin


--
Time is money. Stop wasting it! Get your web API in 5 minutes.
www.restlet.com/download
http://p.sf.net/sfu/restlet
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: Found in a few grammar.xml files (en, de, ru)

2014-05-29 Thread Marcin Miłkowski
W dniu 2014-05-28 21:42, Dominique Pellé pisze:
 Hi

 Searching for  in grammar.xml files, I see things that
   are wrong, or at least suspicious:

 $ ack-grep --xml '' languagetool-language-modules/*/src

 languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/grammar.xml
 25390:token negate=yes/token
 25400:token/token
 25423:token negate=yes/token

 languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xml
 10243:markertoken postag=CD/

 languagetool-language-modules/ru/src/main/resources/org/languagetool/rules/ru/grammar.xml
 935:!--Перед сравнительным оборотом стоит не или слова:
 совсем, совершенно, почти, именно  - запятая не ставится.


 I'm surprised that tests did not pick up automatically
 the  inside the marker tags, at least in the English
 grammar.xml.  The marker tag should never contain text
 but only other sub-tags.  Probably this kind of errors can
 automatically be detected.

Indeed, this could be detected during validation. The only problem is 
that the marker tag is used to mark up simple text content inside 
example tags, and it's not so trivial to define XML Schema to allow no 
text content inside one tag (pattern), but some content inside another 
(example). At least, I couldn't find an easy way. Anyway, XML 
specialists are welcome to look at rules.xsd and pattern.xsd.

Regards,
Marcin

--
Time is money. Stop wasting it! Get your web API in 5 minutes.
www.restlet.com/download
http://p.sf.net/sfu/restlet
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: possible new English rule

2014-05-28 Thread Marcin Miłkowski
W dniu 2014-05-28 13:46, Jaume Ortolà i Font pisze:
 Could it be a useful rule?


 !-- English rule, 2014-05-28 --
 rule id=ID name=a compete/complete
   pattern
token regexp='yes'a|an|the/token
token postag='VB|VBP' postag_regexp='yes'exception postag='VB|VBP'
 postag_regexp='yes' negate_pos='yes'/exception/token
   /pattern
   messageProbably a bad construction: a/the + infinitive/message
   example type='incorrect'markera compete/marker catastrophe an
 argue in/example
   example type='correct'a complete catastrophe/example
   example type='correct'a show/example
 /rule

Looks perfect to me. Please commit it.

Regards,
Marcin


--
Time is money. Stop wasting it! Get your web API in 5 minutes.
www.restlet.com/download
http://p.sf.net/sfu/restlet
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: Dump

2014-05-27 Thread Marcin Miłkowski
Hi,

maybe it was because of a simple mistake in the isNumberOrDot() method. 
I fixed it, so the today's build should run fine. Could you download the 
nightly and see whether you get crashes on your data?

Best,
Marcin

W dniu 2014-05-27 09:20, R.J. Baars pisze:
 Hi.

 I am currently using languagetool-commandline checking billions of
 paragraphs.

 It works fine, except for some dumps, like the ones below.

 I need the tool to continue, for I need the data. When it has been
 processed, I might try to find the items it crashes on.

 It looks like it is all string things. Could it crash in utf8 encoding
 errors?

 Ruud

   java.util.concurrent.ExecutionException:
 java.lang.StringIndexOutOfBoundsException: String index out of range: 0
  at
 org.languagetool.MultiThreadedJLanguageTool.performCheck(MultiThreadedJLanguageTool.java:101)
  at org.languagetool.JLanguageTool.check(JLanguageTool.java:576)
  at org.languagetool.JLanguageTool.check(JLanguageTool.java:534)
  at org.languagetool.JLanguageTool.check(JLanguageTool.java:530)
  at
 org.languagetool.commandline.CommandLineTools.checkText(CommandLineTools.java:96)
  at org.languagetool.commandline.Main.handleLine(Main.java:386)
  at
 org.languagetool.commandline.Main.runOnFileLineByLine(Main.java:286)
  at org.languagetool.commandline.Main.runOnFile(Main.java:166)
  at org.languagetool.commandline.Main.main(Main.java:519)
 Caused by: java.util.concurrent.ExecutionException:
 java.lang.StringIndexOutOfBoundsException: String index out of range: 0
  at java.util.concurrent.FutureTask.report(FutureTask.java:122)
  at java.util.concurrent.FutureTask.get(FutureTask.java:188)
  at
 org.languagetool.MultiThreadedJLanguageTool.performCheck(MultiThreadedJLanguageTool.java:98)
  ... 8 more
 Caused by: java.lang.StringIndexOutOfBoundsException: String index out of
 range: 0
  at java.lang.String.charAt(String.java:658)
  at
 org.languagetool.rules.CommaWhitespaceRule.isNumberOrDot(CommaWhitespaceRule.java:130)
  at
 org.languagetool.rules.CommaWhitespaceRule.match(CommaWhitespaceRule.java:92)
  at
 org.languagetool.JLanguageTool.checkAnalyzedSentence(JLanguageTool.java:686)
  at
 org.languagetool.JLanguageTool$TextCheckCallable.call(JLanguageTool.java:995)
  at
 org.languagetool.JLanguageTool$TextCheckCallable.call(JLanguageTool.java:962)
  at java.util.concurrent.FutureTask.run(FutureTask.java:262)
  at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  at java.lang.Thread.run(Thread.java:744)
 Exception in thread main java.lang.RuntimeException:
 java.util.concurrent.ExecutionException:
 java.lang.StringIndexOutOfBoundsException: String index out of range: 0
  at
 org.languagetool.MultiThreadedJLanguageTool.performCheck(MultiThreadedJLanguageTool.java:101)
  at org.languagetool.JLanguageTool.check(JLanguageTool.java:576)
  at org.languagetool.JLanguageTool.check(JLanguageTool.java:534)
  at org.languagetool.JLanguageTool.check(JLanguageTool.java:530)
  at
 org.languagetool.commandline.CommandLineTools.checkText(CommandLineTools.java:96)
  at org.languagetool.commandline.Main.handleLine(Main.java:386)
  at
 org.languagetool.commandline.Main.runOnFileLineByLine(Main.java:286)
  at org.languagetool.commandline.Main.runOnFile(Main.java:166)
  at org.languagetool.commandline.Main.main(Main.java:519)
 Caused by: java.util.concurrent.ExecutionException:
 java.lang.StringIndexOutOfBoundsException: String index out of range: 0
  at java.util.concurrent.FutureTask.report(FutureTask.java:122)
  at java.util.concurrent.FutureTask.get(FutureTask.java:188)
  at
 org.languagetool.MultiThreadedJLanguageTool.performCheck(MultiThreadedJLanguageTool.java:98)
  ... 8 more
 Caused by: java.lang.StringIndexOutOfBoundsException: String index out of
 range: 0
  at java.lang.String.charAt(String.java:658)
  at
 org.languagetool.rules.CommaWhitespaceRule.isNumberOrDot(CommaWhitespaceRule.java:130)
  at
 org.languagetool.rules.CommaWhitespaceRule.match(CommaWhitespaceRule.java:92)
  at
 org.languagetool.JLanguageTool.checkAnalyzedSentence(JLanguageTool.java:686)
  at
 org.languagetool.JLanguageTool$TextCheckCallable.call(JLanguageTool.java:995)
  at
 org.languagetool.JLanguageTool$TextCheckCallable.call(JLanguageTool.java:962)
  at java.util.concurrent.FutureTask.run(FutureTask.java:262)
  at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  at java.lang.Thread.run(Thread.java:744)
 Exception in 

Re: Dump

2014-05-27 Thread Marcin Miłkowski
W dniu 2014-05-27 12:42, Daniel Naber pisze:
 On 2014-05-27 11:06, Marcin Miłkowski wrote:

 maybe it was because of a simple mistake in the isNumberOrDot() method.
 I fixed it,

 Are you sure you have pushed it? I cannot see it in the list of changes.

Apparently, something weird happened after IDEA update, and it switched 
to a detached head for no obvious reason. I synced manually, it should 
display now.

Regards,
marcin


 Regards
Daniel


 --
 The best possible search technologies are now affordable for all companies.
 Download your FREE open source Enterprise Search Engine today!
 Our experts will assist you in its installation for $59/mo, no commitment.
 Test it for FREE on our Cloud platform anytime!
 http://pubads.g.doubleclick.net/gampad/clk?id=145328191iu=/4140/ostg.clktrk
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel



--
The best possible search technologies are now affordable for all companies.
Download your FREE open source Enterprise Search Engine today!
Our experts will assist you in its installation for $59/mo, no commitment.
Test it for FREE on our Cloud platform anytime!
http://pubads.g.doubleclick.net/gampad/clk?id=145328191iu=/4140/ostg.clktrk
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: New rule for English

2014-05-19 Thread Marcin Miłkowski
W dniu 2014-05-19 05:21, Kumara Bhikkhu pisze:
 Please consider adding this. I'm unable to test it due to the and.

Well, I don't see any mistake being detected here. It was/is... that 
is a way to express stress on some facts in the statement. This is 
perfect English and you could probably find Jane Austin or Charles 
Dickens using such constructions. Therefore, I'm not really sure if we 
need to be stricter than good writers. This belongs to the deplorable 
tradition of nit-picking absurd advice such as Strunk and White 
disingenuous 'The Elements of Style', which criticized perfect English 
(such as 'split infinitive' that has never been a mistake in reality).

Regards,
Marcin


   rule id=It was/is... that name=It was/is... that
   pattern
   tokenIt/token
   token regexp='yes'was|is/token
   and
   token chunk=B-NP-singular min=0/
   token postag=PRP$|PDT|POS postag_regexp=yes/
   /and
   token chunk=I-NP-singular min=0 max=-1/
   token chunk=E-NP-singular/  tokenthat/token
   /pattern
   messageYou may want to make this concise./message
   suggestion\3 \4 \5/suggestion
   shortWordiness/short
   example type='incorrect'markerIt was her last argument
 that/marker finally persuaded me./example
   example type='correct'Her last argument finally persuaded 
 me./example
   /rule


 --
 Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
 Instantly run your Selenium tests across 300+ browser/OS combos.
 Get unparalleled scalability from the best Selenium testing platform available
 Simple to use. Nothing to install. Get started now for free.
 http://p.sf.net/sfu/SauceLabs
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel




--
Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
Instantly run your Selenium tests across 300+ browser/OS combos.
Get unparalleled scalability from the best Selenium testing platform available
Simple to use. Nothing to install. Get started now for free.
http://p.sf.net/sfu/SauceLabs
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: False flag: I had to say no to them

2014-05-13 Thread Marcin Miłkowski
Thanks, I just fixed this,

m.

W dniu 2014-05-13 12:59, Kumara Bhikkhu pisze:
 Found a false flag. In this sentence:
 I had to say /no/ to them.
 LT flags no, saying it probably should be now.

 kb


 --
 Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
 Instantly run your Selenium tests across 300+ browser/OS combos.
 Get unparalleled scalability from the best Selenium testing platform available
 Simple to use. Nothing to install. Get started now for free.
 http://p.sf.net/sfu/SauceLabs



 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel



--
Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
Instantly run your Selenium tests across 300+ browser/OS combos.
Get unparalleled scalability from the best Selenium testing platform available
Simple to use. Nothing to install. Get started now for free.
http://p.sf.net/sfu/SauceLabs
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: homophone detection

2014-05-07 Thread Marcin Miłkowski
W dniu 2014-05-07 16:16, Daniel Naber pisze:
 Hi,

 as you may know, After the Deadline is an Open Source text checker,
 quite similar to LT. It's not maintained anymore, so why not use some of
 its ideas in LT? A paper describing AtD is available at [1], it's
 well-written and provides a good overview of AtD.

 One interesting idea is to detect wrong words based on statistics. AtD
 has a (manually created) set of words that can be easily confused. If
 such a word is found in a text, the probability of that word in its
 context is calculated and compared to the probability of the similar
 words in the same context. If the word from the text is less probable,
 an error is assumed, and a more probable word is suggested.

 If this approach works, it's easier than writing rules: just add a set
 of easily confused words like adapt, adopt to a file, and the rest
 will happen automatically. What you need though is a huge corpus to
 calculate the probabilities. The Google n-gram corpus[2] might be used
 for that.

 AtD has been evaluated against a dyslexia corpus[3] with a recall of
 27%. Running LT on the same corpus (see RealWordCorpusEvaluator), we get
 only 19% recall, and that only considers if an error was detected, not
 if the correction was correct. So there's clearly something to gain for
 LT here.

That may be true but at the same time, I found that AtD almost never 
found mistakes in my English where LT surely did. So I think a hybrid 
approach is a nice idea (see however below).

I also started to play with collocations, and our rule editor could use 
some of the collocation statistics for detecting word confusion:

http://pelcra.pl/hask_pl/Home

The idea is similar to what I used in generating our rules 
automatically. BTW, I got around 100% recall and 40% precision by using 
my method, which is definitely better than AtD. I simply did not 
generate the word confusion sets as I never had the time, and my code 
was composed of different scripts and languages (ultimately, I did not 
use Java TBL). See my paper here:

http://arxiv.org/abs/1211.6887

Note that I never used just a confusion set, but I seeded a clean corpus 
with mistakes. The details are in the paper.

Regards,
Marcin

--
Is your legacy SCM system holding you back? Join Perforce May 7 to find out:
#149; 3 signs your SCM is hindering your productivity
#149; Requirements for releasing software faster
#149; Expert tips and advice for migrating your SCM now
http://p.sf.net/sfu/perforce
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: incorrect antipattern IDs (bug in XML parser?) + antipattern sanity check

2014-05-06 Thread Marcin Miłkowski
W dniu 2014-05-06 00:30, Dominique Pellé pisze:
 Hi

 I've added antipattern sanity checks.

 It detects some problems in antipatterns for German
 and Polish.

 However, I have not checked-in yet because the
 antiPattern.getId() is incorrect. It seems to contain the ID
 of the previous rule, rather than the rule owning the
 antipattern.  I believe that the problem is in the SAX XML
 parser, as /antipattern is found before /pattern
 and the rule ID is set when encountering /pattern
 (not 100% sure whether that's the root cause).
 I have not fixed that.  Maybe Marcin can be quicker to
 fix than me (hint...) :-)

Fixed, indeed ;)

Marcin



--
Is your legacy SCM system holding you back? Join Perforce May 7 to find out:
#149; 3 signs your SCM is hindering your productivity
#149; Requirements for releasing software faster
#149; Expert tips and advice for migrating your SCM now
http://p.sf.net/sfu/perforce
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: Possible bug in XML rule/disambiguation parsing

2014-05-05 Thread Marcin Miłkowski
Hi,

W dniu 2014-05-04 07:06, Dominique Pellé pisze:
 Hi

 I've added a new pattern rule checker
 (commit commit e26967dc4663283574a8d536308c13ad188b44a0)
 and it finds this issue:

 The Catalan rule: FORCA2:6, token [1], contains força
   that contains token separators, so can't possibly
 be matched.
 The Catalan rule: FORCA2:7, token [1], contains força
   that contains token separators, so can't possibly
 be matched.

 The problem is detected in
   
 languagetool-language-modules/ca/target/classes/org/languagetool/resource/ca/disambiguation.xml
 which looks like this:

 rule
  pattern
  marker
  token postag=_GN_FSforçaexception postag=_GV_/
  /token
  /marker
  /pattern
  disambig action=filter postag=N.*|_GN_.*/disambig
 /rule

 It means that the newline and spaces after
 the exception…/ are slurped into the
 value of the token which is unexpected
 to me.

 Removing the spaces and newline after the
 exception, as follows, silences the error, but I
 wonder whether spaces and newline should not
 have been removed automatically after the exception:

Yes, they should. I need to add the same code as I used in the grammar 
rule pattern loader.

Regards,
Marcin

--
Is your legacy SCM system holding you back? Join Perforce May 7 to find out:
#149; 3 signs your SCM is hindering your productivity
#149; Requirements for releasing software faster
#149; Expert tips and advice for migrating your SCM now
http://p.sf.net/sfu/perforce
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: Possible bug in XML rule/disambiguation parsing

2014-05-05 Thread Marcin Miłkowski
W dniu 2014-05-05 11:21, Marcin Miłkowski pisze:
 Hi,

 W dniu 2014-05-04 07:06, Dominique Pellé pisze:
 Hi

 I've added a new pattern rule checker
 (commit commit e26967dc4663283574a8d536308c13ad188b44a0)
 and it finds this issue:

 The Catalan rule: FORCA2:6, token [1], contains força
that contains token separators, so can't possibly
 be matched.
 The Catalan rule: FORCA2:7, token [1], contains força
that contains token separators, so can't possibly
 be matched.

 The problem is detected in

 languagetool-language-modules/ca/target/classes/org/languagetool/resource/ca/disambiguation.xml
 which looks like this:

 rule
   pattern
   marker
   token postag=_GN_FSforçaexception postag=_GV_/
   /token
   /marker
   /pattern
   disambig action=filter postag=N.*|_GN_.*/disambig
 /rule

 It means that the newline and spaces after
 the exception…/ are slurped into the
 value of the token which is unexpected
 to me.

 Removing the spaces and newline after the
 exception, as follows, silences the error, but I
 wonder whether spaces and newline should not
 have been removed automatically after the exception:

 Yes, they should. I need to add the same code as I used in the grammar
 rule pattern loader.

I just fixed this. Now, all whitespace around tokens is trimmed, and all 
repeating whitespace is removed. _But_ single spaces inside the word are 
retained (this was needed as Catalan tokenizer keeps numbers as single 
tokens, even if they contain spaces).

The same code works now in the disambiguator and the grammar rules 
(basically, I moved the space trimming call to the Element class).

I touched a Danish rule because I thought some whitespace could not be 
retained, and I changed the regex to one that is a little bit faster. So 
I kept my change there.

Regards,
Marcin

--
Is your legacy SCM system holding you back? Join Perforce May 7 to find out:
#149; 3 signs your SCM is hindering your productivity
#149; Requirements for releasing software faster
#149; Expert tips and advice for migrating your SCM now
http://p.sf.net/sfu/perforce
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: Suggestion: find POS tag of portion of a word in XML rules

2014-04-29 Thread Marcin Miłkowski
W dniu 2014-04-29 07:02, Dominique Pellé pisze:
 Daniel Naber daniel.na...@languagetool.org
 mailto:daniel.na...@languagetool.org wrote:

 On 2014-04-27 22:18, Dominique Pellé wrote:

   token regexp=yes postag_group1=fooez-(.*)/token

 I'm not sure how this could be implemented in a clean way... wouldn't
 this be a rather ugly special case in the tagger to ignore the
 tokenization and also split at the hyphen?



 I'm not sure either how it would be implemented not knowing
 that code well enough. I have not tried to implement it. But I don't
 think there should be a special case for the hyphen. My example
 contains a hyphen, but hyphens should not be special.
 The POS tag should rather be probed on the
 pattern.matcher.group(1) or the regexp.

 It's not an ugly case. It's a useful general purpose feature, which
 can avoid writing Java rules. Writing a Java rules is uglier.

 Another example where I could use it is for French conjugated
 verbs in interrogations such as Peux-tu  (=can you),
 Peut-il (=can he)... where the verb and the pronoun are in
 the same token in interrogations (again with an hyphen in this
 example).

 Right not, erroneous French conjugations such as *Peut-tu* are
 not detected as an error by LanguageTool (false negative).
 I could detect it as an error if I could do something more or less
 like this:

 pattern
token regexp=yes postag_group1=V.*
 postag_group1_regexp=yes(.*)-tu
   exceptiontoken regexp=yes postag_group1=V.* 2 .*
 postag_group1_regexp=yes(.*)-tu/exception
/token
 /pattern

 This would check that what matches (.*) in the token, is a
 conjugated verb in the 2rd singular form (i.e. V.* 2 .*).

 The French grammar checker Grammalecte based on Lightproof
 correctly detects *peux-tu* as an error. Grammalecte or Lightproof do
 not tokenize, so it's quite different than LanguageTool. Glancing at
 Grammalecte rules (Grammalecte-0.3.9.1/fr-rules.txt), it detects the
 error using such a rule:

 (\w+)-tu - option(inte) and not morph(\1,
 po:(.pre|.imp|ipsi|ifut|cond).* po:2sg, False) and spell(\1) and not
 re.match((?i)vite$, \1)
  -1 _# Forme interrogative. « \1 » n’est pas un verbe à la
 deuxième personne du singulier.

Well, why should we invent a new piece of XML machinery when we already 
have something similar with the match element? Basically, you want to 
search and replace the token surface form, and then tag it. I think we 
could simply adapt the syntax we already use for the synthesizer:

token postag=V.* postag_regexp='yes'match regexp_match=(.*)-tu 
regexp_replace=$1 setpos=yes/[whatever you want here]/token

And this would simply apply the regexp replace and run the tagger on it.

Note that this syntax is almost correct right now and LT won't complain 
about it,only weird things will happen, as it doesn't have any 
consistent semantics. Almost, because you need to say:

token postag=V.* postag_regexp='yes'match no=3 
regexp_match=(.*)-tu regexp_replace=$1 setpos=yes/[whatever you 
want here]/token

And @no attribute is required.

So basically, the functionality is almost there and it would be fairly 
easy to add it via the reference setting in our code.

All in all, there are several more uses of match that are not yet 
fully supported and the code is a bit ugly, I must say.

Regards,
Marcin

--
Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
Instantly run your Selenium tests across 300+ browser/OS combos.  Get 
unparalleled scalability from the best Selenium testing platform available.
Simple to use. Nothing to install. Get started now for free.
http://p.sf.net/sfu/SauceLabs
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: infix vs. prefix in morfologik

2014-04-26 Thread Marcin Miłkowski
W dniu 2014-04-26 20:07, Daniel Naber pisze:
 Hi Marcin and all,

 this is an older change, but I wonder: doesn't infix encoding imply
 prefix encoding? If so, shouldn't then the if .. else if be the other
 way round here in line 73 (DictionaryBuilder.java)?

 https://github.com/languagetool-org/languagetool/commit/29c5215dd516bade89ad79e98fd770680b9337c4#diff-c308bb8e112132635dbb79fd42c83c89

I don't think it makes any difference as you cannot use several encoders 
at once anyway, as far as I remember.

Regards,
Marcin



--
Start Your Social Network Today - Download eXo Platform
Build your Enterprise Intranet with eXo Platform Software
Java Based Open Source Intranet - Social, Extensible, Cloud Ready
Get Started Now And Turn Your Intranet Into A Collaboration Platform
http://p.sf.net/sfu/ExoPlatform
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


  1   2   3   4   5   >