Re: SRX

2014-08-19 Thread R.J. Baars
I am currently checking the output of all the rules on the 20GB corpus; some rules are perfect, some less (though hard to tweak). Result will be a major update, I guess... Ruud > On 2014-08-18 17:18, R.J. Baars wrote: > >> I was able to test, and removed 2 of my additions to make it work. > > Th

Re: SRX

2014-08-19 Thread Daniel Naber
On 2014-08-18 17:18, R.J. Baars wrote: > I was able to test, and removed 2 of my additions to make it work. Thanks, I have committed it. It will be part of the daily builds tonight: https://languagetool.org/download/snapshots/?C=M;O=D Regards Daniel -

Re: SRX

2014-08-18 Thread Daniel Naber
On 2014-08-18 16:56, R.J. Baars wrote: > But how can I test it if it is not in the runtime version? It's inside libs/languagetool-core.jar, which is just a ZIP file. Unzip it, edit segment.srx, re-zip it and test it. Regards Daniel ---

Re: SRX

2014-08-18 Thread R.J. Baars
I am not qualified to edit sources. Just no programmer. Unfortunately, the srx is not separate per languages too. I found the source on Github (which I don't really understand) so I will be able to adjust, and send it to you. But how can I test it if it is not in the runtime version? Ruud

Re: SRX

2014-08-18 Thread Daniel Naber
On 2014-08-18 16:16, R.J. Baars wrote: > There is an adjustment to make in the sentence splitter. But where did > the > .srx go? It's at languagetool-core/src/main/resources/org/languagetool/resource/segment.srx > Could this be added to the Dutch srx rules? Sure, could

Re: SRX

2014-08-18 Thread R.J. Baars
Same applies to [0-9]{1,2}[-]pers. Ruud > There is an adjustment to make in the sentence splitter. But where did the > .srx go? > > I detected an abbreviation that is commonly used and as for now seen as > sentence end: > > milj. > > Could this be added to th

SRX

2014-08-18 Thread R.J. Baars
There is an adjustment to make in the sentence splitter. But where did the .srx go? I detected an abbreviation that is commonly used and as for now seen as sentence end: milj. Could this be added to the Dutch srx rules? Ruud

Re: switching to SRX sentence tokenizer

2014-04-14 Thread Daniel Naber
On 2014-04-12 11:45, Marcin Miłkowski wrote: > Of course, we could make it possible to use another .srx file but then > a > new language module would be incompatible with others, and more work > would be needed to integrate it. Do we want it? There's now a new class LocalSRX

Re: switching to SRX sentence tokenizer

2014-04-12 Thread Marcin Miłkowski
W dniu 2014-04-12 09:55, Daniel Naber pisze: > On 2014-04-12 09:34, Marcin Miłkowski wrote: > >> SRX file can be easily edited and we will happily accept all patches, >> also for languages without complete support in LT. Where's the problem? > > Today, you can extend

Re: switching to SRX sentence tokenizer

2014-04-12 Thread Daniel Naber
On 2014-04-12 09:34, Marcin Miłkowski wrote: > SRX file can be easily edited and we will happily accept all patches, > also for languages without complete support in LT. Where's the problem? Today, you can extend the Language class and have a Regex-based tokenizer with your

Re: switching to SRX sentence tokenizer

2014-04-12 Thread Marcin Miłkowski
W dniu 2014-04-11 22:16, Daniel Naber pisze: > Hi, > > the following languages have been switched to use an SRX-based sentence > tokenizer so we use the same approach for all languages and not a > mixture of different methods: > > Asturian, Italian, Lithuanian, Malayalam, S

switching to SRX sentence tokenizer

2014-04-11 Thread Daniel Naber
Hi, the following languages have been switched to use an SRX-based sentence tokenizer so we use the same approach for all languages and not a mixture of different methods: Asturian, Italian, Lithuanian, Malayalam, Swedish, Tagalog I don't speak these languages so I cannot properly tes

Re: SRX Sentence Tokenizer

2013-05-02 Thread Marcin Miłkowski
Most srx-compliant software uses a single file for all languages, AFAIK. Regards, Marcin 02-05-2013 09:08 użytkownik "Daniel Naber" napisał: > On 01.05.2013, 12:18:41 Andriy Rysin wrote: > > > P.S. BTW would not it make sense to split segement.srx by language > > m

Re: SRX Sentence Tokenizer

2013-05-02 Thread Daniel Naber
On 01.05.2013, 12:18:41 Andriy Rysin wrote: > P.S. BTW would not it make sense to split segement.srx by language > modules? Absolutely. This isn't very high on my personal TODO list though, so any help/patches are welcome. Regards Daniel -- http://www.danielnaber.de ---

Re: SRX Sentence Tokenizer

2013-05-01 Thread Andriy Rysin
Thanks, that helped! Andriy On 05/01/2013 02:54 PM, Piotr wrote: Maybe the part after the \. should be in the afterbreak element? Regards, Piotr On Wed, May 1, 2013 at 6:18 PM, Andriy Rysin <mailto:ary...@gmail.com>> wrote: Hi all I need a bit help with srx sentence

Re: SRX Sentence Tokenizer

2013-05-01 Thread Piotr
Maybe the part after the \. should be in the afterbreak element? Regards, Piotr On Wed, May 1, 2013 at 6:18 PM, Andriy Rysin wrote: > Hi all > > I need a bit help with srx sentence tokenizer, I've added this rule to > prevent sentence split on Name abbreviation+Surname,

SRX Sentence Tokenizer

2013-05-01 Thread Andriy Rysin
Hi all I need a bit help with srx sentence tokenizer, I've added this rule to prevent sentence split on Name abbreviation+Surname, e.g. "Т.Шевченко" which is often met in texts. The rule will need to be a bit more complex but I am trying something simple first. \b[А-ЯІЇЄҐ]\.[А-Я