Re: Pluggable preprocessing and OpenNLP

2017-01-18 Thread Matt Post
Hi,

Sorry, what file format are you talking about? Can you point me to an example 
of the Moses file format? Is this just plain text, one sentence per line?

In general the Moses format is the standard, to the extent that there are any 
standards in MT (they are all mostly informal).

matt

PS. Are you on dev@joshua, or do I need to keep CC'ing you at your address?


> On Jan 16, 2017, at 5:42 PM, Joern Kottmann  wrote:
> 
> Hello,
> 
> we came to the conclusion that it would make sense to add direct
> formats support for letsmt and moses files.
> 
> Here our two issues:
> https://issues.apache.org/jira/browse/OPENNLP-938
> https://issues.apache.org/jira/browse/OPENNLP-939
> 
> Does it make sense for you if we support those formats?
> Did we miss an important format?
> 
> The training works quite fine, but it will take me a bit more time to
> get the evaluation to return something useful. The OpenNLP Sentence
> Detector can only split on end-of-sentence (eos) chars. And if there is
> a sentence without an eos chars it gets treated as a mistake by the
> evaluation.
> 
> Do you have a specific language which would be good for testing for
> you?
> 
> The tokenizer can probably trained as well, I saw a couple of tokenized
> data sets. Maybe that makes sense for you too.
> 
> Jörn
> 
> 
> 
> On Fri, 2017-01-13 at 09:48 -0500, Matt Post wrote:
>> Hi Jörn,
>> 
>> [Sent again without the picture since Apache rejects those,
>> unfortunately...]
>> 
>> You just need monolingual text, so I suggest downloading either the
>> tokenized or untokenized versions. Unfortunately, Opus doesn't make
>> it easy to provide directly links to individual languages. But do
>> this:
>> 
>> 1. Go to http://opus.lingfil.uu.se
>> 
>> 2. Choose de → en (or some other language pair)
>> 
>> 3. In the "mono" or "raw" columns (depending on whether you want
>> tokenized or untokenized text), click the language file for the
>> dataset you want.
>> 
>> matt
>> 
>> 
>>> On Jan 12, 2017, at 6:07 AM, Joern Kottmann 
>>> wrote:
>>> 
>>> Do you have a pointer to an actual file? Or download package?
>>> 
>>> Jörn
>>> 
>>> On Wed, Jan 11, 2017 at 11:33 AM, Tommaso Teofili >> gmail.com
 wrote:
 I think the parallel corpuses are taken from [1], so we could
 start with
 training sentdetect for language packs at [2].
 
 Regards,
 Tommaso
 
 [1] : http://opus.lingfil.uu.se/
 [2] : https://cwiki.apache.org/confluence/display/JOSHUA/Language
 +Packs
 
 Il giorno lun 9 gen 2017 alle ore 11:39 Joern Kottmann >>> gmail.com
 ha scritto:
 
> Sorry, for late reply, can you point me to a link for the
> parallel
 corpus?
> We might just want to add formats support for it to OpenNLP.
> 
> Do you use tokenize.pl for all languages or do you have
> language
 specific
> heuristics?
> It would be great to have an additional more capable rule based
> tokenizer
> in OpenNLP.
> 
> The sentence splitter can be trained on a few thousand
> sentences or so, I
> think that will work out nicely.
> 
> Jörn
> 
> On Wed, Dec 21, 2016 at 7:24 PM, Matt Post 
> wrote:
> 
>>> On Dec 21, 2016, at 10:36 AM, Joern Kottmann >> l.com>
> wrote:
>>> I am happy to support a bit with this, we can also see if
>>> things in
>> OpenNLP
>>> need to be changed to make this work smoothly.
>> 
>> Great!
>> 
>> 
>>> One challenge is to train OpenNLP on all the languages you
>>> support.
 Do
>> you
>>> have training data that could be used to train the
>>> tokenizer and
> sentence
>>> detector?
>> 
>> For the sentence-splitter, I imagine you could make use of
>> the source
> side
>> of our parallel corpus, which has thousands to millions of
>> sentences,
 one
>> per line.
>> 
>> For tokenization (and normalization), we don't typically
>> train models
 but
>> instead use a set of manually developed heuristics, which may
>> or may
 not
> be
>> sentence-specific. See
>> 
>>https://github.com/apache/incubator-joshua/blob/master
>> /
>> scripts/preparation/tokenize.pl
>> 
>> How much training data do you generally need for each task?
>> 
>> 
>>> Jörn
>>> 
>> 
>> 



Re: Pluggable preprocessing and OpenNLP

2017-01-16 Thread Joern Kottmann
Hello,

we came to the conclusion that it would make sense to add direct
formats support for letsmt and moses files.

Here our two issues:
https://issues.apache.org/jira/browse/OPENNLP-938
https://issues.apache.org/jira/browse/OPENNLP-939

Does it make sense for you if we support those formats?
Did we miss an important format?

The training works quite fine, but it will take me a bit more time to
get the evaluation to return something useful. The OpenNLP Sentence
Detector can only split on end-of-sentence (eos) chars. And if there is
a sentence without an eos chars it gets treated as a mistake by the
evaluation.

Do you have a specific language which would be good for testing for
you?

The tokenizer can probably trained as well, I saw a couple of tokenized
data sets. Maybe that makes sense for you too.

Jörn



On Fri, 2017-01-13 at 09:48 -0500, Matt Post wrote:
> Hi Jörn,
> 
> [Sent again without the picture since Apache rejects those,
> unfortunately...]
> 
> You just need monolingual text, so I suggest downloading either the
> tokenized or untokenized versions. Unfortunately, Opus doesn't make
> it easy to provide directly links to individual languages. But do
> this:
> 
> 1. Go to http://opus.lingfil.uu.se
> 
> 2. Choose de → en (or some other language pair)
> 
> 3. In the "mono" or "raw" columns (depending on whether you want
> tokenized or untokenized text), click the language file for the
> dataset you want.
> 
> matt
> 
> 
> > On Jan 12, 2017, at 6:07 AM, Joern Kottmann 
> > wrote:
> > 
> > Do you have a pointer to an actual file? Or download package?
> > 
> > Jörn
> > 
> > On Wed, Jan 11, 2017 at 11:33 AM, Tommaso Teofili  > gmail.com
> > > wrote:
> > > I think the parallel corpuses are taken from [1], so we could
> > > start with
> > > training sentdetect for language packs at [2].
> > > 
> > > Regards,
> > > Tommaso
> > > 
> > > [1] : http://opus.lingfil.uu.se/
> > > [2] : https://cwiki.apache.org/confluence/display/JOSHUA/Language
> > > +Packs
> > > 
> > > Il giorno lun 9 gen 2017 alle ore 11:39 Joern Kottmann  > > gmail.com
> > > ha scritto:
> > > 
> > > > Sorry, for late reply, can you point me to a link for the
> > > > parallel
> > > corpus?
> > > > We might just want to add formats support for it to OpenNLP.
> > > > 
> > > > Do you use tokenize.pl for all languages or do you have
> > > > language
> > > specific
> > > > heuristics?
> > > > It would be great to have an additional more capable rule based
> > > > tokenizer
> > > > in OpenNLP.
> > > > 
> > > > The sentence splitter can be trained on a few thousand
> > > > sentences or so, I
> > > > think that will work out nicely.
> > > > 
> > > > Jörn
> > > > 
> > > > On Wed, Dec 21, 2016 at 7:24 PM, Matt Post 
> > > > wrote:
> > > > 
> > > > > > On Dec 21, 2016, at 10:36 AM, Joern Kottmann  > > > > > l.com>
> > > > wrote:
> > > > > > I am happy to support a bit with this, we can also see if
> > > > > > things in
> > > > > OpenNLP
> > > > > > need to be changed to make this work smoothly.
> > > > > 
> > > > > Great!
> > > > > 
> > > > > 
> > > > > > One challenge is to train OpenNLP on all the languages you
> > > > > > support.
> > > Do
> > > > > you
> > > > > > have training data that could be used to train the
> > > > > > tokenizer and
> > > > sentence
> > > > > > detector?
> > > > > 
> > > > > For the sentence-splitter, I imagine you could make use of
> > > > > the source
> > > > side
> > > > > of our parallel corpus, which has thousands to millions of
> > > > > sentences,
> > > one
> > > > > per line.
> > > > > 
> > > > > For tokenization (and normalization), we don't typically
> > > > > train models
> > > but
> > > > > instead use a set of manually developed heuristics, which may
> > > > > or may
> > > not
> > > > be
> > > > > sentence-specific. See
> > > > > 
> > > > >    https://github.com/apache/incubator-joshua/blob/master
> > > > > /
> > > > > scripts/preparation/tokenize.pl
> > > > > 
> > > > > How much training data do you generally need for each task?
> > > > > 
> > > > > 
> > > > > > Jörn
> > > > > > 
> 
> 


Re: Pluggable preprocessing and OpenNLP

2017-01-13 Thread Matt Post
Hi Jörn,

[Sent again without the picture since Apache rejects those, unfortunately...]

You just need monolingual text, so I suggest downloading either the tokenized 
or untokenized versions. Unfortunately, Opus doesn't make it easy to provide 
directly links to individual languages. But do this:

1. Go to http://opus.lingfil.uu.se 

2. Choose de → en (or some other language pair)

3. In the "mono" or "raw" columns (depending on whether you want tokenized or 
untokenized text), click the language file for the dataset you want.

matt


> On Jan 12, 2017, at 6:07 AM, Joern Kottmann  > wrote:
> 
> Do you have a pointer to an actual file? Or download package?
> 
> Jörn
> 
> On Wed, Jan 11, 2017 at 11:33 AM, Tommaso Teofili  
>> wrote:
> 
>> I think the parallel corpuses are taken from [1], so we could start with
>> training sentdetect for language packs at [2].
>> 
>> Regards,
>> Tommaso
>> 
>> [1] : http://opus.lingfil.uu.se/ 
>> [2] : https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs 
>> 
>> 
>> Il giorno lun 9 gen 2017 alle ore 11:39 Joern Kottmann > 
>>> 
>> ha scritto:
>> 
>>> Sorry, for late reply, can you point me to a link for the parallel
>> corpus?
>>> We might just want to add formats support for it to OpenNLP.
>>> 
>>> Do you use tokenize.pl for all languages or do you have language
>> specific
>>> heuristics?
>>> It would be great to have an additional more capable rule based tokenizer
>>> in OpenNLP.
>>> 
>>> The sentence splitter can be trained on a few thousand sentences or so, I
>>> think that will work out nicely.
>>> 
>>> Jörn
>>> 
>>> On Wed, Dec 21, 2016 at 7:24 PM, Matt Post >> > wrote:
>>> 
 
> On Dec 21, 2016, at 10:36 AM, Joern Kottmann  >
>>> wrote:
> 
> I am happy to support a bit with this, we can also see if things in
 OpenNLP
> need to be changed to make this work smoothly.
 
 Great!
 
 
> One challenge is to train OpenNLP on all the languages you support.
>> Do
 you
> have training data that could be used to train the tokenizer and
>>> sentence
> detector?
 
 For the sentence-splitter, I imagine you could make use of the source
>>> side
 of our parallel corpus, which has thousands to millions of sentences,
>> one
 per line.
 
 For tokenization (and normalization), we don't typically train models
>> but
 instead use a set of manually developed heuristics, which may or may
>> not
>>> be
 sentence-specific. See
 
https://github.com/apache/incubator-joshua/blob/master/ 
 
 scripts/preparation/tokenize.pl
 
 How much training data do you generally need for each task?
 
 
> 
> Jörn
> ​
 
 
>>> 
>> 



Re: Pluggable preprocessing and OpenNLP

2017-01-12 Thread Joern Kottmann
Do you have a pointer to an actual file? Or download package?

Jörn

On Wed, Jan 11, 2017 at 11:33 AM, Tommaso Teofili  wrote:

> I think the parallel corpuses are taken from [1], so we could start with
> training sentdetect for language packs at [2].
>
> Regards,
> Tommaso
>
> [1] : http://opus.lingfil.uu.se/
> [2] : https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs
>
> Il giorno lun 9 gen 2017 alle ore 11:39 Joern Kottmann  >
> ha scritto:
>
> > Sorry, for late reply, can you point me to a link for the parallel
> corpus?
> > We might just want to add formats support for it to OpenNLP.
> >
> > Do you use tokenize.pl for all languages or do you have language
> specific
> > heuristics?
> > It would be great to have an additional more capable rule based tokenizer
> > in OpenNLP.
> >
> > The sentence splitter can be trained on a few thousand sentences or so, I
> > think that will work out nicely.
> >
> > Jörn
> >
> > On Wed, Dec 21, 2016 at 7:24 PM, Matt Post  wrote:
> >
> > >
> > > > On Dec 21, 2016, at 10:36 AM, Joern Kottmann 
> > wrote:
> > > >
> > > > I am happy to support a bit with this, we can also see if things in
> > > OpenNLP
> > > > need to be changed to make this work smoothly.
> > >
> > > Great!
> > >
> > >
> > > > One challenge is to train OpenNLP on all the languages you support.
> Do
> > > you
> > > > have training data that could be used to train the tokenizer and
> > sentence
> > > > detector?
> > >
> > > For the sentence-splitter, I imagine you could make use of the source
> > side
> > > of our parallel corpus, which has thousands to millions of sentences,
> one
> > > per line.
> > >
> > > For tokenization (and normalization), we don't typically train models
> but
> > > instead use a set of manually developed heuristics, which may or may
> not
> > be
> > > sentence-specific. See
> > >
> > > https://github.com/apache/incubator-joshua/blob/master/
> > > scripts/preparation/tokenize.pl
> > >
> > > How much training data do you generally need for each task?
> > >
> > >
> > > >
> > > > Jörn
> > > > ​
> > >
> > >
> >
>


Re: Pluggable preprocessing and OpenNLP

2017-01-11 Thread Tommaso Teofili
I think the parallel corpuses are taken from [1], so we could start with
training sentdetect for language packs at [2].

Regards,
Tommaso

[1] : http://opus.lingfil.uu.se/
[2] : https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs

Il giorno lun 9 gen 2017 alle ore 11:39 Joern Kottmann 
ha scritto:

> Sorry, for late reply, can you point me to a link for the parallel corpus?
> We might just want to add formats support for it to OpenNLP.
>
> Do you use tokenize.pl for all languages or do you have language specific
> heuristics?
> It would be great to have an additional more capable rule based tokenizer
> in OpenNLP.
>
> The sentence splitter can be trained on a few thousand sentences or so, I
> think that will work out nicely.
>
> Jörn
>
> On Wed, Dec 21, 2016 at 7:24 PM, Matt Post  wrote:
>
> >
> > > On Dec 21, 2016, at 10:36 AM, Joern Kottmann 
> wrote:
> > >
> > > I am happy to support a bit with this, we can also see if things in
> > OpenNLP
> > > need to be changed to make this work smoothly.
> >
> > Great!
> >
> >
> > > One challenge is to train OpenNLP on all the languages you support. Do
> > you
> > > have training data that could be used to train the tokenizer and
> sentence
> > > detector?
> >
> > For the sentence-splitter, I imagine you could make use of the source
> side
> > of our parallel corpus, which has thousands to millions of sentences, one
> > per line.
> >
> > For tokenization (and normalization), we don't typically train models but
> > instead use a set of manually developed heuristics, which may or may not
> be
> > sentence-specific. See
> >
> > https://github.com/apache/incubator-joshua/blob/master/
> > scripts/preparation/tokenize.pl
> >
> > How much training data do you generally need for each task?
> >
> >
> > >
> > > Jörn
> > > ​
> >
> >
>


Re: Pluggable preprocessing and OpenNLP

2017-01-09 Thread Joern Kottmann
Sorry, for late reply, can you point me to a link for the parallel corpus?
We might just want to add formats support for it to OpenNLP.

Do you use tokenize.pl for all languages or do you have language specific
heuristics?
It would be great to have an additional more capable rule based tokenizer
in OpenNLP.

The sentence splitter can be trained on a few thousand sentences or so, I
think that will work out nicely.

Jörn

On Wed, Dec 21, 2016 at 7:24 PM, Matt Post  wrote:

>
> > On Dec 21, 2016, at 10:36 AM, Joern Kottmann  wrote:
> >
> > I am happy to support a bit with this, we can also see if things in
> OpenNLP
> > need to be changed to make this work smoothly.
>
> Great!
>
>
> > One challenge is to train OpenNLP on all the languages you support. Do
> you
> > have training data that could be used to train the tokenizer and sentence
> > detector?
>
> For the sentence-splitter, I imagine you could make use of the source side
> of our parallel corpus, which has thousands to millions of sentences, one
> per line.
>
> For tokenization (and normalization), we don't typically train models but
> instead use a set of manually developed heuristics, which may or may not be
> sentence-specific. See
>
> https://github.com/apache/incubator-joshua/blob/master/
> scripts/preparation/tokenize.pl
>
> How much training data do you generally need for each task?
>
>
> >
> > Jörn
> > ​
>
>


Re: Pluggable preprocessing and OpenNLP

2016-12-22 Thread Tommaso Teofili
created https://issues.apache.org/jira/browse/JOSHUA-326 for this

Il giorno mer 21 dic 2016 alle ore 19:38 Matt Post  ha
scritto:

> 7 → master is indeed the plan, as soon as we ship 6.1.
>
> matt
>
>
> > On Dec 21, 2016, at 1:25 PM, Tommaso Teofili 
> wrote:
> >
> > Il giorno mer 21 dic 2016 alle ore 16:00 Matt Post  ha
> > scritto:
> >
> >> Sure, that'd be nice to do. I'd love to get rid of the Perl scripts. Are
> >> you just throwing out an idea or are you interested in doing this?
> >
> >
> > I'd be happy to do it. If Joern can help out that'd be of course very
> > appreciated.
> >
> >
> >> I think the way to go would be to set this up on a branch (off 7), and
> >> then I could test it on some languages.
> >>
> >
> > sure, and hopefully branch 7 becomes our new master soon after the 6.1
> > release.
> >
> > Regards,
> > Tommaso
> >
> >
> >>
> >>
> >>> On Dec 21, 2016, at 5:33 AM, Tommaso Teofili <
> tommaso.teof...@gmail.com>
> >> wrote:
> >>>
> >>> Hi all,
> >>>
> >>> I was talking to Joern (Apache OpenNLP committer) recently and it came
> up
> >>> the idea that we could use OpenNLP for the data preprocessing phase in
> >>> Joshua as to allow tokenization, sentence detection, etc.
> >>> As I was reading through our doc [1] this is currently done with
> >> dedicated
> >>> scripts; we could make that part pluggable (with a default simple Java
> >>> implementation) and allow more fine grained control over it using
> >> libraries
> >>> like OpenNLP:
> >>>
> >>> What would people think?
> >>>
> >>> Regards,
> >>> Tommaso
> >>>
> >>> [1] : https://cwiki.apache.org/confluence/display/JOSHUA/Project+Ideas
> >>
> >>
>
>


Re: Pluggable preprocessing and OpenNLP

2016-12-21 Thread Matt Post
7 → master is indeed the plan, as soon as we ship 6.1.

matt


> On Dec 21, 2016, at 1:25 PM, Tommaso Teofili  
> wrote:
> 
> Il giorno mer 21 dic 2016 alle ore 16:00 Matt Post  ha
> scritto:
> 
>> Sure, that'd be nice to do. I'd love to get rid of the Perl scripts. Are
>> you just throwing out an idea or are you interested in doing this?
> 
> 
> I'd be happy to do it. If Joern can help out that'd be of course very
> appreciated.
> 
> 
>> I think the way to go would be to set this up on a branch (off 7), and
>> then I could test it on some languages.
>> 
> 
> sure, and hopefully branch 7 becomes our new master soon after the 6.1
> release.
> 
> Regards,
> Tommaso
> 
> 
>> 
>> 
>>> On Dec 21, 2016, at 5:33 AM, Tommaso Teofili 
>> wrote:
>>> 
>>> Hi all,
>>> 
>>> I was talking to Joern (Apache OpenNLP committer) recently and it came up
>>> the idea that we could use OpenNLP for the data preprocessing phase in
>>> Joshua as to allow tokenization, sentence detection, etc.
>>> As I was reading through our doc [1] this is currently done with
>> dedicated
>>> scripts; we could make that part pluggable (with a default simple Java
>>> implementation) and allow more fine grained control over it using
>> libraries
>>> like OpenNLP:
>>> 
>>> What would people think?
>>> 
>>> Regards,
>>> Tommaso
>>> 
>>> [1] : https://cwiki.apache.org/confluence/display/JOSHUA/Project+Ideas
>> 
>> 



Re: Pluggable preprocessing and OpenNLP

2016-12-21 Thread Tommaso Teofili
Il giorno mer 21 dic 2016 alle ore 16:00 Matt Post  ha
scritto:

> Sure, that'd be nice to do. I'd love to get rid of the Perl scripts. Are
> you just throwing out an idea or are you interested in doing this?


I'd be happy to do it. If Joern can help out that'd be of course very
appreciated.


> I think the way to go would be to set this up on a branch (off 7), and
> then I could test it on some languages.
>

sure, and hopefully branch 7 becomes our new master soon after the 6.1
release.

Regards,
Tommaso


>
>
> > On Dec 21, 2016, at 5:33 AM, Tommaso Teofili 
> wrote:
> >
> > Hi all,
> >
> > I was talking to Joern (Apache OpenNLP committer) recently and it came up
> > the idea that we could use OpenNLP for the data preprocessing phase in
> > Joshua as to allow tokenization, sentence detection, etc.
> > As I was reading through our doc [1] this is currently done with
> dedicated
> > scripts; we could make that part pluggable (with a default simple Java
> > implementation) and allow more fine grained control over it using
> libraries
> > like OpenNLP:
> >
> > What would people think?
> >
> > Regards,
> > Tommaso
> >
> > [1] : https://cwiki.apache.org/confluence/display/JOSHUA/Project+Ideas
>
>


Re: Pluggable preprocessing and OpenNLP

2016-12-21 Thread Matt Post

> On Dec 21, 2016, at 10:36 AM, Joern Kottmann  wrote:
> 
> I am happy to support a bit with this, we can also see if things in OpenNLP
> need to be changed to make this work smoothly.

Great!


> One challenge is to train OpenNLP on all the languages you support. Do you
> have training data that could be used to train the tokenizer and sentence
> detector?

For the sentence-splitter, I imagine you could make use of the source side of 
our parallel corpus, which has thousands to millions of sentences, one per line.

For tokenization (and normalization), we don't typically train models but 
instead use a set of manually developed heuristics, which may or may not be 
sentence-specific. See


https://github.com/apache/incubator-joshua/blob/master/scripts/preparation/tokenize.pl

How much training data do you generally need for each task?


> 
> Jörn
> ​



Re: Pluggable preprocessing and OpenNLP

2016-12-21 Thread Joern Kottmann
I am happy to support a bit with this, we can also see if things in OpenNLP
need to be changed to make this work smoothly.

One challenge is to train OpenNLP on all the languages you support. Do you
have training data that could be used to train the tokenizer and sentence
detector?

Jörn
​


Re: Pluggable preprocessing and OpenNLP

2016-12-21 Thread Matt Post
Sure, that'd be nice to do. I'd love to get rid of the Perl scripts. Are you 
just throwing out an idea or are you interested in doing this? I think the way 
to go would be to set this up on a branch (off 7), and then I could test it on 
some languages.


> On Dec 21, 2016, at 5:33 AM, Tommaso Teofili  
> wrote:
> 
> Hi all,
> 
> I was talking to Joern (Apache OpenNLP committer) recently and it came up
> the idea that we could use OpenNLP for the data preprocessing phase in
> Joshua as to allow tokenization, sentence detection, etc.
> As I was reading through our doc [1] this is currently done with dedicated
> scripts; we could make that part pluggable (with a default simple Java
> implementation) and allow more fine grained control over it using libraries
> like OpenNLP:
> 
> What would people think?
> 
> Regards,
> Tommaso
> 
> [1] : https://cwiki.apache.org/confluence/display/JOSHUA/Project+Ideas