Re: Pluggable preprocessing and OpenNLP

2016-12-21 Thread Tommaso Teofili
Il giorno mer 21 dic 2016 alle ore 16:00 Matt Post  ha
scritto:

> Sure, that'd be nice to do. I'd love to get rid of the Perl scripts. Are
> you just throwing out an idea or are you interested in doing this?


I'd be happy to do it. If Joern can help out that'd be of course very
appreciated.


> I think the way to go would be to set this up on a branch (off 7), and
> then I could test it on some languages.
>

sure, and hopefully branch 7 becomes our new master soon after the 6.1
release.

Regards,
Tommaso


>
>
> > On Dec 21, 2016, at 5:33 AM, Tommaso Teofili 
> wrote:
> >
> > Hi all,
> >
> > I was talking to Joern (Apache OpenNLP committer) recently and it came up
> > the idea that we could use OpenNLP for the data preprocessing phase in
> > Joshua as to allow tokenization, sentence detection, etc.
> > As I was reading through our doc [1] this is currently done with
> dedicated
> > scripts; we could make that part pluggable (with a default simple Java
> > implementation) and allow more fine grained control over it using
> libraries
> > like OpenNLP:
> >
> > What would people think?
> >
> > Regards,
> > Tommaso
> >
> > [1] : https://cwiki.apache.org/confluence/display/JOSHUA/Project+Ideas
>
>


Re: Pluggable preprocessing and OpenNLP

2016-12-21 Thread Matt Post

> On Dec 21, 2016, at 10:36 AM, Joern Kottmann  wrote:
> 
> I am happy to support a bit with this, we can also see if things in OpenNLP
> need to be changed to make this work smoothly.

Great!


> One challenge is to train OpenNLP on all the languages you support. Do you
> have training data that could be used to train the tokenizer and sentence
> detector?

For the sentence-splitter, I imagine you could make use of the source side of 
our parallel corpus, which has thousands to millions of sentences, one per line.

For tokenization (and normalization), we don't typically train models but 
instead use a set of manually developed heuristics, which may or may not be 
sentence-specific. See


https://github.com/apache/incubator-joshua/blob/master/scripts/preparation/tokenize.pl

How much training data do you generally need for each task?


> 
> Jörn
> ​



Re: Pluggable preprocessing and OpenNLP

2016-12-21 Thread Matt Post
Sure, that'd be nice to do. I'd love to get rid of the Perl scripts. Are you 
just throwing out an idea or are you interested in doing this? I think the way 
to go would be to set this up on a branch (off 7), and then I could test it on 
some languages.


> On Dec 21, 2016, at 5:33 AM, Tommaso Teofili  
> wrote:
> 
> Hi all,
> 
> I was talking to Joern (Apache OpenNLP committer) recently and it came up
> the idea that we could use OpenNLP for the data preprocessing phase in
> Joshua as to allow tokenization, sentence detection, etc.
> As I was reading through our doc [1] this is currently done with dedicated
> scripts; we could make that part pluggable (with a default simple Java
> implementation) and allow more fine grained control over it using libraries
> like OpenNLP:
> 
> What would people think?
> 
> Regards,
> Tommaso
> 
> [1] : https://cwiki.apache.org/confluence/display/JOSHUA/Project+Ideas



[jira] [Commented] (JOSHUA-324) Address Apache Joshua 6.1 RC#2 Issues

2016-12-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/JOSHUA-324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15766912#comment-15766912
 ] 

Hudson commented on JOSHUA-324:
---

SUCCESS: Integrated in Jenkins build joshua_master #165 (See 
[https://builds.apache.org/job/joshua_master/165/])
JOSHUA-324 - added missing incubating suffix (tommaso: rev 
5aa5308cd108c60c19b883dade54d810d1e0966e)
* (edit) pom.xml


> Address Apache Joshua 6.1 RC#2 Issues
> -
>
> Key: JOSHUA-324
> URL: https://issues.apache.org/jira/browse/JOSHUA-324
> Project: Joshua
>  Issue Type: Task
>Affects Versions: 6.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Blocker
> Fix For: 6.1
>
>
> Feedback from [~jmclean] (thank you Justin) on our RC#2 is as follows
> {code}
> ==
> - Your missing incubating in the release artifacts name. [1]
> - There are a number of binary files in the source release that look to be
> compiled source code.
> I checked:
> - name doesn’t include incubating
> - signatures and hashes correct
> - DISCLAIMER exists
> - LICENSE is missing a few things (see below)
> - a source file is missing an Apache header [7]
> - Several unexpected binary files are contained in the source release
> [8][9][10][11]
> - Can compile from source
> License is missing:
> - MIT licensed normalize.css v3.0.3 bundled in [5]
> - glyph icon fonts [6]
> Not an issue but it's a little odd to have LICENSE and NOTICE.txt - usually
> both are bare or both have .txt extension.
> Also while looking at your site I noticed that the download links of you
> incubating site [2] points to github, please change to point to the offical
> release area.
> Also the 6.1 release has already been tagged and it available for public
> download on github [4]  before this vote is finished. This is IMO against
> Apache release policy [3] please remove.
> I also notice you recently released the language packs (18th Nov) but there
> doesn’t seem to have been a vote for that? Any reason for this?
> ===
> [1] http://incubator.apache.org/incubation/Incubation_Policy.html#Releases
> [2] 
> https://cwiki.apache.org/confluence/display/JOSHUA/Apache+Joshua+%28Incubating%29+Home
> [3] http://www.apache.org/dev/release.html#what
> [4] https://github.com/apache/incubator-joshua/releases
> [5] ./demo/bootstrap/css/bootstrap.min.css
> [6] apache-joshua-6.1/demo/bootstrap/fonts/*
> [7] ./src/test/java/org/apache/joshua/decoder/ff/tm/OwnerMapTest.java
> [8] ./bin/GIZA++
> [9] ./bin/mkcls
> [10 ]./bin/snt2cooc.out
> [11] ,/src/test/resources/berkeley_lm/lm.berkeleylm.gz
> [12] http://www.mail-archive.com/general%40incubator.apache.org/msg57543.html
> [13] http://www.mail-archive.com/general%40incubator.apache.org/msg57551.html
> {code}
> This is a blocking issue and until addressed we cannot release 6.1-incubating



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (JOSHUA-324) Address Apache Joshua 6.1 RC#2 Issues

2016-12-21 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/JOSHUA-324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15766822#comment-15766822
 ] 

Tommaso Teofili commented on JOSHUA-324:


changed artifactId in _joshua-incubating_ on master branch.

> Address Apache Joshua 6.1 RC#2 Issues
> -
>
> Key: JOSHUA-324
> URL: https://issues.apache.org/jira/browse/JOSHUA-324
> Project: Joshua
>  Issue Type: Task
>Affects Versions: 6.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Blocker
> Fix For: 6.1
>
>
> Feedback from [~jmclean] (thank you Justin) on our RC#2 is as follows
> {code}
> ==
> - Your missing incubating in the release artifacts name. [1]
> - There are a number of binary files in the source release that look to be
> compiled source code.
> I checked:
> - name doesn’t include incubating
> - signatures and hashes correct
> - DISCLAIMER exists
> - LICENSE is missing a few things (see below)
> - a source file is missing an Apache header [7]
> - Several unexpected binary files are contained in the source release
> [8][9][10][11]
> - Can compile from source
> License is missing:
> - MIT licensed normalize.css v3.0.3 bundled in [5]
> - glyph icon fonts [6]
> Not an issue but it's a little odd to have LICENSE and NOTICE.txt - usually
> both are bare or both have .txt extension.
> Also while looking at your site I noticed that the download links of you
> incubating site [2] points to github, please change to point to the offical
> release area.
> Also the 6.1 release has already been tagged and it available for public
> download on github [4]  before this vote is finished. This is IMO against
> Apache release policy [3] please remove.
> I also notice you recently released the language packs (18th Nov) but there
> doesn’t seem to have been a vote for that? Any reason for this?
> ===
> [1] http://incubator.apache.org/incubation/Incubation_Policy.html#Releases
> [2] 
> https://cwiki.apache.org/confluence/display/JOSHUA/Apache+Joshua+%28Incubating%29+Home
> [3] http://www.apache.org/dev/release.html#what
> [4] https://github.com/apache/incubator-joshua/releases
> [5] ./demo/bootstrap/css/bootstrap.min.css
> [6] apache-joshua-6.1/demo/bootstrap/fonts/*
> [7] ./src/test/java/org/apache/joshua/decoder/ff/tm/OwnerMapTest.java
> [8] ./bin/GIZA++
> [9] ./bin/mkcls
> [10 ]./bin/snt2cooc.out
> [11] ,/src/test/resources/berkeley_lm/lm.berkeleylm.gz
> [12] http://www.mail-archive.com/general%40incubator.apache.org/msg57543.html
> [13] http://www.mail-archive.com/general%40incubator.apache.org/msg57551.html
> {code}
> This is a blocking issue and until addressed we cannot release 6.1-incubating



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: joshua release

2016-12-21 Thread Tommaso Teofili
AFAIU JOSHUA-324 still has a few things to address (e.g. missing incubating
suffix in artifact name).


Il giorno mar 20 dic 2016 alle ore 15:40 Matt Post  ha
scritto:

> Lewis — any chance you can pick this back up? I think we've covered all of
> the issues?