from:"kellen sunderland"

Re: [VOTE] Graduate the Apache Joshua (Incubating) Project

2018-04-27 Thread kellen sunderland

+1 (binding)

On Thu, Apr 26, 2018 at 7:06 PM, Mattmann, Chris A (1761) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> +1!
>
> Sent from my iPhone
>
> On Apr 26, 2018, at 5:04 PM, Tom Barber  spicule.co.uk>> wrote:
>
> +1
>
> On Fri, 27 Apr 2018, 00:58 Thamme Gowda,  d...@gmail.com>> wrote:
> +1 (binding)
>
>
> Cheers,
> TG
>
> --
> *Thamme Gowda *
> @thammegowda  | https://isi.edu/~tg
> ~Sent via somebody's Webmail server
>
> 2018-04-24 22:02 GMT-07:00 lewis john mcgibbney  >:
>
> > Hi Folks,
> > I would like to open a VOTE for graduating the Apache Joshua (Incubating)
> > project.
> > For those that are interested, the Incubator guidelines on graduation can
> > be found at [0].
> > Joshua has been reporting to the IPMC since 16th March 2016 and made one
> > Incubating release.
> >
> > Joshua Basics
> >
> >- Podling Proposal 
> >- Status: current
> >- Established: 2016-02-13
> >- Incubating for 802 days
> >- Prior Board Reports  >
> >
> > There are a few issues to resolve before drafting the graduation
> resolution
> > however this community VOTE is timely. The VOTE will be open at least 72
> > hours and will pass if 3 +1's are received from the Joshua PPMC.
> >
> > [ ] +1 Graduate the Apache Joshua (Incubating) Project
> > [ ] -1 NO NOT Graduate the Apache Joshua (Incubating) Project... please
> > provide reasoning
> >
> > P.S. Here is my binding +1
> >
> > [0]
> > https://incubator.apache.org/guides/graduation.html#the_
> graduation_process
> >
> >
> > --
> > http://home.apache.org/~lewismc/
> > http://people.apache.org/keys/committer/lewismc
> >
>
>
> Spicule Limited is registered in England & Wales. Company Number:
> 09954122. Registered office: First Floor, Telecom House, 125-135 Preston
> Road, Brighton, England, BN1 6AF. VAT No. 251478891.
>
>
> All engagements are subject to Spicule Terms and Conditions of Business.
> This email and its contents are intended solely for the individual to whom
> it is addressed and may contain information that is confidential,
> privileged or otherwise protected from disclosure, distributing or copying.
> Any views or opinions presented in this email are solely those of the
> author and do not necessarily represent those of Spicule Limited. The
> company accepts no liability for any damage caused by any virus transmitted
> by this email. If you have received this message in error, please notify us
> immediately by reply email before deleting it from your system. Service of
> legal notice cannot be effected on Spicule Limited by email.
>

RE: problems with LM loading

2017-10-16 Thread kellen sunderland

The feature function initialization message is just a general purpose exception 
handler.  I’ve seen this quite often when language models fail to load.  The 
most interesting part of the log to me is:

> Caused by: java.lang.RuntimeException: Something wrong with I/O.
>
> at edu.berkeley.nlp.lm.io.ArpaLmReader.parseHeader(ArpaLmReader.java:114)
>
> at edu.berkeley.nlp.lm.io.ArpaLmReader.parse(ArpaLmReader.java:76)


To me it looks like it could only be caused by the lack of the text 
"\\1-grams:" in the file you’re opening.  Reference this function: 
https://github.com/smilli/berkeleylm/blob/master/src/edu/berkeley/nlp/lm/io/ArpaLmReader.java#L105

Are you trying to load a binary lm with an Arpa reader by any chance?  Do you 
have the quoted text in your text based LM?

-Kellen
From: Tommaso Teofili
Sent: Monday, October 16, 2017 4:09 PM
To: dev@joshua.incubator.apache.org
Subject: Re: problems with LM loading

p.s.:
I've tried with other LPs (e.g. sd-en) and I get the same ...

Il giorno lun 16 ott 2017 alle ore 15:06 Tommaso Teofili <
tommaso.teof...@gmail.com> ha scritto:

> Hi all,
>
> I am trying to use the ES-EN language pack from our "Language Packs" page
> with Joshua 6.1, but when I get to load the two language models I get an IO
> execption.
> The config looks like:
>
> feature-function = LanguageModel -lm_type berkeleylm -lm_order 4 -lm_file
> model/lm.berkeleylm
> feature-function = Distortion
> feature-function = LanguageModel -lm_type berkeleylm -lm_order 4 -lm_file
> model/en.giga.twopercent.4.lm.berkeleylm
> feature-function = PhrasePenalty
>
> and I get the following:
>
> java.lang.RuntimeException: java.lang.RuntimeException: Unable to
> instantiate feature function 'LanguageModel -lm_type berkeleylm -lm_order 4
> -lm_file model/lm.berkeleylm'!
>
> ...
>
> Caused by: java.lang.RuntimeException: Unable to instantiate feature
> function 'LanguageModel -lm_type berkeleylm -lm_order 4 -lm_file
> model/lm.berkeleylm'!
>
> at
> org.apache.joshua.decoder.Decoder.initializeFeatureFunctions(Decoder.java:642)
>
> at org.apache.joshua.decoder.Decoder.initialize(Decoder.java:394)
>
> at org.apache.joshua.decoder.Decoder.(Decoder.java:128)
>
> Caused by: java.lang.reflect.InvocationTargetException: null
>
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>
> at
> org.apache.joshua.decoder.Decoder.initializeFeatureFunctions(Decoder.java:638)
>
> ... 58 common frames omitted
>
> Caused by: java.lang.RuntimeException: Something wrong with I/O.
>
> at edu.berkeley.nlp.lm.io.ArpaLmReader.parseHeader(ArpaLmReader.java:114)
>
> at edu.berkeley.nlp.lm.io.ArpaLmReader.parse(ArpaLmReader.java:76)
>
> at edu.berkeley.nlp.lm.io.ArpaLmReader.parse(ArpaLmReader.java:18)
>
> at edu.berkeley.nlp.lm.io.LmReaders.firstPassCommon(LmReaders.java:549)
>
> at edu.berkeley.nlp.lm.io.LmReaders.firstPassArpa(LmReaders.java:526)
>
> at
> edu.berkeley.nlp.lm.io.LmReaders.readArrayEncodedLmFromArpa(LmReaders.java:171)
>
> at
> edu.berkeley.nlp.lm.io.LmReaders.readArrayEncodedLmFromArpa(LmReaders.java:151)
>
> at
> org.apache.joshua.decoder.ff.lm.berkeley_lm.LMGrammarBerkeley.(LMGrammarBerkeley.java:94)
>
> at
> org.apache.joshua.decoder.ff.lm.LanguageModelFF.initializeLM(LanguageModelFF.java:158)
>
> at
> org.apache.joshua.decoder.ff.lm.LanguageModelFF.(LanguageModelFF.java:132)
>
> Any hints on what I could be doing wrong ? Encoding ?
> Did anyone else experience such issue ?
>
> BTW I am running this from within a Java application, Decoder is
> initialized as follows:
>
> JoshuaConfiguration configuration = new JoshuaConfiguration();
> configuration.readConfigFile(pathToJoshuaConfig);
> configuration.use_structured_output = true;
> Decoder decoder = new Decoder(configuration, pathToJoshuaConfig);
>
> Regards,
> Tommaso
>

Re: merging 7 branch to master

2017-09-20 Thread kellen sunderland

Sounds like a good plan to me as well.

+1

On Sep 20, 2017 8:05 PM, "Mattmann, Chris A (3010)" <
chris.a.mattm...@jpl.nasa.gov> wrote:

> +1 from me!
>
> ++
> Chris Mattmann, Ph.D.
> Principal Data Scientist, Engineering Administrative Office (3010)
> Manager, NSF & Open Source Projects Formulation and Development Offices
> (8212)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 180-503E, Mailstop: 180-503
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> ++
>
>
> On 9/20/17, 2:50 AM, "Tommaso Teofili"  wrote:
>
> hi all,
>
> how about :
> - moving the master branch into a 6.x branch
> - merging 7 branch into master
>
> This way we can support 6.1 with bugfixes and minor releases and go
> ahead
> with development on version 7.
>
> Regards,
> Tommaso
>
>
>

Re: [VOTE] Release Apache Joshua 6.1 (Incubating)

2017-03-01 Thread kellen sunderland

For a short term fix for the unit test we can delete lines 48 and 50 from
LMGrammarBerkeleyTest.java.

A bit of a longer term solution would be that we could have a @BeforeClass
setup method that simply zips the uncompressed files.

Thanks again for putting this together Tommaso.


On Wed, Mar 1, 2017 at 6:43 PM, Tommaso Teofili <tommaso.teof...@gmail.com>
wrote:

> thanks Kellen,
>
> I get the very same issues.
> It's probably my fault having copied .md5 and .sha files from the staging
> repo as I didn't have them within my target directory.
> I also get the same test failure.
>
> Hence -1 from me too.
> I'll roll it back, fix the issues and create RC4.
>
> Regards,
> Tommaso
>
>
>
> Il giorno mer 1 mar 2017 alle ore 17:54 kellen sunderland <
> kellen.sunderl...@gmail.com> ha scritto:
>
> > I have to -1 this release for the time being.  For me the signatures and
> > hashes don't seem to match the binaries downloaded.  Could you double
> check
> > that they match for you Tommaso?  I'm also getting a unit test that fails
> > when I run 'mvn clean package'.  I'm digging a little more into this one,
> > but suspect a missing file.
> >
> >
> > 
> > Here's what I've checked so far:
> >
> > Release artifacts must include incubating in the final file name - YES
> > Release artifacts must include a disclaimer within the release
> artifact(s)
> > as noted - YES
> > Every ASF release MUST contain one or more source packages, which MUST be
> > sufficient for a user to build and test the release provided they have
> > access to the appropriate platform and tools. - NO
> > -Not building due to failing test (BerkleyLM failure).  I'm digging a
> > bit more into this.
> >
> > Every artifact distributed to the public through Apache channels MUST be
> > accompanied by one file containing an OpenPGP compatible ASCII armored
> > detached signature and another file containing an MD5 checksum.
> > - .asc - NO
> > I get warning:
> > "gpg --verify joshua-incubating-6.1-src.tar.gz.asc
> > joshua-incubating-6.1-src.tar.gz
> > gpg: Signature made Thu Feb 23 09:15:17 2017 CET using RSA key ID
> > 891768A5
> > gpg: Good signature from "Tommaso Teofili <tomm...@apache.org>"
> > [unknown]
> > gpg: WARNING: This key is not certified with a trusted signature!
> > gpg:  There is no indication that the signature belongs to
> the
> > owner."
> > - .md5 - NO
> > My md5 of joshua-incubating-6.1-src.tar.gz is
> > 504976876b01294811293aa45b5400f5, the joshua-incubating-6.1-src.tar.
> gz.md5
> > indicates it should be 22b738eeae45757715080702a5bd2789
> > - .sha - NO
> > My sha of joshua-incubating-6.1-src.tar.gz is
> > 4AB5BA24301590F36AE6452DACC3F21CBD8B3FEC, the
> > joshua-incubating-6.1-src.tar.gz.md5 indicates it should be
> > 2a55b6d341dddc5369b22a4802a86ec40accd0a1
> > - KEYS - YES
> >
> > On Mon, Feb 27, 2017 at 3:55 AM, Matt Post <p...@cs.jhu.edu> wrote:
> >
> > > Hi folks,
> > >
> > > First, Tommaso, thank you for pulling this together!
> > >
> > > I want to remind everyone that there's a checklist to go through before
> > > sending your +1. Here's from an email from Tom Barber a while back:
> > >
> > > > Hello folks,
> > > >
> > > > I see plenty of +1's going through the release vote,  which is great
> to
> > > see
> > > > people taking an active role in getting the release shipped.
> > > >
> > > > For those of you who are new to the ASF there are a bunch of
> > requirements
> > > > to sign off for a release which you can find here:
> > > >
> > > > http://incubator.apache.org/guides/releasemanagement.html#check-list
> <
> > > http://incubator.apache.org/guides/releasemanagement.html#check-list>
> > > >
> > > > My current concern is that people who are new to the incubator are
> > +1'ing
> > > > software for release without check all or part of the release cycle.
> > > Whilst
> > > > not mandatory, when you +1 a release please can you try to indicate
> > what
> > > > you've checked. The reason for this is,  the tag Lewis has built off
> > > isn't
> > > > the tip of master, so if you're basing  your +1 on your day to day
> > > > development and knowledge of the code base, that's not always whats
> > > > shipped. Also in the branching process

Re: [VOTE] Release Apache Joshua 6.1 (Incubating)

2017-03-01 Thread kellen sunderland

I have to -1 this release for the time being.  For me the signatures and
hashes don't seem to match the binaries downloaded.  Could you double check
that they match for you Tommaso?  I'm also getting a unit test that fails
when I run 'mvn clean package'.  I'm digging a little more into this one,
but suspect a missing file.



Here's what I've checked so far:

Release artifacts must include incubating in the final file name - YES
Release artifacts must include a disclaimer within the release artifact(s)
as noted - YES
Every ASF release MUST contain one or more source packages, which MUST be
sufficient for a user to build and test the release provided they have
access to the appropriate platform and tools. - NO
-Not building due to failing test (BerkleyLM failure).  I'm digging a
bit more into this.

Every artifact distributed to the public through Apache channels MUST be
accompanied by one file containing an OpenPGP compatible ASCII armored
detached signature and another file containing an MD5 checksum.
- .asc - NO
I get warning:
"gpg --verify joshua-incubating-6.1-src.tar.gz.asc
joshua-incubating-6.1-src.tar.gz
gpg: Signature made Thu Feb 23 09:15:17 2017 CET using RSA key ID
891768A5
gpg: Good signature from "Tommaso Teofili "
[unknown]
gpg: WARNING: This key is not certified with a trusted signature!
gpg:  There is no indication that the signature belongs to the
owner."
- .md5 - NO
My md5 of joshua-incubating-6.1-src.tar.gz is
504976876b01294811293aa45b5400f5, the joshua-incubating-6.1-src.tar.gz.md5
indicates it should be 22b738eeae45757715080702a5bd2789
- .sha - NO
My sha of joshua-incubating-6.1-src.tar.gz is
4AB5BA24301590F36AE6452DACC3F21CBD8B3FEC, the
joshua-incubating-6.1-src.tar.gz.md5 indicates it should be
2a55b6d341dddc5369b22a4802a86ec40accd0a1
- KEYS - YES

On Mon, Feb 27, 2017 at 3:55 AM, Matt Post  wrote:

> Hi folks,
>
> First, Tommaso, thank you for pulling this together!
>
> I want to remind everyone that there's a checklist to go through before
> sending your +1. Here's from an email from Tom Barber a while back:
>
> > Hello folks,
> >
> > I see plenty of +1's going through the release vote,  which is great to
> see
> > people taking an active role in getting the release shipped.
> >
> > For those of you who are new to the ASF there are a bunch of requirements
> > to sign off for a release which you can find here:
> >
> > http://incubator.apache.org/guides/releasemanagement.html#check-list <
> http://incubator.apache.org/guides/releasemanagement.html#check-list>
> >
> > My current concern is that people who are new to the incubator are +1'ing
> > software for release without check all or part of the release cycle.
> Whilst
> > not mandatory, when you +1 a release please can you try to indicate what
> > you've checked. The reason for this is,  the tag Lewis has built off
> isn't
> > the tip of master, so if you're basing  your +1 on your day to day
> > development and knowledge of the code base, that's not always whats
> > shipped. Also in the branching process,  its possible merges or
> alterations
> > were accidentally made that Lewis has missed (this is very unlikely I
> know
> > but you know, code changes). Also people build software on different
> OS's,
> > versions of OS's etc so just because it builds on  Lewis's laptop doesn't
> > mean it builds on mine, for example.
> >
> > Also regarding licenses, disclaimers etc, people notice different things
> or
> > interpret stuff differently. its always possible that someone might miss
> a
> > library etc so its important multiple eyes run over the same stuff.
> >
> > Cheers,
> >
> > Tom
>
> I'm hoping I'll have time to go through this tomorrow.
>
> matt
>
>
>
> > On Feb 25, 2017, at 2:41 AM, Tommaso Teofili 
> wrote:
> >
> > Hi Folks,
> > Please VOTE on the Apache Joshua 6.1 Release Candidate #3.
> >
> > We solved 36 issues:
> > https://issues.apache.org/jira/secure/ReleaseNote.jspa?
> projectId=12319720=12335049
> >
> > Git source tag (3447715b3aa0a48ed79465d80618bd5a2f7a7558):
> > https://s.apache.org/XIxJ
> >
> > Staging repo:
> > https://repository.apache.org/content/repositories/orgapachejoshua-1004
> >
> > Source Release Artifacts:
> > https://dist.apache.org/repos/dist/dev/incubator/joshua/6.1/
> >
> > PGP release keys (signed using 891768A5):
> > *https://git1-us-west.apache.org/repos/asf?p=incubator-
> joshua.git;a=blob_plain;f=KEYS;h=aa18365bf5c8c8fb17b084f783a75c
> 3a2460a98d;hb=HEAD
> >  joshua.git;a=blob_plain;f=KEYS;h=aa18365bf5c8c8fb17b084f783a75c
> 3a2460a98d;hb=HEAD>*
> >
> > Vote will be open for 72 hours.
> > Thank you to everyone that is able to VOTE as well as everyone that
> > contributed to Apache Joshua 6.1.
> >
> > [ ] +1, let's get it released!!!
> > [ ] +/-0, fine, but consider to fix few issues before...
>

Re: Cutting RC3

2017-02-28 Thread kellen sunderland

Thanks for taking this on Tommaso.  I've blocked some time off for myself
tomorrow to help review the RC.

-Kellen

On Fri, Feb 24, 2017 at 12:26 PM, Tommaso Teofili  wrote:

> cool, release:perform succeeded, now the artifacts are on a single Nexus
> staging repo.
> I'm pushing the artifacts to /dist and will open the VOTE once uploading
> has finished (hopefully before I get old).
>
> Regards,
> Tommaso
>
> Il giorno gio 23 feb 2017 alle ore 15:39 Matt Post  ha
> scritto:
>
> > Thank you for heading this up, Tommaso! I'll be able to catch up on this
> > after today.
> >
> > matt
> >
> >
> > > On Feb 23, 2017, at 3:06 AM, Tommaso Teofili <
> tommaso.teof...@gmail.com>
> > wrote:
> > >
> > > probably because of the mentioned network issues the artifacts ended up
> > in
> > > two separate staging repositories in Nexus, which is undesired.
> > > I'll drop those repos, rollback the changes on the pom, delete the
> > current
> > > tag in git and perform again mvn release:prepare / perform today.
> > >
> > > Regards,
> > > Tommaso
> > >
> > > Il giorno mer 22 feb 2017 alle ore 16:39 Tommaso Teofili <
> > > tommaso.teof...@gmail.com> ha scritto:
> > >
> > >> Hi all,
> > >>
> > >> Maven is in the extremely slow (because of my bandwidth) process of
> > >> deploying stuff on Nexus as part of the mvn release:perform phase.
> > >> In the meantime perhaps is a good idea not to commit to the master
> > branch,
> > >> until we get the RC3 voted and hence approved / rejected.
> > >>
> > >> Thanks and regards,
> > >> Tommaso
> > >>
> >
> >
>

Re: Issues to Fix with Apache Joshua 6.1 RC#2

2016-12-02 Thread kellen sunderland

[7] has been fixed.

Tom's comments lead me to think that [8][9][10] can be removed from the
release.

I'm not totally clear on what we need to do to resolve the licensing issues
[5] and [6].  Do we simply need to give attribution to these projects in
our LICENSE.txt file?



On Thu, Dec 1, 2016 at 10:44 PM, Matt Post  wrote:

> Hi folks,
>
> What's the status of this? Can we check off items from the list below that
> have been completed?
>
> matt
>
>
> > On Nov 29, 2016, at 4:24 PM, lewis john mcgibbney 
> wrote:
> >
> > Hi Folks,
> > We have a number of issues to fix which were picked up over on general@.
> In
> > particular, we received excellent feedback from my good friend Justin
> [12]
> > [13]. As the general@ VOTE has not had 72 hours to stew I am not going
> to
> > close it, however we should take this time to fix the issues with master
> > before we spin an RC#3. These can be summarized as follows.
> > I've opened a Jira issue to track all of this.
> > https://issues.apache.org/jira/browse/JOSHUA-324
> > Lets track the progress on the Jira ticket.
> >
> > ==
> > - Your missing incubating in the release artifacts name. [1]
> > - There are a number of binary files in the source release that look to
> be
> > compiled source code.
> >
> > I checked:
> > - name doesn’t include incubating
> > - signatures and hashes correct
> > - DISCLAIMER exists
> > - LICENSE is missing a few things (see below)
> > - a source file is missing an Apache header [7]
> > - Several unexpected binary files are contained in the source release
> > [8][9][10][11]
> > - Can compile from source
> >
> > License is missing:
> > - MIT licensed normalize.css v3.0.3 bundled in [5]
> > - glyph icon fonts [6]
> >
> > Not an issue but it's a little odd to have LICENSE and NOTICE.txt -
> usually
> > both are bare or both have .txt extension.
> >
> > Also while looking at your site I noticed that the download links of you
> > incubating site [2] points to github, please change to point to the
> offical
> > release area.
> > Also the 6.1 release has already been tagged and it available for public
> > download on github [4]  before this vote is finished. This is IMO against
> > Apache release policy [3] please remove.
> >
> > I also notice you recently released the language packs (18th Nov) but
> there
> > doesn’t seem to have been a vote for that? Any reason for this?
> > ===
> >
> > [1] http://incubator.apache.org/incubation/Incubation_Policy.
> html#Releases
> > [2]
> > https://cwiki.apache.org/confluence/display/JOSHUA/
> Apache+Joshua+%28Incubating%29+Home
> > [3] http://www.apache.org/dev/release.html#what
> > [4] https://github.com/apache/incubator-joshua/releases
> > [5] ./demo/bootstrap/css/bootstrap.min.css
> > [6] apache-joshua-6.1/demo/bootstrap/fonts/*
> > [7] ./src/test/java/org/apache/joshua/decoder/ff/tm/OwnerMapTest.java
> > [8] ./bin/GIZA++
> > [9] ./bin/mkcls
> > [10 ]./bin/snt2cooc.out
> > [11] ,/src/test/resources/berkeley_lm/lm.berkeleylm.gz
> > [12]
> > http://www.mail-archive.com/general%40incubator.apache.org/msg57543.html
> > [13]
> > http://www.mail-archive.com/general%40incubator.apache.org/msg57551.html
> >
> >
> > --
> > http://home.apache.org/~lewismc/
> > @hectorMcSpector
> > http://www.linkedin.com/in/lmcgibbney
>
>

[jira] [Comment Edited] (JOSHUA-324) Address Apache Joshua 6.1 RC#2 Issues

2016-11-30 Thread Kellen Sunderland (JIRA)


[ 
https://issues.apache.org/jira/browse/JOSHUA-324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15708118#comment-15708118
 ] 

Kellen Sunderland edited comment on JOSHUA-324 at 11/30/16 10:15 AM:
-

Opening a pull shortly to add a license to [7].  Sorry for missing this one.

The binary file highlighted in [11] is used in the regression test 
org.apache.joshua.decoder.ff.lm.berkeley_lm.LMGrammarBerkeleyTest.  I think 
it's valuable to include it as part of the test suite.  I don't think it 
includes any executable code or compiled source if that's a concern.  It's just 
a serialized POJO.  We can remove this test if needed for the release, but I'm 
going to read up on the binary policy to see if there's some way we can leave 
it in.


was (Author: kellen.sunderland):
Opening a pull shortly to add a license to [7].  Sorry for missing this one.

The binary file highlighted in [11] is used in the regression test 
org.apache.joshua.decoder.ff.lm.berkeley_lm.LMGrammarBerkeleyTest.  I think 
it's valuable to include it as part of the test suite.  I don't think it 
includes any executable code if that's a concern.  It's just a serialized POJO. 
 We can remove this test if needed for the release, but I'm going to read up on 
the binary policy to see if there's some way we can leave it in.

> Address Apache Joshua 6.1 RC#2 Issues
> -
>
> Key: JOSHUA-324
> URL: https://issues.apache.org/jira/browse/JOSHUA-324
> Project: Joshua
>  Issue Type: Task
>Affects Versions: 6.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Blocker
> Fix For: 6.1
>
>
> Feedback from [~jmclean] (thank you Justin) on our RC#2 is as follows
> {code}
> ==
> - Your missing incubating in the release artifacts name. [1]
> - There are a number of binary files in the source release that look to be
> compiled source code.
> I checked:
> - name doesn’t include incubating
> - signatures and hashes correct
> - DISCLAIMER exists
> - LICENSE is missing a few things (see below)
> - a source file is missing an Apache header [7]
> - Several unexpected binary files are contained in the source release
> [8][9][10][11]
> - Can compile from source
> License is missing:
> - MIT licensed normalize.css v3.0.3 bundled in [5]
> - glyph icon fonts [6]
> Not an issue but it's a little odd to have LICENSE and NOTICE.txt - usually
> both are bare or both have .txt extension.
> Also while looking at your site I noticed that the download links of you
> incubating site [2] points to github, please change to point to the offical
> release area.
> Also the 6.1 release has already been tagged and it available for public
> download on github [4]  before this vote is finished. This is IMO against
> Apache release policy [3] please remove.
> I also notice you recently released the language packs (18th Nov) but there
> doesn’t seem to have been a vote for that? Any reason for this?
> ===
> [1] http://incubator.apache.org/incubation/Incubation_Policy.html#Releases
> [2] 
> https://cwiki.apache.org/confluence/display/JOSHUA/Apache+Joshua+%28Incubating%29+Home
> [3] http://www.apache.org/dev/release.html#what
> [4] https://github.com/apache/incubator-joshua/releases
> [5] ./demo/bootstrap/css/bootstrap.min.css
> [6] apache-joshua-6.1/demo/bootstrap/fonts/*
> [7] ./src/test/java/org/apache/joshua/decoder/ff/tm/OwnerMapTest.java
> [8] ./bin/GIZA++
> [9] ./bin/mkcls
> [10 ]./bin/snt2cooc.out
> [11] ,/src/test/resources/berkeley_lm/lm.berkeleylm.gz
> [12] http://www.mail-archive.com/general%40incubator.apache.org/msg57543.html
> [13] http://www.mail-archive.com/general%40incubator.apache.org/msg57551.html
> {code}
> This is a blocking issue and until addressed we cannot release 6.1-incubating



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Translator Questions

2016-11-27 Thread kellen sunderland

1) it is usually not possible because a few components of a model are not
symmetric. For example a language model is used with Joshua, but this
language model is only modelling the target language.  To translate German
to English in your example you would need to download a separate German to
English model.

2) There are various public datasets used in the Academic community.
There's a few good links here http://opendata.stackexchange.com/q/3888 and
here http://www.statmt.org/ .  Others likely have some more pointers to
datasets, but that might get you started.

3) you could look into the top server or the http server.  There's
instructions in the readme of how to start those.  The std in method also
will except any kind of script or process that passes input to std in as a
stream.

4) I can't volunteer for this, but someone else in the community might be
able to help.  In general if there are setup issues that you find too
complicated let us know so that we can improve the process in the future.

On Nov 26, 2016 11:21 AM, "Aliaksei Rudak"  wrote:

> Hi,
>
> 1) Is it possible to translate vice-versa using downloaded language pair ?
> For example I downloaded en-ger (English - German). How can I translate
> from German to English ?
>
> 2) Do you know sources where I can get more data for translation pairs ?
>
> 3) How to make translator work without closing session so I can translate
> every time I'm making a request without server restart ?
>
> 4) Can someone from your team help me with Joshua setup and tuning. I will
> pay for your time.
>
> Regards,
> Alexei
>

Re: Signing off a Joshua Release

2016-11-27 Thread kellen sunderland

Definitely guilty of this.  I'll check the release checklist in the
future.  Thanks for the reminder Tom.

On Nov 26, 2016 1:27 PM, "Tom Barber"  wrote:

Hello folks,

I see plenty of +1's going through the release vote,  which is great to see
people taking an active role in getting the release shipped.

For those of you who are new to the ASF there are a bunch of requirements
to sign off for a release which you can find here:

http://incubator.apache.org/guides/releasemanagement.html#check-list

My current concern is that people who are new to the incubator are +1'ing
software for release without check all or part of the release cycle. Whilst
not mandatory, when you +1 a release please can you try to indicate what
you've checked. The reason for this is,  the tag Lewis has built off isn't
the tip of master, so if you're basing  your +1 on your day to day
development and knowledge of the code base, that's not always whats
shipped. Also in the branching process,  its possible merges or alterations
were accidentally made that Lewis has missed (this is very unlikely I know
but you know, code changes). Also people build software on different OS's,
versions of OS's etc so just because it builds on  Lewis's laptop doesn't
mean it builds on mine, for example.

Also regarding licenses, disclaimers etc, people notice different things or
interpret stuff differently. its always possible that someone might miss a
library etc so its important multiple eyes run over the same stuff.

Cheers,

Tom

--
Tom Barber
CTO Spicule LTD
t...@spicule.co.uk

http://spicule.co.uk

GB: +44(0)5603641316
US: +18448141689

Re: Dockerhub hosted images

2016-11-23 Thread kellen sunderland

Yeah it should just be docker 'pull kellens/apache-joshua-es-en-2016-10-05'
then 'docker run -it kellens/apache-joshua-es-en-2016-10-05 /bin/bash' or
something similar.  I think the default command should eventually be to run
the http server, so ideally we'd just do 'docker run -p 5674
 kellens/apache-joshua-es-en-2016-10-05' and that would start up the http
server on port 5674.

Good point on Perl + Python, I can add them.

-Kellen

On Wed, Nov 23, 2016 at 3:22 PM, Matt Post <p...@cs.jhu.edu> wrote:

> Okay, I have this with
>
> docker run -it kellens/apache-joshua-es-en-2016-10-05 bash
>
> It seems we are missing Perl (./prepare.sh fails), and we should replace
> the LanguageModel line with a KenLM instance and build that. I bet we'll
> need Python, too.
>
>
>
>
> > On Nov 23, 2016, at 8:15 AM, Matt Post <p...@cs.jhu.edu> wrote:
> >
> > Kellen, can I bother you to post a few first steps? I've successfully
> pulled this down to my mac but now do not know how to find it, edit it, or
> run it. I'm porting through the documentation and will find it eventually
> but this would save me a bit of time.
> >
> >
> >> On Nov 23, 2016, at 8:07 AM, kellen sunderland <
> kellen.sunderl...@gmail.com> wrote:
> >>
> >> Yes my next step was going to be getting it hosted officially.
> >>
> >> I'll go ahead and open a ticket.  I think I'll hold off on pushing to
> the
> >> Apache account until I've done a little more testing though.
> >>
> >> On Nov 23, 2016 5:22 AM, "lewis john mcgibbney" <lewi...@apache.org>
> wrote:
> >>
> >>> Hi Kellen,
> >>> Nice :)
> >>> Another option is for us to host these via the Apache account.
> >>> https://hub.docker.com/r/apache/
> >>> We could then add a badge to our README which points to the
> Dockerfile(s).
> >>> Do you want to open a ticket over on the INFRA Jira for this?
> >>>
> >>> On Tue, Nov 22, 2016 at 1:57 PM, <
> >>> dev-digest-h...@joshua.incubator.apache.org> wrote:
> >>>
> >>>> From: kellen sunderland <kellen.sunderl...@gmail.com>
> >>>> To: "dev@joshua.incubator.apache.org" <dev@joshua.incubator.apache.
> org>
> >>>> Cc:
> >>>> Date: Tue, 22 Nov 2016 22:56:56 +0100
> >>>> Subject: Re: Dockerhub hosted images
> >>>> Ok, the first image should be properly uploaded now.
> >>>>
> >>>> https://hub.docker.com/r/kellens/apache-joshua-es-en-2016-10-05/
> >>>>
> >>>> -Kellen
> >>>>
> >>>>
> >>>
> >
>
>

Re: Dockerhub hosted images

2016-11-23 Thread kellen sunderland

Yes my next step was going to be getting it hosted officially.

I'll go ahead and open a ticket.  I think I'll hold off on pushing to the
Apache account until I've done a little more testing though.

On Nov 23, 2016 5:22 AM, "lewis john mcgibbney" <lewi...@apache.org> wrote:

> Hi Kellen,
> Nice :)
> Another option is for us to host these via the Apache account.
> https://hub.docker.com/r/apache/
> We could then add a badge to our README which points to the Dockerfile(s).
> Do you want to open a ticket over on the INFRA Jira for this?
>
> On Tue, Nov 22, 2016 at 1:57 PM, <
> dev-digest-h...@joshua.incubator.apache.org> wrote:
>
> > From: kellen sunderland <kellen.sunderl...@gmail.com>
> > To: "dev@joshua.incubator.apache.org" <dev@joshua.incubator.apache.org>
> > Cc:
> > Date: Tue, 22 Nov 2016 22:56:56 +0100
> > Subject: Re: Dockerhub hosted images
> > Ok, the first image should be properly uploaded now.
> >
> > https://hub.docker.com/r/kellens/apache-joshua-es-en-2016-10-05/
> >
> > -Kellen
> >
> >
>

Re: Dockerhub hosted images

2016-11-22 Thread kellen sunderland

Ok, the first image should be properly uploaded now.

https://hub.docker.com/r/kellens/apache-joshua-es-en-2016-10-05/

-Kellen

On Tue, Nov 22, 2016 at 9:09 PM,  wrote:

> Yeah sorry, the upload keeps failing for some reason.  I’m trying again
> now.
>
>
>
> -Kellen
>
>
>
> *From: *Matt Post 
> *Sent: *Tuesday, November 22, 2016 8:22 PM
> *To: *dev@joshua.incubator.apache.org
> *Subject: *Re: Dockerhub hosted images
>
>
>
> I am able to pull others' repos on Docker hub, so it seems like there is
> something wrong with yours? e.g., there are no "Dockerfile" and "Build
> Details" tabs, like this guy's:
>
>
>
> [image: cid:30D41DAE-93E9-4342-86DE-BE873EB51DA9]
>
>
>
>
>
>
>
>
>
> On Nov 22, 2016, at 2:12 PM, Matt Post  wrote:
>
>
>
> How do I clone this? Docker tells me there is no tag "latest", using "-a"
> tells me the repo is not found, and I can't seem to figure out how to tell
> Docker to use hub.docker.com...
>
>
>
> Here's a link to the first image I've been playing with, es-en.
> https://hub.docker.com/r/kellens/apache-joshua-es-en-2016-10-05/
>
>
>
>
>
>
>

Re: Dockerfile Issue

2016-11-22 Thread kellen sunderland

Yep, looks like it's been applied in this PR, so nothing blocking on my
side.

https://github.com/apache/incubator-joshua/pull/77

-Kellen

On Tue, Nov 22, 2016 at 10:53 PM, lewis john mcgibbney 
wrote:

> Hi Kellen,
> Have you fixed your Dockerfile issue? If so then please confirm and I can
> spin out RC#2 for 6.1.
> Thanks in advance Kellen.
> Lewis
>
> --
> http://home.apache.org/~lewismc/
> @hectorMcSpector
> http://www.linkedin.com/in/lmcgibbney
>

Re: language packs blog post

2016-11-21 Thread kellen sunderland

Looks good to me, no objection to tweeting it.  Nice work putting them all
together.

On Mon, Nov 21, 2016 at 9:00 PM, Matt Post  wrote:

> Hi folks,
>
> I just drafted this; any objections to tweeting it?
>
> https://cwiki.apache.org/confluence/display/JOSHUA/
> 2016/11/21/Apache+Joshua+Language+Packs
>
> matt

Re: [VOTE] Release Apache Joshua (Incubating) 6.1

2016-11-21 Thread kellen sunderland

I'd vote for a respin fixing Henry's changes.  There's also a small break
in the Dockerfile which I can fix at some point today (we need to switch it
to use mvn package I believe).

On Mon, Nov 21, 2016 at 10:40 AM, Tommaso Teofili  wrote:

> I think this settles for a respin of a new RC, doesn't it ?
>
> Regards,
> Tommaso
>
> Il giorno sab 19 nov 2016 alle ore 06:20 Henry Saputra <
> henry.sapu...@gmail.com> ha scritto:
>
> > Sorry I was late on dev@ list for this Vote, Lewis
> >
> > Looks like I have to -1 for this one:
> > Missing DISCLAIMER file for the source release artifact
> > NOTICE.txt file contains Apache HTrace instead of Apache Joshua
> >
> > Minor issue:
> > Extra file "pom.xml.release.releaseBackup"
> >
> > - Henry
> >
> > On Fri, Nov 18, 2016 at 2:11 PM, lewis john mcgibbney <
> lewi...@apache.org>
> > wrote:
> >
> > > Hello general@incubator,
> > > Please VOTE on the Apache Joshua 6.1 Release Candidate #1. The release
> > VOTE
> > > has passed over on user@ and dev@joshua with the following results
> > > http://www.mail-archive.com/dev%40joshua.incubator.apache.
> > > org/msg01884.html.
> > >
> > > We solved 44 issues: https://s.apache.org/joshua6.1
> > >
> > > Git source tag (167489bbd78526b9833fe7c88646bf96101d5d2b):
> > > https://s.apache.org/joshua6.1tag
> > >
> > > Staging repo: https://repository.apache.org/content/repositories/
> > > orgapachejoshua-1000/
> > >
> > > Source Release Artifacts: https://dist.apache.org/repos/
> > > dist/dev/incubator/joshua/
> > >
> > > PGP release keys (signed using 48BAEBF6):
> https://dist.apache.org/repos/
> > > dist/release/incubator/joshua/KEYS
> > >
> > > Vote will be open for 72 hours.
> > > Thank you to everyone that is able to VOTE as well as everyone that
> > > contributed to Apache Joshua 6.1.
> > >
> > > [ ] +1, let's get it released!!!
> > > [ ] +/-0, fine, but consider to fix few issues before...
> > > [ ] -1, nope, because... (and please explain why)
> > >
> > > P.S. here is my +1
> > >
> > > --
> > > http://home.apache.org/~lewismc/
> > > @hectorMcSpector
> > > http://www.linkedin.com/in/lmcgibbney
> > >
> >
>

Re: [VOTE] Release Apache Joshua (Incubating) 6.1

2016-11-14 Thread kellen sunderland

+1 .  Thanks to Lewis and Matt for all the recent work.

On Nov 14, 2016 7:11 PM, "Matt Post"  wrote:

+1

Thanks for starting this off, Lewis!


> On Nov 14, 2016, at 12:54 PM, Ramirez, Paul M (398M) <
paul.m.rami...@jpl.nasa.gov> wrote:
>
> +1, let's get it released!!!
>
> --Paul
>
> ==
> Paul Ramirez - Group Supervisor
> Computer Science for Data Intensive Applications (398M)
> NASA - Jet Propulsion Laboratory
> 4800 Oak Grove Dr.
> Pasadena, CA 91109 USA
> Mailstop: 158-242
> Office: 818-354-1015
> Cell: 818-395-8194
> ==
>
> On 11/14/16, 9:16 AM, "lewis john mcgibbney"  wrote:
>
>Hi Folks,
>Please VOTE on the Apache Joshua 6.1 Release Candidate #1.
>
>We solved 44 issues: https://s.apache.org/joshua6.1
>
>Git source tag (167489bbd78526b9833fe7c88646bf96101d5d2b):
>https://s.apache.org/joshua6.1tag
>
>Staging repo:
>https://repository.apache.org/content/repositories/
orgapachejoshua-1000/
>
>Source Release Artifacts:
>https://dist.apache.org/repos/dist/dev/incubator/joshua/
>
>PGP release keys (signed using 48BAEBF6):
>https://dist.apache.org/repos/dist/release/incubator/joshua/KEYS
>
>Vote will be open for 72 hours.
>Thank you to everyone that is able to VOTE as well as everyone that
>contributed to Apache Joshua 6.1.
>
>[ ] +1, let's get it released!!!
>[ ] +/-0, fine, but consider to fix few issues before...
>[ ] -1, nope, because... (and please explain why)
>
>P.S. here is my +1
>
>--
>http://home.apache.org/~lewismc/
>@hectorMcSpector
>http://www.linkedin.com/in/lmcgibbney
>
>

Re: Lewis Volunteering for 6.1 Release Manager

2016-11-11 Thread kellen sunderland

Sounds great to me Lewis, thanks for taking this on.

On Thu, Nov 10, 2016 at 10:55 PM, Matt Post  wrote:

> Just landing back in the states from Berlin. This sounds great Lewis!
>
> matt (from my phone)
>
> > Le 10 nov. 2016 à 12:02, lewis john mcgibbney  a
> écrit :
> >
> > Hi Folks,
> > I would like to put myself forward as release manager for 6.1.
> > I've got a lot of experience working with Incubating releases and have
> been
> > successful in the position of release manager resulting in the release of
> > around 20-30 official incubating and top level projects here at Apache.
> > I'll make sure to document the entire release procedure on our wiki for
> > future reference.
> > Does anyone object? If not then I will get to it today.
> > Lewis
> >
> > --
> > http://home.apache.org/~lewismc/
> > @hectorMcSpector
> > http://www.linkedin.com/in/lmcgibbney
>
>

Re: Joshua Model Input Format(s) and LM Loading

2016-11-10 Thread kellen sunderland

I've done fair amount of measurement and profiling in this area over the
last year, so I can offer a little bit of advice as well.

First of all make sure you're not just using a Java profiler if you're
looking at a model with KenLM.  The perf impact of language model calls
will be under-reported.  If you want to optimized LM calls using a Java
profiler I'd recommend using a Berkley model and measuring the minimization
of  lm calls there.  If you can reduce the amount of work (lm calls) to
Berkley, the speed improvements should carry over to KenLM as well.  (Of
course providing the optimizations are in the Joshua code).

If you want to optimize KenLM models you're better off using some
combination of a native profiler and a JVM profiler.  Even then I'd
consider the impact of the calls to the LM as under-reported.  This is
because making frequent JNI calls amplifies the amount of work required in
GC (I think we've discussed that before).

Another idea would be to call KenLM with an RPC framework (like gRPC).
It'll likely be slower, but then you could measure the Java process with a
Java Profiler (without ill-effects on GC).  You could also then
independently measure the KenLM process with a native profiler.  This might
give you a fairly accurate view of what to optimize.

-Kellen

On Wed, Oct 26, 2016 at 9:34 AM, lewis john mcgibbney 
wrote:

> I hear ye loud and clear Matt :) Thank you for the response.
>
> On Wed, Oct 26, 2016 at 12:30 AM, <
> dev-digest-h...@joshua.incubator.apache.org> wrote:
>
> >
> > From: Matt Post 
> > To: dev@joshua.incubator.apache.org
> > Cc:
> > Date: Tue, 25 Oct 2016 08:49:19 -0400
> > Subject: Re: Joshua Model Input Format(s) and LM Loading
> > Hi Lewis,
> >
> > Joshua supports two language model representation packages: KenLM [0] and
> > BerkeleyLM [1]. These were both developed at about the same time, and
> > represented huge gains in doing this task efficiently, over what had
> > previously been the standard approach (SRILM). Ken Heafield (who has
> > contributed a lot to Joshua) went on to contribute a lot of other
> > improvements to language model representation, decoder integration, and
> > also the actual construction of language models and their efficient
> > interpolation. His goal for a while was to make SRILM completely
> > unnecessary, and I think he succeeded.
> >
> > BerkeleyLM was more of a one-off project. It is slower than KenLM and
> > hasn't been touched in years. If you want to understand, your efforts are
> > probably best spent looking into KenLM papers. But it's also worth noting
> > that Ken is a crack C++ programmer who has spent years hacking away on
> > these problems, and your chances of finding any further efficiencies
> there
> > are probably quite limited unless you have a lot of background in the
> area.
> > But even if you did, I would recommend you not spend your time that way
> — I
> > basically consider the LM representation problem to have been solved by
> > KenLM. That's not to say that there are some improvements to be had on
> the
> > Joshua / JNI bridge, but even there, there are probably better things to
> do.
> >
> > matt
> >
> > [0] KenLM: Faster and Smaller Language Model Queries
> > http://www.kheafield.com/professional/avenue/kenlm.pdf
> >
> > [1] Faster and Smaller N-Gram Language Models
> > http://nlp.cs.berkeley.edu/pubs/Pauls-Klein_2011_LM_paper.pdf
> >
> >
>

Re: moses2 vs. joshua

2016-10-06 Thread kellen sunderland

Will do, but it might be a few days before I get the time to do a proper
test.  Thanks for hosting Matt.

On Thu, Oct 6, 2016 at 2:19 AM, Matt Post <p...@cs.jhu.edu> wrote:

> Hi folks,
>
> Sorry this took so long, long story. But the four models that Hieu shared
> with me are ready. You can download them here; they're each about 15–20 GB.
>
>   http://cs.jhu.edu/~post/files/joshua-hiero-ar-en.tbz
>   http://cs.jhu.edu/~post/files/joshua-phrase-ar-en.tbz
>   http://cs.jhu.edu/~post/files/joshua-hiero-ru-en.tbz
>   http://cs.jhu.edu/~post/files/joshua-hiero-ru-en.tbz
>
> It'd be great if someone could test them on a machine with lots of cores,
> to see how things scale.
>
> matt
>
> On Sep 22, 2016, at 9:09 AM, Matt Post <p...@cs.jhu.edu> wrote:
>
> Hi folks,
>
> I have finished the comparison. Here you can find graphs for ar-en and
> ru-en. The ground-up rewrite of Moses is
> about 2x–3x faster than Joshua.
>
> http://imgur.com/a/FcIbW
>
> One implication (untested) is that we are likely as fast as or faster than
> Moses.
>
> We could brainstorm things to do to close this gap. I'd be much happier
> with 2x or even 1.5x than with 3x, and I bet we could narrow this down. But
> I'd like to get the 6.1 release out of the way, first, so I'm pushing this
> off to next month. Sound cool?
>
> matt
>
>
> On Sep 19, 2016, at 6:26 AM, Matt Post <p...@cs.jhu.edu> wrote:
>
> I can't believe I did this, but I mis-colored one of the hiero lines, and
> the Numbers legend doesn't show the line type. If you reload the dropbox
> file, it's fixed now. The difference is about 3x for both. Here's the table.
>
> Threads
> Joshua
> Moses2
> Joshua (hiero)
> Moses2 (hiero)
> Phrase rate
> Hiero rate
> 1
> 178
> 65
> 2116
> 1137
> 2.74
> 1.86
> 2
> 109
> 42
> 1014
> 389
> 2.60
> 2.61
> 4
> 78
> 29
> 596
> 213
> 2.69
> 2.80
> 6
> 72
> 25
> 473
> 154
> 2.88
> 3.07
>
> I'll put the models together and share them later today. This was on a
> 6-core machine and I agree it'd be nice to test with something much higher.
>
> matt
>
>
> On Sep 19, 2016, at 5:33 AM, kellen sunderland <
> kellen.sunderl...@gmail.com<mailto:kellen.sunderl...@gmail.com
> <kellen.sunderl...@gmail.com>>> wrote:
>
> Do we just want to store these models somewhere temporarily?  I've got a
> OneDrive account and could share the models from there (as long as they're
> below 500GBs or so).
>
> On Mon, Sep 19, 2016 at 11:32 AM, kellen sunderland <
> kellen.sunderl...@gmail.com <mailto:kellen.sunderl...@gmail.com
> <kellen.sunderl...@gmail.com>>> wrote:
> Very nice results.  I think getting to within 25% of a optimized c++
> decoder from a Java decoder is impressive.  Great that Hieu has put in the
> work to make moses2 so fast as well, that gives organizations two quite
> nice decoding engines to choose from, both with reasonable performance.
>
> Matt: I had a question about the x axis here.  Is that number of threads?
> We should be scaling more or less linearly with the number of threads, is
> that the case here?  If you post the models somewhere I can also do a quick
> benchmark on a machine with a few more cores.
>
> -Kellen
>
>
> On Mon, Sep 19, 2016 at 10:53 AM, Tommaso Teofili <
> tommaso.teof...@gmail.com<mailto:tommaso.teof...@gmail.com
> <tommaso.teof...@gmail.com>>> wrote:
> Il giorno sab 17 set 2016 alle ore 15:23 Matt Post <p...@cs.jhu.edu<
> mailto:p...@cs.jhu.edu <p...@cs.jhu.edu>>> ha
> scritto:
>
> I'll ask Hieu; I don't anticipate any problems. One potential problem is
> that that models occupy about 15--20 GB; do you think Jenkins would host
> this?
>
>
> I'm not sure, can such models be downloaded and pruned at runtime, or do
> they need to exist on the Jenkins machine ?
>
>
>
> (ru-en grammars still packing, results will probably not be in until much
> later today)
>
> matt
>
>
> On Sep 17, 2016, at 3:19 PM, Tommaso Teofili <tommaso.teof...@gmail.com<
> mailto:tommaso.teof...@gmail.com <tommaso.teof...@gmail.com>>>
>
> wrote:
>
>
> Hi Matt,
>
> I think it'd be really valuable if we could be able to repeat the same
> tests (given parallel corpus is available) in the future, any chance you
> can share script / code to do that ? We may even consider adding a
>
> Jenkins
>
> job dedicated to continuously monitor performances as we work on Joshua
> master branch.
>
> WDYT?
>
> Anyway thanks for sharing the very interesting comparisons.
> Regards,
> Tommaso
>
> Il giorno sab 17 set 20

Re: moses2 vs. joshua

2016-09-19 Thread kellen sunderland

Very nice results.  I think getting to within 25% of a optimized c++
decoder from a Java decoder is impressive.  Great that Hieu has put in the
work to make moses2 so fast as well, that gives organizations two quite
nice decoding engines to choose from, both with reasonable performance.

Matt: I had a question about the x axis here.  Is that number of threads?
We should be scaling more or less linearly with the number of threads, is
that the case here?  If you post the models somewhere I can also do a quick
benchmark on a machine with a few more cores.

-Kellen


On Mon, Sep 19, 2016 at 10:53 AM, Tommaso Teofili  wrote:

> Il giorno sab 17 set 2016 alle ore 15:23 Matt Post  ha
> scritto:
>
> > I'll ask Hieu; I don't anticipate any problems. One potential problem is
> > that that models occupy about 15--20 GB; do you think Jenkins would host
> > this?
> >
>
> I'm not sure, can such models be downloaded and pruned at runtime, or do
> they need to exist on the Jenkins machine ?
>
>
> >
> > (ru-en grammars still packing, results will probably not be in until much
> > later today)
> >
> > matt
> >
> >
> > > On Sep 17, 2016, at 3:19 PM, Tommaso Teofili <
> tommaso.teof...@gmail.com>
> > wrote:
> > >
> > > Hi Matt,
> > >
> > > I think it'd be really valuable if we could be able to repeat the same
> > > tests (given parallel corpus is available) in the future, any chance
> you
> > > can share script / code to do that ? We may even consider adding a
> > Jenkins
> > > job dedicated to continuously monitor performances as we work on Joshua
> > > master branch.
> > >
> > > WDYT?
> > >
> > > Anyway thanks for sharing the very interesting comparisons.
> > > Regards,
> > > Tommaso
> > >
> > > Il giorno sab 17 set 2016 alle ore 12:29 Matt Post 
> ha
> > > scritto:
> > >
> > >> Ugh, I think the mailing list deleted the attachment. Here is an
> attempt
> > >> around our censors:
> > >>
> > >> https://www.dropbox.com/s/80up63reu4q809y/ar-en-joshua-
> moses2.png?dl=0
> > >>
> > >>
> > >>> On Sep 17, 2016, at 12:21 PM, Matt Post  wrote:
> > >>>
> > >>> Hi everyone,
> > >>>
> > >>> One thing we did this week at MT Marathon was a speed comparison of
> > >> Joshua 6.1 (release candidate) with Moses2, which is a ground-up
> > rewrite of
> > >> Moses designed for speed (see the attached paper). Moses2 is 4–6x
> faster
> > >> than Moses phrase-based, and 100x (!) faster than Moses hiero.
> > >>>
> > >>> I tested using two moderate-to-large sized datasets that Hieu Hoang
> > >> (CC'd) provided me with: ar-en and ru-en. Timing results are from
> 10,000
> > >> sentences in each corpus. The average ar-en sentence length is 7.5,
> and
> > for
> > >> ru-en is 28. I only ran one test for each language, so there could be
> > some
> > >> variance if I averaged, but I think the results look pretty
> consistent.
> > The
> > >> timing is end-to-end (including model load times, which Moses2 tends
> to
> > be
> > >> a bit faster at).
> > >>>
> > >>> Note also that Joshua does not have lexicalized distortion, while
> > Moses2
> > >> does. This means the BLEU scores are a bit lower for Joshua: 62.85
> > versus
> > >> 63.49. This shouldn't really affect runtime, however.
> > >>>
> > >>> I'm working on the ru-en, but here are the ar-en results:
> > >>>
> > >>>
> > >>>
> > >>> Some conclusions:
> > >>>
> > >>> - Hieu has done some bang-up work on the Moses2 rewrite! Joshua is in
> > >> general about 3x slower than Moses2
> > >>>
> > >>> - We don't have a Moses comparison, but extrapolating from Hieu's
> > paper,
> > >> it seems we might be as fast as or faster than Moses phrase-based
> > decoding,
> > >> and are a ton faster on Hiero. I'm going to send my models to Hieu so
> he
> > >> can test on his machine, and then we'll have a better feel for this,
> > >> including how it scales on a machine with many more processors.
> > >>>
> > >>> matt
> > >>>
> > >>>
> > >>
> > >>
> >
> >
>

Heads up if you're getting JVM Crashes

2016-09-13 Thread kellen sunderland

Hello everyone,

Just wanted to give a heads up that as of this commit
https://github.com/apache/incubator-joshua/commit/90fff5ab1de3da23c0f64f90e69ce0da2392fd49
the abi for libkenlm.so has changed.  That means you may have to recompile
it or it could crash when causing probRule.

-Kellen

[jira] [Resolved] (JOSHUA-296) Refactor threading code

2016-08-29 Thread Kellen Sunderland (JIRA)


 [ 
https://issues.apache.org/jira/browse/JOSHUA-296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kellen Sunderland resolved JOSHUA-296.
--
Resolution: Fixed

Fixed in this PR https://github.com/apache/incubator-joshua/pull/45

> Refactor threading code
> ---
>
> Key: JOSHUA-296
> URL: https://issues.apache.org/jira/browse/JOSHUA-296
> Project: Joshua
>  Issue Type: Improvement
>Reporter: Matt Post
>    Assignee: Kellen Sunderland
>Priority: Minor
> Fix For: 6.1
>
>
> The thread-handling code is a bit more complicated than it needs to be. We'd 
> like to simplify this using Executors while maintaining the current 
> stream-based processing features:
> - Input stream: decoding starts and is multithreaded even before the whole 
> input has been received (e.g., so that STDIN works)
> - Multithreading: translations are automatically assigned across threads in a 
> thread pool
> - Output stream: decoding returns right away and callers can block while 
> waiting for translations to assemble



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (JOSHUA-307) Java-based tokenization and normalization

2016-08-29 Thread Kellen Sunderland (JIRA)


[ 
https://issues.apache.org/jira/browse/JOSHUA-307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15447105#comment-15447105
 ] 

Kellen Sunderland commented on JOSHUA-307:
--

+1.  This would be great, and could go into the CLI module.

> Java-based tokenization and normalization
> -
>
> Key: JOSHUA-307
> URL: https://issues.apache.org/jira/browse/JOSHUA-307
> Project: Joshua
>  Issue Type: Improvement
>Reporter: Matt Post
>Priority: Minor
> Fix For: 6.2
>
>
> Currently, Joshua expects data to be lowercased, normalized, and tokenized 
> consistent with the way the training data was prepared before being passed 
> in. This requires calling Perl scripts on the input data. It would be nice if 
> these Perl scripts (located under $JOSHUA/scripts/preparation) were rewritten 
> in Java (under org.apache.joshua.util) so that Joshua could do this 
> normalization itself. This would be particularly useful for the language 
> packs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (JOSHUA-285) Not all RuntimeExceptions are caught

2016-08-29 Thread Kellen Sunderland (JIRA)


[ 
https://issues.apache.org/jira/browse/JOSHUA-285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15446080#comment-15446080
 ] 

Kellen Sunderland commented on JOSHUA-285:
--

This is fixed in PR https://github.com/apache/incubator-joshua/pull/45 .  Any 
uncaught exception will now be propagated from the threadpool thread that it 
occurs on, back to the main thread that is iterating over translation results.  
The main thread can have control over how to handle these failures, but they 
will likely be fatal.  In the case of the CLI tool for example we can just 
crash with a stack trace.

There's also a test specifically causing a runtime exception on a worker thread 
and ensuring that it propagates to the main response thread.

> Not all RuntimeExceptions are caught
> 
>
> Key: JOSHUA-285
> URL: https://issues.apache.org/jira/browse/JOSHUA-285
> Project: Joshua
>  Issue Type: Bug
>Reporter: Matt Post
>    Assignee: Kellen Sunderland
> Fix For: 6.1
>
>
> In many instances Joshua threads will throw a RuntimeException that is not 
> caught, causing the decoder to hang indefinitely. These should be caught and, 
> if serious enough, cause the decoder to die. An example of an error that is 
> caught is running out of memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (JOSHUA-95) Vocabulary locking

2016-08-18 Thread Kellen Sunderland (JIRA)


[ 
https://issues.apache.org/jira/browse/JOSHUA-95?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15426055#comment-15426055
 ] 

Kellen Sunderland commented on JOSHUA-95:
-

Yes, all the contention on Vocabulary has been removed.  

> Vocabulary locking
> --
>
> Key: JOSHUA-95
> URL: https://issues.apache.org/jira/browse/JOSHUA-95
> Project: Joshua
>  Issue Type: Bug
>Reporter: Matt Post
>Assignee: Juri Ganitkevitch
> Fix For: 6.2
>
>
> Vocabulary::id() is still synchronized and a potential point of contention. 
> It would be nice to resolve this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (JOSHUA-295) Revamp dependency organization in Joshua

2016-08-18 Thread Kellen Sunderland (JIRA)


 [ 
https://issues.apache.org/jira/browse/JOSHUA-295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kellen Sunderland updated JOSHUA-295:
-
Affects Version/s: 6.2

> Revamp dependency organization in Joshua
> 
>
> Key: JOSHUA-295
> URL: https://issues.apache.org/jira/browse/JOSHUA-295
> Project: Joshua
>  Issue Type: Improvement
>Affects Versions: 6.2
>    Reporter: Kellen Sunderland
>
> We would like to separate dependencies in Joshua by create a multi-module 
> maven project.  This will allow us to decouple our codebase and make it more 
> modular.  This means consumers of Joshua who are only interested in a core 
> library do not have to pull in dependencies for things like Http servers or 
> database clients.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (JOSHUA-303) Simplify feature handling code within Joshua

2016-08-18 Thread Kellen Sunderland (JIRA)

Kellen Sunderland created JOSHUA-303:


 Summary: Simplify feature handling code within Joshua
 Key: JOSHUA-303
 URL: https://issues.apache.org/jira/browse/JOSHUA-303
 Project: Joshua
  Issue Type: Improvement
Affects Versions: 6.2, 7
Reporter: Kellen Sunderland


There's currently a lot of code branching and special cases necessary in Joshua 
to properly handle sparse versus dense features.  We could refactor this code 
to remove the distinction which would simplify many classes (FeatureVector, 
Rule, etc.).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (JOSHUA-221) ArrayIndexOutOfBoundsException when passing arguments to JoshuaDecoder.main

2016-08-18 Thread Kellen Sunderland (JIRA)


[ 
https://issues.apache.org/jira/browse/JOSHUA-221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15426045#comment-15426045
 ] 

Kellen Sunderland commented on JOSHUA-221:
--

Maybe we could resolve this by using args4j for the JoshuaDecoder.main?

> ArrayIndexOutOfBoundsException when passing arguments to JoshuaDecoder.main
> ---
>
> Key: JOSHUA-221
> URL: https://issues.apache.org/jira/browse/JOSHUA-221
> Project: Joshua
>  Issue Type: Bug
>Reporter: Lewis John McGibbney
> Fix For: 6.2
>
>
> {code}
> lmcgibbn@LMC-032857 /usr/local/joshua(master) $ java -jar class/joshua.jar 
> -version
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2
>   at joshua.decoder.ArgsParser.(ArgsParser.java:43)
>   at joshua.decoder.JoshuaDecoder.main(JoshuaDecoder.java:30)
> lmcgibbn@LMC-032857 /usr/local/joshua(master) $ java -jar class/joshua.jar 
> -version -v
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2
>   at joshua.decoder.ArgsParser.(ArgsParser.java:43)
>   at joshua.decoder.JoshuaDecoder.main(JoshuaDecoder.java:30)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (JOSHUA-287) KenLM.java catches UnsatisfiedLinkError when attempting to load libken.so (libken.dylib on OSX)

2016-08-13 Thread Kellen Sunderland (JIRA)


 [ 
https://issues.apache.org/jira/browse/JOSHUA-287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kellen Sunderland resolved JOSHUA-287.
--
Resolution: Fixed

UnsatisfiedLinkErrors are now wrapped with a descriptive RuntimeException.  The 
message has been updated to indicate that kenlm has not been found, but that 
this may not be a fatal error (e.g. if you're using Berkley).  Tests have been 
updated to skip any tests relying on KenLM by wrapping previously mentioned 
descriptive RuntimeExceptions in a TestNG SkipException.

> KenLM.java catches UnsatisfiedLinkError when attempting to load libken.so 
> (libken.dylib on OSX)
> ---
>
> Key: JOSHUA-287
> URL: https://issues.apache.org/jira/browse/JOSHUA-287
> Project: Joshua
>  Issue Type: Bug
>  Components: core, kenlm
>Affects Versions: 6.0.5
>Reporter: Lewis John McGibbney
>    Assignee: Kellen Sunderland
> Fix For: 6.1
>
>
> As explained in 
> http://www.mail-archive.com/dev%40joshua.incubator.apache.org/msg01189.html 
> currently we have an issue, where, when checked out from master the following 
> RuntimeException is thrown.
> {code}
> ---
>  T E S T S
> ---
> Running TestSuite
> WARN - sentence 0 too long 401, truncating to length 200
> WARN - sentence 0 too long 401, truncating to length 200
> WARN - sentence 0 too long 401, truncating to length 200
> WARN - sentence 0 too long 401, truncating to length 200
> tm_pt_0=-2.000 tm_glue_0=3.000 lm_0=-206.718 lm_0_oov=2.000 
> OOVPenalty=-200.000 | -198.000
> ERROR - * FATAL: Can't find libken.so (libken.dylib on OS X) in $JOSHUA/lib
> ERROR - *This probably means that the KenLM library didn't compile.
> ERROR - *Make sure that BOOST_ROOT is set to the root of your boost
> ERROR - *installation (it's not /opt/local/, the default), change to
> ERROR - *$JOSHUA, and type 'ant kenlm'. If problems persist, see the
> ERROR - *website (joshua-decoder.org).
> WARN - no grammars supplied!  Supplying dummy glue grammar.
> WARN - no grammars supplied!  Supplying dummy glue grammar.
> WARN - no grammars supplied!  Supplying dummy glue grammar.
> WARN - no grammars supplied!  Supplying dummy glue grammar.
> WARN - no grammars supplied!  Supplying dummy glue grammar.
> WARN - no grammars supplied!  Supplying dummy glue grammar.
> WARN - no grammars supplied!  Supplying dummy glue grammar.
> WARN - no grammars supplied!  Supplying dummy glue grammar.
> {code}
> We need to fix this such that we can run static source code analysis via 
> sonar and have our results available on analysis.apache.org.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Podling Report Reminder - August 2016

2016-08-01 Thread kellen sunderland

Looks good to me.  Should we mention that we're planning a release that
moves our build system to maven?

On Mon, Aug 1, 2016 at 3:10 PM, Matt Post <p...@cs.jhu.edu> wrote:

> Hi folks,
>
> I just loaded this with a draft. Comments / unilateral changes will not
> meet resistance from me.
> --
> Joshua
>
> Joshua is a statistical machine translation toolkit
>
> Joshua has been incubating since 2016-02-13.
>
> Three most important issues to address in the move towards graduation:
>
>   1. Ensure first release of Joshua Incubating artifacts (6.1)
>   2. Continue to build the Joshua PPMC and user community
>   3. Investigate targeted user communities within Apache
>
> Any issues that the Incubator PMC (IPMC) or ASF Board wish/need to be
> aware of?
>
>   None.
>
> How has the community developed since the last report?
>
>   We have gained a new contributor, and have continued developing and
>   updating the web page to increase interest. We have not made
>   any real advertising or publicity pushes, but hope to around the
>   time of our first formal release under the Apache banner (targeted
>   for September).
>
> How has the project developed since the last report?
>
>   We have been steadily pushing up stability and design improvements,
>   and are also deep in discussion about further changes. We have
>   made some changes to our team infrastructure, including enabling
>   Travis-CI for continual integration testing.
>
> Date of last release:
>
>   N/A
>
> When were the last committers or PMC members elected?
>
>   April 11, 2016 (Kellen Sunderland and Felix Hieber)
>
> Signed-off-by:
>
>   [ ](joshua) Paul Ramirez
>   [ ](joshua) Lewis John McGibbney
>   [ ](joshua) Chris Mattmann
>   [ ](joshua) Tom Barber
>   [ ](joshua) Henri Yandell
>
> Shepherd/Mentor notes:
> --
> matt
>
>
> > On Jul 31, 2016, at 9:15 AM, johndam...@apache.org wrote:
> >
> > Dear podling,
> >
> > This email was sent by an automated system on behalf of the Apache
> > Incubator PMC. It is an initial reminder to give you plenty of time to
> > prepare your quarterly board report.
> >
> > The board meeting is scheduled for Wed, 17 August 2016, 10:30 am PDT.
> > The report for your podling will form a part of the Incubator PMC
> > report. The Incubator PMC requires your report to be submitted 2 weeks
> > before the board meeting, to allow sufficient time for review and
> > submission (Wed, August 03).
> >
> > Please submit your report with sufficient time to allow the Incubator
> > PMC, and subsequently board members to review and digest. Again, the
> > very latest you should submit your report is 2 weeks prior to the board
> > meeting.
> >
> > Thanks,
> >
> > The Apache Incubator PMC
> >
> > Submitting your Report
> >
> > --
> >
> > Your report should contain the following:
> >
> > *   Your project name
> > *   A brief description of your project, which assumes no knowledge of
> >the project or necessarily of its field
> > *   A list of the three most important issues to address in the move
> >towards graduation.
> > *   Any issues that the Incubator PMC or ASF Board might wish/need to be
> >aware of
> > *   How has the community developed since the last report
> > *   How has the project developed since the last report.
> >
> > This should be appended to the Incubator Wiki page at:
> >
> > http://wiki.apache.org/incubator/August2016
> >
> > Note: This is manually populated. You may need to wait a little before
> > this page is created from a template.
> >
> > Mentors
> > ---
> >
> > Mentors should review reports for their project(s) and sign them off on
> > the Incubator wiki page. Signing off reports shows that you are
> > following the project - projects that are not signed may raise alarms
> > for the Incubator PMC.
> >
> > Incubator PMC
>
>

Re: master pushes

2016-07-28 Thread kellen sunderland

Ahh, I see we don't yet have a travis.yml.   Let me try and create one ...
  Pull request should be coming soon.

-Kellen

On Thu, Jul 28, 2016 at 5:30 PM, kellen sunderland <
kellen.sunderl...@gmail.com> wrote:

> Hey Matt, what still needs to be setup?  I can try and help out.
>
> On Thu, Jul 28, 2016 at 4:49 PM, Matt Post <p...@cs.jhu.edu> wrote:
>
>> Hi folks,
>>
>> Sorry for the continued pushes to master. We have had Travis-CI enabled,
>> but I haven't taken the time to get it setup. Someone else should feel free
>> to take charge, here; otherwise, I hope to have time to do this after my
>> workshop is done, at the end of next week.
>>
>> matt
>
>
>

[jira] [Comment Edited] (JOSHUA-287) KenLM.java catches UnsatisfiedLinkError when attempting to load libken.so (libken.dylib on OSX)

2016-07-28 Thread Kellen Sunderland (JIRA)


[ 
https://issues.apache.org/jira/browse/JOSHUA-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15397701#comment-15397701
 ] 

Kellen Sunderland edited comment on JOSHUA-287 at 7/28/16 3:42 PM:
---

I should have addressed this in my latest PR.  
https://github.com/apache/incubator-joshua/pull/33

I think the correct behaviour is to throw a RuntimeException here.  There are 
two environments to consider, testing, and the general case. At test time my 
feeling is that if a user doesn't have libkenlm we should simply skip any tests 
that rely on KenLM.  We shouldn't force users to download/compile KenLM just to 
run unit tests.  In general if we make a call into the KenLM class and the 
library is not on our java.library.path it's a serious error.  We should throw 
a descriptive exception (now a KenLMLoadException which extends Runtime) and 
let the caller deal with it.  

Can you provide some details on how this breaks Sonar?  Is it still broken 
after the PR?


was (Author: kellen.sunderland):
I should have addressed this in my latest PR.  
https://github.com/apache/incubator-joshua/pull/33

I think the correct behaviour is to throw a RuntimeException here.  There are 
two environments to consider, testing, and the general case. At test time my 
feeling is that we should simply skip any tests that rely on KenLM.  We 
shouldn't force users to download/compile KenLM just to run unit tests.  In 
general if we make a call into the KenLM class and the library is not on our 
java.library.path it's a serious error.  We should throw a descriptive 
exception (now a KenLMLoadException which extends Runtime) and let the caller 
deal with it.  

Can you provide some details on how this breaks Sonar?  Is it still broken 
after the PR?

> KenLM.java catches UnsatisfiedLinkError when attempting to load libken.so 
> (libken.dylib on OSX)
> ---
>
> Key: JOSHUA-287
> URL: https://issues.apache.org/jira/browse/JOSHUA-287
> Project: Joshua
>  Issue Type: Bug
>  Components: core, kenlm
>Affects Versions: 6.0.5
>Reporter: Lewis John McGibbney
> Fix For: 6.1
>
>
> As explained in 
> http://www.mail-archive.com/dev%40joshua.incubator.apache.org/msg01189.html 
> currently we have an issue, where, when checked out from master the following 
> RuntimeException is thrown.
> {code}
> ---
>  T E S T S
> ---
> Running TestSuite
> WARN - sentence 0 too long 401, truncating to length 200
> WARN - sentence 0 too long 401, truncating to length 200
> WARN - sentence 0 too long 401, truncating to length 200
> WARN - sentence 0 too long 401, truncating to length 200
> tm_pt_0=-2.000 tm_glue_0=3.000 lm_0=-206.718 lm_0_oov=2.000 
> OOVPenalty=-200.000 | -198.000
> ERROR - * FATAL: Can't find libken.so (libken.dylib on OS X) in $JOSHUA/lib
> ERROR - *This probably means that the KenLM library didn't compile.
> ERROR - *Make sure that BOOST_ROOT is set to the root of your boost
> ERROR - *installation (it's not /opt/local/, the default), change to
> ERROR - *$JOSHUA, and type 'ant kenlm'. If problems persist, see the
> ERROR - *website (joshua-decoder.org).
> WARN - no grammars supplied!  Supplying dummy glue grammar.
> WARN - no grammars supplied!  Supplying dummy glue grammar.
> WARN - no grammars supplied!  Supplying dummy glue grammar.
> WARN - no grammars supplied!  Supplying dummy glue grammar.
> WARN - no grammars supplied!  Supplying dummy glue grammar.
> WARN - no grammars supplied!  Supplying dummy glue grammar.
> WARN - no grammars supplied!  Supplying dummy glue grammar.
> WARN - no grammars supplied!  Supplying dummy glue grammar.
> {code}
> We need to fix this such that we can run static source code analysis via 
> sonar and have our results available on analysis.apache.org.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (JOSHUA-287) KenLM.java catches UnsatisfiedLinkError when attempting to load libken.so (libken.dylib on OSX)

2016-07-28 Thread Kellen Sunderland (JIRA)


[ 
https://issues.apache.org/jira/browse/JOSHUA-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15397701#comment-15397701
 ] 

Kellen Sunderland commented on JOSHUA-287:
--

I should have addressed this in my latest PR.  
https://github.com/apache/incubator-joshua/pull/33

I think the correct behaviour is to throw a RuntimeException here.  There are 
two environments to consider, testing, and the general case. At test time my 
feeling is that we should simply skip any tests that rely on KenLM.  We 
shouldn't force users to download/compile KenLM just to run unit tests.  In 
general if we make a call into the KenLM class and the library is not on our 
java.library.path it's a serious error.  We should throw a descriptive 
exception (now a KenLMLoadException which extends Runtime) and let the caller 
deal with it.  

Can you provide some details on how this breaks Sonar?  Is it still broken 
after the PR?

> KenLM.java catches UnsatisfiedLinkError when attempting to load libken.so 
> (libken.dylib on OSX)
> ---
>
> Key: JOSHUA-287
> URL: https://issues.apache.org/jira/browse/JOSHUA-287
> Project: Joshua
>  Issue Type: Bug
>  Components: core, kenlm
>Affects Versions: 6.0.5
>Reporter: Lewis John McGibbney
> Fix For: 6.1
>
>
> As explained in 
> http://www.mail-archive.com/dev%40joshua.incubator.apache.org/msg01189.html 
> currently we have an issue, where, when checked out from master the following 
> RuntimeException is thrown.
> {code}
> ---
>  T E S T S
> ---
> Running TestSuite
> WARN - sentence 0 too long 401, truncating to length 200
> WARN - sentence 0 too long 401, truncating to length 200
> WARN - sentence 0 too long 401, truncating to length 200
> WARN - sentence 0 too long 401, truncating to length 200
> tm_pt_0=-2.000 tm_glue_0=3.000 lm_0=-206.718 lm_0_oov=2.000 
> OOVPenalty=-200.000 | -198.000
> ERROR - * FATAL: Can't find libken.so (libken.dylib on OS X) in $JOSHUA/lib
> ERROR - *This probably means that the KenLM library didn't compile.
> ERROR - *Make sure that BOOST_ROOT is set to the root of your boost
> ERROR - *installation (it's not /opt/local/, the default), change to
> ERROR - *$JOSHUA, and type 'ant kenlm'. If problems persist, see the
> ERROR - *website (joshua-decoder.org).
> WARN - no grammars supplied!  Supplying dummy glue grammar.
> WARN - no grammars supplied!  Supplying dummy glue grammar.
> WARN - no grammars supplied!  Supplying dummy glue grammar.
> WARN - no grammars supplied!  Supplying dummy glue grammar.
> WARN - no grammars supplied!  Supplying dummy glue grammar.
> WARN - no grammars supplied!  Supplying dummy glue grammar.
> WARN - no grammars supplied!  Supplying dummy glue grammar.
> WARN - no grammars supplied!  Supplying dummy glue grammar.
> {code}
> We need to fix this such that we can run static source code analysis via 
> sonar and have our results available on analysis.apache.org.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: master pushes

2016-07-28 Thread kellen sunderland

Hey Matt, what still needs to be setup?  I can try and help out.

On Thu, Jul 28, 2016 at 4:49 PM, Matt Post  wrote:

> Hi folks,
>
> Sorry for the continued pushes to master. We have had Travis-CI enabled,
> but I haven't taken the time to get it setup. Someone else should feel free
> to take charge, here; otherwise, I hope to have time to do this after my
> workshop is done, at the end of next week.
>
> matt

Re: Issue Building LM on master branch

2016-07-16 Thread kellen sunderland

Hey Lewis, as an alternative you can try fast_align. It's been working well
for us.  10 minutes seems a little bit faster than what I'd expect.  IIRC
it may take a few hours (4-8?) to align that much data.

https://github.com/clab/fast_align

-Kellen

On Sun, Jul 17, 2016 at 12:01 AM, Lewis John Mcgibbney <
lewis.mcgibb...@gmail.com> wrote:

> Hi Folks,
> When attempting to build a heiro model using 5K sentences for tuning, many
> many more than that for testing and again many many more than that for the
> actual corpus (~880K) I get the following error within the GIZA alignment
> pipeline phase.
>
> Anyone have a clue what this means? I have the full GIZA logs if they are
> useful.
> I did find a thread on a VERY similar issue at [0]. The solution seems to
> be to use absolute paths to all input data for the pipeline however that is
> exactly what I've done e.g.
>
> $JOSHUA/bin/pipeline.pl  --rundir . --type hiero --corpus
> /usr/local/joshua_input/commoncrawl.ru-en --tune
> /usr/local/joshua_input/commoncrawl.ru-en.tune --test
> /usr/local/joshua_input/commoncrawl.ru-en.test --source en --target ru
> --rundir experiment1/1 --readme “Experiment 1 Run 1 Hiero Russian to
> English Translation model” --mbr
>
> Where the parallel .en and .ru sentence files exist for all of the above
> corpus, tune and test paths respectively.
>
> [0] http://comments.gmane.org/gmane.comp.nlp.moses.user/10489
>
> I have been having trouble consistently when generating models using
> GIZA... is there a suggested alignment substitute which I should be trying
> out?
>
> One last question... roughly how long should a Hiero-based LM for a corpus
> of ~880K sentences take on say a MacBook Pro 2.7GHz Interl Core i7 16GB
> mem. I remeber reading a while ago on the old Joshua site that a pipeline
> would run in 10 or so minutes... this is clearly not the case and I would
> like to share/compare some results if possible with others who are in the
> business of generating LM and language packs.
>
> Thanks
>
> ==
> Executing: bash -c rm -f alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final.gz
> Executing: bash -c gzip alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final
> Waiting for second GIZA process...
> (3) generate word alignment @ Fri Jul 15 16:38:42 PDT 2016
> Combining forward and inverted alignment from files:
>   alignments/0/giza.en.0-ru.0/en.0-ru.0.A3.final.{bz2,gz}
>   alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final.{bz2,gz}
> Executing: bash -c mkdir -p alignments/0/model
> Executing: bash -c /usr/local/incubator-joshua/ext/symal/giza2bal.pl -d
> <(gzip -cd alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final.gz) -i <(gzip -cd
> alignments/0/giza.en.0-ru.0/en.0-ru.0.A3.final.gz)
> |/usr/local/incubator-joshua/ext/symal/symal -alignment="grow"
> -diagonal="yes" -final="yes" -both="no"
> -o=alignments/0/model/aligned.grow-diag-final
> symal: computing grow alignment: diagonal (1) final (1)both-uncovered (0)
> skip=<0> counts=<817962>
> symal(9081,0x7fff76241310) malloc: *** error for object 0x7fff74472250:
> pointer being freed was not allocated
> *** set a breakpoint in malloc_error_break to debug
> bash: line 1:  9080 Done
> /usr/local/incubator-joshua/ext/symal/giza2bal.pl -d <(gzip -cd
> alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final.gz) -i <(gzip -cd
> alignments/0/giza.en.0-ru.0/en.0-ru.0.A3.final.gz)
>   9081 Abort trap: 6   |
> /usr/local/incubator-joshua/ext/symal/symal -alignment="grow"
> -diagonal="yes" -final="yes" -both="no"
> -o=alignments/0/model/aligned.grow-diag-final
> Exit code: 134
> ERROR: Can't generate symmetrized alignment file
>
>
>
> --
> *Lewis*
>

Re: Avoiding master failures with CI

2016-07-13 Thread kellen sunderland

Friday is good for me.  Chat with you then.

If anyone else wants to hop on a hangout let me know.

On Wed, Jul 13, 2016 at 6:58 PM, Matt Post <p...@cs.jhu.edu> wrote:

> I misread the day, here, and thought you meant today. I can't do tomorrow
> afternoon, but that time on Friday works for me. We could also go into next
> week if that's better.
>
>
> > On Jul 13, 2016, at 9:41 AM, Matt Post <p...@cs.jhu.edu> wrote:
> >
> > That works for me. I've watched the video you linked so I have a feel
> for this, but I still think it'd be good to chat.
> >
> > matt
> >
> >
> >> On Jul 13, 2016, at 8:21 AM, kellen sunderland <
> kellen.sunderl...@gmail.com> wrote:
> >>
> >> Would everyone be ok with tomorrow at 5PM UTC?
> >>
> >> -Kellen
> >>
> >> On Tue, Jul 12, 2016 at 8:35 PM, Matt Post <p...@cs.jhu.edu> wrote:
> >>
> >>> Hi Kellen,
> >>>
> >>> No worries, and you did provide a link. I think a Google Hangouts
> >>> walkthrough would be an efficient way to go about this. What day and
> time
> >>> work for you? I am mostly open this week.
> >>>
> >>> matt
> >>>
> >>>
> >>>> On Jul 11, 2016, at 6:50 PM, kellen sunderland <
> >>> kellen.sunderl...@gmail.com> wrote:
> >>>>
> >>>> Sorry should have provided the link to this page:
> https://travis-ci.org/
> >>> .
> >>>> If you scroll down a bit on that page there's a Pull Request flow
> >>> section,
> >>>> it's the flow I'd be most in favour of.  There's also a decent (but
> >>> rushed)
> >>>> demo here: https://www.youtube.com/watch?v=Uft5KBimzyk .  We actually
> >>> don't
> >>>> need to do a lot of the work that he demos, i.e. no node or gulp
> >>>> configuration.  Our setup is close enough to default a default java
> >>> project
> >>>> that we just have to tell it to build java 8 and then it runs maven
> >>>> properly.
> >>>>
> >>>> Using a CI server would have some aspects that are similar to the
> >>> branching
> >>>> document you mention, and some benefits that are a bit orthogonal.
> Most
> >>> of
> >>>> these benefits have to do with unit testing, which isn't covered in
> the
> >>> doc.
> >>>>
> >>>> First the orthogonal benefits:  The main benefit we would get from
> using
> >>> CI
> >>>> is that we guarantee code in our repo is never broken.  That is to say
> >>>> tests always pass and it always builds correctly.  CI servers are
> really
> >>>> useful to prevent problems where one developer may have everything
> >>> working
> >>>> properly on his/her machine, but when they later realize it's not
> working
> >>>> on another devs machine.  A good example of this is the
> >>> class-based-lm-test
> >>>> we pushed recently.  It works fine for me locally but it would fail
> for
> >>>> anyone without kenlm.so.  There are many other examples (javadoc
> errors,
> >>>> code style, etc) but what will happen in these cases is we'll see a
> big
> >>>> obvious 'The build has problems' message in the PR page on Github.  If
> >>> the
> >>>> CI server runs of all of our code quality checks and finds that
> >>> everything
> >>>> is good we'll get a big 'This PR is ready to merge' message.
> >>>>
> >>>> Now to the part that overlaps a bit with branching.  There are various
> >>>> branching strategies that we could adopt for the project.  The master
> /
> >>> dev
> >>>> branch one is a possibility.  I'd suggest we try commit code strictly
> in
> >>>> PRs rather than pushing to git.  This would be the equivalent of
> feature
> >>>> branching from your link.  The reason I'd suggest that approach is
> that
> >>>> from what I've seen it'll be dead simple to get working with Github
> and
> >>>> Travis, and it gives us the same goal of having a stable master
> branch.
> >>>>
> >>>> If you'd like we can walk through setting this up together on a forked
> >>>> version of our Github repo.  We could do a

Re: joshua - Build # 49 - Still Failing!

2016-07-13 Thread kellen sunderland

Ahh, ok.  I guess I'll just keep an eye on it.  Thanks Tom (and thanks for
doing the work to set this up).

On Wed, Jul 13, 2016 at 2:43 PM, Tom Barber <t...@analytical-labs.com> wrote:

> That one did you are correct, but a few builds later it reverted back to
> fine, so whatever it was was transient I guess.
>
> --
>
> Director Meteorite.bi - Saiku Analytics Founder
> Tel: +44(0)5603641316
>
> (Thanks to the Saiku community we reached our Kickstart
> <
> http://kickstarter.com/projects/2117053714/saiku-reporting-interactive-report-designer/
> >
> goal, but you can always help by sponsoring the project
> <http://www.meteorite.bi/products/saiku/sponsorship>)
>
> On 13 July 2016 at 13:41, kellen sunderland <kellen.sunderl...@gmail.com>
> wrote:
>
> > To me it reads as if it failed when trying to upload.
> >
> > [WARNING] *** CHECKSUM FAILED - Checksum failed on download: local =
> > 'feabc96bb65f9ea4da42af561362d0f429ea7ded'; remote =
> > '1252f3767e96442e19af8fb760ed07156f4a70cc' - RETRYING[WARNING] ***
> > CHECKSUM FAILED - Checksum failed on download: local =
> > 'feabc96bb65f9ea4da42af561362d0f429ea7ded'; remote =
> > '1252f3767e96442e19af8fb760ed07156f4a70cc' - IGNORINGUploading:
> >
> >
> https://repository.apache.org/content/repositories/snapshots/org/apache/joshua/joshua/6.0.6-SNAPSHOT/joshua-6.0.6-20160713.055028-28.jar
> > 4/1118K
> > 8/1118K
> > 12/1118K
> > 16/1118K
> >
> >
> > 
> >
> >
> > 1116/1118K
> > 1118/1118K
> > [INFO]
> >
> [ERROR]
> > BUILD ERROR[INFO]
> > 
> > [INFO] Error deploying artifact: Failed to transfer file:
> >
> >
> https://repository.apache.org/content/repositories/snapshots/org/apache/joshua/joshua/6.0.6-SNAPSHOT/joshua-6.0.6-20160713.055028-28.jar
> > .
> > Return code is: 401
> >
> >
> >
> > -Kellen
> >
> >
> > On Wed, Jul 13, 2016 at 2:24 PM, Tom Barber <tom.bar...@meteorite.bi>
> > wrote:
> >
> > >
> > >
> >
> https://repository.apache.org/content/repositories/snapshots/org/apache/joshua/joshua/6.0.6-SNAPSHOT/
> > >
> > > Snapshots are uploading, its just missing the version you're looking
> for.
> > >
> > > On Wed, Jul 13, 2016 at 1:20 PM, kellen sunderland <
> > > kellen.sunderl...@gmail.com> wrote:
> > >
> > > > Strange, https://builds.apache.org/job/joshua_master/78/ is passing.
> > > > Looks
> > > > like the analysis CI is getting an 401 when trying to upload a build
> > > > artifact to here:
> > > >
> > > >
> > > >
> > >
> >
> https://repository.apache.org/content/repositories/snapshots/org/apache/joshua/joshua/6.0.6-SNAPSHOT/joshua-6.0.6-20160713.055028-28.jar
> > > >
> > > >
> > > > Anyone know who has admin access on this CI server?  I think we might
> > > need
> > > > to double check the auth settings for this step.
> > > >
> > > > -Kellen
> > > >
> > > >
> > > > --
> > > >
> > > > joshua - Build # 49 - Still Failing:
> > > >
> > > >
> > > > Check console output at
> > > https://analysis.apache.org/jenkins/job/joshua/49/
> > > > to view the results.
> > > >
> > >
> >
>

Re: joshua - Build # 49 - Still Failing!

2016-07-13 Thread kellen sunderland

To me it reads as if it failed when trying to upload.

[WARNING] *** CHECKSUM FAILED - Checksum failed on download: local =
'feabc96bb65f9ea4da42af561362d0f429ea7ded'; remote =
'1252f3767e96442e19af8fb760ed07156f4a70cc' - RETRYING[WARNING] ***
CHECKSUM FAILED - Checksum failed on download: local =
'feabc96bb65f9ea4da42af561362d0f429ea7ded'; remote =
'1252f3767e96442e19af8fb760ed07156f4a70cc' - IGNORINGUploading:
https://repository.apache.org/content/repositories/snapshots/org/apache/joshua/joshua/6.0.6-SNAPSHOT/joshua-6.0.6-20160713.055028-28.jar
4/1118K
8/1118K
12/1118K
16/1118K





1116/1118K
1118/1118K
[INFO] 
[ERROR]
BUILD ERROR[INFO]

[INFO] Error deploying artifact: Failed to transfer file:
https://repository.apache.org/content/repositories/snapshots/org/apache/joshua/joshua/6.0.6-SNAPSHOT/joshua-6.0.6-20160713.055028-28.jar.
Return code is: 401



-Kellen


On Wed, Jul 13, 2016 at 2:24 PM, Tom Barber <tom.bar...@meteorite.bi> wrote:

>
> https://repository.apache.org/content/repositories/snapshots/org/apache/joshua/joshua/6.0.6-SNAPSHOT/
>
> Snapshots are uploading, its just missing the version you're looking for.
>
> On Wed, Jul 13, 2016 at 1:20 PM, kellen sunderland <
> kellen.sunderl...@gmail.com> wrote:
>
> > Strange, https://builds.apache.org/job/joshua_master/78/ is passing.
> > Looks
> > like the analysis CI is getting an 401 when trying to upload a build
> > artifact to here:
> >
> >
> >
> https://repository.apache.org/content/repositories/snapshots/org/apache/joshua/joshua/6.0.6-SNAPSHOT/joshua-6.0.6-20160713.055028-28.jar
> >
> >
> > Anyone know who has admin access on this CI server?  I think we might
> need
> > to double check the auth settings for this step.
> >
> > -Kellen
> >
> >
> > --
> >
> > joshua - Build # 49 - Still Failing:
> >
> >
> > Check console output at
> https://analysis.apache.org/jenkins/job/joshua/49/
> > to view the results.
> >
>

Re: joshua - Build # 49 - Still Failing!

2016-07-13 Thread kellen sunderland

Strange, https://builds.apache.org/job/joshua_master/78/ is passing.  Looks
like the analysis CI is getting an 401 when trying to upload a build
artifact to here:

https://repository.apache.org/content/repositories/snapshots/org/apache/joshua/joshua/6.0.6-SNAPSHOT/joshua-6.0.6-20160713.055028-28.jar


Anyone know who has admin access on this CI server?  I think we might need
to double check the auth settings for this step.

-Kellen


--

joshua - Build # 49 - Still Failing:


Check console output at https://analysis.apache.org/jenkins/job/joshua/49/
to view the results.

Re: [IMPORTANT] Roadmap for 6.1 Release

2016-07-11 Thread kellen sunderland

Thanks for organizing Lewis, sorry for the late replies.  Looking at the
frequency of our updates I'd suggest quarterly, or bi-annual releases.  If
we can keep the master branch stable (which should really be a goal of
ours) then hopefully it's not too much work to create the releases.I do
appreciate that there's probably some effort required to create release
notes + documentation.  Hopefully JIRA will be able to help us create some
of this documentation.

I'd agree that we should shoot for a 6.1 release fairly soon.  I'll review
the PRs that came from our side early after the Apache switch.  They should
probably have JIRA tickets tracking the changes with fix version assigned
as 6.1.

-Kellen



On Thu, Jun 23, 2016 at 11:01 PM, Tom Barber 
wrote:

> Hey Matt
>
> Over on  OODT our releases are few and far between, although that said,
> I've been trying to increase the frequency even if they are very minor. The
> main reason being, if someone commits some code, they don't want to wait 12
> months for it to hit a stable release! So you might say yearly major
> releases and patch releases at sporadic points inbetween to include patches
> people have submitted, this also keeps drive by committers interested
> because if they get some stuff into the codebase they then may commit more,
> rather than say "well I submitted a fix for issue x ages ago and its got
> notwhere".  Releases don't need to be set in stone, but I would try and
> keep them ticking over.
>
> Just my own 2 cents.
>
> Tom
>
> --
>
> Director Meteorite.bi - Saiku Analytics Founder
> Tel: +44(0)5603641316
>
> (Thanks to the Saiku community we reached our Kickstart
> <
> http://kickstarter.com/projects/2117053714/saiku-reporting-interactive-report-designer/
> >
> goal, but you can always help by sponsoring the project
> )
>
> On 23 June 2016 at 21:56, Matt Post  wrote:
>
> > Hi Lewis,
> >
> > Sorry for taking some time to get back to you. I think the roadmap looks
> > great. One thing, though, is that the Amazon folks and I have discussed
> > making a number of backwards-incompatible changes in an effort to
> modernize
> > some pieces of the code. This would have to do with things like the
> config
> > file format, a totally new pipeline based on duct tape, and some other
> > ideas. We think those changes would be suitable for a 7.0 release (major
> > version number change signals backwards incompatibility).
> >
> > I think we've been doing some good work on improving Joshua, but at the
> > same time, I think the release cycle is still little too accelerated for
> > me. I would like to push back to semi- yearly or even yearly releases,
> with
> > bug fixes in between. However, I'm also curious how this might affect our
> > ability to move out of incubation. Do you have any thoughts on this?
> >
> > The major downsides to releases are documentation. It's just hard to find
> > the time to do.
> >
> > My own thoughts for what I'd like to do:
> >
> > - Maybe a 6.1 release (soon, to get it out of the way? or otherwise this
> > fall?), where we formalize the Apache move and maybe formalize the
> release
> > of a handful of language packs, without a lot of other changes
> >
> > - Write a linux.com article advertising this, hopefully attracting some
> > attention
> >
> > - Shoot for a 7.0 release with many of the changes we've discussed (some
> > offline). If we get a good showing at MT Marathon in Prague this year,
> that
> > could be a good time to get all of that in order.
> >
> > - Start getting to work on a version of Joshua that swaps out the core
> > decoder for a neural approach
> >
> > matt
> >
> >
> >
> >
> > > On Jun 23, 2016, at 4:13 PM, Tom Barber 
> wrote:
> > >
> > > I would volunteer some cycles for multi model support in the server and
> > an
> > > improved rest interface and basic UI for end user interaction if you
> > fancy
> > > it.
> > >
> > > --
> > >
> > > Director Meteorite.bi - Saiku Analytics Founder
> > > Tel: +44(0)5603641316
> > >
> > > (Thanks to the Saiku community we reached our Kickstart
> > > <
> >
> http://kickstarter.com/projects/2117053714/saiku-reporting-interactive-report-designer/
> > >
> > > goal, but you can always help by sponsoring the project
> > > )
> > >
> > > On 23 June 2016 at 21:10, Lewis John Mcgibbney <
> > lewis.mcgibb...@gmail.com>
> > > wrote:
> > >
> > >> Hi Folks,
> > >> Anyone have any comments on this?
> > >> Seeing that the Maven multimodule project seems to be taking flight,
> it
> > >> would be nice to see where the roadmap is going?
> > >> Any comments would be great. Also, I'm kinda lost as to what is
> > happening
> > >> with Jira but it looks like it is not really being used for much.
> > >> Thanks
> > >>
> > >> On Mon, Jun 20, 2016 at 11:34 AM, Lewis John Mcgibbney <
> > >>

[jira] [Commented] (JOSHUA-279) Cannot build Joshua master branch

2016-07-11 Thread Kellen Sunderland (JIRA)


[ 
https://issues.apache.org/jira/browse/JOSHUA-279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15370387#comment-15370387
 ] 

Kellen Sunderland commented on JOSHUA-279:
--

Hey Lewis, sorry about these tests failing. We should not be breaking master 
like this, I'll try to ensure this doesn't happen in the future.  I'd propose 
in the short term we ignore any tests that rely on KenLM.  

There's a few options we can look at in the long term:  

*  Figure out an acceptable way to grab the KenLM binaries as a dependency when 
our project is built.  This would mean downloading them from a reliable source, 
and ensuring the binaries match our platform.
*  Download KenLM source as a dependency and build it locally when Joshua 
builds (but don't include KenLM source in our repository).
*  I was going to look into the feasibility of mocking out the language model 
for these tests.  I'm skeptical that this will work as there's likely millions 
of calls that would need to be mocked, even in the case of a small integration 
test example.  That being said maybe we can perform some kind of simplification 
to allow the tests to be useful and still be mockable.  
*  Using a different, Java based LM for unit tests.  We could for example use 
BerkleyLM, or a very simple Java LM implementation.



> Cannot build Joshua master branch
> -
>
> Key: JOSHUA-279
> URL: https://issues.apache.org/jira/browse/JOSHUA-279
> Project: Joshua
>  Issue Type: Bug
>  Components: build, documentation, tests
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Blocker
> Fix For: 6.1
>
>
> Hi Folks,
> We need to be cautious of whatever is committed to master branch... the build 
> has been broken for quite some time and there are constant Javadoc issues 
> which make the build unstable as well.
> For example, when i make an attempt to build master branch we have failing 
> tests
> {code}
> lmcgibbn@LMC-032857 /usr/local/incubator-joshua(master) $ mvn clean install
> ...
> ---
>  T E S T S
> ---
> Running TestSuite
> tm_pt_0=-2.000 tm_glue_0=3.000 lm_0=-206.718 lm_0_oov=2.000 
> OOVPenalty=-200.000 | -198.000
> ERROR - * FATAL: Can't find libken.so (libken.dylib on OS X) in $JOSHUA/lib
> ERROR - *This probably means that the KenLM library didn't compile.
> ERROR - *Make sure that BOOST_ROOT is set to the root of your boost
> ERROR - *installation (it's not /opt/local/, the default), change to
> ERROR - *$JOSHUA, and type 'ant kenlm'. If problems persist, see the
> ERROR - *website (joshua-decoder.org).
> WARN - sentence 0 too long 401, truncating to length 200
> WARN - sentence 0 too long 401, truncating to length 200
> WARN - sentence 0 too long 401, truncating to length 200
> WARN - sentence 0 too long 401, truncating to length 200
> WARN - no grammars supplied!  Supplying dummy glue grammar.
> WARN - no grammars supplied!  Supplying dummy glue grammar.
> WARN - no grammars supplied!  Supplying dummy glue grammar.
> WARN - no grammars supplied!  Supplying dummy glue grammar.
> WARN - no grammars supplied!  Supplying dummy glue grammar.
> WARN - no grammars supplied!  Supplying dummy glue grammar.
> WARN - no grammars supplied!  Supplying dummy glue grammar.
> WARN - no grammars supplied!  Supplying dummy glue grammar.
> %
> %
> %
> %
> %
> %
> %
> %
> %
> Tests run: 126, Failures: 1, Errors: 0, Skipped: 6, Time elapsed: 1.818 sec 
> <<< FAILURE! - in TestSuite
> setUp(org.apache.joshua.decoder.ff.lm.class_lm.ClassBasedLanguageModelTest)  
> Time elapsed: 0.075 sec  <<< FAILURE!
> java.lang.ExceptionInInitializerError
>   at 
> org.apache.joshua.decoder.ff.lm.class_lm.ClassBasedLanguageModelTest.setUp(ClassBasedLanguageModelTest.java:52)
> Caused by: java.lang.RuntimeException: java.lang.UnsatisfiedLinkError: no ken 
> in java.library.path
>   at 
> org.apache.joshua.decoder.ff.lm.class_lm.ClassBasedLanguageModelTest.setUp(ClassBasedLanguageModelTest.java:52)
> Caused by: java.lang.UnsatisfiedLinkError: no ken in java.library.path
>   at 
> org.apache.joshua.decoder.ff.lm.class_lm.ClassBasedLanguageModelTest.setUp(ClassBasedLanguageModelTest.java:52)
> Results :
> Failed tests:
> org.apache.joshua.decoder.ff.lm.class_lm.ClassBasedLanguageModelTest.setUp(org.apache.joshua.decoder.ff.lm.class_lm.ClassBasedLanguageModelTest)
>   Run 1: ClassBasedLanguageModelTest.setUp:52 » ExceptionInInitializer
>   Run 2: PASS
&g

[jira] [Commented] (JOSHUA-274) Use another HTTPServer other than Suns

2016-05-31 Thread Kellen Sunderland (JIRA)


[ 
https://issues.apache.org/jira/browse/JOSHUA-274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15307596#comment-15307596
 ] 

Kellen Sunderland commented on JOSHUA-274:
--

I would propose we don't put anything web server specific in the core Joshua 
jar.  In my mind the Joshua package should really contain just a translation 
library.  We could provide other jars with CLIs and a Restful service consuming 
this library (maybe even built by default).  

My reasoning is that if you are consuming Joshua strictly as a translation 
library you may prefer there be no web-service code there.  If you're already 
hosting this code in your own web service it could be quite confusing to have 
similar functionality in the library you are exposing.  Worse would be the case 
where through a configuration error you accidentally turn on a second web 
service (maybe on a different port).  

The other point I'd make is that by including http functionality in the main 
package we're adding a bunch of dependancies on things like json libs, etc.  
These dependancies could conflict with anyone wanting to use the package as a 
service. 

> Use another HTTPServer other than Suns
> --
>
> Key: JOSHUA-274
> URL: https://issues.apache.org/jira/browse/JOSHUA-274
> Project: Joshua
>  Issue Type: Improvement
>  Components: decoders
>Affects Versions: 6.0.5
>Reporter: Lewis John McGibbney
>Priority: Critical
> Fix For: 6.1
>
>
> This issue concerns the use of the 
> [HttpServer|https://github.com/apache/incubator-joshua/blob/master/src/joshua/decoder/JoshuaDecoder.java#L31]
>  within JoshuaDecoder.java. 
> We should replace the com.sun.net.httpserver.HttpServer implementation and 
> other Sun classes with ones from the Java API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (JOSHUA-275) Revamp the Configuration System

2016-05-27 Thread Kellen Sunderland (JIRA)


 [ 
https://issues.apache.org/jira/browse/JOSHUA-275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kellen Sunderland updated JOSHUA-275:
-
Description: 
I'd like to propose we centralize Joshua's configuration system to make use of 
typesafe/config https://github.com/typesafehub/config .  This config system 
looks like JSON but with comments so it's easy to read.  Because it's JSON it 
supports hierarchies of configurations, lists of configuration etc quite 
easily.  It has some nice features like parsing time automatically.  The main 
advantage here though is that we have a standard config system that doesn't 
have to be manually parsed.

Here's a quick example of how we can use it:

{code:java}
@Inject
public PackedGrammar(@TypesafeConfig("PackedGrammar.grammar_dir")
 String grammar_dir,
 @TypesafeConfig("PackedGrammar.span_limit")
 int span_limit, 
 String owner, 
 String type) throws FileNotFoundException, IOException 
...
{code}

and then a config similar to

\# Joshua configuration file
{code:javascript}
config = {
default-non-terminal = X
goal-symbol = GOAL
...

PackedGrammar: {
type: thrax,
grammar_dir: /local/grammars/...
span_limit: 50
}
...
}
{code}

Version: TBD, but it's a breaking change so we may consider putting it in 
Joshua 7.

Totally open to other config / injection systems if others want to suggest any 
of their favorites.

  was:
I'd like to propose we centralize Joshua's configuration system to make use of 
typesafe/config https://github.com/typesafehub/config .  This config system 
looks like JSON but with comments so it's easy to read.  Because it's JSON it 
supports hierarchies of configurations, lists of configuration etc quite 
easily.  It has some nice features like parsing time automatically.  The main 
advantage here though is that we have a standard config system that doesn't 
have to be manually parsed.

Here's a quick example of how we can use it:

@Inject
public PackedGrammar(@TypesafeConfig("PackedGrammar.grammar_dir")
 String grammar_dir,
 @TypesafeConfig("PackedGrammar.span_limit")
 int span_limit, 
 String owner, 
 String type) throws FileNotFoundException, IOException 
...

and then a config similar to

\# Joshua configuration file
config = {
default-non-terminal = X
goal-symbol = GOAL
...

PackedGrammar: {
type: thrax,
grammar_dir: /local/grammars/...
span_limit: 50
}
...
}

Version: TBD, but it's a breaking change so we may consider putting it in 
Joshua 7.

Totally open to other config / injection systems if others want to suggest any 
of their favorites.


> Revamp the Configuration System
> ---
>
> Key: JOSHUA-275
> URL: https://issues.apache.org/jira/browse/JOSHUA-275
> Project: Joshua
>  Issue Type: Improvement
>Affects Versions: 6.1, 6.2, 7
>Reporter: Kellen Sunderland
>
> I'd like to propose we centralize Joshua's configuration system to make use 
> of typesafe/config https://github.com/typesafehub/config .  This config 
> system looks like JSON but with comments so it's easy to read.  Because it's 
> JSON it supports hierarchies of configurations, lists of configuration etc 
> quite easily.  It has some nice features like parsing time automatically.  
> The main advantage here though is that we have a standard config system that 
> doesn't have to be manually parsed.
> Here's a quick example of how we can use it:
> {code:java}
> @Inject
> public PackedGrammar(@TypesafeConfig("PackedGrammar.grammar_dir")
>  String grammar_dir,
>  @TypesafeConfig("PackedGrammar.span_limit")
>  int span_limit, 
>  String owner, 
>  String type) throws FileNotFoundException, 
> IOException ...
> {code}
> and then a config similar to
> \# Joshua configuration file
> {code:javascript}
> config = {
> default-non-terminal = X
> goal-symbol = GOAL
> ...
> 
> PackedGrammar: {
> type: thrax,
> grammar_dir: /local/grammars/...
> span_limit: 50
> }
> ...
> }
> {code}
> Version: TBD, but it's a breaking change so we may consider putting it in 
> Joshua 7.
> Totally open to other config / injection systems if others want to suggest 
> any of their favorites.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (JOSHUA-275) Revamp the Configuration System

2016-05-27 Thread Kellen Sunderland (JIRA)


 [ 
https://issues.apache.org/jira/browse/JOSHUA-275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kellen Sunderland updated JOSHUA-275:
-
Description: 
I'd like to propose we centralize Joshua's configuration system to make use of 
typesafe/config https://github.com/typesafehub/config .  This config system 
looks like JSON but with comments so it's easy to read.  Because it's JSON it 
supports hierarchies of configurations, lists of configuration etc quite 
easily.  It has some nice features like parsing time automatically.  The main 
advantage here though is that we have a standard config system that doesn't 
have to be manually parsed.

Here's a quick example of how we can use it:

@Inject
public PackedGrammar(@TypesafeConfig("PackedGrammar.grammar_dir")
 String grammar_dir,
 @TypesafeConfig("PackedGrammar.span_limit")
 int span_limit, 
 String owner, 
 String type) throws FileNotFoundException, IOException 
...

and then a config similar to

\# Joshua configuration file
config = {
default-non-terminal = X
goal-symbol = GOAL
...

PackedGrammar: {
type: thrax,
grammar_dir: /local/grammars/...
span_limit: 50
}
...
}

Version: TBD, but it's a breaking change so we may consider putting it in 
Joshua 7.

Totally open to other config / injection systems if others want to suggest any 
of their favorites.

  was:
I'd like to propose we centralize Joshua's configuration system to make use of 
typesafe/config https://github.com/typesafehub/config .  This config system 
looks like JSON but with comments so it's easy to read.  Because it's JSON it 
supports hierarchies of configurations, lists of configuration etc quite 
easily.  It has some nice features like parsing time automatically.  The main 
advantage here though is that we have a standard config system that doesn't 
have to be manually parsed.

Here's a quick example of how we can use it:

@Inject
public PackedGrammar(@TypesafeConfig("PackedGrammar.grammar_dir")
 String grammar_dir,
 @TypesafeConfig("PackedGrammar.span_limit")
 int span_limit, 
 String owner, 
 String type) throws FileNotFoundException, IOException 
...

and then a config similar to

\# Joshua configuration file
config = {
default-non-terminal = X
goal-symbol = GOAL
...

tm: {
type: thrax,
grammar_dir: /local/grammars/...
span_limit: 50
}
...
}

Version: TBD, but it's a breaking change so we may consider putting it in 
Joshua 7.

Totally open to other config / injection systems if others want to suggest any 
of their favorites.


> Revamp the Configuration System
> ---
>
> Key: JOSHUA-275
> URL: https://issues.apache.org/jira/browse/JOSHUA-275
> Project: Joshua
>  Issue Type: Improvement
>Affects Versions: 6.1, 6.2, 7
>Reporter: Kellen Sunderland
>
> I'd like to propose we centralize Joshua's configuration system to make use 
> of typesafe/config https://github.com/typesafehub/config .  This config 
> system looks like JSON but with comments so it's easy to read.  Because it's 
> JSON it supports hierarchies of configurations, lists of configuration etc 
> quite easily.  It has some nice features like parsing time automatically.  
> The main advantage here though is that we have a standard config system that 
> doesn't have to be manually parsed.
> Here's a quick example of how we can use it:
> @Inject
> public PackedGrammar(@TypesafeConfig("PackedGrammar.grammar_dir")
>  String grammar_dir,
>  @TypesafeConfig("PackedGrammar.span_limit")
>  int span_limit, 
>  String owner, 
>  String type) throws FileNotFoundException, 
> IOException ...
> and then a config similar to
> \# Joshua configuration file
> config = {
> default-non-terminal = X
> goal-symbol = GOAL
> ...
> 
> PackedGrammar: {
> type: thrax,
> grammar_dir: /local/grammars/...
> span_limit: 50
> }
> ...
> }
> Version: TBD, but it's a breaking change so we may consider putting it in 
> Joshua 7.
> Totally open to other config / injection systems if others want to suggest 
> any of their favorites.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (JOSHUA-275) Revamp the Configuration System

2016-05-27 Thread Kellen Sunderland (JIRA)


 [ 
https://issues.apache.org/jira/browse/JOSHUA-275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kellen Sunderland updated JOSHUA-275:
-
Description: 
I'd like to propose we centralize Joshua's configuration system to make use of 
typesafe/config https://github.com/typesafehub/config .  This config system 
looks like JSON but with comments so it's easy to read.  Because it's JSON it 
supports hierarchies of configurations, lists of configuration etc quite 
easily.  It has some nice features like parsing time automatically.  The main 
advantage here though is that we have a standard config system that doesn't 
have to be manually parsed.

Here's a quick example of how we can use it:

@Inject
public PackedGrammar(@TypesafeConfig("PackedGrammar.grammar_dir")
 String grammar_dir,
 @TypesafeConfig("PackedGrammar.span_limit")
 int span_limit, 
 String owner, 
 String type) throws FileNotFoundException, IOException 
...

and then a config similar to

\# Joshua configuration file
config = {
default-non-terminal = X
goal-symbol = GOAL
...

tm: {
type: thrax,
grammar_dir: /local/grammars/...
span_limit: 50
}
...
}

Version: TBD, but it's a breaking change so we may consider putting it in 
Joshua 7.

Totally open to other config / injection systems if others want to suggest any 
of their favorites.

  was:
I'd like to propose we centralize Joshua's configuration system to make use of 
typesafe/config https://github.com/typesafehub/config .  This config system 
looks like JSON but with comments so it's easy to read.  Because it's JSON it 
supports hierarchies of configurations, lists of configuration etc quite 
easily.  It has some nice features like parsing time automatically.  The main 
advantage here though is that we have a standard config system that doesn't 
have to be manually parsed.

Here's a quick example of how we can use it:

@Inject
public PackedGrammar(@TypesafeConfig("PackedGrammar.grammar_dir")
 String grammar_dir,
 @TypesafeConfig("PackedGrammar.span_limit")
 int span_limit, 
 String owner, 
 String type) throws FileNotFoundException, IOException 
...

and then a config similar to

\# Joshua configuration file
config = {
\# Joshua configuration file
default-non-terminal = X
goal-symbol = GOAL
...

tm: {
type: thrax,
grammar_dir: /local/grammars/...
span_limit: 50
}
...
}

Version: TBD, but it's a breaking change so we may consider putting it in 
Joshua 7.

Totally open to other config / injection systems if others want to suggest any 
of their favorites.


> Revamp the Configuration System
> ---
>
> Key: JOSHUA-275
> URL: https://issues.apache.org/jira/browse/JOSHUA-275
> Project: Joshua
>  Issue Type: Improvement
>Affects Versions: 6.1, 6.2, 7
>Reporter: Kellen Sunderland
>
> I'd like to propose we centralize Joshua's configuration system to make use 
> of typesafe/config https://github.com/typesafehub/config .  This config 
> system looks like JSON but with comments so it's easy to read.  Because it's 
> JSON it supports hierarchies of configurations, lists of configuration etc 
> quite easily.  It has some nice features like parsing time automatically.  
> The main advantage here though is that we have a standard config system that 
> doesn't have to be manually parsed.
> Here's a quick example of how we can use it:
> @Inject
> public PackedGrammar(@TypesafeConfig("PackedGrammar.grammar_dir")
>  String grammar_dir,
>  @TypesafeConfig("PackedGrammar.span_limit")
>  int span_limit, 
>  String owner, 
>  String type) throws FileNotFoundException, 
> IOException ...
> and then a config similar to
> \# Joshua configuration file
> config = {
> default-non-terminal = X
> goal-symbol = GOAL
> ...
> 
> tm: {
> type: thrax,
> grammar_dir: /local/grammars/...
> span_limit: 50
> }
> ...
> }
> Version: TBD, but it's a breaking change so we may consider putting it in 
> Joshua 7.
> Totally open to other config / injection systems if others want to suggest 
> any of their favorites.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (JOSHUA-275) Revamp the Configuration System

2016-05-27 Thread Kellen Sunderland (JIRA)


 [ 
https://issues.apache.org/jira/browse/JOSHUA-275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kellen Sunderland updated JOSHUA-275:
-
Description: 
I'd like to propose we centralize Joshua's configuration system to make use of 
typesafe/config https://github.com/typesafehub/config .  This config system 
looks like JSON but with comments so it's easy to read.  Because it's JSON it 
supports hierarchies of configurations, lists of configuration etc quite 
easily.  It has some nice features like parsing time automatically.  The main 
advantage here though is that we have a standard config system that doesn't 
have to be manually parsed.

Here's a quick example of how we can use it:

@Inject
public PackedGrammar(@TypesafeConfig("PackedGrammar.grammar_dir")
 String grammar_dir,
 @TypesafeConfig("PackedGrammar.span_limit")
 int span_limit, 
 String owner, 
 String type) throws FileNotFoundException, IOException 
...

and then a config similar to

\# Joshua configuration file
config = {
\# Joshua configuration file
default-non-terminal = X
goal-symbol = GOAL
...

tm: {
type: thrax,
grammar_dir: /local/grammars/...
span_limit: 50
}
...
}

Version: TBD, but it's a breaking change so we may consider putting it in 
Joshua 7.

Totally open to other config / injection systems if others want to suggest any 
of their favorites.

  was:
I'd like to propose we centralize Joshua's configuration system to make use of 
typesafe/config https://github.com/typesafehub/config .  This config system 
looks like JSON but with comments so it's easy to read.  Because it's JSON it 
supports hierarchies of configurations, lists of configuration etc quite 
easily.  It has some nice features like parsing time automatically.  The main 
advantage here though is that we have a standard config system that doesn't 
have to be manually parsed.

Here's a quick example of how we can use it:

@Inject
public PackedGrammar(@TypesafeConfig("PackedGrammar.grammar_dir")
 String grammar_dir,
 @TypesafeConfig("PackedGrammar.span_limit")
 int span_limit, 
 String owner, 
 String type) throws FileNotFoundException, IOException 
...

and then a config similar to

# Joshua configuration file
config = {
# Joshua configuration file
default-non-terminal = X
goal-symbol = GOAL
...

tm: {
type: thrax,
grammar_dir: /local/grammars/...
span_limit: 50
}
...
}

Version: TBD, but it's a breaking change so we may consider putting it in 
Joshua 7.

Totally open to other config / injection systems if others want to suggest any 
of their favorites.


> Revamp the Configuration System
> ---
>
> Key: JOSHUA-275
> URL: https://issues.apache.org/jira/browse/JOSHUA-275
> Project: Joshua
>  Issue Type: Improvement
>Affects Versions: 6.1, 6.2, 7
>Reporter: Kellen Sunderland
>
> I'd like to propose we centralize Joshua's configuration system to make use 
> of typesafe/config https://github.com/typesafehub/config .  This config 
> system looks like JSON but with comments so it's easy to read.  Because it's 
> JSON it supports hierarchies of configurations, lists of configuration etc 
> quite easily.  It has some nice features like parsing time automatically.  
> The main advantage here though is that we have a standard config system that 
> doesn't have to be manually parsed.
> Here's a quick example of how we can use it:
> @Inject
> public PackedGrammar(@TypesafeConfig("PackedGrammar.grammar_dir")
>  String grammar_dir,
>  @TypesafeConfig("PackedGrammar.span_limit")
>  int span_limit, 
>  String owner, 
>  String type) throws FileNotFoundException, 
> IOException ...
> and then a config similar to
> \# Joshua configuration file
> config = {
> \# Joshua configuration file
> default-non-terminal = X
> goal-symbol = GOAL
> ...
> 
> tm: {
> type: thrax,
> grammar_dir: /local/grammars/...
> span_limit: 50
> }
> ...
> }
> Version: TBD, but it's a breaking change so we may consider putting it in 
> Joshua 7.
> Totally open to other config / injection systems if others want to suggest 
> any of their favorites.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: too many emails

2016-05-26 Thread kellen sunderland

I'd +1 as well.  Your breakdown looks good to me Chris.

On Thu, May 26, 2016 at 4:12 PM, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> +1 to a separate list for GitHub stuff. Many communities (Kudu,
> Spark, etc.) end up doing this.
>
> How about:
>
> revi...@joshua.incubator.apache.org
> git...@joshua.incubator.apache.org
> iss...@joshua.incubator.apache.org
>
> Any of those?
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> ++
>
>
>
>
>
>
>
>
>
> On 5/26/16, 6:00 AM, "Matt Post"  wrote:
>
> >I agree it's good to have Github stuff archived on Apache-owned domains,
> I just think that the list gets overwhelmed with garbage that most people
> are just deleting. I mean, I like the idea of skimming through commits, but
> today I am waking up to over 100 emails, and I have to pick out the
> auto-generated emails that I don't have time to read from the important
> ones. If most people are just saving things to a separate folder, that they
> are never going to read, isn't it better to turn off those auto-emails?
> >
> >Why not use a separate list like git@ or archive@ for such posts? Then
> it's there for people to search, but no one has to wade through it.
> >
> >
> >
> >
> >> On May 26, 2016, at 12:45 AM, Lewis John Mcgibbney <
> lewis.mcgibb...@gmail.com> wrote:
> >>
> >> Hi Matt,
> >>
> >> As Henry said. Either we get them going to a different list or else you
> >> subscribe to dev-dig...@joshua.incubator.apache.org (subscribe through
> >> dev-digest-subscr...@joshua.incubator.apache.org)?
> >> Which do you prefer?
> >> Quick reasoning as to why Github convo is shadowed on the Apache lists.
> If
> >> Github ever goes away, then we loose all of the conversation. We
> archive it
> >> @Apache so we cover our communities.
> >> Thanks
> >>
> >>
> >> On Wed, May 25, 2016 at 2:11 PM, <
> >> dev-digest-h...@joshua.incubator.apache.org> wrote:
> >>
> >>>
> >>> From: Matt Post 
> >>> To: dev@joshua.incubator.apache.org
> >>> Cc:
> >>> Date: Wed, 25 May 2016 15:48:24 -0400
> >>> Subject: too many emails
> >>> Does someone know how to turn off the mailing of all github comments to
> >>> dev?
> >>>
> >>> The way I see it, we all have to be on dev, so it should be for people,
> >>> not robots. I am getting every comment about three times.
> >>>
> >>> I would just do it but I don't know how.
> >>>
> >>>
> >
>

[jira] [Created] (JOSHUA-266) Refactor key interfaces and core code for a future release.

2016-05-13 Thread Kellen Sunderland (JIRA)

Kellen Sunderland created JOSHUA-266:


 Summary: Refactor key interfaces and core code for a future 
release. 
 Key: JOSHUA-266
 URL: https://issues.apache.org/jira/browse/JOSHUA-266
 Project: Joshua
  Issue Type: Improvement
Reporter: Kellen Sunderland
Priority: Minor


We've discussed making some modifications to the key interfaces.  This ticket 
can focus on making large changes to the codebase for a future release.  This 
work will likely take some time and some collaboration.  I'd suggest some the 
code for this be a separate release branch.

Some issues we can work on:
*  I'd propose we conform to the SOLID principles for our major interfaces.  
https://en.wikipedia.org/wiki/SOLID_(object-oriented_design)  . 
*  We can look at Sparse / Dense feature vectors and how to handle them 
naturally in Joshua.
*  Refactor objects that may now be used more broadly than was originally 
intended (for example Vocabulary class).
*  We should have a general discussion around what parts of the codebase are 
responsible for what functions.  We should clearly define what logic should be 
a part of the Grammar versus the Feature Functions for example, and make sure 
logic doesn't leak from one of these objects to the others.

 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (JOSHUA-265) Refactor key interfaces and core code for a future release.

2016-05-13 Thread Kellen Sunderland (JIRA)

Kellen Sunderland created JOSHUA-265:


 Summary: Refactor key interfaces and core code for a future 
release. 
 Key: JOSHUA-265
 URL: https://issues.apache.org/jira/browse/JOSHUA-265
 Project: Joshua
  Issue Type: Improvement
Reporter: Kellen Sunderland
Priority: Minor


We've discussed making some modifications to the key interfaces.  This ticket 
can focus on making large changes to the codebase for a future release.  This 
work will likely take some time and some collaboration.  I'd suggest some the 
code for this be a separate release branch.

Some issues we can work on:
*  I'd propose we conform to the SOLID principles for our major interfaces.  
https://en.wikipedia.org/wiki/SOLID_(object-oriented_design)  . 
*  We can look at Sparse / Dense feature vectors and how to handle them 
naturally in Joshua.
*  Refactor objects that may now be used more broadly than was originally 
intended (for example Vocabulary class).
*  We should have a general discussion around what parts of the codebase are 
responsible for what functions.  We should clearly define what logic should be 
a part of the Grammar versus the Feature Functions for example, and make sure 
logic doesn't leak from one of these objects to the others.

 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (JOSHUA-264) Remove system exits and replace with RuntimeExceptions

2016-05-13 Thread Kellen Sunderland (JIRA)

Kellen Sunderland created JOSHUA-264:


 Summary: Remove system exits and replace with RuntimeExceptions
 Key: JOSHUA-264
 URL: https://issues.apache.org/jira/browse/JOSHUA-264
 Project: Joshua
  Issue Type: Improvement
Reporter: Kellen Sunderland


When Joshua is used a library it's much more convenient to get 
RuntimeExceptions when a fatal error happens.  This way the host process can 
possibly handle the error or take some appropriate action (alarm, log, etc).





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Language Pack size

2016-05-13 Thread kellen sunderland

That's a great idea, can we pre-sort the grammar as well?

On Fri, May 13, 2016 at 1:47 PM, Matt Post <p...@cs.jhu.edu> wrote:

> Quantization is also supported in the grammar packer.
>
> Another idea: since we know the model weights when we publish a language
> pack, we should pre-compute the dot product of the weight vector against
> the grammar weights and reduce it to a single (quantized) score.
>
> (This would reduce the ability for users to play with the individual
> weights, but I don't think that's a huge loss, since the main weight is LM
> vs. TM).
>
> matt
>
>
> > On May 13, 2016, at 4:45 PM, Matt Post <p...@cs.jhu.edu> wrote:
> >
> > Oh, yes, of course. That's in build_binary.
> >
> >
> >> On May 13, 2016, at 4:39 PM, kellen sunderland <
> kellen.sunderl...@gmail.com> wrote:
> >>
> >> Could we also use quantization with the language model to reduce the
> size?
> >> KenLM supports this right?
> >>
> >> On Fri, May 13, 2016 at 1:19 PM, Matt Post <p...@cs.jhu.edu> wrote:
> >>
> >>> Great idea, hadn't thought of that.
> >>>
> >>> I think we could also get some leverage out of:
> >>>
> >>> - Reducing the language model to a 4-gram one
> >>> - Doing some filtering of the phrase table to reduce low-probability
> >>> translation options
> >>>
> >>> These would be a bit lossier but I doubt it would matter much at all.
> >>>
> >>> matt
> >>>
> >>>
> >>>> On May 13, 2016, at 4:02 PM, Tom Barber <t...@analytical-labs.com>
> wrote:
> >>>>
> >>>> Out of curiosity more than anything else I tested XZ compression on a
> >>> model
> >>>> instead of Gzip, it takes the Spain pack down from 1.9GB to 1.5GB, not
> >>> the
> >>>> most ever, but obviously does mean 400MB+ less in remote storage and
> data
> >>>> going over the wire.
> >>>>
> >>>> Worth considering I guess.
> >>>>
> >>>> Tom
> >>>> --
> >>>>
> >>>> Director Meteorite.bi - Saiku Analytics Founder
> >>>> Tel: +44(0)5603641316
> >>>>
> >>>> (Thanks to the Saiku community we reached our Kickstart
> >>>> <
> >>>
> http://kickstarter.com/projects/2117053714/saiku-reporting-interactive-report-designer/
> >>>>
> >>>> goal, but you can always help by sponsoring the project
> >>>> <http://www.meteorite.bi/products/saiku/sponsorship>)
> >>>
> >>>
> >
>
>

[jira] [Closed] (JOSHUA-263) Standardize logging across Joshua

2016-05-13 Thread Kellen Sunderland (JIRA)


 [ 
https://issues.apache.org/jira/browse/JOSHUA-263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kellen Sunderland closed JOSHUA-263.

Resolution: Duplicate

> Standardize logging across Joshua
> -
>
> Key: JOSHUA-263
> URL: https://issues.apache.org/jira/browse/JOSHUA-263
> Project: Joshua
>  Issue Type: Improvement
>    Reporter: Kellen Sunderland
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (JOSHUA-263) Standardize logging across Joshua

2016-05-13 Thread Kellen Sunderland (JIRA)


 [ 
https://issues.apache.org/jira/browse/JOSHUA-263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kellen Sunderland updated JOSHUA-263:
-
Description: (was: We would like to standardize logging across Joshua.  
The purpose is to provide a very loose coupling to concrete logging systems, 
such that organizations can plug in whatever loggers they want at runtime.

There's also a surprisingly large performance consideration that can be 
addressed here as well.  There's a few cases where a ton of the cpu work we're 
doing is actually to build strings that don't get logged at the logging levels 
we're running under.  Lazy evaluation of these strings should prevent this 
issue.)

> Standardize logging across Joshua
> -
>
> Key: JOSHUA-263
> URL: https://issues.apache.org/jira/browse/JOSHUA-263
> Project: Joshua
>  Issue Type: Improvement
>    Reporter: Kellen Sunderland
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (JOSHUA-263) Standardize logging across Joshua

2016-05-13 Thread Kellen Sunderland (JIRA)

Kellen Sunderland created JOSHUA-263:


 Summary: Standardize logging across Joshua
 Key: JOSHUA-263
 URL: https://issues.apache.org/jira/browse/JOSHUA-263
 Project: Joshua
  Issue Type: Improvement
Reporter: Kellen Sunderland


We would like to standardize logging across Joshua.  The purpose is to provide 
a very loose coupling to concrete logging systems, such that organizations can 
plug in whatever loggers they want at runtime.

There's also a surprisingly large performance consideration that can be 
addressed here as well.  There's a few cases where a ton of the cpu work we're 
doing is actually to build strings that don't get logged at the logging levels 
we're running under.  Lazy evaluation of these strings should prevent this 
issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (JOSHUA-260) Integrate IoC (Inversion of Control) into Joshua

2016-05-13 Thread Kellen Sunderland (JIRA)


 [ 
https://issues.apache.org/jira/browse/JOSHUA-260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kellen Sunderland reassigned JOSHUA-260:


Assignee: Kellen Sunderland

> Integrate IoC (Inversion of Control) into Joshua
> 
>
> Key: JOSHUA-260
> URL: https://issues.apache.org/jira/browse/JOSHUA-260
> Project: Joshua
>  Issue Type: Improvement
>    Reporter: Kellen Sunderland
>    Assignee: Kellen Sunderland
>
> I'd like to propose we investigate looking into using guice 
> (https://github.com/google/guice) in conjunction with joshua's configuration 
> system.  I believe it would give us a nice way to map what is in the 
> configuration to the code paths, and implementations used within Joshua.  It 
> also would go a long way to allowing us to integrate unit tests throughout 
> all the important classes in Joshua.  What does everyone think?  Would IoC be 
> a good pattern to adopt?  Is everyone ok with using guice (versus say some 
> other IoC library).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: ApacheCon Meetup

2016-05-12 Thread kellen sunderland

I just wanted to discuss it as a group.  Your approach looks good to me.

On Thu, May 12, 2016 at 6:05 PM, Henry Saputra <henry.sapu...@gmail.com>
wrote:

> Ah sorry, trigger happy
>
> About logging. Are you proposing to use log4j interface in the code? I
> would recommend to use slf4j [1] as facade abstraction.
> Then implementation could be done via log4j or logback.
>
> Love to see API access to Joshua.
>
> - Henry
>
> [1] http://www.slf4j.org
>
> On Thu, May 12, 2016 at 6:03 PM, Henry Saputra <henry.sapu...@gmail.com>
> wrote:
>
> > About logging. Are you proposing to use log4j interface in the code? I
> > would recommend to use slf4j [1]
> >
> >
> > [
> >
> > On Thu, May 12, 2016 at 2:30 PM, kellen sunderland <
> > kellen.sunderl...@gmail.com> wrote:
> >
> >> Thanks for organizing Lewis,
> >>
> >> Here's some topics for discussion I've been noting while working with
> >> Joshua.  None of these are high priority issues for me, but if we are
> all
> >> in agreement on them it might make sense to log them.
> >>
> >> Boring code convention stuff: Logging with log4j, throw Runtime
> Exceptions
> >> instead of Typed, remove all system exits (replace with
> >> RuntimeExceptions),
> >> refactor some large files.
> >>
> >> Testing: Integrate existing unit tests, provide some good test examples
> so
> >> others can begin adding more tests.
> >>
> >> Configuration: We also touched on IoC, CLI args, and configuration
> changes
> >> that are possible.
> >>
> >> OO stuff: Joshua is pretty good here, but I would personally prefer more
> >> granular interfaces.  I wouldn't advocate radical changes, but maybe a
> >> little refactoring might make sense to better align with the interface
> >> segregation principle.
> >> https://en.wikipedia.org/wiki/Interface_segregation_principle
> >>
> >> JNI reliance:  We've found KenLM works really well with Joshua, but
> there
> >> is one issue with using it.  It requires many JNI calls during decoding
> >> and
> >> these calls impact GC performance.  In fact when a JNI call happens the
> GC
> >> throws out any work it may have done and quits until the JNI call
> >> completes.  The GC will then resume and start marking objects for
> >> collection from scratch.  This is not ideal especially for programs with
> >> large heaps (Joshua / Spark).  There's a couple ways we could mitigate
> >> this
> >> and I think they'd all speed up Joshua quite a lot.
> >>
> >> High level roadmap topics:
> >>
> >> *  Distributed Decoding is something I'll likely continue working on.
> >> Theres some obvious things we can do given usage patterns of translation
> >> engines that can help us out here (I think).
> >> *  Providing a way to optimize Joshua for low-latency, low-throughput
> >> calls
> >> could be interesting for those with near real-time use cases.
> Providing a
> >> way to optimize for high-latency, high-throughput could be interesting
> for
> >> async/batch use cases.
> >> *  The machine learning optimization algorithms could be cleaned up a
> bit
> >> (MERT/MIRA).
> >> *  The Vocabulary could probably be replaced with a simpler
> implementation
> >> (without sacrificing performance).
> >>
> >> -Kellen
> >>
> >>
> >>
> >> On Thu, May 12, 2016 at 12:32 PM, Lewis John Mcgibbney <
> >> lewis.mcgibb...@gmail.com> wrote:
> >>
> >> > Hi Folks,
> >> > Kellen, Henri and I are going to get together tomorrow 13th around
> >> > lunchtime PST to talk everything Joshua.
> >> > Would be great to have others online via GChat if possible.
> >> > Let's say around 11am PST for the time being.
> >> > See you then folks.
> >> > Thanks
> >> > Lewis
> >> >
> >> >
> >> > --
> >> > *Lewis*
> >> >
> >>
> >
> >
>

Re: ApacheCon Meetup

2016-05-12 Thread kellen sunderland

Thanks for organizing Lewis,

Here's some topics for discussion I've been noting while working with
Joshua.  None of these are high priority issues for me, but if we are all
in agreement on them it might make sense to log them.

Boring code convention stuff: Logging with log4j, throw Runtime Exceptions
instead of Typed, remove all system exits (replace with RuntimeExceptions),
refactor some large files.

Testing: Integrate existing unit tests, provide some good test examples so
others can begin adding more tests.

Configuration: We also touched on IoC, CLI args, and configuration changes
that are possible.

OO stuff: Joshua is pretty good here, but I would personally prefer more
granular interfaces.  I wouldn't advocate radical changes, but maybe a
little refactoring might make sense to better align with the interface
segregation principle.
https://en.wikipedia.org/wiki/Interface_segregation_principle

JNI reliance:  We've found KenLM works really well with Joshua, but there
is one issue with using it.  It requires many JNI calls during decoding and
these calls impact GC performance.  In fact when a JNI call happens the GC
throws out any work it may have done and quits until the JNI call
completes.  The GC will then resume and start marking objects for
collection from scratch.  This is not ideal especially for programs with
large heaps (Joshua / Spark).  There's a couple ways we could mitigate this
and I think they'd all speed up Joshua quite a lot.

High level roadmap topics:

*  Distributed Decoding is something I'll likely continue working on.
Theres some obvious things we can do given usage patterns of translation
engines that can help us out here (I think).
*  Providing a way to optimize Joshua for low-latency, low-throughput calls
could be interesting for those with near real-time use cases.  Providing a
way to optimize for high-latency, high-throughput could be interesting for
async/batch use cases.
*  The machine learning optimization algorithms could be cleaned up a bit
(MERT/MIRA).
*  The Vocabulary could probably be replaced with a simpler implementation
(without sacrificing performance).

-Kellen

On Thu, May 12, 2016 at 12:32 PM, Lewis John Mcgibbney <
lewis.mcgibb...@gmail.com> wrote:

> Hi Folks,
> Kellen, Henri and I are going to get together tomorrow 13th around
> lunchtime PST to talk everything Joshua.
> Would be great to have others online via GChat if possible.
> Let's say around 11am PST for the time being.
> See you then folks.
> Thanks
> Lewis
>
>
> --
> *Lewis*
>

[jira] [Commented] (JOSHUA-260) Integrate IoC (Inversion of Control) into Joshua

2016-05-02 Thread Kellen Sunderland (JIRA)


[ 
https://issues.apache.org/jira/browse/JOSHUA-260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15267687#comment-15267687
 ] 

Kellen Sunderland commented on JOSHUA-260:
--

This isn't the kind of change that can be made overnight, so don't worry about 
not looking into it by June.  It's a more long term consideration, and I can 
try and sell you a bit more on it next week.  

If we use Guice alone the benefit it would provide is that all of our 
implementations will be configured and hooked up in a single class at launch 
time, based on our launch configuration.  We won't have to have branchpoints in 
the codebase to handle different arguments that were passed in when the library 
was launched.  An example of code that could be simplified (in Decoder.java) 
would be:

 if (joshuaConfiguration.amortized_sorting) {
Decoder.LOG(1, "Grammar sorting happening lazily on-demand.");
  } else {
long pre_sort_time = System.currentTimeMillis();
for (Grammar grammar : this.grammars) {
  grammar.sortGrammar(this.featureFunctions);
}
Decoder.LOG(1, String.format("Grammar sorting took %d seconds.",
(System.currentTimeMillis() - pre_sort_time) / 1000));
  }

We could replace this kind of code with a subclass of Decoder that 
automatically is used when a configuration option is set (in this case when the 
option amortized_sorting is false).  This would help keep the size of a class 
like Decoder small, it spreads out the logic of the code to various subclasses 
and automatically chooses the correct subclass at launch time.

So that's the benefit of just using juice and doing some OO refactoring, but 
there are some nice libraries that will do some of things you have on your 
wish-list.  I think we can use some combination of args4j and typesafe config 
to accomplish most of the functionality you want.  Args4j in particular will 
make it easy to generate documentation and help for any cli arguments (looks 
like this is already somewhat the case for the GrammarPacker).  Typesafe config 
also allows you to override any configuration from the cli as an arg.

We of course don't have to make these changes all at once.  We can gradually 
introduce Guice and Args4j and then consider how to update the config aspects 
of Joshua.


> Integrate IoC (Inversion of Control) into Joshua
> 
>
> Key: JOSHUA-260
> URL: https://issues.apache.org/jira/browse/JOSHUA-260
> Project: Joshua
>  Issue Type: Improvement
>Reporter: Kellen Sunderland
>
> I'd like to propose we investigate looking into using guice 
> (https://github.com/google/guice) in conjunction with joshua's configuration 
> system.  I believe it would give us a nice way to map what is in the 
> configuration to the code paths, and implementations used within Joshua.  It 
> also would go a long way to allowing us to integrate unit tests throughout 
> all the important classes in Joshua.  What does everyone think?  Would IoC be 
> a good pattern to adopt?  Is everyone ok with using guice (versus say some 
> other IoC library).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (JOSHUA-172) Speed up grammar file reading with memory-mapped files

2016-05-02 Thread Kellen Sunderland (JIRA)


[ 
https://issues.apache.org/jira/browse/JOSHUA-172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15267193#comment-15267193
 ] 

Kellen Sunderland commented on JOSHUA-172:
--

This ticket shouldn't be open should it?  In the current source it seems that 
the grammar is being memory mapped.

> Speed up grammar file reading with memory-mapped files
> --
>
> Key: JOSHUA-172
> URL: https://issues.apache.org/jira/browse/JOSHUA-172
> Project: Joshua
>  Issue Type: Bug
>Reporter: Matt Post
> Fix For: 6.1
>
>
> [This 
> document|http://nadeausoftware.com/articles/2008/02/java_tip_how_read_files_quickly]
>  should be helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (JOSHUA-259) Integration tests are failing

2016-05-02 Thread Kellen Sunderland (JIRA)

Kellen Sunderland created JOSHUA-259:


 Summary: Integration tests are failing
 Key: JOSHUA-259
 URL: https://issues.apache.org/jira/browse/JOSHUA-259
 Project: Joshua
  Issue Type: Bug
Reporter: Kellen Sunderland


Several integration tests are currently failing with Joshua.  I have a quick 
fix coming for one of the tests but just in case we need more discussion around 
the failures I'll open a bug.

The currently failing tests for me:
test/decoder/too-long
test/server/http
test/server/tcp-text
test/thrax/extraction

and 

test/decoder/moses-compat (but this is easy to fix, simple extra space in the 
expected file)

These are failing under OS X 10.11.  If working under other environments feel 
free to post a 'works for me'.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: [GitHub] incubator-joshua pull request: More work on structuring translatio...

2016-04-28 Thread kellen sunderland

Hey Matt.  I'd suggest we hook up the unit test runs in Vancouver (only a
week away).  They should be runnable with junit.
On Apr 28, 2016 9:55 PM, "mjpost"  wrote:

> Github user mjpost commented on the pull request:
>
>
> https://github.com/apache/incubator-joshua/pull/9#issuecomment-215626822
>
> Okay, this is merged. This is nice work. A few notes:
>
> - Can you tell me how to run the unit tests? @KellenSunderland @fhieber
>
> - The phrase-based decoder can use the exact same improvements you
> pushed up, so I converted them over to this
>
> - include_align_index is no longer used anywhere. I may add that back
> to the phrase-based decoder if I need it, but am leaving it out for now (so
> it will really just do nothing if you ask for it)
>
>
> ---
> If your project is set up for it, you can reply to this email and have your
> reply appear on GitHub as well. If your project does not have this feature
> enabled and wishes so, or if the feature is enabled but not working, please
> contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
> with INFRA.
> ---
>

Re: joshua_api

2016-04-27 Thread kellen sunderland

Hey Matt,

If you had time that would be fantastic.  I've created a new PR in case you
want to pull it in.  There's actually 4 tests failing for me currently
(casing issues causing at least one).  If you want to wait until we fix
these tests that's also completely fine.

-Kellen

On Wed, Apr 27, 2016 at 11:32 AM, Matt Post <p...@cs.jhu.edu> wrote:

> Do you want me to fix the recapitalization? Or are you going to do that? I
> looked a bit, and it seems I'll have to add a method to get a word
> alignment object instead of just the string, so that I can poke through
> them. This approach is as good as true-casing in some languages.
>
> A few other things:
>
> - I saw a comment in the commit about the changes not working for
> phrase-based translation. Can you (or Felix) elaborate? What exactly will
> no longer work?
>
> - Currently, there are multiple places where the "output-format" string
> has to get edited (KBestExtractor and in Translation). After you push your
> changes in, I'm going to make some edits so that this all occurs in one
> place.
>
> matt
>
>
> > On Apr 27, 2016, at 2:25 PM, kellen sunderland <
> kellen.sunderl...@gmail.com> wrote:
> >
> > Thanks for taking a look Matt,
> >
> > I think this is all we've got planned as far as changes relating to an
> API
> > would go.  We have a few more commits coming but they're just performance
> > improvements and they don't change too much in the way of interfaces or
> > method signatures.
> >
> > -Kellen
> >
> > On Wed, Apr 27, 2016 at 4:47 AM, Matt Post <p...@cs.jhu.edu> wrote:
> >
> >> Kellen,
> >>
> >> Great. I had a chance to start looking over the ReworkedExtractions
> >> branch. I'll have some more time today. It looks good to me so far. Is
> >> there anything else you plan to do, or does that branch contain
> basically
> >> all of it (apart from the recapitalization fix, which I see should be
> >> applied more selectively, maybe only when a -recapitalize flag is
> present,
> >> to save on time).
> >>
> >> matt
> >>
> >>
> >>> On Apr 26, 2016, at 1:56 AM, kellen sunderland <
> >> kellen.sunderl...@gmail.com> wrote:
> >>>
> >>> Hey Matt,
> >>>
> >>> I've opened a new pull request with a few of our commits, feel free to
> >> take
> >>> a look when you have some time.
> >>>
> >>> More importantly I've pushed our queue of upcoming commits to the
> >> following
> >>> branch in my fork:
> >>>
> >>
> https://github.com/KellenSunderland/incubator-joshua/commits/ReworkedExtractions
> >>> .  From there you can get an idea for the work we've done so far.  I
> >>> haven't opened a PR yet for these commits because there's still some
> >>> merging I have to do (there's a few failing tests and I had to
> >> temporarily
> >>> comment out some of your casing code).  Once that's fixed I'll do a
> >> proper
> >>> PR for these commits.
> >>>
> >>> -Kellen
> >>>
> >>> On Mon, Apr 25, 2016 at 1:35 PM, Matt Post <p...@cs.jhu.edu> wrote:
> >>>
> >>>> Great. On that first point, I meant that translate() would return a
> >>>> Translation object, which would know its hypergraph and could iterate
> >> over
> >>>> a KBestExtractor. In any case, though, it sounds like you are a bit
> >> ahead
> >>>> of me on this, so I'll wait for a push that I can see, and then we can
> >>>> converge on the design.
> >>>>
> >>>> matt
> >>>>
> >>>>
> >>>>> On Apr 25, 2016, at 4:10 PM, Hieber, Felix <fhie...@amazon.de>
> wrote:
> >>>>>
> >>>>> Hi Matt,
> >>>>>
> >>>>> These are some nice suggestions. Most of the work we have done is in
> >>>> line of what you propose so I would agree with Kellen that we should
> >>>> synchronize and compare better earlier than later.
> >>>>>
> >>>>> Best,
> >>>>> Felix
> >>>>>
> >>>>>> On 25.04.2016, at 07:44, kellen sunderland <
> >> kellen.sunderl...@gmail.com>
> >>>> wrote:
> >>>>>>
> >>>>>> Hey Matt,
> >>>>>>
> >>>>>> Sorry for the late reply.  The Joshua-6 folder and tst may have jus

Re: joshua_api

2016-04-27 Thread kellen sunderland

Thanks for taking a look Matt,

I think this is all we've got planned as far as changes relating to an API
would go.  We have a few more commits coming but they're just performance
improvements and they don't change too much in the way of interfaces or
method signatures.

-Kellen

On Wed, Apr 27, 2016 at 4:47 AM, Matt Post <p...@cs.jhu.edu> wrote:

> Kellen,
>
> Great. I had a chance to start looking over the ReworkedExtractions
> branch. I'll have some more time today. It looks good to me so far. Is
> there anything else you plan to do, or does that branch contain basically
> all of it (apart from the recapitalization fix, which I see should be
> applied more selectively, maybe only when a -recapitalize flag is present,
> to save on time).
>
> matt
>
>
> > On Apr 26, 2016, at 1:56 AM, kellen sunderland <
> kellen.sunderl...@gmail.com> wrote:
> >
> > Hey Matt,
> >
> > I've opened a new pull request with a few of our commits, feel free to
> take
> > a look when you have some time.
> >
> > More importantly I've pushed our queue of upcoming commits to the
> following
> > branch in my fork:
> >
> https://github.com/KellenSunderland/incubator-joshua/commits/ReworkedExtractions
> > .  From there you can get an idea for the work we've done so far.  I
> > haven't opened a PR yet for these commits because there's still some
> > merging I have to do (there's a few failing tests and I had to
> temporarily
> > comment out some of your casing code).  Once that's fixed I'll do a
> proper
> > PR for these commits.
> >
> > -Kellen
> >
> > On Mon, Apr 25, 2016 at 1:35 PM, Matt Post <p...@cs.jhu.edu> wrote:
> >
> >> Great. On that first point, I meant that translate() would return a
> >> Translation object, which would know its hypergraph and could iterate
> over
> >> a KBestExtractor. In any case, though, it sounds like you are a bit
> ahead
> >> of me on this, so I'll wait for a push that I can see, and then we can
> >> converge on the design.
> >>
> >> matt
> >>
> >>
> >>> On Apr 25, 2016, at 4:10 PM, Hieber, Felix <fhie...@amazon.de> wrote:
> >>>
> >>> Hi Matt,
> >>>
> >>> These are some nice suggestions. Most of the work we have done is in
> >> line of what you propose so I would agree with Kellen that we should
> >> synchronize and compare better earlier than later.
> >>>
> >>> Best,
> >>> Felix
> >>>
> >>>> On 25.04.2016, at 07:44, kellen sunderland <
> kellen.sunderl...@gmail.com>
> >> wrote:
> >>>>
> >>>> Hey Matt,
> >>>>
> >>>> Sorry for the late reply.  The Joshua-6 folder and tst may have just
> >> been
> >>>> artifacts of some symlinks I have locally.  Sorry they may have been
> >> pushed
> >>>> by mistake, I can clean that up.
> >>>>
> >>>> Good idea to have the api code in a separate branch.  We can merge the
> >> work
> >>>> that we've done some time next week.
> >>>>
> >>>> KBestExtractor is one of the things we want to return via the API.  We
> >>>> already have some of this implemented though as you suggest.  I'll try
> >> and
> >>>> push the remaining work we've done into my github branch so you can
> >> compare.
> >>>>
> >>>> -Kellen
> >>>>
> >>>>> On Mon, Apr 25, 2016 at 6:11 AM, Matt Post <p...@cs.jhu.edu> wrote:
> >>>>>
> >>>>> Okay, after looking at this a bit more, I have a better
> understanding,
> >> and
> >>>>> an idea for how to move forward.
> >>>>>
> >>>>> First, I see that Translation.java has provisions for structured
> >> output.
> >>>>> I'm guessing StructuredTranslation was added by mistake?
> >>>>>
> >>>>> Moving forward, on the joshua_api branch, I was thinking of the
> >> following,
> >>>>> but want to make sure it doesn't collide with what you've done or are
> >> doing:
> >>>>>
> >>>>> - Factor KBestExtractor to return Translation objects instead of
> >> printing,
> >>>>> and also turn it into an iterator
> >>>>>
> >>>>> - There's a real discrepancy with competing forest representations.
> >> There
> >>>>> are operatio

Re: joshua_api

2016-04-25 Thread kellen sunderland

Hey Matt,

Sorry for the late reply.  The Joshua-6 folder and tst may have just been
artifacts of some symlinks I have locally.  Sorry they may have been pushed
by mistake, I can clean that up.

Good idea to have the api code in a separate branch.  We can merge the work
that we've done some time next week.

KBestExtractor is one of the things we want to return via the API.  We
already have some of this implemented though as you suggest.  I'll try and
push the remaining work we've done into my github branch so you can compare.

-Kellen

On Mon, Apr 25, 2016 at 6:11 AM, Matt Post  wrote:

> Okay, after looking at this a bit more, I have a better understanding, and
> an idea for how to move forward.
>
> First, I see that Translation.java has provisions for structured output.
> I'm guessing StructuredTranslation was added by mistake?
>
> Moving forward, on the joshua_api branch, I was thinking of the following,
> but want to make sure it doesn't collide with what you've done or are doing:
>
> - Factor KBestExtractor to return Translation objects instead of printing,
> and also turn it into an iterator
>
> - There's a real discrepancy with competing forest representations. There
> are operations on the hypergraph (via WalkerFunction), and then also
> operations on Derivations. This leads to code that operates on both. It
> would be nice if the KBestExtractor just returned something like a reduced
> "slice" of a forest forest new nodes containing only single back pointers,
> representing exactly the nth-best derivation. Then we could generically use
> the WalkerFunctions on that (e.g., viterbi extraction), and get rid of many
> of the DerivationVisitor classes
>
> - Related: constructing the k-best list is expensive, even for just the
> first item, since you have to set up all the candidate lists and so on.
> This led to me implementing top-n = 0, where you can get the translation
> and some limited information (not replayed features) via Viterbi extractors
> on the hypergraph, and you only have to call KBestExtractor if you actually
> want k-best lists. This leads to dual code, e.g., substitutions of
> output_format in multiple places. The first item the KBestIterator returns
> should be constructed more efficiently, on the assumption that the caller
> might not ask for more items. The StructuredTranslation object already is
> lazy about returning things that are asked for (e.g., it will only replay
> features if you ask for the feature functions).
>
> I will probably implement most of these tonight and tomorrow unless there
> are objections from anyone (including an objection asking for more time to
> evaluate!)
>
> matt
>
>
> > On Apr 23, 2016, at 7:22 PM, Matt Post  wrote:
> >
> > Hi,
> >
> > Kellen suggested we create a Joshua API, which I think is an excellent
> idea. I've just made a start at this. It is not done and needs more work,
> but I know that the Amazon folks have done some things on the backend, and
> I wanted to make sure not to duplicate any work they might have done. Also,
> it's something we should discuss.
> >
> > First, I was a bit confused about the joshua-6 subdirectory, and the
> files there (also, what is tst/? Both of these were from a recent commit).
> I moved those over and then things didn't compile. I got things compiling
> and then made a few changes to StructuredTranslation.
> >
> > The biggest change I hope doesn't create problems is that I simplified
> StructuredTranslation to no longer contain the Hypergraph object; instead,
> it contains a DerivationState object. This represents a particular k-best
> derivation, using Huang & Chiang (2005)-style ranked back pointers. The
> nice thing is that you can simplify define a DerivationVisitor class and
> pass it to DeriviationState::visit, and it will see every node in a
> particular derivation.
> >
> > This is distinct from WalkerFunction, which walks an entire *HyperGraph*.
> >
> > Let me know what you guys thing about these changes, and maybe we can
> spec out the API, and then clean things up inside a bit to use it (there's
> no reason to be passing output stream writers to KBestExtractor, for
> example...).
> >
> > matt
> >
> >
> >
> >> Begin forwarded message:
> >>
> >> From: mjp...@apache.org
> >> Subject: incubator-joshua git commit: Simplified StructuredTranslation
> to use derivations instead of hypergraphs, now using in KBestExtractor
> >> Date: April 23, 2016 at 7:12:19 PM EDT
> >> To: comm...@joshua.incubator.apache.org
> >> Reply-To: dev@joshua.incubator.apache.org
> >>
> >> Repository: incubator-joshua
> >> Updated Branches:
> >> refs/heads/joshua_api [created] 824319561
> >>
> >>
> >> Simplified StructuredTranslation to use derivations instead of
> hypergraphs, now using in KBestExtractor
> >>
> >> The StructuredTranslation object is a great idea. I rewrote it here to
> do the following:
> >>
> >> - It now compiles. I'm not sure why it was tucked under
> $JOSHUA/joshua-6, but I just

Re: programmatic API usage?

2016-04-15 Thread kellen sunderland

Yes we’re using Joshua as a library and will be sharing some more code soon
(hopefully tomorrow) that should help a little for this.  I think it might
make sense to come up with a simple v0.1 API that we could start using to
call Joshua in an interoperable manner.


A simple API could look like this:


StructuredTranslation translate(String source);


Where StructuredTranslation contains the translated string, but also some
information about how it got the translations (n-best lists, etc).


-Kellen




On 4/12/16, 5:09 PM, "Matt Post"  wrote:


Hi Tommaso,


There isn't really, unfortunately. I have never used Joshua as a library;
it would be nice if the Amazon folks (who I infer have done so, from a
comment on their last commit) would contribute a doc on this front.


What is the preferred avenue for developer documentation? Javadocs, or
something else?


matt



On Apr 12, 2016, at 6:09 AM, Tommaso Teofili 
wrote:


Hi all,


I am going through the code (so I'll probably figure it out at some point),

however I wonder if there's a quick guide on how to start using Joshua

programmatically as I am start having a look at how it could be integrated

into other projects.


Regards,

Tommaso

Apologies for the late replies

2016-04-15 Thread kellen sunderland

Looks like my emails from amazon.de are getting filtered (are they stuck in
a moderation queue?), so sorry for the late replies everyone.

First of all thanks to everyone for inviting me as a committer.  Great to
be working on an interesting project.  Some quick background about me: I've
been a developer for 10 years, currently working at Amazon full time on a
machine translation project involving Joshua.  I'm originally from Western
Canada (now living in Berlin) so I can help point people to good pubs for
ApacheCon.

-Kellen

Re: ApacheCon 2016 and Joshua

2016-03-22 Thread kellen sunderland

My use case for Joshua involves creating internal scalable web services to
translate text across several language arcs.  Most of the code changes I
(and others on my team) aim to contribute focus on Joshua stability and
performance.  So far my work has been mostly around speeding up decoding
and training speed (I hope to have some significant patches incoming around
the time of the Con).

I'll go ahead and book flights to Vancouver as well.  Hope to see most of
you there.

-Kellen


I'm going to look into making a short trip. I think I'd arrive on the
11th and then leave on
Friday the 13th. Could we plan a meet up for the night of Thursday the
12th? It'd be great
to meet everyone (and having a deadline would help me prioritize :) )

matt


> On Mar 15, 2016, at 12:24 PM, Lewis John Mcgibbney 
wrote:
>
> Hi Matt,
>
> On Mon, Mar 14, 2016 at 8:26 AM, Matt Post  wrote:
>
>> Whoa! Lewis, can you give some more detail on this talk, what you
>> proposed, and what you plan to talk about?
>>
>
> http://sched.co/6OJI
>
>
>>
>> I haven't ever been to ApacheCon, but am interested in going. I don't have
>> much of a feel for what motivates folks outside the academic research
>> community, and that would be good to have in laying out projects that might
>> interest people.
>>
>
> I agree. Would be great to meet you there. We could have a Joshua meetup.
>
>
>>
>> Regarding those project, I have a number of them. Perhaps it would be
>> useful to flesh them out with some more detail, and perhaps post them, for
>> those who are interested. First, with respect to Tommaso's question, the
>> following:
>>
>> - Use cases. I'd really like to push machine translation as a black box,
>> where people can download and use models, not caring how they work, and
>> building on top of them. I think this could be transformative. I've just
>> added to Joshua the ability to add, store, and manage custom phrasal
>> translation rules, which would let people take a model and add their own
>> translations on top of it, perhaps correcting mistakes as they encounter
>> them. There's a JSON API for it (undocumented).
>>
>> Building this up would also require pulling together lots of different
>> test sets, evaluating changes, and so on.
>>
>> - Neural nets. This is a huge research area. I think the advantages are
>> that it could enable releasing models that are much smaller. However, on
>> the down side, it's not clear what the best way to integrate these models
>> into Joshua is. Fully neural attention models would require re-architecting
>> Joshua, as they are essentially a new paradigm. Adding neural components as
>> feature functions that interact with the existing decoding algorithm would
>> be an intermediate step.
>>
>
> OK. This sounds like bang on for a meet up topic. Regardless of who is
> there, we could have a Webex or something similar for the incubating
> community,
>
>
>>
>> For other projects, I'd love:
>>
>> - Better documentation, developer and end-user (probably I need to write a
>> lot of this; if nothing else, it would be hugely useful to me in terms of
>> prioritizing to know that people want it)
>
>
>> - Rewriting certain components. The tuning modules, in particular, are a
>> real mess, and should be synthesized and improved.
>>
>> - Replacing Moses components. Joshua can call out to Moses to build phrase
>> tables; it would be nice to get rid of this (and wouldn't be that hard)
>> with our own Java implementations. It would also be good to add a
>> lexicalized distortion model to the phrase-based decoder.
>>
>>
> These all sound excellent and would all make very reasonable GSoC projects,
> Thanks
> Lewis

68 matches

Mail list logo