Re: modernmt

2017-07-02 Thread John Hewitt
I found the reference for that 1,000,000 number a bit too late -- according
to this more recent paper from Koehn, it's more like 15,000,000 tokens for
NMT to meet phrase-based MT, and they omit syntax-based.

https://arxiv.org/pdf/1706.03872.pdf

-John

On Sun, Jul 2, 2017 at 12:38 PM, John Hewitt <john...@seas.upenn.edu> wrote:

> I've talked with the ModernMT people; they're well aware that they're in a
> neural MT world, and they also know that there's a sizable market for
> non-neural MT solutions.
> To back this up -- Philipp Koehn gave a talk in March on comparing
> phrase-based, syntax-based, and neural MT in low-resource settings, that
> is, when the amount of bilingual text to train on is small.
>
> Neural MT needs (if I remember correctly) about 1,000,000 tokens of
> training data to outpace syntax-based MT.
> Many language pairs (and, for that matter, domains within a single
> language pair) do not meet that requirement, and in those cases
> syntax-based MT performs best.
>
> That being said, there are some cool opportunities to combine neural and
> syntax-based MT. I can't commit the work hours right now, necessarily, but
> I've worked with xnmt <https://github.com/neulab/xnmt>, an MIT-licensed
> neural MT library that is purpose-built to be highly modular. It may offer
> some good opportunities to make an ensemble system.
>
> On Sun, Jul 2, 2017 at 4:22 AM, Tommaso Teofili <tommaso.teof...@gmail.com
> > wrote:
>
>> I think it's interesting as it extends some features that also Joshua has,
>> it's open source and has good results compared with NMT.
>>
>> Tommaso
>>
>> Il giorno sab 1 lug 2017 alle ore 18:56 Suneel Marthi <
>> suneel.mar...@gmail.com> ha scritto:
>>
>> > Is this the latest/greatest paper around MT @tommaso ??
>> >
>> > On Sat, Jul 1, 2017 at 7:55 AM, Tommaso Teofili <
>> tommaso.teof...@gmail.com
>> > >
>> > wrote:
>> >
>> > > I accidentally found the paper about mmt [1]
>> > >
>> > > [1] :
>> > > https://ufal.mff.cuni.cz/eamt2017/user-project-product-
>> > > papers/papers/user/EAMT2017_paper_88.pdf
>> > >
>> > > Il giorno gio 1 dic 2016 alle ore 22:19 Mattmann, Chris A (3010) <
>> > > chris.a.mattm...@jpl.nasa.gov> ha scritto:
>> > >
>> > > > Guys I want to point you at the DARPA D3M program:
>> > > >
>> > > > http://www.darpa.mil/program/data-driven-discovery-of-models
>> > > >
>> > > > I’m part of the Government Team for the program. This will be a good
>> > > > connection
>> > > > to have b/c it’s focused on automatically doing model and code
>> building
>> > > > for ML based
>> > > > approaches.
>> > > >
>> > > >
>> > > > ++
>> > > > Chris Mattmann, Ph.D.
>> > > > Principal Data Scientist, Engineering Administrative Office (3010)
>> > > > Manager, Open Source Projects Formulation and Development Office
>> (8212)
>> > > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> > > > Office: 180-503E, Mailstop: 180-503
>> > > > Email: chris.a.mattm...@nasa.gov
>> > > > WWW:  http://sunset.usc.edu/~mattmann/
>> > > > ++
>> > > > Director, Information Retrieval and Data Science Group (IRDS)
>> > > > Adjunct Associate Professor, Computer Science Department
>> > > > University of Southern California, Los Angeles, CA 90089 USA
>> > > > WWW: http://irds.usc.edu/
>> > > > ++
>> > > >
>> > > >
>> > > > On 12/1/16, 1:15 PM, "Matt Post" <p...@cs.jhu.edu> wrote:
>> > > >
>> > > > John,
>> > > >
>> > > > Thanks for sharing, this is really helpful. I didn't realize
>> that
>> > > > Marcello was involved.
>> > > >
>> > > > I think we can identify with the NMT danger. I still think there
>> > is a
>> > > > big niche that deep learning approaches won't reach for a few years,
>> > > until
>> > > > GPUs become super prevalent. Which is why I like ModernMT's
>> approaches,
>> > > > which overlap with many of the things I've been thinking. One thing
>> I
&g

Re: modernmt

2017-07-02 Thread John Hewitt
I've talked with the ModernMT people; they're well aware that they're in a
neural MT world, and they also know that there's a sizable market for
non-neural MT solutions.
To back this up -- Philipp Koehn gave a talk in March on comparing
phrase-based, syntax-based, and neural MT in low-resource settings, that
is, when the amount of bilingual text to train on is small.

Neural MT needs (if I remember correctly) about 1,000,000 tokens of
training data to outpace syntax-based MT.
Many language pairs (and, for that matter, domains within a single language
pair) do not meet that requirement, and in those cases syntax-based MT
performs best.

That being said, there are some cool opportunities to combine neural and
syntax-based MT. I can't commit the work hours right now, necessarily, but
I've worked with xnmt <https://github.com/neulab/xnmt>, an MIT-licensed
neural MT library that is purpose-built to be highly modular. It may offer
some good opportunities to make an ensemble system.

On Sun, Jul 2, 2017 at 4:22 AM, Tommaso Teofili <tommaso.teof...@gmail.com>
wrote:

> I think it's interesting as it extends some features that also Joshua has,
> it's open source and has good results compared with NMT.
>
> Tommaso
>
> Il giorno sab 1 lug 2017 alle ore 18:56 Suneel Marthi <
> suneel.mar...@gmail.com> ha scritto:
>
> > Is this the latest/greatest paper around MT @tommaso ??
> >
> > On Sat, Jul 1, 2017 at 7:55 AM, Tommaso Teofili <
> tommaso.teof...@gmail.com
> > >
> > wrote:
> >
> > > I accidentally found the paper about mmt [1]
> > >
> > > [1] :
> > > https://ufal.mff.cuni.cz/eamt2017/user-project-product-
> > > papers/papers/user/EAMT2017_paper_88.pdf
> > >
> > > Il giorno gio 1 dic 2016 alle ore 22:19 Mattmann, Chris A (3010) <
> > > chris.a.mattm...@jpl.nasa.gov> ha scritto:
> > >
> > > > Guys I want to point you at the DARPA D3M program:
> > > >
> > > > http://www.darpa.mil/program/data-driven-discovery-of-models
> > > >
> > > > I’m part of the Government Team for the program. This will be a good
> > > > connection
> > > > to have b/c it’s focused on automatically doing model and code
> building
> > > > for ML based
> > > > approaches.
> > > >
> > > >
> > > > ++
> > > > Chris Mattmann, Ph.D.
> > > > Principal Data Scientist, Engineering Administrative Office (3010)
> > > > Manager, Open Source Projects Formulation and Development Office
> (8212)
> > > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > > > Office: 180-503E, Mailstop: 180-503
> > > > Email: chris.a.mattm...@nasa.gov
> > > > WWW:  http://sunset.usc.edu/~mattmann/
> > > > ++
> > > > Director, Information Retrieval and Data Science Group (IRDS)
> > > > Adjunct Associate Professor, Computer Science Department
> > > > University of Southern California, Los Angeles, CA 90089 USA
> > > > WWW: http://irds.usc.edu/
> > > > ++
> > > >
> > > >
> > > > On 12/1/16, 1:15 PM, "Matt Post" <p...@cs.jhu.edu> wrote:
> > > >
> > > > John,
> > > >
> > > > Thanks for sharing, this is really helpful. I didn't realize that
> > > > Marcello was involved.
> > > >
> > > > I think we can identify with the NMT danger. I still think there
> > is a
> > > > big niche that deep learning approaches won't reach for a few years,
> > > until
> > > > GPUs become super prevalent. Which is why I like ModernMT's
> approaches,
> > > > which overlap with many of the things I've been thinking. One thing I
> > > > really like is there automatic context-switching approach. This is a
> > > great
> > > > way to build general-purpose models, and I'd like to mimic it. I have
> > > some
> > > > general ideas about how this should be implemented but am also
> looking
> > > into
> > > > the literature here.
> > > >
> > > > matt
> > > >
> > > >
> > > > > On Dec 1, 2016, at 1:46 PM, John Hewitt <
> john...@seas.upenn.edu>
> > > > wrote:
> > > > >
> > > > > I had a few good conversations over dinner with this team at
> AMT

Re: [ANNOUNCE] - Apache Joshua 6.1 incubating release

2017-06-22 Thread John Hewitt
Related note: I've begun to announce to the Penn NLP communities; I can
talk to Mark Liberman at the LDC about getting a note in there as well.

-John

On Thu, Jun 22, 2017 at 10:11 AM, lewis john mcgibbney 
wrote:

> Hi Tommaso,
> EXCELLENT :)
> @Matt are you able to Tweet this out and make some tags?
> @Tommaso, where else did you announce this? Is it possible for us to make
> some more noise on various other communication forums/channels?
> This is brilliant news. Thank you Tommaso for being persistent with the
> release process, I am glad that we were able to recover the artifacts.
> Lewis
>
> On Thu, Jun 22, 2017 at 5:55 AM, <
> dev-digest-h...@joshua.incubator.apache.org> wrote:
>
> >
> > From: Tommaso Teofili 
> > To: annou...@apache.org
> > Cc: "dev@joshua.incubator.apache.org" 
> > Bcc:
> > Date: Thu, 22 Jun 2017 12:54:49 +
> > Subject: [ANNOUNCE] - Apache Joshua 6.1 incubating release
> > Hi Folks,
> >
> > The Apache Joshua team (PPMC) is pleased to announce the immediate
> > availability of Apache Joshua 6.1 (incubating).
> >
> > Apache Joshua is a statistical machine translation decoder for
> > phrase-based, hierarchical, and syntax-based machine translation, written
> > in Java.
> >
> > Apache Joshua is released as both source code, downloads for which can be
> > found at ASF dist download site [0] as well as Maven artifacts which can
> be
> > found on Maven central [1].
> >
> > The full Jira release report can be found here [3].
> >
> > Thank you,
> > Tommaso (on behalf of Apache Joshua PPMC)
> >
> > — DISCLAIMER Apache Joshua is an effort undergoing incubation at The
> Apache
> > Software Foundation (ASF), sponsored by the Apache Incubator PMC.
> > Incubation is required of all newly accepted projects until a further
> > review indicates that the infrastructure, communications, and decision
> > making process have stabilized in a manner consistent with other
> successful
> > ASF projects. While incubation status is not necessarily a reflection of
> > the completeness or stability of the code,it does indicate that the
> project
> > has yet to be fully endorsed by the ASF.
> >
> > [0] http://apache.org/dist/incubator/joshua/6.1/
> > [1] http://search.maven.org/#search|ga|1|g%3A%22org.apache.joshua%22
> > [3]
> > https://issues.apache.org/jira/secure/ReleaseNote.jspa?
> > projectId=12319720=12335049
> >
> >
>
>
> --
> http://home.apache.org/~lewismc/
> @hectorMcSpector
> http://www.linkedin.com/in/lmcgibbney
>


Re: modernmt

2016-12-01 Thread John Hewitt
I had a few good conversations over dinner with this team at AMTA in Austin
in October.
They seem to be in the interesting position where their work is good, but
is in danger of being superseded by neural MT as they come out of the gate.
Clearly, it has benefits over NMT, and is easier to adopt, but may not be
the winner over the long run.

Here's the link

to their AMTA tutorial.

-John

On Thu, Dec 1, 2016 at 10:17 AM, Mattmann, Chris A (3010) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Wow seems like this kind of overlaps with BigTranslate as well.. thanks
> for passing
> along Matt
>
> ++
> Chris Mattmann, Ph.D.
> Principal Data Scientist, Engineering Administrative Office (3010)
> Manager, Open Source Projects Formulation and Development Office (8212)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 180-503E, Mailstop: 180-503
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> ++
>
>
> On 12/1/16, 4:47 AM, "Matt Post"  wrote:
>
> Just came across this, and it's really cool:
>
> https://github.com/ModernMT/MMT
>
> See the README for some great use cases. I'm surprised I'd never heard
> of this before as it's EU funded and associated with U Edinburgh.
>
>


Re: Any symal experts?

2016-11-23 Thread John Hewitt
It'll be a headache because it also has no documentation, but to be fair it
may be less of a headache / a better long-term solution than trying to move
forward with this hackier solution.

I'll keep the symal use on the backburner and start putting together an
atools port.

-John

On Wed, Nov 23, 2016 at 12:18 PM, Matt Post <p...@cs.jhu.edu> wrote:

> John — I suggest trying to ditch those GIZA++ tools entirely. fast_align
> indeed replaced them with "atools"; how much work would it be to port that?
>
>
> > On Nov 23, 2016, at 12:11 PM, John Hewitt <john...@seas.upenn.edu>
> wrote:
> >
> > Hey everyone,
> >
> > I'm packaging up a Java port Fast Align for Joshua and integrating it
> into
> > the pipeline.
> > Fast Align does not produce symmetrical alignments -- it relies on a tool
> > that I haven't ported to Java.
> > We package symal (which symmetricizes alignments) with Joshua right now
> for
> > GIZA++, so I'm attempting to re-use that.
> > However, symal uses the .bal format, which it fails to describe.
> > It gets away with this because files from GIZA++ are piped through
> > giza2bal.pl, which itself is not well documented.
> > I'm attempting to write, say, fastalign2bal.py.
> > With a bit of tinkering, I got at the .bal format:
> >
> > 1
> >
> > 7 jehovah said to moses and aaron :  # 3 2 2 4 5 6 8
> >
> > 8 i řekl hospodin mojžíšovi a aronovi takto :  # 2 2 1 4 5 6 6 7
> >
> > A template for which would be
> >
> > 1
> >
> > NUM_TGT_TOKENS [tgt_token1 tgt_token2 ... tgt_tokenN] # [alignment1
> > alignment2 ... alignmentN]
> > NUM_SRC_TOKENS [src_token1 src_token2 ... src_tokenN] # [alignment1
> > alignment2 ... alignmentN]
> >
> >
> > However, I'm hitting some pretty nasty errors with symal when I pipe in
> > some fastalign2bal.py output.
> > A few hours with gdb made some progress (for as far as I can tell, the
> > formats are identical) but if anyone has experience with symal, I would
> > greatly appreciate some consultation.
> >
> > -John
>
>


Any symal experts?

2016-11-23 Thread John Hewitt
Hey everyone,

I'm packaging up a Java port Fast Align for Joshua and integrating it into
the pipeline.
Fast Align does not produce symmetrical alignments -- it relies on a tool
that I haven't ported to Java.
We package symal (which symmetricizes alignments) with Joshua right now for
GIZA++, so I'm attempting to re-use that.
However, symal uses the .bal format, which it fails to describe.
It gets away with this because files from GIZA++ are piped through
giza2bal.pl, which itself is not well documented.
I'm attempting to write, say, fastalign2bal.py.
With a bit of tinkering, I got at the .bal format:

1

7 jehovah said to moses and aaron :  # 3 2 2 4 5 6 8

8 i řekl hospodin mojžíšovi a aronovi takto :  # 2 2 1 4 5 6 6 7

A template for which would be

1

NUM_TGT_TOKENS [tgt_token1 tgt_token2 ... tgt_tokenN] # [alignment1
alignment2 ... alignmentN]
NUM_SRC_TOKENS [src_token1 src_token2 ... src_tokenN] # [alignment1
alignment2 ... alignmentN]


However, I'm hitting some pretty nasty errors with symal when I pipe in
some fastalign2bal.py output.
A few hours with gdb made some progress (for as far as I can tell, the
formats are identical) but if anyone has experience with symal, I would
greatly appreciate some consultation.

-John


Re: Updating Incubator summary

2016-11-17 Thread John Hewitt
@Matt, that sounds like an interesting goal. What's the hook?

@Henri, that sounds good. I like the idea of showing people snippets, as MT
isn't necessarily intuitive to the average Linux.com reader.

On Thu, Nov 17, 2016 at 5:44 AM, Matt Post  wrote:

> My thinking on that roadmap was a comment Lewis made a while ago about
> incubator graduation being judged by the number of releases. If you think
> we can get out sooner, then I'm all for it! Maybe we can get the docker
> containers out and then push for it after that?
>
> I like your idea about a more concerted advertising effort. We could also
> try to pull together a demo paper for ACL   which is
> due in February. I think I might have a hook that would appeal to reviewers
> there.
>
>
> > On Nov 17, 2016, at 2:12 AM, Henri Yandell  wrote:
> >
> > Sounds good :)
> >
> > My basic mantra is 'get the summary page all signed off, then start
> asking
> > "when graduate?"'. Projects can tend to linger in the Incubator awaiting
> > perfection.
> >
> > I wonder how you could take the 3rd item (Linux.com article) and make
> that
> > bigger. Perhaps encourage every committer to write a blog post so you end
> > up with the article as an intro, and then each committer's blog entry or
> > website hosted article as a personal "how I got into this" or "what I
> work
> > on" or "a commit I recently did, a commit I keep meaning to getting
> around
> > to working on". Random thought :)
> >
> > Hen
> >
> > On Tue, Nov 15, 2016 at 11:09 AM, Matt Post  wrote:
> >
> >> We're still waiting on our first software release, so it seems to me a
> bit
> >> premature to graduate? Though I don't know how these decisions are made
> —
> >> what goes into it?
> >>
> >> Here is the roadmap that I have in mind:
> >>
> >> - 6.1 release (imminent)
> >> - Large-scale release of language packs (imminent)
> >> - Linux.com article introducing people to MT, Joshua, language packs,
> and
> >> adding custom rules
> >> - Release of docker-based language packs (including KenLM)
> >> - 7.0 release (spring)
> >> - Graduate
> >>
> >> If we keep that rough schedule, we'll have incubated a year and have a
> lot
> >> to show for it.
> >>
> >> matt
> >>
> >>
> >>> On Nov 15, 2016, at 12:13 PM, Henri Yandell  wrote:
> >>>
> >>> Thanks :)
> >>>
> >>> Reason for asking being that it felt that the standard checklist things
> >>> were complete and I was wondering what the path to graduation is?
> >>>
> >>> Any reason not to start thinking about a vote?
> >>>
> >>> On Tue, Nov 15, 2016 at 04:02 Matt Post  wrote:
> >>>
>  Thanks, Lewis, and Henri, for pointing this out.
> 
> 
> > On Nov 15, 2016, at 1:18 AM, lewis john mcgibbney <
> lewi...@apache.org>
>  wrote:
> >
> > Hi Henri,
> > I just pushed the update to SVN. Should update asynch reasonably
> soon.
> >
> > http://incubator.apache.org/projects/joshua.html
> >
> > Thanks
> >
> > On Sun, Nov 13, 2016 at 1:22 PM, <
> > dev-digest-h...@joshua.incubator.apache.org> wrote:
> >
> >>
> >> From: Henri Yandell 
> >> To: dev@joshua.incubator.apache.org
> >> Cc:
> >> Date: Sun, 13 Nov 2016 01:17:57 -0800
> >> Subject: Updating Incubator summary
> >> Would be useful to update this page:
> >>
> >> http://incubator.apache.org/projects/joshua.html
> >>
> >>
> >> Are there any of the checklist items that are still open?
> >>
> >>
> > As far as I am aware no :)
> 
> 
> >>
> >>
>
>


Re: [VOTE] Release Apache Joshua (Incubating) 6.1

2016-11-14 Thread John Hewitt
+1 Let's do it.

-John

On Mon, Nov 14, 2016 at 1:13 PM, kellen sunderland <
kellen.sunderl...@gmail.com> wrote:

> +1 .  Thanks to Lewis and Matt for all the recent work.
>
> On Nov 14, 2016 7:11 PM, "Matt Post"  wrote:
>
> +1
>
> Thanks for starting this off, Lewis!
>
>
> > On Nov 14, 2016, at 12:54 PM, Ramirez, Paul M (398M) <
> paul.m.rami...@jpl.nasa.gov> wrote:
> >
> > +1, let's get it released!!!
> >
> > --Paul
> >
> > ==
> > Paul Ramirez - Group Supervisor
> > Computer Science for Data Intensive Applications (398M)
> > NASA - Jet Propulsion Laboratory
> > 4800 Oak Grove Dr.
> > Pasadena, CA 91109 USA
> > Mailstop: 158-242
> > Office: 818-354-1015
> > Cell: 818-395-8194
> > ==
> >
> > On 11/14/16, 9:16 AM, "lewis john mcgibbney"  wrote:
> >
> >Hi Folks,
> >Please VOTE on the Apache Joshua 6.1 Release Candidate #1.
> >
> >We solved 44 issues: https://s.apache.org/joshua6.1
> >
> >Git source tag (167489bbd78526b9833fe7c88646bf96101d5d2b):
> >https://s.apache.org/joshua6.1tag
> >
> >Staging repo:
> >https://repository.apache.org/content/repositories/
> orgapachejoshua-1000/
> >
> >Source Release Artifacts:
> >https://dist.apache.org/repos/dist/dev/incubator/joshua/
> >
> >PGP release keys (signed using 48BAEBF6):
> >https://dist.apache.org/repos/dist/release/incubator/joshua/KEYS
> >
> >Vote will be open for 72 hours.
> >Thank you to everyone that is able to VOTE as well as everyone that
> >contributed to Apache Joshua 6.1.
> >
> >[ ] +1, let's get it released!!!
> >[ ] +/-0, fine, but consider to fix few issues before...
> >[ ] -1, nope, because... (and please explain why)
> >
> >P.S. here is my +1
> >
> >--
> >http://home.apache.org/~lewismc/
> >@hectorMcSpector
> >http://www.linkedin.com/in/lmcgibbney
> >
> >
>


Re: Pipeline Mystery

2016-10-26 Thread John Hewitt
It seems like MERT isn't writing it's final config file (which is typical
of MERT, in my experience). I recall giving up and using kbmira. This final
config file is the one used in test, so I can see why skipping to test ends
up failing pretty quick.

To answer your question, though, I haven't tried. Not in my bandwidth right
now.

-John

On Thu, Oct 27, 2016 at 12:44 AM, lewis john mcgibbney 
wrote:

> Hi Folks,
> So I've been plodding away again and feel i am very close to generating my
> first language pack, however I've arrived at the following fankle!!!
> If I run a pipeline from start to finish it fails at the 'test-bundle-1'
> phase as below stating " [Errno 2] No such file or directory:
> '/usr/local/joshua_resources/russian_experiments/exp3/tune/
> joshua.config.final'"
>
> lmcgibbn@LMC-056430 /usr/local/joshua_resources/russian_experiments/exp3 $
> /usr/local/incubator-joshua/bin/pipeline.pl  --rundir . --type hiero
> --corpus
> /usr/local/joshua_resources/russian_experiments/data/commoncrawl.ru-en
> --tune
> /usr/local/joshua_resources/russian_experiments/data/
> commoncrawl.ru-en.tune
> --test
> /usr/local/joshua_resources/russian_experiments/data/
> commoncrawl.ru-en.test
> --source en --target ru --readme "Experiment 3 Run 1 of ru --> en model
> training" --aligner berkeley --hadoop-mem 10g --tmp
> /usr/local/hadoop-2.5.2/hadoop_tmp_dir
> [train-copy-and-filter] cached, skipping...
> [train-tokenize-en] cached, skipping...
> [train-tokenize-ru] cached, skipping...
> [train-trim] cached, skipping...
> [train-lowercase-en] cached, skipping...
> [train-lowercase-ru] cached, skipping...
> [train-vocab-en] cached, skipping...
> [train-vocab-ru] cached, skipping...
> [tune-copy-and-filter] cached, skipping...
> [tune-tokenize-en] cached, skipping...
> [tune-tokenize-ru] cached, skipping...
> [tune-lowercase-en] cached, skipping...
> [tune-lowercase-ru] cached, skipping...
> [tune-vocab-en] cached, skipping...
> [tune-vocab-ru] cached, skipping...
> [test-copy-and-filter] cached, skipping...
> [test-tokenize-en] cached, skipping...
> [test-tokenize-ru] cached, skipping...
> [test-lowercase-en] cached, skipping...
> [test-lowercase-ru] cached, skipping...
> [test-vocab-en] cached, skipping...
> [test-vocab-ru] cached, skipping...
> [lm-sort-uniq] cached, skipping...
> [kenlm] cached, skipping...
> [compile-kenlm] cached, skipping...
> [glue-tune] cached, skipping...
> [tune-bundle] cached, skipping...
> [mert-1] rebuilding...
>
> dep=/usr/local/joshua_resources/russian_experiments/
> exp3/data/tune/corpus.en
>
> dep=/usr/local/joshua_resources/russian_experiments/
> exp3/tune/joshua.config
> [CHANGED]
>   dep=tune/model/grammar.gz.packed/slice_0.source
>
> dep=/usr/local/joshua_resources/russian_experiments/
> exp3/tune/joshua.config.final
> [NOT FOUND]
>   cmd=/usr/local/incubator-joshua/scripts/training/run_tuner.py
> /usr/local/joshua_resources/russian_experiments/exp3/data/tune/corpus.en
> /usr/local/joshua_resources/russian_experiments/exp3/data/tune/corpus.ru
> --tunedir /usr/local/joshua_resources/russian_experiments/exp3/tune
> --tuner
> mert --decoder
> /usr/local/joshua_resources/russian_experiments/exp3/tune/decoder_command
> --decoder-config
> /usr/local/joshua_resources/russian_experiments/exp3/tune/joshua.config
> --decoder-output-file
> /usr/local/joshua_resources/russian_experiments/exp3/tune/output.nbest
> --decoder-log-file
> /usr/local/joshua_resources/russian_experiments/exp3/tune/joshua.log
> --iterations 10 --metric 'BLEU 4 closest'
>   took 27 seconds (27s)
> [test-bundle-1] rebuilding...
>
> dep=/usr/local/joshua_resources/russian_experiments/
> exp3/tune/joshua.config.final
> [NOT FOUND]
>   dep=grammar.gz
>
> dep=/usr/local/joshua_resources/russian_experiments/
> exp3/test/1/model/joshua.config
>   cmd=/usr/local/incubator-joshua/scripts/support/run_bundler.py --force
> --symlink --absolute --verbose -T /usr/local/hadoop-2.5.2/hadoop_tmp_dir
> /usr/local/joshua_resources/russian_experiments/exp3/tune/
> joshua.config.final
> /usr/local/joshua_resources/russian_experiments/exp3/test/1/model
> --copy-config-options '-top-n 300 -pop-limit 5000 -output-format "%i ||| %s
> ||| %f ||| %c" -mark-oovs false' --pack-tm grammar.gz --tm
> /usr/local/joshua_resources/russian_experiments/exp3/data/
> tune/grammar.glue
>   JOB FAILED (return code 2)
> ERROR:root:ERROR: argument config: can't open
> '/usr/local/joshua_resources/russian_experiments/exp3/tune/
> joshua.config.final':
> [Errno 2] No such file or directory:
> '/usr/local/joshua_resources/russian_experiments/exp3/tune/
> joshua.config.final'
>
> However, if I run the pipeline with the --first-step test flag, then I get
> the following where the 'test-bundle-1' phase executes and completes
> flawlessly however the pipeline then goes on to die at the 'test-decode-1'
> phase!!!
>
> lmcgibbn@LMC-056430 /usr/local/joshua_resources/russian_experiments/exp3 $
> /usr/local/incubator-joshua/bin/pipeline.pl  

Re: openjdk 8 incompatibility

2016-10-25 Thread John Hewitt
Checks out. Thanks, Matt.

-John

On Tue, Oct 25, 2016 at 3:56 PM, Matt Post <p...@cs.jhu.edu> wrote:

> Hmm, inclusion of that line looks like a mistake. I've seen Eclipse add
> random imports because it sorts the suggestions in a very unhelpful manner.
> I just removed the line and pushed, try again.
>
>
> > On Oct 25, 2016, at 1:11 PM, John Hewitt <john...@seas.upenn.edu> wrote:
> >
> > Hi all,
> >
> > Has anyone been able to compile Joshua with openjdk? I get this message:
> >
> > /home/john/java/incubator-joshua/src/main/java/org/
> apache/joshua/decoder/ff/lm/KenLM.java:[21,19]
> > error: package javafx.scene does not exist
> >
> > And the following link seems to confirm that javafx is not a part of
> > openjdk.
> > https://ask.fedoraproject.org/en/question/93407/there-is-no-
> javafx-packages-in-openjdk-180-fedora-gnulinux/
> >
> > -John
>
>


openjdk 8 incompatibility

2016-10-25 Thread John Hewitt
Hi all,

Has anyone been able to compile Joshua with openjdk? I get this message:

/home/john/java/incubator-joshua/src/main/java/org/apache/joshua/decoder/ff/lm/KenLM.java:[21,19]
error: package javafx.scene does not exist

And the following link seems to confirm that javafx is not a part of
openjdk.
https://ask.fedoraproject.org/en/question/93407/there-is-no-javafx-packages-in-openjdk-180-fedora-gnulinux/

-John


[jira] [Commented] (JOSHUA-288) Port fast_align to java

2016-10-25 Thread John Hewitt (JIRA)

[ 
https://issues.apache.org/jira/browse/JOSHUA-288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15605667#comment-15605667
 ] 

John Hewitt commented on JOSHUA-288:


Replaced gnu-getopt (not Apache licence-compliant) with commons-cli

> Port fast_align to java
> ---
>
> Key: JOSHUA-288
> URL: https://issues.apache.org/jira/browse/JOSHUA-288
> Project: Joshua
>  Issue Type: New Feature
>Reporter: Matt Post
>    Assignee: John Hewitt
>Priority: Minor
> Fix For: 6.2
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> It would be great to have a Java port of fast_align, so that we don't have to 
> worry about compiling it, and could distribute it via Maven.
> https://github.com/clab/fast_align
> The port we'll use, in progress, is hosted at:
> https://github.com/john-hewitt/fast_align.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (JOSHUA-288) Port fast_align to java

2016-10-25 Thread John Hewitt (JIRA)

 [ 
https://issues.apache.org/jira/browse/JOSHUA-288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Hewitt updated JOSHUA-288:
---
Description: 
It would be great to have a Java port of fast_align, so that we don't have to 
worry about compiling it, and could distribute it via Maven.

https://github.com/clab/fast_align

The port we'll use, in progress, is hosted at:

https://github.com/john-hewitt/fast_align.java

  was:
It would be great to have a Java port of fast_align, so that we don't have to 
worry about compiling it, and could distribute it via Maven.

https://github.com/clab/fast_align


> Port fast_align to java
> ---
>
> Key: JOSHUA-288
> URL: https://issues.apache.org/jira/browse/JOSHUA-288
> Project: Joshua
>  Issue Type: New Feature
>Reporter: Matt Post
>    Assignee: John Hewitt
>Priority: Minor
> Fix For: 6.2
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> It would be great to have a Java port of fast_align, so that we don't have to 
> worry about compiling it, and could distribute it via Maven.
> https://github.com/clab/fast_align
> The port we'll use, in progress, is hosted at:
> https://github.com/john-hewitt/fast_align.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: language pack #1

2016-10-05 Thread John Hewitt
Quick further note -- I already had $JOSHUA set to a different directory,
so initially all the lookups were failing.

It's possible current users of JOSHUA will as well when they download new
language packs. This should be an obvious and quick fix for the user, but I
don't know if there's something we could do in the name of making it even
clearer. (Potentially checking whether $JOSHUA is the same as $PWD after
the directory change in prepare.sh, and printing a warning if it's not?)

-John

On Wed, Oct 5, 2016 at 11:32 PM, John Hewitt <john...@seas.upenn.edu> wrote:

> Thanks, Matt!
>
> Some notes:
>
> When piping input into prepare.sh, I get the following output:
>
> WARNING: No known abbreviations for language 'es', attempting fall-back to
> English version...
> ERROR: No abbreviations files found in /nlp/users/johnhew/apache-
> joshua-es-en-2016-10-05/scripts/preparation/nonbre
> aking_prefixes
>
> Seems that line 12 of tokenize.pl:
> my $mydir = "$ENV{JOSHUA}/scripts/preparation/nonbreaking_prefixes";
> should be:
> my $mydir = "$ENV{JOSHUA}/scripts/nonbreaking_prefixes";
>
> When I make this modification, it works just fine for me.
> Also, tried in server mode -- seems to work without issue.
>
> (For reference -- executed on an openSUSE cluster)
>
> -John
>
>
>
> On Wed, Oct 5, 2016 at 10:36 PM, Matt Post <p...@cs.jhu.edu> wrote:
>
>> Hi folks,
>>
>> I have managed to assemble an actual working language pack. Consider this
>> a (near-final, I hope) draft of what we're rolling out for lots of
>> languages. Please download it, check out the README and associated files,
>> test it, and let me know what's missing or what needs to change.
>>
>> http://cs.jhu.edu/~post/files/apache-joshua-es-en-2016-10-05.tgz
>> <http://cs.jhu.edu/~post/files/apache-joshua-es-en-2016-10-05.tgz> (2.1
>> GB)
>>
>> Suggested use:
>>
>> tar xzvf apache-joshua-es-en-2016-10-05.tgz
>> echo "\"Yo quiero Taco Bell,\", él dijo." \
>> | ./apache-joshua-es-en-2016-10-05/prepare.sh \
>> | ./apache-joshua-es-en-2016-10-05/joshua
>>
>> matt
>
>
>


[jira] [Commented] (JOSHUA-288) Port fast_align to java

2016-10-02 Thread John Hewitt (JIRA)

[ 
https://issues.apache.org/jira/browse/JOSHUA-288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15541397#comment-15541397
 ] 

John Hewitt commented on JOSHUA-288:


I'm moving to benchmark the port against the original C implementation in 
runtime and AER. The authors of the original papers specify datasets on which 
they evaluate:

 - French: Europarl and News Commentary from WMT 12
- Chinese: (LDC2003E14)
- Arabic : all parallel data made available for the NIST 2012 Open MT

At least for French and Arabic, it is unclear where the manual reference 
alignments reside. Any thoughts? [~post]?

> Port fast_align to java
> ---
>
> Key: JOSHUA-288
> URL: https://issues.apache.org/jira/browse/JOSHUA-288
> Project: Joshua
>  Issue Type: New Feature
>Reporter: Matt Post
>    Assignee: John Hewitt
>Priority: Minor
> Fix For: 6.2
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> It would be great to have a Java port of fast_align, so that we don't have to 
> worry about compiling it, and could distribute it via Maven.
> https://github.com/clab/fast_align



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (JOSHUA-288) Port fast_align to java

2016-09-08 Thread John Hewitt (JIRA)

[ 
https://issues.apache.org/jira/browse/JOSHUA-288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474276#comment-15474276
 ] 

John Hewitt commented on JOSHUA-288:


I've found what is possibly a bug in the original C code which was kept in the 
Java conversion. It seems to be a simple arithmetic error. 
I've created a pull request, https://github.com/clab/fast_align/pull/20,  and 
will determine whether it's actually a bug and should be fixed in the Java port.


> Port fast_align to java
> ---
>
> Key: JOSHUA-288
> URL: https://issues.apache.org/jira/browse/JOSHUA-288
> Project: Joshua
>  Issue Type: New Feature
>Reporter: Matt Post
>Priority: Minor
> Fix For: 6.2
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> It would be great to have a Java port of fast_align, so that we don't have to 
> worry about compiling it, and could distribute it via Maven.
> https://github.com/clab/fast_align



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (JOSHUA-221) ArrayIndexOutOfBoundsException when passing arguments to JoshuaDecoder.main

2016-08-19 Thread John Hewitt (JIRA)

[ 
https://issues.apache.org/jira/browse/JOSHUA-221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428097#comment-15428097
 ] 

John Hewitt commented on JOSHUA-221:


The current command line parsing scheme writes the options to a temporary 
config file and then reads the config file as it otherwise would.

If using Args4J or commons-cli, I think we should consider avoiding this, as it 
means we're parsing arguments twice. 

> ArrayIndexOutOfBoundsException when passing arguments to JoshuaDecoder.main
> ---
>
> Key: JOSHUA-221
> URL: https://issues.apache.org/jira/browse/JOSHUA-221
> Project: Joshua
>  Issue Type: Bug
>Reporter: Lewis John McGibbney
> Fix For: 6.2
>
>
> {code}
> lmcgibbn@LMC-032857 /usr/local/joshua(master) $ java -jar class/joshua.jar 
> -version
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2
>   at joshua.decoder.ArgsParser.(ArgsParser.java:43)
>   at joshua.decoder.JoshuaDecoder.main(JoshuaDecoder.java:30)
> lmcgibbn@LMC-032857 /usr/local/joshua(master) $ java -jar class/joshua.jar 
> -version -v
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2
>   at joshua.decoder.ArgsParser.(ArgsParser.java:43)
>   at joshua.decoder.JoshuaDecoder.main(JoshuaDecoder.java:30)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (JOSHUA-288) Port fast_align to java

2016-08-13 Thread John Hewitt (JIRA)

[ 
https://issues.apache.org/jira/browse/JOSHUA-288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15420046#comment-15420046
 ] 

John Hewitt commented on JOSHUA-288:


Existing direct port of fast_align to Java found: 
https://github.com/dowobeha/fast_align.java
This port isn't ready to be integrated with Joshua for the following reasons:
 - It uses GNU getopt, licensed GPLv2
 - It has no documentation.
 - It has no test cases
 - It has not been touched since initial commit. 
I am remedying these at https://github.com/john-hewitt/fast_align.java

> Port fast_align to java
> ---
>
> Key: JOSHUA-288
> URL: https://issues.apache.org/jira/browse/JOSHUA-288
> Project: Joshua
>  Issue Type: New Feature
>Reporter: Matt Post
>Priority: Minor
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> It would be great to have a Java port of fast_align, so that we don't have to 
> worry about compiling it, and could distribute it via Maven.
> https://github.com/clab/fast_align



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] incubator-joshua issue #32: JOSHUA-286 - Replace old joshua-decoder.org link...

2016-07-28 Thread john-hewitt
Github user john-hewitt commented on the issue:

https://github.com/apache/incubator-joshua/pull/32
  
@lewismc Improvements addressed. Happy to help.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-joshua pull request #32: JOSHUA-286 - Replace old joshua-decoder.o...

2016-07-27 Thread john-hewitt
GitHub user john-hewitt opened a pull request:

https://github.com/apache/incubator-joshua/pull/32

JOSHUA-286 - Replace old joshua-decoder.org links with joshua.apache.org

- Update links to documentation and support to reflect the 
move to Apache. 
- keep Gitignore entry for old website was kept to keep the 
repo clean. 
- Update links to the git repo as well.
- old pages in the `docs` folder unchanged. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/john-hewitt/incubator-joshua JOSHUA-286

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-joshua/pull/32.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #32


commit 36a58e75e5deb71bdeed2980e740502fe3d516c2
Author: John Hewitt <john.hewit...@gmail.com>
Date:   2016-07-28T02:57:04Z

Replace old joshua-decoder.org links with joshua.apache.org

Updating links to documentation and support to reflect the
move to Apache. Gitignore entry was kept to keep the repo
clean. References to the old git repo were updated as well.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---