Re: [DISCUSS] Graduation (was Re: Path to TLP)

2018-09-06 Thread Matt Post
Hi folks,

It is fine with me if you want to move to graduation, but at this point I will 
assert that I don't have the time to contribute, and do not wish to be involved 
as a committee member once that threshold is crossed. It has been a good run 
and I have only fond associations with the project, but it is time for me to 
move on, and I wish you all the best.

Sincerely,
Matt



> On Sep 6, 2018, at 11:36 AM, Chris Mattmann  wrote:
> 
> Coming back to this.
> 
> 
> 
> Sorry it took so long :/
> 
> 
> 
> Here is a proposed graduation template. I will call for a VOTE on it 
> by mid-next week once the discussion comes to consensus. 
> 
> 
> 
> WHEREAS, the Board of Directors deems it to be in the best
> 
> interests of the Foundation and consistent with the
> 
> Foundation's purpose to establish a Project Management
> 
> Committee charged with the creation and maintenance of
> 
> open-source software, for distribution at no charge to
> 
> the public, related to statistical and other forms of machine 
> translation.
> 
> 
> 
> NOW, THEREFORE, BE IT RESOLVED, that a Project Management
> 
> Committee (PMC), to be known as the "Apache Joshua Project",
> 
> be and hereby is established pursuant to Bylaws of the
> 
> Foundation; and be it further
> 
> 
> 
> RESOLVED, that the Apache Joshua Project be and hereby is
> 
> responsible for the creation and maintenance of software
> 
> related to statistical and other forms of machine translation;
> 
> and be it further
> 
> 
> 
> RESOLVED, that the office of "Vice President, Apache Joshua" be
> 
> and hereby is created, the person holding such office to
> 
> serve at the direction of the Board of Directors as the chair
> 
> of the Apache Joshua Project, and to have primary responsibility
> 
> for management of the projects within the scope of
> 
> responsibility of the Apache Joshua Project; and be it further
> 
> 
> 
> RESOLVED, that the persons listed immediately below be and
> 
> hereby are appointed to serve as the initial members of the
> 
> Apache Joshua Project:
> 
> 
> 
> * Tom Barber  
> 
> * Thamme Gowda   
> 
> * Felix Hieber 
> 
> * Lewis John McGibbney 
> 
> * Chris Mattmann 
> 
> * Matt Post     
> 
> * Paul Ramirez   
> 
> * Henry Saputra
> 
> * Kellen Sunderland 
> 
> * Tommaso Teofili
> 
> 
> 
> NOW, THEREFORE, BE IT FURTHER RESOLVED, that Matt Post
> 
> be appointed to the office of Vice President, Apache Joshua to
> 
> serve in accordance with and subject to the direction of the
> 
> Board of Directors and the Bylaws of the Foundation until
> 
> death, resignation, retirement, removal or disqualification,
> 
> or until a successor is appointed; and be it further
> 
> 
> 
> RESOLVED, that the initial Apache Joshua PMC be and hereby is
> 
> tasked with the creation of a set of bylaws intended to
> 
> encourage open development and increased participation in the
> 
> Apache Joshua Project; and be it further
> 
> 
> 
> RESOLVED, that the Apache Joshua Project be and hereby
> 
> is tasked with the migration and rationalization of the Apache
> 
> Incubator Joshua podling; and be it further
> 
> 
> 
> RESOLVED, that all responsibilities pertaining to the Apache
> 
> Incubator Joshua podling encumbered upon the Apache Incubator
> 
> Project are hereafter discharged.
> 
> 
> 
> Cheers,
> 
> Chris
> 
> 
> 
> 
> 
> 
> 
> From: Thamme Gowda 
> Reply-To: "dev@joshua.incubator.apache.org" 
> Date: Saturday, February 3, 2018 at 7:51 PM
> To: "dev@joshua.incubator.apache.org" 
> Subject: Re: [DISCUSS] Graduation (was Re: Path to TLP)
> 
> 
> 
> Great news!
> 
> 
> 
> 2018-02-01 19:48 GMT-08:00 Mattmann, Chris A (1761) <
> 
> chris.a.mattm...@jpl.nasa.gov>:
> 
> 
> 
> +1 I’ll draft the resolution and send shortly for community vote
> 
> 
> 
> Sent from my iPhone
> 
> 
> 
>> On Feb 1, 2018, at 7:22 PM, Tom Barber  wrote:
> 
>> 
> 
>> I'd just like to dig this one back. Seeing how Matt accepted the
> 
> proposal and there is action from Tommaso and Lewis to get stuff merged,
> 
> it seems like there is general consensus to get Joshua out of the incubator.
> 
>> 
> 
>> Tom
> 
>> 
> 
>>>

Re: CJK LPs

2018-02-19 Thread Matt Post
You just have to make sure that the language pack makes it easy to apply the 
same pre-processing to test data that you applied at training time. Which means 
bundling the segmentation model with the language pack (or doing something 
simple, like single-character words—that degrades performance but would be 
easier). I typically use the Stanford segmenter but I'm not sure it would 
matter that much.

matt


> On Feb 19, 2018, at 1:45 PM, Tommaso Teofili <tommaso.teof...@gmail.com> 
> wrote:
> 
> thanks Matt.
> Would you be able to point out such additional step in a bit more detail
> when you have time ?
> Not sure what you used for segmentation, perhaps could use either Lucene's
> CJK [1] or Kuromoji [2] analyzers.
> 
> Regards,
> Tommaso
> 
> [1] :
> https://lucene.apache.org/core/7_0_0/analyzers-common/org/apache/lucene/analysis/cjk/CJKAnalyzer.html
> [2] : https://lucene.apache.org/core/7_0_0/analyzers-kuromoji/
> 
> Il giorno lun 19 feb 2018 alle ore 12:12 Matt Post <p...@cs.jhu.edu> ha
> scritto:
> 
>> I don’t think I ever built these. There is an additional step of properly
>> and consistently segmenting Chinese which complicates things and creates an
>> external dependency.
>> 
>> matt (from my phone)
>> 
>>> Le 19 févr. 2018 à 10:46, Tommaso Teofili <tommaso.teof...@gmail.com> a
>> écrit :
>>> 
>>> Hi all,
>>> 
>>> I am not sure if I am missing something, but I somewhat recalled that
>>> language packs for Chinese (but also Japanese / Korean) existed at [1],
>>> however I can't find any.
>>> Reading through the comments it seems at least that was the plan.
>>> If that is a leftout from the recent LP migration we could try to fix it
>>> otherwise it'd be nice to build and provide such CJK LPs.
>>> Can anyone help clarify ?
>>> 
>>> Regards,
>>> Tommaso
>>> 
>>> [1] : https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs
>> 
>> 



Re: [jira] [Commented] (JOSHUA-333) The English-English Language Pack download links are broken.

2018-01-08 Thread Matt Post
Hi folks,

Hope we can dig these up because they’ve been deleted from JHU’s servers. 

matt (from my phone)

> Le 5 janv. 2018 à 17:51, Lewis John McGibbney (JIRA)  a 
> écrit :
> 
> 
>[ 
> https://issues.apache.org/jira/browse/JOSHUA-333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16313425#comment-16313425
>  ] 
> 
> Lewis John McGibbney commented on JOSHUA-333:
> -
> 
> [~bugg_tb] were these files copied when we migrated from [~post]'s server to 
> Dropbox?
> 
>> The English-English Language Pack download links are broken.
>> 
>> 
>>Key: JOSHUA-333
>>URL: https://issues.apache.org/jira/browse/JOSHUA-333
>>Project: Joshua
>> Issue Type: Bug
>>   Reporter: David Gonzalez
>> 
>> On the Apache Joshua English-English wiki page the ruleset (PPDB v2) 
>> downloads are all broken (404).
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65142863
> 
> 
> 
> --
> This message was sent by Atlassian JIRA
> (v6.4.14#64029)



[jira] [Commented] (JOSHUA-332) Merge 7 branch into master

2017-10-26 Thread Matt Post (JIRA)

[ 
https://issues.apache.org/jira/browse/JOSHUA-332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16220971#comment-16220971
 ] 

Matt Post commented on JOSHUA-332:
--

I am not sure if it's worth merging 7. I'm not saying it's not, I just honestly 
don't know. There was a lot of work there but it was also a number of ideas 
that were never fully implemented. We did do quite a bit of work on it, but it 
may be very far from being able to be merged. You might consider just 
abandoning those unless you have a clear idea how to pull them in.

>  Merge 7 branch into master
> ---
>
> Key: JOSHUA-332
> URL: https://issues.apache.org/jira/browse/JOSHUA-332
> Project: Joshua
>  Issue Type: Task
>  Components: core
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Fix For: 7
>
>
> As discussed on the mailing list, let's branch _master_ into a _6x_ branch 
> and merge branch _7_ into _master_ in order to keep developing on top of the 
> latest in the main branch.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Re: [DISCUSS] Graduation (was Re: Path to TLP)

2017-10-05 Thread Matt Post
Thanks Tommaso. Though, I should say, initial thanks goes to Zhifei Li. I just 
took it over.

I think I can stick around in the capacity Chris suggests. Thanks, all.

matt

> On Sep 27, 2017, at 9:20 AM, Tommaso Teofili <tommaso.teof...@gmail.com> 
> wrote:
> 
> +1 to Chris's proposal.
> 
> Let me also add my thanks to you Matt for making Joshua happen in first
> place and for bringing it to the ASF and involving me and the rest of the
> team in such an interesting piece of sw and to machine translation in
> general. I do understand the need for you to move into the NMT stuff but at
> the same time I think Joshua is a very good resource (given also the so
> many language packs available) for people and / or projects that want to
> start with MT having reasonably good results so I can still see its value.
> 
> My 2 cents,
> Tommaso
> 
> 
> 
> Il giorno mar 26 set 2017 alle ore 18:57 Chris Mattmann <mattm...@apache.org>
> ha scritto:
> 
>> Thanks Matt. My feeling is that if you are willing to make you the chair
>> of the project,
>> which is really an administrative role if you are willing and willingness
>> to submit a board
>> report once monthly, and then quarterly after 3 months. This is to
>> recognize your contributions
>> and merit to the project, which will never expire. Even if you are not
>> actively developing, I think
>> you would make a great chair.
>> 
>> Apache Joshua works, has a release, and has a good community around it of
>> people like Lewis,
>> Tommaso, and others that I think it would withstand even your development
>> departure. It could
>> also make a good academic/learning tool and could be something we could
>> focus on getting new
>> GSOC projects to add in the NeuralMT stuff.
>> 
>> If you are OK with that I think we should proceed. Let me know and thanks.
>> 
>> Cheers,
>> Chris
>> 
>> 
>> 
>> 
>> On 9/25/17, 11:24 PM, "Matt Post" <p...@cs.jhu.edu> wrote:
>> 
>>Hi everyone,
>> 
>>I think now is as good time a time as any to mention my feelings about
>> Joshua. You may have noticed that I haven't done much active development
>> over the past year; you likely also know that the reason is that the
>> research community has shifted entirely from work on statistical models to
>> work on neural machine translation. On the research side, neural models now
>> consistently outperform phrase-based systems on BLEU score on language
>> pairs where there is enough data (roughly, around 15 million words of
>> training), and work there has injected a lot of new life into a field that
>> many had felt was starting to stagnate. From a production standpoint,
>> neural systems are also a big win: the models do best with a GPU and take
>> some time to train, but the architecture and pipeline are simpler, and the
>> resulting models are constant-sized and on the order of a few gigabytes at
>> most, instead of scaling with training data into the tens of gigabytes, as
>> statistical systems do. Test-time inference can also be run fairly
>> efficiently on CPUs where throughput demands are low enough. All commercial
>> systems are now neural or are quickly moving in that direction, including
>> relatively surprising places like Systran, which until recently was known
>> as the world's best-known rule-based system. As GPUs become more ubiquitous
>> and cheap, this situation is only going to get better, even for the end
>> user. There is little doubt that neural MT has supplanted statistical
>> approaches to machine translation, across both academic research and
>> industry. And it is still in its relative infancy, with lots of interesting
>> research problems and engineering issues to investigate and resolve.
>> 
>>It's somewhat sad for me because I've been working on or with Joshua
>> for almost seven years, but I also find my feelings here interesting in
>> contrast to a previous time I've felt tugged away from Joshua. As many of
>> you know, Philipp Koehn joined JHU a few years ago, which brought some
>> tension to JHU with respect to collaborating on research. There was
>> pressure for me to switch. Moses had a much bigger development community
>> and was much more feature rich, but despite this, I was reluctant to let go
>> of Joshua, for a number of reasons. Java is nicer to work with than C++
>> (and not really that much slower); our code is better written, IMO; jar
>> files are easier to distribute than C++ in compiled or source form; and, of
>> course, I had much more familiarity with the codebase, not to m

Re: [DISCUSS] Graduation (was Re: Path to TLP)

2017-09-26 Thread Matt Post
Hi everyone,

I think now is as good time a time as any to mention my feelings about Joshua. 
You may have noticed that I haven't done much active development over the past 
year; you likely also know that the reason is that the research community has 
shifted entirely from work on statistical models to work on neural machine 
translation. On the research side, neural models now consistently outperform 
phrase-based systems on BLEU score on language pairs where there is enough data 
(roughly, around 15 million words of training), and work there has injected a 
lot of new life into a field that many had felt was starting to stagnate. From 
a production standpoint, neural systems are also a big win: the models do best 
with a GPU and take some time to train, but the architecture and pipeline are 
simpler, and the resulting models are constant-sized and on the order of a few 
gigabytes at most, instead of scaling with training data into the tens of 
gigabytes, as statistical systems do. Test-time inference can also be run 
fairly efficiently on CPUs where throughput demands are low enough. All 
commercial systems are now neural or are quickly moving in that direction, 
including relatively surprising places like Systran, which until recently was 
known as the world's best-known rule-based system. As GPUs become more 
ubiquitous and cheap, this situation is only going to get better, even for the 
end user. There is little doubt that neural MT has supplanted statistical 
approaches to machine translation, across both academic research and industry. 
And it is still in its relative infancy, with lots of interesting research 
problems and engineering issues to investigate and resolve.

It's somewhat sad for me because I've been working on or with Joshua for almost 
seven years, but I also find my feelings here interesting in contrast to a 
previous time I've felt tugged away from Joshua. As many of you know, Philipp 
Koehn joined JHU a few years ago, which brought some tension to JHU with 
respect to collaborating on research. There was pressure for me to switch. 
Moses had a much bigger development community and was much more feature rich, 
but despite this, I was reluctant to let go of Joshua, for a number of reasons. 
Java is nicer to work with than C++ (and not really that much slower); our code 
is better written, IMO; jar files are easier to distribute than C++ in compiled 
or source form; and, of course, I had much more familiarity with the codebase, 
not to mention something of a personal stake in Joshua. But with neural MT, I 
have none of these reservations. It's nice for one to have the Moses/Joshua 
tension resolved (sometimes, ignoring a problem does make it go away!), but for 
all the reasons I listed in the opening paragraph, NMT is now the clear way to 
go. And the bottom line for me is that I can no longer justify spending time on 
Joshua during my working hours, and with a young family and other interests 
that I want to pursue, I don't have time for it outside of work. I am happy to 
still linger on the project, but am unlikely to be much of an active 
participant unless I'm explicitly asked for something.

As I've written before here, I think there may still some role for statistical 
systems, and therefore, for Joshua. In low-resource situations, StatMT may 
still be the right approach overall, or even simply the best way to quickly 
build up a working system. There is some promise I think in deploying models 
easily on older hardware that people have, and perhaps getting people to hep 
contribute translations and translation memories that could be used to build 
and improve systems. There are surely more good ideas in this space in the vein 
of providing a good tool to users. 

It's been a great experience for me working with the Apache community on 
Joshua. I am grateful to Chris for convincing us to make Joshua an Apache 
incubator project, which put a lot of new life into the project. Lewis has been 
a lot of help throughout helping smooth over the transition; Tommaso has 
repeatedly helped with tasks large and small; and that is just three of you. 
It's too bad therefore that the timing just didn't work out, but neural MT 
ascended very rapidly. I know there are other members here who are also 
thinking along these lines. At the same time, I hope my departure from active 
development doesn’t mean the end of the project for those of you who wish to 
keep working on it. 

Sincerely,
matt


> Le 25 sept. 2017 à 23:10, Tommaso Teofili  a écrit 
> :
> 
> I would also think we're ready for graduation.
> My only concern relates to how many of the current committers are willing
> to keep contributing to the project, basically if we have a PMC which is
> big enough for the graduation.
> 
> Regards,
> Tommaso
> 
> 
> Il giorno sab 23 set 2017 alle ore 01:21 Chris Mattmann 
> ha scritto:
> 
>> Tom, glad you raised this issue, IMO, Joshua is ready for 

Re: About how to use Jousha translator

2017-09-12 Thread Matt Post
This is the best I can do:

https://cwiki.apache.org/confluence/display/JOSHUA/RESTful+API 
<https://cwiki.apache.org/confluence/display/JOSHUA/RESTful+API>


> On Sep 12, 2017, at 9:19 PM, Tehetena Alemu <tehet...@gmail.com 
> <mailto:tehet...@gmail.com>> wrote:
> 
> Hi Matt,
> 
> Thanks for your response. Would you mind to give me a clue how I can use this 
> plublic API to translate from amharic to English or other ?
> 
> On Tuesday, September 12, 2017, Matt Post <p...@cs.jhu.edu 
> <mailto:p...@cs.jhu.edu>> wrote:
> Hi,
> 
> The mention of Google referred only to the public API. That is, Joshua's 
> server mode will answer to RESTful style queries. This is implemented
> 
> There are not any new language packs forthcoming in the near future that I am 
> aware of.
> 
> matt (from my phone)
> 
> > Le 12 sept. 2017 à 14:44, lewis john mcgibbney <lewi...@apache.org 
> > <javascript:;>> a écrit :
> >
> > If I were you I would simply contact dev@joshia with that query then.
> > Someone on the list should hopefully see the comment and respond.
> > It looks like an update to this documentation is possibly required as I am
> > not sure if anyone is actively working on this... I may be wrong however!
> >
> >> On Tue, Sep 12, 2017 at 3:07 AM Tehetena Alemu <tehet...@gmail.com 
> >> <javascript:;>> wrote:
> >>
> >> https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs 
> >> <https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs>
> >>
> >> "*Version 3 Language Packs Coming Soon*
> >> (March 2017) Version 3 language packs with Kenlm (via Docker) and more
> >> complete Google Translate API support
> >> <https://cloud.google.com/translate/docs/reference/rest 
> >> <https://cloud.google.com/translate/docs/reference/rest>> are coming soon.
> >> If you have questions, comments, concerns, or wish to help, please post
> >> questions to the Joshua mailing list: d...@joshua.apache.org 
> >> <javascript:;>."
> >>
> >> Tehetena Alemu
> >>
> >> On Tue, Sep 12, 2017 at 1:45 AM, lewis john mcgibbney <lewi...@apache.org 
> >> <javascript:;>>
> >> wrote:
> >>
> >>> Where did you get this information from?
> >>>
> >>> On Mon, Sep 11, 2017 at 12:28 PM, Tehetena Alemu <tehet...@gmail.com 
> >>> <javascript:;>>
> >>> wrote:
> >>>
> >>>> Thank you very much Lewis , it is very kind of you. Your help means a
> >>>>> lot. By the way, 2 weeks is the time i took on trying diffrent options ,
> >>>>> but not for getting  a response.
> >>>>>
> >>>>
> >>>> On the other way, I just found out jousha pack 3 will be released soon,
> >>>> with  Google translation. When will it be released ? It will be a very 
> >>>> good
> >>>> contribution to my paper.
> >>>>
> >>>> Best,
> >>>>
> >>>>
> >>>> --
> >>>> Tehetena Alemu
> >>>>
> >>>>
> >>>
> >>>
> >>> --
> >>> http://home.apache.org/~lewismc/ <http://home.apache.org/~lewismc/>
> >>> @hectorMcSpector
> >>> http://www.linkedin.com/in/lmcgibbney 
> >>> <http://www.linkedin.com/in/lmcgibbney>
> >>>
> >>
> >> --
> > http://home.apache.org/~lewismc/ <http://home.apache.org/~lewismc/>
> > @hectorMcSpector
> > http://www.linkedin.com/in/lmcgibbney 
> > <http://www.linkedin.com/in/lmcgibbney>
> 
> 
> 
> -- 
> Tehetena Alemu
> 



Re: About how to use Jousha translator

2017-09-12 Thread Matt Post
Hi,

The mention of Google referred only to the public API. That is, Joshua's server 
mode will answer to RESTful style queries. This is implemented 

There are not any new language packs forthcoming in the near future that I am 
aware of. 

matt (from my phone)

> Le 12 sept. 2017 à 14:44, lewis john mcgibbney  a écrit :
> 
> If I were you I would simply contact dev@joshia with that query then.
> Someone on the list should hopefully see the comment and respond.
> It looks like an update to this documentation is possibly required as I am
> not sure if anyone is actively working on this... I may be wrong however!
> 
>> On Tue, Sep 12, 2017 at 3:07 AM Tehetena Alemu  wrote:
>> 
>> https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs
>> 
>> "*Version 3 Language Packs Coming Soon*
>> (March 2017) Version 3 language packs with Kenlm (via Docker) and more
>> complete Google Translate API support
>>  are coming soon.
>> If you have questions, comments, concerns, or wish to help, please post
>> questions to the Joshua mailing list: d...@joshua.apache.org."
>> 
>> Tehetena Alemu
>> 
>> On Tue, Sep 12, 2017 at 1:45 AM, lewis john mcgibbney 
>> wrote:
>> 
>>> Where did you get this information from?
>>> 
>>> On Mon, Sep 11, 2017 at 12:28 PM, Tehetena Alemu 
>>> wrote:
>>> 
 Thank you very much Lewis , it is very kind of you. Your help means a
> lot. By the way, 2 weeks is the time i took on trying diffrent options ,
> but not for getting  a response.
> 
 
 On the other way, I just found out jousha pack 3 will be released soon,
 with  Google translation. When will it be released ? It will be a very good
 contribution to my paper.
 
 Best,
 
 
 --
 Tehetena Alemu
 
 
>>> 
>>> 
>>> --
>>> http://home.apache.org/~lewismc/
>>> @hectorMcSpector
>>> http://www.linkedin.com/in/lmcgibbney
>>> 
>> 
>> --
> http://home.apache.org/~lewismc/
> @hectorMcSpector
> http://www.linkedin.com/in/lmcgibbney



Re: [jira] [Commented] (JOSHUA-277) UnsatisfiedLinkError: no ken in java.library.path

2017-08-28 Thread Matt Post
Hi,

There is no Joshua manual, unfortunately, just the Confluence pages.

I looked at your run and it seems that Thrax is failing. I don't know what your 
Hadoop configuration is like, but that is likely the problem (see thrax.log in 
these directories). If you setup Hadoop incorrectly, or don't have enough 
space, or set it up on a network share instead of local disks, all of these 
things can cause problems.

matt


> On Aug 28, 2017, at 2:46 PM, Jeffrey Smith (JIRA)  wrote:
> 
> 
>[ 
> https://issues.apache.org/jira/browse/JOSHUA-277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16143731#comment-16143731
>  ] 
> 
> Jeffrey Smith commented on JOSHUA-277:
> --
> 
> Thank you for your help. Here are the two directories in question. Also, do
> you know of a manual, similar to the moses manual, for joshua? I couldn't
> seem to find one.
> Moses does work on this same computer I am running this on.
> ​
> joshua-tutorial.tar.gz
> 
> ​​
> joshua.tar.gz
> 
> ​
> 
> 
> 
> 
> 
> -- 
> *Jeffrey Smith, PhD*
> Chief Systems Engineer and E2 Lead
> Multi Agency Collaboration Environment (MACE)
> Sierra Nevada Corporation
> 3076 Centreville Road, Herndon, VA 20171
> 703-464-6434 (Office)
> 603-566-0124 (Cell)
> jeff.sm...@macefusion.com
> 
> 
>> UnsatisfiedLinkError: no ken in java.library.path
>> -
>> 
>>Key: JOSHUA-277
>>URL: https://issues.apache.org/jira/browse/JOSHUA-277
>>Project: Joshua
>> Issue Type: Bug
>>   Reporter: Thamme Gowda
>> 
>> I followed this guide 
>> http://joshua.incubator.apache.org/6.0/quick-start.html to test the latest 
>> build.
>> Assuming there few things are broken due to newer maven build system, I 
>> tried to fix pipeline.pl to get the quick start guide working.
>> Which files from kenlm build should I add to JNI path? (I am unable to 
>> locate the library file in the kenlm build output)
>> Here is the full log:
>> {code}
>> $JOSHUA/bin/pipeline.pl --source bn --target en --type hiero 
>> --no-prepare --aligner berkeley --corpus input/bn-en/tok/training.bn-en  
>>--tune input/bn-en/tok/dev.bn-en --test input/bn-en/tok/devtest.bn-en
>> [train-copy-and-filter] cached, skipping...
>> [train-vocab-bn] cached, skipping...
>> [train-vocab-en] cached, skipping...
>> [tune-copy-and-filter] cached, skipping...
>> [tune-vocab-bn] cached, skipping...
>> [tune-vocab-en.0] cached, skipping...
>> [tune-vocab-en.1] cached, skipping...
>> [tune-vocab-en.2] cached, skipping...
>> [tune-vocab-en.3] cached, skipping...
>> [test-copy-and-filter] cached, skipping...
>> [test-vocab-bn] cached, skipping...
>> [test-vocab-en.0] cached, skipping...
>> [test-vocab-en.1] cached, skipping...
>> [test-vocab-en.2] cached, skipping...
>> [test-vocab-en.3] cached, skipping...
>> [source-numlines] cached, skipping...
>> [source-numlines] retrieved cached result =>20788
>> [berkeley-aligner-chunk-0] cached, skipping...
>> [aligner-combine] cached, skipping...
>> [pack-grammar] cached, skipping...
>> [lm-sort-uniq] cached, skipping...
>> [kenlm] cached, skipping...
>> [compile-kenlm] cached, skipping...
>> [glue-tune] cached, skipping...
>> Error: Could not find or load main class 
>> joshua.util.encoding.EncoderConfiguration
>> [tune-bundle] cached, skipping...
>> [mert-1] rebuilding...
>>  
>> dep=/Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/data/tune/corpus.bn
>>  [CHANGED]
>>  
>> dep=/Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/joshua.config
>>  [CHANGED]
>>  dep=tune/model/grammar.packed/slice_0.source [CHANGED]
>>  
>> dep=/Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/joshua.config.final
>>  [NOT FOUND]
>>  
>> cmd=/Users/thammegr/work/projects/apache/incubator-joshua/scripts/training/run_tuner.py
>>  
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/data/tune/corpus.bn
>>  
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/data/tune/corpus.en
>>  --tunedir 
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune 
>> --tuner mert --decoder 
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/decoder_command
>>  --decoder-config 
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/joshua.config
>>  --decoder-output-file 
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/output.nbest
>>  --decoder-log-file 
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/joshua.log
>>  --iterations 10 --metric 'BLEU 4 closest'
>>  JOB FAILED (return code 1)
>> Exception in thread "main" java.lang.RuntimeException: 

[jira] [Commented] (JOSHUA-277) UnsatisfiedLinkError: no ken in java.library.path

2017-08-27 Thread Matt Post (JIRA)

[ 
https://issues.apache.org/jira/browse/JOSHUA-277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16143361#comment-16143361
 ] 

Matt Post commented on JOSHUA-277:
--

If you want to tar up your whole run directory and make it available somewhere 
I can take a closer look.





> UnsatisfiedLinkError: no ken in java.library.path
> -
>
> Key: JOSHUA-277
> URL: https://issues.apache.org/jira/browse/JOSHUA-277
> Project: Joshua
>  Issue Type: Bug
>Reporter: Thamme Gowda
>
> I followed this guide http://joshua.incubator.apache.org/6.0/quick-start.html 
> to test the latest build.
> Assuming there few things are broken due to newer maven build system, I tried 
> to fix pipeline.pl to get the quick start guide working.
> Which files from kenlm build should I add to JNI path? (I am unable to locate 
> the library file in the kenlm build output)
> Here is the full log:
> {code}
> $JOSHUA/bin/pipeline.pl --source bn --target en --type hiero 
> --no-prepare --aligner berkeley --corpus input/bn-en/tok/training.bn-en   
>   --tune input/bn-en/tok/dev.bn-en --test input/bn-en/tok/devtest.bn-en
> [train-copy-and-filter] cached, skipping...
> [train-vocab-bn] cached, skipping...
> [train-vocab-en] cached, skipping...
> [tune-copy-and-filter] cached, skipping...
> [tune-vocab-bn] cached, skipping...
> [tune-vocab-en.0] cached, skipping...
> [tune-vocab-en.1] cached, skipping...
> [tune-vocab-en.2] cached, skipping...
> [tune-vocab-en.3] cached, skipping...
> [test-copy-and-filter] cached, skipping...
> [test-vocab-bn] cached, skipping...
> [test-vocab-en.0] cached, skipping...
> [test-vocab-en.1] cached, skipping...
> [test-vocab-en.2] cached, skipping...
> [test-vocab-en.3] cached, skipping...
> [source-numlines] cached, skipping...
> [source-numlines] retrieved cached result =>20788
> [berkeley-aligner-chunk-0] cached, skipping...
> [aligner-combine] cached, skipping...
> [pack-grammar] cached, skipping...
> [lm-sort-uniq] cached, skipping...
> [kenlm] cached, skipping...
> [compile-kenlm] cached, skipping...
> [glue-tune] cached, skipping...
> Error: Could not find or load main class 
> joshua.util.encoding.EncoderConfiguration
> [tune-bundle] cached, skipping...
> [mert-1] rebuilding...
>   
> dep=/Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/data/tune/corpus.bn
>  [CHANGED]
>   
> dep=/Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/joshua.config
>  [CHANGED]
>   dep=tune/model/grammar.packed/slice_0.source [CHANGED]
>   
> dep=/Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/joshua.config.final
>  [NOT FOUND]
>   
> cmd=/Users/thammegr/work/projects/apache/incubator-joshua/scripts/training/run_tuner.py
>  
> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/data/tune/corpus.bn
>  
> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/data/tune/corpus.en
>  --tunedir 
> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune --tuner 
> mert --decoder 
> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/decoder_command
>  --decoder-config 
> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/joshua.config
>  --decoder-output-file 
> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/output.nbest
>  --decoder-log-file 
> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/joshua.log
>  --iterations 10 --metric 'BLEU 4 closest'
>   JOB FAILED (return code 1)
> Exception in thread "main" java.lang.RuntimeException: Unable to instantiate 
> feature function 'StateMinimizingLanguageModel -lm_order 5 -lm_file 
> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/model/lm.kenlm'!
>   at 
> org.apache.joshua.decoder.Decoder.initializeFeatureFunctions(Decoder.java:761)
>   at org.apache.joshua.decoder.Decoder.initialize(Decoder.java:514)
>   at org.apache.joshua.decoder.Decoder.(Decoder.java:122)
>   at org.apache.joshua.decoder.JoshuaDecoder.main(JoshuaDecoder.java:69)
> Caused by: java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
&

Re: [jira] [Commented] (JOSHUA-277) UnsatisfiedLinkError: no ken in java.library.path

2017-08-26 Thread Matt Post
You said you're on OS X? This should work, but you might try building in a 
Docker container. There's a Dockerfile in distribution/docker/kenlm


> On Aug 25, 2017, at 1:24 PM, Jeffrey Smith (JIRA)  wrote:
> 
> 
>[ 
> https://issues.apache.org/jira/browse/JOSHUA-277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16141502#comment-16141502
>  ] 
> 
> Jeffrey Smith commented on JOSHUA-277:
> --
> 
> Thanks. I appreciate you getting back to me on this. I only jave JDK 8 on 
> this system. I did the above steps with the same issue. Perhaps this is a 
> problem.
> 
> Joshua is here:
> [ec2-user@ip-172-31-4-253 runs]$ echo $JOSHUA
> /data/joshua
> 
> I installed the joshua tutorial files in:
> /data/joshua-tutorial so I am running tutorial from 
> /data/joshua-tutorial/runs
> 
> when I run:
> $JOSHUA/bin/pipeline.pl \
>  --rundir 1 \
>  --readme "Baseline Hiero run" \
>  --source es \
>  --target en \
>  --type hiero \
>  --corpus $FISHER/corpus/asr/fisher_train \
>  --tune $FISHER/corpus/asr/fisher_dev \
>  --test $FISHER/corpus/asr/fisher_dev2 \
>  --maxlen 11 \
>  --maxlen-tune 11 \
>  --maxlen-test 11 \
>  --tuner-iterations 1 \
>  --lm-order 3
> 
> I still get the error I described
> 
> 
> 
>> UnsatisfiedLinkError: no ken in java.library.path
>> -
>> 
>>Key: JOSHUA-277
>>URL: https://issues.apache.org/jira/browse/JOSHUA-277
>>Project: Joshua
>> Issue Type: Bug
>>   Reporter: Thamme Gowda
>> 
>> I followed this guide 
>> http://joshua.incubator.apache.org/6.0/quick-start.html to test the latest 
>> build.
>> Assuming there few things are broken due to newer maven build system, I 
>> tried to fix pipeline.pl to get the quick start guide working.
>> Which files from kenlm build should I add to JNI path? (I am unable to 
>> locate the library file in the kenlm build output)
>> Here is the full log:
>> {code}
>> $JOSHUA/bin/pipeline.pl --source bn --target en --type hiero 
>> --no-prepare --aligner berkeley --corpus input/bn-en/tok/training.bn-en  
>>--tune input/bn-en/tok/dev.bn-en --test input/bn-en/tok/devtest.bn-en
>> [train-copy-and-filter] cached, skipping...
>> [train-vocab-bn] cached, skipping...
>> [train-vocab-en] cached, skipping...
>> [tune-copy-and-filter] cached, skipping...
>> [tune-vocab-bn] cached, skipping...
>> [tune-vocab-en.0] cached, skipping...
>> [tune-vocab-en.1] cached, skipping...
>> [tune-vocab-en.2] cached, skipping...
>> [tune-vocab-en.3] cached, skipping...
>> [test-copy-and-filter] cached, skipping...
>> [test-vocab-bn] cached, skipping...
>> [test-vocab-en.0] cached, skipping...
>> [test-vocab-en.1] cached, skipping...
>> [test-vocab-en.2] cached, skipping...
>> [test-vocab-en.3] cached, skipping...
>> [source-numlines] cached, skipping...
>> [source-numlines] retrieved cached result =>20788
>> [berkeley-aligner-chunk-0] cached, skipping...
>> [aligner-combine] cached, skipping...
>> [pack-grammar] cached, skipping...
>> [lm-sort-uniq] cached, skipping...
>> [kenlm] cached, skipping...
>> [compile-kenlm] cached, skipping...
>> [glue-tune] cached, skipping...
>> Error: Could not find or load main class 
>> joshua.util.encoding.EncoderConfiguration
>> [tune-bundle] cached, skipping...
>> [mert-1] rebuilding...
>>  
>> dep=/Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/data/tune/corpus.bn
>>  [CHANGED]
>>  
>> dep=/Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/joshua.config
>>  [CHANGED]
>>  dep=tune/model/grammar.packed/slice_0.source [CHANGED]
>>  
>> dep=/Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/joshua.config.final
>>  [NOT FOUND]
>>  
>> cmd=/Users/thammegr/work/projects/apache/incubator-joshua/scripts/training/run_tuner.py
>>  
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/data/tune/corpus.bn
>>  
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/data/tune/corpus.en
>>  --tunedir 
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune 
>> --tuner mert --decoder 
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/decoder_command
>>  --decoder-config 
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/joshua.config
>>  --decoder-output-file 
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/output.nbest
>>  --decoder-log-file 
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/joshua.log
>>  --iterations 10 --metric 'BLEU 4 closest'
>>  JOB FAILED (return code 1)
>> Exception in thread "main" java.lang.RuntimeException: Unable to instantiate 
>> feature function 'StateMinimizingLanguageModel -lm_order 5 -lm_file 
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/model/lm.kenlm'!
>>  at 
>> 

Re: [jira] [Commented] (JOSHUA-277) UnsatisfiedLinkError: no ken in java.library.path

2017-08-23 Thread Matt Post
What's the file size of grammar.gz? Looks like it didn't get extracted.


> On Aug 23, 2017, at 8:14 PM, Jeffrey Smith (JIRA)  wrote:
> 
> 
>[ 
> https://issues.apache.org/jira/browse/JOSHUA-277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138814#comment-16138814
>  ] 
> 
> Jeffrey Smith commented on JOSHUA-277:
> --
> 
> Getting closer:
> 
> I deleted runs/1 and re-ran:
> 
> $JOSHUA/bin/pipeline.pl \
> --rundir 1 \
> --readme "Baseline Hiero run" \
> --source es \
> --target en \
> --type hiero \
> --corpus $FISHER/corpus/asr/fisher_train \
> --tune $FISHER/corpus/asr/fisher_dev \
> --test $FISHER/corpus/asr/fisher_dev2 \
> --maxlen 11 \
> --maxlen-tune 11 \
> --maxlen-test 11 \
> --tuner-iterations 1 \
> --lm-order 3
> 
> The example got farther but ended with the following error. 
> ...
> * Packing grammar at "grammar.gz" to 
> "/data/joshua-tutorial/runs/1/tune/model/grammar.gz.packed"
> * Running the grammar-packer.pl script with the command: 
> /data/joshua/scripts/support/grammar-packer.pl -a -T /tmp -g grammar.gz -o 
> /data/joshua-tutorial/runs/1/tune/model/grammar.gz.packed
> Exception in thread "main" java.util.NoSuchElementException
>at org.apache.joshua.util.io.LineReader.next(LineReader.java:276)
>at 
> org.apache.joshua.tools.GrammarPacker.getGrammarReader(GrammarPacker.java:239)
>at org.apache.joshua.tools.GrammarPacker.pack(GrammarPacker.java:184)
>at 
> org.apache.joshua.tools.GrammarPackerCli.run(GrammarPackerCli.java:120)
>at 
> org.apache.joshua.tools.GrammarPackerCli.main(GrammarPackerCli.java:137)
> * FATAL: Couldn't pack the grammar.
> 
> 
>> UnsatisfiedLinkError: no ken in java.library.path
>> -
>> 
>>Key: JOSHUA-277
>>URL: https://issues.apache.org/jira/browse/JOSHUA-277
>>Project: Joshua
>> Issue Type: Bug
>>   Reporter: Thamme Gowda
>> 
>> I followed this guide 
>> http://joshua.incubator.apache.org/6.0/quick-start.html to test the latest 
>> build.
>> Assuming there few things are broken due to newer maven build system, I 
>> tried to fix pipeline.pl to get the quick start guide working.
>> Which files from kenlm build should I add to JNI path? (I am unable to 
>> locate the library file in the kenlm build output)
>> Here is the full log:
>> {code}
>> $JOSHUA/bin/pipeline.pl --source bn --target en --type hiero 
>> --no-prepare --aligner berkeley --corpus input/bn-en/tok/training.bn-en  
>>--tune input/bn-en/tok/dev.bn-en --test input/bn-en/tok/devtest.bn-en
>> [train-copy-and-filter] cached, skipping...
>> [train-vocab-bn] cached, skipping...
>> [train-vocab-en] cached, skipping...
>> [tune-copy-and-filter] cached, skipping...
>> [tune-vocab-bn] cached, skipping...
>> [tune-vocab-en.0] cached, skipping...
>> [tune-vocab-en.1] cached, skipping...
>> [tune-vocab-en.2] cached, skipping...
>> [tune-vocab-en.3] cached, skipping...
>> [test-copy-and-filter] cached, skipping...
>> [test-vocab-bn] cached, skipping...
>> [test-vocab-en.0] cached, skipping...
>> [test-vocab-en.1] cached, skipping...
>> [test-vocab-en.2] cached, skipping...
>> [test-vocab-en.3] cached, skipping...
>> [source-numlines] cached, skipping...
>> [source-numlines] retrieved cached result =>20788
>> [berkeley-aligner-chunk-0] cached, skipping...
>> [aligner-combine] cached, skipping...
>> [pack-grammar] cached, skipping...
>> [lm-sort-uniq] cached, skipping...
>> [kenlm] cached, skipping...
>> [compile-kenlm] cached, skipping...
>> [glue-tune] cached, skipping...
>> Error: Could not find or load main class 
>> joshua.util.encoding.EncoderConfiguration
>> [tune-bundle] cached, skipping...
>> [mert-1] rebuilding...
>>  
>> dep=/Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/data/tune/corpus.bn
>>  [CHANGED]
>>  
>> dep=/Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/joshua.config
>>  [CHANGED]
>>  dep=tune/model/grammar.packed/slice_0.source [CHANGED]
>>  
>> dep=/Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/joshua.config.final
>>  [NOT FOUND]
>>  
>> cmd=/Users/thammegr/work/projects/apache/incubator-joshua/scripts/training/run_tuner.py
>>  
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/data/tune/corpus.bn
>>  
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/data/tune/corpus.en
>>  --tunedir 
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune 
>> --tuner mert --decoder 
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/decoder_command
>>  --decoder-config 
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/joshua.config
>>  --decoder-output-file 
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/output.nbest
>>  --decoder-log-file 
>> 

Re: [jira] [Commented] (JOSHUA-277) UnsatisfiedLinkError: no ken in java.library.path

2017-08-23 Thread Matt Post
what is the file size of lm dot kenlm and lm.gz? that will tell you if they 
built fine. 

check that joshua config path to lm is valid. thrown error might be off. 

matt (from my phone)

> Le 23 août 2017 à 15:27, Jeffrey Smith (JIRA)  a écrit :
> 
> 
>[ 
> https://issues.apache.org/jira/browse/JOSHUA-277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138334#comment-16138334
>  ] 
> 
> Jeffrey Smith commented on JOSHUA-277:
> --
> 
> PS. There is a  "runs/1/tune/model/lm.kenlm". It is a soft-link to 
> .../joshua-tutorial/runs/1/lm.kenlm . Perhaps this is not what it is supposed 
> to be?
> 
> 
>> UnsatisfiedLinkError: no ken in java.library.path
>> -
>> 
>>Key: JOSHUA-277
>>URL: https://issues.apache.org/jira/browse/JOSHUA-277
>>Project: Joshua
>> Issue Type: Bug
>>   Reporter: Thamme Gowda
>> 
>> I followed this guide 
>> http://joshua.incubator.apache.org/6.0/quick-start.html to test the latest 
>> build.
>> Assuming there few things are broken due to newer maven build system, I 
>> tried to fix pipeline.pl to get the quick start guide working.
>> Which files from kenlm build should I add to JNI path? (I am unable to 
>> locate the library file in the kenlm build output)
>> Here is the full log:
>> {code}
>> $JOSHUA/bin/pipeline.pl --source bn --target en --type hiero 
>> --no-prepare --aligner berkeley --corpus input/bn-en/tok/training.bn-en  
>>--tune input/bn-en/tok/dev.bn-en --test input/bn-en/tok/devtest.bn-en
>> [train-copy-and-filter] cached, skipping...
>> [train-vocab-bn] cached, skipping...
>> [train-vocab-en] cached, skipping...
>> [tune-copy-and-filter] cached, skipping...
>> [tune-vocab-bn] cached, skipping...
>> [tune-vocab-en.0] cached, skipping...
>> [tune-vocab-en.1] cached, skipping...
>> [tune-vocab-en.2] cached, skipping...
>> [tune-vocab-en.3] cached, skipping...
>> [test-copy-and-filter] cached, skipping...
>> [test-vocab-bn] cached, skipping...
>> [test-vocab-en.0] cached, skipping...
>> [test-vocab-en.1] cached, skipping...
>> [test-vocab-en.2] cached, skipping...
>> [test-vocab-en.3] cached, skipping...
>> [source-numlines] cached, skipping...
>> [source-numlines] retrieved cached result =>20788
>> [berkeley-aligner-chunk-0] cached, skipping...
>> [aligner-combine] cached, skipping...
>> [pack-grammar] cached, skipping...
>> [lm-sort-uniq] cached, skipping...
>> [kenlm] cached, skipping...
>> [compile-kenlm] cached, skipping...
>> [glue-tune] cached, skipping...
>> Error: Could not find or load main class 
>> joshua.util.encoding.EncoderConfiguration
>> [tune-bundle] cached, skipping...
>> [mert-1] rebuilding...
>>  
>> dep=/Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/data/tune/corpus.bn
>>  [CHANGED]
>>  
>> dep=/Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/joshua.config
>>  [CHANGED]
>>  dep=tune/model/grammar.packed/slice_0.source [CHANGED]
>>  
>> dep=/Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/joshua.config.final
>>  [NOT FOUND]
>>  
>> cmd=/Users/thammegr/work/projects/apache/incubator-joshua/scripts/training/run_tuner.py
>>  
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/data/tune/corpus.bn
>>  
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/data/tune/corpus.en
>>  --tunedir 
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune 
>> --tuner mert --decoder 
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/decoder_command
>>  --decoder-config 
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/joshua.config
>>  --decoder-output-file 
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/output.nbest
>>  --decoder-log-file 
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/joshua.log
>>  --iterations 10 --metric 'BLEU 4 closest'
>>  JOB FAILED (return code 1)
>> Exception in thread "main" java.lang.RuntimeException: Unable to instantiate 
>> feature function 'StateMinimizingLanguageModel -lm_order 5 -lm_file 
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/model/lm.kenlm'!
>>at 
>> org.apache.joshua.decoder.Decoder.initializeFeatureFunctions(Decoder.java:761)
>>at org.apache.joshua.decoder.Decoder.initialize(Decoder.java:514)
>>at org.apache.joshua.decoder.Decoder.(Decoder.java:122)
>>at org.apache.joshua.decoder.JoshuaDecoder.main(JoshuaDecoder.java:69)
>> Caused by: java.lang.reflect.InvocationTargetException
>>at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>>at 
>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>>at 
>> 

[jira] [Commented] (JOSHUA-277) UnsatisfiedLinkError: no ken in java.library.path

2017-08-22 Thread Matt Post (JIRA)

[ 
https://issues.apache.org/jira/browse/JOSHUA-277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16137288#comment-16137288
 ] 

Matt Post commented on JOSHUA-277:
--

Is there a file lib/libken.so? And bin/lmplz?

> UnsatisfiedLinkError: no ken in java.library.path
> -
>
> Key: JOSHUA-277
> URL: https://issues.apache.org/jira/browse/JOSHUA-277
> Project: Joshua
>  Issue Type: Bug
>Reporter: Thamme Gowda
>
> I followed this guide http://joshua.incubator.apache.org/6.0/quick-start.html 
> to test the latest build.
> Assuming there few things are broken due to newer maven build system, I tried 
> to fix pipeline.pl to get the quick start guide working.
> Which files from kenlm build should I add to JNI path? (I am unable to locate 
> the library file in the kenlm build output)
> Here is the full log:
> {code}
> $JOSHUA/bin/pipeline.pl --source bn --target en --type hiero 
> --no-prepare --aligner berkeley --corpus input/bn-en/tok/training.bn-en   
>   --tune input/bn-en/tok/dev.bn-en --test input/bn-en/tok/devtest.bn-en
> [train-copy-and-filter] cached, skipping...
> [train-vocab-bn] cached, skipping...
> [train-vocab-en] cached, skipping...
> [tune-copy-and-filter] cached, skipping...
> [tune-vocab-bn] cached, skipping...
> [tune-vocab-en.0] cached, skipping...
> [tune-vocab-en.1] cached, skipping...
> [tune-vocab-en.2] cached, skipping...
> [tune-vocab-en.3] cached, skipping...
> [test-copy-and-filter] cached, skipping...
> [test-vocab-bn] cached, skipping...
> [test-vocab-en.0] cached, skipping...
> [test-vocab-en.1] cached, skipping...
> [test-vocab-en.2] cached, skipping...
> [test-vocab-en.3] cached, skipping...
> [source-numlines] cached, skipping...
> [source-numlines] retrieved cached result =>20788
> [berkeley-aligner-chunk-0] cached, skipping...
> [aligner-combine] cached, skipping...
> [pack-grammar] cached, skipping...
> [lm-sort-uniq] cached, skipping...
> [kenlm] cached, skipping...
> [compile-kenlm] cached, skipping...
> [glue-tune] cached, skipping...
> Error: Could not find or load main class 
> joshua.util.encoding.EncoderConfiguration
> [tune-bundle] cached, skipping...
> [mert-1] rebuilding...
>   
> dep=/Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/data/tune/corpus.bn
>  [CHANGED]
>   
> dep=/Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/joshua.config
>  [CHANGED]
>   dep=tune/model/grammar.packed/slice_0.source [CHANGED]
>   
> dep=/Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/joshua.config.final
>  [NOT FOUND]
>   
> cmd=/Users/thammegr/work/projects/apache/incubator-joshua/scripts/training/run_tuner.py
>  
> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/data/tune/corpus.bn
>  
> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/data/tune/corpus.en
>  --tunedir 
> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune --tuner 
> mert --decoder 
> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/decoder_command
>  --decoder-config 
> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/joshua.config
>  --decoder-output-file 
> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/output.nbest
>  --decoder-log-file 
> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/joshua.log
>  --iterations 10 --metric 'BLEU 4 closest'
>   JOB FAILED (return code 1)
> Exception in thread "main" java.lang.RuntimeException: Unable to instantiate 
> feature function 'StateMinimizingLanguageModel -lm_order 5 -lm_file 
> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/model/lm.kenlm'!
>   at 
> org.apache.joshua.decoder.Decoder.initializeFeatureFunctions(Decoder.java:761)
>   at org.apache.joshua.decoder.Decoder.initialize(Decoder.java:514)
>   at org.apache.joshua.decoder.Decoder.(Decoder.java:122)
>   at org.apache.joshua.decoder.JoshuaDecoder.main(JoshuaDecoder.java:69)
> Caused by: java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.joshua.decoder.Decoder.initializeFeatureFunctions(Decoder.j

[jira] [Commented] (JOSHUA-277) UnsatisfiedLinkError: no ken in java.library.path

2017-08-22 Thread Matt Post (JIRA)

[ 
https://issues.apache.org/jira/browse/JOSHUA-277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16137271#comment-16137271
 ] 

Matt Post commented on JOSHUA-277:
--

KenLM is not getting built. Did you check out the Getting Started page? 
download-deps.sh downloads and builds KenLM. Likely it is failing.

https://cwiki.apache.org/confluence/display/JOSHUA/Getting+Started

> UnsatisfiedLinkError: no ken in java.library.path
> -
>
> Key: JOSHUA-277
> URL: https://issues.apache.org/jira/browse/JOSHUA-277
> Project: Joshua
>  Issue Type: Bug
>Reporter: Thamme Gowda
>
> I followed this guide http://joshua.incubator.apache.org/6.0/quick-start.html 
> to test the latest build.
> Assuming there few things are broken due to newer maven build system, I tried 
> to fix pipeline.pl to get the quick start guide working.
> Which files from kenlm build should I add to JNI path? (I am unable to locate 
> the library file in the kenlm build output)
> Here is the full log:
> {code}
> $JOSHUA/bin/pipeline.pl --source bn --target en --type hiero 
> --no-prepare --aligner berkeley --corpus input/bn-en/tok/training.bn-en   
>   --tune input/bn-en/tok/dev.bn-en --test input/bn-en/tok/devtest.bn-en
> [train-copy-and-filter] cached, skipping...
> [train-vocab-bn] cached, skipping...
> [train-vocab-en] cached, skipping...
> [tune-copy-and-filter] cached, skipping...
> [tune-vocab-bn] cached, skipping...
> [tune-vocab-en.0] cached, skipping...
> [tune-vocab-en.1] cached, skipping...
> [tune-vocab-en.2] cached, skipping...
> [tune-vocab-en.3] cached, skipping...
> [test-copy-and-filter] cached, skipping...
> [test-vocab-bn] cached, skipping...
> [test-vocab-en.0] cached, skipping...
> [test-vocab-en.1] cached, skipping...
> [test-vocab-en.2] cached, skipping...
> [test-vocab-en.3] cached, skipping...
> [source-numlines] cached, skipping...
> [source-numlines] retrieved cached result =>20788
> [berkeley-aligner-chunk-0] cached, skipping...
> [aligner-combine] cached, skipping...
> [pack-grammar] cached, skipping...
> [lm-sort-uniq] cached, skipping...
> [kenlm] cached, skipping...
> [compile-kenlm] cached, skipping...
> [glue-tune] cached, skipping...
> Error: Could not find or load main class 
> joshua.util.encoding.EncoderConfiguration
> [tune-bundle] cached, skipping...
> [mert-1] rebuilding...
>   
> dep=/Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/data/tune/corpus.bn
>  [CHANGED]
>   
> dep=/Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/joshua.config
>  [CHANGED]
>   dep=tune/model/grammar.packed/slice_0.source [CHANGED]
>   
> dep=/Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/joshua.config.final
>  [NOT FOUND]
>   
> cmd=/Users/thammegr/work/projects/apache/incubator-joshua/scripts/training/run_tuner.py
>  
> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/data/tune/corpus.bn
>  
> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/data/tune/corpus.en
>  --tunedir 
> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune --tuner 
> mert --decoder 
> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/decoder_command
>  --decoder-config 
> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/joshua.config
>  --decoder-output-file 
> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/output.nbest
>  --decoder-log-file 
> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/joshua.log
>  --iterations 10 --metric 'BLEU 4 closest'
>   JOB FAILED (return code 1)
> Exception in thread "main" java.lang.RuntimeException: Unable to instantiate 
> feature function 'StateMinimizingLanguageModel -lm_order 5 -lm_file 
> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/model/lm.kenlm'!
>   at 
> org.apache.joshua.decoder.Decoder.initializeFeatureFunctions(Decoder.java:761)
>   at org.apache.joshua.decoder.Decoder.initialize(Decoder.java:514)
>   at org.apache.joshua.decoder.Decoder.(Decoder.java:122)
>   at org.apache.joshua.decoder.JoshuaDecoder.main(JoshuaDecoder.java:69)
> Caused by: java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstan

Re: Merging 7.X into master??? + cleaning up branches

2017-07-04 Thread Matt Post
Whether to integrate neural stuff in Joshua is an interesting question. The 
research direction has been to develop fully neural systems that leave behind 
the phrase-based and hierarchical framework entirely. Doing this in Joshua 
would basically require a ground-up rewrite and is probably not worth the time. 
Moses has neural feature functions; for example, you can use a Nematus model as 
a rescore feature (though it breaks dynamic programming). This might be 
reasonable to implement as a project but it would be quite a bit of work and 
introduce GPU requirements that would raise the question of why you'd use 
Joshua if you had a GPU available. I think that it would be better to focus on 
low-resource scenarios and user-focused applications, instead.


> On Jun 29, 2017, at 12:35 PM, Tommaso Teofili <tommaso.teof...@gmail.com> 
> wrote:
> 
> Hi Matt,
> 
> Il giorno gio 29 giu 2017 alle ore 05:21 Matt Post <p...@cs.jhu.edu> ha
> scritto:
> 
>> This is definitely a good idea. Many of these branches are dead and are
>> unlikely to contain much that can be merged in, and are therefore probably
>> best deleted. The plan for 7 was a big simplification of much of the guts,
>> but with the transition to neural approaches in the research community,
>> this is unlikely to be done unless it finds a new champion.
>> 
> 
> do you think we should look at NMT in the Joshua project ?
> Or is it more like you are more interested on NMT at the moment ?
> Or both ? :)
> 
> Other than that let's merge 7 to master and drop the remaining stuff,
> except that for the PR for JOSHUA-290 [1] which should be merged into 7
> branch.
> 
> Regards,
> Tommaso
> 
> [1] : https://github.com/apache/incubator-joshua/pull/71
> 
> 
>> 
>> 
>> 
>> 
>>> On Jun 28, 2017, at 3:43 AM, Tommaso Teofili <tommaso.teof...@gmail.com>
>> wrote:
>>> 
>>> +1 for both cleaning up branches *and* merging 7 branch into master.
>>> 
>>> Regarding branches and Git let me read through the links and I'll share
>> my
>>> opinion.
>>> 
>>> Regards,
>>> Tommaso
>>> 
>>> Il giorno mer 28 giu 2017 alle ore 06:41 Chris Mattmann <
>> mattm...@apache.org>
>>> ha scritto:
>>> 
>>>> Hey Team,
>>>> 
>>>> I recommend that Joshua consider adopting the Tika and/or Nutch
>>>> contribution
>>>> policy RE: branches and Git:
>>>> 
>>>> https://github.com/apache/tika/#contributing-via-github
>>>> https://github.com/apache/nutch/#contributing
>>>> 
>>>> Cheers,
>>>> Chris
>>>> 
>>>> 
>>>> 
>>>> On 6/27/17, 9:36 PM, "lewis john mcgibbney" <lewi...@apache.org> wrote:
>>>> 
>>>>   Hi Folks,
>>>>   Two things...
>>>> 
>>>>  1. Currently the branches for Joshua are a bit of a mess... it
>>>> would be
>>>>  better if they were named after JIRA issues such that the mappings
>>>> back to
>>>>  some concrete development were explicit. Does anyone want to clean
>>>> these up?
>>>>  2. Now that 6.1-incubating is released and live, Is there any
>>>> desire to
>>>>  merge 7.X branch into master and continue development there? I was
>>>> not
>>>>  involved with the 7.X development but it looked like a significant
>>>> step
>>>>  forward... it would be a shame for that work to stagnate.
>>>> 
>>>>   Thanks,
>>>> 
>>>>   lewis
>>>> 
>>>>   --
>>>>   http://home.apache.org/~lewismc/
>>>>   @hectorMcSpector
>>>>   http://www.linkedin.com/in/lmcgibbney
>>>> 
>>>> 
>>>> 
>>>> 
>> 
>> 



Re: [ANNOUNCE] - Apache Joshua 6.1 incubating release

2017-06-28 Thread Matt Post
Yes, tighter integration with other Apache projects sounds like a good idea to 
me. Rewriting Thrax to use a more modern tool would also be hugely helpful to 
Joshua in the long term. It is getting harder and harder to find and maintain 
(much less justify) Hadoop clusters that are separate from other research ones.


> On Jun 28, 2017, at 3:42 AM, Tommaso Teofili  
> wrote:
> 
> +1
> 
> Tommaso
> 
> Il giorno mer 28 giu 2017 alle ore 07:46 lewis john mcgibbney <
> lewi...@apache.org> ha scritto:
> 
>> Hi Suneel,
>> I think it's worth opening a JIRA issue and we can possibly mark it for
>> 7.X?
>> lewis
>> 
>> On Tue, Jun 27, 2017 at 9:36 PM, <
>> dev-digest-h...@joshua.incubator.apache.org> wrote:
>> 
>>> 
>>> From: Suneel Marthi 
>>> To: dev@joshua.incubator.apache.org
>>> Cc:
>>> Bcc:
>>> Date: Fri, 23 Jun 2017 01:59:28 -0400
>>> Subject: Re: [ANNOUNCE] - Apache Joshua 6.1 incubating release
>>> Congrats on the release.
>>> 
>>> I have been a silent lurker on this channel since I first heard of Joshua
>>> last September at Amazon, Berlin.
>>> 
>>> Tommaso and myself recently did a talk at Berlin Buzzwords 2017 -
>>> 'Embracing Diversity - searching over multiple languages' [1]
>>> using Apache Joshua for Machine Translation, and Apache OpenNLP for
>>> Language detection.
>>> 
>>> I have been wondering how much of the present VLPS can be replaced by
>>> OpenNLP with Flink/Beam pipelines.
>>> I did a talk last week at Hadoop Summit, San Jose about 'Large Scale Text
>>> processing with Apache OpenNLP and Apache Flink [2].
>>> 
>>> Also that Thrax which is presently MapReduce based, can definitely be
>>> ported over to modern streaming distributed frameworks like Flink/Kafka
>>> Streams/Beam.
>>> 
>>> 
>>> [1]
>>> https://www.youtube.com/watch?v=ZrWxySF-9KY=20=2s;
>>> list=PLq-odUc2x7i-9Nijx-WfoRMoAfHC9XzTt
>>> [2] https://www.slideshare.net/SuneelMarthi/large-scale-text-processing
>>> 
>>> 
>>> 
>> 



Re: Merging 7.X into master??? + cleaning up branches

2017-06-28 Thread Matt Post
This is definitely a good idea. Many of these branches are dead and are 
unlikely to contain much that can be merged in, and are therefore probably best 
deleted. The plan for 7 was a big simplification of much of the guts, but with 
the transition to neural approaches in the research community, this is unlikely 
to be done unless it finds a new champion.




> On Jun 28, 2017, at 3:43 AM, Tommaso Teofili  
> wrote:
> 
> +1 for both cleaning up branches *and* merging 7 branch into master.
> 
> Regarding branches and Git let me read through the links and I'll share my
> opinion.
> 
> Regards,
> Tommaso
> 
> Il giorno mer 28 giu 2017 alle ore 06:41 Chris Mattmann 
> ha scritto:
> 
>> Hey Team,
>> 
>> I recommend that Joshua consider adopting the Tika and/or Nutch
>> contribution
>> policy RE: branches and Git:
>> 
>> https://github.com/apache/tika/#contributing-via-github
>> https://github.com/apache/nutch/#contributing
>> 
>> Cheers,
>> Chris
>> 
>> 
>> 
>> On 6/27/17, 9:36 PM, "lewis john mcgibbney"  wrote:
>> 
>>Hi Folks,
>>Two things...
>> 
>>   1. Currently the branches for Joshua are a bit of a mess... it
>> would be
>>   better if they were named after JIRA issues such that the mappings
>> back to
>>   some concrete development were explicit. Does anyone want to clean
>> these up?
>>   2. Now that 6.1-incubating is released and live, Is there any
>> desire to
>>   merge 7.X branch into master and continue development there? I was
>> not
>>   involved with the 7.X development but it looked like a significant
>> step
>>   forward... it would be a shame for that work to stagnate.
>> 
>>Thanks,
>> 
>>lewis
>> 
>>--
>>http://home.apache.org/~lewismc/
>>@hectorMcSpector
>>http://www.linkedin.com/in/lmcgibbney
>> 
>> 
>> 
>> 



Re: java.lang.UnsatisfiedLinkError: no ken in java.library.path

2017-05-31 Thread Matt Post
Yes, LD_LIBRARY_PATH should also include your system library paths. It looks 
like things are good for you!

matt


> On May 31, 2017, at 12:32 AM, Hoàng Đình Long <long@gmail.com> wrote:
> 
> Hi Matt,
> 
> Thank you very much. That's exactly the reason.
> In the tutorial, there is only a step to setup boost. And I made an
> environment variable called LD_LIBRARY_PATH="/usr/include/boost". In
> /usr/include/boost, libken.so file doesn't exist. So I appended the path
> like this:
> 
> LD_LIBRARY_PATH="/usr/include/boost:/home/long/Working/joshua-tutorial/runs/releases/apache-joshua-es-en-2017-05-30/lib"
> 
> Now it works great.
> 
> Is it the way it is supposed to happen?
> 
> Anyway, thank you!
> I will review the overall picture and see what to do next.
> 
> On Tue, May 30, 2017 at 10:44 PM, Matt Post <p...@cs.jhu.edu> wrote:
> 
>> It looks like it can't find libken.so, so that is probably not in your
>> $LD_LIBRARY_PATH. This is supposed to be set by the "joshua" script, so
>> something must be wrong. I'm not sure what but it shouldn't be too
>> difficult to track down at the terminal.
>> 
>> matt
>> 
>> 
>>> On May 25, 2017, at 10:43 PM, Hoàng Đình Long <long@gmail.com>
>> wrote:
>>> 
>>> Yes, in the language pack/model folder, there are lm.kenlm, grammar.glue
>>> files and a sub folder named grammar.packed
>>> 
>>> On Thu, May 25, 2017 at 5:56 PM, Matt Post <p...@cs.jhu.edu> wrote:
>>> 
>>>> Is the file model/lm.kenlm in place?
>>>> 
>>>> 
>>>>> On May 24, 2017, at 10:15 PM, Hoàng Đình Long <long@gmail.com>
>>>> wrote:
>>>>> 
>>>>> Hello,
>>>>> 
>>>>> In the $JOSHUA home directory (which I built from source code cloned
>> from
>>>>> Github), I found a lib directory which contains only 1 file
>>>>> named libken.so. Then I copy the lib folder into the language pack
>>>> release
>>>>> folder (because the language pack folder doesn't have this sub folder).
>>>>> 
>>>>> The error remains the same when I try to use the language pack:
>>>>> 
>>>>> Exception in thread "main" java.lang.RuntimeException: Unable to
>>>>> instantiate feature function 'StateMinimizingLanguageModel -lm_order 3
>>>>> -lm_file model/lm.kenlm'!
>>>>> at
>>>>> org.apache.joshua.decoder.Decoder.initializeFeatureFunctions(
>>>> Decoder.java:642)
>>>>> at org.apache.joshua.decoder.Decoder.initialize(Decoder.java:394)
>>>>> at org.apache.joshua.decoder.Decoder.(Decoder.java:128)
>>>>> at org.apache.joshua.decoder.JoshuaDecoder.main(JoshuaDecoder.java:69)
>>>>> Caused by: java.lang.reflect.InvocationTargetException
>>>>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> Method)
>>>>> at
>>>>> sun.reflect.NativeConstructorAccessorImpl.newInstance(
>>>> NativeConstructorAccessorImpl.java:62)
>>>>> at
>>>>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
>>>> DelegatingConstructorAccessorImpl.java:45)
>>>>> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>>>>> at
>>>>> org.apache.joshua.decoder.Decoder.initializeFeatureFunctions(
>>>> Decoder.java:638)
>>>>> ... 3 more
>>>>> Caused by: org.apache.joshua.decoder.ff.lm.KenLM$KenLMLoadException:
>>>>> java.lang.UnsatisfiedLinkError: no ken in java.library.path
>>>>> at
>>>>> org.apache.joshua.decoder.ff.lm.KenLM.initializeSystemLibrary(KenLM.
>>>> java:107)
>>>>> at org.apache.joshua.decoder.ff.lm.KenLM.(KenLM.java:58)
>>>>> at
>>>>> org.apache.joshua.decoder.ff.lm.StateMinimizingLanguageModel.
>>>> initializeLM(StateMinimizingLanguageModel.java:63)
>>>>> at
>>>>> org.apache.joshua.decoder.ff.lm.LanguageModelFF.(
>>>> LanguageModelFF.java:132)
>>>>> at
>>>>> org.apache.joshua.decoder.ff.lm.StateMinimizingLanguageModel.(
>>>> StateMinimizingLanguageModel.java:47)
>>>>> ... 8 more
>>>>> Caused by: java.lang.UnsatisfiedLinkError: no ken in java.library.path
>>>>> at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1867)
>>>>> at java.lang.Runtime.loadLibrary0(Runtime.java:87

Re: java.lang.UnsatisfiedLinkError: no ken in java.library.path

2017-05-30 Thread Matt Post
It looks like it can't find libken.so, so that is probably not in your 
$LD_LIBRARY_PATH. This is supposed to be set by the "joshua" script, so 
something must be wrong. I'm not sure what but it shouldn't be too difficult to 
track down at the terminal.

matt


> On May 25, 2017, at 10:43 PM, Hoàng Đình Long <long@gmail.com> wrote:
> 
> Yes, in the language pack/model folder, there are lm.kenlm, grammar.glue
> files and a sub folder named grammar.packed
> 
> On Thu, May 25, 2017 at 5:56 PM, Matt Post <p...@cs.jhu.edu> wrote:
> 
>> Is the file model/lm.kenlm in place?
>> 
>> 
>>> On May 24, 2017, at 10:15 PM, Hoàng Đình Long <long@gmail.com>
>> wrote:
>>> 
>>> Hello,
>>> 
>>> In the $JOSHUA home directory (which I built from source code cloned from
>>> Github), I found a lib directory which contains only 1 file
>>> named libken.so. Then I copy the lib folder into the language pack
>> release
>>> folder (because the language pack folder doesn't have this sub folder).
>>> 
>>> The error remains the same when I try to use the language pack:
>>> 
>>> Exception in thread "main" java.lang.RuntimeException: Unable to
>>> instantiate feature function 'StateMinimizingLanguageModel -lm_order 3
>>> -lm_file model/lm.kenlm'!
>>> at
>>> org.apache.joshua.decoder.Decoder.initializeFeatureFunctions(
>> Decoder.java:642)
>>> at org.apache.joshua.decoder.Decoder.initialize(Decoder.java:394)
>>> at org.apache.joshua.decoder.Decoder.(Decoder.java:128)
>>> at org.apache.joshua.decoder.JoshuaDecoder.main(JoshuaDecoder.java:69)
>>> Caused by: java.lang.reflect.InvocationTargetException
>>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>>> at
>>> sun.reflect.NativeConstructorAccessorImpl.newInstance(
>> NativeConstructorAccessorImpl.java:62)
>>> at
>>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
>> DelegatingConstructorAccessorImpl.java:45)
>>> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>>> at
>>> org.apache.joshua.decoder.Decoder.initializeFeatureFunctions(
>> Decoder.java:638)
>>> ... 3 more
>>> Caused by: org.apache.joshua.decoder.ff.lm.KenLM$KenLMLoadException:
>>> java.lang.UnsatisfiedLinkError: no ken in java.library.path
>>> at
>>> org.apache.joshua.decoder.ff.lm.KenLM.initializeSystemLibrary(KenLM.
>> java:107)
>>> at org.apache.joshua.decoder.ff.lm.KenLM.(KenLM.java:58)
>>> at
>>> org.apache.joshua.decoder.ff.lm.StateMinimizingLanguageModel.
>> initializeLM(StateMinimizingLanguageModel.java:63)
>>> at
>>> org.apache.joshua.decoder.ff.lm.LanguageModelFF.(
>> LanguageModelFF.java:132)
>>> at
>>> org.apache.joshua.decoder.ff.lm.StateMinimizingLanguageModel.(
>> StateMinimizingLanguageModel.java:47)
>>> ... 8 more
>>> Caused by: java.lang.UnsatisfiedLinkError: no ken in java.library.path
>>> at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1867)
>>> at java.lang.Runtime.loadLibrary0(Runtime.java:870)
>>> at java.lang.System.loadLibrary(System.java:1122)
>>> at
>>> org.apache.joshua.decoder.ff.lm.KenLM.initializeSystemLibrary(KenLM.
>> java:103)
>>> ... 12 more
>>> 
>>> 
>>> On Wed, May 24, 2017 at 9:06 PM, Matt Post <p...@cs.jhu.edu> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> If you are using a language pack, you must have KenLM installed into the
>>>> language pack directory under lib/ since that is where it looks. Can you
>>>> copy libken.so to that directory and see if it works?
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On May 23, 2017, at 10:54 PM, Hoàng Đình Long <long@gmail.com>
>>>> wrote:
>>>>> 
>>>>> Hello,
>>>>> 
>>>>> I have followed the Fisher call home tutorial and I built 3 models
>> based
>>>> on
>>>>> that document.
>>>>> 
>>>>> Then I use this script to build a language pack based on the model
>>>> number 3:
>>>>> 
>>>>> $JOSHUA/scripts/language-pack/build_lp.sh es-en
>>>> 3/tune/joshua.config.final
>>>>> 4g
>>>>> 
>>>>> The scripts ended well and I created a folder named releases.
>>>>> 
>>>>> In "releases" folder, there is a 

Re: java.lang.UnsatisfiedLinkError: no ken in java.library.path

2017-05-25 Thread Matt Post
Is the file model/lm.kenlm in place?


> On May 24, 2017, at 10:15 PM, Hoàng Đình Long <long@gmail.com> wrote:
> 
> Hello,
> 
> In the $JOSHUA home directory (which I built from source code cloned from
> Github), I found a lib directory which contains only 1 file
> named libken.so. Then I copy the lib folder into the language pack release
> folder (because the language pack folder doesn't have this sub folder).
> 
> The error remains the same when I try to use the language pack:
> 
> Exception in thread "main" java.lang.RuntimeException: Unable to
> instantiate feature function 'StateMinimizingLanguageModel -lm_order 3
> -lm_file model/lm.kenlm'!
> at
> org.apache.joshua.decoder.Decoder.initializeFeatureFunctions(Decoder.java:642)
> at org.apache.joshua.decoder.Decoder.initialize(Decoder.java:394)
> at org.apache.joshua.decoder.Decoder.(Decoder.java:128)
> at org.apache.joshua.decoder.JoshuaDecoder.main(JoshuaDecoder.java:69)
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at
> org.apache.joshua.decoder.Decoder.initializeFeatureFunctions(Decoder.java:638)
> ... 3 more
> Caused by: org.apache.joshua.decoder.ff.lm.KenLM$KenLMLoadException:
> java.lang.UnsatisfiedLinkError: no ken in java.library.path
> at
> org.apache.joshua.decoder.ff.lm.KenLM.initializeSystemLibrary(KenLM.java:107)
> at org.apache.joshua.decoder.ff.lm.KenLM.(KenLM.java:58)
> at
> org.apache.joshua.decoder.ff.lm.StateMinimizingLanguageModel.initializeLM(StateMinimizingLanguageModel.java:63)
> at
> org.apache.joshua.decoder.ff.lm.LanguageModelFF.(LanguageModelFF.java:132)
> at
> org.apache.joshua.decoder.ff.lm.StateMinimizingLanguageModel.(StateMinimizingLanguageModel.java:47)
> ... 8 more
> Caused by: java.lang.UnsatisfiedLinkError: no ken in java.library.path
> at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1867)
> at java.lang.Runtime.loadLibrary0(Runtime.java:870)
> at java.lang.System.loadLibrary(System.java:1122)
> at
> org.apache.joshua.decoder.ff.lm.KenLM.initializeSystemLibrary(KenLM.java:103)
> ... 12 more
> 
> 
> On Wed, May 24, 2017 at 9:06 PM, Matt Post <p...@cs.jhu.edu> wrote:
> 
>> Hi,
>> 
>> If you are using a language pack, you must have KenLM installed into the
>> language pack directory under lib/ since that is where it looks. Can you
>> copy libken.so to that directory and see if it works?
>> 
>> 
>> 
>> 
>>> On May 23, 2017, at 10:54 PM, Hoàng Đình Long <long@gmail.com>
>> wrote:
>>> 
>>> Hello,
>>> 
>>> I have followed the Fisher call home tutorial and I built 3 models based
>> on
>>> that document.
>>> 
>>> Then I use this script to build a language pack based on the model
>> number 3:
>>> 
>>> $JOSHUA/scripts/language-pack/build_lp.sh es-en
>> 3/tune/joshua.config.final
>>> 4g
>>> 
>>> The scripts ended well and I created a folder named releases.
>>> 
>>> In "releases" folder, there is a folder named
>>> "apache-joshua-es-en-2017-05-24". I cd into that folder.
>>> 
>>> I created a file named "example.es" with this sentence inside "común y
>>> corriente" and ran this command:
>>> 
>>> cat example.es | ./prepare.sh | ./joshua > output.en
>>> 
>>> It reported the following error:
>>> 
>>> Exception in thread "main" java.lang.RuntimeException: Unable to
>>> instantiate feature function 'StateMinimizingLanguageModel -lm_order 3
>>> -lm_file model/lm.kenlm'!
>>> at
>>> org.apache.joshua.decoder.Decoder.initializeFeatureFunctions(
>> Decoder.java:642)
>>> at org.apache.joshua.decoder.Decoder.initialize(Decoder.java:394)
>>> at org.apache.joshua.decoder.Decoder.(Decoder.java:128)
>>> at org.apache.joshua.decoder.JoshuaDecoder.main(JoshuaDecoder.java:69)
>>> Caused by: java.lang.reflect.InvocationTargetException
>>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>>> at
>>> sun.reflect.NativeConstructorAccessorImpl.newInstance(
>> NativeConstructorAccessorImpl.java:62)
>>> at
>>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
>> DelegatingConstructorAcc

Re: java.lang.UnsatisfiedLinkError: no ken in java.library.path

2017-05-24 Thread Matt Post
Hi,

If you are using a language pack, you must have KenLM installed into the 
language pack directory under lib/ since that is where it looks. Can you copy 
libken.so to that directory and see if it works?




> On May 23, 2017, at 10:54 PM, Hoàng Đình Long  wrote:
> 
> Hello,
> 
> I have followed the Fisher call home tutorial and I built 3 models based on
> that document.
> 
> Then I use this script to build a language pack based on the model number 3:
> 
> $JOSHUA/scripts/language-pack/build_lp.sh es-en 3/tune/joshua.config.final
> 4g
> 
> The scripts ended well and I created a folder named releases.
> 
> In "releases" folder, there is a folder named
> "apache-joshua-es-en-2017-05-24". I cd into that folder.
> 
> I created a file named "example.es" with this sentence inside "común y
> corriente" and ran this command:
> 
> cat example.es | ./prepare.sh | ./joshua > output.en
> 
> It reported the following error:
> 
> Exception in thread "main" java.lang.RuntimeException: Unable to
> instantiate feature function 'StateMinimizingLanguageModel -lm_order 3
> -lm_file model/lm.kenlm'!
> at
> org.apache.joshua.decoder.Decoder.initializeFeatureFunctions(Decoder.java:642)
> at org.apache.joshua.decoder.Decoder.initialize(Decoder.java:394)
> at org.apache.joshua.decoder.Decoder.(Decoder.java:128)
> at org.apache.joshua.decoder.JoshuaDecoder.main(JoshuaDecoder.java:69)
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at
> org.apache.joshua.decoder.Decoder.initializeFeatureFunctions(Decoder.java:638)
> ... 3 more
> Caused by: org.apache.joshua.decoder.ff.lm.KenLM$KenLMLoadException:
> java.lang.UnsatisfiedLinkError: no ken in java.library.path
> at
> org.apache.joshua.decoder.ff.lm.KenLM.initializeSystemLibrary(KenLM.java:107)
> at org.apache.joshua.decoder.ff.lm.KenLM.(KenLM.java:58)
> at
> org.apache.joshua.decoder.ff.lm.StateMinimizingLanguageModel.initializeLM(StateMinimizingLanguageModel.java:63)
> at
> org.apache.joshua.decoder.ff.lm.LanguageModelFF.(LanguageModelFF.java:132)
> at
> org.apache.joshua.decoder.ff.lm.StateMinimizingLanguageModel.(StateMinimizingLanguageModel.java:47)
> ... 8 more
> Caused by: java.lang.UnsatisfiedLinkError: no ken in java.library.path
> at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1867)
> at java.lang.Runtime.loadLibrary0(Runtime.java:870)
> at java.lang.System.loadLibrary(System.java:1122)
> at
> org.apache.joshua.decoder.ff.lm.KenLM.initializeSystemLibrary(KenLM.java:103)
> 
> 
> I think I have installed KenLM properly. If I hadn't installed it, I
> wouldn't have been able to follow the tutorial and build the language pack,
> would I? Did I miss something here?
> 
> -- 
> _Long HĐi_



Re: ping on RC4 vote

2017-03-31 Thread Matt Post
Yes, I've verified that those don't match, either.

I can't think of a reason that they *shouldn't* match. Tommaso, do you have any 
idea why they're different? Are these two locations out of sync?



> On Mar 29, 2017, at 12:58 PM, Michael A. Hedderich 
>  wrote:
> 
> Hi,
> 
> from my last mail:
> 
> "What does not match for me are the md5 or sha1 of the stagging repo with
> those of the source release artifacts. E.g. https://repository.apache.org/
> content/repositories/orgapachejoshua-1005/org/apache/joshua/joshua
> -incubating/6.1/joshua-incubating-6.1-src.tar.gz.md5 vs
> https://dist.apache.org/repos/dist/dev/incubator/joshua/6.1/joshua
> -incubating-6.1-src.tar.gz.md5  "
> 
> If this is the expected behavior, then its a +1 from me, too.
> 
> Cheers,
> Michael
> 
> 2017-03-29 12:07 GMT-04:00 lewis john mcgibbney :
> 
>> Hi Folks,
>> I would also like to encourage people to take a look and VOTE as soon as
>> possible.
>> I'm in regular contact with some folks over at the Linguistic Data
>> Consortium [0] (as are several of us I'm sure) and they have tentatively
>> agreed to announce our release (should it be done by then) in their next
>> newsletter... which has a wide reader base.
>> 
>> Thank you Tommaso for hanging on here.
>> 
>> To clarify, I'm a +1
>> 
>> [0] https://www.ldc.upenn.edu/
>> 
>> On Wed, Mar 29, 2017 at 8:39 AM, <
>> dev-digest-h...@joshua.incubator.apache.org> wrote:
>> 
>>> 
>>> 
>>> From: Tommaso Teofili 
>>> To: "dev@joshua.incubator.apache.org" 
>>> Cc:
>>> Bcc:
>>> Date: Wed, 29 Mar 2017 15:39:18 +
>>> Subject: Re: ping on RC4 vote
>>> ping
>>> 
>>> 
>> 
> 
> 
> 2017-03-29 12:07 GMT-04:00 lewis john mcgibbney :
> 
>> Hi Folks,
>> I would also like to encourage people to take a look and VOTE as soon as
>> possible.
>> I'm in regular contact with some folks over at the Linguistic Data
>> Consortium [0] (as are several of us I'm sure) and they have tentatively
>> agreed to announce our release (should it be done by then) in their next
>> newsletter... which has a wide reader base.
>> 
>> Thank you Tommaso for hanging on here.
>> 
>> To clarify, I'm a +1
>> 
>> [0] https://www.ldc.upenn.edu/
>> 
>> On Wed, Mar 29, 2017 at 8:39 AM, <
>> dev-digest-h...@joshua.incubator.apache.org> wrote:
>> 
>>> 
>>> 
>>> From: Tommaso Teofili 
>>> To: "dev@joshua.incubator.apache.org" 
>>> Cc:
>>> Bcc:
>>> Date: Wed, 29 Mar 2017 15:39:18 +
>>> Subject: Re: ping on RC4 vote
>>> ping
>>> 
>>> 
>> 



Re: [VOTE] Release Apache Joshua 6.1 (Incubating) RC4

2017-03-31 Thread Matt Post
+1

✓ MD5 sums (tar and zip)
✓ includes DISCLAIMER
✓ build from src distribution (zip and tgz): 168 tests run, 31 skipped
✓ verified both GPG signatures

I agree about Michael's earlier point: the file name is 
joshua-incubating-6.1-src.tar.gz but it unpacks to 
apache-joshua-incubating-6.1. This discrepancy is okay for now but should be 
fixed in the future.

(at some point when we're in person we should exchange GPG keys)

matt

> On Mar 20, 2017, at 9:53 PM, Matt Post <p...@cs.jhu.edu> wrote:
> 
> Folks — This is still in my queue so let's keep this open.
> 
> matt
> 
> 
>> On Mar 16, 2017, at 8:56 PM, John Hewitt <john...@seas.upenn.edu> wrote:
>> 
>> Lewis is right about the week. Sorry, everyone. This week had a DARPA
>> meeting in Atlanta. I'll get my +/-1 out tomorrow.
>> 
>> -John
>> 
>> On Thu, Mar 16, 2017 at 8:53 PM, Michael A. Hedderich <
>> m...@michael-hedderich.de> wrote:
>> 
>>> Hi,
>>> 
>>> Thanks Tommaso for putting the release together!
>>> 
>>> I was traveling to the US, sorry for the delay from my side.
>>> 
>>> Here is my list:
>>> - build from tag: passed
>>> - build from staging repo (zip and gz): passed
>>> - build from source release artifacts (zip and gz): passed
>>> - md5, sha1 and acc match within the stagging repo
>>> - md5 and acc match within the source release artifacts
>>> 
>>> What does not match for me are the md5 or sha1 of the stagging repo with
>>> those of the source release artifacts. E.g.
>>> https://repository.apache.org/content/repositories/
>>> orgapachejoshua-1005/org/apache/joshua/joshua-incubating/6.1/joshua-
>>> incubating-6.1-src.tar.gz.md5
>>> vs
>>> https://dist.apache.org/repos/dist/dev/incubator/joshua/6.1/
>>> joshua-incubating-6.1-src.tar.gz.md5
>>> 
>>> Is this the expected behavior?
>>> 
>>> The link to the check-list that Tom Barber had sent around in the past (
>>> http://incubator.apache.org/guides/releasemanagement.html#check-list) does
>>> not seem to be valid anymore. At least for me the anchor point does not
>>> work and I could not find the check-list on this page or one of its
>>> subpages. Does anyone know if this list still exists? If not, should we put
>>> such a list on the Joshua PPMC Wiki?
>>> 
>>> Regards,
>>> Michael
>>> 
>>> 
>>> 2017-03-16 20:11 GMT-04:00 lewis john mcgibbney <lewi...@apache.org>:
>>> 
>>>> Hi Tommaso,
>>>> It looks like you caught the PPMC on a bad week... we will get the VOTE'd
>>>> done worry ;)
>>>> Thanks for putting the RC together.
>>>> Comments inline
>>>> 
>>>> On Mon, Mar 13, 2017 at 3:58 PM, <
>>>> dev-digest-h...@joshua.incubator.apache.org> wrote:
>>>> 
>>>> SIGS look good so do tags and staging repos.
>>>> 
>>>> On primary release src at
>>>> https://dist.apache.org/repos/dist/dev/incubator/joshua/6.1/
>>>> joshua-incubating-6.1-src.tar.gz,
>>>> the compressed archive is called joshua-incubating-6.1-src, when I
>>>> decompress it, it is called apache-joshua-6.1-incubating. This is a minor
>>>> inconsistency which we may wish to address for next incubating release.
>>>> 
>>>> When I build (mvn clean install) I get the following... damn laptop. This
>>>> is the same issue I got when I tried to spin the original RC2 myself.
>>> This
>>>> is specific to my environment s not a blocker.
>>>> 
>>>> [INFO]
>>>> 
>>>> [INFO] BUILD FAILURE
>>>> [INFO]
>>>> 
>>>> [INFO] Total time: 29.351 s
>>>> [INFO] Finished at: 2017-03-16T17:07:16-07:00
>>>> [INFO] Final Memory: 41M/697M
>>>> [INFO]
>>>> 
>>>> [ERROR] Failed to execute goal
>>>> org.apache.maven.plugins:maven-assembly-plugin:3.0.0:single
>>>> (source-release-assembly) on project joshua-incubating: Execution
>>>> source-release-assembly of goal
>>>> org.apache.maven.plugins:maven-assembly-plugin:3.0.0:single failed: user
>>>> id
>>>> '498339010' is too big ( > 2097151 ). -> [Help 1]
>>>> [ERROR]
>>>> [ERRO

Re: [VOTE] Release Apache Joshua 6.1 (Incubating) RC4

2017-03-20 Thread Matt Post
Folks — This is still in my queue so let's keep this open.

matt


> On Mar 16, 2017, at 8:56 PM, John Hewitt  wrote:
> 
> Lewis is right about the week. Sorry, everyone. This week had a DARPA
> meeting in Atlanta. I'll get my +/-1 out tomorrow.
> 
> -John
> 
> On Thu, Mar 16, 2017 at 8:53 PM, Michael A. Hedderich <
> m...@michael-hedderich.de> wrote:
> 
>> Hi,
>> 
>> Thanks Tommaso for putting the release together!
>> 
>> I was traveling to the US, sorry for the delay from my side.
>> 
>> Here is my list:
>> - build from tag: passed
>> - build from staging repo (zip and gz): passed
>> - build from source release artifacts (zip and gz): passed
>> - md5, sha1 and acc match within the stagging repo
>> - md5 and acc match within the source release artifacts
>> 
>> What does not match for me are the md5 or sha1 of the stagging repo with
>> those of the source release artifacts. E.g.
>> https://repository.apache.org/content/repositories/
>> orgapachejoshua-1005/org/apache/joshua/joshua-incubating/6.1/joshua-
>> incubating-6.1-src.tar.gz.md5
>> vs
>> https://dist.apache.org/repos/dist/dev/incubator/joshua/6.1/
>> joshua-incubating-6.1-src.tar.gz.md5
>> 
>> Is this the expected behavior?
>> 
>> The link to the check-list that Tom Barber had sent around in the past (
>> http://incubator.apache.org/guides/releasemanagement.html#check-list) does
>> not seem to be valid anymore. At least for me the anchor point does not
>> work and I could not find the check-list on this page or one of its
>> subpages. Does anyone know if this list still exists? If not, should we put
>> such a list on the Joshua PPMC Wiki?
>> 
>> Regards,
>> Michael
>> 
>> 
>> 2017-03-16 20:11 GMT-04:00 lewis john mcgibbney :
>> 
>>> Hi Tommaso,
>>> It looks like you caught the PPMC on a bad week... we will get the VOTE'd
>>> done worry ;)
>>> Thanks for putting the RC together.
>>> Comments inline
>>> 
>>> On Mon, Mar 13, 2017 at 3:58 PM, <
>>> dev-digest-h...@joshua.incubator.apache.org> wrote:
>>> 
>>> SIGS look good so do tags and staging repos.
>>> 
>>> On primary release src at
>>> https://dist.apache.org/repos/dist/dev/incubator/joshua/6.1/
>>> joshua-incubating-6.1-src.tar.gz,
>>> the compressed archive is called joshua-incubating-6.1-src, when I
>>> decompress it, it is called apache-joshua-6.1-incubating. This is a minor
>>> inconsistency which we may wish to address for next incubating release.
>>> 
>>> When I build (mvn clean install) I get the following... damn laptop. This
>>> is the same issue I got when I tried to spin the original RC2 myself.
>> This
>>> is specific to my environment s not a blocker.
>>> 
>>> [INFO]
>>> 
>>> [INFO] BUILD FAILURE
>>> [INFO]
>>> 
>>> [INFO] Total time: 29.351 s
>>> [INFO] Finished at: 2017-03-16T17:07:16-07:00
>>> [INFO] Final Memory: 41M/697M
>>> [INFO]
>>> 
>>> [ERROR] Failed to execute goal
>>> org.apache.maven.plugins:maven-assembly-plugin:3.0.0:single
>>> (source-release-assembly) on project joshua-incubating: Execution
>>> source-release-assembly of goal
>>> org.apache.maven.plugins:maven-assembly-plugin:3.0.0:single failed: user
>>> id
>>> '498339010' is too big ( > 2097151 ). -> [Help 1]
>>> [ERROR]
>>> [ERROR] To see the full stack trace of the errors, re-run Maven with the
>> -e
>>> switch.
>>> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
>>> [ERROR]
>>> [ERROR] For more information about the errors and possible solutions,
>>> please read the following articles:
>>> [ERROR] [Help 1]
>>> http://cwiki.apache.org/confluence/display/MAVEN/
>> PluginExecutionException
>>> 
>>> A mvn clean test results in the following
>>> 
>>> [INFO]
>>> 
>>> [INFO] BUILD SUCCESS
>>> [INFO]
>>> 
>>> [INFO] Total time: 18.971 s
>>> [INFO] Finished at: 2017-03-16T17:09:25-07:00
>>> [INFO] Final Memory: 34M/608M
>>> [INFO]
>>> 
>>> 
>>> CHANGES, DISCLAIMER, LICENSE, NOTICE and README all look good. DOAP is
>>> slightly out of date, however it reflects the first RC.
>>> 
>>> 
>>> [X] +1, let's get it released!!!
 
>>> 
>>> Thank you Tommaso
>>> 
>>> --
>>> http://home.apache.org/~lewismc/
>>> @hectorMcSpector
>>> http://www.linkedin.com/in/lmcgibbney
>>> 
>> 



Re: Apache Joshua Question

2017-03-09 Thread Matt Post
Hi Guanzhong,

I suggest you look at the demo that is included with Joshua. That contains 
sample Javascript code that shows how to connect to a RESTful instance of 
Joshua. See the README file in this directory:

https://github.com/apache/incubator-joshua/tree/master/demo 
<https://github.com/apache/incubator-joshua/tree/master/demo>

as well as the HTML / JS at

https://github.com/apache/incubator-joshua/blob/master/demo/index.html 
<https://github.com/apache/incubator-joshua/blob/master/demo/index.html>
https://github.com/apache/incubator-joshua/blob/master/demo/demo.js 
<https://github.com/apache/incubator-joshua/blob/master/demo/demo.js>

The page for the RESTful server should also be of some help (I assume you have 
seen this):


https://cwiki.apache.org/confluence/display/JOSHUA/RESTful+API?src=contextnavpagetreemode
 
<https://cwiki.apache.org/confluence/display/JOSHUA/RESTful+API?src=contextnavpagetreemode>

Sincerely,
Matt Post


> On Mar 9, 2017, at 9:58 AM, Guanzhong Wang <guanzhong.wang.w...@gmail.com> 
> wrote:
> 
> Dear Joshua Development Team
> I am a web developer working in DC area now. In recent years I am very
> interested in the knowledge of nlp, I noticed that you just released the
> apache joshua REST API doc on website. Recently I'm trying to integrate
> joshua to my java based web service, but didn't get any progress. I have no
> idea how to pass args to joshua, cause I can't find any joshua java API or
> lib. I see joshua can run a script to bring up a http server, but instead
> of it do you think I can integrate it to my own http server? Any Idea?
> 
> 
> Thanks.



Re: [jira] [Commented] (JOSHUA-331) Address Apache Joshua 6.1 RC#3 Issues

2017-03-08 Thread Matt Post
Hi Tommaso,

I'm afraid I'm not at all familiar with the release process and am not sure 
what to do here. Can you simply retrace these steps and do it again correctly?

matt


> On Mar 7, 2017, at 8:31 AM, Tommaso Teofili (JIRA)  wrote:
> 
> 
>[ 
> https://issues.apache.org/jira/browse/JOSHUA-331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15899440#comment-15899440
>  ] 
> 
> Tommaso Teofili commented on JOSHUA-331:
> 
> 
> now all seems fine to me, one concern I have is that for RC3 I didn't have 
> .MD5 checksums in my target directory after having done _mvn release:prepare_ 
> and _mvn release:perform_ and therefore I took the ones from the staging repo 
> and copied them to _/dist_ assuming that they got generated using my key, 
> which of course was not the case.
> How should we proceed there ?
> 
> 
>> Address Apache Joshua 6.1 RC#3 Issues
>> -
>> 
>>Key: JOSHUA-331
>>URL: https://issues.apache.org/jira/browse/JOSHUA-331
>>Project: Joshua
>> Issue Type: Task
>> Components: release
>>   Reporter: Tommaso Teofili
>>   Assignee: Tommaso Teofili
>>Fix For: 6.1
>> 
>> 
>> Address the following issues:
>> {quote}
>> Every ASF release MUST contain one or more source packages, which MUST be
>> sufficient for a user to build and test the release provided they have
>> access to the appropriate platform and tools. - NO
>>-Not building due to failing test (BerkleyLM failure).  I'm digging a
>> bit more into this.
>> {quote}
>> {quote}
>> Every artifact distributed to the public through Apache channels MUST be
>> accompanied by one file containing an OpenPGP compatible ASCII armored
>> detached signature and another file containing an MD5 checksum.
>>- .asc - NO
>>I get warning:
>>"gpg --verify joshua-incubating-6.1-src.tar.gz.asc
>> joshua-incubating-6.1-src.tar.gz
>>gpg: Signature made Thu Feb 23 09:15:17 2017 CET using RSA key ID
>> 891768A5
>>gpg: Good signature from "Tommaso Teofili "
>> [unknown]
>>gpg: WARNING: This key is not certified with a trusted signature!
>>gpg:  There is no indication that the signature belongs to the
>> owner."
>>- .md5 - NO
>>My md5 of joshua-incubating-6.1-src.tar.gz is
>> 504976876b01294811293aa45b5400f5, the joshua-incubating-6.1-src.tar.gz.md5
>> indicates it should be 22b738eeae45757715080702a5bd2789
>>- .sha - NO
>>My sha of joshua-incubating-6.1-src.tar.gz is
>> 4AB5BA24301590F36AE6452DACC3F21CBD8B3FEC, the
>> joshua-incubating-6.1-src.tar.gz.md5 indicates it should be
>> 2a55b6d341dddc5369b22a4802a86ec40accd0a1
>>- KEYS - YES
>> {quote}
> 
> 
> 
> --
> This message was sent by Atlassian JIRA
> (v6.3.15#6346)



Re: Dockerhub hosted images

2017-03-07 Thread Matt Post
FYI, I stress-tested the Joshua server with the following protocol: for both 
the TCP and HTTP servers, I started a six-thread server, and then sent five 
simultaneous 16k documents at each. The translation times were as follows:

TCP: (times: 8:07 8:06 8:06)

for x in 1 2 3 4; do for num in $(seq 1 5); do cat corpus.es | nc 
localhost 5674 > t.tcp.$num & done; time wait; done)

HTTP: (times: 7:25 7:34 7:20)

for x in 1 2 3 4; do for num in $(seq 1 5); do 
/home/hltcoe/mpost/code/joshua/scripts/support/query_http.py -s localhost -p 
5674 corpus.es > t.out.$num & done; time wait; done

The HTTP query takes 100 lines of the test set at a time, constructs the 
RESTful query string (with 100 url-encoded "q=..." lines), and sends it to the 
server.

So the bottom line is that the HTTP server both has an extended 
Google-translate API (which also supports other things like adding rules) and 
is a bit faster.

I'm documenting the RESTful API here: 
https://cwiki.apache.org/confluence/display/JOSHUA/RESTful+API

matt


> On Mar 3, 2017, at 11:24 AM, Matt Post <p...@cs.jhu.edu> wrote:
> 
> Folks,
> 
> I've updated the code with a few changes that will support Dockerized 
> language packs. The nice thing is that this makes it easy to include KenLM.
> 
> Here are some changes that were made:
> 
> - Joshua now notes what directory the config file was found in and loads 
> relative paths found in the config file relative to that directory 
> automatically. This means you don't have to "cd" to the LP (language pack) 
> directory before running Joshua.
> 
> - I fixed the HTTP server to take multiple "q=" lines, just like the Google 
> translate API. Before, they only took one "q=" line. This should mean (I'll 
> test later today) that the HTTP server can handle throughput essentially at 
> the rates of the TCP server.
> 
> - I added (but haven't pushed yet) the KenLM model files to the language 
> packs. In addition, I added a file "joshua.config.kenlm". These are not used 
> except by Docker.
> 
> - I fixed the docker setup. See the new file:
> 
>   
> https://github.com/apache/incubator-joshua/blob/master/distribution/docker/kenlm/Dockerfile
>  
> <https://github.com/apache/incubator-joshua/blob/master/distribution/docker/kenlm/Dockerfile>
> 
> This docker container builds KenLM. It then expects to be run with docker 
> mounting an existing language pack to /model. It then runs the 
> joshua.config.kenlm file, running it as a server in HTTP mode. See the README 
> file for information:
> 
>   
> https://github.com/apache/incubator-joshua/tree/master/distribution/docker/kenlm
>  
> <https://github.com/apache/incubator-joshua/tree/master/distribution/docker/kenlm>
> 
> If anyone wants to test this out, please do. You can grab an updated language 
> pack (version 3) here:
> 
>   
> http://cs.jhu.edu/~post/language-packs/apache-joshua-es-en-2017-03-03.tgz 
> <http://cs.jhu.edu/~post/language-packs/apache-joshua-es-en-2017-03-03.tgz>
> 
> (Warning: 9 GB)
> 
> matt
> 
> 
>> On Nov 23, 2016, at 10:14 AM, kellen sunderland 
>> <kellen.sunderl...@gmail.com> wrote:
>> 
>> Yeah it should just be docker 'pull kellens/apache-joshua-es-en-2016-10-05'
>> then 'docker run -it kellens/apache-joshua-es-en-2016-10-05 /bin/bash' or
>> something similar.  I think the default command should eventually be to run
>> the http server, so ideally we'd just do 'docker run -p 5674
>> kellens/apache-joshua-es-en-2016-10-05' and that would start up the http
>> server on port 5674.
>> 
>> Good point on Perl + Python, I can add them.
>> 
>> -Kellen
>> 
>> On Wed, Nov 23, 2016 at 3:22 PM, Matt Post <p...@cs.jhu.edu> wrote:
>> 
>>> Okay, I have this with
>>> 
>>>   docker run -it kellens/apache-joshua-es-en-2016-10-05 bash
>>> 
>>> It seems we are missing Perl (./prepare.sh fails), and we should replace
>>> the LanguageModel line with a KenLM instance and build that. I bet we'll
>>> need Python, too.
>>> 
>>> 
>>> 
>>> 
>>>> On Nov 23, 2016, at 8:15 AM, Matt Post <p...@cs.jhu.edu> wrote:
>>>> 
>>>> Kellen, can I bother you to post a few first steps? I've successfully
>>> pulled this down to my mac but now do not know how to find it, edit it, or
>>> run it. I'm porting through the documentation and will find it eventually
>>> but this would save me a bit of time.
>>>> 
>>>> 
>>>>> On Nov 23, 2016, at 8:07 AM, kellen sunderland <
>>> kellen.sunderl...@gmail.c

[jira] [Commented] (JOSHUA-331) Address Apache Joshua 6.1 RC#3 Issues

2017-03-06 Thread Matt Post (JIRA)

[ 
https://issues.apache.org/jira/browse/JOSHUA-331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898102#comment-15898102
 ] 

Matt Post commented on JOSHUA-331:
--

All four files should be variations of each other. Alas, they are not.

lm.berkeleylm.gz uncompressed is lm.berkeleylm, so that's good.

lm.gz, unfortunately, when uncompressed, is missing a line from "lm". However, 
recompressing "lm" (with its additional line) into lm.gz results in no changes 
to the tests. 

However, I regenerated all the files from the "lm" file (with its additional 
 line, which is crucial for one of the tests). This was done in the 
following manner:

cat lm | gzip -9n > lm.gz
$JOSHUA/scripts/lm/compile_berkeleylm.py lm lm.berkeleylm
cat lm.berkeleylm | gzip -9n > lm.berkeleylm.gz

Running "mvn test" then succeeds, so I have done this and committed and pushed.

All of these tests are important because they exercise BerkeleyLM's ability to 
read and properly recognize its different supported files.

Now, compile_berkeleylm.py is a fairly simply wrapper around a java call. So it 
would not be difficult to modify the code and distribute only the 
human-readable "lm" file.

Thoughts?

> Address Apache Joshua 6.1 RC#3 Issues
> -
>
> Key: JOSHUA-331
> URL: https://issues.apache.org/jira/browse/JOSHUA-331
> Project: Joshua
>  Issue Type: Task
>  Components: release
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Fix For: 6.1
>
>
> Address the following issues:
> {quote}
> Every ASF release MUST contain one or more source packages, which MUST be
> sufficient for a user to build and test the release provided they have
> access to the appropriate platform and tools. - NO
> -Not building due to failing test (BerkleyLM failure).  I'm digging a
> bit more into this.
> {quote}
> {quote}
> Every artifact distributed to the public through Apache channels MUST be
> accompanied by one file containing an OpenPGP compatible ASCII armored
> detached signature and another file containing an MD5 checksum.
> - .asc - NO
> I get warning:
> "gpg --verify joshua-incubating-6.1-src.tar.gz.asc
> joshua-incubating-6.1-src.tar.gz
> gpg: Signature made Thu Feb 23 09:15:17 2017 CET using RSA key ID
> 891768A5
> gpg: Good signature from "Tommaso Teofili <tomm...@apache.org>"
> [unknown]
> gpg: WARNING: This key is not certified with a trusted signature!
> gpg:  There is no indication that the signature belongs to the
> owner."
> - .md5 - NO
> My md5 of joshua-incubating-6.1-src.tar.gz is
> 504976876b01294811293aa45b5400f5, the joshua-incubating-6.1-src.tar.gz.md5
> indicates it should be 22b738eeae45757715080702a5bd2789
> - .sha - NO
> My sha of joshua-incubating-6.1-src.tar.gz is
> 4AB5BA24301590F36AE6452DACC3F21CBD8B3FEC, the
> joshua-incubating-6.1-src.tar.gz.md5 indicates it should be
> 2a55b6d341dddc5369b22a4802a86ec40accd0a1
> - KEYS - YES
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


Re: [VOTE] Release Apache Joshua 6.1 (Incubating)

2017-03-04 Thread Matt Post
Tommaso,

What's your timeline for fixing this? I just pushed in some changes that add 
docker support and provide multithreading for the HTTP server. It would be nice 
to include those, BUT if it's a lot of extra work, we can just add them later 
(or you could point me to the doc you followed, and I'll do it on Monday)

matt


> On Mar 1, 2017, at 1:09 PM, John Hewitt <john...@seas.upenn.edu> wrote:
> 
> Tommaso, thanks for the RC.
> Kellen, thanks for checking for the -1.
> 
> -John
> 
> On Wed, Mar 1, 2017 at 1:03 PM, kellen sunderland <
> kellen.sunderl...@gmail.com> wrote:
> 
>> For a short term fix for the unit test we can delete lines 48 and 50 from
>> LMGrammarBerkeleyTest.java.
>> 
>> A bit of a longer term solution would be that we could have a @BeforeClass
>> setup method that simply zips the uncompressed files.
>> 
>> Thanks again for putting this together Tommaso.
>> 
>> 
>> On Wed, Mar 1, 2017 at 6:43 PM, Tommaso Teofili <tommaso.teof...@gmail.com
>>> 
>> wrote:
>> 
>>> thanks Kellen,
>>> 
>>> I get the very same issues.
>>> It's probably my fault having copied .md5 and .sha files from the staging
>>> repo as I didn't have them within my target directory.
>>> I also get the same test failure.
>>> 
>>> Hence -1 from me too.
>>> I'll roll it back, fix the issues and create RC4.
>>> 
>>> Regards,
>>> Tommaso
>>> 
>>> 
>>> 
>>> Il giorno mer 1 mar 2017 alle ore 17:54 kellen sunderland <
>>> kellen.sunderl...@gmail.com> ha scritto:
>>> 
>>>> I have to -1 this release for the time being.  For me the signatures
>> and
>>>> hashes don't seem to match the binaries downloaded.  Could you double
>>> check
>>>> that they match for you Tommaso?  I'm also getting a unit test that
>> fails
>>>> when I run 'mvn clean package'.  I'm digging a little more into this
>> one,
>>>> but suspect a missing file.
>>>> 
>>>> 
>>>> 
>>>> Here's what I've checked so far:
>>>> 
>>>> Release artifacts must include incubating in the final file name - YES
>>>> Release artifacts must include a disclaimer within the release
>>> artifact(s)
>>>> as noted - YES
>>>> Every ASF release MUST contain one or more source packages, which MUST
>> be
>>>> sufficient for a user to build and test the release provided they have
>>>> access to the appropriate platform and tools. - NO
>>>>-Not building due to failing test (BerkleyLM failure).  I'm
>> digging a
>>>> bit more into this.
>>>> 
>>>> Every artifact distributed to the public through Apache channels MUST
>> be
>>>> accompanied by one file containing an OpenPGP compatible ASCII armored
>>>> detached signature and another file containing an MD5 checksum.
>>>>- .asc - NO
>>>>I get warning:
>>>>"gpg --verify joshua-incubating-6.1-src.tar.gz.asc
>>>> joshua-incubating-6.1-src.tar.gz
>>>>gpg: Signature made Thu Feb 23 09:15:17 2017 CET using RSA key ID
>>>> 891768A5
>>>>gpg: Good signature from "Tommaso Teofili <tomm...@apache.org>"
>>>> [unknown]
>>>>gpg: WARNING: This key is not certified with a trusted signature!
>>>>gpg:  There is no indication that the signature belongs to
>>> the
>>>> owner."
>>>>- .md5 - NO
>>>>My md5 of joshua-incubating-6.1-src.tar.gz is
>>>> 504976876b01294811293aa45b5400f5, the joshua-incubating-6.1-src.tar.
>>> gz.md5
>>>> indicates it should be 22b738eeae45757715080702a5bd2789
>>>>- .sha - NO
>>>>My sha of joshua-incubating-6.1-src.tar.gz is
>>>> 4AB5BA24301590F36AE6452DACC3F21CBD8B3FEC, the
>>>> joshua-incubating-6.1-src.tar.gz.md5 indicates it should be
>>>> 2a55b6d341dddc5369b22a4802a86ec40accd0a1
>>>>- KEYS - YES
>>>> 
>>>> On Mon, Feb 27, 2017 at 3:55 AM, Matt Post <p...@cs.jhu.edu> wrote:
>>>> 
>>>>> Hi folks,
>>>>> 
>>>>> First, Tommaso, thank you for pulling this together!
>>>>> 
>>>>> I want to remind everyone that there's a checklist to go through
>> before
>>>>> sending your +1. Here's from an email from Tom Barber a while back:
>>&g

Re: Dockerhub hosted images

2017-03-03 Thread Matt Post
Folks,

I've updated the code with a few changes that will support Dockerized language 
packs. The nice thing is that this makes it easy to include KenLM.

Here are some changes that were made:

- Joshua now notes what directory the config file was found in and loads 
relative paths found in the config file relative to that directory 
automatically. This means you don't have to "cd" to the LP (language pack) 
directory before running Joshua.

- I fixed the HTTP server to take multiple "q=" lines, just like the Google 
translate API. Before, they only took one "q=" line. This should mean (I'll 
test later today) that the HTTP server can handle throughput essentially at the 
rates of the TCP server.

- I added (but haven't pushed yet) the KenLM model files to the language packs. 
In addition, I added a file "joshua.config.kenlm". These are not used except by 
Docker.

- I fixed the docker setup. See the new file:


https://github.com/apache/incubator-joshua/blob/master/distribution/docker/kenlm/Dockerfile
 
<https://github.com/apache/incubator-joshua/blob/master/distribution/docker/kenlm/Dockerfile>

This docker container builds KenLM. It then expects to be run with docker 
mounting an existing language pack to /model. It then runs the 
joshua.config.kenlm file, running it as a server in HTTP mode. See the README 
file for information:


https://github.com/apache/incubator-joshua/tree/master/distribution/docker/kenlm
 
<https://github.com/apache/incubator-joshua/tree/master/distribution/docker/kenlm>

If anyone wants to test this out, please do. You can grab an updated language 
pack (version 3) here:


http://cs.jhu.edu/~post/language-packs/apache-joshua-es-en-2017-03-03.tgz 
<http://cs.jhu.edu/~post/language-packs/apache-joshua-es-en-2017-03-03.tgz>

(Warning: 9 GB)

matt


> On Nov 23, 2016, at 10:14 AM, kellen sunderland <kellen.sunderl...@gmail.com> 
> wrote:
> 
> Yeah it should just be docker 'pull kellens/apache-joshua-es-en-2016-10-05'
> then 'docker run -it kellens/apache-joshua-es-en-2016-10-05 /bin/bash' or
> something similar.  I think the default command should eventually be to run
> the http server, so ideally we'd just do 'docker run -p 5674
> kellens/apache-joshua-es-en-2016-10-05' and that would start up the http
> server on port 5674.
> 
> Good point on Perl + Python, I can add them.
> 
> -Kellen
> 
> On Wed, Nov 23, 2016 at 3:22 PM, Matt Post <p...@cs.jhu.edu> wrote:
> 
>> Okay, I have this with
>> 
>>docker run -it kellens/apache-joshua-es-en-2016-10-05 bash
>> 
>> It seems we are missing Perl (./prepare.sh fails), and we should replace
>> the LanguageModel line with a KenLM instance and build that. I bet we'll
>> need Python, too.
>> 
>> 
>> 
>> 
>>> On Nov 23, 2016, at 8:15 AM, Matt Post <p...@cs.jhu.edu> wrote:
>>> 
>>> Kellen, can I bother you to post a few first steps? I've successfully
>> pulled this down to my mac but now do not know how to find it, edit it, or
>> run it. I'm porting through the documentation and will find it eventually
>> but this would save me a bit of time.
>>> 
>>> 
>>>> On Nov 23, 2016, at 8:07 AM, kellen sunderland <
>> kellen.sunderl...@gmail.com> wrote:
>>>> 
>>>> Yes my next step was going to be getting it hosted officially.
>>>> 
>>>> I'll go ahead and open a ticket.  I think I'll hold off on pushing to
>> the
>>>> Apache account until I've done a little more testing though.
>>>> 
>>>> On Nov 23, 2016 5:22 AM, "lewis john mcgibbney" <lewi...@apache.org>
>> wrote:
>>>> 
>>>>> Hi Kellen,
>>>>> Nice :)
>>>>> Another option is for us to host these via the Apache account.
>>>>> https://hub.docker.com/r/apache/
>>>>> We could then add a badge to our README which points to the
>> Dockerfile(s).
>>>>> Do you want to open a ticket over on the INFRA Jira for this?
>>>>> 
>>>>> On Tue, Nov 22, 2016 at 1:57 PM, <
>>>>> dev-digest-h...@joshua.incubator.apache.org> wrote:
>>>>> 
>>>>>> From: kellen sunderland <kellen.sunderl...@gmail.com>
>>>>>> To: "dev@joshua.incubator.apache.org" <dev@joshua.incubator.apache.
>> org>
>>>>>> Cc:
>>>>>> Date: Tue, 22 Nov 2016 22:56:56 +0100
>>>>>> Subject: Re: Dockerhub hosted images
>>>>>> Ok, the first image should be properly uploaded now.
>>>>>> 
>>>>>> https://hub.docker.com/r/kellens/apache-joshua-es-en-2016-10-05/
>>>>>> 
>>>>>> -Kellen
>>>>>> 
>>>>>> 
>>>>> 
>>> 
>> 
>> 



Re: [VOTE] Release Apache Joshua 6.1 (Incubating)

2017-02-26 Thread Matt Post
Hi folks,

First, Tommaso, thank you for pulling this together!

I want to remind everyone that there's a checklist to go through before sending 
your +1. Here's from an email from Tom Barber a while back:

> Hello folks,
> 
> I see plenty of +1's going through the release vote,  which is great to see
> people taking an active role in getting the release shipped.
> 
> For those of you who are new to the ASF there are a bunch of requirements
> to sign off for a release which you can find here:
> 
> http://incubator.apache.org/guides/releasemanagement.html#check-list 
> 
> 
> My current concern is that people who are new to the incubator are +1'ing
> software for release without check all or part of the release cycle. Whilst
> not mandatory, when you +1 a release please can you try to indicate what
> you've checked. The reason for this is,  the tag Lewis has built off isn't
> the tip of master, so if you're basing  your +1 on your day to day
> development and knowledge of the code base, that's not always whats
> shipped. Also in the branching process,  its possible merges or alterations
> were accidentally made that Lewis has missed (this is very unlikely I know
> but you know, code changes). Also people build software on different OS's,
> versions of OS's etc so just because it builds on  Lewis's laptop doesn't
> mean it builds on mine, for example.
> 
> Also regarding licenses, disclaimers etc, people notice different things or
> interpret stuff differently. its always possible that someone might miss a
> library etc so its important multiple eyes run over the same stuff.
> 
> Cheers,
> 
> Tom

I'm hoping I'll have time to go through this tomorrow.

matt



> On Feb 25, 2017, at 2:41 AM, Tommaso Teofili  
> wrote:
> 
> Hi Folks,
> Please VOTE on the Apache Joshua 6.1 Release Candidate #3.
> 
> We solved 36 issues:
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319720=12335049
> 
> Git source tag (3447715b3aa0a48ed79465d80618bd5a2f7a7558):
> https://s.apache.org/XIxJ
> 
> Staging repo:
> https://repository.apache.org/content/repositories/orgapachejoshua-1004
> 
> Source Release Artifacts:
> https://dist.apache.org/repos/dist/dev/incubator/joshua/6.1/
> 
> PGP release keys (signed using 891768A5):
> *https://git1-us-west.apache.org/repos/asf?p=incubator-joshua.git;a=blob_plain;f=KEYS;h=aa18365bf5c8c8fb17b084f783a75c3a2460a98d;hb=HEAD
> *
> 
> Vote will be open for 72 hours.
> Thank you to everyone that is able to VOTE as well as everyone that
> contributed to Apache Joshua 6.1.
> 
> [ ] +1, let's get it released!!!
> [ ] +/-0, fine, but consider to fix few issues before...
> [ ] -1, nope, because... (and please explain why)
> 
> Regards,
> Tommaso



Re: Cutting RC3

2017-02-23 Thread Matt Post
Thank you for heading this up, Tommaso! I'll be able to catch up on this after 
today.

matt


> On Feb 23, 2017, at 3:06 AM, Tommaso Teofili  
> wrote:
> 
> probably because of the mentioned network issues the artifacts ended up in
> two separate staging repositories in Nexus, which is undesired.
> I'll drop those repos, rollback the changes on the pom, delete the current
> tag in git and perform again mvn release:prepare / perform today.
> 
> Regards,
> Tommaso
> 
> Il giorno mer 22 feb 2017 alle ore 16:39 Tommaso Teofili <
> tommaso.teof...@gmail.com> ha scritto:
> 
>> Hi all,
>> 
>> Maven is in the extremely slow (because of my bandwidth) process of
>> deploying stuff on Nexus as part of the mvn release:perform phase.
>> In the meantime perhaps is a good idea not to commit to the master branch,
>> until we get the RC3 voted and hence approved / rejected.
>> 
>> Thanks and regards,
>> Tommaso
>> 



problems with BerkeleyLM

2017-02-01 Thread Matt Post
Hi folks,

I've found some problems with BerkeleyLM. I haven't diagnosed it yet, and am 
not going to have time for a week or two at least, but thought I'd bring it to 
everyone's attention because this affects our no-external-dependency releases.

As for the solution, in addition to trying to track down this problem, I've 
been working on a docker solution for helping people easily add KenLM to the 
language packs.

The problem can be seen in the following. I trained a English--German model, 
using the state minimizing KenLM (KenLM/Full). You can see the BLEU scores on a 
number of test sets below. If I then swap out the StateMinimizingLanguageModel 
for a regular LanguageModel but using KenLM to represent (KenLM/LM), I get a 
drop as expected. If I then swap out KenLM for BerkeleyLM, I get a further huge 
drop.

I wouldn't expect this large of a drop in either situation, but the BerkeleyLM 
one is especially troubling.

Anyway, troubleshooting is forthcoming, but I am sharing this in case anyone is 
using BerkeleyLM somewhere.

matt

---
news-test2008
KenLM/Full:   => BLEU = 0.1464
KenLM/LM: => BLEU = 0.1168
BerkeleyLM:   => BLEU = 0.0800

newstest2008-14.de-en
KenLM/Full:   => BLEU = 0.1524
KenLM/LM: => BLEU = 0.1235
BerkeleyLM:   => BLEU = 0.0839

newstest2009
KenLM/Full:   => BLEU = 0.1372
KenLM/LM: => BLEU = 0.1113
BerkeleyLM:   => BLEU = 0.0793

newstest2010
KenLM/Full:   => BLEU = 0.1487
KenLM/LM: => BLEU = 0.1213
BerkeleyLM:   => BLEU = 0.0847

newstest2011
KenLM/Full:   => BLEU = 0.1473
KenLM/LM: => BLEU = 0.1192
BerkeleyLM:   => BLEU = 0.0826

newstest2012
KenLM/Full:   => BLEU = 0.1488
KenLM/LM: => BLEU = 0.1205
BerkeleyLM:   => BLEU = 0.0797

newstest2013
KenLM/Full:   => BLEU = 0.1692
KenLM/LM: => BLEU = 0.1391
BerkeleyLM:   => BLEU = 0.0923

newstest2014.de-en
KenLM/Full:   => BLEU = 0.1669
KenLM/LM: => BLEU = 0.1351
BerkeleyLM:   => BLEU = 0.0881

newstest2016.de-en
KenLM/Full:   => BLEU = 0.2177
KenLM/LM: => BLEU = 0.1724
BerkeleyLM:   => BLEU = 0.1117

Re: Podling Report Reminder - February 2017

2017-02-01 Thread Matt Post
Folks,

I added the Joshua report.

https://wiki.apache.org/incubator/February2017 


It is due today. Feel free to make comments or initiate discussion here but 
otherwise what's there is what will be sent.

matt


> On Jan 25, 2017, at 7:21 PM, johndam...@apache.org wrote:
> 
> Dear podling,
> 
> This email was sent by an automated system on behalf of the Apache
> Incubator PMC. It is an initial reminder to give you plenty of time to
> prepare your quarterly board report.
> 
> The board meeting is scheduled for Wed, 22 February 2017, 10:30 am PDT.
> The report for your podling will form a part of the Incubator PMC
> report. The Incubator PMC requires your report to be submitted 2 weeks
> before the board meeting, to allow sufficient time for review and
> submission (Wed, February 08).
> 
> Please submit your report with sufficient time to allow the Incubator
> PMC, and subsequently board members to review and digest. Again, the
> very latest you should submit your report is 2 weeks prior to the board
> meeting.
> 
> Thanks,
> 
> The Apache Incubator PMC
> 
> Submitting your Report
> 
> --
> 
> Your report should contain the following:
> 
> *   Your project name
> *   A brief description of your project, which assumes no knowledge of
>the project or necessarily of its field
> *   A list of the three most important issues to address in the move
>towards graduation.
> *   Any issues that the Incubator PMC or ASF Board might wish/need to be
>aware of
> *   How has the community developed since the last report
> *   How has the project developed since the last report.
> 
> This should be appended to the Incubator Wiki page at:
> 
> https://wiki.apache.org/incubator/February2017
> 
> Note: This is manually populated. You may need to wait a little before
> this page is created from a template.
> 
> Mentors
> ---
> 
> Mentors should review reports for their project(s) and sign them off on
> the Incubator wiki page. Signing off reports shows that you are
> following the project - projects that are not signed may raise alarms
> for the Incubator PMC.
> 
> Incubator PMC



[jira] [Commented] (JOSHUA-329) A suspicious use of incrementer in for statement

2017-01-31 Thread Matt Post (JIRA)

[ 
https://issues.apache.org/jira/browse/JOSHUA-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847740#comment-15847740
 ] 

Matt Post commented on JOSHUA-329:
--

I think you are correct. Thanks for pointing this out!

> A suspicious use of incrementer in for statement
> 
>
> Key: JOSHUA-329
> URL: https://issues.apache.org/jira/browse/JOSHUA-329
> Project: Joshua
>  Issue Type: Bug
>Reporter: Jaechang Nam
>Priority: Trivial
>
> In a recent snapshot of the github mirror, I've found a suspicious 
> incrementer in 
> src/main/java/org/apache/joshua/decoder/ff/lm/LanguageModelFF.java.
> {code:java}
> 269   for (int i = 0; i < tokens.length; i++) {
> 270 if (tokens[i] > 0) { // skip nonterminals
> 271   for (int j = 0; j < alignments.length; j += 2) {
> 272 if (alignments[j] == i) {
> 273   String annotation = 
> sentence.getAnnotation((int)alignments[i] + begin, "class");
> 274   if (annotation != null) {
> 275 //System.err.println(String.format("  
> word %d source %d abs %d annotation %d/%s",
> 276 //i, alignments[i], alignments[i] + 
> begin, annotation, Vocabulary.word(annotation)));
> 277 tokens[i] = Vocabulary.id(annotation);
> 278 break;
> 279   }
> 280 }
> 281   }
> 282 }
> 283   }
> {code}
> In Line 273, alignments[i] should be alignments[j] if tokens.length is not 
> same as alignments.length? Since I don't have domain knowledge, this may not 
> be correct but just wanted to report this in case.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


Re: Podling Report Reminder - February 2017

2017-01-30 Thread Matt Post
Folks — I'll take care of this next week, after February 6.

matt

> On Jan 30, 2017, at 10:18 PM, johndam...@apache.org wrote:
> 
> Dear podling,
> 
> This email was sent by an automated system on behalf of the Apache
> Incubator PMC. It is an initial reminder to give you plenty of time to
> prepare your quarterly board report.
> 
> The board meeting is scheduled for Wed, 22 February 2017, 10:30 am PDT.
> The report for your podling will form a part of the Incubator PMC
> report. The Incubator PMC requires your report to be submitted 2 weeks
> before the board meeting, to allow sufficient time for review and
> submission (Wed, February 08).
> 
> Please submit your report with sufficient time to allow the Incubator
> PMC, and subsequently board members to review and digest. Again, the
> very latest you should submit your report is 2 weeks prior to the board
> meeting.
> 
> Thanks,
> 
> The Apache Incubator PMC
> 
> Submitting your Report
> 
> --
> 
> Your report should contain the following:
> 
> *   Your project name
> *   A brief description of your project, which assumes no knowledge of
>the project or necessarily of its field
> *   A list of the three most important issues to address in the move
>towards graduation.
> *   Any issues that the Incubator PMC or ASF Board might wish/need to be
>aware of
> *   How has the community developed since the last report
> *   How has the project developed since the last report.
> 
> This should be appended to the Incubator Wiki page at:
> 
> https://wiki.apache.org/incubator/February2017
> 
> Note: This is manually populated. You may need to wait a little before
> this page is created from a template.
> 
> Mentors
> ---
> 
> Mentors should review reports for their project(s) and sign them off on
> the Incubator wiki page. Signing off reports shows that you are
> following the project - projects that are not signed may raise alarms
> for the Incubator PMC.
> 
> Incubator PMC



[jira] [Resolved] (JOSHUA-327) travis build fails

2017-01-25 Thread Matt Post (JIRA)

 [ 
https://issues.apache.org/jira/browse/JOSHUA-327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Post resolved JOSHUA-327.
--
Resolution: Fixed

> travis build fails
> --
>
> Key: JOSHUA-327
> URL: https://issues.apache.org/jira/browse/JOSHUA-327
> Project: Joshua
>  Issue Type: Bug
>    Reporter: Matt Post
>
> Travis builds with KenLM fail because the "downlown-deps.sh" script pauses 
> indefinitely for a response.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (JOSHUA-327) travis build fails

2017-01-25 Thread Matt Post (JIRA)

[ 
https://issues.apache.org/jira/browse/JOSHUA-327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15838962#comment-15838962
 ] 

Matt Post commented on JOSHUA-327:
--

Fixed in commit f581881631ffa8f68d9f4e864909da5002b6067f.

> travis build fails
> --
>
> Key: JOSHUA-327
> URL: https://issues.apache.org/jira/browse/JOSHUA-327
> Project: Joshua
>  Issue Type: Bug
>    Reporter: Matt Post
>
> Travis builds with KenLM fail because the "downlown-deps.sh" script pauses 
> indefinitely for a response.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (JOSHUA-328) failure when glue grammar is listed first

2017-01-25 Thread Matt Post (JIRA)
Matt Post created JOSHUA-328:


 Summary: failure when glue grammar is listed first
 Key: JOSHUA-328
 URL: https://issues.apache.org/jira/browse/JOSHUA-328
 Project: Joshua
  Issue Type: Bug
Affects Versions: 6.1
Reporter: Matt Post
 Fix For: 6.1


If doing CKY-decoding (-search cky), listing the glue grammar before the packed 
grammar results in a parsing failure. E.g., the following lines in the config 
file:

tm = thrax -maxspan -1 -owner glue -path model/glue.grammar
tm = thrax -maxspan 20 -path model/grammar.packed -owner pt

will result in failed decoding every time, and a printing of the following 
error message:

ERROR - the goal_bin does not have exactly one item




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (JOSHUA-325) Warning about non-ASF licensed downloads

2017-01-25 Thread Matt Post (JIRA)

[ 
https://issues.apache.org/jira/browse/JOSHUA-325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15838650#comment-15838650
 ] 

Matt Post commented on JOSHUA-325:
--

This was fixed with commit c65f70fc427945a2b18a9b2ee77b8614be7fc051, and 
further addressed with commit f581881631ffa8f68d9f4e864909da5002b6067f.

> Warning about non-ASF licensed downloads
> 
>
> Key: JOSHUA-325
> URL: https://issues.apache.org/jira/browse/JOSHUA-325
> Project: Joshua
>  Issue Type: Task
>Affects Versions: 6.1
>Reporter: Matt Post
>    Assignee: Matt Post
>Priority: Minor
>
> Via Tom Barber:
> The download-deps.sh file obviously downloads and builds stuff with non ASF
> licenses, I realise this is for model training purposes only, and 99.9%
> wont care, but should we consider putting a prompt into that script warning
> people. I ask because a company might add in the training modules blindly
> assuming because the script is distributed by the ASF the modules are also
> ASL2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (JOSHUA-325) Warning about non-ASF licensed downloads

2017-01-25 Thread Matt Post (JIRA)

 [ 
https://issues.apache.org/jira/browse/JOSHUA-325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Post resolved JOSHUA-325.
--
Resolution: Fixed

> Warning about non-ASF licensed downloads
> 
>
> Key: JOSHUA-325
> URL: https://issues.apache.org/jira/browse/JOSHUA-325
> Project: Joshua
>  Issue Type: Task
>Affects Versions: 6.1
>Reporter: Matt Post
>    Assignee: Matt Post
>Priority: Minor
>
> Via Tom Barber:
> The download-deps.sh file obviously downloads and builds stuff with non ASF
> licenses, I realise this is for model training purposes only, and 99.9%
> wont care, but should we consider putting a prompt into that script warning
> people. I ask because a company might add in the training modules blindly
> assuming because the script is distributed by the ASF the modules are also
> ASL2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (JOSHUA-327) travis build fails

2017-01-25 Thread Matt Post (JIRA)
Matt Post created JOSHUA-327:


 Summary: travis build fails
 Key: JOSHUA-327
 URL: https://issues.apache.org/jira/browse/JOSHUA-327
 Project: Joshua
  Issue Type: Bug
Reporter: Matt Post


Travis builds with KenLM fail because the "downlown-deps.sh" script pauses 
indefinitely for a response.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (JOSHUA-324) Address Apache Joshua 6.1 RC#2 Issues

2017-01-25 Thread Matt Post (JIRA)

[ 
https://issues.apache.org/jira/browse/JOSHUA-324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15837799#comment-15837799
 ] 

Matt Post commented on JOSHUA-324:
--

Lewis — Does this mean we're good to go? Is there something I can do?

> Address Apache Joshua 6.1 RC#2 Issues
> -
>
> Key: JOSHUA-324
> URL: https://issues.apache.org/jira/browse/JOSHUA-324
> Project: Joshua
>  Issue Type: Task
>Affects Versions: 6.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Blocker
> Fix For: 6.1
>
>
> Feedback from [~jmclean] (thank you Justin) on our RC#2 is as follows
> {code}
> ==
> - Your missing incubating in the release artifacts name. [1]
> - There are a number of binary files in the source release that look to be
> compiled source code.
> I checked:
> - name doesn’t include incubating
> - signatures and hashes correct
> - DISCLAIMER exists
> - LICENSE is missing a few things (see below)
> - a source file is missing an Apache header [7]
> - Several unexpected binary files are contained in the source release
> [8][9][10][11]
> - Can compile from source
> License is missing:
> - MIT licensed normalize.css v3.0.3 bundled in [5]
> - glyph icon fonts [6]
> Not an issue but it's a little odd to have LICENSE and NOTICE.txt - usually
> both are bare or both have .txt extension.
> Also while looking at your site I noticed that the download links of you
> incubating site [2] points to github, please change to point to the offical
> release area.
> Also the 6.1 release has already been tagged and it available for public
> download on github [4]  before this vote is finished. This is IMO against
> Apache release policy [3] please remove.
> I also notice you recently released the language packs (18th Nov) but there
> doesn’t seem to have been a vote for that? Any reason for this?
> ===
> [1] http://incubator.apache.org/incubation/Incubation_Policy.html#Releases
> [2] 
> https://cwiki.apache.org/confluence/display/JOSHUA/Apache+Joshua+%28Incubating%29+Home
> [3] http://www.apache.org/dev/release.html#what
> [4] https://github.com/apache/incubator-joshua/releases
> [5] ./demo/bootstrap/css/bootstrap.min.css
> [6] apache-joshua-6.1/demo/bootstrap/fonts/*
> [7] ./src/test/java/org/apache/joshua/decoder/ff/tm/OwnerMapTest.java
> [8] ./bin/GIZA++
> [9] ./bin/mkcls
> [10 ]./bin/snt2cooc.out
> [11] ,/src/test/resources/berkeley_lm/lm.berkeleylm.gz
> [12] http://www.mail-archive.com/general%40incubator.apache.org/msg57543.html
> [13] http://www.mail-archive.com/general%40incubator.apache.org/msg57551.html
> {code}
> This is a blocking issue and until addressed we cannot release 6.1-incubating



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Plugging self-hosted Joshua into mailman?

2017-01-19 Thread Matt Post

> On Jan 17, 2017, at 11:55 AM, Karel Novotný <ka...@apc.org> wrote:
> 
> Hello Matt,
> 
> Thanks for responding...
> 
> On 17.1.2017 17:31, Matt Post wrote:
>> Hello,
>> 
>> Joshua would be suitable to this. We have models built for FR→EN and ES→EN. 
>> I want to improve these because some certain data was left out. I could also 
>> build ones for the other direction.
> That's excellent news. Can you please tell me a bit more about what you
> mean by having models for FR→EN and ES→EN ? Does this mean that the tool
> is ready to be used by other applications (e.g. mailman) to auto-translate?
> 
> Have you had any previous experience with similar implementation as I
> described?

This just means we have pre-built models (which we call "language packs") that 
you can just download and immediately use to translate from French to English 
and from Spanish to English. For the complete list of language packs, along 
with instructions for how to use it, see this page:

https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs

You can just download any of these, unpack them, and start translating. The 
quality will vary, but for these two languages should be reasonable.

To translate, the data you send to Joshua has to have already been 
sentence-split, because Joshua expects to receive input one sentence at a time. 
Joshua provides an API that you can make use of. Do you have any kind of 
expectations about your volume requirements? How many sentences will you be 
translating per day?

matt


>> 
>> One question — What do you mean about 3rd party services being 
>> "untrustworthy"?
> 
> We wish to auto-translate lists with private conversations, so we can
> not run those by systems where we don't know (don't have control of)
> what happens with the data. That's all, I didn't want to accuse anyone.

Oh, that makes perfect sense. For some reason I assumed you were translating 
public mailing lists, but if you're doing private ones, it is reasonable to 
want to keep the data entirely in-house.


> thanks
> 
> karel
> 
>> 
>> matt
>> 
>> 
>>> On Jan 16, 2017, at 12:27 PM, Karel Novotný <ka...@apc.org> wrote:
>>> 
>>> Hello developers,
>>> 
>>> I am new to this list, so missing a lot of background. Apologies
>>> beforehand for eventually dumb questions...
>>> 
>>> We would like to build a self-hosted machine translation system that
>>> could be plugged into our mailman installs. The objective is that the
>>> members of our multicultural network would be able to send email in
>>> their mother language and it would be delivered to the list
>>> machine-translated (and vise versa). The translation pairs we care about
>>> most are EN<->FR and EN<->ES
>>> 
>>> Our dream scenario is:
>>> 
>>> 1. A translator machine is installed on our server, so the messages
>>> don't need to be run through untrustworthy 3rd party services (googletrans)
>>> 2. Mailman (or similar) is connected to such a translator
>>> 3. Mailing list users can opt to receive messages sent to the mailing
>>> list in following format:
>>> 
>>> 
>>> Message body
>>> --
>>> Message body translated
>>> -
>>> 
>>> 4. Similarly, the system can be configured so that when receiving
>>> messages from specific senders the messages get translated from FR or ES
>>> into EN
>>> 
>>> Our default language used on lists is EN
>>> 
>>> Is Joshua relevant for this? Any previous experience with similar setup?
>>> I suppose that a lot of configuration would be needed, but at this point
>>> I want to know if I am not completely mistaken when considering your
>>> Joshua for this.
>>> 
>>> Thanks
>>> 
>>> karel
>>> 
>>> ---
>>> 
>>> -- 
>>> ~~~
>>> Karel Novotny 
>>> Knowledge Sharing & Network Development Coordinator
>>> APC - The Association for Progressive Communications 
>>> https://www.apc.org
>>> GSM: +420 605 243 246 (GMT +1)
>>> jabber: ka...@riseup.net
>>> Working/online: Monday - Thursday
>>> ~~~
>>> My public OpenPGP key: 
>>> https://pgp.mit.edu/pks/lookup?op=get=0x7FDEF502377E4FCA
>>> 
>>> 
>> 
> 
> -- 
> ~~~
> Karel Novotny 
> Knowledge Sharing & Network Development Coordinator
> APC - The Association for Progressive Communications 
> https://www.apc.org <https://www.apc.org/>
> GSM: +420 605 243 246 (GMT +1)
> jabber: ka...@riseup.net
> Working/online: Monday - Thursday
> ~~~
> My public OpenPGP key: 
> https://pgp.mit.edu/pks/lookup?op=get=0x7FDEF502377E4FCA 
> <https://pgp.mit.edu/pks/lookup?op=get=0x7FDEF502377E4FCA>


Re: Plugging self-hosted Joshua into mailman?

2017-01-19 Thread Matt Post
Karel — On this point, I don't think you should have to use the tutorials, 
which tell you how to identify training data and build new translation models 
yourself. I imagine that you would be more interested in downloading pre-built 
models that don't really require you to be an expert in MT. See this page:

https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs

matt


> On Jan 17, 2017, at 12:07 PM, lewis john mcgibbney  wrote:
> 
> Hi Karel,
> The short answer is yes.
> I would advise you to start at the Tutorial
> https://cwiki.apache.org/confluence/display/JOSHUA/Getting+Started
> If you find anything which causes you problems then please write back here.
> Once you have skipped through the tutorial then you will have a much better
> feel for the workflow required.
> I can see the Apache Tika language identification and translate API's being
> of particular use here when considered in a runtime context. We have a
> Joshua implementation over in Tika which can aid you in this task however
> try the Joshua tutorial first.
> Lewis
> 
> On Mon, Jan 16, 2017 at 7:41 AM, Chris Mattmann  wrote:
> 
>> Hi Karel,
>> 
>> I would recommend moving this thread to dev@joshua.incubator.apache.org
>> instead of the private list. I’ve moved private to BCC.
>> 
>> Thank you.
>> 
>> Cheers,
>> Chris
>> 
>> 
>> 
>> On 1/16/17, 6:58 AM, wrote:
>> 
>>Hello,
>> 
>>We would like to build a self-hosted machine translation system that
>>could be plugged into our mailman installs. The objective is that the
>>members of our multicultural network would be able to send email in
>>their mother language and it would be delivered to the list
>>machine-translated (and vise versa).
>> 
>>Are we on the right track with Joshua? I suppose that a lot of
>>configuration would be needed, but at this point I want to know if I am
>>not completely mistaken when considering your sw for this.
>> 
>>Thanks
>> 
>>karel
>> 
>> 
>>--
>>~~~
>>Karel Novotny
>>Knowledge Sharing & Network Development Coordinator
>>APC - The Association for Progressive Communications
>>https://www.apc.org
>>GSM: +420 605 243 246 (GMT +1)
>>jabber: ka...@riseup.net
>>Working/online: Monday - Thursday
>>~~~
>>My public OpenPGP key: https://pgp.mit.edu/pks/lookup?op=get=
>> 0x7FDEF502377E4FCA
>> 
>> 
>> 
>> 
>> 
>> 
> 
> 
> -- 
> http://home.apache.org/~lewismc/
> @hectorMcSpector
> http://www.linkedin.com/in/lmcgibbney



Re: mvn assembly issues

2017-01-19 Thread Matt Post
I have never seen this error before! It seems like this must have something to 
do with the build environment where this is being done? Maybe there are tar 
options to not store the userid or to set it to something?


> On Jan 18, 2017, at 9:08 PM, David Meikle  wrote:
> 
> Hey Lewis,
> 
>> On 18 Jan 2017, at 22:02, lewis john mcgibbney  wrote:
>> 
>> Hi Folks,
>> Anyone know how to work through this issue? The code in question can be
>> found at
>> https://github.com/apache/incubator-joshua/blob/master/pom.xml#L287-L309
>> Lewis
>> 
>> [INFO]
>> 
>> [INFO] BUILD FAILURE
>> [INFO]
>> 
>> [INFO] Total time: 16.222 s
>> [INFO] Finished at: 2017-01-18T13:59:41-08:00
>> [INFO] Final Memory: 37M/639M
>> [INFO]
>> 
>> [ERROR] Failed to execute goal
>> org.apache.maven.plugins:maven-assembly-plugin:3.0.0:single
>> (source-release-assembly) on project joshua-incubating: Execution
>> source-release-assembly of goal
>> org.apache.maven.plugins:maven-assembly-plugin:3.0.0:single failed: user id
>> '498339010' is too big ( > 2097151 ). -> [Help 1]
>> [ERROR]
>> [ERROR] To see the full stack trace of the errors, re-run Maven with the -e
>> switch.
>> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
>> [ERROR]
>> [ERROR] For more information about the errors and possible solutions,
>> please read the following articles:
>> [ERROR] [Help 1]
>> http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException
>> 
>> -- 
>> http://home.apache.org/~lewismc/
>> @hectorMcSpector
>> http://www.linkedin.com/in/lmcgibbney
> 
> 
> Normally the switching tar to posix mode does the trick when I have had this 
> before - normally when logged into a AD domain on my Mac.  What is the full 
> log with -X saying?
> 
> Cheers,
> Dave
> 



Re: Pluggable preprocessing and OpenNLP

2017-01-18 Thread Matt Post
Hi,

Sorry, what file format are you talking about? Can you point me to an example 
of the Moses file format? Is this just plain text, one sentence per line?

In general the Moses format is the standard, to the extent that there are any 
standards in MT (they are all mostly informal).

matt

PS. Are you on dev@joshua, or do I need to keep CC'ing you at your address?


> On Jan 16, 2017, at 5:42 PM, Joern Kottmann <kottm...@gmail.com> wrote:
> 
> Hello,
> 
> we came to the conclusion that it would make sense to add direct
> formats support for letsmt and moses files.
> 
> Here our two issues:
> https://issues.apache.org/jira/browse/OPENNLP-938
> https://issues.apache.org/jira/browse/OPENNLP-939
> 
> Does it make sense for you if we support those formats?
> Did we miss an important format?
> 
> The training works quite fine, but it will take me a bit more time to
> get the evaluation to return something useful. The OpenNLP Sentence
> Detector can only split on end-of-sentence (eos) chars. And if there is
> a sentence without an eos chars it gets treated as a mistake by the
> evaluation.
> 
> Do you have a specific language which would be good for testing for
> you?
> 
> The tokenizer can probably trained as well, I saw a couple of tokenized
> data sets. Maybe that makes sense for you too.
> 
> Jörn
> 
> 
> 
> On Fri, 2017-01-13 at 09:48 -0500, Matt Post wrote:
>> Hi Jörn,
>> 
>> [Sent again without the picture since Apache rejects those,
>> unfortunately...]
>> 
>> You just need monolingual text, so I suggest downloading either the
>> tokenized or untokenized versions. Unfortunately, Opus doesn't make
>> it easy to provide directly links to individual languages. But do
>> this:
>> 
>> 1. Go to http://opus.lingfil.uu.se
>> 
>> 2. Choose de → en (or some other language pair)
>> 
>> 3. In the "mono" or "raw" columns (depending on whether you want
>> tokenized or untokenized text), click the language file for the
>> dataset you want.
>> 
>> matt
>> 
>> 
>>> On Jan 12, 2017, at 6:07 AM, Joern Kottmann <kottm...@gmail.com>
>>> wrote:
>>> 
>>> Do you have a pointer to an actual file? Or download package?
>>> 
>>> Jörn
>>> 
>>> On Wed, Jan 11, 2017 at 11:33 AM, Tommaso Teofili <tommaso.teofili@
>>> gmail.com
>>>> wrote:
>>>> I think the parallel corpuses are taken from [1], so we could
>>>> start with
>>>> training sentdetect for language packs at [2].
>>>> 
>>>> Regards,
>>>> Tommaso
>>>> 
>>>> [1] : http://opus.lingfil.uu.se/
>>>> [2] : https://cwiki.apache.org/confluence/display/JOSHUA/Language
>>>> +Packs
>>>> 
>>>> Il giorno lun 9 gen 2017 alle ore 11:39 Joern Kottmann <kottmann@
>>>> gmail.com
>>>> ha scritto:
>>>> 
>>>>> Sorry, for late reply, can you point me to a link for the
>>>>> parallel
>>>> corpus?
>>>>> We might just want to add formats support for it to OpenNLP.
>>>>> 
>>>>> Do you use tokenize.pl for all languages or do you have
>>>>> language
>>>> specific
>>>>> heuristics?
>>>>> It would be great to have an additional more capable rule based
>>>>> tokenizer
>>>>> in OpenNLP.
>>>>> 
>>>>> The sentence splitter can be trained on a few thousand
>>>>> sentences or so, I
>>>>> think that will work out nicely.
>>>>> 
>>>>> Jörn
>>>>> 
>>>>> On Wed, Dec 21, 2016 at 7:24 PM, Matt Post <p...@cs.jhu.edu>
>>>>> wrote:
>>>>> 
>>>>>>> On Dec 21, 2016, at 10:36 AM, Joern Kottmann <kottmann@gmai
>>>>>>> l.com>
>>>>> wrote:
>>>>>>> I am happy to support a bit with this, we can also see if
>>>>>>> things in
>>>>>> OpenNLP
>>>>>>> need to be changed to make this work smoothly.
>>>>>> 
>>>>>> Great!
>>>>>> 
>>>>>> 
>>>>>>> One challenge is to train OpenNLP on all the languages you
>>>>>>> support.
>>>> Do
>>>>>> you
>>>>>>> have training data that could be used to train the
>>>>>>> tokenizer and
>>>>> sentence
>>>>>>> detector?
>>>>>> 
>>>>>> For the sentence-splitter, I imagine you could make use of
>>>>>> the source
>>>>> side
>>>>>> of our parallel corpus, which has thousands to millions of
>>>>>> sentences,
>>>> one
>>>>>> per line.
>>>>>> 
>>>>>> For tokenization (and normalization), we don't typically
>>>>>> train models
>>>> but
>>>>>> instead use a set of manually developed heuristics, which may
>>>>>> or may
>>>> not
>>>>> be
>>>>>> sentence-specific. See
>>>>>> 
>>>>>>https://github.com/apache/incubator-joshua/blob/master
>>>>>> /
>>>>>> scripts/preparation/tokenize.pl
>>>>>> 
>>>>>> How much training data do you generally need for each task?
>>>>>> 
>>>>>> 
>>>>>>> Jörn
>>>>>>> 
>> 
>> 



Re: Plugging self-hosted Joshua into mailman?

2017-01-17 Thread Matt Post
Hello,

Joshua would be suitable to this. We have models built for FR→EN and ES→EN. I 
want to improve these because some certain data was left out. I could also 
build ones for the other direction.

One question — What do you mean about 3rd party services being "untrustworthy"?

matt


> On Jan 16, 2017, at 12:27 PM, Karel Novotný  wrote:
> 
> Hello developers,
> 
> I am new to this list, so missing a lot of background. Apologies
> beforehand for eventually dumb questions...
> 
> We would like to build a self-hosted machine translation system that
> could be plugged into our mailman installs. The objective is that the
> members of our multicultural network would be able to send email in
> their mother language and it would be delivered to the list
> machine-translated (and vise versa). The translation pairs we care about
> most are EN<->FR and EN<->ES
> 
> Our dream scenario is:
> 
> 1. A translator machine is installed on our server, so the messages
> don't need to be run through untrustworthy 3rd party services (googletrans)
> 2. Mailman (or similar) is connected to such a translator
> 3. Mailing list users can opt to receive messages sent to the mailing
> list in following format:
> 
> 
> Message body
> --
> Message body translated
> -
> 
> 4. Similarly, the system can be configured so that when receiving
> messages from specific senders the messages get translated from FR or ES
> into EN
> 
> Our default language used on lists is EN
> 
> Is Joshua relevant for this? Any previous experience with similar setup?
> I suppose that a lot of configuration would be needed, but at this point
> I want to know if I am not completely mistaken when considering your
> Joshua for this.
> 
> Thanks
> 
> karel
> 
> ---
> 
> -- 
> ~~~
> Karel Novotny 
> Knowledge Sharing & Network Development Coordinator
> APC - The Association for Progressive Communications 
> https://www.apc.org
> GSM: +420 605 243 246 (GMT +1)
> jabber: ka...@riseup.net
> Working/online: Monday - Thursday
> ~~~
> My public OpenPGP key: 
> https://pgp.mit.edu/pks/lookup?op=get=0x7FDEF502377E4FCA
> 
> 



Re: Rebase on Relese

2017-01-13 Thread Matt Post
Hi Lewis,

Welcome back!

I think we have checked off all the things on your list, and are ready any time 
for the release. Do you have the time to double-check, and then to head up this 
effort?

matt


> On Jan 13, 2017, at 11:59 AM, lewis john mcgibbney  wrote:
> 
> Hi Folks,
> Where are we with the release? I need to apologize for disappearing. Phone
> off and Laptop off for close to 3 weeks.
> Can someone bring me up-to-date with where we are?
> Thanks
> Lewis
> 
> -- 
> http://home.apache.org/~lewismc/
> @hectorMcSpector
> http://www.linkedin.com/in/lmcgibbney



Re: Pluggable preprocessing and OpenNLP

2017-01-13 Thread Matt Post
Hi Jörn,

[Sent again without the picture since Apache rejects those, unfortunately...]

You just need monolingual text, so I suggest downloading either the tokenized 
or untokenized versions. Unfortunately, Opus doesn't make it easy to provide 
directly links to individual languages. But do this:

1. Go to http://opus.lingfil.uu.se <http://opus.lingfil.uu.se/>

2. Choose de → en (or some other language pair)

3. In the "mono" or "raw" columns (depending on whether you want tokenized or 
untokenized text), click the language file for the dataset you want.

matt


> On Jan 12, 2017, at 6:07 AM, Joern Kottmann <kottm...@gmail.com 
> <mailto:kottm...@gmail.com>> wrote:
> 
> Do you have a pointer to an actual file? Or download package?
> 
> Jörn
> 
> On Wed, Jan 11, 2017 at 11:33 AM, Tommaso Teofili <tommaso.teof...@gmail.com 
> <mailto:tommaso.teof...@gmail.com>
>> wrote:
> 
>> I think the parallel corpuses are taken from [1], so we could start with
>> training sentdetect for language packs at [2].
>> 
>> Regards,
>> Tommaso
>> 
>> [1] : http://opus.lingfil.uu.se/ <http://opus.lingfil.uu.se/>
>> [2] : https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs 
>> <https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs>
>> 
>> Il giorno lun 9 gen 2017 alle ore 11:39 Joern Kottmann <kottm...@gmail.com 
>> <mailto:kottm...@gmail.com>
>>> 
>> ha scritto:
>> 
>>> Sorry, for late reply, can you point me to a link for the parallel
>> corpus?
>>> We might just want to add formats support for it to OpenNLP.
>>> 
>>> Do you use tokenize.pl for all languages or do you have language
>> specific
>>> heuristics?
>>> It would be great to have an additional more capable rule based tokenizer
>>> in OpenNLP.
>>> 
>>> The sentence splitter can be trained on a few thousand sentences or so, I
>>> think that will work out nicely.
>>> 
>>> Jörn
>>> 
>>> On Wed, Dec 21, 2016 at 7:24 PM, Matt Post <p...@cs.jhu.edu 
>>> <mailto:p...@cs.jhu.edu>> wrote:
>>> 
>>>> 
>>>>> On Dec 21, 2016, at 10:36 AM, Joern Kottmann <kottm...@gmail.com 
>>>>> <mailto:kottm...@gmail.com>>
>>> wrote:
>>>>> 
>>>>> I am happy to support a bit with this, we can also see if things in
>>>> OpenNLP
>>>>> need to be changed to make this work smoothly.
>>>> 
>>>> Great!
>>>> 
>>>> 
>>>>> One challenge is to train OpenNLP on all the languages you support.
>> Do
>>>> you
>>>>> have training data that could be used to train the tokenizer and
>>> sentence
>>>>> detector?
>>>> 
>>>> For the sentence-splitter, I imagine you could make use of the source
>>> side
>>>> of our parallel corpus, which has thousands to millions of sentences,
>> one
>>>> per line.
>>>> 
>>>> For tokenization (and normalization), we don't typically train models
>> but
>>>> instead use a set of manually developed heuristics, which may or may
>> not
>>> be
>>>> sentence-specific. See
>>>> 
>>>>https://github.com/apache/incubator-joshua/blob/master/ 
>>>> <https://github.com/apache/incubator-joshua/blob/master/>
>>>> scripts/preparation/tokenize.pl
>>>> 
>>>> How much training data do you generally need for each task?
>>>> 
>>>> 
>>>>> 
>>>>> Jörn
>>>>> ​
>>>> 
>>>> 
>>> 
>> 



Re: Any symal experts?

2017-01-03 Thread Matt Post
John — Any updates on here?


> On Nov 23, 2016, at 12:28 PM, Matt Post <p...@cs.jhu.edu> wrote:
> 
> I think it will be much less of a headache. The GIZA++ code is notorious for 
> being unreadable, and the Perl piece of that pipeline only hurts (even though 
> Philipp's Perl is unusually clear). I think adding atools to your port is the 
> way to go, and that it's written in C++ should facilitate that.
> 
> 
> 
> 
>> On Nov 23, 2016, at 12:25 PM, John Hewitt <john...@seas.upenn.edu> wrote:
>> 
>> It'll be a headache because it also has no documentation, but to be fair it
>> may be less of a headache / a better long-term solution than trying to move
>> forward with this hackier solution.
>> 
>> I'll keep the symal use on the backburner and start putting together an
>> atools port.
>> 
>> -John
>> 
>> On Wed, Nov 23, 2016 at 12:18 PM, Matt Post <p...@cs.jhu.edu> wrote:
>> 
>>> John — I suggest trying to ditch those GIZA++ tools entirely. fast_align
>>> indeed replaced them with "atools"; how much work would it be to port that?
>>> 
>>> 
>>>> On Nov 23, 2016, at 12:11 PM, John Hewitt <john...@seas.upenn.edu>
>>> wrote:
>>>> 
>>>> Hey everyone,
>>>> 
>>>> I'm packaging up a Java port Fast Align for Joshua and integrating it
>>> into
>>>> the pipeline.
>>>> Fast Align does not produce symmetrical alignments -- it relies on a tool
>>>> that I haven't ported to Java.
>>>> We package symal (which symmetricizes alignments) with Joshua right now
>>> for
>>>> GIZA++, so I'm attempting to re-use that.
>>>> However, symal uses the .bal format, which it fails to describe.
>>>> It gets away with this because files from GIZA++ are piped through
>>>> giza2bal.pl, which itself is not well documented.
>>>> I'm attempting to write, say, fastalign2bal.py.
>>>> With a bit of tinkering, I got at the .bal format:
>>>> 
>>>> 1
>>>> 
>>>> 7 jehovah said to moses and aaron :  # 3 2 2 4 5 6 8
>>>> 
>>>> 8 i řekl hospodin mojžíšovi a aronovi takto :  # 2 2 1 4 5 6 6 7
>>>> 
>>>> A template for which would be
>>>> 
>>>> 1
>>>> 
>>>> NUM_TGT_TOKENS [tgt_token1 tgt_token2 ... tgt_tokenN] # [alignment1
>>>> alignment2 ... alignmentN]
>>>> NUM_SRC_TOKENS [src_token1 src_token2 ... src_tokenN] # [alignment1
>>>> alignment2 ... alignmentN]
>>>> 
>>>> 
>>>> However, I'm hitting some pretty nasty errors with symal when I pipe in
>>>> some fastalign2bal.py output.
>>>> A few hours with gdb made some progress (for as far as I can tell, the
>>>> formats are identical) but if anyone has experience with symal, I would
>>>> greatly appreciate some consultation.
>>>> 
>>>> -John
>>> 
>>> 
> 



[jira] [Commented] (JOSHUA-326) Make preprocessing phase pluggable

2016-12-22 Thread Matt Post (JIRA)

[ 
https://issues.apache.org/jira/browse/JOSHUA-326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15770280#comment-15770280
 ] 

Matt Post commented on JOSHUA-326:
--

+1

> Make preprocessing phase pluggable
> --
>
> Key: JOSHUA-326
> URL: https://issues.apache.org/jira/browse/JOSHUA-326
> Project: Joshua
>  Issue Type: Improvement
>  Components: pipeline
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Fix For: 6.2
>
>
> It'd be nice to have the data preprocessing phase pluggable, with a default 
> simple Java implementation and eventually other more advanced ones based on 
> external tools like Apache OpenNLP.
> That should replace our scripts based preprocessing:
> - tokenization: 
> https://github.com/apache/incubator-joshua/blob/master/scripts/preparation/tokenize.pl



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Pluggable preprocessing and OpenNLP

2016-12-21 Thread Matt Post

> On Dec 21, 2016, at 10:36 AM, Joern Kottmann  wrote:
> 
> I am happy to support a bit with this, we can also see if things in OpenNLP
> need to be changed to make this work smoothly.

Great!


> One challenge is to train OpenNLP on all the languages you support. Do you
> have training data that could be used to train the tokenizer and sentence
> detector?

For the sentence-splitter, I imagine you could make use of the source side of 
our parallel corpus, which has thousands to millions of sentences, one per line.

For tokenization (and normalization), we don't typically train models but 
instead use a set of manually developed heuristics, which may or may not be 
sentence-specific. See


https://github.com/apache/incubator-joshua/blob/master/scripts/preparation/tokenize.pl

How much training data do you generally need for each task?


> 
> Jörn
> ​



Re: Pluggable preprocessing and OpenNLP

2016-12-21 Thread Matt Post
Sure, that'd be nice to do. I'd love to get rid of the Perl scripts. Are you 
just throwing out an idea or are you interested in doing this? I think the way 
to go would be to set this up on a branch (off 7), and then I could test it on 
some languages.


> On Dec 21, 2016, at 5:33 AM, Tommaso Teofili  
> wrote:
> 
> Hi all,
> 
> I was talking to Joern (Apache OpenNLP committer) recently and it came up
> the idea that we could use OpenNLP for the data preprocessing phase in
> Joshua as to allow tokenization, sentence detection, etc.
> As I was reading through our doc [1] this is currently done with dedicated
> scripts; we could make that part pluggable (with a default simple Java
> implementation) and allow more fine grained control over it using libraries
> like OpenNLP:
> 
> What would people think?
> 
> Regards,
> Tommaso
> 
> [1] : https://cwiki.apache.org/confluence/display/JOSHUA/Project+Ideas



Re: Problem on web page

2016-12-20 Thread Matt Post
Thanks! Fixed.


> On Dec 19, 2016, at 9:35 AM, Fabrizio Gotti  wrote:
> 
> Hi,
> 
> On the page
> 
> https://cwiki.apache.org/confluence/display/JOSHUA/Notes+on+Language+Pack+Creation
> 
> the link "Here are a number of things that may be useful for you to know in 
> using them. 
> "
>  points to a corpus, not to the page expected.
> 
> Best,
> 
> Fabrizio Gotti
> Université de Montréal



Re: Apache Joshua Project

2016-12-16 Thread Matt Post
There is not enough information for me to answer your question. I don't see any 
problems.

$ echo "i'll give you 10% of the asking price" | ./prepare.sh | ./joshua
I'll give you 10 % of the asking price


> On Dec 16, 2016, at 3:22 AM, Aliaksei Rudak <alru...@gmail.com> wrote:
> 
> Also there is a problem with parsing (%) sign in sentences. Do you know how 
> to solve this ?
> 
> 2016-12-15 10:57 GMT+03:00 Aliaksei Rudak <alru...@gmail.com 
> <mailto:alru...@gmail.com>>:
> Hi Matt,
> 
> English-Russian language pack has broken link
> https://cwiki.apache.org/confluence/home.apache.org/~lewismc/language-pack-en-ru-2016-10-28.tar.gz
>  
> <https://cwiki.apache.org/confluence/home.apache.org/~lewismc/language-pack-en-ru-2016-10-28.tar.gz>
> 
> When do you plan to create and upload other languages ?
> 
> Regards,
> Alexei
> 
> 2016-12-14 21:50 GMT+03:00 Matt Post <p...@cs.jhu.edu 
> <mailto:p...@cs.jhu.edu>>:
> 1. If you download Joshua from GitHub, and run "download_dependencies.sh", it 
> builds KenLM and the KenLM library. If you can do that, that is all you need 
> to do.
> 
> 2. http://opus.lingfil.uu.se <http://opus.lingfil.uu.se/> is a great place to 
> get parallel data; it's where we got all the data we use.
> 
> 3. Joshua has a Java API (undocumented) but not a C++ one.
> 
> 
>> On Dec 14, 2016, at 10:30 AM, Aliaksei Rudak <alru...@gmail.com 
>> <mailto:alru...@gmail.com>> wrote:
>> 
>> 1) Can you estimate approximate date of releasing language packs with kenlm 
>> model ? I have a teammate who know c++ well so If we have more information 
>> (or tutorial) of how to do that by ourselves we can share the result with 
>> others. So it will be benefit for all.
>> 
>> 2) Where can I get or buy parallel corpora for other languages ? Where did 
>> you get data for current huge language packs? I found several sources but 
>> they so small in size.
>> 
>> 3) Is there any document of how to create offline translation system based 
>> on Joshua and make it as c++ library for example ?
>> 
>> 
>> 
>> 
>> 2016-12-14 14:33 GMT+03:00 Matt Post <p...@cs.jhu.edu 
>> <mailto:p...@cs.jhu.edu>>:
>> 1. the lm cannot be used with moses. we have berkeleylm format you need 
>> kenlm. we are releasing kenlm soon. kenlm is better but it requires the user 
>> to compile c++ code which can be tricky. 
>> 
>> 2/3. please see the README in each language pack. you need to pass input 
>> text through "prepare.sh" which does tokenization. 
>> 
>> matt (from my phone)
>> 
>> Le 14 déc. 2016 à 06:16, Aliaksei Rudak <alru...@gmail.com 
>> <mailto:alru...@gmail.com>> a écrit :
>> 
>>> Hi Matt, 
>>> Thanks for answers.
>>> 
>>> 1) Can language models inside Joshua language packs work with Moses MT ? If 
>>> yes - can you give me the link how to run them on it ? 
>>> 
>>> 2) I installed several instances (German, Spanish, Russian) and all of them 
>>> have the same strange issue. Trying to translate one sentence. 
>>> 
>>> For example from Spanish to English
>>> "Además podrás encontrar las audiciones de los textos con distintos acentos 
>>> del español. "
>>> 
>>> Translates as
>>> "Also auditions, you'll find texts with different accents of español"
>>> 
>>> It means that one word in sentence (español) is not translated correct. But 
>>> it's ok if you translating single word ( español )
>>> 
>>> Same for other languages (German, Russian). All words (except one or 
>>> sometimes 2 words) are not translated. Do you know how to fix this ?
>>> 
>>> 3) How to translate sentences with punctuation marks (comma, exclamation, 
>>> question marks etc) ?
>>> 
>>> Translating from Spanish to English gives error
>>> "¿Se puede aprender a escribir? ¿El escritor nace o se hace? La vieja 
>>> pregunta."
>>> 
>>> If you try to translate words separated with commas it not translates these 
>>> words
>>> "inglés, francés, alemán y portugués"
>>> 
>>> output
>>> "Inglés, francés, german and portuguese"
>>> 
>>> Regards,
>>> Alexei
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 2016-12-13 17:44 GMT+03:00 Matt Post <p...@cs.jhu.edu 
>>> <mailto:p...@cs.jhu.edu>>:
>>> 
>>>> On Dec 12, 2016, at 3:04 PM,

Re: Apache Joshua Project

2016-12-14 Thread Matt Post
1. If you download Joshua from GitHub, and run "download_dependencies.sh", it 
builds KenLM and the KenLM library. If you can do that, that is all you need to 
do.

2. http://opus.lingfil.uu.se is a great place to get parallel data; it's where 
we got all the data we use.

3. Joshua has a Java API (undocumented) but not a C++ one.


> On Dec 14, 2016, at 10:30 AM, Aliaksei Rudak <alru...@gmail.com> wrote:
> 
> 1) Can you estimate approximate date of releasing language packs with kenlm 
> model ? I have a teammate who know c++ well so If we have more information 
> (or tutorial) of how to do that by ourselves we can share the result with 
> others. So it will be benefit for all.
> 
> 2) Where can I get or buy parallel corpora for other languages ? Where did 
> you get data for current huge language packs? I found several sources but 
> they so small in size.
> 
> 3) Is there any document of how to create offline translation system based on 
> Joshua and make it as c++ library for example ?
> 
> 
> 
> 
> 2016-12-14 14:33 GMT+03:00 Matt Post <p...@cs.jhu.edu 
> <mailto:p...@cs.jhu.edu>>:
> 1. the lm cannot be used with moses. we have berkeleylm format you need 
> kenlm. we are releasing kenlm soon. kenlm is better but it requires the user 
> to compile c++ code which can be tricky. 
> 
> 2/3. please see the README in each language pack. you need to pass input text 
> through "prepare.sh" which does tokenization. 
> 
> matt (from my phone)
> 
> Le 14 déc. 2016 à 06:16, Aliaksei Rudak <alru...@gmail.com 
> <mailto:alru...@gmail.com>> a écrit :
> 
>> Hi Matt, 
>> Thanks for answers.
>> 
>> 1) Can language models inside Joshua language packs work with Moses MT ? If 
>> yes - can you give me the link how to run them on it ? 
>> 
>> 2) I installed several instances (German, Spanish, Russian) and all of them 
>> have the same strange issue. Trying to translate one sentence. 
>> 
>> For example from Spanish to English
>> "Además podrás encontrar las audiciones de los textos con distintos acentos 
>> del español. "
>> 
>> Translates as
>> "Also auditions, you'll find texts with different accents of español"
>> 
>> It means that one word in sentence (español) is not translated correct. But 
>> it's ok if you translating single word ( español )
>> 
>> Same for other languages (German, Russian). All words (except one or 
>> sometimes 2 words) are not translated. Do you know how to fix this ?
>> 
>> 3) How to translate sentences with punctuation marks (comma, exclamation, 
>> question marks etc) ?
>> 
>> Translating from Spanish to English gives error
>> "¿Se puede aprender a escribir? ¿El escritor nace o se hace? La vieja 
>> pregunta."
>> 
>> If you try to translate words separated with commas it not translates these 
>> words
>> "inglés, francés, alemán y portugués"
>> 
>> output
>> "Inglés, francés, german and portuguese"
>> 
>> Regards,
>> Alexei
>> 
>> 
>> 
>> 
>> 
>> 2016-12-13 17:44 GMT+03:00 Matt Post <p...@cs.jhu.edu 
>> <mailto:p...@cs.jhu.edu>>:
>> 
>>> On Dec 12, 2016, at 3:04 PM, Aliaksei Rudak <alru...@gmail.com 
>>> <mailto:alru...@gmail.com>> wrote:
>>> 
>>> 1) If English-German pair will be recompiled to German-English (vice-versa) 
>>> do I need a separate instance to process back translation ? Or they can 
>>> work in one instance in both directions ?
>>> 
>> A whole new model needs to be trained. You need a separate model for each 
>> direction.
>>> 2) Are there any documents about how to recompile model to work vice-versa 
>>> from German-English to English-German ?
>>> 
>>> At this page under the “Project Info” title links “Community page” and 
>>> “Current Documentation” not working
>>> 
>>> http://incubator.apache.org/projects/joshua.html 
>>> <http://incubator.apache.org/projects/joshua.html>
>> This document on running the pipeline:
>> 
>>  
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65871630 
>> <https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65871630>3)
>>  Are there ways of increasing translation quality without changing 
>> (extending) language model?  
>>> 
>>> At this page under “How do I make Joshua produce better results? at second 
>>> option (Joshua directly) link not working
>>>  
>>> http://joshua.incuba

Re: Apache Joshua Project

2016-12-14 Thread Matt Post
1. the lm cannot be used with moses. we have berkeleylm format you need kenlm. 
we are releasing kenlm soon. kenlm is better but it requires the user to 
compile c++ code which can be tricky. 

2/3. please see the README in each language pack. you need to pass input text 
through "prepare.sh" which does tokenization. 

matt (from my phone)

> Le 14 déc. 2016 à 06:16, Aliaksei Rudak <alru...@gmail.com> a écrit :
> 
> Hi Matt, 
> Thanks for answers.
> 
> 1) Can language models inside Joshua language packs work with Moses MT ? If 
> yes - can you give me the link how to run them on it ? 
> 
> 2) I installed several instances (German, Spanish, Russian) and all of them 
> have the same strange issue. Trying to translate one sentence. 
> 
> For example from Spanish to English
> "Además podrás encontrar las audiciones de los textos con distintos acentos 
> del español. "
> 
> Translates as
> "Also auditions, you'll find texts with different accents of español"
> 
> It means that one word in sentence (español) is not translated correct. But 
> it's ok if you translating single word ( español )
> 
> Same for other languages (German, Russian). All words (except one or 
> sometimes 2 words) are not translated. Do you know how to fix this ?
> 
> 3) How to translate sentences with punctuation marks (comma, exclamation, 
> question marks etc) ?
> 
> Translating from Spanish to English gives error
> "¿Se puede aprender a escribir? ¿El escritor nace o se hace? La vieja 
> pregunta."
> 
> If you try to translate words separated with commas it not translates these 
> words
> "inglés, francés, alemán y portugués"
> 
> output
> "Inglés, francés, german and portuguese"
> 
> Regards,
> Alexei
> 
> 
> 
> 
> 
> 2016-12-13 17:44 GMT+03:00 Matt Post <p...@cs.jhu.edu>:
>> 
>>> On Dec 12, 2016, at 3:04 PM, Aliaksei Rudak <alru...@gmail.com> wrote:
>>> 
>>> 1) If English-German pair will be recompiled to German-English (vice-versa) 
>>> do I need a separate instance to process back translation ? Or they can 
>>> work in one instance in both directions ?
>>> 
>> A whole new model needs to be trained. You need a separate model for each 
>> direction.
>>> 2) Are there any documents about how to recompile model to work vice-versa 
>>> from German-English to English-German ?
>>> 
>>> At this page under the “Project Info” title links “Community page” and 
>>> “Current Documentation” not working
>>> 
>>> http://incubator.apache.org/projects/joshua.html
>>> 
>> 
>> This document on running the pipeline:
>> 
>>  
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65871630
>>> 3) Are there ways of increasing translation quality without changing 
>>> (extending) language model?  
>>> 
>>> At this page under “How do I make Joshua produce better results? at second 
>>> option (Joshua directly) link not working
>>>  
>>> http://joshua.incubator.apache.org/6.0/faq.html
>>> 
>> 
>> Yes but it's complicated. The best way is to add data, but there are lots of 
>> other models and parameter variations that could be tried.
>> 
>>> 4) How can I reduce the amount of memory each language pair instance use 
>>> without losing process speed and quality?
>>> 
>> If you can find German–French parallel data, use that. Otherwise, pivot 
>> through another language.
>>> 5) To make translation from German to French do I need to make translation 
>>> via English conversion ? (like German to English first and then English to 
>>> French) 
>>> 
>>> I mean for the case without German-French parallel data.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Regards,
>>> 
>>> Alexei
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 2016-12-12 17:58 GMT+03:00 Matt Post <p...@cs.jhu.edu>:
>>>> No, each has to be run separately. But not all are equally good, so I 
>>>> suggest starting with a few and building up.
>>>> 
>>>> If you get KenLM working in place of BerkeleyLM, the language models will 
>>>> be shared between them if they are on the same machine. I will post 
>>>> instructions soon.
>>>> 
>>>> Yes, each one has two language models that are interpolated.
>>>> 
>>>> 
>>>> 
>>>>> On Dec 12, 2016, at 9:20 AM, Aliaksei Rudak <alru.

Re: Apache Joshua Project

2016-12-13 Thread Matt Post

> On Dec 12, 2016, at 3:04 PM, Aliaksei Rudak <alru...@gmail.com 
> <mailto:alru...@gmail.com>> wrote:
> 
> 1) If English-German pair will be recompiled to German-English (vice-versa) 
> do I need a separate instance to process back translation ? Or they can work 
> in one instance in both directions ?
> 
A whole new model needs to be trained. You need a separate model for each 
direction.
> 2) Are there any documents about how to recompile model to work vice-versa 
> from German-English to English-German ?
> 
> At this page under the “Project Info” title links “Community page” and 
> “Current Documentation” not working
> 
> http://incubator.apache.org/projects/joshua.html 
> <http://incubator.apache.org/projects/joshua.html>
This document on running the pipeline:


https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65871630 
<https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65871630>3) 
Are there ways of increasing translation quality without changing (extending) 
language model?  
> 
> At this page under “How do I make Joshua produce better results? at second 
> option (Joshua directly) link not working
>  
> http://joshua.incubator.apache.org/6.0/faq.html 
> <http://joshua.incubator.apache.org/6.0/faq.html>
Yes but it's complicated. The best way is to add data, but there are lots of 
other models and parameter variations that could be tried.

> 4) How can I reduce the amount of memory each language pair instance use 
> without losing process speed and quality?
> 
If you can find German–French parallel data, use that. Otherwise, pivot through 
another language.
> 5) To make translation from German to French do I need to make translation 
> via English conversion ? (like German to English first and then English to 
> French) 
> 
> I mean for the case without German-French parallel data.
> 
> 
> 
> 
> 
> Regards,
> 
> Alexei
> 
> 
> 
> 
> 
> 
> 2016-12-12 17:58 GMT+03:00 Matt Post <p...@cs.jhu.edu 
> <mailto:p...@cs.jhu.edu>>:
> No, each has to be run separately. But not all are equally good, so I suggest 
> starting with a few and building up.
> 
> If you get KenLM working in place of BerkeleyLM, the language models will be 
> shared between them if they are on the same machine. I will post instructions 
> soon.
> 
> Yes, each one has two language models that are interpolated.
> 
> 
> 
>> On Dec 12, 2016, at 9:20 AM, Aliaksei Rudak <alru...@gmail.com 
>> <mailto:alru...@gmail.com>> wrote:
>> 
>> Hi Matt,
>> 
>> You was right about increasing memory. Spanish works fine now but need about 
>> 16GB to run. Is it possible to use one Joshua instance for all language 
>> pairs simultaneously ? Right now I use one instance for each pair at it 
>> takes about 4GB, so for all 60 languages I need 240 GB of RAM memory and 60 
>> running instances. But may be it's possible to process all language 
>> translation with one instance and use for example 32 GB ?
>> 
>> Also I found that every language pair archive has 2 language models ( 
>> Berkeley and KenLM ) Do I need them two at once ? Or Joshua selects one of 
>> them depending on some parameters ?
>> 
>> Regards,
>> Alexei
>> 
>> 
>> 
>> 
>> 2016-12-07 15:51 GMT+03:00 Matt Post <p...@cs.jhu.edu 
>> <mailto:p...@cs.jhu.edu>>:
>> I fixed the Czech link.
>> 
>> For Spanish–English, what is the error? I imagine you have to provide more 
>> memory. Edit the "joshua" script and double or triple the amount of memory.
>> 
>> 
>>> On Dec 7, 2016, at 7:14 AM, Aliaksei Rudak <alru...@gmail.com 
>>> <mailto:alru...@gmail.com>> wrote:
>>> 
>>> Hi Matt,
>>> 
>>> Can you check Czech-English language pack, it has broken link. 
>>> Spanish-English pair not works, throws exceptions
>>> 
>>> 
>>> Regards,
>>> Alexei
>>> 
>>> 2016-11-28 17:30 GMT+03:00 <alru...@gmail.com <mailto:alru...@gmail.com>>:
>>> Hi Matt, what time (total price ) will be to record video of how to make 
>>> translation vice-versa (from german to english)  to english to german pair
>>> 
>>> Regards,
>>> Alexei
>>> 
>>> On Nov 28, 2016, at 17:59, Matt Post <p...@cs.jhu.edu 
>>> <mailto:p...@cs.jhu.edu>> wrote:
>>> 
>>>> Inline below:
>>>> 
>>>>> On Nov 26, 2016, at 11:12 AM, Aliaksei Rudak <alru...@gmail.com 
>>>>> <mailto:alru...@gmail.com>> wrote:
>>

Re: Apache Joshua Project

2016-12-13 Thread Matt Post

> On Dec 12, 2016, at 3:04 PM, Aliaksei Rudak <alru...@gmail.com> wrote:
> 
> 1) If English-German pair will be recompiled to German-English (vice-versa) 
> do I need a separate instance to process back translation ? Or they can work 
> in one instance in both directions ?
> 
A whole new model needs to be trained. You need a separate model for each 
direction.
> 2) Are there any documents about how to recompile model to work vice-versa 
> from German-English to English-German ?
> 
> At this page under the “Project Info” title links “Community page” and 
> “Current Documentation” not working
> 
> http://incubator.apache.org/projects/joshua.html 
> <http://incubator.apache.org/projects/joshua.html>
This document on running the pipeline:


https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65871630
> 3) Are there ways of increasing translation quality without changing 
> (extending) language model?  
> 
> At this page under “How do I make Joshua produce better results? at second 
> option (Joshua directly) link not working
>  
> http://joshua.incubator.apache.org/6.0/faq.html 
> <http://joshua.incubator.apache.org/6.0/faq.html>
Yes but it's complicated. The best way is to add data, but there are lots of 
other models and parameter variations that could be tried.

> 4) How can I reduce the amount of memory each language pair instance use 
> without losing process speed and quality?
> 
If you can find German–French parallel data, use that. Otherwise, pivot through 
another language.
> 5) To make translation from German to French do I need to make translation 
> via English conversion ? (like German to English first and then English to 
> French) 
> 
> I mean for the case without German-French parallel data.
> 
> 
> 
> 
> 
> Regards,
> 
> Alexei
> 
> 
> 
> 
> 
> 
> 2016-12-12 17:58 GMT+03:00 Matt Post <p...@cs.jhu.edu 
> <mailto:p...@cs.jhu.edu>>:
> No, each has to be run separately. But not all are equally good, so I suggest 
> starting with a few and building up.
> 
> If you get KenLM working in place of BerkeleyLM, the language models will be 
> shared between them if they are on the same machine. I will post instructions 
> soon.
> 
> Yes, each one has two language models that are interpolated.
> 
> 
> 
>> On Dec 12, 2016, at 9:20 AM, Aliaksei Rudak <alru...@gmail.com 
>> <mailto:alru...@gmail.com>> wrote:
>> 
>> Hi Matt,
>> 
>> You was right about increasing memory. Spanish works fine now but need about 
>> 16GB to run. Is it possible to use one Joshua instance for all language 
>> pairs simultaneously ? Right now I use one instance for each pair at it 
>> takes about 4GB, so for all 60 languages I need 240 GB of RAM memory and 60 
>> running instances. But may be it's possible to process all language 
>> translation with one instance and use for example 32 GB ?
>> 
>> Also I found that every language pair archive has 2 language models ( 
>> Berkeley and KenLM ) Do I need them two at once ? Or Joshua selects one of 
>> them depending on some parameters ?
>> 
>> Regards,
>> Alexei
>> 
>> 
>> 
>> 
>> 2016-12-07 15:51 GMT+03:00 Matt Post <p...@cs.jhu.edu 
>> <mailto:p...@cs.jhu.edu>>:
>> I fixed the Czech link.
>> 
>> For Spanish–English, what is the error? I imagine you have to provide more 
>> memory. Edit the "joshua" script and double or triple the amount of memory.
>> 
>> 
>>> On Dec 7, 2016, at 7:14 AM, Aliaksei Rudak <alru...@gmail.com 
>>> <mailto:alru...@gmail.com>> wrote:
>>> 
>>> Hi Matt,
>>> 
>>> Can you check Czech-English language pack, it has broken link. 
>>> Spanish-English pair not works, throws exceptions
>>> 
>>> 
>>> Regards,
>>> Alexei
>>> 
>>> 2016-11-28 17:30 GMT+03:00 <alru...@gmail.com <mailto:alru...@gmail.com>>:
>>> Hi Matt, what time (total price ) will be to record video of how to make 
>>> translation vice-versa (from german to english)  to english to german pair
>>> 
>>> Regards,
>>> Alexei
>>> 
>>> On Nov 28, 2016, at 17:59, Matt Post <p...@cs.jhu.edu 
>>> <mailto:p...@cs.jhu.edu>> wrote:
>>> 
>>>> Inline below:
>>>> 
>>>>> On Nov 26, 2016, at 11:12 AM, Aliaksei Rudak <alru...@gmail.com 
>>>>> <mailto:alru...@gmail.com>> wrote:
>>>>> 
>>>>> Hi Matt,
>>>>> 
>>>>> 
>>>>> 
>>&

[jira] [Created] (JOSHUA-325) Warning about non-ASF licensed downloads

2016-12-12 Thread Matt Post (JIRA)
Matt Post created JOSHUA-325:


 Summary: Warning about non-ASF licensed downloads
 Key: JOSHUA-325
 URL: https://issues.apache.org/jira/browse/JOSHUA-325
 Project: Joshua
  Issue Type: Task
Affects Versions: 6.1
Reporter: Matt Post
Assignee: Matt Post
Priority: Minor


Via Tom Barber:
The download-deps.sh file obviously downloads and builds stuff with non ASF
licenses, I realise this is for model training purposes only, and 99.9%
wont care, but should we consider putting a prompt into that script warning
people. I ask because a company might add in the training modules blindly
assuming because the script is distributed by the ASF the modules are also
ASL2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Issues to Fix with Apache Joshua 6.1 RC#2

2016-12-12 Thread Matt Post
Lewis, do you have time to pick this up again? It'd be great to get this out 
before Christmas.

Or is there something you need from me?

matt


> On Dec 2, 2016, at 5:09 AM, kellen sunderland <kellen.sunderl...@gmail.com> 
> wrote:
> 
> [7] has been fixed.
> 
> Tom's comments lead me to think that [8][9][10] can be removed from the
> release.
> 
> I'm not totally clear on what we need to do to resolve the licensing issues
> [5] and [6].  Do we simply need to give attribution to these projects in
> our LICENSE.txt file?
> 
> 
> 
> On Thu, Dec 1, 2016 at 10:44 PM, Matt Post <p...@cs.jhu.edu> wrote:
> 
>> Hi folks,
>> 
>> What's the status of this? Can we check off items from the list below that
>> have been completed?
>> 
>> matt
>> 
>> 
>>> On Nov 29, 2016, at 4:24 PM, lewis john mcgibbney <lewi...@apache.org>
>> wrote:
>>> 
>>> Hi Folks,
>>> We have a number of issues to fix which were picked up over on general@.
>> In
>>> particular, we received excellent feedback from my good friend Justin
>> [12]
>>> [13]. As the general@ VOTE has not had 72 hours to stew I am not going
>> to
>>> close it, however we should take this time to fix the issues with master
>>> before we spin an RC#3. These can be summarized as follows.
>>> I've opened a Jira issue to track all of this.
>>> https://issues.apache.org/jira/browse/JOSHUA-324
>>> Lets track the progress on the Jira ticket.
>>> 
>>> ==
>>> - Your missing incubating in the release artifacts name. [1]
>>> - There are a number of binary files in the source release that look to
>> be
>>> compiled source code.
>>> 
>>> I checked:
>>> - name doesn’t include incubating
>>> - signatures and hashes correct
>>> - DISCLAIMER exists
>>> - LICENSE is missing a few things (see below)
>>> - a source file is missing an Apache header [7]
>>> - Several unexpected binary files are contained in the source release
>>> [8][9][10][11]
>>> - Can compile from source
>>> 
>>> License is missing:
>>> - MIT licensed normalize.css v3.0.3 bundled in [5]
>>> - glyph icon fonts [6]
>>> 
>>> Not an issue but it's a little odd to have LICENSE and NOTICE.txt -
>> usually
>>> both are bare or both have .txt extension.
>>> 
>>> Also while looking at your site I noticed that the download links of you
>>> incubating site [2] points to github, please change to point to the
>> offical
>>> release area.
>>> Also the 6.1 release has already been tagged and it available for public
>>> download on github [4]  before this vote is finished. This is IMO against
>>> Apache release policy [3] please remove.
>>> 
>>> I also notice you recently released the language packs (18th Nov) but
>> there
>>> doesn’t seem to have been a vote for that? Any reason for this?
>>> ===
>>> 
>>> [1] http://incubator.apache.org/incubation/Incubation_Policy.
>> html#Releases
>>> [2]
>>> https://cwiki.apache.org/confluence/display/JOSHUA/
>> Apache+Joshua+%28Incubating%29+Home
>>> [3] http://www.apache.org/dev/release.html#what
>>> [4] https://github.com/apache/incubator-joshua/releases
>>> [5] ./demo/bootstrap/css/bootstrap.min.css
>>> [6] apache-joshua-6.1/demo/bootstrap/fonts/*
>>> [7] ./src/test/java/org/apache/joshua/decoder/ff/tm/OwnerMapTest.java
>>> [8] ./bin/GIZA++
>>> [9] ./bin/mkcls
>>> [10 ]./bin/snt2cooc.out
>>> [11] ,/src/test/resources/berkeley_lm/lm.berkeleylm.gz
>>> [12]
>>> http://www.mail-archive.com/general%40incubator.apache.org/msg57543.html
>>> [13]
>>> http://www.mail-archive.com/general%40incubator.apache.org/msg57551.html
>>> 
>>> 
>>> --
>>> http://home.apache.org/~lewismc/
>>> @hectorMcSpector
>>> http://www.linkedin.com/in/lmcgibbney
>> 
>> 



Re: Issues to Fix with Apache Joshua 6.1 RC#2

2016-12-01 Thread Matt Post
Hi folks,

What's the status of this? Can we check off items from the list below that have 
been completed?

matt


> On Nov 29, 2016, at 4:24 PM, lewis john mcgibbney  wrote:
> 
> Hi Folks,
> We have a number of issues to fix which were picked up over on general@. In
> particular, we received excellent feedback from my good friend Justin [12]
> [13]. As the general@ VOTE has not had 72 hours to stew I am not going to
> close it, however we should take this time to fix the issues with master
> before we spin an RC#3. These can be summarized as follows.
> I've opened a Jira issue to track all of this.
> https://issues.apache.org/jira/browse/JOSHUA-324
> Lets track the progress on the Jira ticket.
> 
> ==
> - Your missing incubating in the release artifacts name. [1]
> - There are a number of binary files in the source release that look to be
> compiled source code.
> 
> I checked:
> - name doesn’t include incubating
> - signatures and hashes correct
> - DISCLAIMER exists
> - LICENSE is missing a few things (see below)
> - a source file is missing an Apache header [7]
> - Several unexpected binary files are contained in the source release
> [8][9][10][11]
> - Can compile from source
> 
> License is missing:
> - MIT licensed normalize.css v3.0.3 bundled in [5]
> - glyph icon fonts [6]
> 
> Not an issue but it's a little odd to have LICENSE and NOTICE.txt - usually
> both are bare or both have .txt extension.
> 
> Also while looking at your site I noticed that the download links of you
> incubating site [2] points to github, please change to point to the offical
> release area.
> Also the 6.1 release has already been tagged and it available for public
> download on github [4]  before this vote is finished. This is IMO against
> Apache release policy [3] please remove.
> 
> I also notice you recently released the language packs (18th Nov) but there
> doesn’t seem to have been a vote for that? Any reason for this?
> ===
> 
> [1] http://incubator.apache.org/incubation/Incubation_Policy.html#Releases
> [2]
> https://cwiki.apache.org/confluence/display/JOSHUA/Apache+Joshua+%28Incubating%29+Home
> [3] http://www.apache.org/dev/release.html#what
> [4] https://github.com/apache/incubator-joshua/releases
> [5] ./demo/bootstrap/css/bootstrap.min.css
> [6] apache-joshua-6.1/demo/bootstrap/fonts/*
> [7] ./src/test/java/org/apache/joshua/decoder/ff/tm/OwnerMapTest.java
> [8] ./bin/GIZA++
> [9] ./bin/mkcls
> [10 ]./bin/snt2cooc.out
> [11] ,/src/test/resources/berkeley_lm/lm.berkeleylm.gz
> [12]
> http://www.mail-archive.com/general%40incubator.apache.org/msg57543.html
> [13]
> http://www.mail-archive.com/general%40incubator.apache.org/msg57551.html
> 
> 
> --
> http://home.apache.org/~lewismc/
> @hectorMcSpector
> http://www.linkedin.com/in/lmcgibbney



Re: modernmt

2016-12-01 Thread Matt Post
John,

Thanks for sharing, this is really helpful. I didn't realize that Marcello was 
involved.

I think we can identify with the NMT danger. I still think there is a big niche 
that deep learning approaches won't reach for a few years, until GPUs become 
super prevalent. Which is why I like ModernMT's approaches, which overlap with 
many of the things I've been thinking. One thing I really like is there 
automatic context-switching approach. This is a great way to build 
general-purpose models, and I'd like to mimic it. I have some general ideas 
about how this should be implemented but am also looking into the literature 
here.

matt


> On Dec 1, 2016, at 1:46 PM, John Hewitt <john...@seas.upenn.edu> wrote:
> 
> I had a few good conversations over dinner with this team at AMTA in Austin
> in October.
> They seem to be in the interesting position where their work is good, but
> is in danger of being superseded by neural MT as they come out of the gate.
> Clearly, it has benefits over NMT, and is easier to adopt, but may not be
> the winner over the long run.
> 
> Here's the link
> <https://amtaweb.org/wp-content/uploads/2016/11/MMT_Tutorial_FedericoTrombetti_wide-cover.pdf>
> to their AMTA tutorial.
> 
> -John
> 
> On Thu, Dec 1, 2016 at 10:17 AM, Mattmann, Chris A (3010) <
> chris.a.mattm...@jpl.nasa.gov> wrote:
> 
>> Wow seems like this kind of overlaps with BigTranslate as well.. thanks
>> for passing
>> along Matt
>> 
>> ++
>> Chris Mattmann, Ph.D.
>> Principal Data Scientist, Engineering Administrative Office (3010)
>> Manager, Open Source Projects Formulation and Development Office (8212)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 180-503E, Mailstop: 180-503
>> Email: chris.a.mattm...@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++
>> Director, Information Retrieval and Data Science Group (IRDS)
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> WWW: http://irds.usc.edu/
>> ++
>> 
>> 
>> On 12/1/16, 4:47 AM, "Matt Post" <p...@cs.jhu.edu> wrote:
>> 
>>Just came across this, and it's really cool:
>> 
>>https://github.com/ModernMT/MMT
>> 
>>See the README for some great use cases. I'm surprised I'd never heard
>> of this before as it's EU funded and associated with U Edinburgh.
>> 
>> 



Re: ★ joshua roadmap feature: dynamic phrase tables, retuning, and data sharing

2016-12-01 Thread Matt Post
It wouldn't be hard to add some TMX-like features, no. There are some technical 
challenges, though — for example, the current demo lets you add phrases, but 
that doesn't affect the language model at all.

Ideally, we'd also allow people to add whole sentences, and would then run 
John's fast_align implementation (with a saved model) to break down that new 
sentence, and do proper incremental updating.

How do you image Lucene fitting into this? 

matt


> On Dec 1, 2016, at 9:22 AM, Tommaso Teofili <tommaso.teof...@gmail.com> wrote:
> 
> Matt,
> 
> really nice least of very useful features, thanks for this!
> One comment only on the translation memories one: as seen by one that had
> never heard about it, it sounds not too complicated to implement on top of
> current Joshua (with IR library like Apache Lucene), is my understanding
> correct ?
> 
> My 2 cents,
> Tommaso
> 
> 
> Il giorno mar 29 nov 2016 alle ore 04:08 Matt Post <p...@cs.jhu.edu 
> <mailto:p...@cs.jhu.edu>> ha
> scritto:
> 
>> One project I think could be interesting for Joshua's future is sketched
>> here.
>> 
>> - Dynamic phrase tables. Joshua currently lets people add custom phrases
>> to the existing models that then get used. There is a research topic here
>> for how to make it better (particularly, how to set the weights of rules
>> that are added at runtime instead of learned from bitext), but it works
>> really well for adding words that are OOV (since it's always cheaper to use
>> the OOV). Here's a demo of how this works (this feature is included in the
>> language packs).
>> 
>> 
>> https://github.com/joshua-decoder/joshua/wiki/Joshua's-Dynamic-Phrase-Tables
>> 
>> - Translation memories. There is a large commercial market (billions) for
>> tools called "translation memories", where translators are translating
>> documents, and the sentences get queried against their past translations
>> and matched in a fuzzy fashion. The big tool on the market for this is SDL
>> Trados <
>> http://www.sdl.com/solution/language/translation-productivity/trados-studio/ 
>> <http://www.sdl.com/solution/language/translation-productivity/trados-studio/>>.
>> I'm not talking about selling a product, but in a space that big, there
>> have got to be a lot of people who'd rather just run their own system, than
>> shell out for an expensive (and ugly) tool. So there is a big niche for an
>> open source tool, and currently nothing really filling it. The "dynamic
>> phrase table" feature above provides the beginnings of offering a TM
>> competitor, but one that is "seeded" with a regular statistical machine
>> translation model.
>> 
>> - Dynamic re-tuning. One thing that'd be *really* cool is to revamp the
>> tuning infrastructure in Joshua. The use-case I imagine is that Joshua
>> could sit on top of a large tuning set across diverse domains (e.g, formal
>> news, informal web logs, spoken dialogue, etc). You could then add new
>> phrases in sentences as above, which would get automatically aligned, and
>> then everything could be retuned at the user's request (or perhaps at
>> night). This way, when people added new data to their models, Joshua would
>> automatically find the best weights, either immediately or on some
>> schedule. There'd be less worry about bit rot.
>> 
>> - Data collection and sharing. Another cool idea would be to allow people
>> to easily send us data. If we get to a place where people are building
>> custom dynamic phrase tables, a cool ability would be to make it easy for
>> people to upload the data they have added to their private systems, which
>> we could then collect and further distribute. So Joshua could become an
>> easy means for people to crowdsource data used for translation systems.
>> This is obviously just a high-level idea that would require a lot of
>> details to be figured out, but it would be super cool.
>> 
>> matt



modernmt

2016-12-01 Thread Matt Post
Just came across this, and it's really cool:

https://github.com/ModernMT/MMT

See the README for some great use cases. I'm surprised I'd never heard of this 
before as it's EU funded and associated with U Edinburgh.

[jira] [Commented] (JOSHUA-324) Address Apache Joshua 6.1 RC#2 Issues

2016-11-29 Thread Matt Post (JIRA)

[ 
https://issues.apache.org/jira/browse/JOSHUA-324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706660#comment-15706660
 ] 

Matt Post commented on JOSHUA-324:
--

I don't see any of the binaries [8] [9] [10]: 
https://github.com/apache/incubator-joshua/tree/master/bin

We discussed the language packs on the mailing list, but I didn't call a vote — 
it didn't cross my mind.

> Address Apache Joshua 6.1 RC#2 Issues
> -
>
> Key: JOSHUA-324
> URL: https://issues.apache.org/jira/browse/JOSHUA-324
> Project: Joshua
>  Issue Type: Task
>Affects Versions: 6.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Blocker
> Fix For: 6.1
>
>
> Feedback from [~jmclean] (thank you Justin) on our RC#2 is as follows
> {code}
> ==
> - Your missing incubating in the release artifacts name. [1]
> - There are a number of binary files in the source release that look to be
> compiled source code.
> I checked:
> - name doesn’t include incubating
> - signatures and hashes correct
> - DISCLAIMER exists
> - LICENSE is missing a few things (see below)
> - a source file is missing an Apache header [7]
> - Several unexpected binary files are contained in the source release
> [8][9][10][11]
> - Can compile from source
> License is missing:
> - MIT licensed normalize.css v3.0.3 bundled in [5]
> - glyph icon fonts [6]
> Not an issue but it's a little odd to have LICENSE and NOTICE.txt - usually
> both are bare or both have .txt extension.
> Also while looking at your site I noticed that the download links of you
> incubating site [2] points to github, please change to point to the offical
> release area.
> Also the 6.1 release has already been tagged and it available for public
> download on github [4]  before this vote is finished. This is IMO against
> Apache release policy [3] please remove.
> I also notice you recently released the language packs (18th Nov) but there
> doesn’t seem to have been a vote for that? Any reason for this?
> ===
> [1] http://incubator.apache.org/incubation/Incubation_Policy.html#Releases
> [2] 
> https://cwiki.apache.org/confluence/display/JOSHUA/Apache+Joshua+%28Incubating%29+Home
> [3] http://www.apache.org/dev/release.html#what
> [4] https://github.com/apache/incubator-joshua/releases
> [5] ./demo/bootstrap/css/bootstrap.min.css
> [6] apache-joshua-6.1/demo/bootstrap/fonts/*
> [7] ./src/test/java/org/apache/joshua/decoder/ff/tm/OwnerMapTest.java
> [8] ./bin/GIZA++
> [9] ./bin/mkcls
> [10 ]./bin/snt2cooc.out
> [11] ,/src/test/resources/berkeley_lm/lm.berkeleylm.gz
> [12] http://www.mail-archive.com/general%40incubator.apache.org/msg57543.html
> [13] http://www.mail-archive.com/general%40incubator.apache.org/msg57551.html
> {code}
> This is a blocking issue and until addressed we cannot release 6.1-incubating



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Signing off a Joshua Release

2016-11-29 Thread Matt Post
Same here, thanks, Tom.


> On Nov 27, 2016, at 3:38 AM, kellen sunderland  
> wrote:
> 
> Definitely guilty of this.  I'll check the release checklist in the
> future.  Thanks for the reminder Tom.
> 
> On Nov 26, 2016 1:27 PM, "Tom Barber"  wrote:
> 
> Hello folks,
> 
> I see plenty of +1's going through the release vote,  which is great to see
> people taking an active role in getting the release shipped.
> 
> For those of you who are new to the ASF there are a bunch of requirements
> to sign off for a release which you can find here:
> 
> http://incubator.apache.org/guides/releasemanagement.html#check-list
> 
> My current concern is that people who are new to the incubator are +1'ing
> software for release without check all or part of the release cycle. Whilst
> not mandatory, when you +1 a release please can you try to indicate what
> you've checked. The reason for this is,  the tag Lewis has built off isn't
> the tip of master, so if you're basing  your +1 on your day to day
> development and knowledge of the code base, that's not always whats
> shipped. Also in the branching process,  its possible merges or alterations
> were accidentally made that Lewis has missed (this is very unlikely I know
> but you know, code changes). Also people build software on different OS's,
> versions of OS's etc so just because it builds on  Lewis's laptop doesn't
> mean it builds on mine, for example.
> 
> Also regarding licenses, disclaimers etc, people notice different things or
> interpret stuff differently. its always possible that someone might miss a
> library etc so its important multiple eyes run over the same stuff.
> 
> Cheers,
> 
> Tom
> 
> --
> Tom Barber
> CTO Spicule LTD
> t...@spicule.co.uk
> 
> http://spicule.co.uk
> 
> GB: +44(0)5603641316
> US: +18448141689



★ joshua roadmap feature: dynamic phrase tables, retuning, and data sharing

2016-11-28 Thread Matt Post
One project I think could be interesting for Joshua's future is sketched here.

- Dynamic phrase tables. Joshua currently lets people add custom phrases to the 
existing models that then get used. There is a research topic here for how to 
make it better (particularly, how to set the weights of rules that are added at 
runtime instead of learned from bitext), but it works really well for adding 
words that are OOV (since it's always cheaper to use the OOV). Here's a demo of 
how this works (this feature is included in the language packs). 


https://github.com/joshua-decoder/joshua/wiki/Joshua's-Dynamic-Phrase-Tables

- Translation memories. There is a large commercial market (billions) for tools 
called "translation memories", where translators are translating documents, and 
the sentences get queried against their past translations and matched in a 
fuzzy fashion. The big tool on the market for this is SDL Trados 
. 
I'm not talking about selling a product, but in a space that big, there have 
got to be a lot of people who'd rather just run their own system, than shell 
out for an expensive (and ugly) tool. So there is a big niche for an open 
source tool, and currently nothing really filling it. The "dynamic phrase 
table" feature above provides the beginnings of offering a TM competitor, but 
one that is "seeded" with a regular statistical machine translation model.

- Dynamic re-tuning. One thing that'd be *really* cool is to revamp the tuning 
infrastructure in Joshua. The use-case I imagine is that Joshua could sit on 
top of a large tuning set across diverse domains (e.g, formal news, informal 
web logs, spoken dialogue, etc). You could then add new phrases in sentences as 
above, which would get automatically aligned, and then everything could be 
retuned at the user's request (or perhaps at night). This way, when people 
added new data to their models, Joshua would automatically find the best 
weights, either immediately or on some schedule. There'd be less worry about 
bit rot.

- Data collection and sharing. Another cool idea would be to allow people to 
easily send us data. If we get to a place where people are building custom 
dynamic phrase tables, a cool ability would be to make it easy for people to 
upload the data they have added to their private systems, which we could then 
collect and further distribute. So Joshua could become an easy means for people 
to crowdsource data used for translation systems. This is obviously just a 
high-level idea that would require a lot of details to be figured out, but it 
would be super cool.

matt

Re: Downloading of non ASF licensed code

2016-11-28 Thread Matt Post
This would be easy to do. Maybe just a simple prompt that alerts the user? 
Something like

echo "Warning: this script downloads many tools used in building and 
running"
echo "Joshua. Not all of them are Apache Licensed. If you wish to 
continue, hit Enter".
read j
if [[ ! -z $j ]]; then
echo "Quitting."
fi



> On Nov 25, 2016, at 10:41 AM, Tom Barber  wrote:
> 
> This may have come up before in the whole licensing chat so apologies if
> I'm just going over old ground.
> 
> The download-deps.sh file obviously downloads and builds stuff with non ASF
> licenses, I realise this is for model training purposes only, and 99.9%
> wont care, but should we consider putting a prompt into that script warning
> people. I ask because a company might add in the training modules blindly
> assuming because the script is distributed by the ASF the modules are also
> ASL2.0.
> 
> Just a thought.
> 
> Tom
> 
> -- 
> Tom Barber
> CTO Spicule LTD
> t...@spicule.co.uk
> 
> http://spicule.co.uk
> 
> GB: +44(0)5603641316
> US: +18448141689



Re: Any symal experts?

2016-11-23 Thread Matt Post
I think it will be much less of a headache. The GIZA++ code is notorious for 
being unreadable, and the Perl piece of that pipeline only hurts (even though 
Philipp's Perl is unusually clear). I think adding atools to your port is the 
way to go, and that it's written in C++ should facilitate that.




> On Nov 23, 2016, at 12:25 PM, John Hewitt <john...@seas.upenn.edu> wrote:
> 
> It'll be a headache because it also has no documentation, but to be fair it
> may be less of a headache / a better long-term solution than trying to move
> forward with this hackier solution.
> 
> I'll keep the symal use on the backburner and start putting together an
> atools port.
> 
> -John
> 
> On Wed, Nov 23, 2016 at 12:18 PM, Matt Post <p...@cs.jhu.edu> wrote:
> 
>> John — I suggest trying to ditch those GIZA++ tools entirely. fast_align
>> indeed replaced them with "atools"; how much work would it be to port that?
>> 
>> 
>>> On Nov 23, 2016, at 12:11 PM, John Hewitt <john...@seas.upenn.edu>
>> wrote:
>>> 
>>> Hey everyone,
>>> 
>>> I'm packaging up a Java port Fast Align for Joshua and integrating it
>> into
>>> the pipeline.
>>> Fast Align does not produce symmetrical alignments -- it relies on a tool
>>> that I haven't ported to Java.
>>> We package symal (which symmetricizes alignments) with Joshua right now
>> for
>>> GIZA++, so I'm attempting to re-use that.
>>> However, symal uses the .bal format, which it fails to describe.
>>> It gets away with this because files from GIZA++ are piped through
>>> giza2bal.pl, which itself is not well documented.
>>> I'm attempting to write, say, fastalign2bal.py.
>>> With a bit of tinkering, I got at the .bal format:
>>> 
>>> 1
>>> 
>>> 7 jehovah said to moses and aaron :  # 3 2 2 4 5 6 8
>>> 
>>> 8 i řekl hospodin mojžíšovi a aronovi takto :  # 2 2 1 4 5 6 6 7
>>> 
>>> A template for which would be
>>> 
>>> 1
>>> 
>>> NUM_TGT_TOKENS [tgt_token1 tgt_token2 ... tgt_tokenN] # [alignment1
>>> alignment2 ... alignmentN]
>>> NUM_SRC_TOKENS [src_token1 src_token2 ... src_tokenN] # [alignment1
>>> alignment2 ... alignmentN]
>>> 
>>> 
>>> However, I'm hitting some pretty nasty errors with symal when I pipe in
>>> some fastalign2bal.py output.
>>> A few hours with gdb made some progress (for as far as I can tell, the
>>> formats are identical) but if anyone has experience with symal, I would
>>> greatly appreciate some consultation.
>>> 
>>> -John
>> 
>> 



Re: Any symal experts?

2016-11-23 Thread Matt Post
John — I suggest trying to ditch those GIZA++ tools entirely. fast_align indeed 
replaced them with "atools"; how much work would it be to port that?


> On Nov 23, 2016, at 12:11 PM, John Hewitt  wrote:
> 
> Hey everyone,
> 
> I'm packaging up a Java port Fast Align for Joshua and integrating it into
> the pipeline.
> Fast Align does not produce symmetrical alignments -- it relies on a tool
> that I haven't ported to Java.
> We package symal (which symmetricizes alignments) with Joshua right now for
> GIZA++, so I'm attempting to re-use that.
> However, symal uses the .bal format, which it fails to describe.
> It gets away with this because files from GIZA++ are piped through
> giza2bal.pl, which itself is not well documented.
> I'm attempting to write, say, fastalign2bal.py.
> With a bit of tinkering, I got at the .bal format:
> 
> 1
> 
> 7 jehovah said to moses and aaron :  # 3 2 2 4 5 6 8
> 
> 8 i řekl hospodin mojžíšovi a aronovi takto :  # 2 2 1 4 5 6 6 7
> 
> A template for which would be
> 
> 1
> 
> NUM_TGT_TOKENS [tgt_token1 tgt_token2 ... tgt_tokenN] # [alignment1
> alignment2 ... alignmentN]
> NUM_SRC_TOKENS [src_token1 src_token2 ... src_tokenN] # [alignment1
> alignment2 ... alignmentN]
> 
> 
> However, I'm hitting some pretty nasty errors with symal when I pipe in
> some fastalign2bal.py output.
> A few hours with gdb made some progress (for as far as I can tell, the
> formats are identical) but if anyone has experience with symal, I would
> greatly appreciate some consultation.
> 
> -John



Re: Dockerhub hosted images

2016-11-23 Thread Matt Post
Okay, I have this with

docker run -it kellens/apache-joshua-es-en-2016-10-05 bash

It seems we are missing Perl (./prepare.sh fails), and we should replace the 
LanguageModel line with a KenLM instance and build that. I bet we'll need 
Python, too.




> On Nov 23, 2016, at 8:15 AM, Matt Post <p...@cs.jhu.edu> wrote:
> 
> Kellen, can I bother you to post a few first steps? I've successfully pulled 
> this down to my mac but now do not know how to find it, edit it, or run it. 
> I'm porting through the documentation and will find it eventually but this 
> would save me a bit of time.
> 
> 
>> On Nov 23, 2016, at 8:07 AM, kellen sunderland <kellen.sunderl...@gmail.com> 
>> wrote:
>> 
>> Yes my next step was going to be getting it hosted officially.
>> 
>> I'll go ahead and open a ticket.  I think I'll hold off on pushing to the
>> Apache account until I've done a little more testing though.
>> 
>> On Nov 23, 2016 5:22 AM, "lewis john mcgibbney" <lewi...@apache.org> wrote:
>> 
>>> Hi Kellen,
>>> Nice :)
>>> Another option is for us to host these via the Apache account.
>>> https://hub.docker.com/r/apache/
>>> We could then add a badge to our README which points to the Dockerfile(s).
>>> Do you want to open a ticket over on the INFRA Jira for this?
>>> 
>>> On Tue, Nov 22, 2016 at 1:57 PM, <
>>> dev-digest-h...@joshua.incubator.apache.org> wrote:
>>> 
>>>> From: kellen sunderland <kellen.sunderl...@gmail.com>
>>>> To: "dev@joshua.incubator.apache.org" <dev@joshua.incubator.apache.org>
>>>> Cc:
>>>> Date: Tue, 22 Nov 2016 22:56:56 +0100
>>>> Subject: Re: Dockerhub hosted images
>>>> Ok, the first image should be properly uploaded now.
>>>> 
>>>> https://hub.docker.com/r/kellens/apache-joshua-es-en-2016-10-05/
>>>> 
>>>> -Kellen
>>>> 
>>>> 
>>> 
> 



Re: [VOTE] Release Apache Joshua 6.1 RC#2

2016-11-23 Thread Matt Post
+1 Thanks, Lewis!


> On Nov 23, 2016, at 12:15 AM, lewis john mcgibbney  wrote:
> 
> Hello user@ and dev,
> Please VOTE on the Apache Joshua 6.1 Release Candidate #2.
> 
> We solved 50 issues: https://s.apache.org/joshua6.1
> 
> Git source tag (29c8be650d53216f779a340d33f8f61af4d45629):
> https://s.apache.org/pk2t 
> 
> Staging repo:
> https://repository.apache.org/content/repositories/orgapachejoshua-1001/
> 
> 
> Source Release Artifacts: https://dist.apache.org/repos/
> dist/dev/incubator/joshua/
> 
> PGP release keys (signed using 48BAEBF6): https://dist.apache.org/repos/
> dist/release/incubator/joshua/KEYS
> 
> Vote will be open for 72 hours.
> Thank you to everyone that is able to VOTE as well as everyone that
> contributed to Apache Joshua 6.1.
> 
> [ ] +1, let's get it released!!!
> [ ] +/-0, fine, but consider to fix few issues before...
> [ ] -1, nope, because... (and please explain why)
> 
> P.S. here is my +1
> 
> -- 
> http://home.apache.org/~lewismc/
> @hectorMcSpector
> http://www.linkedin.com/in/lmcgibbney



Re: Dockerhub hosted images

2016-11-23 Thread Matt Post
Kellen, can I bother you to post a few first steps? I've successfully pulled 
this down to my mac but now do not know how to find it, edit it, or run it. I'm 
porting through the documentation and will find it eventually but this would 
save me a bit of time.


> On Nov 23, 2016, at 8:07 AM, kellen sunderland  
> wrote:
> 
> Yes my next step was going to be getting it hosted officially.
> 
> I'll go ahead and open a ticket.  I think I'll hold off on pushing to the
> Apache account until I've done a little more testing though.
> 
> On Nov 23, 2016 5:22 AM, "lewis john mcgibbney"  wrote:
> 
>> Hi Kellen,
>> Nice :)
>> Another option is for us to host these via the Apache account.
>> https://hub.docker.com/r/apache/
>> We could then add a badge to our README which points to the Dockerfile(s).
>> Do you want to open a ticket over on the INFRA Jira for this?
>> 
>> On Tue, Nov 22, 2016 at 1:57 PM, <
>> dev-digest-h...@joshua.incubator.apache.org> wrote:
>> 
>>> From: kellen sunderland 
>>> To: "dev@joshua.incubator.apache.org" 
>>> Cc:
>>> Date: Tue, 22 Nov 2016 22:56:56 +0100
>>> Subject: Re: Dockerhub hosted images
>>> Ok, the first image should be properly uploaded now.
>>> 
>>> https://hub.docker.com/r/kellens/apache-joshua-es-en-2016-10-05/
>>> 
>>> -Kellen
>>> 
>>> 
>> 



test non apache account

2016-11-23 Thread Matt Post


matt (from my phone)


Re: Dockerhub hosted images

2016-11-22 Thread Matt Post
How do I clone this? Docker tells me there is no tag "latest", using "-a" tells 
me the repo is not found, and I can't seem to figure out how to tell Docker to 
use hub.docker.com...


> Here's a link to the first image I've been playing with, es-en.
> https://hub.docker.com/r/kellens/apache-joshua-es-en-2016-10-05/




Re: language packs blog post

2016-11-21 Thread Matt Post
That's better, fixed.


> On Nov 21, 2016, at 3:14 PM, kellen sunderland <kellen.sunderl...@gmail.com> 
> wrote:
> 
> Looks good to me, no objection to tweeting it.  Nice work putting them all
> together.
> 
> On Mon, Nov 21, 2016 at 9:00 PM, Matt Post <p...@cs.jhu.edu> wrote:
> 
>> Hi folks,
>> 
>> I just drafted this; any objections to tweeting it?
>> 
>>https://cwiki.apache.org/confluence/display/JOSHUA/
>> 2016/11/21/Apache+Joshua+Language+Packs
>> 
>> matt



language packs blog post

2016-11-21 Thread Matt Post
Hi folks,

I just drafted this; any objections to tweeting it?


https://cwiki.apache.org/confluence/display/JOSHUA/2016/11/21/Apache+Joshua+Language+Packs

matt

Re: Updating Incubator summary

2016-11-21 Thread Matt Post
I've put up a private roadmap page on Confluence. Feel free to add / edit / 
comment.

https://cwiki.apache.org/confluence/display/JOSHUA/Private+Roadmap

Johns, the hook is that there are a lot of NLP tools that one can download and 
use (Berkeley Parser, spacy, etc), which have prebuilt models and just work — 
black box. But there is nothing like that for MT! Nothing you can just download 
and easily run, without caring about how it works or wanting to improve it. We 
want to build MT as a tool that people can easily include in their projects. 
Making MT available in this way will lead to uses that we haven't imagined.

matt


> On Nov 18, 2016, at 2:14 AM, Henri Yandell <bay...@apache.org> wrote:
> 
> Given the community-over-code meme, incubator graduation is more about the
> people than the code :)
> 
> So I'm more interested in your roadmap being discussed and consensus
> existing, than on how many releases are done etc. I like your proposed
> roadmap and it seems like nice timing on request-for-graduation.
> 
> Hen
> 
> On Thu, Nov 17, 2016 at 2:44 AM, Matt Post <p...@cs.jhu.edu> wrote:
> 
>> My thinking on that roadmap was a comment Lewis made a while ago about
>> incubator graduation being judged by the number of releases. If you think
>> we can get out sooner, then I'm all for it! Maybe we can get the docker
>> containers out and then push for it after that?
>> 
>> I like your idea about a more concerted advertising effort. We could also
>> try to pull together a demo paper for ACL <http://acl2017.org/>  which is
>> due in February. I think I might have a hook that would appeal to reviewers
>> there.
>> 
>> 
>>> On Nov 17, 2016, at 2:12 AM, Henri Yandell <bay...@apache.org> wrote:
>>> 
>>> Sounds good :)
>>> 
>>> My basic mantra is 'get the summary page all signed off, then start
>> asking
>>> "when graduate?"'. Projects can tend to linger in the Incubator awaiting
>>> perfection.
>>> 
>>> I wonder how you could take the 3rd item (Linux.com article) and make
>> that
>>> bigger. Perhaps encourage every committer to write a blog post so you end
>>> up with the article as an intro, and then each committer's blog entry or
>>> website hosted article as a personal "how I got into this" or "what I
>> work
>>> on" or "a commit I recently did, a commit I keep meaning to getting
>> around
>>> to working on". Random thought :)
>>> 
>>> Hen
>>> 
>>> On Tue, Nov 15, 2016 at 11:09 AM, Matt Post <p...@cs.jhu.edu> wrote:
>>> 
>>>> We're still waiting on our first software release, so it seems to me a
>> bit
>>>> premature to graduate? Though I don't know how these decisions are made
>> —
>>>> what goes into it?
>>>> 
>>>> Here is the roadmap that I have in mind:
>>>> 
>>>> - 6.1 release (imminent)
>>>> - Large-scale release of language packs (imminent)
>>>> - Linux.com article introducing people to MT, Joshua, language packs,
>> and
>>>> adding custom rules
>>>> - Release of docker-based language packs (including KenLM)
>>>> - 7.0 release (spring)
>>>> - Graduate
>>>> 
>>>> If we keep that rough schedule, we'll have incubated a year and have a
>> lot
>>>> to show for it.
>>>> 
>>>> matt
>>>> 
>>>> 
>>>>> On Nov 15, 2016, at 12:13 PM, Henri Yandell <bay...@apache.org> wrote:
>>>>> 
>>>>> Thanks :)
>>>>> 
>>>>> Reason for asking being that it felt that the standard checklist things
>>>>> were complete and I was wondering what the path to graduation is?
>>>>> 
>>>>> Any reason not to start thinking about a vote?
>>>>> 
>>>>> On Tue, Nov 15, 2016 at 04:02 Matt Post <p...@cs.jhu.edu> wrote:
>>>>> 
>>>>>> Thanks, Lewis, and Henri, for pointing this out.
>>>>>> 
>>>>>> 
>>>>>>> On Nov 15, 2016, at 1:18 AM, lewis john mcgibbney <
>> lewi...@apache.org>
>>>>>> wrote:
>>>>>>> 
>>>>>>> Hi Henri,
>>>>>>> I just pushed the update to SVN. Should update asynch reasonably
>> soon.
>>>>>>> 
>>>>>>> http://incubator.apache.org/projects/joshua.html
>>>>>>> 
>>>>>>> Thanks
>>>>>>> 
>>>>>>> On Sun, Nov 13, 2016 at 1:22 PM, <
>>>>>>> dev-digest-h...@joshua.incubator.apache.org> wrote:
>>>>>>> 
>>>>>>>> 
>>>>>>>> From: Henri Yandell <bay...@apache.org>
>>>>>>>> To: dev@joshua.incubator.apache.org
>>>>>>>> Cc:
>>>>>>>> Date: Sun, 13 Nov 2016 01:17:57 -0800
>>>>>>>> Subject: Updating Incubator summary
>>>>>>>> Would be useful to update this page:
>>>>>>>> 
>>>>>>>> http://incubator.apache.org/projects/joshua.html
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Are there any of the checklist items that are still open?
>>>>>>>> 
>>>>>>>> 
>>>>>>> As far as I am aware no :)
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>> 
>> 



Re: Reduce number of branches in the source repo

2016-11-21 Thread Matt Post
I just cleaned out a bunch of old branches, but I can't write directly to our 
Github mirror, and I'm not sure that deleted branches get pushed down.

From this page you can see all the deletable branches:

https://github.com/apache/incubator-joshua/branches/stale

They are ones that are way behind and 0 commits ahead.

matt


> On Nov 19, 2016, at 1:46 AM, Tommaso Teofili  
> wrote:
> 
> +1
> 
> Tommaso
> 
> Il giorno sab 19 nov 2016 alle ore 06:15 Henry Saputra <
> henry.sapu...@gmail.com> ha scritto:
> 
>> HI All,
>> 
>> I think we have bit too many branches in our source repo.
>> 
>> Would it be easier to keep the branches to just master and releases
>> branches?
>> 
>> 
>> 
>> - Henry
>> 



Re: [VOTE] Release Apache Joshua (Incubating) 6.1

2016-11-21 Thread Matt Post
That would be great to fix!


> On Nov 21, 2016, at 4:47 AM, kellen sunderland  
> wrote:
> 
> I'd vote for a respin fixing Henry's changes.  There's also a small break
> in the Dockerfile which I can fix at some point today (we need to switch it
> to use mvn package I believe).
> 
> On Mon, Nov 21, 2016 at 10:40 AM, Tommaso Teofili > wrote:
> 
>> I think this settles for a respin of a new RC, doesn't it ?
>> 
>> Regards,
>> Tommaso
>> 
>> Il giorno sab 19 nov 2016 alle ore 06:20 Henry Saputra <
>> henry.sapu...@gmail.com> ha scritto:
>> 
>>> Sorry I was late on dev@ list for this Vote, Lewis
>>> 
>>> Looks like I have to -1 for this one:
>>> Missing DISCLAIMER file for the source release artifact
>>> NOTICE.txt file contains Apache HTrace instead of Apache Joshua
>>> 
>>> Minor issue:
>>> Extra file "pom.xml.release.releaseBackup"
>>> 
>>> - Henry
>>> 
>>> On Fri, Nov 18, 2016 at 2:11 PM, lewis john mcgibbney <
>> lewi...@apache.org>
>>> wrote:
>>> 
 Hello general@incubator,
 Please VOTE on the Apache Joshua 6.1 Release Candidate #1. The release
>>> VOTE
 has passed over on user@ and dev@joshua with the following results
 http://www.mail-archive.com/dev%40joshua.incubator.apache.
 org/msg01884.html.
 
 We solved 44 issues: https://s.apache.org/joshua6.1
 
 Git source tag (167489bbd78526b9833fe7c88646bf96101d5d2b):
 https://s.apache.org/joshua6.1tag
 
 Staging repo: https://repository.apache.org/content/repositories/
 orgapachejoshua-1000/
 
 Source Release Artifacts: https://dist.apache.org/repos/
 dist/dev/incubator/joshua/
 
 PGP release keys (signed using 48BAEBF6):
>> https://dist.apache.org/repos/
 dist/release/incubator/joshua/KEYS
 
 Vote will be open for 72 hours.
 Thank you to everyone that is able to VOTE as well as everyone that
 contributed to Apache Joshua 6.1.
 
 [ ] +1, let's get it released!!!
 [ ] +/-0, fine, but consider to fix few issues before...
 [ ] -1, nope, because... (and please explain why)
 
 P.S. here is my +1
 
 --
 http://home.apache.org/~lewismc/
 @hectorMcSpector
 http://www.linkedin.com/in/lmcgibbney
 
>>> 
>> 



Re: package-info.java

2016-11-18 Thread Matt Post
Ah, thanks, Thamme! The first question was whether this was the right choice 
(it seems it was), and now I can go spend some time fixing this in Eclipse.

matt


> On Nov 18, 2016, at 3:43 PM, Thamme Gowda  wrote:
> 
> Hi Matt,
> 
> In the recent days, package-info.java is preferred over package.html.
> 
> I may have read the official word from Oracle sometime ago:
> http://docs.oracle.com/javase/7/docs/technotes/tools/solaris/javadoc.html#packagecomment
> 
> I guess the issue you're facing is due to multiple  info files for the same
> package, one in src/main/java and another one is in src/test/java.
> 
> I would recommend removing the package info from the conflicting test
> packages .
> 
> Thanks Lewis for sharing the link.
> 
> Best,
> Thamme
> 
> On Nov 16, 2016 10:34 PM, "lewis john mcgibbney"  wrote:
> 
>> Hi Matt,
>> I get digest email, however I saw you email on the remote list.
>> The answer is here
>> https://www.intertech.com/Blog/whats-package-info-java-for/
>> I couldn't find any Oracle or Open JDK-level documentation.
>> package.html is deprecated in current JDK.
>> 
>> --
>> http://home.apache.org/~lewismc/
>> @hectorMcSpector
>> http://www.linkedin.com/in/lmcgibbney
>> 



language pack release!

2016-11-18 Thread Matt Post
Hi folks,

I am in the process of publishing the release of 62 language packs, constructed 
by my colleague here at the Johns Hopkins HLTCOE, Paul McNamee.

I am updating this page as they come out:

https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs

Inevitably problems will be found. I know that some of these are probably not 
very well-performing. But I figured we should adopt a release-and-improve 
iterative model and see what people do with them.

It will be important to document for people how to use these and give them an 
idea of what is possible. I think the Linux.com article (something we were 
offered some time ago) might help. We should also — perhaps after a bit more 
testing — start advertising around. It would also be great just to create a 
Confluence page to give people some ideas. These are all things on my list that 
I would be happy to have others take the initiative on, if they wanted.

matt

[jira] [Commented] (JOSHUA-315) Thrax keeps all rules

2016-11-17 Thread Matt Post (JIRA)

[ 
https://issues.apache.org/jira/browse/JOSHUA-315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15673452#comment-15673452
 ] 

Matt Post commented on JOSHUA-315:
--

Yeah, I had expected a bigger savings, too. I should quantify it in terms of 
runtime, as well.

> Thrax keeps all rules
> -
>
> Key: JOSHUA-315
> URL: https://issues.apache.org/jira/browse/JOSHUA-315
> Project: Joshua
>  Issue Type: Bug
>    Reporter: Matt Post
> Fix For: 6.2
>
>
> When extracting rules, Thrax keeps *all* options for each target side. For 
> large bitexts and common source sides (e.g., "de" for Spanish–English), there 
> can be tens of thousands of translations, due to errors in the alignments and 
> phenomena like garbage collection. The decoder throws out all but the top 
> num_translation_options of these (default 20), but before doing so, it has to 
> score all the target side options with all feature functions, include the 
> language model. This slows down "warming up" of the model and means that the 
> first sentences to use these items are very slow to translation.
> I have updated scripts/training/filter-rules.pl to filter out using Thrax's 
> rarity penalty field, but it would be much better if Thrax were to keep only 
> the most 100 frequent translation options for each source side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: "mvn assembly" no longer works

2016-11-17 Thread Matt Post
Ah, thanks Lewis. I did update the README to mention the new package target.



> On Nov 17, 2016, at 1:36 AM, lewis john mcgibbney  wrote:
> 
> Hi Matt,
> Again, I am on digest and didn't receive but I'll reply here.
> No need to use the Maven assembly plugin anymore... simply execute mvn 
> package... you will then see 
> ./target/joshua-6.2-SNAPSHOT-jar-with-dependencies.jar the exact same, but 
> now a default Maven task rather than a custom plugin implementation.
> Do we need to update README?
> 
> -- 
> http://home.apache.org/~lewismc/ 
> @hectorMcSpector
> http://www.linkedin.com/in/lmcgibbney 



Re: Updating Incubator summary

2016-11-17 Thread Matt Post
My thinking on that roadmap was a comment Lewis made a while ago about 
incubator graduation being judged by the number of releases. If you think we 
can get out sooner, then I'm all for it! Maybe we can get the docker containers 
out and then push for it after that?

I like your idea about a more concerted advertising effort. We could also try 
to pull together a demo paper for ACL <http://acl2017.org/>  which is due in 
February. I think I might have a hook that would appeal to reviewers there.


> On Nov 17, 2016, at 2:12 AM, Henri Yandell <bay...@apache.org> wrote:
> 
> Sounds good :)
> 
> My basic mantra is 'get the summary page all signed off, then start asking
> "when graduate?"'. Projects can tend to linger in the Incubator awaiting
> perfection.
> 
> I wonder how you could take the 3rd item (Linux.com article) and make that
> bigger. Perhaps encourage every committer to write a blog post so you end
> up with the article as an intro, and then each committer's blog entry or
> website hosted article as a personal "how I got into this" or "what I work
> on" or "a commit I recently did, a commit I keep meaning to getting around
> to working on". Random thought :)
> 
> Hen
> 
> On Tue, Nov 15, 2016 at 11:09 AM, Matt Post <p...@cs.jhu.edu> wrote:
> 
>> We're still waiting on our first software release, so it seems to me a bit
>> premature to graduate? Though I don't know how these decisions are made —
>> what goes into it?
>> 
>> Here is the roadmap that I have in mind:
>> 
>> - 6.1 release (imminent)
>> - Large-scale release of language packs (imminent)
>> - Linux.com article introducing people to MT, Joshua, language packs, and
>> adding custom rules
>> - Release of docker-based language packs (including KenLM)
>> - 7.0 release (spring)
>> - Graduate
>> 
>> If we keep that rough schedule, we'll have incubated a year and have a lot
>> to show for it.
>> 
>> matt
>> 
>> 
>>> On Nov 15, 2016, at 12:13 PM, Henri Yandell <bay...@apache.org> wrote:
>>> 
>>> Thanks :)
>>> 
>>> Reason for asking being that it felt that the standard checklist things
>>> were complete and I was wondering what the path to graduation is?
>>> 
>>> Any reason not to start thinking about a vote?
>>> 
>>> On Tue, Nov 15, 2016 at 04:02 Matt Post <p...@cs.jhu.edu> wrote:
>>> 
>>>> Thanks, Lewis, and Henri, for pointing this out.
>>>> 
>>>> 
>>>>> On Nov 15, 2016, at 1:18 AM, lewis john mcgibbney <lewi...@apache.org>
>>>> wrote:
>>>>> 
>>>>> Hi Henri,
>>>>> I just pushed the update to SVN. Should update asynch reasonably soon.
>>>>> 
>>>>> http://incubator.apache.org/projects/joshua.html
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> On Sun, Nov 13, 2016 at 1:22 PM, <
>>>>> dev-digest-h...@joshua.incubator.apache.org> wrote:
>>>>> 
>>>>>> 
>>>>>> From: Henri Yandell <bay...@apache.org>
>>>>>> To: dev@joshua.incubator.apache.org
>>>>>> Cc:
>>>>>> Date: Sun, 13 Nov 2016 01:17:57 -0800
>>>>>> Subject: Updating Incubator summary
>>>>>> Would be useful to update this page:
>>>>>> 
>>>>>> http://incubator.apache.org/projects/joshua.html
>>>>>> 
>>>>>> 
>>>>>> Are there any of the checklist items that are still open?
>>>>>> 
>>>>>> 
>>>>> As far as I am aware no :)
>>>> 
>>>> 
>> 
>> 



package-info.java

2016-11-16 Thread Matt Post
Hi Thamme,

Eclipse is complaining about package-info.java files, e.g.,

The type package-info is already defined

for org.apache.joshua.decoder.package-info.java. I see that a while ago you 
replaced the package-info.html files with these. Is there a particular reason 
for this? Is .java preferred to .html? In researching solutions to this one of 
the suggestions was to go to .html.

matt

"mvn assembly" no longer works

2016-11-15 Thread Matt Post
Lewis — with the last two changes you pushed up,

mvn assembly:single

no longer produces the consolidated jar in $JOSHUA/target. Can you advise?

matt

Re: November 2016 Newsletter -- LDC

2016-11-15 Thread Matt Post
I think LDC newsletter contents are related to data that is published by the 
LDC. 

Once the language packs are out, there are a number of mailing lists we could 
advertise on, including corp...@uib.no and the moses users mailing list.

I am also working on a graphic that we can use to accompany the language pack 
releases. They are actually done and packed up; I'll send a note shortly to ask 
folks to test one or two of them.

matt




> On Nov 15, 2016, at 1:46 PM, lewis john mcgibbney  wrote:
> 
> Hi Folks,
> LDC newsletter FYI.
> I wonder if we should starts publishing our notices in their newsletter? I 
> think I'll make an inquiry about that exact topic.
> Lewis
> 
> -- Forwarded message --
> From: Mcgibbney, Lewis J (398M)  >
> Date: Tue, Nov 15, 2016 at 10:02 AM
> Subject: FW: November 2016 Newsletter -- LDC
> To: "lewis.mcgibb...@gmail.com " 
> >
> 
> 
>  
> 
>  
> 
> Dr. Lewis John McGibbney Ph.D., B.Sc.
> 
> Data Scientist II
> 
> Computer Science for Data Intensive Applications Group 398M
> 
> Jet Propulsion Laboratory
> 
> California Institute of Technology 
> 
> 4800 Oak Grove Drive
> 
> Pasadena, California 91109-8099
> 
> Mail Stop : 158-256C
> 
> Tel:  (+1) (818)-393-7402 
> Cell: (+1) (626)-487-3476 
> Fax:  (+1) (818)-393-1190 
> Email: lewis.j.mcgibb...@jpl.nasa.gov 
>  
> 
>
> 
>  
> 
>  Dare Mighty Things
> 
>  
> 
> From: Ldc-customers1  > on behalf of Penn LDC 
> >
> Date: Tuesday, November 15, 2016 at 9:44 AM
> To: Penn LDC >
> Subject: November 2016 Newsletter -- LDC
> 
>  
> 
> In this newsletter:
> 
> Join LDC for Membership Year 2017
> 
> Commercial use and LDC data
> 
> Spring 2017 Data Scholarship Program
> 
> LDC closed November 24-25 for US Thanksgiving Holiday
> 
>  
> 
> New publications:
> 
> JANA: A Human-Human Dialogues Corpus for Egyptian Dialect 
> 
>  
> 
> Multi-Language Conversational Telephone Speech 2011 – Slavic Group 
> 
>  
> 
> IARPA Babel Georgian Language Pack IARPA-babel404b-v1.0a 
> 
> GALE Phase 3 and 4 Chinese Newswire Parallel Text 
> 
>  
> 
> Join LDC for Membership Year 2017
> 
> Organizations engaged in language-related research, education and technology 
> development are invited to join LDC for Membership Year (MY) 2017. Consortium 
> members enjoy unparalleled access and continuing rights to new data releases 
> and to an archive of close to 700 holdings.
> 
> Membership fees have not increased for 2017. In addition, discounts are 
> available for organizations who keep their membership current and for those 
> who join before March 1, 2017. 
> 
>• MY 2016 members receive a 10% discount if they renew their 
> membership before March 1, 2017. After March 1, MY2016 members receive a 5% 
> discount if they renew their membership any time in 2017.
> 
>• New members and returning former members receive a 5% discount 
> off the membership fee if they join/renew before March 1, 2017.
> 
> Plans for MY2017 publications are in progress. Among the expected releases 
> are:
> 
> 2010 NIST Speaker Recognition Evaluation data set
> 
> Multilanguage conversational telephone speech: developed to support language 
> identification research in related languages
> 
> UCLA High Speed Laryngeal Database: audio recordings and high-speed 
> videoendoscopic images of the vocal folds while sustaining vowels
> 
> Noisy TIMIT: TIMIT with added artificial noise
> 
> CHiME shared task data: noisy read WSJ speech
> 
> First Year Law Students’ Memoranda: memos to a hypothetical court with 
> annotations
> 
> IARPA Babel Language Packs: languages include Vietnamese, Haitian Creole, 
> Zulu, Kazakh and Lithuanian
> 
> BOLT: source, parallel and word-aligned data in all languages
> 
> RATS Keyword Spotting data set
> 
> GALE Phases 3 and 4: all tasks and languages   
> 
> Visit Join LDC  for details on 
> membership, user accounts and payment.
> 
> Commercial use and LDC data
> 
> For-profit organizations are reminded that an LDC membership is a 
> pre-requisite for obtaining a commercial license to almost all LDC databases. 
> Non-member organizations, including non-member for-profit organizations, 
> cannot use LDC data to develop or test products for commercialization, nor 
> can they use LDC data in any 

Re: Updating Incubator summary

2016-11-15 Thread Matt Post
We're still waiting on our first software release, so it seems to me a bit 
premature to graduate? Though I don't know how these decisions are made — what 
goes into it?

Here is the roadmap that I have in mind:

- 6.1 release (imminent)
- Large-scale release of language packs (imminent)
- Linux.com article introducing people to MT, Joshua, language packs, and 
adding custom rules
- Release of docker-based language packs (including KenLM)
- 7.0 release (spring)
- Graduate

If we keep that rough schedule, we'll have incubated a year and have a lot to 
show for it.

matt


> On Nov 15, 2016, at 12:13 PM, Henri Yandell <bay...@apache.org> wrote:
> 
> Thanks :)
> 
> Reason for asking being that it felt that the standard checklist things
> were complete and I was wondering what the path to graduation is?
> 
> Any reason not to start thinking about a vote?
> 
> On Tue, Nov 15, 2016 at 04:02 Matt Post <p...@cs.jhu.edu> wrote:
> 
>> Thanks, Lewis, and Henri, for pointing this out.
>> 
>> 
>>> On Nov 15, 2016, at 1:18 AM, lewis john mcgibbney <lewi...@apache.org>
>> wrote:
>>> 
>>> Hi Henri,
>>> I just pushed the update to SVN. Should update asynch reasonably soon.
>>> 
>>> http://incubator.apache.org/projects/joshua.html
>>> 
>>> Thanks
>>> 
>>> On Sun, Nov 13, 2016 at 1:22 PM, <
>>> dev-digest-h...@joshua.incubator.apache.org> wrote:
>>> 
>>>> 
>>>> From: Henri Yandell <bay...@apache.org>
>>>> To: dev@joshua.incubator.apache.org
>>>> Cc:
>>>> Date: Sun, 13 Nov 2016 01:17:57 -0800
>>>> Subject: Updating Incubator summary
>>>> Would be useful to update this page:
>>>> 
>>>> http://incubator.apache.org/projects/joshua.html
>>>> 
>>>> 
>>>> Are there any of the checklist items that are still open?
>>>> 
>>>> 
>>> As far as I am aware no :)
>> 
>> 



[jira] [Commented] (JOSHUA-315) Thrax keeps all rules

2016-11-14 Thread Matt Post (JIRA)

[ 
https://issues.apache.org/jira/browse/JOSHUA-315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15664649#comment-15664649
 ] 

Matt Post commented on JOSHUA-315:
--

This has been addressed in commit 885389d513b5d0f3f68b59c3b17a776584b3a208. If 
you add the word "count" to the list of thrax features in the thrax config 
file, a sixth field will be extracted with the rule count, e.g.,

[X] ||| de ||| of ||| 0.72572 0.29124 1 0 0.39357 0.17023 ||| 0-0 ||| 
2565758
[X] ||| de ||| to ||| 2.89509 2.10811 1 0 2.87285 2.08282 ||| 0-0 ||| 215020
[X] ||| de ||| in ||| 3.11663 2.17583 1 0 2.91081 2.34837 ||| 0-0 ||| 207011
...

This is then used by the filter-rules.pl script (with the flag -t 100) to prune 
remove all rules except the top 100 most frequent, for each source side. This 
has been added to the pipeline. The grammars seem to be about 5% smaller and 
should have only a positive effect on running time.

> Thrax keeps all rules
> -
>
> Key: JOSHUA-315
> URL: https://issues.apache.org/jira/browse/JOSHUA-315
> Project: Joshua
>  Issue Type: Bug
>Reporter: Matt Post
> Fix For: 6.2
>
>
> When extracting rules, Thrax keeps *all* options for each target side. For 
> large bitexts and common source sides (e.g., "de" for Spanish–English), there 
> can be tens of thousands of translations, due to errors in the alignments and 
> phenomena like garbage collection. The decoder throws out all but the top 
> num_translation_options of these (default 20), but before doing so, it has to 
> score all the target side options with all feature functions, include the 
> language model. This slows down "warming up" of the model and means that the 
> first sentences to use these items are very slow to translation.
> I have updated scripts/training/filter-rules.pl to filter out using Thrax's 
> rarity penalty field, but it would be much better if Thrax were to keep only 
> the most 100 frequent translation options for each source side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (JOSHUA-315) Thrax keeps all rules

2016-11-14 Thread Matt Post (JIRA)

 [ 
https://issues.apache.org/jira/browse/JOSHUA-315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Post resolved JOSHUA-315.
--
Resolution: Fixed

> Thrax keeps all rules
> -
>
> Key: JOSHUA-315
> URL: https://issues.apache.org/jira/browse/JOSHUA-315
> Project: Joshua
>  Issue Type: Bug
>    Reporter: Matt Post
> Fix For: 6.2
>
>
> When extracting rules, Thrax keeps *all* options for each target side. For 
> large bitexts and common source sides (e.g., "de" for Spanish–English), there 
> can be tens of thousands of translations, due to errors in the alignments and 
> phenomena like garbage collection. The decoder throws out all but the top 
> num_translation_options of these (default 20), but before doing so, it has to 
> score all the target side options with all feature functions, include the 
> language model. This slows down "warming up" of the model and means that the 
> first sentences to use these items are very slow to translation.
> I have updated scripts/training/filter-rules.pl to filter out using Thrax's 
> rarity penalty field, but it would be much better if Thrax were to keep only 
> the most 100 frequent translation options for each source side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [VOTE] Release Apache Joshua (Incubating) 6.1

2016-11-14 Thread Matt Post
+1

Thanks for starting this off, Lewis!


> On Nov 14, 2016, at 12:54 PM, Ramirez, Paul M (398M) 
>  wrote:
> 
> +1, let's get it released!!!
> 
> --Paul
> 
> ==
> Paul Ramirez - Group Supervisor
> Computer Science for Data Intensive Applications (398M)
> NASA - Jet Propulsion Laboratory
> 4800 Oak Grove Dr.
> Pasadena, CA 91109 USA
> Mailstop: 158-242
> Office: 818-354-1015
> Cell: 818-395-8194
> ==
> 
> On 11/14/16, 9:16 AM, "lewis john mcgibbney"  wrote:
> 
>Hi Folks,
>Please VOTE on the Apache Joshua 6.1 Release Candidate #1.
> 
>We solved 44 issues: https://s.apache.org/joshua6.1
> 
>Git source tag (167489bbd78526b9833fe7c88646bf96101d5d2b):
>https://s.apache.org/joshua6.1tag
> 
>Staging repo:
>https://repository.apache.org/content/repositories/orgapachejoshua-1000/
> 
>Source Release Artifacts:
>https://dist.apache.org/repos/dist/dev/incubator/joshua/
> 
>PGP release keys (signed using 48BAEBF6):
>https://dist.apache.org/repos/dist/release/incubator/joshua/KEYS
> 
>Vote will be open for 72 hours.
>Thank you to everyone that is able to VOTE as well as everyone that
>contributed to Apache Joshua 6.1.
> 
>[ ] +1, let's get it released!!!
>[ ] +/-0, fine, but consider to fix few issues before...
>[ ] -1, nope, because... (and please explain why)
> 
>P.S. here is my +1
> 
>-- 
>http://home.apache.org/~lewismc/
>@hectorMcSpector
>http://www.linkedin.com/in/lmcgibbney
> 
> 



Re: Lewis Volunteering for 6.1 Release Manager

2016-11-10 Thread Matt Post
Just landing back in the states from Berlin. This sounds great Lewis!

matt (from my phone)

> Le 10 nov. 2016 à 12:02, lewis john mcgibbney  a écrit :
> 
> Hi Folks,
> I would like to put myself forward as release manager for 6.1.
> I've got a lot of experience working with Incubating releases and have been
> successful in the position of release manager resulting in the release of
> around 20-30 official incubating and top level projects here at Apache.
> I'll make sure to document the entire release procedure on our wiki for
> future reference.
> Does anyone object? If not then I will get to it today.
> Lewis
> 
> -- 
> http://home.apache.org/~lewismc/
> @hectorMcSpector
> http://www.linkedin.com/in/lmcgibbney



Re: [jira] [Commented] (JOSHUA-318) scripts/training/run_tuner.py should enable configurable memory usage when invioking joshua-decoder

2016-11-02 Thread Matt Post
Not sure how to interpret that comment. 6.2 will be 7.


> On Nov 2, 2016, at 1:57 PM, Lewis John McGibbney (JIRA)  
> wrote:
> 
> 
>[ 
> https://issues.apache.org/jira/browse/JOSHUA-318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15629835#comment-15629835
>  ] 
> 
> Lewis John McGibbney commented on JOSHUA-318:
> -
> 
> Agreed, it's set for fix 6.2... if we ever release 6.2.
> 
>> scripts/training/run_tuner.py should enable configurable memory usage when 
>> invioking joshua-decoder
>> ---
>> 
>>Key: JOSHUA-318
>>URL: https://issues.apache.org/jira/browse/JOSHUA-318
>>Project: Joshua
>> Issue Type: Improvement
>> Components: tuner
>>   Affects Versions: 6.0.5
>>   Reporter: Lewis John McGibbney
>>Fix For: 6.2
>> 
>> 
>> When I run the run_tuner.py script I can easily run into the following
>> {code}
>> [mert-1] rebuilding...
>>  dep=/usr/local/joshua_resources/russian_experiments/exp3/data/tune/corpus.en
>>  dep=/usr/local/joshua_resources/russian_experiments/exp3/tune/joshua.config 
>> [CHANGED]
>>  dep=tune/model/grammar.gz.packed/slice_0.source [CHANGED]
>>  
>> dep=/usr/local/joshua_resources/russian_experiments/exp3/tune/joshua.config.final
>>  [NOT FOUND]
>>  cmd=/usr/local/incubator-joshua/scripts/training/run_tuner.py 
>> /usr/local/joshua_resources/russian_experiments/exp3/data/tune/corpus.en 
>> /usr/local/joshua_resources/russian_experiments/exp3/data/tune/corpus.ru 
>> --tunedir /usr/local/joshua_resources/russian_experiments/exp3/tune --tuner 
>> mert --decoder 
>> /usr/local/joshua_resources/russian_experiments/exp3/tune/decoder_command 
>> --decoder-config 
>> /usr/local/joshua_resources/russian_experiments/exp3/tune/joshua.config 
>> --decoder-output-file 
>> /usr/local/joshua_resources/russian_experiments/exp3/tune/output.nbest 
>> --decoder-log-file 
>> /usr/local/joshua_resources/russian_experiments/exp3/tune/joshua.log 
>> --iterations 10 --metric 'BLEU 4 closest'
>>  JOB FAILED (return code 1)
>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>>  at 
>> org.apache.joshua.decoder.ff.tm.packed.PackedGrammar$PackedSlice.initializeFeatureStructures(PackedGrammar.java:385)
>>  at 
>> org.apache.joshua.decoder.ff.tm.packed.PackedGrammar$PackedSlice.(PackedGrammar.java:368)
>>  at 
>> org.apache.joshua.decoder.ff.tm.packed.PackedGrammar.(PackedGrammar.java:153)
>>  at 
>> org.apache.joshua.decoder.Decoder.initializeTranslationGrammars(Decoder.java:458)
>>  at org.apache.joshua.decoder.Decoder.initialize(Decoder.java:389)
>>  at org.apache.joshua.decoder.Decoder.(Decoder.java:128)
>>  at org.apache.joshua.decoder.JoshuaDecoder.main(JoshuaDecoder.java:69)
>> Traceback (most recent call last):
>>  File "/usr/local/incubator-joshua/scripts/training/run_tuner.py", line 553, 
>> in 
>>main(sys.argv)
>>  File "/usr/local/incubator-joshua/scripts/training/run_tuner.py", line 536, 
>> in main
>>run_zmert(opts.tunedir, opts.source, opts.target, opts.decoder, 
>> opts.decoder_config, opts.decoder_output_file, opts)
>>  File "/usr/local/incubator-joshua/scripts/training/run_tuner.py", line 417, 
>> in run_zmert
>>opts.metric, opts.iterations or 10)
>>  File "/usr/local/incubator-joshua/scripts/training/run_tuner.py", line 399, 
>> in setup_configs
>>for feature,weight in get_features(config):
>>  File "/usr/local/incubator-joshua/scripts/training/run_tuner.py", line 351, 
>> in get_features
>>output = check_output("%s/bin/joshua-decoder -c %s -show-weights -v 0" % 
>> (JOSHUA, config_file), shell=True)
>>  File "/Users/lmcgibbn/miniconda3/lib/python3.5/subprocess.py", line 626, in 
>> check_output
>>**kwargs).stdout
>>  File "/Users/lmcgibbn/miniconda3/lib/python3.5/subprocess.py", line 708, in 
>> run
>>output=stdout, stderr=stderr)
>> subprocess.CalledProcessError: Command 
>> '/usr/local/incubator-joshua/bin/joshua-decoder -c 
>> /usr/local/joshua_resources/russian_experiments/exp3/tune/joshua.config 
>> -show-weights -v 0' returned non-zero exit status 1
>> {code}
>> This is because, by default the joshua-decoder script runs with 4g of 
>> memory. The run_runer.py script should be flexible enough to continue with 
>> the memory allocation provided when a pipe was initially invoked. This value 
>> should then be passed to the joshua-decoder script.
> 
> 
> 
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)



Re: Community Review of New Language Pack

2016-11-02 Thread Matt Post
All right, trying again to download, will report back.


> On Nov 2, 2016, at 2:19 PM, lewis john mcgibbney <lewi...@apache.org> wrote:
> 
> Hi Matt,
> Thanks for looking into this.
> 
> On Wed, Nov 2, 2016 at 10:49 AM, <
> dev-digest-h...@joshua.incubator.apache.org> wrote:
> 
>> From: Matt Post <p...@cs.jhu.edu>
>> To: dev@joshua.incubator.apache.org
>> Cc:
>> Date: Tue, 1 Nov 2016 16:26:29 -0400
>> Subject: Re: Community Review of New Language Pack
>> Lewis, can I get an MD5 or SHA1 checksum? I'm getting errors unpacking.
>> 
> 
> Yes, please see
> http://home.apache.org/~lewismc/language-pack-ru-en-2016-10-28.tar.gz.md5
> 
> 
>> 
>> I do see that you built the LP with the old scripts. I'll write up
>> instructions on how to do it with the new set.
>> 
>> 
> Correct. I would greatly appreciate that thank you Matt.
> Lewis



Re: Community Review of New Language Pack

2016-11-01 Thread Matt Post
Checking this out now


> On Oct 28, 2016, at 4:25 PM, lewis john mcgibbney  wrote:
> 
> Hi Folks,
> I managed to generate my first language pack today based on heiro model.
> It's 4.8GB in size so I have made it available via my home.apache.org
> public space at [0]. Right now it is uploading and will take a wee while.
> I would like some community review so we can review the quality of what has
> been generated. In addition there are a number of immediate things I am
> struggling with.
> 
> Firstly, the following files were not present after running the bundler.py.
> 
>   -  prepare.sh, this is a baseline requirement for running the tests as
>   detailed within the auto-generated README.
>   - the entire 'scripts' directory!!! This means that no utility
>   processing can be undertaken at all.
> 
> I know that both of the above are essential requirements, I therefore added
> them from a different language pack, increased default maximum memory usage
> and also augmented the README with some details regarding the dataset used
> to generate the language pack.
> 
> In comparison to the es --> en language pack posted by Matt, due to the fat
> that no scripts directory was generated, this language pack does not have
> the scripts/release directory either. I am not sure how this was generated.
> 
> Over and above what I've detailed so far, there is one blocking issue for
> me... when I submit Russian text to the Joshua server, it just spits back
> out the same Russian text! I can see the decoder logging to std out however
> I can only assume that no decoding is actually taking place.
> 
> Can you guys please review the language pack, provide feedback on the
> configuration, some of the scores which have been generated and even the
> BLEU score? I have absolutely everything local and also backed up so I can
> provide absolutely everything as well as the exact commands I invoked to
> generate the entire thing from start to finish.
> Cheers troops.
> 
> [0] http://home.apache.org/~lewismc/language-pack-ru-en-2016-10-28.tar.gz
> 
> -- 
> http://home.apache.org/~lewismc/
> @hectorMcSpector
> http://www.linkedin.com/in/lmcgibbney



Re: Podling Report Reminder - November 2016

2016-10-31 Thread Matt Post
Folks, I filled this out today. 

matt (from my phone)

> Le 31 oct. 2016 à 20:00, johndam...@apache.org a écrit :
> 
> Dear podling,
> 
> This email was sent by an automated system on behalf of the Apache
> Incubator PMC. It is an initial reminder to give you plenty of time to
> prepare your quarterly board report.
> 
> The board meeting is scheduled for Wed, 16 November 2016, 10:30 am PDT.
> The report for your podling will form a part of the Incubator PMC
> report. The Incubator PMC requires your report to be submitted 2 weeks
> before the board meeting, to allow sufficient time for review and
> submission (Wed, November 02).
> 
> Please submit your report with sufficient time to allow the Incubator
> PMC, and subsequently board members to review and digest. Again, the
> very latest you should submit your report is 2 weeks prior to the board
> meeting.
> 
> Thanks,
> 
> The Apache Incubator PMC
> 
> Submitting your Report
> 
> --
> 
> Your report should contain the following:
> 
> *   Your project name
> *   A brief description of your project, which assumes no knowledge of
>the project or necessarily of its field
> *   A list of the three most important issues to address in the move
>towards graduation.
> *   Any issues that the Incubator PMC or ASF Board might wish/need to be
>aware of
> *   How has the community developed since the last report
> *   How has the project developed since the last report.
> 
> This should be appended to the Incubator Wiki page at:
> 
> http://wiki.apache.org/incubator/November2016
> 
> Note: This is manually populated. You may need to wait a little before
> this page is created from a template.
> 
> Mentors
> ---
> 
> Mentors should review reports for their project(s) and sign them off on
> the Incubator wiki page. Signing off reports shows that you are
> following the project - projects that are not signed may raise alarms
> for the Incubator PMC.
> 
> Incubator PMC



  1   2   3   >