Re: Issues to Fix with Apache Joshua 6.1 RC#2

2016-12-01 Thread Matt Post
Hi folks,

What's the status of this? Can we check off items from the list below that have 
been completed?

matt


> On Nov 29, 2016, at 4:24 PM, lewis john mcgibbney  wrote:
> 
> Hi Folks,
> We have a number of issues to fix which were picked up over on general@. In
> particular, we received excellent feedback from my good friend Justin [12]
> [13]. As the general@ VOTE has not had 72 hours to stew I am not going to
> close it, however we should take this time to fix the issues with master
> before we spin an RC#3. These can be summarized as follows.
> I've opened a Jira issue to track all of this.
> https://issues.apache.org/jira/browse/JOSHUA-324
> Lets track the progress on the Jira ticket.
> 
> ==
> - Your missing incubating in the release artifacts name. [1]
> - There are a number of binary files in the source release that look to be
> compiled source code.
> 
> I checked:
> - name doesn’t include incubating
> - signatures and hashes correct
> - DISCLAIMER exists
> - LICENSE is missing a few things (see below)
> - a source file is missing an Apache header [7]
> - Several unexpected binary files are contained in the source release
> [8][9][10][11]
> - Can compile from source
> 
> License is missing:
> - MIT licensed normalize.css v3.0.3 bundled in [5]
> - glyph icon fonts [6]
> 
> Not an issue but it's a little odd to have LICENSE and NOTICE.txt - usually
> both are bare or both have .txt extension.
> 
> Also while looking at your site I noticed that the download links of you
> incubating site [2] points to github, please change to point to the offical
> release area.
> Also the 6.1 release has already been tagged and it available for public
> download on github [4]  before this vote is finished. This is IMO against
> Apache release policy [3] please remove.
> 
> I also notice you recently released the language packs (18th Nov) but there
> doesn’t seem to have been a vote for that? Any reason for this?
> ===
> 
> [1] http://incubator.apache.org/incubation/Incubation_Policy.html#Releases
> [2]
> https://cwiki.apache.org/confluence/display/JOSHUA/Apache+Joshua+%28Incubating%29+Home
> [3] http://www.apache.org/dev/release.html#what
> [4] https://github.com/apache/incubator-joshua/releases
> [5] ./demo/bootstrap/css/bootstrap.min.css
> [6] apache-joshua-6.1/demo/bootstrap/fonts/*
> [7] ./src/test/java/org/apache/joshua/decoder/ff/tm/OwnerMapTest.java
> [8] ./bin/GIZA++
> [9] ./bin/mkcls
> [10 ]./bin/snt2cooc.out
> [11] ,/src/test/resources/berkeley_lm/lm.berkeleylm.gz
> [12]
> http://www.mail-archive.com/general%40incubator.apache.org/msg57543.html
> [13]
> http://www.mail-archive.com/general%40incubator.apache.org/msg57551.html
> 
> 
> --
> http://home.apache.org/~lewismc/
> @hectorMcSpector
> http://www.linkedin.com/in/lmcgibbney



Re: modernmt

2016-12-01 Thread Matt Post
John,

Thanks for sharing, this is really helpful. I didn't realize that Marcello was 
involved.

I think we can identify with the NMT danger. I still think there is a big niche 
that deep learning approaches won't reach for a few years, until GPUs become 
super prevalent. Which is why I like ModernMT's approaches, which overlap with 
many of the things I've been thinking. One thing I really like is there 
automatic context-switching approach. This is a great way to build 
general-purpose models, and I'd like to mimic it. I have some general ideas 
about how this should be implemented but am also looking into the literature 
here.

matt


> On Dec 1, 2016, at 1:46 PM, John Hewitt  wrote:
> 
> I had a few good conversations over dinner with this team at AMTA in Austin
> in October.
> They seem to be in the interesting position where their work is good, but
> is in danger of being superseded by neural MT as they come out of the gate.
> Clearly, it has benefits over NMT, and is easier to adopt, but may not be
> the winner over the long run.
> 
> Here's the link
> 
> to their AMTA tutorial.
> 
> -John
> 
> On Thu, Dec 1, 2016 at 10:17 AM, Mattmann, Chris A (3010) <
> chris.a.mattm...@jpl.nasa.gov> wrote:
> 
>> Wow seems like this kind of overlaps with BigTranslate as well.. thanks
>> for passing
>> along Matt
>> 
>> ++
>> Chris Mattmann, Ph.D.
>> Principal Data Scientist, Engineering Administrative Office (3010)
>> Manager, Open Source Projects Formulation and Development Office (8212)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 180-503E, Mailstop: 180-503
>> Email: chris.a.mattm...@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++
>> Director, Information Retrieval and Data Science Group (IRDS)
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> WWW: http://irds.usc.edu/
>> ++
>> 
>> 
>> On 12/1/16, 4:47 AM, "Matt Post"  wrote:
>> 
>>Just came across this, and it's really cool:
>> 
>>https://github.com/ModernMT/MMT
>> 
>>See the README for some great use cases. I'm surprised I'd never heard
>> of this before as it's EU funded and associated with U Edinburgh.
>> 
>> 



Re: modernmt

2016-12-01 Thread John Hewitt
I had a few good conversations over dinner with this team at AMTA in Austin
in October.
They seem to be in the interesting position where their work is good, but
is in danger of being superseded by neural MT as they come out of the gate.
Clearly, it has benefits over NMT, and is easier to adopt, but may not be
the winner over the long run.

Here's the link

to their AMTA tutorial.

-John

On Thu, Dec 1, 2016 at 10:17 AM, Mattmann, Chris A (3010) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Wow seems like this kind of overlaps with BigTranslate as well.. thanks
> for passing
> along Matt
>
> ++
> Chris Mattmann, Ph.D.
> Principal Data Scientist, Engineering Administrative Office (3010)
> Manager, Open Source Projects Formulation and Development Office (8212)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 180-503E, Mailstop: 180-503
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> ++
>
>
> On 12/1/16, 4:47 AM, "Matt Post"  wrote:
>
> Just came across this, and it's really cool:
>
> https://github.com/ModernMT/MMT
>
> See the README for some great use cases. I'm surprised I'd never heard
> of this before as it's EU funded and associated with U Edinburgh.
>
>


Re: modernmt

2016-12-01 Thread Mattmann, Chris A (3010)
Wow seems like this kind of overlaps with BigTranslate as well.. thanks for 
passing
along Matt

++
Chris Mattmann, Ph.D.
Principal Data Scientist, Engineering Administrative Office (3010)
Manager, Open Source Projects Formulation and Development Office (8212)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 180-503E, Mailstop: 180-503
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++
 

On 12/1/16, 4:47 AM, "Matt Post"  wrote:

Just came across this, and it's really cool:

https://github.com/ModernMT/MMT

See the README for some great use cases. I'm surprised I'd never heard of 
this before as it's EU funded and associated with U Edinburgh.



Re: ★ joshua roadmap feature: dynamic phrase tables, retuning, and data sharing

2016-12-01 Thread Matt Post
It wouldn't be hard to add some TMX-like features, no. There are some technical 
challenges, though — for example, the current demo lets you add phrases, but 
that doesn't affect the language model at all.

Ideally, we'd also allow people to add whole sentences, and would then run 
John's fast_align implementation (with a saved model) to break down that new 
sentence, and do proper incremental updating.

How do you image Lucene fitting into this? 

matt


> On Dec 1, 2016, at 9:22 AM, Tommaso Teofili  wrote:
> 
> Matt,
> 
> really nice least of very useful features, thanks for this!
> One comment only on the translation memories one: as seen by one that had
> never heard about it, it sounds not too complicated to implement on top of
> current Joshua (with IR library like Apache Lucene), is my understanding
> correct ?
> 
> My 2 cents,
> Tommaso
> 
> 
> Il giorno mar 29 nov 2016 alle ore 04:08 Matt Post  > ha
> scritto:
> 
>> One project I think could be interesting for Joshua's future is sketched
>> here.
>> 
>> - Dynamic phrase tables. Joshua currently lets people add custom phrases
>> to the existing models that then get used. There is a research topic here
>> for how to make it better (particularly, how to set the weights of rules
>> that are added at runtime instead of learned from bitext), but it works
>> really well for adding words that are OOV (since it's always cheaper to use
>> the OOV). Here's a demo of how this works (this feature is included in the
>> language packs).
>> 
>> 
>> https://github.com/joshua-decoder/joshua/wiki/Joshua's-Dynamic-Phrase-Tables
>> 
>> - Translation memories. There is a large commercial market (billions) for
>> tools called "translation memories", where translators are translating
>> documents, and the sentences get queried against their past translations
>> and matched in a fuzzy fashion. The big tool on the market for this is SDL
>> Trados <
>> http://www.sdl.com/solution/language/translation-productivity/trados-studio/ 
>> >.
>> I'm not talking about selling a product, but in a space that big, there
>> have got to be a lot of people who'd rather just run their own system, than
>> shell out for an expensive (and ugly) tool. So there is a big niche for an
>> open source tool, and currently nothing really filling it. The "dynamic
>> phrase table" feature above provides the beginnings of offering a TM
>> competitor, but one that is "seeded" with a regular statistical machine
>> translation model.
>> 
>> - Dynamic re-tuning. One thing that'd be *really* cool is to revamp the
>> tuning infrastructure in Joshua. The use-case I imagine is that Joshua
>> could sit on top of a large tuning set across diverse domains (e.g, formal
>> news, informal web logs, spoken dialogue, etc). You could then add new
>> phrases in sentences as above, which would get automatically aligned, and
>> then everything could be retuned at the user's request (or perhaps at
>> night). This way, when people added new data to their models, Joshua would
>> automatically find the best weights, either immediately or on some
>> schedule. There'd be less worry about bit rot.
>> 
>> - Data collection and sharing. Another cool idea would be to allow people
>> to easily send us data. If we get to a place where people are building
>> custom dynamic phrase tables, a cool ability would be to make it easy for
>> people to upload the data they have added to their private systems, which
>> we could then collect and further distribute. So Joshua could become an
>> easy means for people to crowdsource data used for translation systems.
>> This is obviously just a high-level idea that would require a lot of
>> details to be figured out, but it would be super cool.
>> 
>> matt



Re: ★ joshua roadmap feature: dynamic phrase tables, retuning, and data sharing

2016-12-01 Thread Tommaso Teofili
Matt,

really nice least of very useful features, thanks for this!
One comment only on the translation memories one: as seen by one that had
never heard about it, it sounds not too complicated to implement on top of
current Joshua (with IR library like Apache Lucene), is my understanding
correct ?

My 2 cents,
Tommaso


Il giorno mar 29 nov 2016 alle ore 04:08 Matt Post  ha
scritto:

> One project I think could be interesting for Joshua's future is sketched
> here.
>
> - Dynamic phrase tables. Joshua currently lets people add custom phrases
> to the existing models that then get used. There is a research topic here
> for how to make it better (particularly, how to set the weights of rules
> that are added at runtime instead of learned from bitext), but it works
> really well for adding words that are OOV (since it's always cheaper to use
> the OOV). Here's a demo of how this works (this feature is included in the
> language packs).
>
>
> https://github.com/joshua-decoder/joshua/wiki/Joshua's-Dynamic-Phrase-Tables
>
> - Translation memories. There is a large commercial market (billions) for
> tools called "translation memories", where translators are translating
> documents, and the sentences get queried against their past translations
> and matched in a fuzzy fashion. The big tool on the market for this is SDL
> Trados <
> http://www.sdl.com/solution/language/translation-productivity/trados-studio/>.
> I'm not talking about selling a product, but in a space that big, there
> have got to be a lot of people who'd rather just run their own system, than
> shell out for an expensive (and ugly) tool. So there is a big niche for an
> open source tool, and currently nothing really filling it. The "dynamic
> phrase table" feature above provides the beginnings of offering a TM
> competitor, but one that is "seeded" with a regular statistical machine
> translation model.
>
> - Dynamic re-tuning. One thing that'd be *really* cool is to revamp the
> tuning infrastructure in Joshua. The use-case I imagine is that Joshua
> could sit on top of a large tuning set across diverse domains (e.g, formal
> news, informal web logs, spoken dialogue, etc). You could then add new
> phrases in sentences as above, which would get automatically aligned, and
> then everything could be retuned at the user's request (or perhaps at
> night). This way, when people added new data to their models, Joshua would
> automatically find the best weights, either immediately or on some
> schedule. There'd be less worry about bit rot.
>
> - Data collection and sharing. Another cool idea would be to allow people
> to easily send us data. If we get to a place where people are building
> custom dynamic phrase tables, a cool ability would be to make it easy for
> people to upload the data they have added to their private systems, which
> we could then collect and further distribute. So Joshua could become an
> easy means for people to crowdsource data used for translation systems.
> This is obviously just a high-level idea that would require a lot of
> details to be figured out, but it would be super cool.
>
> matt


Re: modernmt

2016-12-01 Thread Tommaso Teofili
wow I had never heard of it either, but I do know one of the committers :)

Il giorno gio 1 dic 2016 alle ore 13:49 Matt Post  ha
scritto:

> Just came across this, and it's really cool:
>
> https://github.com/ModernMT/MMT
>
> See the README for some great use cases. I'm surprised I'd never heard of
> this before as it's EU funded and associated with U Edinburgh.


modernmt

2016-12-01 Thread Matt Post
Just came across this, and it's really cool:

https://github.com/ModernMT/MMT

See the README for some great use cases. I'm surprised I'd never heard of this 
before as it's EU funded and associated with U Edinburgh.