★ joshua roadmap feature: dynamic phrase tables, retuning, and data sharing

2016-11-28 Thread Matt Post
One project I think could be interesting for Joshua's future is sketched here.

- Dynamic phrase tables. Joshua currently lets people add custom phrases to the 
existing models that then get used. There is a research topic here for how to 
make it better (particularly, how to set the weights of rules that are added at 
runtime instead of learned from bitext), but it works really well for adding 
words that are OOV (since it's always cheaper to use the OOV). Here's a demo of 
how this works (this feature is included in the language packs). 


https://github.com/joshua-decoder/joshua/wiki/Joshua's-Dynamic-Phrase-Tables

- Translation memories. There is a large commercial market (billions) for tools 
called "translation memories", where translators are translating documents, and 
the sentences get queried against their past translations and matched in a 
fuzzy fashion. The big tool on the market for this is SDL Trados 
. 
I'm not talking about selling a product, but in a space that big, there have 
got to be a lot of people who'd rather just run their own system, than shell 
out for an expensive (and ugly) tool. So there is a big niche for an open 
source tool, and currently nothing really filling it. The "dynamic phrase 
table" feature above provides the beginnings of offering a TM competitor, but 
one that is "seeded" with a regular statistical machine translation model.

- Dynamic re-tuning. One thing that'd be *really* cool is to revamp the tuning 
infrastructure in Joshua. The use-case I imagine is that Joshua could sit on 
top of a large tuning set across diverse domains (e.g, formal news, informal 
web logs, spoken dialogue, etc). You could then add new phrases in sentences as 
above, which would get automatically aligned, and then everything could be 
retuned at the user's request (or perhaps at night). This way, when people 
added new data to their models, Joshua would automatically find the best 
weights, either immediately or on some schedule. There'd be less worry about 
bit rot.

- Data collection and sharing. Another cool idea would be to allow people to 
easily send us data. If we get to a place where people are building custom 
dynamic phrase tables, a cool ability would be to make it easy for people to 
upload the data they have added to their private systems, which we could then 
collect and further distribute. So Joshua could become an easy means for people 
to crowdsource data used for translation systems. This is obviously just a 
high-level idea that would require a lot of details to be figured out, but it 
would be super cool.

matt

[RESULT] WAS Re: [VOTE] Release Apache Joshua 6.1 RC#2

2016-11-28 Thread lewis john mcgibbney
Evening All,
OK, 72 hours has come and gone. I'm going to close of this VOTE thread. The
following VOTE's were cast.

[10] +1, let's get it released!!!
Lewis John McGibbney
Matt Post
Tommaso Teofili
John Hewitt
Kellen Sunderland
Tom Barber
Chris A. Mattmann
Henry Saptura
Michael A. Hedderich
Felix Hieber

[0] +/-0, fine, but consider to fix few issues before...
[0] -1, nope, because... (and please explain why)

Thank you to everyone that VOTE'd, I'll progress to general@ and see how we
get on.
Thanks
Lewis

On Tue, Nov 22, 2016 at 9:15 PM, lewis john mcgibbney 
wrote:

> Hello user@ and dev,
> Please VOTE on the Apache Joshua 6.1 Release Candidate #2.
>
> We solved 50 issues: https://s.apache.org/joshua6.1
>
> Git source tag (29c8be650d53216f779a340d33f8f61af4d45629):
> https://s.apache.org/pk2t 
>
> Staging repo: https://repository.apache.org/content/repositories/
> orgapachejoshua-1001/
> 
>
> Source Release Artifacts: https://dist.apache.org/repos/
> dist/dev/incubator/joshua/
>
> PGP release keys (signed using 48BAEBF6): https://dist.apache.org/repos/
> dist/release/incubator/joshua/KEYS
>
> Vote will be open for 72 hours.
> Thank you to everyone that is able to VOTE as well as everyone that
> contributed to Apache Joshua 6.1.
>
> [ ] +1, let's get it released!!!
> [ ] +/-0, fine, but consider to fix few issues before...
> [ ] -1, nope, because... (and please explain why)
>
> P.S. here is my +1
>
> --
> http://home.apache.org/~lewismc/
> @hectorMcSpector
> http://www.linkedin.com/in/lmcgibbney
>



-- 
http://home.apache.org/~lewismc/
@hectorMcSpector
http://www.linkedin.com/in/lmcgibbney


Re: Downloading of non ASF licensed code

2016-11-28 Thread Matt Post
This would be easy to do. Maybe just a simple prompt that alerts the user? 
Something like

echo "Warning: this script downloads many tools used in building and 
running"
echo "Joshua. Not all of them are Apache Licensed. If you wish to 
continue, hit Enter".
read j
if [[ ! -z $j ]]; then
echo "Quitting."
fi



> On Nov 25, 2016, at 10:41 AM, Tom Barber  wrote:
> 
> This may have come up before in the whole licensing chat so apologies if
> I'm just going over old ground.
> 
> The download-deps.sh file obviously downloads and builds stuff with non ASF
> licenses, I realise this is for model training purposes only, and 99.9%
> wont care, but should we consider putting a prompt into that script warning
> people. I ask because a company might add in the training modules blindly
> assuming because the script is distributed by the ASF the modules are also
> ASL2.0.
> 
> Just a thought.
> 
> Tom
> 
> -- 
> Tom Barber
> CTO Spicule LTD
> t...@spicule.co.uk
> 
> http://spicule.co.uk
> 
> GB: +44(0)5603641316
> US: +18448141689