[jira] [Assigned] (JOSHUA-312) Even though alignment is cached, it is always re-done in pipeline re-execution

2016-10-13 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/JOSHUA-312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney reassigned JOSHUA-312:
---

Assignee: Lewis John McGibbney

> Even though alignment is cached, it is always re-done in pipeline re-execution
> --
>
> Key: JOSHUA-312
> URL: https://issues.apache.org/jira/browse/JOSHUA-312
> Project: Joshua
>  Issue Type: Improvement
>  Components: alignment
>Affects Versions: 6.0.5
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 6.2
>
>
> Say if a pipeline fails after alignment. The alignment result is never cached 
> and it becomes necessary to undertake alignment... again!
> We should investigate the process for caching alignments as it would really 
> speed up rerunning end-to-end pipelines for large input datasets.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (JOSHUA-312) Even though alignment is cached, it is always re-done in pipeline re-execution

2016-10-13 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/JOSHUA-312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573593#comment-15573593
 ] 

Lewis John McGibbney commented on JOSHUA-312:
-

OK doke... I managed to reproduce this today.
So one of my pipelines just failed, this has to do with me screwing up my 
paths... however this was after alignment with berkeley aligner.
When I went to re-reun the code as follows, alignment was not pulled from the 
cache... it is completely re-run
{code}
lmcgibbn@LMC-056430 /usr/local/joshua_resources/russian_experiments $ ls -al
total 8
drwxr-xr-x   7 lmcgibbn  wheel  238 Oct 13 16:48 .
drwxr-xr-x  22 lmcgibbn  wheel  748 Oct 13 12:09 ..
drwxr-xr-x  29 lmcgibbn  wheel  986 Oct 13 16:48 .cachepipe
-rw-r--r--   1 lmcgibbn  wheel   47 Oct 13 12:24 README
drwxr-xr-x   5 lmcgibbn  wheel  170 Oct 13 16:48 alignments
drwxr-xr-x  12 lmcgibbn  wheel  408 Oct 13 12:23 data
drwxr-xr-x   6 lmcgibbn  wheel  204 Oct 13 12:24 scripts
lmcgibbn@LMC-056430 /usr/local/joshua_resources/russian_experiments $ 
/usr/local/incubator-joshua/bin/pipeline.pl  --rundir . --type hiero --corpus 
/usr/local/joshua_resources/russian_experiments/data/commoncrawl.ru-en --tune 
/usr/local/joshua_resources/russian_experiments/data/commoncrawl.ru-en.tune 
--test 
/usr/local/joshua_resources/russian_experiments/data/commoncrawl.ru-en.test 
--source en --target ru --readme "Experiment 1 Run 1 of ru --> en model 
training" --aligner berkeley
[train-copy-and-filter] cached, skipping...
[train-tokenize-en] cached, skipping...
[train-tokenize-ru] cached, skipping...
[train-trim] cached, skipping...
[train-lowercase-en] cached, skipping...
[train-lowercase-ru] cached, skipping...
[train-vocab-en] cached, skipping...
[train-vocab-ru] cached, skipping...
[tune-copy-and-filter] cached, skipping...
[tune-tokenize-en] cached, skipping...
[tune-tokenize-ru] cached, skipping...
[tune-lowercase-en] cached, skipping...
[tune-lowercase-ru] cached, skipping...
[tune-vocab-en] cached, skipping...
[tune-vocab-ru] cached, skipping...
[test-copy-and-filter] cached, skipping...
[test-tokenize-en] cached, skipping...
[test-tokenize-ru] cached, skipping...
[test-lowercase-en] cached, skipping...
[test-lowercase-ru] cached, skipping...
[test-vocab-en] cached, skipping...
[test-vocab-ru] cached, skipping...
[source-numlines] cached, skipping...
[source-numlines] retrieved cached result =>   817962
[berkeley-aligner-chunk-0] rebuilding...
  dep=alignments/0/word-align.conf
  
dep=/usr/local/joshua_resources/russian_experiments/data/train/splits/corpus.en.0
 [NOT FOUND]
  
dep=/usr/local/joshua_resources/russian_experiments/data/train/splits/corpus.ru.0
 [NOT FOUND]
  dep=alignments/0/training.align [NOT FOUND]
  cmd=java -d64 -Xmx10g -jar 
/usr/local/incubator-joshua/ext/berkeleyaligner/distribution/berkeleyaligner.jar
 ++alignments/0/word-align.conf
{code}

The aligner looks as follows

{code}
lmcgibbn@LMC-056430 /usr/local $ tail -f 
joshua_resources/russian_experiments/alignments/0/log
main() {
  Execution directory: alignments/0
  Preparing Training Data {
ERROR: No files found at source /dev/null
  } [23s, cum. 23s]
  817962 training sentences, 0 test sentences
  Training models: 2 stages {
Training stage 1: MODEL1 and MODEL1 jointly for 5 iterations {
  Initializing forward model
 [1m16s, cum. 1m16s]
  Initializing reverse model [1m36s, cum. 2m53s]
  Joint Train: 817962 sentences, jointly {
Iteration 1/5 {
  Sentence 1/817962
  Sentence 2/817962
  Sentence 3/817962
  Sentence 11/817962
  Sentence 40/817962
  Sentence 146/817962
...
{code}

It would therefore appear to me that YES, the pipeline is cached, however on 
re-runs, the cache is not consulted and therefore alignment is repeated.

> Even though alignment is cached, it is always re-done in pipeline re-execution
> --
>
> Key: JOSHUA-312
> URL: https://issues.apache.org/jira/browse/JOSHUA-312
> Project: Joshua
>  Issue Type: Improvement
>  Components: alignment
>Affects Versions: 6.0.5
>Reporter: Lewis John McGibbney
>Priority: Critical
> Fix For: 6.2
>
>
> Say if a pipeline fails after alignment. The alignment result is never cached 
> and it becomes necessary to undertake alignment... again!
> We should investigate the process for caching alignments as it would really 
> speed up rerunning end-to-end pipelines for large input datasets.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (JOSHUA-315) Thrax keeps all rules

2016-10-13 Thread Matt Post (JIRA)
Matt Post created JOSHUA-315:


 Summary: Thrax keeps all rules
 Key: JOSHUA-315
 URL: https://issues.apache.org/jira/browse/JOSHUA-315
 Project: Joshua
  Issue Type: Bug
Reporter: Matt Post
 Fix For: 6.2


When extracting rules, Thrax keeps *all* options for each target side. For 
large bitexts and common source sides (e.g., "de" for Spanish–English), there 
can be tens of thousands of translations, due to errors in the alignments and 
phenomena like garbage collection. The decoder throws out all but the top 
num_translation_options of these (default 20), but before doing so, it has to 
score all the target side options with all feature functions, include the 
language model. This slows down "warming up" of the model and means that the 
first sentences to use these items are very slow to translation.

I have updated scripts/training/filter-rules.pl to filter out using Thrax's 
rarity penalty field, but it would be much better if Thrax were to keep only 
the most 100 frequent translation options for each source side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Joshua 6.1

2016-10-13 Thread Matt Post
Hi folks,

I think I'm going to do the 6.1 release tomorrow. Any objections?

Along with the release will be about 60 language packs for a large range of 
languages. These will be released early next week and will be built on 
BerkeleyLM, so that there are no external dependencies.

I'd like to push out the release quietly until the language packs are ready, 
uploaded, and linked.

Is there anything I need to know to do an Apache release?

matt




[jira] [Commented] (JOSHUA-311) Improve pipeline logging to indicate location on berkeley alignment log(s)

2016-10-13 Thread Matt Post (JIRA)

[ 
https://issues.apache.org/jira/browse/JOSHUA-311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15572497#comment-15572497
 ] 

Matt Post commented on JOSHUA-311:
--

In any case, I'm going to move this to 6.2.

> Improve pipeline logging to indicate location on berkeley alignment log(s)
> --
>
> Key: JOSHUA-311
> URL: https://issues.apache.org/jira/browse/JOSHUA-311
> Project: Joshua
>  Issue Type: Improvement
>  Components: alignment, logging, pipeline
>Affects Versions: 6.0.5
>Reporter: Lewis John McGibbney
> Fix For: 6.2
>
>
> When one runs a pipeline using --aligner berkeley, no log location is 
> provided for user to follow progress of alignment.
> {code}
> [berkeley-aligner-chunk-0] rebuilding...
>   dep=alignments/0/word-align.conf [CHANGED]
>   
> dep=/usr/local/jpl/xdata/joshua_experiments/russian_experiments/0/data/train/splits/corpus.en.0
>  [NOT FOUND]
>   
> dep=/usr/local/jpl/xdata/joshua_experiments/russian_experiments/0/data/train/splits/corpus.ru.0
>  [NOT FOUND]
>   dep=alignments/0/training.align [NOT FOUND]
>   cmd=java -d64 -Xmx10g -jar 
> /usr/local/jpl/xdata/joshua_experiments/incubator-joshua/ext/berkeleyaligner/distribution/berkeleyaligner.jar
>  ++alignments/0/word-align.conf
> {code}
> We could add something like
> {code}
> [berkeley-aligner-chunk-0] rebuilding...
>   dep=alignments/0/word-align.conf [CHANGED]
>   
> dep=/usr/local/jpl/xdata/joshua_experiments/russian_experiments/0/data/train/splits/corpus.en.0
>  [NOT FOUND]
>   
> dep=/usr/local/jpl/xdata/joshua_experiments/russian_experiments/0/data/train/splits/corpus.ru.0
>  [NOT FOUND]
>   dep=alignments/0/training.align [NOT FOUND]
>   cmd=java -d64 -Xmx10g -jar 
> /usr/local/jpl/xdata/joshua_experiments/incubator-joshua/ext/berkeleyaligner/distribution/berkeleyaligner.jar
>  ++alignments/0/word-align.conf logs being written to /path/to/log
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (JOSHUA-280) Existing Language packs not compatible with Joshua master

2016-10-13 Thread Matt Post (JIRA)

 [ 
https://issues.apache.org/jira/browse/JOSHUA-280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Post resolved JOSHUA-280.
--
Resolution: Fixed

> Existing Language packs not compatible with Joshua master
> -
>
> Key: JOSHUA-280
> URL: https://issues.apache.org/jira/browse/JOSHUA-280
> Project: Joshua
>  Issue Type: Bug
>  Components: language packs
>Affects Versions: 6.0.5
>Reporter: Lewis John McGibbney
>Assignee: Matt Post
>Priority: Critical
> Fix For: 6.1
>
>
> When I work with the existing Spanish --> English language pack at 
> http://cs.jhu.edu/~post/language-packs/language-pack-es-en-phrase-2015-03-06.tgz,
>  I get the following error
> {code}
> lmcgibbn@LMC-032857 
> /usr/local/Cellar/joshua/HEAD/libexec/language-pack-es-en-phrase-2015-03-06(NUTCH-2089)
>  $ ./run-joshua-server.sh
> INFO - Parameters read from configuration file: joshua.config
> INFO - tm = 'moses -owner pt -maxspan 0 -path phrase-table.packed 
> -max-source-len 5'
> INFO - defaultnonterminal = 'X'
> INFO - goalsymbol = 'GOAL'
> INFO - featurefunction = 'StateMinimizingLanguageModel -lm_type kenlm 
> -lm_order 5 -lm_file lm.kenlm'
> INFO - markoovs = 'false'
> INFO - search = 'stack'
> INFO - pop-limit: 100
> INFO - poplimit = '100'
> INFO - topn = '0'
> INFO - useuniquenbest = 'true'
> INFO - outputformat = '%s'
> INFO - includealignindex = 'false'
> INFO - featurefunction = 'OOVPenalty'
> INFO - featurefunction = 'WordPenalty'
> INFO - featurefunction = 'Distortion'
> INFO - featurefunction = 'PhrasePenalty'
> INFO - c = 'joshua.config'
> INFO - server-port: 5674
> INFO - serverport = '5674'
> INFO - Read 9 weights (0 of them dense)
> INFO - Reading vocabulary: phrase-table.packed/vocabulary
> INFO - Read 191983 entries from the vocabulary
> INFO - Reading packed config: phrase-table.packed/config
> 102030405060708090.100%
> Exception in thread "main" java.lang.RuntimeException: The grammar at 
> phrase-table.packed was packed with packer version 0, but the earliest 
> supported version is 3
>   at 
> org.apache.joshua.decoder.ff.tm.packed.PackedGrammar.readConfig(PackedGrammar.java:1061)
>   at 
> org.apache.joshua.decoder.ff.tm.packed.PackedGrammar.(PackedGrammar.java:143)
>   at 
> org.apache.joshua.decoder.phrase.PhraseTable.(PhraseTable.java:65)
>   at 
> org.apache.joshua.decoder.Decoder.initializeTranslationGrammars(Decoder.java:603)
>   at org.apache.joshua.decoder.Decoder.initialize(Decoder.java:514)
>   at org.apache.joshua.decoder.Decoder.(Decoder.java:126)
>   at org.apache.joshua.decoder.JoshuaDecoder.main(JoshuaDecoder.java:69)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (JOSHUA-280) Existing Language packs not compatible with Joshua master

2016-10-13 Thread Matt Post (JIRA)

[ 
https://issues.apache.org/jira/browse/JOSHUA-280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15572431#comment-15572431
 ] 

Matt Post commented on JOSHUA-280:
--

This is all fixed with the new language packer. Language packs will now include 
the runtime and have no external dependencies (including on Joshua or $JOSHUA).

> Existing Language packs not compatible with Joshua master
> -
>
> Key: JOSHUA-280
> URL: https://issues.apache.org/jira/browse/JOSHUA-280
> Project: Joshua
>  Issue Type: Bug
>  Components: language packs
>Affects Versions: 6.0.5
>Reporter: Lewis John McGibbney
>Assignee: Matt Post
>Priority: Critical
> Fix For: 6.1
>
>
> When I work with the existing Spanish --> English language pack at 
> http://cs.jhu.edu/~post/language-packs/language-pack-es-en-phrase-2015-03-06.tgz,
>  I get the following error
> {code}
> lmcgibbn@LMC-032857 
> /usr/local/Cellar/joshua/HEAD/libexec/language-pack-es-en-phrase-2015-03-06(NUTCH-2089)
>  $ ./run-joshua-server.sh
> INFO - Parameters read from configuration file: joshua.config
> INFO - tm = 'moses -owner pt -maxspan 0 -path phrase-table.packed 
> -max-source-len 5'
> INFO - defaultnonterminal = 'X'
> INFO - goalsymbol = 'GOAL'
> INFO - featurefunction = 'StateMinimizingLanguageModel -lm_type kenlm 
> -lm_order 5 -lm_file lm.kenlm'
> INFO - markoovs = 'false'
> INFO - search = 'stack'
> INFO - pop-limit: 100
> INFO - poplimit = '100'
> INFO - topn = '0'
> INFO - useuniquenbest = 'true'
> INFO - outputformat = '%s'
> INFO - includealignindex = 'false'
> INFO - featurefunction = 'OOVPenalty'
> INFO - featurefunction = 'WordPenalty'
> INFO - featurefunction = 'Distortion'
> INFO - featurefunction = 'PhrasePenalty'
> INFO - c = 'joshua.config'
> INFO - server-port: 5674
> INFO - serverport = '5674'
> INFO - Read 9 weights (0 of them dense)
> INFO - Reading vocabulary: phrase-table.packed/vocabulary
> INFO - Read 191983 entries from the vocabulary
> INFO - Reading packed config: phrase-table.packed/config
> 102030405060708090.100%
> Exception in thread "main" java.lang.RuntimeException: The grammar at 
> phrase-table.packed was packed with packer version 0, but the earliest 
> supported version is 3
>   at 
> org.apache.joshua.decoder.ff.tm.packed.PackedGrammar.readConfig(PackedGrammar.java:1061)
>   at 
> org.apache.joshua.decoder.ff.tm.packed.PackedGrammar.(PackedGrammar.java:143)
>   at 
> org.apache.joshua.decoder.phrase.PhraseTable.(PhraseTable.java:65)
>   at 
> org.apache.joshua.decoder.Decoder.initializeTranslationGrammars(Decoder.java:603)
>   at org.apache.joshua.decoder.Decoder.initialize(Decoder.java:514)
>   at org.apache.joshua.decoder.Decoder.(Decoder.java:126)
>   at org.apache.joshua.decoder.JoshuaDecoder.main(JoshuaDecoder.java:69)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (JOSHUA-299) Move regression tests to proper unit tests

2016-10-13 Thread Matt Post (JIRA)

[ 
https://issues.apache.org/jira/browse/JOSHUA-299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15572425#comment-15572425
 ] 

Matt Post commented on JOSHUA-299:
--

This was almost entirely completed, and we are marking it done. It has been 
completed on the 7 branch.

> Move regression tests to proper unit tests
> --
>
> Key: JOSHUA-299
> URL: https://issues.apache.org/jira/browse/JOSHUA-299
> Project: Joshua
>  Issue Type: Bug
>Reporter: Matt Post
>Assignee: Lewis John McGibbney
> Fix For: 6.1
>
>  Time Spent: 2m
>  Remaining Estimate: 0h
>
> Many of the regression tests (test*.sh under src/test/resources) have been 
> moved to proper unit tests, but this move should be completed, and the 
> regression tests should be deleted. This should be done for 6.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (JOSHUA-299) Move regression tests to proper unit tests

2016-10-13 Thread Matt Post (JIRA)

 [ 
https://issues.apache.org/jira/browse/JOSHUA-299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Post resolved JOSHUA-299.
--
Resolution: Fixed

> Move regression tests to proper unit tests
> --
>
> Key: JOSHUA-299
> URL: https://issues.apache.org/jira/browse/JOSHUA-299
> Project: Joshua
>  Issue Type: Bug
>Reporter: Matt Post
>Assignee: Lewis John McGibbney
> Fix For: 6.1
>
>  Time Spent: 2m
>  Remaining Estimate: 0h
>
> Many of the regression tests (test*.sh under src/test/resources) have been 
> moved to proper unit tests, but this move should be completed, and the 
> regression tests should be deleted. This should be done for 6.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)