[jira] [Assigned] (JOSHUA-312) Even though alignment is cached, it is always re-done in pipeline re-execution
[ https://issues.apache.org/jira/browse/JOSHUA-312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney reassigned JOSHUA-312: --- Assignee: Lewis John McGibbney > Even though alignment is cached, it is always re-done in pipeline re-execution > -- > > Key: JOSHUA-312 > URL: https://issues.apache.org/jira/browse/JOSHUA-312 > Project: Joshua > Issue Type: Improvement > Components: alignment >Affects Versions: 6.0.5 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Critical > Fix For: 6.2 > > > Say if a pipeline fails after alignment. The alignment result is never cached > and it becomes necessary to undertake alignment... again! > We should investigate the process for caching alignments as it would really > speed up rerunning end-to-end pipelines for large input datasets. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (JOSHUA-312) Even though alignment is cached, it is always re-done in pipeline re-execution
[ https://issues.apache.org/jira/browse/JOSHUA-312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573593#comment-15573593 ] Lewis John McGibbney commented on JOSHUA-312: - OK doke... I managed to reproduce this today. So one of my pipelines just failed, this has to do with me screwing up my paths... however this was after alignment with berkeley aligner. When I went to re-reun the code as follows, alignment was not pulled from the cache... it is completely re-run {code} lmcgibbn@LMC-056430 /usr/local/joshua_resources/russian_experiments $ ls -al total 8 drwxr-xr-x 7 lmcgibbn wheel 238 Oct 13 16:48 . drwxr-xr-x 22 lmcgibbn wheel 748 Oct 13 12:09 .. drwxr-xr-x 29 lmcgibbn wheel 986 Oct 13 16:48 .cachepipe -rw-r--r-- 1 lmcgibbn wheel 47 Oct 13 12:24 README drwxr-xr-x 5 lmcgibbn wheel 170 Oct 13 16:48 alignments drwxr-xr-x 12 lmcgibbn wheel 408 Oct 13 12:23 data drwxr-xr-x 6 lmcgibbn wheel 204 Oct 13 12:24 scripts lmcgibbn@LMC-056430 /usr/local/joshua_resources/russian_experiments $ /usr/local/incubator-joshua/bin/pipeline.pl --rundir . --type hiero --corpus /usr/local/joshua_resources/russian_experiments/data/commoncrawl.ru-en --tune /usr/local/joshua_resources/russian_experiments/data/commoncrawl.ru-en.tune --test /usr/local/joshua_resources/russian_experiments/data/commoncrawl.ru-en.test --source en --target ru --readme "Experiment 1 Run 1 of ru --> en model training" --aligner berkeley [train-copy-and-filter] cached, skipping... [train-tokenize-en] cached, skipping... [train-tokenize-ru] cached, skipping... [train-trim] cached, skipping... [train-lowercase-en] cached, skipping... [train-lowercase-ru] cached, skipping... [train-vocab-en] cached, skipping... [train-vocab-ru] cached, skipping... [tune-copy-and-filter] cached, skipping... [tune-tokenize-en] cached, skipping... [tune-tokenize-ru] cached, skipping... [tune-lowercase-en] cached, skipping... [tune-lowercase-ru] cached, skipping... [tune-vocab-en] cached, skipping... [tune-vocab-ru] cached, skipping... [test-copy-and-filter] cached, skipping... [test-tokenize-en] cached, skipping... [test-tokenize-ru] cached, skipping... [test-lowercase-en] cached, skipping... [test-lowercase-ru] cached, skipping... [test-vocab-en] cached, skipping... [test-vocab-ru] cached, skipping... [source-numlines] cached, skipping... [source-numlines] retrieved cached result => 817962 [berkeley-aligner-chunk-0] rebuilding... dep=alignments/0/word-align.conf dep=/usr/local/joshua_resources/russian_experiments/data/train/splits/corpus.en.0 [NOT FOUND] dep=/usr/local/joshua_resources/russian_experiments/data/train/splits/corpus.ru.0 [NOT FOUND] dep=alignments/0/training.align [NOT FOUND] cmd=java -d64 -Xmx10g -jar /usr/local/incubator-joshua/ext/berkeleyaligner/distribution/berkeleyaligner.jar ++alignments/0/word-align.conf {code} The aligner looks as follows {code} lmcgibbn@LMC-056430 /usr/local $ tail -f joshua_resources/russian_experiments/alignments/0/log main() { Execution directory: alignments/0 Preparing Training Data { ERROR: No files found at source /dev/null } [23s, cum. 23s] 817962 training sentences, 0 test sentences Training models: 2 stages { Training stage 1: MODEL1 and MODEL1 jointly for 5 iterations { Initializing forward model [1m16s, cum. 1m16s] Initializing reverse model [1m36s, cum. 2m53s] Joint Train: 817962 sentences, jointly { Iteration 1/5 { Sentence 1/817962 Sentence 2/817962 Sentence 3/817962 Sentence 11/817962 Sentence 40/817962 Sentence 146/817962 ... {code} It would therefore appear to me that YES, the pipeline is cached, however on re-runs, the cache is not consulted and therefore alignment is repeated. > Even though alignment is cached, it is always re-done in pipeline re-execution > -- > > Key: JOSHUA-312 > URL: https://issues.apache.org/jira/browse/JOSHUA-312 > Project: Joshua > Issue Type: Improvement > Components: alignment >Affects Versions: 6.0.5 >Reporter: Lewis John McGibbney >Priority: Critical > Fix For: 6.2 > > > Say if a pipeline fails after alignment. The alignment result is never cached > and it becomes necessary to undertake alignment... again! > We should investigate the process for caching alignments as it would really > speed up rerunning end-to-end pipelines for large input datasets. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (JOSHUA-315) Thrax keeps all rules
Matt Post created JOSHUA-315: Summary: Thrax keeps all rules Key: JOSHUA-315 URL: https://issues.apache.org/jira/browse/JOSHUA-315 Project: Joshua Issue Type: Bug Reporter: Matt Post Fix For: 6.2 When extracting rules, Thrax keeps *all* options for each target side. For large bitexts and common source sides (e.g., "de" for Spanish–English), there can be tens of thousands of translations, due to errors in the alignments and phenomena like garbage collection. The decoder throws out all but the top num_translation_options of these (default 20), but before doing so, it has to score all the target side options with all feature functions, include the language model. This slows down "warming up" of the model and means that the first sentences to use these items are very slow to translation. I have updated scripts/training/filter-rules.pl to filter out using Thrax's rarity penalty field, but it would be much better if Thrax were to keep only the most 100 frequent translation options for each source side. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Joshua 6.1
Hi folks, I think I'm going to do the 6.1 release tomorrow. Any objections? Along with the release will be about 60 language packs for a large range of languages. These will be released early next week and will be built on BerkeleyLM, so that there are no external dependencies. I'd like to push out the release quietly until the language packs are ready, uploaded, and linked. Is there anything I need to know to do an Apache release? matt
[jira] [Commented] (JOSHUA-311) Improve pipeline logging to indicate location on berkeley alignment log(s)
[ https://issues.apache.org/jira/browse/JOSHUA-311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15572497#comment-15572497 ] Matt Post commented on JOSHUA-311: -- In any case, I'm going to move this to 6.2. > Improve pipeline logging to indicate location on berkeley alignment log(s) > -- > > Key: JOSHUA-311 > URL: https://issues.apache.org/jira/browse/JOSHUA-311 > Project: Joshua > Issue Type: Improvement > Components: alignment, logging, pipeline >Affects Versions: 6.0.5 >Reporter: Lewis John McGibbney > Fix For: 6.2 > > > When one runs a pipeline using --aligner berkeley, no log location is > provided for user to follow progress of alignment. > {code} > [berkeley-aligner-chunk-0] rebuilding... > dep=alignments/0/word-align.conf [CHANGED] > > dep=/usr/local/jpl/xdata/joshua_experiments/russian_experiments/0/data/train/splits/corpus.en.0 > [NOT FOUND] > > dep=/usr/local/jpl/xdata/joshua_experiments/russian_experiments/0/data/train/splits/corpus.ru.0 > [NOT FOUND] > dep=alignments/0/training.align [NOT FOUND] > cmd=java -d64 -Xmx10g -jar > /usr/local/jpl/xdata/joshua_experiments/incubator-joshua/ext/berkeleyaligner/distribution/berkeleyaligner.jar > ++alignments/0/word-align.conf > {code} > We could add something like > {code} > [berkeley-aligner-chunk-0] rebuilding... > dep=alignments/0/word-align.conf [CHANGED] > > dep=/usr/local/jpl/xdata/joshua_experiments/russian_experiments/0/data/train/splits/corpus.en.0 > [NOT FOUND] > > dep=/usr/local/jpl/xdata/joshua_experiments/russian_experiments/0/data/train/splits/corpus.ru.0 > [NOT FOUND] > dep=alignments/0/training.align [NOT FOUND] > cmd=java -d64 -Xmx10g -jar > /usr/local/jpl/xdata/joshua_experiments/incubator-joshua/ext/berkeleyaligner/distribution/berkeleyaligner.jar > ++alignments/0/word-align.conf logs being written to /path/to/log > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (JOSHUA-280) Existing Language packs not compatible with Joshua master
[ https://issues.apache.org/jira/browse/JOSHUA-280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matt Post resolved JOSHUA-280. -- Resolution: Fixed > Existing Language packs not compatible with Joshua master > - > > Key: JOSHUA-280 > URL: https://issues.apache.org/jira/browse/JOSHUA-280 > Project: Joshua > Issue Type: Bug > Components: language packs >Affects Versions: 6.0.5 >Reporter: Lewis John McGibbney >Assignee: Matt Post >Priority: Critical > Fix For: 6.1 > > > When I work with the existing Spanish --> English language pack at > http://cs.jhu.edu/~post/language-packs/language-pack-es-en-phrase-2015-03-06.tgz, > I get the following error > {code} > lmcgibbn@LMC-032857 > /usr/local/Cellar/joshua/HEAD/libexec/language-pack-es-en-phrase-2015-03-06(NUTCH-2089) > $ ./run-joshua-server.sh > INFO - Parameters read from configuration file: joshua.config > INFO - tm = 'moses -owner pt -maxspan 0 -path phrase-table.packed > -max-source-len 5' > INFO - defaultnonterminal = 'X' > INFO - goalsymbol = 'GOAL' > INFO - featurefunction = 'StateMinimizingLanguageModel -lm_type kenlm > -lm_order 5 -lm_file lm.kenlm' > INFO - markoovs = 'false' > INFO - search = 'stack' > INFO - pop-limit: 100 > INFO - poplimit = '100' > INFO - topn = '0' > INFO - useuniquenbest = 'true' > INFO - outputformat = '%s' > INFO - includealignindex = 'false' > INFO - featurefunction = 'OOVPenalty' > INFO - featurefunction = 'WordPenalty' > INFO - featurefunction = 'Distortion' > INFO - featurefunction = 'PhrasePenalty' > INFO - c = 'joshua.config' > INFO - server-port: 5674 > INFO - serverport = '5674' > INFO - Read 9 weights (0 of them dense) > INFO - Reading vocabulary: phrase-table.packed/vocabulary > INFO - Read 191983 entries from the vocabulary > INFO - Reading packed config: phrase-table.packed/config > 102030405060708090.100% > Exception in thread "main" java.lang.RuntimeException: The grammar at > phrase-table.packed was packed with packer version 0, but the earliest > supported version is 3 > at > org.apache.joshua.decoder.ff.tm.packed.PackedGrammar.readConfig(PackedGrammar.java:1061) > at > org.apache.joshua.decoder.ff.tm.packed.PackedGrammar.(PackedGrammar.java:143) > at > org.apache.joshua.decoder.phrase.PhraseTable.(PhraseTable.java:65) > at > org.apache.joshua.decoder.Decoder.initializeTranslationGrammars(Decoder.java:603) > at org.apache.joshua.decoder.Decoder.initialize(Decoder.java:514) > at org.apache.joshua.decoder.Decoder.(Decoder.java:126) > at org.apache.joshua.decoder.JoshuaDecoder.main(JoshuaDecoder.java:69) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (JOSHUA-280) Existing Language packs not compatible with Joshua master
[ https://issues.apache.org/jira/browse/JOSHUA-280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15572431#comment-15572431 ] Matt Post commented on JOSHUA-280: -- This is all fixed with the new language packer. Language packs will now include the runtime and have no external dependencies (including on Joshua or $JOSHUA). > Existing Language packs not compatible with Joshua master > - > > Key: JOSHUA-280 > URL: https://issues.apache.org/jira/browse/JOSHUA-280 > Project: Joshua > Issue Type: Bug > Components: language packs >Affects Versions: 6.0.5 >Reporter: Lewis John McGibbney >Assignee: Matt Post >Priority: Critical > Fix For: 6.1 > > > When I work with the existing Spanish --> English language pack at > http://cs.jhu.edu/~post/language-packs/language-pack-es-en-phrase-2015-03-06.tgz, > I get the following error > {code} > lmcgibbn@LMC-032857 > /usr/local/Cellar/joshua/HEAD/libexec/language-pack-es-en-phrase-2015-03-06(NUTCH-2089) > $ ./run-joshua-server.sh > INFO - Parameters read from configuration file: joshua.config > INFO - tm = 'moses -owner pt -maxspan 0 -path phrase-table.packed > -max-source-len 5' > INFO - defaultnonterminal = 'X' > INFO - goalsymbol = 'GOAL' > INFO - featurefunction = 'StateMinimizingLanguageModel -lm_type kenlm > -lm_order 5 -lm_file lm.kenlm' > INFO - markoovs = 'false' > INFO - search = 'stack' > INFO - pop-limit: 100 > INFO - poplimit = '100' > INFO - topn = '0' > INFO - useuniquenbest = 'true' > INFO - outputformat = '%s' > INFO - includealignindex = 'false' > INFO - featurefunction = 'OOVPenalty' > INFO - featurefunction = 'WordPenalty' > INFO - featurefunction = 'Distortion' > INFO - featurefunction = 'PhrasePenalty' > INFO - c = 'joshua.config' > INFO - server-port: 5674 > INFO - serverport = '5674' > INFO - Read 9 weights (0 of them dense) > INFO - Reading vocabulary: phrase-table.packed/vocabulary > INFO - Read 191983 entries from the vocabulary > INFO - Reading packed config: phrase-table.packed/config > 102030405060708090.100% > Exception in thread "main" java.lang.RuntimeException: The grammar at > phrase-table.packed was packed with packer version 0, but the earliest > supported version is 3 > at > org.apache.joshua.decoder.ff.tm.packed.PackedGrammar.readConfig(PackedGrammar.java:1061) > at > org.apache.joshua.decoder.ff.tm.packed.PackedGrammar.(PackedGrammar.java:143) > at > org.apache.joshua.decoder.phrase.PhraseTable.(PhraseTable.java:65) > at > org.apache.joshua.decoder.Decoder.initializeTranslationGrammars(Decoder.java:603) > at org.apache.joshua.decoder.Decoder.initialize(Decoder.java:514) > at org.apache.joshua.decoder.Decoder.(Decoder.java:126) > at org.apache.joshua.decoder.JoshuaDecoder.main(JoshuaDecoder.java:69) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (JOSHUA-299) Move regression tests to proper unit tests
[ https://issues.apache.org/jira/browse/JOSHUA-299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15572425#comment-15572425 ] Matt Post commented on JOSHUA-299: -- This was almost entirely completed, and we are marking it done. It has been completed on the 7 branch. > Move regression tests to proper unit tests > -- > > Key: JOSHUA-299 > URL: https://issues.apache.org/jira/browse/JOSHUA-299 > Project: Joshua > Issue Type: Bug >Reporter: Matt Post >Assignee: Lewis John McGibbney > Fix For: 6.1 > > Time Spent: 2m > Remaining Estimate: 0h > > Many of the regression tests (test*.sh under src/test/resources) have been > moved to proper unit tests, but this move should be completed, and the > regression tests should be deleted. This should be done for 6.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (JOSHUA-299) Move regression tests to proper unit tests
[ https://issues.apache.org/jira/browse/JOSHUA-299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matt Post resolved JOSHUA-299. -- Resolution: Fixed > Move regression tests to proper unit tests > -- > > Key: JOSHUA-299 > URL: https://issues.apache.org/jira/browse/JOSHUA-299 > Project: Joshua > Issue Type: Bug >Reporter: Matt Post >Assignee: Lewis John McGibbney > Fix For: 6.1 > > Time Spent: 2m > Remaining Estimate: 0h > > Many of the regression tests (test*.sh under src/test/resources) have been > moved to proper unit tests, but this move should be completed, and the > regression tests should be deleted. This should be done for 6.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332)