Matt Post created JOSHUA-315:
--------------------------------

             Summary: Thrax keeps all rules
                 Key: JOSHUA-315
                 URL: https://issues.apache.org/jira/browse/JOSHUA-315
             Project: Joshua
          Issue Type: Bug
            Reporter: Matt Post
             Fix For: 6.2


When extracting rules, Thrax keeps *all* options for each target side. For 
large bitexts and common source sides (e.g., "de" for Spanish–English), there 
can be tens of thousands of translations, due to errors in the alignments and 
phenomena like garbage collection. The decoder throws out all but the top 
num_translation_options of these (default 20), but before doing so, it has to 
score all the target side options with all feature functions, include the 
language model. This slows down "warming up" of the model and means that the 
first sentences to use these items are very slow to translation.

I have updated scripts/training/filter-rules.pl to filter out using Thrax's 
rarity penalty field, but it would be much better if Thrax were to keep only 
the most 100 frequent translation options for each source side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to