Hi folks,

I thought I'd let you know about a problem I discovered with Thrax. Can you 
spot it?

$ ls -lh grammar.gz
-rw-r--r-- 1 mpost staff 2.2G Oct  6 13:55 grammar.gz
$ gzip -cd 9/grammar.gz | cut -d\| -f4 | uniq -c | sort -n | tail
   8448  las 
   8643  a 
   9440  que 
   9595  se 
   9696  , 
  10617  los 
  10885  el 
  11687  en 
  11932  de 
  12738  la 

As you can see, for lots of source sides, there are tons of target options. The 
first time any rule is used, all the target sides are scored with 
estimateRule() in order to sort them (including a call to the LM), and then all 
but the top 20 (configurable with -num_translation_options) are discarded. This 
is a big waste: the useless rules are stored on disk, and while the 
compute-time waste is constant-time, it does make a difference in "warming up" 
the decoder and, of course, memory usage.

The problem is that Thrax takes all target sides it finds during training. It 
would be good to add an option to Thrax that only keeps the top X translation 
options for each source side (where X is maybe 100).

matt


Reply via email to