thrax bug with rarity penalty
Hi folks, There is a bug in Thrax related to floating point underflow and the computation of the rarity penalty. I'm training large models over Europarl and other datasets for the Spanish–English language pack, and in an attempt to filter the models down to the hundred most frequent candidates, am finding that often the rarity penalty is 0. For example: [X] ||| australia . ||| australia . ||| Lex(e|f)=0.49798 Lex(f|e)=0.45459 PhrasePenalty=1 RarityPenalty=0 p(e|f)=0.05919 p(f|e)=0.09309 ||| 0-0 1-1 "australia" occurs many times in the training corpus, so there is no reason that RarityPenalty should be 0. Note that the rarity penalty is not a raw count, but is computed as @Override public Writable score(RuleWritable r, Annotation annotation) { return new FloatWritable((float) Math.exp(1 - annotation.count())); } https://github.com/joshua-decoder/thrax/blob/master/src/edu/jhu/thrax/hadoop/features/annotation/RarityPenaltyFeature.java So the problem seems to be that, for very highly-attested word pairs, the counts are so high that Math.exp() here is negative and gets truncated to 0 when only five decimal places are requested. I wonder, why the Math.exp(1-x) dance on this value? Why not just have the rarity penalty return the log count? matt
Re: Joshua 6.1
I don't see why not? > On Oct 14, 2016, at 3:36 AM, Tommaso Teofili> wrote: > > Hi Matt, > > thanks for pushing this forward, +1 from me. > One concern I have is related to the language packs licensing, can we > distribute them under AL2 license ? (as "convenience" binaries as the > official release consists of the Joshua source code). > I'm asking this because in OpenNLP we have had this long time issue of the > models licensing. > > Regards, > Tommaso > > > > Il giorno gio 13 ott 2016 alle ore 18:58 Matt Post ha > scritto: > >> Hi folks, >> >> I think I'm going to do the 6.1 release tomorrow. Any objections? >> >> Along with the release will be about 60 language packs for a large range >> of languages. These will be released early next week and will be built on >> BerkeleyLM, so that there are no external dependencies. >> >> I'd like to push out the release quietly until the language packs are >> ready, uploaded, and linked. >> >> Is there anything I need to know to do an Apache release? >> >> matt >> >> >>
Re: Joshua 6.1
Hi Matt, thanks for pushing this forward, +1 from me. One concern I have is related to the language packs licensing, can we distribute them under AL2 license ? (as "convenience" binaries as the official release consists of the Joshua source code). I'm asking this because in OpenNLP we have had this long time issue of the models licensing. Regards, Tommaso Il giorno gio 13 ott 2016 alle ore 18:58 Matt Postha scritto: > Hi folks, > > I think I'm going to do the 6.1 release tomorrow. Any objections? > > Along with the release will be about 60 language packs for a large range > of languages. These will be released early next week and will be built on > BerkeleyLM, so that there are no external dependencies. > > I'd like to push out the release quietly until the language packs are > ready, uploaded, and linked. > > Is there anything I need to know to do an Apache release? > > matt > > >