thrax bug with rarity penalty

2016-10-14 Thread Matt Post
Hi folks,

There is a bug in Thrax related to floating point underflow and the computation 
of the rarity penalty. I'm training large models over Europarl and other 
datasets for the Spanish–English language pack, and in an attempt to filter the 
models down to the hundred most frequent candidates, am finding that often the 
rarity penalty is 0. For example:

[X] ||| australia . ||| australia . ||| Lex(e|f)=0.49798 Lex(f|e)=0.45459 
PhrasePenalty=1 RarityPenalty=0 p(e|f)=0.05919 p(f|e)=0.09309 ||| 0-0 1-1

"australia" occurs many times in the training corpus, so there is no reason 
that RarityPenalty should be 0.

Note that the rarity penalty is not a raw count, but is computed as

  @Override
  public Writable score(RuleWritable r, Annotation annotation) {
return new FloatWritable((float) Math.exp(1 - annotation.count()));
  }

https://github.com/joshua-decoder/thrax/blob/master/src/edu/jhu/thrax/hadoop/features/annotation/RarityPenaltyFeature.java

So the problem seems to be that, for very highly-attested word pairs, the 
counts are so high that Math.exp() here is negative and gets truncated to 0 
when only five decimal places are requested.

I wonder, why the Math.exp(1-x) dance on this value? Why not just have the 
rarity penalty return the log count?

matt

Re: Joshua 6.1

2016-10-14 Thread Matt Post
I don't see why not?


> On Oct 14, 2016, at 3:36 AM, Tommaso Teofili  
> wrote:
> 
> Hi Matt,
> 
> thanks for pushing this forward, +1 from me.
> One concern I have is related to the language packs licensing, can we
> distribute them under AL2 license ? (as "convenience" binaries as the
> official release consists of the Joshua source code).
> I'm asking this because in OpenNLP we have had this long time issue of the
> models licensing.
> 
> Regards,
> Tommaso
> 
> 
> 
> Il giorno gio 13 ott 2016 alle ore 18:58 Matt Post  ha
> scritto:
> 
>> Hi folks,
>> 
>> I think I'm going to do the 6.1 release tomorrow. Any objections?
>> 
>> Along with the release will be about 60 language packs for a large range
>> of languages. These will be released early next week and will be built on
>> BerkeleyLM, so that there are no external dependencies.
>> 
>> I'd like to push out the release quietly until the language packs are
>> ready, uploaded, and linked.
>> 
>> Is there anything I need to know to do an Apache release?
>> 
>> matt
>> 
>> 
>> 



Re: Joshua 6.1

2016-10-14 Thread Tommaso Teofili
Hi Matt,

thanks for pushing this forward, +1 from me.
One concern I have is related to the language packs licensing, can we
distribute them under AL2 license ? (as "convenience" binaries as the
official release consists of the Joshua source code).
I'm asking this because in OpenNLP we have had this long time issue of the
models licensing.

Regards,
Tommaso



Il giorno gio 13 ott 2016 alle ore 18:58 Matt Post  ha
scritto:

> Hi folks,
>
> I think I'm going to do the 6.1 release tomorrow. Any objections?
>
> Along with the release will be about 60 language packs for a large range
> of languages. These will be released early next week and will be built on
> BerkeleyLM, so that there are no external dependencies.
>
> I'd like to push out the release quietly until the language packs are
> ready, uploaded, and linked.
>
> Is there anything I need to know to do an Apache release?
>
> matt
>
>
>