Re: Abbreviations for Tokenisation Training

William Colen Thu, 14 Mar 2013 19:47:05 -0700

Andreas,

Include the punctuation marks, like in
http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/test/resources/opennlp/tools/sentdetect/abb.xml?view=markup


In my experiments I could improve only 0.01% using the abbreviation
dictionary combined with a model trained from a Brazilian Portuguese
corpus, but for the final system the dictionary had a positive impact
because you can add abbreviations that are not common in the training data.

William

On Thu, Mar 14, 2013 at 2:29 PM, Jörn Kottmann <[email protected]> wrote:

> The abbreviation list has almost no impact on the accuracy of the
> tokenizer,
> it might help if you have data with very rare abbreviations, but its not a
> feature
> you should use when you just get started with the training.
>
> My recommendation is to first get a good baseline tokenizer model, and
> then if
> you are not happy with it experiment with more advanced features or
> customization.
>
> I don't know how the dots are handled in the lookup code, maybe somebody
> else does here,
> otherwise I can have a look at the code.
>
> Jörn
>
>
> On 03/14/2013 05:24 PM, Andreas Niekler wrote:
>
>> Dear List,
>>
>> do the abbreviations for the token trainer include the appending . or do
>> they just come in form of the actual string
>>
>> like
>>
>> e.g. vs. e.g
>>
>> or
>>
>> usw. vs. usw
>>
>> or
>>
>> Dr. vs. Dr
>>
>> Thank you
>>
>> Andreas
>>
>> Am 14.03.2013 14:50, schrieb Jörn Kottmann:
>>
>>> On 03/14/2013 02:15 PM, Andreas Niekler wrote:
>>>
>>>> Hello,
>>>>
>>>> seems that this issue is already opened by you:
>>>> https://issues.apache.org/**jira/browse/OPENNLP-501<https://issues.apache.org/jira/browse/OPENNLP-501>
>>>>
>>>> Shoul i include that into 1.6.0 or just the trunk?
>>>>
>>> Leave the version open, it would probably be nice to pull that
>>> fix into 1.5.3, but it depends on how quick we get it and what
>>> the other committers think about it, so can't promise anything here.
>>> If it will not go into 1.5.3 it will definitely go into the version
>>> after.
>>>
>>> Jörn
>>>
>>>
>

Re: Abbreviations for Tokenisation Training

Reply via email to