Andreas, Include the punctuation marks, like in http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/test/resources/opennlp/tools/sentdetect/abb.xml?view=markup
In my experiments I could improve only 0.01% using the abbreviation dictionary combined with a model trained from a Brazilian Portuguese corpus, but for the final system the dictionary had a positive impact because you can add abbreviations that are not common in the training data. William On Thu, Mar 14, 2013 at 2:29 PM, Jörn Kottmann <[email protected]> wrote: > The abbreviation list has almost no impact on the accuracy of the > tokenizer, > it might help if you have data with very rare abbreviations, but its not a > feature > you should use when you just get started with the training. > > My recommendation is to first get a good baseline tokenizer model, and > then if > you are not happy with it experiment with more advanced features or > customization. > > I don't know how the dots are handled in the lookup code, maybe somebody > else does here, > otherwise I can have a look at the code. > > Jörn > > > On 03/14/2013 05:24 PM, Andreas Niekler wrote: > >> Dear List, >> >> do the abbreviations for the token trainer include the appending . or do >> they just come in form of the actual string >> >> like >> >> e.g. vs. e.g >> >> or >> >> usw. vs. usw >> >> or >> >> Dr. vs. Dr >> >> Thank you >> >> Andreas >> >> Am 14.03.2013 14:50, schrieb Jörn Kottmann: >> >>> On 03/14/2013 02:15 PM, Andreas Niekler wrote: >>> >>>> Hello, >>>> >>>> seems that this issue is already opened by you: >>>> https://issues.apache.org/**jira/browse/OPENNLP-501<https://issues.apache.org/jira/browse/OPENNLP-501> >>>> >>>> Shoul i include that into 1.6.0 or just the trunk? >>>> >>> Leave the version open, it would probably be nice to pull that >>> fix into 1.5.3, but it depends on how quick we get it and what >>> the other committers think about it, so can't promise anything here. >>> If it will not go into 1.5.3 it will definitely go into the version >>> after. >>> >>> Jörn >>> >>> >
