As Jörn wrote you should tag ALL person names in your corpus, not just the famous ones.
Then, Polish is a highly inflected language. How do you deal with all the case forms of a person name? Do you have them in the list? If you don't, that's one of the problems as well. Why do you need to stem the articles? Is it to account for the inflections? But then you should do exactly the same with your test data. However, I would strongly advise you not to use the stemmer. You lose a lot of valuable information which can help distinguish whether a word is a name or not. Just tag the texts as they are (maybe with some proper tokenization and sentence splitting) - this should improve the results. Svetoslav ________________________________________ Från: Jörn Kottmann <[email protected]> Skickat: den 20 augusti 2013 09:56 Till: [email protected] Ämne: Re: OpenNLP NER for Polish On 08/20/2013 09:47 AM, Tomasz Sobczak wrote: > Could you suggest me what have I missed or what can I do better in my input > text file to improve my entity recognition? Its hard to tell without seeing your training data, but I suspect your tagging is too inconsistent, e.g. many people names are not tagged. Try to use a linguistic annotation tool to annotate at least a few hundred articles with all mentioned person names. Jörn
