SV: OpenNLP NER for Polish

Svetoslav Marinov Tue, 20 Aug 2013 01:51:52 -0700

As Jörn wrote you should tag ALL person names in your corpus, not just the 
famous ones.

Then, Polish is a highly inflected language. How do you deal with all the case 
forms of a person name? Do you have them in the list? If you don't, that's one 
of the problems as well. Why do you need to stem the articles? Is it to account 
for the inflections? But then you should do exactly the same with your test 
data. However, I would strongly advise you not to use the stemmer. You lose a 
lot of valuable information which can help distinguish whether a word is a name 
or not. Just tag the texts as they are (maybe with some proper tokenization and 
sentence splitting) - this should improve the results.

Svetoslav
________________________________________
Från: Jörn Kottmann <[email protected]>
Skickat: den 20 augusti 2013 09:56
Till: [email protected]
Ämne: Re: OpenNLP NER for Polish

On 08/20/2013 09:47 AM, Tomasz Sobczak wrote:
> Could you suggest me what have I missed or what can I do better in my input
> text file to improve my entity recognition?

Its hard to tell without seeing your training data, but I suspect your
tagging is too inconsistent,
e.g. many people names are not tagged.

Try to use a linguistic annotation tool to annotate at least a few
hundred articles with all mentioned
person names.

Jörn

SV: OpenNLP NER for Polish

Reply via email to