Hello everybody,
I have just joined this mailing list! Thank you in advance for your help.

I am studying a simple analizer that extracts specific information from a
text. The information i would like to extract are:

1. Person
2. Company
3. Email address
4. Zipcode
5. Home address

for email address and zipcode i directly use *RegexNameFinder,* emails have
specific format so a regex should work without problems, zipcodes too (5
digits long, only numbers). In this case RegexNameFinder works perfectly.

The problems are for Person, Company and home addresses. I read the
documentation for Named Entity Recognition but i have the following doubts:

1. I have a complete italian name/surname database (csv) i would like to
understand how to create the train model correctly. I see that i have to
use a specific tag like <START:person> Person name here </END> in a
context! As I wrote I only have name and surname (one per line) so in this
case how can i create the model? Do i have to create fake sentences and put
the names there?

2. Let suppose we have those sentences do i have to write all the
name/surname combinations to let opennlp understand when a token (or more
tokens) is a Person ? Example:


*<START:person> Barack <END> , 61 years old , will join the board as a
nonexecutive director Nov. 29 .<START:person> Barack Obama <END> , 61 years
old , will join the board as a nonexecutive director Nov. 29 .*

*<START:person> Bill <END> , 61 years old , will join the board as a
nonexecutive director Nov. 29 .*

*<START:person> **Bill Clinton <END> , 61 years old , will join the board
as a nonexecutive director Nov. 29 .*

...and so on.. ?

3. Same doubt for companies, I have a very big database with around 1M
companies names, what is the best solution to train open nlp for those
names?

Last but not the least...

4. What is the best way to train opennlp for home addresses? In italy for
example the "format" is:

Name Surname
address, number, zip-code
City
Country

Thank you so much!

Reply via email to