Hello everybody, I have just joined this mailing list! Thank you in advance for your help.
I am studying a simple analizer that extracts specific information from a text. The information i would like to extract are: 1. Person 2. Company 3. Email address 4. Zipcode 5. Home address for email address and zipcode i directly use *RegexNameFinder,* emails have specific format so a regex should work without problems, zipcodes too (5 digits long, only numbers). In this case RegexNameFinder works perfectly. The problems are for Person, Company and home addresses. I read the documentation for Named Entity Recognition but i have the following doubts: 1. I have a complete italian name/surname database (csv) i would like to understand how to create the train model correctly. I see that i have to use a specific tag like <START:person> Person name here </END> in a context! As I wrote I only have name and surname (one per line) so in this case how can i create the model? Do i have to create fake sentences and put the names there? 2. Let suppose we have those sentences do i have to write all the name/surname combinations to let opennlp understand when a token (or more tokens) is a Person ? Example: *<START:person> Barack <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 .<START:person> Barack Obama <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 .* *<START:person> Bill <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 .* *<START:person> **Bill Clinton <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 .* ...and so on.. ? 3. Same doubt for companies, I have a very big database with around 1M companies names, what is the best solution to train open nlp for those names? Last but not the least... 4. What is the best way to train opennlp for home addresses? In italy for example the "format" is: Name Surname address, number, zip-code City Country Thank you so much!