Hi
I suggest using the OpenNLP with the default models available here:
http://opennlp.sourceforge.net/models-1.5/
These models can recognize people, location (not addresses) and
organization names.
If this does not perform satisfactorily (which is most often the case),
train the model as you have described in point 2 of your mail.
Yes, the training data creation is very time consuming. OpenNLP suggests
the training data is at least 15,000 sentences big for reasonable model
performance.

If you want to recognize only addresses and not interested in locations in
general, I suggest
you recognize entities of the three types and then do a regular expression
like pattern matching. For example <Person
Name>(\\W+)<Location>(\\W+)<NUMBER>(\\W+)<ZIPCODE> e.t.c.


On Mon, Aug 17, 2015 at 2:55 AM, Damiano Porta <damianopo...@gmail.com>
wrote:

> Hello everybody,
> I have just joined this mailing list! Thank you in advance for your help.
>
> I am studying a simple analizer that extracts specific information from a
> text. The information i would like to extract are:
>
> 1. Person
> 2. Company
> 3. Email address
> 4. Zipcode
> 5. Home address
>
> for email address and zipcode i directly use *RegexNameFinder,* emails have
> specific format so a regex should work without problems, zipcodes too (5
> digits long, only numbers). In this case RegexNameFinder works perfectly.
>
> The problems are for Person, Company and home addresses. I read the
> documentation for Named Entity Recognition but i have the following doubts:
>
> 1. I have a complete italian name/surname database (csv) i would like to
> understand how to create the train model correctly. I see that i have to
> use a specific tag like <START:person> Person name here </END> in a
> context! As I wrote I only have name and surname (one per line) so in this
> case how can i create the model? Do i have to create fake sentences and put
> the names there?
>
> 2. Let suppose we have those sentences do i have to write all the
> name/surname combinations to let opennlp understand when a token (or more
> tokens) is a Person ? Example:
>
>
> *<START:person> Barack <END> , 61 years old , will join the board as a
> nonexecutive director Nov. 29 .<START:person> Barack Obama <END> , 61 years
> old , will join the board as a nonexecutive director Nov. 29 .*
>
> *<START:person> Bill <END> , 61 years old , will join the board as a
> nonexecutive director Nov. 29 .*
>
> *<START:person> **Bill Clinton <END> , 61 years old , will join the board
> as a nonexecutive director Nov. 29 .*
>
> ...and so on.. ?
>
> 3. Same doubt for companies, I have a very big database with around 1M
> companies names, what is the best solution to train open nlp for those
> names?
>
> Last but not the least...
>
> 4. What is the best way to train opennlp for home addresses? In italy for
> example the "format" is:
>
> Name Surname
> address, number, zip-code
> City
> Country
>
> Thank you so much!
>



-- 
V

Reply via email to