On Apr 8, 2016, at 5:13 PM, Jenn C <jen...@gmail.com> wrote:

> I worked on a text mining project last semester where I had a bunch of
> magazines with text that was totally unstructured (from IA). I would have
> really liked to know how to work entity matching into such a project. Are
> there text mining projects out there that demonstrate doing this?

If I understand your question correctly, then the Stanford Name Entity 
Recognition (NER) library/application may be one solution. [1]

Given text as input, a named entity recognition library/application returns a 
list of nouns (names, places, and things). The things can be all sorts of stuff 
such as organizations, dates, times, fiscal amounts, etc. Stanford’s NER is 
really a Java library, but has a command-line interface. Feed it a text, and 
you get back an XML stream. The stream contains elements, and each element is 
expected to be some sort of entity. Be forewarned. For the the best and most 
optimal performance, it is necessary to “train” the library/application. 
Frankly, I’ve never done that, and consequently, I guess I’ve never been 
optimal.* You also might want to take a read of the text from the Python 
Natural Language Toolkit (NLTK) module. [2] The noted chapter gives a pretty 
good overview of the subject. 

[1] NER - http://nlp.stanford.edu/software/CRF-NER.shtml
[2] NLTK chapter - http://www.nltk.org/book/ch07.html

* ‘Story of my life.

—
Eric Lease Morgan

Reply via email to