That sounds like Named Entity Recognition. It's typically done with a Conditional Random Field. You could take a look at http://nlp.stanford.edu/software/CRF-NER.shtml.
On Fri, Jul 2, 2010 at 10:53 AM, Alex McLintock <[email protected]>wrote: > I'm quite interested in OpenCalais - a Reuters/Thompson initiative. It > is a web service to take your free text and identify important terms > in it like people, businesses, places, and so on. If you are the > document owner you can submit your document to their web site and get > back important tags saying what this document is about. I'd like to > tag this sort of data and feed it into a Lucene style index so that it > can be used in searches AND in focussed/topical crawls. > > Now, here comes the problem. When we crawl the web we don't own the > documents we are crawling so we don't really have permission to use > Reuters' servers to do this analysis. (Maybe we could cut a deal > though if we were a big enough company). > > So has anyone else looked at alternatives to OpenCalais which takes > free text and tries to understand what it is about? I've been looking > for software to do this but nothing seems suitable. > > Alex >

