Mark, Your three step plan looks good. Here are some answers:
- NLTK is a toolkit aimed at teaching NLP to university students. (and there's an O'Reilly book coming out this summer, which you can read for free online at http://www.nltk.org/book ) -> can help you with (1.) - LingPipe is a toolkit to actually build (Java) systems for particular NLP tasks. -> can help you with (2.+3.) - In NLP, there's two ways of doing things: either you convert between lots of idiosyncratic data formats, or you submit to a framework (UIMA, GATE) that manages annotations for you (see also the section on Standards in http://nltk.googlecode.com/svn/trunk/doc/book/ch11.html ). UIMA is such a framework for meta-data (such as annotation of text) that saves you writing a lot of conversion code by standardizing one way of storing and handling it. - the difference between components and frameworks doesn't just confuse you, people (wrongly) use the two terms interchangeably, but there's a difference, which is discussed in Section 2.6 of an paper of mine: Leidner, Jochen L. (2003). Current Issues in Software Engineering for Natural Language Processing. Proceedings of the Workshop on Software Engineering and Architecture of Language Technology Systems (SEALTS) held at the Joint Conference for Human Language Technology and the Annual Meeting of the Noth American Chapter of the Association for Computational Linguistics 2003 (HLT/NAACL'03), Edmonton, Alberta, Canada, pp. 45-50. http://www.iccs.inf.ed.ac.uk/~jleidner/documents/Leidner-2003-SEALTS.pdf How can you tell a toolkit from a framework? They say a framework is like Hollywood: "You can't call us, we call you." If it never calls you, it's probably a toolkit. ;-) As far as resources at the University of Texas go, try to connect with Jason Baldridge, who is a professor at U Texas (he's one of the authors of the OpenNLP package). Regards Jochen -- Dr. Jochen Leidner Research Scientist Thomson Reuters Research & Development 610 Opperman Drive Eagan, MN 55123 USA http://www.ThomsonReuters.com -----Original Message----- From: news [mailto:[email protected]] On Behalf Of Mark Ettinger Sent: Monday, April 06, 2009 8:01 PM To: [email protected] Subject: New to NLP and navigating the options. Hello all, I am a trained mathematician/computer scientist/programmer jumping into NLP and excited by the challenge but intimidated by the algorithm and software options. Specifically, I am at University of Texas and am charged with putting to good use our large database of (more-or-less unused) clinical notes. My strategy is roughly: 1. Learn the theory of NLP and Information Extraction. 2. Understand the publicly available software packages so as to avoid reinventing the wheel. 3. Apply #2 to our database and begin experimenting. My question in this post centers on #2. Not being a software engineer (though having lots of scientific programming experience), I am sometimes puzzled by "frameworks" and "components". I think of everything as libraries of functions. Yes, I know this view is outdated. I can wrap my head around NLP packages like Lingpipe and NLTK but am unclear what a package like UIMA offers over and above these types of pure libraries. Given what I've told you about my background (scientist, programmer, but NOT software engineer) can someone explain to me how investing the time to learn UIMA will pay off in the long run? I've started to dig into the UIMA api but thought I'd throw this rather basic question out there, hoping someone wouldn't think it too naive for this forum. Thanks in advance! Mark Ettinger
