Hi Ted, thanks for your comments! Regarding differences between DictionaryAnnotator and ConceptMapper there is a previous thread that should help understanding such comparison [1].
2010/12/27 Ted Pedersen <[email protected]> > Anyway, assuming that I specify entities using both Regular > Expressions and Dictionary entries, is there a preferred way to use > and/or combine the above (or anything else?) The goal at this point is > simply to identify those entities in text for later downstream > processing. > You probably have to put the "dictionary" analysis engine (be DA or CM) in the pipeline along with the RegularExpression Annotator and then combine the generated annotations inside a third custom annotator or via the Configurable Feature Extractor. Note that you can build also named entities recognition blocks using OpenNLP (see, for example, [2]) with existing models or creating your own ones. Hope this helps. Cheers, Tommaso [1] : http://markmail.org/thread/oyhct2lh4uj2ow2h [2] : http://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Name_Finder > > Thanks! > Ted > > On Mon, Dec 27, 2010 at 9:59 AM, Ted Pedersen <[email protected]> wrote: > > Thanks to Tommaso for a very interesting posting, and to Darren for > > the question that generated it. > > > > As a kind of follow-on question to one of the suggestions made by > Tommaso.... > > > > I'm particularly interested in the functionality provided by Concept > > Mapper, or maybe Dictionary Annotator (that is having the ability to > > create a dictionary and then be able to recognize when a dictionary > > term occurs in my text). From reading over the documentation it seems > > like Concept Mapper and Dictionary Annotator are fairly similar. To be > > honest I don't know much about UIMA, but am trying to learn, so there > > might be some subtleties here I don't see (that would make one want to > > prefer one of these over the other). > > > > Is there a short summary of the differences between Concept Mapper and > > Dictionary Annotator, and does anyone have any strong feelings about > > when you should use one over the other? > > > > Cordially, > > Ted > > > > On Mon, Dec 27, 2010 at 2:45 AM, Tommaso Teofili > > <[email protected]> wrote: > >> Hi Darren, > >> > >> 2010/12/23 Darren Cruse <[email protected]> > >> > >>> Hi guys I apologize for a newbie question but I'm quite new to UIMA and > the > >>> whole area of information extraction/entity extraction. And I'm hoping > >>> someone can tell me if UIMA is a proper tool for a project that I've > been > >>> working on (with other tools) that I've been having trouble with. > >>> > >>> > >>> Basically the task is to extract meta data from html in the form of > RDF. > >>> Where the html represents books/articles/papers/etc. that typically > have > >>> an > >>> "outline" or "table of contents", and part of the task involves > extracting > >>> the entities "behind" (so to speak) the table of contents. > >>> > >> > >> this is perfectly aligned to UIMA scope as it deals with to discovering > >> hidden knowledge > >> > >> > >>> > >>> > >>> So e.g. if the "corpus" of html pages are from a book, and the book has > >>> Volume 1 and Volume 2, Volume 1 has Chapters 1-18, Chapter 1 has 6 > >>> Sections, > >>> Section 1 has three Parts, etc. Then my resulting RDF has to model > these > >>> things (entities/classes/whatever you'd call them) and understand the > >>> "hierarchy" of what contains what. > >>> > >>> > >>> The real challenging part is that it's a pretty large volume of > material > >>> with many different books/articles/papers/etc. And there is a lot of > >>> variability, as each were authored by different people not following > any > >>> particular template. > >>> > >> > >> On the "large volume of material" topic I think that UIMA-AS [1] can > help > >> you as you need to scale. > >> > >> > >>> > >>> > >>> For example what I called a "table of contents" is rarely a single page > but > >>> more often it's exploded across multiple "outline" pages where e.g. a > high > >>> level table of contents page goes to the level of chapter links. And > then > >>> each chapter may have it's own "outline" breaking down the sections > within > >>> that chapter. Or it might not, different books can differ. For > example > >>> the > >>> pages making up the chapter may just have headings referring to the > >>> titles/names of the sections without being organized into a chapter > >>> "outline" at all. Yet I'm still responsible for identifying what the > >>> sections are. > >>> > >>> > >>> Somewhat helpful is that headings often indicate the kind of thing they > >>> are, > >>> e.g. "Section 3: The Life of the Spleen, Wrap-Up". Not always though, > >>> e.g. > >>> I may only get the "The Life of the Spleen, Wrap-Up" part (without > "Section > >>> 3:" on the front). > >>> > >>> > >>> Or I may get both forms in different places in the book, where ideally > I > >>> should relate the two references as being the same thing. > >>> > >>> > >>> And where different places can refer to the same thing with other > >>> differences too. Possibly the case of the letters differ, or in this > >>> example there could be one heading with "Wrap-Up" and another with > "Wrap > >>> Up" (one with the dash the other without the dash). > >>> > >>> > >>> As far as understanding the relationships between things i.e. that > Chapter > >>> 3 > >>> contains Sections 1 through 3 and Section 1 contains two "Parts", where > the > >>> things do appear in a "table of contents" or "outline" page, it seems > like > >>> the arrangement/formatting of those pages give the clue as to "what > >>> contains > >>> what". i.e. Things "contained" typically follow what they're contained > by, > >>> and are often indented (but not necessarily, it can just be that the > >>> "parent" is bolded, yet they might not be indented beneath their > "parent"). > >>> > >>> > >>> > >>> Apologize for the long winded description but hopefully it will help to > >>> clarify my question since I'm new to UIMA: > >>> > >>> > >>> a. Does it sound like a "UIMA kind of problem"? :) > >>> > >> > >> I recently on a similar use case and yes I think this sounds a UIMA kind > of > >> problem. > >> My very abstract advice is to use a bottom-up approach, that is > recognize > >> words, then sentences, then sections at first; after that you can "play" > >> with sections and understand relationships with chapters and so on. > >> > >> > >>> > >>> i.e. These "things" I'm trying to understand like > >>> Volume/Chapter/Section/etc. - would you call those "entities" in the > way > >>> I've heard the term "entity extraction"? > >>> > >>> > >>> b. And I gave so much detail so I could also ask: Does this sound > like a > >>> straightforward use for UIMA, or does it sound like a *difficult* use > for > >>> UIMA? > >>> > >> > >> it sounds to me a straightforward use of UIMA but this doesn't mean > it'll be > >> that easy :) > >> > >> > >>> > >>> > >>> c. Regarding b, I can imagine me giving UIMA regular expressions to > look > >>> for "Chapter (.*): (.*)" kind of stuff, or giving it lists ahead of > time > >>> like of the chapters I know the book has (this is the idea of a > "Gazeteer" > >>> yes?), but I'm unclear: does UIMA also address this thing where I'm > trying > >>> to understand "what *contains* what"? > >>> > >> > >> I'd recommend regular expressions as latest thing to rely on, as they > are > >> not so easy to maintain along time and also not so efficient; however > they > >> can really help sometimes. > >> I'd go through simple NLP phases as tokenizing and POS tagging along > with > >> "Gazeteers" (see DictionaryAnnotator[2] and ConceptMapper[3]) and maybe > >> introducing OpenNLP[4] tools to use chunkers. > >> > >> > >>> > >>> > >>> d. i.e. Does UIMA support the need to look at the relationship between > >>> things e.g. "does this heading follow another heading, and was that > other > >>> heading identified as a "Section", and is this heading indented further > to > >>> the right than that one, so I guess this must be a "Part" within that > >>> "Section". Does UIMA support that kind of thing? If so does that have > a > >>> name I can search on? :) > >>> > >> > >> What you have to do to support that in UIMA is define some annotator > that > >> recognize headings creating, for example, HeadingAnnotations and then > use, > >> for example, the ConfigurableFeatureExtractor[5] to see what follows > what > >> and those kind of things. > >> > >> > >> > >>> > >>> > >>> e. When I mentioned the slight inconsistencies in how things are > >>> referenced > >>> (the case being different, a dash being omitted, etc). I think I've > heard > >>> the phrase "fuzzy matching". I'm guessing that's part of what UIMA > >>> provides? > >>> > >> > >> "fuzzy matching" is more likely to be part of IR systems (as > Lucene/Solr) > >> however you can place your own tokenizer to parse text as you need; in > UIMA > >> you can get the simple tokenizer and place also the stemmer block > >> (SnowballAnnotator[6]) in the pipeline to get "matches" only on radix of > a > >> word. > >> > >> > >>> > >>> > >>> Thanks for any tips I apologize for such a long question I'd been > looking > >>> at > >>> the UIMA docs but I was new enough I decided I needed to appeal to > those of > >>> you with greater experience. :) > >>> > >> > >> Finally regarding RDF there is not an RDF CAS consumer in UIMA but it > can be > >> simply built using Apache Clerezza UIMA Utils module[7]; I'll write a > >> separate email about this as soon as possible. > >> > >> Thanks to you, hope my small hints can help you. > >> Cheers, > >> Tommaso > >> > >> [1] : http://uima.apache.org/doc-uimaas-what.html > >> [2] : http://uima.apache.org/sandbox.html#dict.annotator > >> [3] : http://uima.apache.org/sandbox.html#concept.mapper.annotator > >> [4] : http://incubator.apache.org/opennlp/ > >> [5] : > >> > http://uima.apache.org/sandbox.html#configurable.feature.extractor.annotator > >> [6] : http://uima.apache.org/sandbox.html#snowball.annotator > >> [7] : > >> > http://svn.apache.org/repos/asf/incubator/clerezza/trunk/org.apache.clerezza.parent/org.apache.clerezza.uima/org.apache.clerezza.uima.utils/ > >> > >> > >> > >> > >> > >>> > >>> > >>> (is there any kind of "Text Extraction for Dummies" kind of > introduction > >>> anybody would recommend for a newbie btw?) > >>> > >>> > >>> Thanks again, > >>> > >>> > >>> Darren > >>> > >> > > > > > > > > -- > > Ted Pedersen > > http://www.d.umn.edu/~tpederse > > > > > > -- > Ted Pedersen > http://www.d.umn.edu/~tpederse >
