Hi Darren, 2010/12/23 Darren Cruse <[email protected]>
> Hi guys I apologize for a newbie question but I'm quite new to UIMA and the > whole area of information extraction/entity extraction. And I'm hoping > someone can tell me if UIMA is a proper tool for a project that I've been > working on (with other tools) that I've been having trouble with. > > > Basically the task is to extract meta data from html in the form of RDF. > Where the html represents books/articles/papers/etc. that typically have > an > "outline" or "table of contents", and part of the task involves extracting > the entities "behind" (so to speak) the table of contents. > this is perfectly aligned to UIMA scope as it deals with to discovering hidden knowledge > > > So e.g. if the "corpus" of html pages are from a book, and the book has > Volume 1 and Volume 2, Volume 1 has Chapters 1-18, Chapter 1 has 6 > Sections, > Section 1 has three Parts, etc. Then my resulting RDF has to model these > things (entities/classes/whatever you'd call them) and understand the > "hierarchy" of what contains what. > > > The real challenging part is that it's a pretty large volume of material > with many different books/articles/papers/etc. And there is a lot of > variability, as each were authored by different people not following any > particular template. > On the "large volume of material" topic I think that UIMA-AS [1] can help you as you need to scale. > > > For example what I called a "table of contents" is rarely a single page but > more often it's exploded across multiple "outline" pages where e.g. a high > level table of contents page goes to the level of chapter links. And then > each chapter may have it's own "outline" breaking down the sections within > that chapter. Or it might not, different books can differ. For example > the > pages making up the chapter may just have headings referring to the > titles/names of the sections without being organized into a chapter > "outline" at all. Yet I'm still responsible for identifying what the > sections are. > > > Somewhat helpful is that headings often indicate the kind of thing they > are, > e.g. "Section 3: The Life of the Spleen, Wrap-Up". Not always though, > e.g. > I may only get the "The Life of the Spleen, Wrap-Up" part (without "Section > 3:" on the front). > > > Or I may get both forms in different places in the book, where ideally I > should relate the two references as being the same thing. > > > And where different places can refer to the same thing with other > differences too. Possibly the case of the letters differ, or in this > example there could be one heading with "Wrap-Up" and another with "Wrap > Up" (one with the dash the other without the dash). > > > As far as understanding the relationships between things i.e. that Chapter > 3 > contains Sections 1 through 3 and Section 1 contains two "Parts", where the > things do appear in a "table of contents" or "outline" page, it seems like > the arrangement/formatting of those pages give the clue as to "what > contains > what". i.e. Things "contained" typically follow what they're contained by, > and are often indented (but not necessarily, it can just be that the > "parent" is bolded, yet they might not be indented beneath their "parent"). > > > > Apologize for the long winded description but hopefully it will help to > clarify my question since I'm new to UIMA: > > > a. Does it sound like a "UIMA kind of problem"? :) > I recently on a similar use case and yes I think this sounds a UIMA kind of problem. My very abstract advice is to use a bottom-up approach, that is recognize words, then sentences, then sections at first; after that you can "play" with sections and understand relationships with chapters and so on. > > i.e. These "things" I'm trying to understand like > Volume/Chapter/Section/etc. - would you call those "entities" in the way > I've heard the term "entity extraction"? > > > b. And I gave so much detail so I could also ask: Does this sound like a > straightforward use for UIMA, or does it sound like a *difficult* use for > UIMA? > it sounds to me a straightforward use of UIMA but this doesn't mean it'll be that easy :) > > > c. Regarding b, I can imagine me giving UIMA regular expressions to look > for "Chapter (.*): (.*)" kind of stuff, or giving it lists ahead of time > like of the chapters I know the book has (this is the idea of a "Gazeteer" > yes?), but I'm unclear: does UIMA also address this thing where I'm trying > to understand "what *contains* what"? > I'd recommend regular expressions as latest thing to rely on, as they are not so easy to maintain along time and also not so efficient; however they can really help sometimes. I'd go through simple NLP phases as tokenizing and POS tagging along with "Gazeteers" (see DictionaryAnnotator[2] and ConceptMapper[3]) and maybe introducing OpenNLP[4] tools to use chunkers. > > > d. i.e. Does UIMA support the need to look at the relationship between > things e.g. "does this heading follow another heading, and was that other > heading identified as a "Section", and is this heading indented further to > the right than that one, so I guess this must be a "Part" within that > "Section". Does UIMA support that kind of thing? If so does that have a > name I can search on? :) > What you have to do to support that in UIMA is define some annotator that recognize headings creating, for example, HeadingAnnotations and then use, for example, the ConfigurableFeatureExtractor[5] to see what follows what and those kind of things. > > > e. When I mentioned the slight inconsistencies in how things are > referenced > (the case being different, a dash being omitted, etc). I think I've heard > the phrase "fuzzy matching". I'm guessing that's part of what UIMA > provides? > "fuzzy matching" is more likely to be part of IR systems (as Lucene/Solr) however you can place your own tokenizer to parse text as you need; in UIMA you can get the simple tokenizer and place also the stemmer block (SnowballAnnotator[6]) in the pipeline to get "matches" only on radix of a word. > > > Thanks for any tips I apologize for such a long question I'd been looking > at > the UIMA docs but I was new enough I decided I needed to appeal to those of > you with greater experience. :) > Finally regarding RDF there is not an RDF CAS consumer in UIMA but it can be simply built using Apache Clerezza UIMA Utils module[7]; I'll write a separate email about this as soon as possible. Thanks to you, hope my small hints can help you. Cheers, Tommaso [1] : http://uima.apache.org/doc-uimaas-what.html [2] : http://uima.apache.org/sandbox.html#dict.annotator [3] : http://uima.apache.org/sandbox.html#concept.mapper.annotator [4] : http://incubator.apache.org/opennlp/ [5] : http://uima.apache.org/sandbox.html#configurable.feature.extractor.annotator [6] : http://uima.apache.org/sandbox.html#snowball.annotator [7] : http://svn.apache.org/repos/asf/incubator/clerezza/trunk/org.apache.clerezza.parent/org.apache.clerezza.uima/org.apache.clerezza.uima.utils/ > > > (is there any kind of "Text Extraction for Dummies" kind of introduction > anybody would recommend for a newbie btw?) > > > Thanks again, > > > Darren >
