Re: document structure (was: Discussion of next UIMA release)

Greg Holmberg Tue, 19 May 2009 10:04:43 -0700

Indeed, the structure is important to linguistic analysis. For example,imagine you have a table with three cells, containing the text "1996","Honda", and "Camry". If the cells are properly treated as sentence orparagraph boundaries, then entity extraction would produce a year, acompany, and a vehicle. If the structure is striped and just the plaintext is analyzed, then you get one entity, a vehicle, "1996 Honda Camry".Which is not exactly the same thing.

I feel that the lack of any standard in UIMA regarding the structure ofthe document being analyzed (that is, beyond simply plain text) makes itpretty much impossible to combine annotators from different sources--oneof the primary justifications of UIMA, in my opinion.

I sketched a possible solution to this on the wiki(http://cwiki.apache.org/UIMA/uima-sandbox-components.html, see "Documentmodel") back in 2007, but it didn't generate much interest. There's alsoa proposal for document properties, beyond the simpleSourceDocumentInformation class.



Greg Holmberg

On Tue, 19 May 2009 09:34:14 -0700, Manuel Fiorelli<[email protected]> wrote:

I would like to see a well-established way to analyze semi-structured
documents, such as (X)HTML pages. UIMA shouldn't provide its own
parser, but at least a type system (like uima.cas) to represent a DOM
Document within a CAS instance (the simplest solution is to represent
element nodes as feature structures and text nodes as annotations on
the plain text, but I suspect there are more convenient solutions).

When the analysis function doesn't rely upon the document structure,
there should be a way to skip most of the markup and iterate on the
blocks. I think that we cannot work directly on the plain text, since
the loss of information could lead to misinterpretations. For example,
in the following fragment

<p>First paragrapher</p><p>Second paragrapher</p>

the plain text would be

First paragrapherSecond paragrapher

where "paragrapherSecond" is an error in the interpretation of thedocument.


Manuel Fiorelli

Re: document structure (was: Discussion of next UIMA release)

Reply via email to