Indeed, the structure is important to linguistic analysis. For example,
imagine you have a table with three cells, containing the text "1996",
"Honda", and "Camry". If the cells are properly treated as sentence or
paragraph boundaries, then entity extraction would produce a year, a
company, and a vehicle. If the structure is striped and just the plain
text is analyzed, then you get one entity, a vehicle, "1996 Honda Camry".
Which is not exactly the same thing.
I feel that the lack of any standard in UIMA regarding the structure of
the document being analyzed (that is, beyond simply plain text) makes it
pretty much impossible to combine annotators from different sources--one
of the primary justifications of UIMA, in my opinion.
I sketched a possible solution to this on the wiki
(http://cwiki.apache.org/UIMA/uima-sandbox-components.html, see "Document
model") back in 2007, but it didn't generate much interest. There's also
a proposal for document properties, beyond the simple
SourceDocumentInformation class.
Greg Holmberg
On Tue, 19 May 2009 09:34:14 -0700, Manuel Fiorelli
<[email protected]> wrote:
I would like to see a well-established way to analyze semi-structured
documents, such as (X)HTML pages. UIMA shouldn't provide its own
parser, but at least a type system (like uima.cas) to represent a DOM
Document within a CAS instance (the simplest solution is to represent
element nodes as feature structures and text nodes as annotations on
the plain text, but I suspect there are more convenient solutions).
When the analysis function doesn't rely upon the document structure,
there should be a way to skip most of the markup and iterate on the
blocks. I think that we cannot work directly on the plain text, since
the loss of information could lead to misinterpretations. For example,
in the following fragment
<p>First paragrapher</p><p>Second paragrapher</p>
the plain text would be
First paragrapherSecond paragrapher
where "paragrapherSecond" is an error in the interpretation of the
document.
Manuel Fiorelli