[EMAIL PROTECTED] wrote:
[...]
Longer term, I'd like to modify the NLP components I'm using to understand
these ElementAnnotations directly, bypassing the HTML but getting the same
benefits. It would then return offsets against the plain text, and I wouldn't
have the translation problem.
What I'd like to see happen in the community would be an agreement on something
like the ElementAnnotation above as a standard type definition and with
expected (required?) instances in UIMA. Existing plain text annotators would
continue to work, while more up-to-date annotators would use the
ElementAnnotations and the plain text.
So, are there any annotator suppliers out there who'd like to work on Type
System standards with me? The lack of them seems to be breaking the promise of
UIMA as an integration platform. The lowest common denominator data
representations in UIMA seem a bit too low from my point of view. UIMA's
mechanism is great--now we need some policy.
[...]
I like your approach to convert mark-up to annotations, we have done
similar things in the past. We should be able to leverage the new Tika
project for parsing html and other formats (such as pdf or office
formats). Of course that still leaves the question of annotation type
standards, and it would be great if you and others could come up with a
proposal.
I'm hoping to meet up with some of the Tika folks at ApacheCon next
week, and I'll report when I get back.
--Thilo