[EMAIL PROTECTED] wrote:
[...]
Longer term, I'd like to modify the NLP components I'm using to understand 
these ElementAnnotations directly, bypassing the HTML but getting the same 
benefits.  It would then return offsets against the plain text, and I wouldn't 
have the translation problem.

What I'd like to see happen in the community would be an agreement on something 
like the ElementAnnotation above as a standard type definition and with 
expected (required?) instances in UIMA.  Existing plain text annotators would 
continue to work, while more up-to-date annotators would use the 
ElementAnnotations and the plain text.

So, are there any annotator suppliers out there who'd like to work on Type 
System standards with me?  The lack of them seems to be breaking the promise of 
UIMA as an integration platform.  The lowest common denominator data 
representations in UIMA seem a bit too low from my point of view.  UIMA's 
mechanism is great--now we need some policy.
[...]

I like your approach to convert mark-up to annotations, we have done similar things in the past. We should be able to leverage the new Tika project for parsing html and other formats (such as pdf or office formats). Of course that still leaves the question of annotation type standards, and it would be great if you and others could come up with a proposal.

I'm hoping to meet up with some of the Tika folks at ApacheCon next week, and I'll report when I get back.

--Thilo

Reply via email to