Re: plain text or HTML in the CAS?

Thilo Goetz Thu, 26 Apr 2007 12:18:03 -0700

[EMAIL PROTECTED] wrote:
[...]

Longer term, I'd like to modify the NLP components I'm using to understand 
these ElementAnnotations directly, bypassing the HTML but getting the same 
benefits.  It would then return offsets against the plain text, and I wouldn't 
have the translation problem.


What I'd like to see happen in the community would be an agreement on something 
like the ElementAnnotation above as a standard type definition and with 
expected (required?) instances in UIMA.  Existing plain text annotators would 
continue to work, while more up-to-date annotators would use the 
ElementAnnotations and the plain text.

So, are there any annotator suppliers out there who'd like to work on Type 
System standards with me?  The lack of them seems to be breaking the promise of 
UIMA as an integration platform.  The lowest common denominator data 
representations in UIMA seem a bit too low from my point of view.  UIMA's 
mechanism is great--now we need some policy.

[...]

I like your approach to convert mark-up to annotations, we have donesimilar things in the past. We should be able to leverage the new Tikaproject for parsing html and other formats (such as pdf or officeformats). Of course that still leaves the question of annotation typestandards, and it would be great if you and others could come up with aproposal.

I'm hoping to meet up with some of the Tika folks at ApacheCon nextweek, and I'll report when I get back.


--Thilo

Re: plain text or HTML in the CAS?

Reply via email to