Re: plain text or HTML in the CAS?

[EMAIL PROTECTED] Thu, 26 Apr 2007 14:00:00 -0700

 -------------- Original message ----------------------
From: Thilo Goetz <[EMAIL PROTECTED]>
> [EMAIL PROTECTED] wrote:
> [...]
> > Longer term, I'd like to modify the NLP components I'm using to understand 
> these ElementAnnotations directly, bypassing the HTML but getting the same 
> benefits.  It would then return offsets against the plain text, and I 
> wouldn't 
> have the translation problem.
> > 
> > What I'd like to see happen in the community would be an agreement on 
> something like the ElementAnnotation above as a standard type definition and 
> with expected (required?) instances in UIMA.  Existing plain text annotators 
> would continue to work, while more up-to-date annotators would use the 
> ElementAnnotations and the plain text.
> > 
> > So, are there any annotator suppliers out there who'd like to work on Type 
> System standards with me?  The lack of them seems to be breaking the promise 
> of 
> UIMA as an integration platform.  The lowest common denominator data 
> representations in UIMA seem a bit too low from my point of view.  UIMA's 
> mechanism is great--now we need some policy.
> [...]
> 
> I like your approach to convert mark-up to annotations, we have done 
> similar things in the past.  We should be able to leverage the new Tika 
> project for parsing html and other formats (such as pdf or office 
> formats).  Of course that still leaves the question of annotation type 
> standards, and it would be great if you and others could come up with a 
> proposal.
> 
> I'm hoping to meet up with some of the Tika folks at ApacheCon next 
> week, and I'll report when I get back.
> 
> --Thilo


Wow.  Interesting.  I wasn't aware of the Tika project.  It could be really 
useful to me.

To complete the picture above:

Feature Name        Super Type   Element Type
ElementAnnotation Annotation    
   attributes             FSArray        AttributeFS
   children               FSArray        ElementAnnotation
   name                  String 
   parent                 ElementAnnotation     
   qualifiedName    String      
   uri                       String

AttributeFS           TOP       
   localName         String     
   qualifiedName   String       
   type                   String        CDATA, ID, IDREF, IDREFS, NMTOKEN, 
NMTOKENS, ENTITY, ENTITIES, NOTATION
   uri                      String      
   value                 String 

These are based on the callback methods from the SAX ContentHandler interface.  
I'm using the Neko HTML parser, as is Tika project, I see.


In addition, it would be good to address standardizing document properties.  I 
see Tika is also looking at these, and thinking about using the Dublin Core 
standard.  I'm also using Dublin Core.

Properties could be represented like this:

PropertyFS          TOP
    name                String    A Dublin Core name
    scheme            String    A Dublin Core scheme, tells how to interprete 
the value
    value                String    

This could be used to represent any number of arbitrary properties, including 
charset, creator, created, modified, format, identifier, language, offset, 
size, and title.  This would obviate the need for the DocumentAnnotation and 
the SourceDocumentInformation types.


Another thing would be standard types for extraction results, such as 
paragraphs, sentences, tokens, parts of speech, named entities, etc.


Greg

Re: plain text or HTML in the CAS?

Reply via email to