-------------- Original message ----------------------
From: Thilo Goetz <[EMAIL PROTECTED]>
> [EMAIL PROTECTED] wrote:
> [...]
> > Longer term, I'd like to modify the NLP components I'm using to understand
> these ElementAnnotations directly, bypassing the HTML but getting the same
> benefits. It would then return offsets against the plain text, and I
> wouldn't
> have the translation problem.
> >
> > What I'd like to see happen in the community would be an agreement on
> something like the ElementAnnotation above as a standard type definition and
> with expected (required?) instances in UIMA. Existing plain text annotators
> would continue to work, while more up-to-date annotators would use the
> ElementAnnotations and the plain text.
> >
> > So, are there any annotator suppliers out there who'd like to work on Type
> System standards with me? The lack of them seems to be breaking the promise
> of
> UIMA as an integration platform. The lowest common denominator data
> representations in UIMA seem a bit too low from my point of view. UIMA's
> mechanism is great--now we need some policy.
> [...]
>
> I like your approach to convert mark-up to annotations, we have done
> similar things in the past. We should be able to leverage the new Tika
> project for parsing html and other formats (such as pdf or office
> formats). Of course that still leaves the question of annotation type
> standards, and it would be great if you and others could come up with a
> proposal.
>
> I'm hoping to meet up with some of the Tika folks at ApacheCon next
> week, and I'll report when I get back.
>
> --Thilo
Wow. Interesting. I wasn't aware of the Tika project. It could be really
useful to me.
To complete the picture above:
Feature Name Super Type Element Type
ElementAnnotation Annotation
attributes FSArray AttributeFS
children FSArray ElementAnnotation
name String
parent ElementAnnotation
qualifiedName String
uri String
AttributeFS TOP
localName String
qualifiedName String
type String CDATA, ID, IDREF, IDREFS, NMTOKEN,
NMTOKENS, ENTITY, ENTITIES, NOTATION
uri String
value String
These are based on the callback methods from the SAX ContentHandler interface.
I'm using the Neko HTML parser, as is Tika project, I see.
In addition, it would be good to address standardizing document properties. I
see Tika is also looking at these, and thinking about using the Dublin Core
standard. I'm also using Dublin Core.
Properties could be represented like this:
PropertyFS TOP
name String A Dublin Core name
scheme String A Dublin Core scheme, tells how to interprete
the value
value String
This could be used to represent any number of arbitrary properties, including
charset, creator, created, modified, format, identifier, language, offset,
size, and title. This would obviate the need for the DocumentAnnotation and
the SourceDocumentInformation types.
Another thing would be standard types for extraction results, such as
paragraphs, sentences, tokens, parts of speech, named entities, etc.
Greg