Re: Document "properties" and SourceDocumentInformation

Ekaterina Buyko Wed, 07 Mar 2007 06:28:27 -0800

Dear Greg,

You raised certainly a good point in that you deplore the lack of acommonly shared (standard) UIMA annotation scheme for NLP purposes. Sucha scheme would enable a flexible plug-in of components developed at manydifferent sites, worldwide.

We encountered the same problem in the context of BOOTStrep(www.bootstrep.org), a European STREP project in which we are heavilyinvolved together with six international partners. We use UIMA as acommon platform for developing NLP software for text mining in biology.As part of our project activites, in the meantime, we developed amulti-layered UIMA annotation type system. This type system currentlycontains six spec layers: document meta information (author, titleetc.), document structure and style information, morpho-syntax, syntaxans semantics (discourse to come). In our work, we integrated as much aspossible already existing annotation schemes from the NLP community(such as TEI, Dublin Core, Penn Treebank etc.) The scheme is designedwith domain-independence in mind though some portions (e.g., documentstructure and semantics, of course) introduce bits of domain-dependence.Coverage of general language applications (e.g., newspapers) should,however, not constitute a big deal.

Please see our paper at the up-coming UIMA workshop at GLDV 2007http://incubator.apache.org/uima/downloads/gldv/gldv07-uima-hahn.pdf

We are aware of the fact that other teams are working on the samechallenges and we like the idea a lot to coordinate these efforts inorder to find a way to elaborate on a common UIMA annotation scheme forNLP. Correspondingly, we find the idea to build a consortium in the UIMAApache project really fascinating. It is certainly one way to speed upconsensus on a commonly shared UIMA standard annotation scheme andcreate an international UIMA community.


Best regards from Jena

Ekaterina Buyko & Udo Hahn

--

Ekaterina Buyko
Jena University Language and Information Engineering (JULIE) Lab
Phone: +49-3641-944322
Fax:   +49-3641-944321
email: [EMAIL PROTECTED]
URL:   http://www.coling.uni-jena.de


Thilo Goetz schrieb:

[EMAIL PROTECTED] wrote:
What is the recommended way of storing document properties, such as"author", "date created", "title", etc?
I also need some data for internal uses, such as the document sizeand URI.
One other requirement: this is not a closed vertical solution with aknown set of annotators designed to inter-operate. This is anapplication platform that will use some known annotators but allowplugging in arbitrary unknown annotators from other companies (that'swhy one uses UIMA, of course!). Also, some of our annotators may beused in UIMA containers from other companies with unknownannotators. So my code can't depend on either the UIMA containerproviding, or all of the other annotators (but possibly our own)knowing about, any data structure containing these properties.
I see a few possibilities:

1. Add features to DocumentAnnotation
2. Add features to SourceDocumentInformation
3. Create my own annotation or TOP FS.
The documentation recommends not adding features toDocumentAnnotation if you are using JCas (I am). I agree--what ifboth my annotators and someone else's annotator have added featuresto DA? It just wouldn't work, right?
It's the same with SDI, if two annotators both add features to it.They in conflict, and they can't be merged.
SDI is useful however, since it has the document size and URI.Despite it being in a package called "examples", in truth it's becomea standard. All the annotators the ship with UIMA use it. If youwant to use the semantic search (Juru) indexing CAS Consumer, youhave to use SDI. I'm sure many annotators in the world have used SDI.
I would like my annotators and UIMA container to be compatible withall those annotators. Therefore, I think I have to use SDI for sizeand URI, but not modify it.
Creating my own annotation (or is extending TOP FS better?) seemslike the best answer. My UIMA container and set of annotators wouldknow about it, and other's annotators wouldn't be affected. Myannotators would have to gracefully degrade when running in a UIMAcontainer that doesn't provide this new annotation.
What are people's thoughts?  1, 2 or 3?
If you use the JCas, as you say you do, definitely 3. There is noneed to use an annotation, extending TOP would be sufficient.
================
Longer term, I think we as a community need to define Type Systemsthat allow inter-operability of annotators and CAS Consumers. Forexample, we could create an official SourceDocumentInformation thatallows arbitrary sets of document properties as simple name-valuepairs. In other words, add this feature to SourceDocumentInformation:
        properties           uima.cas.FSArray    PropertyFS

    uima.PropertyFS    uima.cas.TOP
        name                  uima.cas.String
        value                   uima.cas.String
        scheme               uima.cas.String
I'm personally not a big fan of arbitrary attribute-value schemes likethis. You need yet another place (outside the type system) where youdocument what the properties are that you define and expect.
And define that names, values, and schemes conform to the Dublin CoreMetadata Initiative standards.
Similarly, I think we need to create Type System standards forrepresenting document structure. For example, how could HTMLelements and attributes be stored in the CAS such that all annotatorscould depend on them being there and therefore make intelligent useof them?
And finally, we need some Type System standards for representingcertain common result annotations, such as lexical markup and namedentities. How can we combine two annotators from different companiesif they don't have a shared definition of the data flowing between them?
And isn't this the whole point of UIMA? It appears to me that theUIMA dream won't come true until we create these standards for dataexchange or data transformation within the CAS.
In my opinion, the current situation really limits the usefulness ofUIMA as a platform for text processing (unless you control everypiece of code in the system, of course).
How do we start such a consortium?
This mailing list is a good start ;-). I know there are others whowork on similar things, but I'll let them speak for themselves.
One issue of course is that it is difficult to agree on any commontype system. It's hard enough to even agree on what an annotation is,let alone specific types of annotations. We could try to define acertain base set on Apache. I would hesitate to put more built-intypes into UIMA itself, though. I'd rather have a type systemrepository where we modularly define certain kinds of type systems(such as html markup, for example), and that people can use, or not.
--Thilo
Thanks for listening,


Greg Holmberg

Re: Document "properties" and SourceDocumentInformation

Reply via email to