Dear Greg,
You raised certainly a good point in that you deplore the lack of a
commonly shared (standard) UIMA annotation scheme for NLP purposes. Such
a scheme would enable a flexible plug-in of components developed at many
different sites, worldwide.
We encountered the same problem in the context of BOOTStrep
(www.bootstrep.org), a European STREP project in which we are heavily
involved together with six international partners. We use UIMA as a
common platform for developing NLP software for text mining in biology.
As part of our project activites, in the meantime, we developed a
multi-layered UIMA annotation type system. This type system currently
contains six spec layers: document meta information (author, title
etc.), document structure and style information, morpho-syntax, syntax
ans semantics (discourse to come). In our work, we integrated as much as
possible already existing annotation schemes from the NLP community
(such as TEI, Dublin Core, Penn Treebank etc.) The scheme is designed
with domain-independence in mind though some portions (e.g., document
structure and semantics, of course) introduce bits of domain-dependence.
Coverage of general language applications (e.g., newspapers) should,
however, not constitute a big deal.
Please see our paper at the up-coming UIMA workshop at GLDV 2007
http://incubator.apache.org/uima/downloads/gldv/gldv07-uima-hahn.pdf
We are aware of the fact that other teams are working on the same
challenges and we like the idea a lot to coordinate these efforts in
order to find a way to elaborate on a common UIMA annotation scheme for
NLP. Correspondingly, we find the idea to build a consortium in the UIMA
Apache project really fascinating. It is certainly one way to speed up
consensus on a commonly shared UIMA standard annotation scheme and
create an international UIMA community.
Best regards from Jena
Ekaterina Buyko & Udo Hahn
--
Ekaterina Buyko
Jena University Language and Information Engineering (JULIE) Lab
Phone: +49-3641-944322
Fax: +49-3641-944321
email: [EMAIL PROTECTED]
URL: http://www.coling.uni-jena.de
Thilo Goetz schrieb:
[EMAIL PROTECTED] wrote:
What is the recommended way of storing document properties, such as
"author", "date created", "title", etc?
I also need some data for internal uses, such as the document size
and URI.
One other requirement: this is not a closed vertical solution with a
known set of annotators designed to inter-operate. This is an
application platform that will use some known annotators but allow
plugging in arbitrary unknown annotators from other companies (that's
why one uses UIMA, of course!). Also, some of our annotators may be
used in UIMA containers from other companies with unknown
annotators. So my code can't depend on either the UIMA container
providing, or all of the other annotators (but possibly our own)
knowing about, any data structure containing these properties.
I see a few possibilities:
1. Add features to DocumentAnnotation
2. Add features to SourceDocumentInformation
3. Create my own annotation or TOP FS.
The documentation recommends not adding features to
DocumentAnnotation if you are using JCas (I am). I agree--what if
both my annotators and someone else's annotator have added features
to DA? It just wouldn't work, right?
It's the same with SDI, if two annotators both add features to it.
They in conflict, and they can't be merged.
SDI is useful however, since it has the document size and URI.
Despite it being in a package called "examples", in truth it's become
a standard. All the annotators the ship with UIMA use it. If you
want to use the semantic search (Juru) indexing CAS Consumer, you
have to use SDI. I'm sure many annotators in the world have used SDI.
I would like my annotators and UIMA container to be compatible with
all those annotators. Therefore, I think I have to use SDI for size
and URI, but not modify it.
Creating my own annotation (or is extending TOP FS better?) seems
like the best answer. My UIMA container and set of annotators would
know about it, and other's annotators wouldn't be affected. My
annotators would have to gracefully degrade when running in a UIMA
container that doesn't provide this new annotation.
What are people's thoughts? 1, 2 or 3?
If you use the JCas, as you say you do, definitely 3. There is no
need to use an annotation, extending TOP would be sufficient.
================
Longer term, I think we as a community need to define Type Systems
that allow inter-operability of annotators and CAS Consumers. For
example, we could create an official SourceDocumentInformation that
allows arbitrary sets of document properties as simple name-value
pairs. In other words, add this feature to SourceDocumentInformation:
properties uima.cas.FSArray PropertyFS
uima.PropertyFS uima.cas.TOP
name uima.cas.String
value uima.cas.String
scheme uima.cas.String
I'm personally not a big fan of arbitrary attribute-value schemes like
this. You need yet another place (outside the type system) where you
document what the properties are that you define and expect.
And define that names, values, and schemes conform to the Dublin Core
Metadata Initiative standards.
Similarly, I think we need to create Type System standards for
representing document structure. For example, how could HTML
elements and attributes be stored in the CAS such that all annotators
could depend on them being there and therefore make intelligent use
of them?
And finally, we need some Type System standards for representing
certain common result annotations, such as lexical markup and named
entities. How can we combine two annotators from different companies
if they don't have a shared definition of the data flowing between them?
And isn't this the whole point of UIMA? It appears to me that the
UIMA dream won't come true until we create these standards for data
exchange or data transformation within the CAS.
In my opinion, the current situation really limits the usefulness of
UIMA as a platform for text processing (unless you control every
piece of code in the system, of course).
How do we start such a consortium?
This mailing list is a good start ;-). I know there are others who
work on similar things, but I'll let them speak for themselves.
One issue of course is that it is difficult to agree on any common
type system. It's hard enough to even agree on what an annotation is,
let alone specific types of annotations. We could try to define a
certain base set on Apache. I would hesitate to put more built-in
types into UIMA itself, though. I'd rather have a type system
repository where we modularly define certain kinds of type systems
(such as html markup, for example), and that people can use, or not.
--Thilo
Thanks for listening,
Greg Holmberg