On 2/28/07, Thilo Goetz <[EMAIL PROTECTED]> wrote:
[EMAIL PROTECTED] wrote: > 1. Add features to DocumentAnnotation > 2. Add features to SourceDocumentInformation > 3. Create my own annotation or TOP FS. > If you use the JCas, as you say you do, definitely 3. There is no need to use an annotation, extending TOP would be sufficient.
I agree. Adding feature to existing types should generally be avoided if there's an acceptable alternative solution, especially if you're using JCas. BTW in 2.1 we've added the ability to index and retrieve non-Annotation FeatureStructures without having to define a custom index in your component descriptor, which should make it much more convenient to use a document metadata Type that extends TOP.
> Longer term, I think we as a community need to define Type Systems that allow inter-operability of annotators and CAS Consumers. For example, we could create an official SourceDocumentInformation that allows arbitrary sets of document properties as simple name-value pairs. In other words, add this feature to SourceDocumentInformation: > > properties uima.cas.FSArray PropertyFS > > uima.PropertyFS uima.cas.TOP > name uima.cas.String > value uima.cas.String > scheme uima.cas.String I'm personally not a big fan of arbitrary attribute-value schemes like this. You need yet another place (outside the type system) where you document what the properties are that you define and expect.
Agreed. Our hope is that the type system would be used for declaring these things. There can be more than one Type declared for holding different kinds of document metadata (e.g. a DublinCoreMetadata type, in addition to other types with different properties). Perhaps, it might be useful if these all extended from some base DocumentMetadata type that did not define any features, just so it would be clear that these all represented some kind of DocumentMetadata?
> Similarly, I think we need to create Type System standards for representing document structure. For example, how could HTML elements and attributes be stored in the CAS such that all annotators could depend on them being there and therefore make intelligent use of them? > > > And finally, we need some Type System standards for representing certain common result annotations, such as lexical markup and named entities. How can we combine two annotators from different companies if they don't have a shared definition of the data flowing between them? > > > And isn't this the whole point of UIMA? It appears to me that the UIMA dream won't come true until we create these standards for data exchange or data transformation within the CAS. > > In my opinion, the current situation really limits the usefulness of UIMA as a platform for text processing (unless you control every piece of code in the system, of course). > > How do we start such a consortium? This mailing list is a good start ;-). I know there are others who work on similar things, but I'll let them speak for themselves. One issue of course is that it is difficult to agree on any common type system. It's hard enough to even agree on what an annotation is, let alone specific types of annotations. We could try to define a certain base set on Apache. I would hesitate to put more built-in types into UIMA itself, though. I'd rather have a type system repository where we modularly define certain kinds of type systems (such as html markup, for example), and that people can use, or not.
Right. I think largely it's the uima users, not the framework developers, who would have to participate in forging agreements on common type systems. So uima-user seems like a good place to have discussions like that, at least once this list has a larger number of subscribers, which we hope will happen after the Apache release out and people have migrated to it. Possibly, type systems that have gotten significant support on uima-user might be included in the Apache UIMA release, initially as part of the "sandbox". -Adam
