I played around with the existing python support for uima, and wasn't
really satisfied with it.  It's all done through a swig interface to
c++, and the result isn't exactly easy to use.  So I put together a
pure-python package that provides support for reading and writing UIMA
CAS data files. The main motivation behind writing this package was to
allow UIMA data to be read and written by Python programs in a manner
that is natural to the Python language.  Here's a very simple example
use case:

>>> import pycas
>>> # Load a CAS from an XMI or an XCAS file:
>>> cas = pycas.xml.load_cas('myDocument.xml', 'myTypeSystem.xml')
>>> # Look up a type object from myTypeSystem.xml:
>>> Token = cas.type_system['org.mydomain.Token']
>>> # Iterate over all instances of that type, and perform some work:
>>> for fs in cas.get_annotation_index(Token):
...     token.someProperty = func(token.someOtherProperty)
>>> # Write the modified CAS to an XMI file:
>>> pycas.xml.save_cas(cas, 'myModifiedDocument.xml')

I put up a temporary webpage for it:

http://www.cis.upenn.edu/~edloper/pycas/

I'd like to release it as an open source project, but wanted to get
feedback from the good uima folks at apache & ibm first.  Some
possibilities include: (a) releasing it as a standalone project; (b)
incorporating it into the main UIMA project; and (c) adding it under
the "corpus reader" subpackage of nltk (http://nltk.org).  (The name
"pycas" could be changed as well -- I picked it by analogy with jcas.)

n.b.: pycas does not attempt to provide support for many of the
"framework" features of UIMA, including the ability to combine
processing components together to create applications. It focuses only
on providing access to the data structures that UIMA uses to manage
annotations.  If someone else wants to extend what I've done, that's
fine, but all I really wanted was convenient read/write access to UIMA
data files.

-Edward

Reply via email to