On Thu, 28 Jan 2010 00:17:40 -0800, Thilo Goetz <[email protected]> wrote:

On 01/28/2010 03:38, Greg Holmberg wrote:
[...]
Any thoughts on going in this direction (EXI)?  Can you think of any
alternatives (where the recipient is Java, but not running UIMA)?

I don't know what your requirements are exactly in terms of
memory overhead, are your FSs full graphs or just trees etc.,

Well, I want this to work for any AggregateAnalyzer, so I'm not designing to a particular application. I'm designing a general infrastructure to deploy UIMA apps.

It happens that the application I'm initially targeting has a complex type system (12 types with one-to-many and many-to-many relationships) that results in a large network of objects with cycles. And we'll be adding more types in the future. And it's a massive amount of data--every token gets annotated, possible more than once. Hence my concern for the cost of communicating that data over the network in a massively scaled system to a single repository.

but one alternative I've been using lately is JSON.  It's
a light-weight, self-describing format, human readable, and
there are easy to use Java libraries available.  It's also
way more compact than xmi.  Just a thought.  The downside is,
there's no (freely available) UIMA integration that I know of,
you would have to do that yourself.

I don't know much about JSON. The above sounds good--I'll have to look into it. Could it handle arbitrary UIMA TypeSystems with cyclic graphs?

But I'm always concerned about any text-based data format--both the data size and the CPU/memory usage during generating and parsing. Also, since I can't just plug a ContentHandler into XmiCasSerializer to produce JSON (or can I?), it would be a lot more work. I'd have to write code that does something similar to what XmiCasSerializer does--about 1400 lines. And it requires intimate knowledge of the low-level CAS implementation and use of some non-public UIMA APIs.

Since the code in XmiCasSerializer is so complicated, I was wondering if it could be generalized, to make it easier to write new CAS serializers that don't use SAX or XML? For example, could we write some sort of general CAS traverser to hide the low-level details, so that someone who wants to implement a serializer for some new data format could just plug in a callback? Similar to the concept of a ContentHandler, but not using SAX or XML.

Or am I throwing the baby out with the bath-water? Is ContentHandler already capable of that? For example, could a ContentHandler produce JSON or DataObject-style output?

Here's another thought: what would it take to make FeatureStructure (and related classes) implement Serializable by writing writeObject() and readObject() methods? Then I could just use RMI to send the CAS over the network.

So far, EXI looks like the best solution--fast, network-efficient, recipient doesn't need UIMA, and a low cost of development. The GPL license is the only problem for me.

Thoughts?


Greg Holmberg

Reply via email to