Re: XMI parsing?

Greg Holmberg Thu, 28 Jan 2010 12:03:42 -0800

On Thu, 28 Jan 2010 00:17:40 -0800, Thilo Goetz <[email protected]> wrote:

On 01/28/2010 03:38, Greg Holmberg wrote:
[...]

Any thoughts on going in this direction (EXI)?  Can you think of any
alternatives (where the recipient is Java, but not running UIMA)?


I don't know what your requirements are exactly in terms of
memory overhead, are your FSs full graphs or just trees etc.,

Well, I want this to work for any AggregateAnalyzer, so I'm not designingto a particular application. I'm designing a general infrastructure todeploy UIMA apps.

It happens that the application I'm initially targeting has a complex typesystem (12 types with one-to-many and many-to-many relationships) thatresults in a large network of objects with cycles. And we'll be addingmore types in the future. And it's a massive amount of data--every tokengets annotated, possible more than once. Hence my concern for the cost ofcommunicating that data over the network in a massively scaled system to asingle repository.

but one alternative I've been using lately is JSON.  It's
a light-weight, self-describing format, human readable, and
there are easy to use Java libraries available.  It's also
way more compact than xmi.  Just a thought.  The downside is,
there's no (freely available) UIMA integration that I know of,
you would have to do that yourself.

I don't know much about JSON. The above sounds good--I'll have to lookinto it. Could it handle arbitrary UIMA TypeSystems with cyclic graphs?

But I'm always concerned about any text-based data format--both the datasize and the CPU/memory usage during generating and parsing. Also, sinceI can't just plug a ContentHandler into XmiCasSerializer to produce JSON(or can I?), it would be a lot more work. I'd have to write code thatdoes something similar to what XmiCasSerializer does--about 1400 lines.And it requires intimate knowledge of the low-level CAS implementation anduse of some non-public UIMA APIs.

Since the code in XmiCasSerializer is so complicated, I was wondering ifit could be generalized, to make it easier to write new CAS serializersthat don't use SAX or XML? For example, could we write some sort ofgeneral CAS traverser to hide the low-level details, so that someone whowants to implement a serializer for some new data format could just plugin a callback? Similar to the concept of a ContentHandler, but not usingSAX or XML.

Or am I throwing the baby out with the bath-water? Is ContentHandleralready capable of that? For example, could a ContentHandler produce JSONor DataObject-style output?

Here's another thought: what would it take to make FeatureStructure (andrelated classes) implement Serializable by writing writeObject() andreadObject() methods? Then I could just use RMI to send the CAS over thenetwork.

So far, EXI looks like the best solution--fast, network-efficient,recipient doesn't need UIMA, and a low cost of development. The GPLlicense is the only problem for me.


Thoughts?


Greg Holmberg

Re: XMI parsing?

Reply via email to