On Thu, 28 Jan 2010 00:17:40 -0800, Thilo Goetz <[email protected]> wrote:
On 01/28/2010 03:38, Greg Holmberg wrote:
[...]
Any thoughts on going in this direction (EXI)? Can you think of any
alternatives (where the recipient is Java, but not running UIMA)?
I don't know what your requirements are exactly in terms of
memory overhead, are your FSs full graphs or just trees etc.,
Well, I want this to work for any AggregateAnalyzer, so I'm not designing
to a particular application. I'm designing a general infrastructure to
deploy UIMA apps.
It happens that the application I'm initially targeting has a complex type
system (12 types with one-to-many and many-to-many relationships) that
results in a large network of objects with cycles. And we'll be adding
more types in the future. And it's a massive amount of data--every token
gets annotated, possible more than once. Hence my concern for the cost of
communicating that data over the network in a massively scaled system to a
single repository.
but one alternative I've been using lately is JSON. It's
a light-weight, self-describing format, human readable, and
there are easy to use Java libraries available. It's also
way more compact than xmi. Just a thought. The downside is,
there's no (freely available) UIMA integration that I know of,
you would have to do that yourself.
I don't know much about JSON. The above sounds good--I'll have to look
into it. Could it handle arbitrary UIMA TypeSystems with cyclic graphs?
But I'm always concerned about any text-based data format--both the data
size and the CPU/memory usage during generating and parsing. Also, since
I can't just plug a ContentHandler into XmiCasSerializer to produce JSON
(or can I?), it would be a lot more work. I'd have to write code that
does something similar to what XmiCasSerializer does--about 1400 lines.
And it requires intimate knowledge of the low-level CAS implementation and
use of some non-public UIMA APIs.
Since the code in XmiCasSerializer is so complicated, I was wondering if
it could be generalized, to make it easier to write new CAS serializers
that don't use SAX or XML? For example, could we write some sort of
general CAS traverser to hide the low-level details, so that someone who
wants to implement a serializer for some new data format could just plug
in a callback? Similar to the concept of a ContentHandler, but not using
SAX or XML.
Or am I throwing the baby out with the bath-water? Is ContentHandler
already capable of that? For example, could a ContentHandler produce JSON
or DataObject-style output?
Here's another thought: what would it take to make FeatureStructure (and
related classes) implement Serializable by writing writeObject() and
readObject() methods? Then I could just use RMI to send the CAS over the
network.
So far, EXI looks like the best solution--fast, network-efficient,
recipient doesn't need UIMA, and a low cost of development. The GPL
license is the only problem for me.
Thoughts?
Greg Holmberg