Eddie Epstein <eaepst...@...> writes:
> There's a binary serialization format built into UIMA already,
> public static void serializeCAS(CAS cas, OutputStream ostream) {
> that is in the org.apache.uima.cas.impl.Serialization package.
>
> This method is typically several times faster than XMI serialization,
> depending on CAS content. The binary format eliminates XML parsing,
> but still has to extract the arbitrary object model contents of a CAS
> starting from the indexes, and reconstruct the CAS indexes on deserialization.
>
> One drawback of binary serialization is that the client and service sides
> must have exactly the same type system, as binary type and feature
> codes are used. Like XMI serialization, the binary format also supports
> delta CAS replies.
>
> Eddie
Thanks for the suggestion, Eddie. I looked at that code, and it's pretty
interesting. I learned that the CAS, at the lowest level, is an array of
bytes, an array of shorts, an array of longs, and an array of characters. It's
all in how you interpret those bytes into FeatureStructures, apparently.
So serializeCAS just writes those bytes to an OutputStream. It's like a memory
dump, basically. The recipient may have to do byte-swapping, depending on the
CPU. I see why this works so well with C++.
In my case, I want to send the CAS data to another Java process that isn't
running UIMA. So I don't want to load the UIMA classes or re-constitute the
CAS. I need a self-describing data format so that the recipient can interpret
the data without UIMA.
XMI would work, but I worry about performance in a large cluster (both CPU
usage to generate/parse, and also network bandwidth).
I did try the binary XML standard, EXI. I tried the open-source
implementation, EXIficient, from Siemens. See http://exificient.sourceforge.net
This plugged into XmiCasSerializer pretty easily, and after fixing a few null-
pointer exceptions in EXIficient, I got some output. This turned out to be 30%
the size of the XML file for my small test (1029 bytes vs. 3396). I haven't
measured performance yet.
Here's what I did to plug EXIficient in:
EXIFactory exiFactory = DefaultEXIFactory.newInstance();
exiFactory.setCodingMode(CodingMode.COMPRESSION);
EXIResult exiResult = new EXIResult(outputStream, exiFactory);
ContentHandler handler = exiResult.getHandler();
XmiCasSerializer serializer = new XmiCasSerializer(jcas.getTypeSystem
());
serializer.serialize(jcas.getCas(), handler);
I haven't tried to actually read the file, so I don't know that the data is
correct yet.
I've submitted a patch to the EXIficient project for the null pointer
exceptions.
More testing is required, but it looks pretty good so far. If it doesn't work,
I would have to try to do something similar with java.io.DataOutputStream,
which seems like a lot of work--basically implementing something similar to EXI.
Any thoughts on going in this direction (EXI)? Can you think of any
alternatives (where the recipient is Java, but not running UIMA)?
Thanks,
Greg Holmberg