If it's Java on the repository side, creating Java objects of your choice on the UIMA side and then sending them over RMI is an option.

-Chris

Greg Holmberg wrote:
Hi UIMA users!


I'm looking for advice on how to transmit data from a CAS to a non-UIMA recipient .


I'd like to send data from a CAS over the network to a repository. I can write any Java code I want to run in the repository server to receive the data and insert it into the repository indexes. And no, the repository is not a SQL database, and there is no JDBC driver for it.


I'm thinking the easiest data format to transmit from the CAS would be XMI. I can just use the UIMA serialization methods to produce an XMI XML String, and then send that as a payload over whatever transport I want (RMI, HTTP , FTP, JSON, SOAP, whatever).


But then how would the repository server parse the XMI XML that it receives? Obviously, I could just use the UIMA de-serialization to re-constitute the CAS, but that's a lot of overhead (time and memory) considering I don't actually neet to run UIMA in the repository, and I just want to get the data values from the XMI and insert some records/objects in the repository index.


Can I parse the XMI XML from UIMA without using UIMA?


For example, is there a XSD file for XMI? Or at least, for the UIMA "flavor" of XMI? If so, I could feed the XSD file to JAXB to generate equivalent Java classes, then JAXB would parse and validate the XMI, producing Java objects.


I suppose I could also parse the XMI with the XML StAX parser built into Java 6, and just bypass the creation of Java objects (directly inserting into the repository). More work, but might use less memory and perform better.


Or, instead of XMI, I could walk the CAS myself, and invent some data format (JSON? SOAP? RMI?) to send to the repository. This could be binary to lessen the data on the network and ease the unmarshalling on the other end. Performance and network bandwidth are an issue for me, since this has to scale (there will be many clients sending CAS data to the repository).


I seem to remember that the serialization of the CAS between Java and C++ uses a fast binary format. Would that be a possibility here? Could I read that without re-constituting the CAS in the repository?


What are your thoughts on these options?


Thanks,




Greg Holmberg

Reply via email to