On 09/16/2011 10:43 AM, Alexander Klenner wrote:
I have a question concerning the merging of different UIMA pipelines. Say I have 3 different annotators that work on the same document (The CAS sofa data is identical for each of the pipelines) They do this parallel and all of them produce different annotations but in a sofa with the same name(_textView). Finally I have 3 serialized XCAS files in three different folders, coming from different nodes of a cluster.
We have the same problem sometimes, and I'd be very interested in a "clean" solution.
Is there an UIMA conform way to merge the corresponding xml files into one CAS object that has all the annotations of the three separate files? I could easily do this with a non uima java class that just adds all the annotation information into one file. Since the sofa data is the same, the offset information of the annotations will be correct, but I'd rather stay in the UIMA context.
We actually edit XMI files using python scripts to add annotations that come from outside UIMA, etc. However, especially given the very unfortunate disappearance of Ed Loper's uimapy, our approach is a bit hacky, e.g. for dealing with the xmi:id features, namespace prefixes for type systems, etc. Also, XMI allows for many different representations of the same information, and our scripts really only deal with the most common version (as attributes).
I guess in Java you can at least use org.apache.uima.cas.impl.XmiCasDeserializer and org.apache.uima.cas.impl.XmiCasSerializer to avoid the XMI specific details.
Bye, Jens
