I'm having a problem with CAS serializationWithCompression. I am processing
a few million text document on an IBM P8 with 16 physical SMTP 8 cpus, 200GB
RAM, Ubuntu 14.04.3 LTS and IBM Java 1.8.
I run 55 UIMA pipelines concurrently. I'm using UIMA 2.6.0.
I use serializeWithCompression to save the final state of the processing on
each document to a file for later processing.
However, the size of the serialized CAS just keeps growing. The size of the
CAS is stable, but the serialized CASes just keep getting bigger. I even
went to creating a new CAS for each process instead of using cas.reset(). I
have also tried writing the serialized CAS to a byte array output stream
first and then to a file, but it is the serializeWithCompression that caused
the size problem not writing the file.
Here's what the code looks like. Flushing or not flushing does not make a
difference. Closing or not closing the file output strem does not make a
difference (other than leaking memory). I've also tried doing
serializeWithCompression with type filtering. Wanted to try using a Marker,
but cannot see how to do that. The problem exists regardless of doing 1 or
55 pipelines concurrently.
File fout = new File(documentPath);
fos = new FileOutputStream(fout);
org.apache.uima.cas.impl.Serialization.serializeWithCompression(
cas, fos);
fos.flush();
fos.close();
logger.info( "serializedCas size " + cas.size() + " ToFile " +
documentPath);
Suggestions will be appreciated.
Thanks / Dan