Great! Glad to see some use is being made of JSON :-). -Marshall
On 1/13/2016 2:05 PM, D. Heinze wrote: > Found the problem by serializing the CAS to Json. The CAS sofaText was > acting like a pushdown stack and accumulating the full text of each > successive document due to an input stream and buffer not getting properly > closed/cleared between iterations. > > Thanks / Dan > > -----Original Message----- > From: D. Heinze [mailto:[email protected]] > Sent: Tuesday, January 12, 2016 2:13 PM > To: [email protected] > Subject: RE: CAS serializationWithCompression > > Thanks Marshall. Will do. I just completed upgrading from UIMA 2.6.0 to > 2.8.1 just to make sure there were no issues there. Will now get back to > the CAS serialization issue. Yes, I've been trying to think of where there > could be retained junk that is getting added back into the CAS with each > iteration. > > -Dan > > -----Original Message----- > From: Marshall Schor [mailto:[email protected]] > Sent: Tuesday, January 12, 2016 11:56 AM > To: [email protected] > Subject: Re: CAS serializationWithCompression > > hmmm, seems like unusual behavior. > > It would help a lot to diagnose this if you could construct a small test > case - one which perhaps creates a cas, fills it with a bit of data, does > the compressed serialization, resets the cas, and loops and see if that > produces "expanding" serializations. > > -- if it does, please post the test case to a Jira and we'll diagnose / > fix this :-) > > -- if it doesn't, then you have to get closer to your actual use case and > iterate until you see what it is that you last added that starts making it > serialize ever-expanding instances. That will be a big clue, I think. > > -Marshall > > On 1/12/2016 10:54 AM, D. Heinze wrote: >> The CAS.size() starts as larger than the serializedWithCompression >> version, but eventually the serializedWithCompression version grows to >> be larger than the CAS.size(). >> The overall process is: >> * Create a new CAS >> * Read in an xml document and store the structure and content in the cas. >> * Tokenize and parse the document and store that info in the cas. >> * Run a number of lexical engines and ConceptMapper engines on the >> data and store that info in the cas >> * Produce an xml document with the content of the original input >> document marked up with the analysis results and both write that out >> to a file and also store it in the cas >> * serializeWithCompression to a FileOutputStream >> * cas.reset() >> * iterate on the next input document >> All the work other than creating and cas.reset() is done using the JCas. >> Even though the output CASes keep getting larger, they seem to >> deserialize just fine and are usable. >> Thanks/Dan >> >> -----Original Message----- >> From: Richard Eckart de Castilho [mailto:[email protected]] >> Sent: Tuesday, January 12, 2016 2:45 AM >> To: [email protected] >> Subject: Re: CAS serializationWithCompression >> >> Is the CAS.size() larger than the serialized version or smaller? >> What are you actually doing to the CAS? Just serializing/deserializing >> a couple of times in a row, or do you actually add feature structures? >> The sample code you show doesn't give any hint about where the CAS >> comes from and what is being done with it. >> >> -- Richard >> >>> On 12.01.2016, at 03:06, D. Heinze <[email protected]> wrote: >>> >>> I'm having a problem with CAS serializationWithCompression. I am >>> processing a few million text document on an IBM P8 with 16 physical >>> SMTP 8 cpus, 200GB RAM, Ubuntu 14.04.3 LTS and IBM Java 1.8. >>> >>> I run 55 UIMA pipelines concurrently. I'm using UIMA 2.6.0. >>> >>> I use serializeWithCompression to save the final state of the >>> processing on each document to a file for later processing. >>> >>> However, the size of the serialized CAS just keeps growing. The size >>> of the CAS is stable, but the serialized CASes just keep getting >>> bigger. I even went to creating a new CAS for each process instead of >>> using cas.reset(). I have also tried writing the serialized CAS to a >>> byte array output stream first and then to a file, but it is the >>> serializeWithCompression that caused the size problem not writing the >> file. >>> Here's what the code looks like. Flushing or not flushing does not >>> make a difference. Closing or not closing the file output strem does >>> not make a difference (other than leaking memory). I've also tried >>> doing serializeWithCompression with type filtering. Wanted to try >>> using a Marker, but cannot see how to do that. The problem exists >>> regardless of doing 1 or >>> 55 pipelines concurrently. >>> >>> >>> >>> File fout = new File(documentPath); >>> >>> fos = new FileOutputStream(fout); >>> >>> >>> org.apache.uima.cas.impl.Serialization.serializeWithCompression( >>> cas, fos); >>> >>> fos.flush(); >>> >>> fos.close(); >>> >>> logger.info( "serializedCas size " + cas.size() + " ToFile " + >>> documentPath); >>> >>> >>> >>> Suggestions will be appreciated. >>> >>> >>> >>> Thanks / Dan >>> >>> >>> >
