JCas method like getSofaDataString that doesn't copy the chars from the
StringHeap
----------------------------------------------------------------------------------
Key: UIMA-483
URL: https://issues.apache.org/jira/browse/UIMA-483
Project: UIMA
Issue Type: Improvement
Affects Versions: 2.1
Reporter: Greg Holmberg
I process large documents--the String I pass to JCas.setSofaDataString may be
as large 100 MBs (50,000,000 chars). This is causing the JVM to run out of
memory when we have many concurrent AnalysisEngines running.
I traced JCas.getSofaDataString(), and it eventually calls
StringHeap.getStringForCode(), which does a "new String" from it's private
char[] (which does a copy).
This would happen for each annotator. We have five, so now the 100 MBs has
become 600 MBs. Multiply by 10 concurrent AnalysisEngines, and that's 6,000
MBs.
Perhaps there could be a variation on getSofaDataString that returns one of the
other classes (besides String) that implements CharSequence. A CharBuffer
perhaps, or even a new class the implements the CharSequence interface but is
read-only (just four methods). Or even just return a char[] or char[] and
begin/end offset into the StringHeap.
If nothing else, perhaps the document text should be treated specially from all
the little strings in the StringHeap, and be stored separately, so calls to
getSofaDataString() simply return a reference to an existing String object,
without copying.
I'm open to possibilities, I just need the copying to end.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.