[
https://issues.apache.org/jira/browse/UIMA-483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12510758
]
Marshall Schor commented on UIMA-483:
-------------------------------------
Eddie remarked that the JNI interface to C++ analytics and the blob
serialization method both use the string heap for transfer. In those cases the
CAS is delivered [or delivered back] to a Java CAS with strings in the old
character array string heap. Also, there is a low-level CAS API that still
creates string features in the old string heap.
One possible improvement: The java impl could change the code where it finds
the "internalStringCode == null" to not only create a new Java string, but also
store it in the string list, updating the heap so that future refs would find
the internalStringCode != null. Serialization which used the character array
format would need to be updated to not add these strings to the output twice.
Another improvement we could do that might significantly reduce storage in many
common cases:
Add an "identity" hash map: key = strings being added to the string heap from
Java, value = <stringCode>. This would allow sharing of things when the
strings are ==, and this sharing would be preserved across serialization. An
Identity hashmap would only need to hash 4 (or 8) byte "addresses", not the
whole string.
Does anyone see any issues with this?
> JCas method like getSofaDataString that doesn't copy the chars from the
> StringHeap
> ----------------------------------------------------------------------------------
>
> Key: UIMA-483
> URL: https://issues.apache.org/jira/browse/UIMA-483
> Project: UIMA
> Issue Type: Improvement
> Affects Versions: 2.1
> Reporter: Greg Holmberg
>
> I process large documents--the String I pass to JCas.setSofaDataString may be
> as large 100 MBs (50,000,000 chars). This is causing the JVM to run out of
> memory when we have many concurrent AnalysisEngines running.
> I traced JCas.getSofaDataString(), and it eventually calls
> StringHeap.getStringForCode(), which does a "new String" from it's private
> char[] (which does a copy).
> This would happen for each annotator. We have five, so now the 100 MBs has
> become 600 MBs. Multiply by 10 concurrent AnalysisEngines, and that's 6,000
> MBs.
> Perhaps there could be a variation on getSofaDataString that returns one of
> the other classes (besides String) that implements CharSequence. A
> CharBuffer perhaps, or even a new class the implements the CharSequence
> interface but is read-only (just four methods). Or even just return a char[]
> or char[] and begin/end offset into the StringHeap.
> If nothing else, perhaps the document text should be treated specially from
> all the little strings in the StringHeap, and be stored separately, so calls
> to getSofaDataString() simply return a reference to an existing String
> object, without copying.
> I'm open to possibilities, I just need the copying to end.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.