[
https://issues.apache.org/jira/browse/UIMA-483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12510618
]
Thilo Goetz commented on UIMA-483:
----------------------------------
That's not really what should happen. There are two ways strings are kept in
the CAS: either as String objects, or as character data. The regular APIs all
use the version where the CAS simply keeps a reference to the original String
object, and that's what the sofa data APIs also do (or at least I think so from
eyeballing them). So there should be no copying going on, the relevant piece
of code being this from StringHeap.getStringForCode():
if (internalStringCode != NULL) {
return (String) this.stringList.get(internalStringCode);
}
If you have traced your code in the debugger and found that this is not used,
and instead the String constructor is called as you describe, it would be
helpful if you could provide a test case.
The character data method of keeping string data in the CAS is obsolete. I'll
see if there are any real dependencies on it, or if we can completely remove
that code.
--Thilo
> JCas method like getSofaDataString that doesn't copy the chars from the
> StringHeap
> ----------------------------------------------------------------------------------
>
> Key: UIMA-483
> URL: https://issues.apache.org/jira/browse/UIMA-483
> Project: UIMA
> Issue Type: Improvement
> Affects Versions: 2.1
> Reporter: Greg Holmberg
>
> I process large documents--the String I pass to JCas.setSofaDataString may be
> as large 100 MBs (50,000,000 chars). This is causing the JVM to run out of
> memory when we have many concurrent AnalysisEngines running.
> I traced JCas.getSofaDataString(), and it eventually calls
> StringHeap.getStringForCode(), which does a "new String" from it's private
> char[] (which does a copy).
> This would happen for each annotator. We have five, so now the 100 MBs has
> become 600 MBs. Multiply by 10 concurrent AnalysisEngines, and that's 6,000
> MBs.
> Perhaps there could be a variation on getSofaDataString that returns one of
> the other classes (besides String) that implements CharSequence. A
> CharBuffer perhaps, or even a new class the implements the CharSequence
> interface but is read-only (just four methods). Or even just return a char[]
> or char[] and begin/end offset into the StringHeap.
> If nothing else, perhaps the document text should be treated specially from
> all the little strings in the StringHeap, and be stored separately, so calls
> to getSofaDataString() simply return a reference to an existing String
> object, without copying.
> I'm open to possibilities, I just need the copying to end.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.