[
https://issues.apache.org/jira/browse/UIMA-483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Marshall Schor updated UIMA-483:
--------------------------------
Affects Version/s: 2.1
> JCas method like getSofaDataString that doesn't copy the chars from the
> StringHeap
> ----------------------------------------------------------------------------------
>
> Key: UIMA-483
> URL: https://issues.apache.org/jira/browse/UIMA-483
> Project: UIMA
> Issue Type: Improvement
> Components: Core Java Framework
> Affects Versions: 2.1, 2.2
> Reporter: Greg Holmberg
>
> I process large documents--the String I pass to JCas.setSofaDataString may be
> as large 100 MBs (50,000,000 chars). This is causing the JVM to run out of
> memory when we have many concurrent AnalysisEngines running.
> I traced JCas.getSofaDataString(), and it eventually calls
> StringHeap.getStringForCode(), which does a "new String" from it's private
> char[] (which does a copy).
> This would happen for each annotator. We have five, so now the 100 MBs has
> become 600 MBs. Multiply by 10 concurrent AnalysisEngines, and that's 6,000
> MBs.
> Perhaps there could be a variation on getSofaDataString that returns one of
> the other classes (besides String) that implements CharSequence. A
> CharBuffer perhaps, or even a new class the implements the CharSequence
> interface but is read-only (just four methods). Or even just return a char[]
> or char[] and begin/end offset into the StringHeap.
> If nothing else, perhaps the document text should be treated specially from
> all the little strings in the StringHeap, and be stored separately, so calls
> to getSofaDataString() simply return a reference to an existing String
> object, without copying.
> I'm open to possibilities, I just need the copying to end.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.