[ 
https://issues.apache.org/jira/browse/UIMA-483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511125
 ] 

Thilo Goetz commented on UIMA-483:
----------------------------------

I really think completely removing the character heap would be the better 
solution, particularly when we'd need to adapt the serialization anyway.  The 
low-level API we could deprecate, and internally make it use regular Strings.  
I don't think anybody's using it, it's not even documented.  You'd need to look 
at the source code to know what it does.

I'm not too keen on the hash map either.  We would make everybody pay for a 
case that doesn't concern everybody.  You can use String.intern() as a 
programmer, if you think that's useful in your case.  One might consider 
intern()ing strings on deserialization, but even there, I'm not sure everybody 
wants that (I sure hope nobody relies on Strings being equal(), but not ==, but 
you never know).  And wrt space requirements: we would need to create an 
Integer object for every string code in that hash map, as primitive values 
can't be stored in Maps.


> JCas method like getSofaDataString that doesn't copy the chars from the 
> StringHeap
> ----------------------------------------------------------------------------------
>
>                 Key: UIMA-483
>                 URL: https://issues.apache.org/jira/browse/UIMA-483
>             Project: UIMA
>          Issue Type: Improvement
>    Affects Versions: 2.1
>            Reporter: Greg Holmberg
>
> I process large documents--the String I pass to JCas.setSofaDataString may be 
> as large 100 MBs (50,000,000 chars).  This is causing the JVM to run out of 
> memory when we have many concurrent AnalysisEngines running.
> I traced JCas.getSofaDataString(), and it eventually calls 
> StringHeap.getStringForCode(), which does a "new String" from it's private 
> char[] (which does a copy).
> This would happen for each annotator.  We have five, so now the 100 MBs has 
> become 600 MBs.  Multiply by 10 concurrent AnalysisEngines, and that's 6,000 
> MBs.
> Perhaps there could be a variation on getSofaDataString that returns one of 
> the other classes (besides String) that implements CharSequence.  A 
> CharBuffer perhaps, or even a new class the implements the CharSequence 
> interface but is read-only (just four methods).  Or even just return a char[] 
> or char[] and begin/end offset into the StringHeap.
> If nothing else, perhaps the document text should be treated specially from 
> all the little strings in the StringHeap, and be stored separately, so calls 
> to getSofaDataString() simply return a reference to an existing String 
> object, without copying.
> I'm open to possibilities, I just need the copying to end.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to