[
https://issues.apache.org/jira/browse/UIMA-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Thilo Goetz reopened UIMA-1067:
-------------------------------
Fix in 2.2.2 hotfix 1.
> Remove char heap/ref heap in StringHeap of the CAS
> --------------------------------------------------
>
> Key: UIMA-1067
> URL: https://issues.apache.org/jira/browse/UIMA-1067
> Project: UIMA
> Issue Type: Improvement
> Components: Core Java Framework
> Affects Versions: 2.2.2
> Reporter: Thilo Goetz
> Assignee: Thilo Goetz
> Fix For: 2.3
>
>
> The StringHeap class provides two ways to store strings: either as Java
> strings, or by copying characters onto a character heap. The second option
> is only used for deserialization from a binary CAS. However, even if not
> used, this capability means a very significant memory overhead. To
> demonstrate this, I ran the following experiment. As analysis engine, I used
> our sandbox POS tagger. It sets just one string feature on each token. As
> text, I used a 2.4MB input file (2x moby.txt). To run this in IBM Java
> 1.5.0_7 (which happens to be the JVM I'm interested in) you need to specify
> -Xmx135M. I checked 5MB increments. The I patched the StringHeap
> implementation to work without the additional book keeping overhead and ran
> the experiment again. I was then able to run with -Xmx115M. This represents
> a very significant gain, particularly given the fact that I ran so little
> analysis (only tokens and sentences are produced, and only a single
> string-valued feature set). The new code also ran a tiny bit faster, but not
> much. One might see more improvement for analysis that is not as compute
> intensive as the Tagger.
> The challenge is to make sure that the serialization code still works after
> this change.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.