Remove char heap/ref heap in StringHeap of the CAS
--------------------------------------------------

                 Key: UIMA-1067
                 URL: https://issues.apache.org/jira/browse/UIMA-1067
             Project: UIMA
          Issue Type: Improvement
          Components: Core Java Framework
    Affects Versions: 2.2.2
            Reporter: Thilo Goetz
            Assignee: Thilo Goetz
             Fix For: 2.3


The StringHeap class provides two ways to store strings: either as Java 
strings, or by copying characters onto a character heap.  The second option is 
only used for deserialization from a binary CAS.  However, even if not used, 
this capability means a very significant memory overhead.  To demonstrate this, 
I ran the following experiment.  As analysis engine, I used our sandbox POS 
tagger.  It sets just one string feature on each token.  As text, I used a 
2.4MB input file (2x moby.txt).  To run this in IBM Java 1.5.0_7 (which happens 
to be the JVM I'm interested in) you need to specify -Xmx135M.  I checked 5MB 
increments.  The I patched the StringHeap implementation to work without the 
additional book keeping overhead and ran the experiment again.  I was then able 
to run with -Xmx115M.  This represents a very significant gain, particularly 
given the fact that I ran so little analysis (only tokens and sentences are 
produced, and only a single string-valued feature set).  The new code also ran 
a tiny bit faster, but not much.  One might see more improvement for analysis 
that is not as compute intensive as the Tagger.

The challenge is to make sure that the serialization code still works after 
this change.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to