Thilo, Just tested this change with the JNI interface to uimacpp and it works fine.
Eddie ---------- Forwarded message ---------- From: Thilo Goetz (JIRA) <[email protected]> Date: Fri, Jun 6, 2008 at 10:21 AM Subject: [jira] Closed: (UIMA-1067) Remove char heap/ref heap in StringHeap of the CAS To: [email protected] [ https://issues.apache.org/jira/browse/UIMA-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel] Thilo Goetz closed UIMA-1067. ----------------------------- Resolution: Fixed Fixed, all unit tests pass. Please test this change if you use (binary) serialization. It should work the same as before, I haven't changed the serialization format in any way. > Remove char heap/ref heap in StringHeap of the CAS > -------------------------------------------------- > > Key: UIMA-1067 > URL: https://issues.apache.org/jira/browse/UIMA-1067 > Project: UIMA > Issue Type: Improvement > Components: Core Java Framework > Affects Versions: 2.2.2 > Reporter: Thilo Goetz > Assignee: Thilo Goetz > Fix For: 2.3 > > > The StringHeap class provides two ways to store strings: either as Java strings, or by copying characters onto a character heap. The second option is only used for deserialization from a binary CAS. However, even if not used, this capability means a very significant memory overhead. To demonstrate this, I ran the following experiment. As analysis engine, I used our sandbox POS tagger. It sets just one string feature on each token. As text, I used a 2.4MB input file (2x moby.txt). To run this in IBM Java 1.5.0_7 (which happens to be the JVM I'm interested in) you need to specify -Xmx135M. I checked 5MB increments. The I patched the StringHeap implementation to work without the additional book keeping overhead and ran the experiment again. I was then able to run with -Xmx115M. This represents a very significant gain, particularly given the fact that I ran so little analysis (only tokens and sentences are produced, and only a single string-valued feature set). The new code also ran a tiny bit faster, but not much. One might see more improvement for analysis that is not as compute intensive as the Tagger. > The challenge is to make sure that the serialization code still works after this change. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
