Thanks, Thilo, good points! Another fine point below
Thilo Goetz wrote: > Just a few more points on this fascinating topic. > > * The JVM internally represents characters as UTF16. > This means that any ascii text will use twice as much > memory in the JVM as on disk. > > * While reading in the file, you will likely do some > copying. Even if you allocate a char[] of the right > size ahead of time and use that as a buffer to read > in your file, you'll copy that data when you create > a string out of it. So you'll need double the > amount of the final String memory while reading it > in. To the best of my knowledge, there is no way > around this issue, at least if you want to end up > with a regular Java string. > > * Strings in the JVM use a char[] internally. So you > are not only constrained by the maximum heap size, but > also by the maximum array size on the particular JVM > implementation you're using. This detail is buried > deep down in your JVM documentation. I don't know > what the numbers are nowadays, but they used to be > quite low in the Java 1.4 days. This may have changed. > > * On 32-bit windows, a process may use up to 2GB of > memory, not 4GB. Subtract from that the memory that > the JVM needs, and you get to some number around 1.4GB > as the maximum JVM heap space you can allocate. > Actually, there seems to be a way to get Windows XP and Server to let users have 3GB, not 2GB, but you have to change a setting. See http://msdn.microsoft.com/en-us/library/ms791558.aspx -Marshall > So the upshot is that on 32bit windows, you can't > read in ascii files into a String that are larger > than 350MB or so. The number may be a lot smaller, > depending on your JVM and how clever your implementation > is. > > In addition, you want to do some UIMA analysis. > Consider that this needs space, too. Depending on > your analysis, the size of the CAS may easily be > 10 times the size of your text, or more. > > So read in your large files in chunks no larger than > 5 MB, is my recommendation. If you have files that > big, you're probably not concerned with the fact that > you may be cutting up a word here and there. Still, > you can try to place splits at end-of-sentence > characters or whitespace. > > --Thilo > > >