Re: Holy COW! It worked (I think)

2005-04-04 Thread andy
hehe, Mike if you want to go at it take a look at the performance branch 
before we reverted it where I ditched all NumericRecords, LabelSST and
String based records for parallel object arrays.  I just never could 
finish it because at the time the other features were too much of a 
moving target.  At
that time I decresed memory requirements of a mostly numeric file by 
about 1/3.

-Andy
Glen Stampoultzis wrote:
Nice one.  You clearly know the Excel file format very well.   There 
are lots of places where we create too many objects.  Looks like 
that's one of them.

There are some tests around string handling and SST but it would be 
nice to have some for the new code you've written.

Some of our existing tests against head are unfortunately still broken.
Regards,
Glen
Michael Zalewski wrote:
When I wrote this comment, I thought after I posted -- This is 
ridiculous.
No sane person is going to try to rewrite BinaryTree like I 
suggested. Such
a change must surely destroy every other dependant structure, which 
includes
the file structure itself. Only a raving lunatic would actually spend 
time
on this.

So I tried :)
Result was memory requirements decreased by about 1/3. (My file 
consists of
only Strings, and the strings are short. I suspect most use cases 
would not
see such an improvement in memory requirements).

However, time to load my 65,000 unique string workbook decreased by a 
factor
of of almost 10 (from over 5 minutes to about 30 sec). The strange
phenomenon with the CPU going idle happened briefly for less than 3 
sec, and
only one time.

Here is part of the code (to make my idea more clear)
   private static final class Node
   implements Map.Entry
   {
   private Comparable   _dataKey; // instead of Comparable[] _data
   private Comparable   _dataData;
   private Node _leftKey; // instead of Node[] _left
   private Node _leftData;
   private Node _rightKey; // instead of Node[] _right
   private Node _rightData;
   private Node _parentKey; // instead of Node[] _parent
   private Node _parentData;
   private boolean  _blackKey; // instead of Boolean[] _black
   private boolean  _blackData;
   private int  _hashcode;
   private boolean  _calculated_hashcode;
   /**
* Make a new cell with given key and value, and with null
* links, and black (true) colors.
*
* @param key
* @param value
*/
   Node(final Comparable key, final Comparable value)
   {
   _dataKey = key; // much shorter ctor
   _dataData = value;  // does not create any arrays
   }
I'll put this into Bugzilla soon to start the discussion. But I 
should point
out that I have run practically no tests. Gotta find the tests first.

Are there any tests?
-Original Message-
From: Michael Zalewski [mailto:[EMAIL PROTECTED] Sent: Friday, 
March 25, 2005 12:45 PM
To: 'POI Users List'
Subject: RE: HSSF cannot open files that contain many strings

My own thought is that there are just too gosh darn many objects. 
(Gosh darn
many objects = gosh darn long time to process).

The SST table gets deserialized into a humongous double binary tree
structure, (org.apache.poi.util.BinaryTree) which is actually indexed by
both the index of the string and the value of the string. So this 
means that
there are at least 10 objects created per String

1) The String structure (type org.apache.poi.hssf.record.UnicodeString)
2) The String value itself (contained as a field in type UnicodeString)
3) The Integer value (which indexes the String). It's an Integer object
instead of a primitive, so it can implement Comparable and be one of the
keys in the double indexed tree structure
4) The Node object (of the tree, which has a reference to both the 
String
value and the Integer value)
5) One or more LabelSST records which contain an index into the tree.

If you look inside org.apache.poi.util.BinaryTree, you can see that each
node of the binary tree (there is one node for each string) contains 
five
array objects in addition to the ones I listed above.

This means that my file of 65,000 unique strings will end up creating
650,000 objects to represent those strings when deserialized. I'm 
probably
missing some objects in this analysis, so my guess is that my 65,000 
string
spreadsheet required over a million java objects.

You can get rid of 5 of these objects with a simple refactoring of
BinaryTree -- replace each of the 5 arrays with 2 fields (replace the 5
arrays with 10 primitive fields).

-Original Message-
From: Danny Mui [mailto:[EMAIL PROTECTED] Sent: Friday, March 25, 
2005 11:50 AM
To: POI Users List
Subject: Re: HSSF cannot open files that contain many strings

I'm curious about the CPU utilization issues and why it takes so gosh 
darn long!  Wonder what a profiler will say about loading a file as 
you've described.

It shouldn't be too 

Re: Holy COW! It worked (I think)

2005-04-04 Thread andy
I'd like to see a patch version of this that ALSO handles rich string 
format!  We still don't do that properly.
The BinaryTree was actually a stop gap to de-duplicate strings both when 
they weren't really duplicates (thus loosing
rich string formatting) and when they were added (and thus quite 
possibly were dupes).

This looks like good work.
Michael Zalewski wrote:
When I wrote this comment, I thought after I posted -- This is ridiculous.
No sane person is going to try to rewrite BinaryTree like I suggested. Such
a change must surely destroy every other dependant structure, which includes
the file structure itself. Only a raving lunatic would actually spend time
on this.
So I tried :)
Result was memory requirements decreased by about 1/3. (My file consists of
only Strings, and the strings are short. I suspect most use cases would not
see such an improvement in memory requirements).
However, time to load my 65,000 unique string workbook decreased by a factor
of of almost 10 (from over 5 minutes to about 30 sec). The strange
phenomenon with the CPU going idle happened briefly for less than 3 sec, and
only one time.
Here is part of the code (to make my idea more clear)
   private static final class Node
   implements Map.Entry
   {
   private Comparable   _dataKey; // instead of Comparable[] _data
   private Comparable   _dataData;
   private Node _leftKey; // instead of Node[] _left
   private Node _leftData;
   private Node _rightKey; // instead of Node[] _right
   private Node _rightData;
   private Node _parentKey; // instead of Node[] _parent
   private Node _parentData;
   private boolean  _blackKey; // instead of Boolean[] _black
   private boolean  _blackData;
   private int  _hashcode;
   private boolean  _calculated_hashcode;
   /**
* Make a new cell with given key and value, and with null
* links, and black (true) colors.
*
* @param key
* @param value
*/
   Node(final Comparable key, final Comparable value)
   {
   _dataKey = key; // much shorter ctor
   _dataData = value;  // does not create any arrays
   }
I'll put this into Bugzilla soon to start the discussion. But I should point
out that I have run practically no tests. Gotta find the tests first.
Are there any tests?
-Original Message-
From: Michael Zalewski [mailto:[EMAIL PROTECTED] 
Sent: Friday, March 25, 2005 12:45 PM
To: 'POI Users List'
Subject: RE: HSSF cannot open files that contain many strings

My own thought is that there are just too gosh darn many objects. (Gosh darn
many objects = gosh darn long time to process).
The SST table gets deserialized into a humongous double binary tree
structure, (org.apache.poi.util.BinaryTree) which is actually indexed by
both the index of the string and the value of the string. So this means that
there are at least 10 objects created per String
1) The String structure (type org.apache.poi.hssf.record.UnicodeString)
2) The String value itself (contained as a field in type UnicodeString)
3) The Integer value (which indexes the String). It's an Integer object
instead of a primitive, so it can implement Comparable and be one of the
keys in the double indexed tree structure
4) The Node object (of the tree, which has a reference to both the String
value and the Integer value)
5) One or more LabelSST records which contain an index into the tree.
If you look inside org.apache.poi.util.BinaryTree, you can see that each
node of the binary tree (there is one node for each string) contains five
array objects in addition to the ones I listed above.
This means that my file of 65,000 unique strings will end up creating
650,000 objects to represent those strings when deserialized. I'm probably
missing some objects in this analysis, so my guess is that my 65,000 string
spreadsheet required over a million java objects.
You can get rid of 5 of these objects with a simple refactoring of
BinaryTree -- replace each of the 5 arrays with 2 fields (replace the 5
arrays with 10 primitive fields).

-Original Message-
From: Danny Mui [mailto:[EMAIL PROTECTED] 
Sent: Friday, March 25, 2005 11:50 AM
To: POI Users List
Subject: Re: HSSF cannot open files that contain many strings

I'm curious about the CPU utilization issues and why it takes so gosh 
darn long!  Wonder what a profiler will say about loading a file as 
you've described.

It shouldn't be too difficult to adjust the way the SST's are 
written/loaded to validate/invalidate this problem/fix.

Michael Zalewski wrote:
 

Ummm...
Yes I think I might have identified an issue with POI and a large number
   

of
 

strings. And I was looking at it partly in response to Mike's problem.
But I don't think the issue I found is the root problem. It might explain
why large files generated in POI HSSF would not open correctly in Excel.
   

In
 

fact, I 

Re: Holy COW! It worked (I think)

2005-03-26 Thread Glen Stampoultzis
Nice one.  You clearly know the Excel file format very well.   There are 
lots of places where we create too many objects.  Looks like that's one 
of them.

There are some tests around string handling and SST but it would be nice 
to have some for the new code you've written.

Some of our existing tests against head are unfortunately still broken.
Regards,
Glen
Michael Zalewski wrote:
When I wrote this comment, I thought after I posted -- This is ridiculous.
No sane person is going to try to rewrite BinaryTree like I suggested. Such
a change must surely destroy every other dependant structure, which includes
the file structure itself. Only a raving lunatic would actually spend time
on this.
So I tried :)
Result was memory requirements decreased by about 1/3. (My file consists of
only Strings, and the strings are short. I suspect most use cases would not
see such an improvement in memory requirements).
However, time to load my 65,000 unique string workbook decreased by a factor
of of almost 10 (from over 5 minutes to about 30 sec). The strange
phenomenon with the CPU going idle happened briefly for less than 3 sec, and
only one time.
Here is part of the code (to make my idea more clear)
   private static final class Node
   implements Map.Entry
   {
   private Comparable   _dataKey; // instead of Comparable[] _data
   private Comparable   _dataData;
   private Node _leftKey; // instead of Node[] _left
   private Node _leftData;
   private Node _rightKey; // instead of Node[] _right
   private Node _rightData;
   private Node _parentKey; // instead of Node[] _parent
   private Node _parentData;
   private boolean  _blackKey; // instead of Boolean[] _black
   private boolean  _blackData;
   private int  _hashcode;
   private boolean  _calculated_hashcode;
   /**
* Make a new cell with given key and value, and with null
* links, and black (true) colors.
*
* @param key
* @param value
*/
   Node(final Comparable key, final Comparable value)
   {
   _dataKey = key; // much shorter ctor
   _dataData = value;  // does not create any arrays
   }
I'll put this into Bugzilla soon to start the discussion. But I should point
out that I have run practically no tests. Gotta find the tests first.
Are there any tests?
-Original Message-
From: Michael Zalewski [mailto:[EMAIL PROTECTED] 
Sent: Friday, March 25, 2005 12:45 PM
To: 'POI Users List'
Subject: RE: HSSF cannot open files that contain many strings

My own thought is that there are just too gosh darn many objects. (Gosh darn
many objects = gosh darn long time to process).
The SST table gets deserialized into a humongous double binary tree
structure, (org.apache.poi.util.BinaryTree) which is actually indexed by
both the index of the string and the value of the string. So this means that
there are at least 10 objects created per String
1) The String structure (type org.apache.poi.hssf.record.UnicodeString)
2) The String value itself (contained as a field in type UnicodeString)
3) The Integer value (which indexes the String). It's an Integer object
instead of a primitive, so it can implement Comparable and be one of the
keys in the double indexed tree structure
4) The Node object (of the tree, which has a reference to both the String
value and the Integer value)
5) One or more LabelSST records which contain an index into the tree.
If you look inside org.apache.poi.util.BinaryTree, you can see that each
node of the binary tree (there is one node for each string) contains five
array objects in addition to the ones I listed above.
This means that my file of 65,000 unique strings will end up creating
650,000 objects to represent those strings when deserialized. I'm probably
missing some objects in this analysis, so my guess is that my 65,000 string
spreadsheet required over a million java objects.
You can get rid of 5 of these objects with a simple refactoring of
BinaryTree -- replace each of the 5 arrays with 2 fields (replace the 5
arrays with 10 primitive fields).

-Original Message-
From: Danny Mui [mailto:[EMAIL PROTECTED] 
Sent: Friday, March 25, 2005 11:50 AM
To: POI Users List
Subject: Re: HSSF cannot open files that contain many strings

I'm curious about the CPU utilization issues and why it takes so gosh 
darn long!  Wonder what a profiler will say about loading a file as 
you've described.

It shouldn't be too difficult to adjust the way the SST's are 
written/loaded to validate/invalidate this problem/fix.

Michael Zalewski wrote:
 

Ummm...
Yes I think I might have identified an issue with POI and a large number
   

of
 

strings. And I was looking at it partly in response to Mike's problem.
But I don't think the issue I found is the root problem. It might explain
why large files generated in POI HSSF would not open correctly in Excel.
   

In

Re: Holy COW! It worked (I think)

2005-03-25 Thread Mike E. Serra

Michael Zalewski, you are a holy cleric.
(I hope I'm not jumping the gun too much there :)

Michael Zalewski wrote:


When I wrote this comment, I thought after I posted -- This is 
ridiculous.
No sane person is going to try to rewrite BinaryTree like I suggested. 
Such
a change must surely destroy every other dependant structure, which 
includes
the file structure itself. Only a raving lunatic would actually spend 
time
on this.

So I tried :)

Result was memory requirements decreased by about 1/3. (My file 
consists of
only Strings, and the strings are short. I suspect most use cases would 
not
see such an improvement in memory requirements).

However, time to load my 65,000 unique string workbook decreased by a 
factor
of of almost 10 (from over 5 minutes to about 30 sec). The strange
phenomenon with the CPU going idle happened briefly for less than 3 
sec, and
only one time.

Here is part of the code (to make my idea more clear)

private static final class Node
implements Map.Entry
{
private Comparable   _dataKey; // instead of Comparable[] _data
private Comparable   _dataData;
private Node _leftKey; // instead of Node[] _left
private Node _leftData;
private Node _rightKey; // instead of Node[] _right
private Node _rightData;
private Node _parentKey; // instead of Node[] _parent
private Node _parentData;
private boolean  _blackKey; // instead of Boolean[] _black
private boolean  _blackData;
private int  _hashcode;
private boolean  _calculated_hashcode;

/**
 * Make a new cell with given key and value, and with null
 * links, and black (true) colors.
 *
 * @param key
 * @param value
 */

Node(final Comparable key, final Comparable value)
{
_dataKey = key; // much shorter ctor
_dataData = value;  // does not create any arrays
}

I'll put this into Bugzilla soon to start the discussion. But I should 
point
out that I have run practically no tests. Gotta find the tests first.

Are there any tests?

-Original Message-
From: Michael Zalewski [mailto:[EMAIL PROTECTED] 
Sent: Friday, March 25, 2005 12:45 PM
To: 'POI Users List'
Subject: RE: HSSF cannot open files that contain many strings

My own thought is that there are just too gosh darn many objects. (Gosh 
darn
many objects = gosh darn long time to process).

The SST table gets deserialized into a humongous double binary tree
structure, (org.apache.poi.util.BinaryTree) which is actually indexed by
both the index of the string and the value of the string. So this means 
that
there are at least 10 objects created per String

1) The String structure (type org.apache.poi.hssf.record.UnicodeString)
2) The String value itself (contained as a field in type UnicodeString)
3) The Integer value (which indexes the String). It's an Integer object
instead of a primitive, so it can implement Comparable and be one of the
keys in the double indexed tree structure
4) The Node object (of the tree, which has a reference to both the 
String
value and the Integer value)
5) One or more LabelSST records which contain an index into the tree.

If you look inside org.apache.poi.util.BinaryTree, you can see that each
node of the binary tree (there is one node for each string) contains 
five
array objects in addition to the ones I listed above.

This means that my file of 65,000 unique strings will end up creating
650,000 objects to represent those strings when deserialized. I'm 
probably
missing some objects in this analysis, so my guess is that my 65,000 
string
spreadsheet required over a million java objects.

You can get rid of 5 of these objects with a simple refactoring of
BinaryTree -- replace each of the 5 arrays with 2 fields (replace the 5
arrays with 10 primitive fields).



-Original Message-
From: Danny Mui [mailto:[EMAIL PROTECTED] 
Sent: Friday, March 25, 2005 11:50 AM
To: POI Users List
Subject: Re: HSSF cannot open files that contain many strings

I'm curious about the CPU utilization issues and why it takes so gosh 
darn long!  Wonder what a profiler will say about loading a file as 
you've described.

It shouldn't be too difficult to adjust the way the SST's are 
written/loaded to validate/invalidate this problem/fix.

Michael Zalewski wrote:
 Ummm...
 
 Yes I think I might have identified an issue with POI and a large 
number
of
 strings. And I was looking at it partly in response to Mike's problem.
 
 But I don't think the issue I found is the root problem. It might 
explain
 why large files generated in POI HSSF would not open correctly in 
Excel.
In
 fact, I couldn't find any problem with the way POI handles things. At 
this
 point, I would say that what I have identified is just a difference in 
the
 way Excel writes a file with more than 1024 strings, and