Dear POI community,

I am trying to use SXSSFWorkbook (streaming version of workbook) to generate 
XLSX spreadsheet with many rows and hyperlinks in the cells. It turned out that 
effective streaming of such document is not possible, as hyperlinks are 
accumulated in XSSFSheet#hyperlinks (see [1]). It's clear why this is 
necessary: <hyperlinks> XML element goes after <sheetData> in <worksheet>, so 
they need to be somehow preserved until all data rows are written. However it 
turned out that each hyperlink is 2KB size, while the "useful" data (the string 
link itself) is about 300 bytes (75% memory overhead). For document with 60000 
hyperlinks, the memory used is 132MB (see attached poi-hyperlink-dump.png). The 
most of memory is consumed by Cur$Locations which has 7 arrays, 32 elements 
each, consuming 1KB in total (see attached poi-hyperlink-dump-detailed.png). 
The stack trace is the following:

Thread [main] (Suspended (breakpoint at line 493 in 
org.apache.xmlbeans.impl.store.Cur$Locations))      
        
org.apache.xmlbeans.impl.store.Cur$Locations.<init>(org.apache.xmlbeans.impl.store.Locale)
 line: 493    
        
org.apache.xmlbeans.impl.store.Locale.<init>(org.apache.xmlbeans.SchemaTypeLoader,
 org.apache.xmlbeans.XmlOptions) line: 169    
        
org.apache.xmlbeans.impl.store.Locale.getLocale(org.apache.xmlbeans.SchemaTypeLoader,
 org.apache.xmlbeans.XmlOptions) line: 241 
        
org.apache.xmlbeans.impl.store.Locale.newInstance(org.apache.xmlbeans.SchemaTypeLoader,
 org.apache.xmlbeans.SchemaType, org.apache.xmlbeans.XmlOptions) line: 592      
 
        
org.apache.xmlbeans.impl.schema.SchemaTypeLoaderImpl(org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase).newInstance(org.apache.xmlbeans.SchemaType,
 org.apache.xmlbeans.XmlOptions) line: 198        
        
org.apache.poi.POIXMLTypeLoader.newInstance(org.apache.xmlbeans.SchemaType, 
org.apache.xmlbeans.XmlOptions) line: 84    
        
org.openxmlformats.schemas.spreadsheetml.x2006.main.CTHyperlink$Factory.newInstance()
 line: not available       
        
org.apache.poi.xssf.usermodel.XSSFHyperlink.<init>(org.apache.poi.common.usermodel.HyperlinkType)
 line: 60      
        
org.apache.poi.xssf.usermodel.XSSFCreationHelper.createHyperlink(org.apache.poi.common.usermodel.HyperlinkType)
 line: 78        
        
org.apache.poi.xssf.streaming.SXSSFCreationHelper.createHyperlink(org.apache.poi.common.usermodel.HyperlinkType)
 line: 83       

By looking into tacktrace one can see that there is chance to optimize memory 
in Locale#getLocale() by taking advantage of 
options.hasOption(USE_SAME_LOCALE). However this option is not documented – 
what could be the side effects?

Another opportunity is to refactor XSSFHyperlink so that it does not create 
CTHyperlink immediately but postpones that until that is really needed 
(SXSSFWorkbook#write() is called), more specifically, when 
XSSFHyperlink#getCTHyperlink() is called. However SXSSFCell#setHyperlink() 
misuses this method [3] in a sense that it should use API of XSSFHyperlink the 
same way as XSSFCell#setHyperlink() does (see SXSSFCell.java.patch).

Summarizing above I would say that the following could be improved right now:

* Document USE_SAME_LOCALE option. Provide a description of what could go wrong 
if this options is used, especially when used in conjunction with 
UNSYNCHRONIZED. Also the option assumes that Locale object could be somehow 
instantiated but it's constructor is private and one cannot do something like 
POIXMLTypeLoader.DEFAULT_XML_OPTIONS.put(Locale.USE_SAME_LOCALE, 
Locale.getLocale(XmlBeans.getContextTypeLoader(), 
POIXMLTypeLoader.DEFAULT_XML_OPTIONS));
* Apply minor correction to documentation [4]:
   there are still things that still may consume a large amount of memory based 
on which features you are using, e.g. merged regions, comments, ...
   there are still things that still may consume a large amount of memory based 
on which features you are using, e.g. merged regions, hyperlinks, comments, ...
* Apply the attached SXSSFCell.java.patch.

More advanced improvement of SXSSF would be to dump all problematic objects 
(regions, hyperlinks, comments, ...) into temporary files and then merge them 
into destination XSLX spreadsheet – then it would be a real streaming.

POI v3.15. Links are to POI 3.12, but there are no conceptual differences.

[1] 
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.poi/poi-ooxml/3.12/org/apache/poi/xssf/usermodel/XSSFSheet.java#137
[2] 
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.xmlbeans/xmlbeans/2.4.0/org/apache/xmlbeans/impl/store/Locale.java#213
[3] 
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.poi/poi-ooxml/3.12/org/apache/poi/xssf/streaming/SXSSFCell.java#603
[4] https://poi.apache.org/spreadsheet/how-to.html#sxssf
--- SXSSFCell.java.orig	2016-09-17 11:35:28.000000000 +0200
+++ SXSSFCell.java	2016-11-24 12:12:59.419912600 +0100
@@ -712,8 +712,7 @@
 
         XSSFHyperlink xssfobj = (XSSFHyperlink)link;
         // Assign to us
-        CellReference ref = new CellReference(getRowIndex(), getColumnIndex());
-        xssfobj.getCTHyperlink().setRef( ref.formatAsString()  );
+        xssfobj.setCellReference( new CellReference(getRowIndex(), getColumnIndex()).toString() );
 
         // Add to the lists
         getSheet()._sh.addHyperlink(xssfobj);
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to