Dear POI community,
I am trying to use SXSSFWorkbook (streaming version of workbook) to generate
XLSX spreadsheet with many rows and hyperlinks in the cells. It turned out that
effective streaming of such document is not possible, as hyperlinks are
accumulated in XSSFSheet#hyperlinks (see [1]). It's clear why this is
necessary: <hyperlinks> XML element goes after <sheetData> in <worksheet>, so
they need to be somehow preserved until all data rows are written. However it
turned out that each hyperlink is 2KB size, while the "useful" data (the string
link itself) is about 300 bytes (75% memory overhead). For document with 60000
hyperlinks, the memory used is 132MB (see attached poi-hyperlink-dump.png). The
most of memory is consumed by Cur$Locations which has 7 arrays, 32 elements
each, consuming 1KB in total (see attached poi-hyperlink-dump-detailed.png).
The stack trace is the following:
Thread [main] (Suspended (breakpoint at line 493 in
org.apache.xmlbeans.impl.store.Cur$Locations))
org.apache.xmlbeans.impl.store.Cur$Locations.<init>(org.apache.xmlbeans.impl.store.Locale)
line: 493
org.apache.xmlbeans.impl.store.Locale.<init>(org.apache.xmlbeans.SchemaTypeLoader,
org.apache.xmlbeans.XmlOptions) line: 169
org.apache.xmlbeans.impl.store.Locale.getLocale(org.apache.xmlbeans.SchemaTypeLoader,
org.apache.xmlbeans.XmlOptions) line: 241
org.apache.xmlbeans.impl.store.Locale.newInstance(org.apache.xmlbeans.SchemaTypeLoader,
org.apache.xmlbeans.SchemaType, org.apache.xmlbeans.XmlOptions) line: 592
org.apache.xmlbeans.impl.schema.SchemaTypeLoaderImpl(org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase).newInstance(org.apache.xmlbeans.SchemaType,
org.apache.xmlbeans.XmlOptions) line: 198
org.apache.poi.POIXMLTypeLoader.newInstance(org.apache.xmlbeans.SchemaType,
org.apache.xmlbeans.XmlOptions) line: 84
org.openxmlformats.schemas.spreadsheetml.x2006.main.CTHyperlink$Factory.newInstance()
line: not available
org.apache.poi.xssf.usermodel.XSSFHyperlink.<init>(org.apache.poi.common.usermodel.HyperlinkType)
line: 60
org.apache.poi.xssf.usermodel.XSSFCreationHelper.createHyperlink(org.apache.poi.common.usermodel.HyperlinkType)
line: 78
org.apache.poi.xssf.streaming.SXSSFCreationHelper.createHyperlink(org.apache.poi.common.usermodel.HyperlinkType)
line: 83
By looking into tacktrace one can see that there is chance to optimize memory
in Locale#getLocale() by taking advantage of
options.hasOption(USE_SAME_LOCALE). However this option is not documented –
what could be the side effects?
Another opportunity is to refactor XSSFHyperlink so that it does not create
CTHyperlink immediately but postpones that until that is really needed
(SXSSFWorkbook#write() is called), more specifically, when
XSSFHyperlink#getCTHyperlink() is called. However SXSSFCell#setHyperlink()
misuses this method [3] in a sense that it should use API of XSSFHyperlink the
same way as XSSFCell#setHyperlink() does (see SXSSFCell.java.patch).
Summarizing above I would say that the following could be improved right now:
* Document USE_SAME_LOCALE option. Provide a description of what could go wrong
if this options is used, especially when used in conjunction with
UNSYNCHRONIZED. Also the option assumes that Locale object could be somehow
instantiated but it's constructor is private and one cannot do something like
POIXMLTypeLoader.DEFAULT_XML_OPTIONS.put(Locale.USE_SAME_LOCALE,
Locale.getLocale(XmlBeans.getContextTypeLoader(),
POIXMLTypeLoader.DEFAULT_XML_OPTIONS));
* Apply minor correction to documentation [4]:
there are still things that still may consume a large amount of memory based
on which features you are using, e.g. merged regions, comments, ...
there are still things that still may consume a large amount of memory based
on which features you are using, e.g. merged regions, hyperlinks, comments, ...
* Apply the attached SXSSFCell.java.patch.
More advanced improvement of SXSSF would be to dump all problematic objects
(regions, hyperlinks, comments, ...) into temporary files and then merge them
into destination XSLX spreadsheet – then it would be a real streaming.
POI v3.15. Links are to POI 3.12, but there are no conceptual differences.
[1]
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.poi/poi-ooxml/3.12/org/apache/poi/xssf/usermodel/XSSFSheet.java#137
[2]
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.xmlbeans/xmlbeans/2.4.0/org/apache/xmlbeans/impl/store/Locale.java#213
[3]
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.poi/poi-ooxml/3.12/org/apache/poi/xssf/streaming/SXSSFCell.java#603
[4] https://poi.apache.org/spreadsheet/how-to.html#sxssf
--- SXSSFCell.java.orig 2016-09-17 11:35:28.000000000 +0200
+++ SXSSFCell.java 2016-11-24 12:12:59.419912600 +0100
@@ -712,8 +712,7 @@
XSSFHyperlink xssfobj = (XSSFHyperlink)link;
// Assign to us
- CellReference ref = new CellReference(getRowIndex(), getColumnIndex());
- xssfobj.getCTHyperlink().setRef( ref.formatAsString() );
+ xssfobj.setCellReference( new CellReference(getRowIndex(), getColumnIndex()).toString() );
// Add to the lists
getSheet()._sh.addHyperlink(xssfobj);
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]