Hi Nick The DataFormatter works well. However, the XSSFExcelExtractor runs out of memory when I try to process a reasonably large XLSX file - is there a more memory efficient way of processing such files?
Thanks, - Chris On 4 Nov 2012, at 12:44, Chris Bamford wrote: Hi Nick, Thanks for the steer for XLSX files. I have tried this ReadMsOfficeFiles<http://codezrule.wordpress.com/2012/01/05/extract-text-from-ms-office-2007-files-docx-pptx-xlsx/> program and I think I may have found the cause of my particular issue i.e. text extraction of doubles gives very large scary looking numbers ("9.2999999999999999E-2" instead of "0.093"). In XSSFExcelExtractor.getText(): ... // Rows and cells for (Object rawR : sheet) { Row row = (Row)rawR; for(Iterator<Cell> ri = row.cellIterator(); ri.hasNext();) { Cell cell = ri.next(); // Is it a formula one? if(cell.getCellType() == Cell.CELL_TYPE_FORMULA && formulasNotResults) { text.append(cell.getCellFormula()); } else if(cell.getCellType() == Cell.CELL_TYPE_STRING) { text.append(cell.getRichStringCellValue().getString()); } else { XSSFCell xc = (XSSFCell)cell; text.append(xc.getRawValue()); // shouldn't this just be text.append(cell.toString()); ? } // Output the comment, if requested and exists Comment comment = cell.getCellComment(); if(includeCellComments && comment != null) { // Replace any newlines with spaces, otherwise it // breaks the output String commentText = comment.getString().getString().replace('\n', ' '); text.append(" Comment by ").append(comment.getAuthor()).append(": ").append(commentText); } if(ri.hasNext()) text.append("\t"); } text.append("\n"); } ... The highlighted line spits out the raw double in all its glory rather than just the text equivalent. As this class is designed to produce text it seems reasonable to me that toString() would be sufficient, what do you think? I have a spreadsheet which exhibits the problem, would you like me to send it? If so, how? Thanks, - Chris On 2 Nov 2012, at 15:04, Nick Burch wrote: On Fri, 2 Nov 2012, Chris Bamford wrote: The XLS extraction is going great. For XLSX can I use the same mechanism? Similar. The low level file formats are very different, but there's an analagous extractor that uses SAX XML events rather than record events Nick --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@poi.apache.org<mailto:user-unsubscr...@poi.apache.org> For additional commands, e-mail: user-h...@poi.apache.org<mailto:user-h...@poi.apache.org> <https://serviceB.mimecast.com/mimecast/click?account=C1A1&code=520a90a81be92c80dac7974a447e65bf> [ Our Blog<https://serviceB.mimecast.com/mimecast/click?account=C1A1&code=de5a5fb9c363a9315c48774b80382f74> ] [ Twitter<https://serviceB.mimecast.com/mimecast/click?account=C1A1&code=0805caf78cbff352efabfad6b5793ee4> ] [ YouTube<https://serviceB.mimecast.com/mimecast/click?account=C1A1&code=5a09c8486cc49449a18c8e13731c33aa> ] Chris Bamford Senior Developer m: +44 7860 405292 www.mimecast.com<https://serviceB.mimecast.com/mimecast/click?account=C1A1&code=20ad68c80d30750ecf11a8a1c4714c63> 2-8 Balfe Street, London, N1 9EG +44 (0) 207 843 2300 Disclaimer The information contained in this communication from cbamf...@mimecast.com<mailto:cbamf...@mimecast.com> sent at 2012-11-04 12:43:30 is confidential and may be legally privileged. It is intended solely for use by user@poi.apache.org<mailto:user@poi.apache.org> and others authorized to receive it. If you are not user@poi.apache.org<mailto:user@poi.apache.org> you are hereby notified that any disclosure, copying, distribution or taking action in reliance of the contents of this information is strictly prohibited and may be unlawful. Mimecast Ltd. is a company registered in England and Wales with the company number 4698693 VAT No. GB 123 4197 34 Registered Office: 2 - 8 Balfe Street, Kings Cross London, N1 9EG Email Address: i...@mimecast.com<mailto:i...@mimecast.com> ________________________________ This email message has been scanned for viruses by Mimecast. Mimecast delivers a complete managed email solution from a single web based platform. For more information please visit http://www.mimecast.com<http://www.mimecast.com/><http://www.mimecast.com/> ________________________________