Re: Advice on parsing Spreadsheets and preserving cell positions

Nick Burch Wed, 04 Feb 2015 15:24:21 -0800

On Wed, 4 Feb 2015, Matt Bachmann wrote:

When I play with the TIKA jar file with a simple excel file I getsomething like what I have below. Code I write to do the parsing pullsout something similar. The data is generally correct. But, in theparsing the position of cells is completely lost.

That's to be expected - Tika will only return you text for cells which arereally defined in the file. It won't generate dummy entries for "missing"cells which Excel optimised out of the file for being blank. This avoidsbloating the Tika output, and keeps the Tika code much simpler

Is this possible with TIKA? I have google around and have not foundmuch. Do I have to drop down to POI to do this?

You'll need to use POI if you want full control over missing rows ormissing cells.

For working with .xls files, you'd probably want something like theexample "missing records aware" streaming xls to csv converter:

https://svn.apache.org/repos/asf/poi/trunk/src/examples/src/org/apache/poi/hssf/eventusermodel/examples/XLS2CSVmra.java
For .xlsx you'll need some similar logic in a sax-based parser

Or, if you have the memory, it's all very easy, as detailed on the site:
http://poi.apache.org/spreadsheet/quick-guide.html#Iterator

Nick

Re: Advice on parsing Spreadsheets and preserving cell positions

Reply via email to