On Wed, 4 Feb 2015, Matt Bachmann wrote:
When I play with the TIKA jar file with a simple excel file I get something like what I have below. Code I write to do the parsing pulls out something similar. The data is generally correct. But, in the parsing the position of cells is completely lost.

That's to be expected - Tika will only return you text for cells which are really defined in the file. It won't generate dummy entries for "missing" cells which Excel optimised out of the file for being blank. This avoids bloating the Tika output, and keeps the Tika code much simpler

Is this possible with TIKA? I have google around and have not found much. Do I have to drop down to POI to do this?

You'll need to use POI if you want full control over missing rows or missing cells.

For working with .xls files, you'd probably want something like the example "missing records aware" streaming xls to csv converter:
https://svn.apache.org/repos/asf/poi/trunk/src/examples/src/org/apache/poi/hssf/eventusermodel/examples/XLS2CSVmra.java
For .xlsx you'll need some similar logic in a sax-based parser

Or, if you have the memory, it's all very easy, as detailed on the site:
http://poi.apache.org/spreadsheet/quick-guide.html#Iterator

Nick

Reply via email to