Re: Advice on parsing Spreadsheets and preserving cell positions

Matt Bachmann Wed, 04 Feb 2015 15:36:57 -0800

Alrighty! Thanks!

-Matt


On Wed Feb 04 2015 at 6:23:58 PM Nick Burch <[email protected]> wrote:

> On Wed, 4 Feb 2015, Matt Bachmann wrote:
> > When I play with the TIKA jar file with a simple excel file I get
> > something like what I have below. Code I write to do the parsing pulls
> > out something similar.  The data is generally correct. But, in the
> > parsing the position of cells is completely lost.
>
> That's to be expected - Tika will only return you text for cells which are
> really defined in the file. It won't generate dummy entries for "missing"
> cells which Excel optimised out of the file for being blank. This avoids
> bloating the Tika output, and keeps the Tika code much simpler
>
> > Is this possible with TIKA? I have google around and have not found
> > much. Do I have to drop down to POI to do this?
>
> You'll need to use POI if you want full control over missing rows or
> missing cells.
>
> For working with .xls files, you'd probably want something like the
> example "missing records aware" streaming xls to csv converter:
> https://svn.apache.org/repos/asf/poi/trunk/src/examples/src/
> org/apache/poi/hssf/eventusermodel/examples/XLS2CSVmra.java
> For .xlsx you'll need some similar logic in a sax-based parser
>
> Or, if you have the memory, it's all very easy, as detailed on the site:
> http://poi.apache.org/spreadsheet/quick-guide.html#Iterator
>
> Nick
>

Re: Advice on parsing Spreadsheets and preserving cell positions

Reply via email to