On Fri, 27 Jan 2012, Gangwal, Adish (IS Consultant) wrote:
We want to use Tika as it supports different doc formats and not just xls or doc like POI I think Streamed parsing also makes Tika a lot faster and efficient than POI to parse even large docs of 15 MB or greater.
The streamed parsing of Excel files in Tika is powered by POI!
I understand that Tika uses POI under the cover to parse excel. So , is there some way, to tell Tika (and in turn POI) to follow some Missing Cell Policy.
A missing cell policy won't help here, you're doing streaming event parsing.
It sounds like you have some very specific business requirements around the minimum number of cells per row, missing and blank cell handling etc. Tika is never going to be able to do everything for everyone, so for your specific case you may be best off writing your own custom parser and dropping that into Tika. XLS2CSVmra is a good basis for doing XLS -> CSV with full control over missing cells and missing rows (you can set a minimum number of columns to output for example), and XLSX2CSV has a similar thing for XLSX -> CSV
Nick
