TIKA-214 has now been filed, along with the sample XLS file. https://issues.apache.org/jira/browse/TIKA-214
Should I separately bother the POI folks about this issue? Incidentally, although sad and hacky it may be worth noting that catting the output of strings and "strings -el" does a decent job of pulling unique strings out. (Although does include font names, etc.) -David 2009/3/30 Jukka Zitting <jukka.zitt...@gmail.com>: > Hi, > > On Sat, Mar 28, 2009 at 6:18 AM, David Weekly <da...@pbwiki.com> wrote: >> So this is part "bug report" (the columns of the first sheet should >> definitely be included!) > > Agreed. Can you please file a Jira bug report for this? It looks > similar to some of the zero- vs. one-based index issues we faced when > upgrading to POI 3.5. > >> and part query as to whether or not there is a plan >> w/Tika to extract more than sheet & cell data from documents. > > Doing so would be very nice. You may want to file a Jira improvement > request for that. > > And if you're familiar with Apache POI (or willing to learn it), > patches would of course also be welcome. :-) Otherwise I don't know > when one of us will encounter a similar need. > > You may also want to contact the POI project to see if they've already > implemented text extraction improvements that would cover these > features. Last week at the ApacheCon I noticed that they've recently > been improving the out-of-the-box text extraction features in POI. > > BR, > > Jukka Zitting > -- Follow me on Twitter! http://twitter.com/dweekly