Hi Georger, Tika uses the ASF's JIRA bug filing system, available here:
http://issues.apache.org/jira/browse/TIKA You can use that system to file your bug. HTH, Chris On 1/17/09 8:49 AM, "Georger Araujo" <georger...@yahoo.com.br> wrote: > Hi, > I have been trying Tika for a few days now and I have to say I am very > impressed. Great work! > > I apologize in advance for attaching a .zip file to this message. But as Tika > still does not appear at the ASF Bugzilla so that I can file a proper bug and > attach the file where it really should belong, I thought it would be > acceptable if 1)I kept the size small, and 2)I checked it to be virus-free. > It's just 10K; once again, if someone feels offended, I apologize. > > I would like to report a bug. Tika revision is svn-20090116, platform is > Windows XP Pro SP3, JDK version is 1.6.0_06. > > I plan on using Tika to extract text from Excel (both .xls and .xlsx) files > for indexing. But, I found that Tika juxtaposes cells. The example worksheets > are in the attached .zip file. > > When I run > > --begin-- > java -jar tika-0.3-SNAPSHOT-standalone.jar --text > no_cell_separators_when_extracted.xls > --end-- > > I get the following output: > > --begin-- > Plan1 > NameEmailSanta claussa...@claus.org > Tooth fairyto...@fairy.org > --end-- > > Same thing with a .xlxs file: > --begin-- > java -jar tika-0.3-SNAPSHOT-standalone.jar --text > no_cell_separators_when_extracted.xlsx > --end-- > > The output is: > > --begin-- > [Content_Types].xml > > > > _rels/.rels > > > > xl/_rels/workbook.xml.rels > > > > xl/workbook.xml > > > > xl/theme/theme1.xml > > > > xl/worksheets/_rels/sheet1.xml.rels > > > > xl/worksheets/sheet2.xml > > > > xl/worksheets/sheet3.xml > > > > xl/sharedStrings.xml > NameEmailSanta claussa...@claus.orgtooth fairyto...@fairy.org > > > xl/styles.xml > > > > xl/worksheets/sheet1.xml > 012345 > > > docProps/core.xml > GeorgerGeorger2009-01-17T15:29:04Z2009-01-17T15:30:56Z > > > docProps/app.xml > Microsoft Excel0falsePlanilhas3Plan1Plan2Plan3falsefalsefalse12.0000 > --end-- > > Also note that the values from docProps/app.xml have been juxtaposed as well. > > This way, after indexing these files using the output from Tika, a search > engine will only find "Fairy" when substring matching is used, because "Tooth > Fairy" becomes "Tooth fairyto...@fairy.org". This is suboptimal and wrong. > > Thanks for your attention. Best regards, > > Georger > > > Veja quais são os assuntos do momento no Yahoo! +Buscados > http://br.maisbuscados.yahoo.com > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.