[ https://issues.apache.org/jira/browse/TIKA-189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12666449#action_12666449 ]
Uwe Schindler commented on TIKA-189: ------------------------------------ Hi Georger, sorry this was my fault (I only tested with a local open office file). The problem in this issue is not whitespace: Even the XHTML output looks wrong: <table> <tbody> <tr> <td>NameEmailSanta claussa...@claus.org</td> </tr> <tr> <td>Tooth fairyto...@fairy.org</td> </tr> </tbody> As you see, it is not a whitespace problem, it seems, that the ExcelExtractor forgets to insert a new TD element. I will investigate this a little bit, but I am not sure, if it is a POI bug or an error in the table cell loop. The other small bug, I found in whitespace handling, is a new issue: TIKA-190 > Text extraction from Excel files juxtaposes cells > ------------------------------------------------- > > Key: TIKA-189 > URL: https://issues.apache.org/jira/browse/TIKA-189 > Project: Tika > Issue Type: Bug > Components: general > Affects Versions: 0.3 > Environment: Tika revision is svn-20090116, platform is Windows XP > Pro SP3, JDK version is 1.6.0_06. > Reporter: Georger Araújo > Priority: Minor > Attachments: no_cell_separators_when_extracted.zip, TIKA-189.patch > > > I plan on using Tika to extract text from Excel (both .xls and .xlsx) files > for indexing. But, I found that Tika juxtaposes cells on output. The example > worksheets are in the attached .zip file. > I took the time to run Apache POI and it does not have this bug i.e. cells > are properly separated. > When I run > --begin-- > java -jar tika-0.3-SNAPSHOT-standalone.jar --text > no_cell_separators_when_extracted.xls > --end-- > I get the following output: > --begin-- > Plan1 > NameEmailSanta claussa...@claus.org > Tooth fairyto...@fairy.org > --end-- > Same thing with a .xlxs file: > --begin-- > java -jar tika-0.3-SNAPSHOT-standalone.jar --text > no_cell_separators_when_extracted.xlsx > --end-- > The output is: > --begin-- > [Content_Types].xml > _rels/.rels > xl/_rels/workbook.xml.rels > xl/workbook.xml > xl/theme/theme1.xml > xl/worksheets/_rels/sheet1.xml.rels > xl/worksheets/sheet2.xml > xl/worksheets/sheet3.xml > xl/sharedStrings.xml > NameEmailSanta claussa...@claus.orgtooth fairyto...@fairy.org > xl/styles.xml > xl/worksheets/sheet1.xml > 012345 > docProps/core.xml > GeorgerGeorger2009-01-17T15:29:04Z2009-01-17T15:30:56Z > docProps/app.xml > Microsoft Excel0falsePlanilhas3Plan1Plan2Plan3falsefalsefalse12.0000 > --end-- > Also note that the values from docProps/app.xml have been juxtaposed as well. > This way, after indexing these files using the output from Tika, a search > engine will only find "Fairy" when substring matching is used, because "Tooth > Fairy" becomes "Tooth fairyto...@fairy.org". This is suboptimal and wrong. > Thanks for your attention. Best regards, > Georger -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.