[ https://issues.apache.org/jira/browse/TIKA-214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
David Weekly updated TIKA-214: ------------------------------ Attachment: xls.xls Attached is a sample Excel 2003 file with several unique keywords, useful for testing completeness of textual extraction. > Excel Parsing Issues > -------------------- > > Key: TIKA-214 > URL: https://issues.apache.org/jira/browse/TIKA-214 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 0.3 > Environment: Debian Etch / Debian Sid > Reporter: David Weekly > Attachments: xls.xls > > > I ran a sample Excel 2003 file (which I will attempt to attach) that I made > through Tika 0.3 and the output didn't correctly identify the sheets, did not > include text from the first column of the first sheet, and did not include > any supplementary text (e.g. titles for charts, legends, etc.). > Specific issues with parsing xls.xls: (pardon the deliberately random names) > - "charttabyodawg" (a chart sheet) improperly labeled as the sheet for data > actually on Sheet1. > - "Sheet1" data is actually the data on Sheet2 > - Sheet2 is not mentioned. > - Chart title for chart on "charttabyodawg" is "WhamPuff" and is not > included in the output. > - Chart title for inline chart on Sheet1 is "fizzlepuff" and is not included > in output. > - Y-axis for inline chart on Sheet1 is "whyaxis" and is not included in > output. > - X-axis for inline chart on Sheet1 is "eksaxis" and is not included in > output. > - Label for data in inline chart on Sheet1 is "YottaPuff" and is not > included in output. > Below is the output fromt Tika v0.3 when run on the attached XLS: > <?xml version="1.0" encoding="UTF-8"?> > <html xmlns="http://www.w3.org/1999/xhtml"> > <head> > <title/> > </head> > <body> > <div class="page"> > <h1>charttabyodawg</h1> > <table> > <tbody> > <tr> <td>1</td> > </tr> > <tr> <td>2</td> > </tr> > <tr> <td>300</td> <td/> <td/> <td>1</td> > </tr> > <tr> <td>baz</td> <td/> <td/> <td>2</td> <td/> <td>9</td> > </tr> > <tr> <td>yadda yam</td> <td/> <td/> <td>300</td> <td/> > <td>5</td> > </tr> > <tr> <td/> <td/> <td/> <td/> <td/> <td>16</td> > </tr> > </tbody> > </table> > </div> > <div class="page"> > <h1>Sheet1</h1> > <table> > <tbody> > <tr> <td/> > </tr> > <tr> <td/> > </tr> > <tr> <td/> > </tr> > <tr> <td/> > </tr> > <tr> <td/> <td/> <td>dingdong</td> > </tr> > </tbody> > </table> > </div> > </body> > </html> -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.