[jira] Commented: (TIKA-189) Text extraction from Excel files juxtaposes cells

kumar raja jana (JIRA) Thu, 22 Jan 2009 06:49:05 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12666150#action_12666150
 ]


kumar raja jana commented on TIKA-189:
--------------------------------------

Hi,

I face the same issue with xls and xlsx files. I have modified 
ExcelExtractor.processSheet() to add a space character to the handler after 
each cell value. Most search engines use a WhiteSpace Tokenizer so it should 
not be a problem even if this modification creates 2 space characters. This is 
the modified code

                                while(currentColumn < entry.getKey().x)
                                {
                                        handler.endElement("td");
                                        handler.startElement("td");
                                        currentColumn++;
                                }

                                entry.getValue().render(handler);
                                handler.characters(" ");  //this is the added 
line

I would love to know if there is any other workaround. 

xlsx files are getting parsed as zip files. Since the latest version of Apache 
POI parses Office 2007 documents, Can we add office 2007 document types to TIka 
mime-types.xml file?


 

> Text extraction from Excel files juxtaposes cells
> -------------------------------------------------
>
>                 Key: TIKA-189
>                 URL: https://issues.apache.org/jira/browse/TIKA-189
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 0.3
>         Environment: Tika revision is svn-20090116, platform is Windows XP 
> Pro SP3, JDK version is 1.6.0_06.
>            Reporter: Georger Rommel Ferreira de Araújo
>            Priority: Minor
>         Attachments: no_cell_separators_when_extracted.zip
>
>
> I plan on using Tika to extract text from Excel (both .xls and .xlsx) files 
> for indexing. But, I found that Tika juxtaposes cells on output. The example 
> worksheets are in the attached .zip file.
> I took the time to run Apache POI and it does not have this bug i.e. cells 
> are properly separated.
> When I run
> --begin--
> java -jar tika-0.3-SNAPSHOT-standalone.jar --text 
> no_cell_separators_when_extracted.xls
> --end--
> I get the following output:
> --begin--
> Plan1
>     NameEmailSanta claussa...@claus.org
>     Tooth fairyto...@fairy.org
> --end--
> Same thing with a .xlxs file:
> --begin--
> java -jar tika-0.3-SNAPSHOT-standalone.jar --text 
> no_cell_separators_when_extracted.xlsx
> --end--
> The output is:
> --begin--
> [Content_Types].xml
> _rels/.rels
> xl/_rels/workbook.xml.rels
> xl/workbook.xml
> xl/theme/theme1.xml
> xl/worksheets/_rels/sheet1.xml.rels
> xl/worksheets/sheet2.xml
> xl/worksheets/sheet3.xml
> xl/sharedStrings.xml
> NameEmailSanta claussa...@claus.orgtooth fairyto...@fairy.org
> xl/styles.xml
> xl/worksheets/sheet1.xml
> 012345
> docProps/core.xml
> GeorgerGeorger2009-01-17T15:29:04Z2009-01-17T15:30:56Z
> docProps/app.xml
> Microsoft Excel0falsePlanilhas3Plan1Plan2Plan3falsefalsefalse12.0000
> --end--
> Also note that the values from docProps/app.xml have been juxtaposed as well.
> This way, after indexing these files using the output from Tika, a search 
> engine will only find "Fairy" when substring matching is used, because "Tooth 
> Fairy" becomes "Tooth fairyto...@fairy.org". This is suboptimal and wrong.
> Thanks for your attention. Best regards,
> Georger

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-189) Text extraction from Excel files juxtaposes cells

Reply via email to