Hi Georger,

Tika uses the ASF's JIRA bug filing system, available here:

http://issues.apache.org/jira/browse/TIKA

You can use that system to file your bug.

HTH,
Chris



On 1/17/09 8:49 AM, "Georger Araujo" <georger...@yahoo.com.br> wrote:

> Hi,
> I have been trying Tika for a few days now and I have to say I am very
> impressed. Great work!
>
> I apologize in advance for attaching a .zip file to this message. But as Tika
> still does not appear at the ASF Bugzilla so that I can file a proper bug and
> attach the file where it really should belong, I thought it would be
> acceptable if 1)I kept the size small, and 2)I checked it to be virus-free.
> It's just 10K; once again, if someone feels offended, I apologize.
>
> I would like to report a bug. Tika revision is svn-20090116, platform is
> Windows XP Pro SP3, JDK version is 1.6.0_06.
>
> I plan on using Tika to extract text from Excel (both .xls and .xlsx) files
> for indexing. But, I found that Tika juxtaposes cells. The example worksheets
> are in the attached .zip file.
>
> When I run
>
> --begin--
> java -jar tika-0.3-SNAPSHOT-standalone.jar --text
> no_cell_separators_when_extracted.xls
> --end--
>
> I get the following output:
>
> --begin--
> Plan1
>         NameEmailSanta claussa...@claus.org
>         Tooth fairyto...@fairy.org
> --end--
>
> Same thing with a .xlxs file:
> --begin--
> java -jar tika-0.3-SNAPSHOT-standalone.jar --text
> no_cell_separators_when_extracted.xlsx
> --end--
>
> The output is:
>
> --begin--
> [Content_Types].xml
>
>
>
> _rels/.rels
>
>
>
> xl/_rels/workbook.xml.rels
>
>
>
> xl/workbook.xml
>
>
>
> xl/theme/theme1.xml
>
>
>
> xl/worksheets/_rels/sheet1.xml.rels
>
>
>
> xl/worksheets/sheet2.xml
>
>
>
> xl/worksheets/sheet3.xml
>
>
>
> xl/sharedStrings.xml
> NameEmailSanta claussa...@claus.orgtooth fairyto...@fairy.org
>
>
> xl/styles.xml
>
>
>
> xl/worksheets/sheet1.xml
> 012345
>
>
> docProps/core.xml
> GeorgerGeorger2009-01-17T15:29:04Z2009-01-17T15:30:56Z
>
>
> docProps/app.xml
> Microsoft Excel0falsePlanilhas3Plan1Plan2Plan3falsefalsefalse12.0000
> --end--
>
> Also note that the values from docProps/app.xml have been juxtaposed as well.
>
> This way, after indexing these files using the output from Tika, a search
> engine will only find "Fairy" when substring matching is used, because "Tooth
> Fairy" becomes "Tooth fairyto...@fairy.org". This is suboptimal and wrong.
>
> Thanks for your attention. Best regards,
>
> Georger
>
>
>       Veja quais são os assuntos do momento no Yahoo! +Buscados
> http://br.maisbuscados.yahoo.com
>

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


Reply via email to