Tika Server - don't extract embedded images?

Hanjan, Harinder Thu, 11 Oct 2018 15:00:42 -0700

Hello!

We are using Tika Server to extract text from rich files, HTML, PDF, DOCX, XLS, 
etc. By default Tika is extracting the alt text of images present in HTML files 
and returns it as [image: this is the alt text of the image] which becomes part 
of the document's extracted text. This ends up in Solr and shows up in the 
results when we generate document summaries at query time (via Solr's highlight 
functionality). You can see this at https://imgur.com/a/zTc9X6m


Based on the docs, I have tried the following tika config but I continue to see 
[image: ] tags in extract text.
<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser">
      <mime-exclude>image/jpeg</mime-exclude>
      <parser-exclude class="org.apache.tika.parser.image.ImageParser"/>
     <parser-exclude class="org.apache.tika.parser.jpeg.JpegParser"/>
    </parser>
  </parsers>
</properties>

> java -Dtika.config=tikaconfig.xml -jar tika-server-1.17.jar

What am I doing wrong, how can I tell Tika Server to ignore embedded images?

Thanks!
Harinder

________________________________
NOTICE -
This communication is intended ONLY for the use of the person or entity named 
above and may contain information that is confidential or legally privileged. 
If you are not the intended recipient named above or a person responsible for 
delivering messages or communications to the intended recipient, YOU ARE HEREBY 
NOTIFIED that any use, distribution, or copying of this communication or any of 
the information contained in it is strictly prohibited. If you have received 
this communication in error, please notify us immediately by telephone and then 
destroy or delete this communication, or return it to us by mail if requested 
by us. The City of Calgary thanks you for your attention and co-operation.

Tika Server - don't extract embedded images?

Reply via email to