Thanks Tim. I have created TIKA-2755.

Cheers!
Harinder

-----Original Message-----
From: Tim Allison <[email protected]> 
Sent: Friday, October 12, 2018 10:59 AM
To: [email protected]
Subject: [EXT] Re: Tika Server - don't extract embedded images?

Please open an issue on our jira with a short example file.  We can look into 
parameterizing this behavior, maybe?
On Thu, Oct 11, 2018 at 6:00 PM Hanjan, Harinder <[email protected]> 
wrote:
>
> Hello!
>
>
>
> We are using Tika Server to extract text from rich files, HTML, PDF, 
> DOCX, XLS, etc. By default Tika is extracting the alt text of images 
> present in HTML files and returns it as [image: this is the alt text 
> of the image] which becomes part of the document’s extracted text. 
> This ends up in Solr and shows up in the results when we generate 
> document summaries at query time (via Solr’s highlight functionality). 
> You can see this at 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__imgur.com_a_zTc9X
> 6m&d=DwIFaQ&c=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M&r=N30IrhmaeK
> KhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U&m=dwMbP_-LdUIFVK-ul8K2AqWKJRFWTkM1Kf
> eDYQyxOec&s=l3fyPFfLoWdKkRdnB1y1h4dd8vnoHFBZ8Ii8dNNaZy0&e=
>
>
> Based on the docs, I have tried the following tika config but I continue to 
> see [image: ] tags in extract text.
>
> <?xml version="1.0" encoding="UTF-8"?>
>
> <properties>
>
>   <parsers>
>
>     <parser class="org.apache.tika.parser.DefaultParser">
>
>       <mime-exclude>image/jpeg</mime-exclude>
>
>       <parser-exclude 
> class="org.apache.tika.parser.image.ImageParser"/>
>
>      <parser-exclude class="org.apache.tika.parser.jpeg.JpegParser"/>
>
>     </parser>
>
>   </parsers>
>
> </properties>
>
>
>
> > java -Dtika.config=tikaconfig.xml -jar tika-server-1.17.jar
>
>
>
> What am I doing wrong, how can I tell Tika Server to ignore embedded images?
>
>
>
> Thanks!
>
> Harinder
>
>
> ________________________________
> NOTICE -
> This communication is intended ONLY for the use of the person or entity named 
> above and may contain information that is confidential or legally privileged. 
> If you are not the intended recipient named above or a person responsible for 
> delivering messages or communications to the intended recipient, YOU ARE 
> HEREBY NOTIFIED that any use, distribution, or copying of this communication 
> or any of the information contained in it is strictly prohibited. If you have 
> received this communication in error, please notify us immediately by 
> telephone and then destroy or delete this communication, or return it to us 
> by mail if requested by us. The City of Calgary thanks you for your attention 
> and co-operation.

Reply via email to