Please open an issue on our jira with a short example file.  We can
look into parameterizing this behavior, maybe?
On Thu, Oct 11, 2018 at 6:00 PM Hanjan, Harinder
<[email protected]> wrote:
>
> Hello!
>
>
>
> We are using Tika Server to extract text from rich files, HTML, PDF, DOCX, 
> XLS, etc. By default Tika is extracting the alt text of images present in 
> HTML files and returns it as [image: this is the alt text of the image] which 
> becomes part of the document’s extracted text. This ends up in Solr and shows 
> up in the results when we generate document summaries at query time (via 
> Solr’s highlight functionality). You can see this at 
> https://imgur.com/a/zTc9X6m
>
>
> Based on the docs, I have tried the following tika config but I continue to 
> see [image: ] tags in extract text.
>
> <?xml version="1.0" encoding="UTF-8"?>
>
> <properties>
>
>   <parsers>
>
>     <parser class="org.apache.tika.parser.DefaultParser">
>
>       <mime-exclude>image/jpeg</mime-exclude>
>
>       <parser-exclude class="org.apache.tika.parser.image.ImageParser"/>
>
>      <parser-exclude class="org.apache.tika.parser.jpeg.JpegParser"/>
>
>     </parser>
>
>   </parsers>
>
> </properties>
>
>
>
> > java -Dtika.config=tikaconfig.xml -jar tika-server-1.17.jar
>
>
>
> What am I doing wrong, how can I tell Tika Server to ignore embedded images?
>
>
>
> Thanks!
>
> Harinder
>
>
> ________________________________
> NOTICE -
> This communication is intended ONLY for the use of the person or entity named 
> above and may contain information that is confidential or legally privileged. 
> If you are not the intended recipient named above or a person responsible for 
> delivering messages or communications to the intended recipient, YOU ARE 
> HEREBY NOTIFIED that any use, distribution, or copying of this communication 
> or any of the information contained in it is strictly prohibited. If you have 
> received this communication in error, please notify us immediately by 
> telephone and then destroy or delete this communication, or return it to us 
> by mail if requested by us. The City of Calgary thanks you for your attention 
> and co-operation.

Reply via email to