Please open an issue on our jira with a short example file. We can
look into parameterizing this behavior, maybe?
On Thu, Oct 11, 2018 at 6:00 PM Hanjan, Harinder
<[email protected]> wrote:
>
> Hello!
>
>
>
> We are using Tika Server to extract text from rich files, HTML, PDF, DOCX,
> XLS, etc. By default Tika is extracting the alt text of images present in
> HTML files and returns it as [image: this is the alt text of the image] which
> becomes part of the document’s extracted text. This ends up in Solr and shows
> up in the results when we generate document summaries at query time (via
> Solr’s highlight functionality). You can see this at
> https://imgur.com/a/zTc9X6m
>
>
> Based on the docs, I have tried the following tika config but I continue to
> see [image: ] tags in extract text.
>
> <?xml version="1.0" encoding="UTF-8"?>
>
> <properties>
>
> <parsers>
>
> <parser class="org.apache.tika.parser.DefaultParser">
>
> <mime-exclude>image/jpeg</mime-exclude>
>
> <parser-exclude class="org.apache.tika.parser.image.ImageParser"/>
>
> <parser-exclude class="org.apache.tika.parser.jpeg.JpegParser"/>
>
> </parser>
>
> </parsers>
>
> </properties>
>
>
>
> > java -Dtika.config=tikaconfig.xml -jar tika-server-1.17.jar
>
>
>
> What am I doing wrong, how can I tell Tika Server to ignore embedded images?
>
>
>
> Thanks!
>
> Harinder
>
>
> ________________________________
> NOTICE -
> This communication is intended ONLY for the use of the person or entity named
> above and may contain information that is confidential or legally privileged.
> If you are not the intended recipient named above or a person responsible for
> delivering messages or communications to the intended recipient, YOU ARE
> HEREBY NOTIFIED that any use, distribution, or copying of this communication
> or any of the information contained in it is strictly prohibited. If you have
> received this communication in error, please notify us immediately by
> telephone and then destroy or delete this communication, or return it to us
> by mail if requested by us. The City of Calgary thanks you for your attention
> and co-operation.