Thanks Tim. I have created TIKA-2755. Cheers! Harinder
-----Original Message----- From: Tim Allison <[email protected]> Sent: Friday, October 12, 2018 10:59 AM To: [email protected] Subject: [EXT] Re: Tika Server - don't extract embedded images? Please open an issue on our jira with a short example file. We can look into parameterizing this behavior, maybe? On Thu, Oct 11, 2018 at 6:00 PM Hanjan, Harinder <[email protected]> wrote: > > Hello! > > > > We are using Tika Server to extract text from rich files, HTML, PDF, > DOCX, XLS, etc. By default Tika is extracting the alt text of images > present in HTML files and returns it as [image: this is the alt text > of the image] which becomes part of the document’s extracted text. > This ends up in Solr and shows up in the results when we generate > document summaries at query time (via Solr’s highlight functionality). > You can see this at > https://urldefense.proofpoint.com/v2/url?u=https-3A__imgur.com_a_zTc9X > 6m&d=DwIFaQ&c=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M&r=N30IrhmaeK > KhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U&m=dwMbP_-LdUIFVK-ul8K2AqWKJRFWTkM1Kf > eDYQyxOec&s=l3fyPFfLoWdKkRdnB1y1h4dd8vnoHFBZ8Ii8dNNaZy0&e= > > > Based on the docs, I have tried the following tika config but I continue to > see [image: ] tags in extract text. > > <?xml version="1.0" encoding="UTF-8"?> > > <properties> > > <parsers> > > <parser class="org.apache.tika.parser.DefaultParser"> > > <mime-exclude>image/jpeg</mime-exclude> > > <parser-exclude > class="org.apache.tika.parser.image.ImageParser"/> > > <parser-exclude class="org.apache.tika.parser.jpeg.JpegParser"/> > > </parser> > > </parsers> > > </properties> > > > > > java -Dtika.config=tikaconfig.xml -jar tika-server-1.17.jar > > > > What am I doing wrong, how can I tell Tika Server to ignore embedded images? > > > > Thanks! > > Harinder > > > ________________________________ > NOTICE - > This communication is intended ONLY for the use of the person or entity named > above and may contain information that is confidential or legally privileged. > If you are not the intended recipient named above or a person responsible for > delivering messages or communications to the intended recipient, YOU ARE > HEREBY NOTIFIED that any use, distribution, or copying of this communication > or any of the information contained in it is strictly prohibited. If you have > received this communication in error, please notify us immediately by > telephone and then destroy or delete this communication, or return it to us > by mail if requested by us. The City of Calgary thanks you for your attention > and co-operation.
