Still seeing the same issue with your test. Are you saying that you are seeing the regular text from the Word doc, not the text from the images? I’m sure it must be some Maven issue. Which jar/package is the OOXML parser in? Will Tika warn me if it can’t find an appropriate parser and is instead, falling back to the default (empty) parser?
From: Tim Allison <[email protected]> Sent: Friday, January 8, 2021 12:50 PM To: [email protected] Subject: Re: Problem parsing DOCX Y. That means that somehow the OOXMLParser didn't make it to your path. I just added that docx and a unit (not really) test, and it seems to work for me: https://github.com/tballison/tika-2_0-client-examples/blob/master/src/test/java/TestRotation.java#L82 What does your config look like? I'd recommend merging from {{main}} and rebuilding -- TIKA-3268 fixed a bug that can silently prevent parsers from loading if there's a typo in the exclude-parser's class. On Fri, Jan 8, 2021 at 11:38 AM Peter Kronenberg <[email protected]<mailto:[email protected]>> wrote: Trying to parse the attached Word file. Not matter what it does with the images, I would expect to at least see the extracted text. I realize that the PDF options have no bearing here. But here is all I’m getting. Also note that it does not even identify it as a Word document. Only as Office. And there is hardly any other metadata. Only X-Parsed-By and Content-Type I’m sure the EmptyParser is a clue. Am I not including the correct parser? checking: [c:\Program Files (x86)\Tesseract-OCR-4.0.0\tesseract.exe] [main] WARN org.apache.tika.parser.ocr.TesseractOCRParser - Tesseract OCR is installed and will be automatically applied to image files unless you've excluded the TesseractOCRParser from the default parser. Tesseract may dramatically slow down content extraction (TIKA-2359). As of Tika 1.15 (and prior versions), Tesseract is automatically called. In future versions of Tika, users may need to turn the TesseractOCRParser on via TikaConfig. [main] INFO org.torchai.TikaOCRParser - Tesseract path: c:\Program Files (x86)\Tesseract-OCR-4.0.0\, exists: true [main] INFO org.torchai.TikaOCRParser - Tessdata path: c:\Program Files (x86)\Tesseract-OCR-4.0.0\tessdata\, exists: true [main] INFO org.torchai.TikaOCRParser - Image Magick path: c:\Program Files\ImageMagick-7.0.10-Q16-HDRI\, exists: true [main] INFO org.torchai.TikaOCRParser - Python path: c:\python39\, exists: true [main] INFO org.torchai.TikaOCRParser - enableImageProcessing: true [main] INFO org.torchai.TikaOCRParser - apply rotation: false [main] INFO org.torchai.TikaOCRParser - PDF Extract inline images: true [main] INFO org.torchai.TikaOCRParser - PDF OCR Strategy: AUTO [main] INFO org.torchai.TikaOCRParser - PDF OCR DPI: 100 [main] INFO org.torchai.TikaOCRParser - PDF Detect angles: true [main] INFO org.torchai.TikaOCRParser - calling parse on c:\testFiles\Skewed Dickens.docx [main] INFO org.torchai.TikaOCRParser - mimeType = application/x-tika-ooxml [main] INFO org.torchai.TikaOCRParser - X-Parsed-By: org.apache.tika.parser.EmptyParser [main] INFO org.torchai.TikaOCRParser - Content-Type: application/x-tika-ooxml Text: <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta name="X-Parsed-By" content="org.apache.tika.parser.EmptyParser" /> <meta name="Content-Type" content="application/x-tika-ooxml" /> <title></title> </head> <body /></html>
