RE: Problem parsing DOCX

Peter Kronenberg Fri, 08 Jan 2021 10:58:11 -0800

Still seeing the same issue with your test.  Are you saying that you are seeing 
the regular text from the Word doc, not the text from the images?
I’m sure it must be some Maven issue.  Which jar/package is the OOXML parser in?
Will Tika warn me if it can’t find an appropriate parser and is instead, 
falling back to the default (empty) parser?

From: Tim Allison <[email protected]>
Sent: Friday, January 8, 2021 12:50 PM
To: [email protected]
Subject: Re: Problem parsing DOCX

Y. That means that somehow the OOXMLParser didn't make it to your path.  I just 
added that docx and a unit (not really) test, and it seems to work for me: 
https://github.com/tballison/tika-2_0-client-examples/blob/master/src/test/java/TestRotation.java#L82

What does your config look like?

I'd recommend merging from {{main}} and rebuilding -- TIKA-3268 fixed a bug 
that can silently prevent parsers from loading if there's a typo in the 
exclude-parser's class.

On Fri, Jan 8, 2021 at 11:38 AM Peter Kronenberg 
<[email protected]<mailto:[email protected]>> wrote:
Trying to parse the attached Word file.  Not matter what it does with the 
images, I would expect to at least see the extracted text.  I realize that the 
PDF options have no bearing here.    But here is all I’m getting.  Also note 
that it does not even identify it as a Word document.  Only as Office.  And 
there is hardly any other metadata.  Only X-Parsed-By and Content-Type
I’m sure the EmptyParser is a clue.  Am I not including the correct parser?

checking: [c:\Program Files (x86)\Tesseract-OCR-4.0.0\tesseract.exe]
[main] WARN org.apache.tika.parser.ocr.TesseractOCRParser - Tesseract OCR is 
installed and will be automatically applied to image files unless
you've excluded the TesseractOCRParser from the default parser.
Tesseract may dramatically slow down content extraction (TIKA-2359).
As of Tika 1.15 (and prior versions), Tesseract is automatically called.
In future versions of Tika, users may need to turn the TesseractOCRParser on 
via TikaConfig.
[main] INFO org.torchai.TikaOCRParser - Tesseract path: c:\Program Files 
(x86)\Tesseract-OCR-4.0.0\, exists: true
[main] INFO org.torchai.TikaOCRParser - Tessdata path:  c:\Program Files 
(x86)\Tesseract-OCR-4.0.0\tessdata\, exists: true
[main] INFO org.torchai.TikaOCRParser - Image Magick path: c:\Program 
Files\ImageMagick-7.0.10-Q16-HDRI\, exists: true
[main] INFO org.torchai.TikaOCRParser - Python path: c:\python39\, exists: true
[main] INFO org.torchai.TikaOCRParser - enableImageProcessing: true
[main] INFO org.torchai.TikaOCRParser - apply rotation: false
[main] INFO org.torchai.TikaOCRParser - PDF Extract inline images: true
[main] INFO org.torchai.TikaOCRParser - PDF OCR Strategy: AUTO
[main] INFO org.torchai.TikaOCRParser - PDF OCR DPI: 100
[main] INFO org.torchai.TikaOCRParser - PDF Detect angles: true
[main] INFO org.torchai.TikaOCRParser - calling parse on c:\testFiles\Skewed 
Dickens.docx
[main] INFO org.torchai.TikaOCRParser - mimeType = application/x-tika-ooxml
[main] INFO org.torchai.TikaOCRParser - X-Parsed-By: 
org.apache.tika.parser.EmptyParser
[main] INFO org.torchai.TikaOCRParser - Content-Type: application/x-tika-ooxml
Text: <html xmlns="http://www.w3.org/1999/xhtml";>
<head>
<meta name="X-Parsed-By" content="org.apache.tika.parser.EmptyParser" />
<meta name="Content-Type" content="application/x-tika-ooxml" />
<title></title>
</head>
<body /></html>

RE: Problem parsing DOCX

Reply via email to