Hi Tim
and when is "extractInlineImages" actually effective ?
Thanks, Sergey
On 19/05/17 16:16, Allison, Timothy B. wrote:
Y, well, sorry. I’m thrilled someone is using it!
I tried to document that here:
https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29
See the OCR section.
And there’s a link to that page from
https://wiki.apache.org/tika/TikaOCR (See OCR on PDFs)
How can we improve the documentation so that you don’t waste an hour?
*From:*David Pilato [mailto:[email protected]]
*Sent:* Friday, May 19, 2017 5:55 AM
*To:* [email protected]
*Subject:* Re: Extracting Text from embedded images in PDF docs
Got it working. In case someone else hits the same issue, here is my
config file... Well... That was obvious :D
/<?/*xml version**="1.0" **encoding**="UTF-8"*/?>
/<*properties*>
<*parsers*>
<*parser class="org.apache.tika.parser.DefaultParser"*/>
<*parser class="org.apache.tika.parser.pdf.PDFParser"*>
<*params*>
<*param name="ocrStrategy" type="string"*>ocr_and_text</*param*>
</*params*>
</*parser*>
</*parsers*>
</*properties*>
David
Le 19 mai 2017 à 10:59, David Pilato <[email protected]
<mailto:[email protected]>> a écrit :
So I saw in debug mode that indeed config.getExtractInlineImages()
is false so I'm going to check my config.
:D
David
Le 18 mai 2017 à 22:18, David Pilato <[email protected]
<mailto:[email protected]>> a écrit :
Hey guys
First post here ;)
I'm trying to play with OCR with Tika. I installed Tesseract and
I can extract text from a PNG image.
I created a PDF document with this image embedded and I'm trying
now to extract the text out of it.
I added this configuration but I guess I'm doing it wrong:
/<?/*xml version**="1.0" **encoding**="UTF-8"*/?>
/<*properties*>
<*parsers*>
<*parser class="org.apache.tika.parser.DefaultParser"*>
</*parser*>
<*parser class="org.apache.tika.parser.pdf.PDFParser"*>
<*params*>
<*param name="extractInlineImages" type="bool"*>true</*param*>
</*params*>
</*parser*>
</*parsers*>
</*properties*>
I'm creating my Tika instance with something like:
TikaConfig config = *new
*TikaConfig(TikaInstance.*class*.getResourceAsStream(*"/tika-config.xml"*));
detector = config.getDetector();
parser = *new *AutoDetectParser(config);
/tika /= *new *Tika(detector, parser);
Any idea? I'm feeling that my xml config is wrong but can't find
what should be the right syntax.
Thanks for your help guys!
David