Sorry for not responding sooner.  The file that you attached helps me
understand this question quite a bit.

The basic answer is: no, not yet, not generally.  The correct way to do OCR
on PDFs might be to render the page without rendering the stored text and
then run OCR on the page (minus text).  We're not yet doing this.

As mentioned, see our wiki (
https://cwiki.apache.org/confluence/display/tika/PDFParser%20(Apache%20PDFBox)
for the two main options of running OCR on PDFs.

On your specific test file (see code and output below) you can use option 1
for PDFs (e.g. extract inline images), and you get what you want, but this
will not generalize because some PDFs can use thousands of images per
page.

PDFParserConfig config = new PDFParserConfig();
config.setExtractInlineImages(true);
ParseContext parseContext = new ParseContext();
parseContext.set(PDFParserConfig.class, config);

Parser p = new AutoDetectParser();
ContentHandler handler = new ToXMLContentHandler();
Metadata metadata = new Metadata();
Path path = Paths.get("/..../Simple-text-image.pdf");
try (InputStream tis = TikaInputStream.get(path, metadata)) {
    p.parse(tis, handler, metadata, parseContext);
}
System.out.println(handler.toString());


<html xmlns="http://www.w3.org/1999/xhtml";>
<head>
<meta name="date" content="2020-12-31T20:08:09Z" />
<meta name="pdf:PDFVersion" content="1.7" />
<meta name="xmp:CreatorTool" content="Microsoft® Word for Microsoft 365" />
<meta name="pdf:hasXFA" content="false" />
<meta name="access_permission:modify_annotations" content="true" />
<meta name="access_permission:can_print_degraded" content="true" />
<meta name="dc:creator" content="Peter Kronenberg" />
<meta name="language" content="en-US" />
<meta name="dcterms:created" content="2020-12-31T20:08:09Z" />
<meta name="Last-Modified" content="2020-12-31T20:08:09Z" />
<meta name="dcterms:modified" content="2020-12-31T20:08:09Z" />
<meta name="dc:format" content="application/pdf; version=1.7" />
<meta name="xmpMM:DocumentID"
content="uuid:2C3CDC87-5F9C-49B7-9E4F-0E16A7AE27BC" />
<meta name="Last-Save-Date" content="2020-12-31T20:08:09Z" />
<meta name="pdf:docinfo:creator_tool" content="Microsoft® Word for
Microsoft 365" />
<meta name="access_permission:fill_in_form" content="true" />
<meta name="pdf:docinfo:modified" content="2020-12-31T20:08:09Z" />
<meta name="meta:save-date" content="2020-12-31T20:08:09Z" />
<meta name="pdf:encrypted" content="false" />
<meta name="xmp:CreateDate" content="2020-12-31T15:08:09Z" />
<meta name="modified" content="2020-12-31T20:08:09Z" />
<meta name="Content-Length" content="47113" />
<meta name="pdf:hasMarkedContent" content="true" />
<meta name="Content-Type" content="application/pdf" />
<meta name="xmp:ModifyDate" content="2020-12-31T15:08:09Z" />
<meta name="pdf:docinfo:creator" content="Peter Kronenberg" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.pdf.PDFParser" />
<meta name="creator" content="Peter Kronenberg" />
<meta name="dc:language" content="en-US" />
<meta name="meta:author" content="Peter Kronenberg" />
<meta name="pdf:producer" content="Microsoft® Word for Microsoft 365" />
<meta name="meta:creation-date" content="2020-12-31T20:08:09Z" />
<meta name="created" content="2020-12-31T20:08:09Z" />
<meta name="access_permission:extract_for_accessibility" content="true" />
<meta name="access_permission:assemble_document" content="true" />
<meta name="xmpTPg:NPages" content="2" />
<meta name="Creation-Date" content="2020-12-31T20:08:09Z" />
<meta name="resourceName" content="Simple-text-image.pdf" />
<meta name="pdf:hasXMP" content="true" />
<meta name="access_permission:extract_content" content="true" />
<meta name="access_permission:can_print" content="true" />
<meta name="Author" content="Peter Kronenberg" />
<meta name="producer" content="Microsoft® Word for Microsoft 365" />
<meta name="access_permission:can_modify" content="true" />
<meta name="pdf:docinfo:producer" content="Microsoft® Word for Microsoft 365" />
<meta name="pdf:docinfo:created" content="2020-12-31T20:08:09Z" />
<title></title>
</head>
<body><div class="page"><p />
<p>Start of text
</p>
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do
eiusmod tempor incididunt ut
</p>
<p>labore et dolore magna aliqua. A arcu cursus vitae congue. Iaculis
nunc sed augue lacus. Et netus
</p>
<p>et malesuada fames. Nunc sed id semper risus in hendrerit gravida.
Scelerisque fermentum dui
</p>
<p>faucibus in ornare quam viverra orci. Dolor morbi non arcu risus.
Pharetra massa massa ultricies
</p>
<p>mi quis. Vitae tempus quam pellentesque nec nam. Sit amet nisl
suscipit adipiscing. Auctor
</p>
<p>augue mauris augue neque gravida in fermentum et sollicitudin. Sed
vulputate mi sit amet mauris
</p>
<p>commodo. Velit sed ullamcorper morbi tincidunt ornare massa eget.
Rutrum quisque non tellus
</p>
<p>orci ac auctor augue.
</p>
<p>Phasellus faucibus scelerisque eleifend donec pretium vulputate
sapien nec sagittis. Vestibulum
</p>
<p>rhoncus est pellentesque elit ullamcorper dignissim cras tincidunt
lobortis. Diam ut venenatis
</p>
<p>tellus in metus vulputate eu scelerisque felis. Nulla malesuada
pellentesque elit eget gravida cum
</p>
<p>sociis. Est pellentesque elit ullamcorper dignissim cras tincidunt
lobortis. Dictum varius duis at
</p>
<p>consectetur lorem donec massa sapien faucibus. Integer malesuada
nunc vel risus. Sit amet
</p>
<p>consectetur adipiscing elit duis tristique sollicitudin nibh. Nunc
non blandit massa enim nec dui
</p>
<p>nunc mattis enim. Quam viverra orci sagittis eu volutpat odio. Duis
at tellus at urna
</p>
<p>condimentum mattis pellentesque id. Egestas tellus rutrum tellus
pellentesque eu tincidunt tortor
</p>
<p>aliquam nulla. Netus et malesuada fames ac turpis egestas sed tempus urna.
</p>
<p>Ut sem nulla pharetra diam sit amet nisl suscipit. Mus mauris vitae
ultricies leo. Gravida neque
</p>
<p>convallis a cras. Enim nec dui nunc mattis. Non odio euismod
lacinia at quis risus sed.
</p>
<p>Commodo viverra maecenas accumsan lacus vel facilisis. Nunc sed id
semper risus in hendrerit
</p>
<p>gravida rutrum. Mi bibendum neque egestas congue quisque egestas
diam in. Pharetra sit amet
</p>
<p>aliquam id diam maecenas ultricies mi. Semper risus in hendrerit
gravida rutrum quisque non
</p>
<p>tellus orci.
</p>
<p>Bibendum at varius vel pharetra vel. Lacus vestibulum sed arcu non
odio euismod. Mollis
</p>
<p>aliquam ut porttitor leo a diam. Tincidunt praesent semper feugiat
nibh. Tristique senectus et
</p>
<p>netus et malesuada fames ac turpis egestas. Ipsum dolor sit amet
consectetur adipiscing elit ut
</p>
<p>aliquam purus. Sollicitudin ac orci phasellus egestas tellus.
Gravida in fermentum et sollicitudin
</p>
<p>ac orci phasellus egestas. Congue quisque egestas diam in. Volutpat
maecenas volutpat blandit
</p>
<p>aliquam etiam erat. Sed blandit libero volutpat sed cras ornare
arcu dui. Interdum posuere lorem
</p>
<p>ipsum dolor sit. Lectus magna fringilla urna porttitor rhoncus
dolor purus non. Ac turpis egestas
</p>
<p>sed tempus urna. Nam aliquam sem et tortor consequat id porta.
</p>
<p>End of text
</p>
<p>
</p>
<p>  </p>
<p />
</div>
<div class="page"><p />
<p>Start of image
</p>
<p>End of image
</p>
<p> </p>
<p />
<img src="embedded:image0.png" alt="image0.png" /><div
class="ocr">Pellentesque adipiscing commodo elit at imperdiet dui.
Consectetur purus ut faucibus pulvinar. Tincidunt
praesent semper feugiat nibh sed pulvinar. Sagittis aliquam malesuada
bibendum arcu vitae elementum
curabitur vitae. Velit euismod in pellentesque massa placerat duis.
Fermentum et sollicitudin ac orci
phasellus egestas tellus. Ante in nibh mauris cursus mattis molestie.
Commodo quis imperdiet massa
fincidunt nunc pulvinar sapien et ligula. Lorem mollis aliquam ut
porttitor leo a diam sollicitudin

tempor. Amet aliquam id diam maecenas ultricies mi eget mauris
pharetra. Ullamcorper dignissim cras
tincidunt lobortis feugiat vivamus at augue eget. Orci eu lobortis
elementum nibh tellus. Id aliquet

lectus proin nibh nis! condimentum. Vitae elementum curabitur vitae
nunc sed velit. Rhoncus dolor purus
non enim praesent elementum facilisis leo vel. Velit egestas dui id
ornare arcu odio ut sem nulla. Purus
sit amet luctus venenatis lectus magna fringilla. Maecenas sed enim ut sem.
</div>

</div>
</body></html>


On Thu, Dec 31, 2020 at 9:58 AM Peter Kronenberg <[email protected]>
wrote:

> I’ve got Tika working with Tesseract on PDF files, but it seems that if I
> give it a PDF file that has both searchable text and images, the text is
> OCRed twice.  Is there a way to avoid this?  Even if it has to make two
> passes, one for the straight text and then another for just the images
>

Reply via email to