[tesseract-ocr] Re: Suggestions on running PDFs through Tesseract without losing vector graphics?

Tom Morris Thu, 10 Sep 2015 10:12:15 -0700

On Thursday, September 10, 2015 at 2:31:03 AM UTC-4, [email protected] 
wrote:
>
> On Friday, September 4, 2015 at 9:38:20 PM UTC-7, Jeff Breidenbach wrote:
>>
>> But I would like to see an example PDF - one of the simpler ones - just 
>> to see how the vector graphics were done. Please do not get your hopes up.
>>
>
> I would upload a page, but unfortunately I'd be worried about running 
> afoul of any copyright restrictions upon the book.
>

I suspect a single representative page used in this educational context
would qualify for "fair use" under U.S. copyright law, but it's your call.
Even if you don't publish a page, I'd be curious who the publisher/imprint
is and whether this format is standard practice for them.

As far as I can tell, the text is implemented with each letter (or, in the
> case of dotted letters, contiguous portions of letters) being a single
> closed vector shape.
>

Dotted letters?!?! I hope you're not hoping to recognize those too.

I agree with Jeff that this sounds like a difficult task and it seems like
a lot of work for a one-of, but I think it's doable. A searchable PDF is
basically an image layer with an invisible text layer registered on top of
it. I suspect that, instead of a base image layer, you could have a base
vector graphics layer with a registered invisible text layer over it.

My imagined pipeline would be something like:

- page segmentation - using either the PDF (depending on what info is
available there) or a rasterized version of the page. This will give you a
page layout breakdown by block type (text, image, drawings).
- rasterize - either just the text blocks or the entire page at a good
resolution for OCR work
- OCR - get text along with coordinates for each word/line
- PDF assembly - crack open the original PDF, copy its contents, and insert
the invisible text with the coordinates registered to the correct place on
the underlying vector graphic text (see Tess sources for one example of how
this is done)

Hopefully you are either going to be searching for a LOT of words in the
book to make this worthwhile or are willing to write off the time
investment as a science experiment.

Tom

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/482fe723-0181-4ea0-ab80-98e4bd926d28%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Re: Suggestions on running PDFs through Tesseract without losing vector graphics?

Reply via email to