On Thursday, September 10, 2015 at 2:31:03 AM UTC-4, [email protected] wrote: > > On Friday, September 4, 2015 at 9:38:20 PM UTC-7, Jeff Breidenbach wrote: >> >> But I would like to see an example PDF - one of the simpler ones - just >> to see how the vector graphics were done. Please do not get your hopes up. >> > > I would upload a page, but unfortunately I'd be worried about running > afoul of any copyright restrictions upon the book. >
I suspect a single representative page used in this educational context would qualify for "fair use" under U.S. copyright law, but it's your call. Even if you don't publish a page, I'd be curious who the publisher/imprint is and whether this format is standard practice for them. As far as I can tell, the text is implemented with each letter (or, in the > case of dotted letters, contiguous portions of letters) being a single > closed vector shape. > Dotted letters?!?! I hope you're not hoping to recognize those too. I agree with Jeff that this sounds like a difficult task and it seems like a lot of work for a one-of, but I think it's doable. A searchable PDF is basically an image layer with an invisible text layer registered on top of it. I suspect that, instead of a base image layer, you could have a base vector graphics layer with a registered invisible text layer over it. My imagined pipeline would be something like: - page segmentation - using either the PDF (depending on what info is available there) or a rasterized version of the page. This will give you a page layout breakdown by block type (text, image, drawings). - rasterize - either just the text blocks or the entire page at a good resolution for OCR work - OCR - get text along with coordinates for each word/line - PDF assembly - crack open the original PDF, copy its contents, and insert the invisible text with the coordinates registered to the correct place on the underlying vector graphic text (see Tess sources for one example of how this is done) Hopefully you are either going to be searching for a LOT of words in the book to make this worthwhile or are willing to write off the time investment as a science experiment. Tom -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/482fe723-0181-4ea0-ab80-98e4bd926d28%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

