On Thursday, September 10, 2015 at 5:20:42 PM UTC-7, Jeff Breidenbach wrote:
>
> If the PDF embedded vector graphics similar to how it does rasters, then 
> this becomes somewhat practical. For example, if the vector graphics were 
> an embedded SVG then we'd pull that out (similar to how pdfimage from
> poppler can pull out embedded rasters.) Then we'd teach Leptonica to
> read the SVG into a pix, which is a rasterization operation. At PDF 
> generation
> time we'd throw out the pix and instead use the original SVG.
> But I don't think it works that way at all. I think the vector graphics 
> commands are likely to be direct PDF primitives and therefore way, way, 
> way too hard play with in this fashion.
>
> I think a more likely approach (but still very unlikely!) is using a 
> modified Tesseract to create a new PDF containing invisible text 
> layer and nothing else. Then hope someone has written a general 
> purpose "composite two PDF files on top of each other" program 
> and use it to merge. Such a merging program would be pretty difficult
> to write, and it is hard for me to imagine why it would exist.
>
>
>
I don't know a whole lot about how Tesseract/Leptonica/pdfimage/etc. work, 
but would it potentially be possible to dump the entire page into one large 
raster image and then use the segmentation data to cut out just the part 
that "looks like" text? I know that pdftk (well, jPDFTweak, but that's 
essentially a front-end for pdftk) and other similar libraries can shove a 
whole page into a single raster image file, which might be helpful for this 
situation.

I'll see if I can either post one of the page PDFs with enough of the 
text/images removed to make it more-or-less useless for anything other than 
researching how the text is stored or just make a differing PDF with the 
text converted to curves in a similar fashion. The issue is that it's a 
relatively big-named higher education publisher, and I'd rather not get 
smacked with some legal nasty-gram for doing something that could probably 
be construed as piracy if someone felt like trying to ruin my day.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/cc36f325-6b71-44fd-a25d-561942ac7ebe%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to