Help identifying hair-lines in PDFs using PDFBox and tabula

Gilad Denneboom Mon, 22 May 2017 13:08:28 -0700

Hi all,

So I'm trying to identify hair-lines in my PDFs. I came across tabula,
which seems to be able to do it, but I can't get it to quite work with my
files in the way I need it to, so any help is greatly appreciated!


Here's what I've been doing so far: I used the Ruling object from tabula to
extract both the horizontal and vertical rules from a stripped version of
the PDF page (ie, after removing all the text in it).
I'm getting results but now I want to relate them back to the original PDF
page, and that's proving difficult. If I add a text field using the
coordinates of the Ruling objects they are way off then where I would
expect them to be. I think it has to do with the DPI setting used to
convert the PDF page to an image, which is necessary for the rulings
extraction.
So my question is: How can I take these Ruling objects and convert them
back to the original coordinates of the PDF?
I would also like to be able to only identify lines of a certain width and
height, but if I get the rectangles to work correctly I think I can do that
in post-processing.

Thanks in advance!
Gilad

Help identifying hair-lines in PDFs using PDFBox and tabula

Reply via email to