Re: Help identifying hair-lines in PDFs using PDFBox and tabula
I've found that if I set the dpi to 72 the locations of the Rulings match the original PDF page. On Tue, May 23, 2017 at 12:02 PM, Gilad Denneboomwrote: > PS. I'm also happy to hear any ideas on how to achieve it using PDFBox on > its own, without tabula... > > On Tue, May 23, 2017 at 12:01 PM, Gilad Denneboom < > gilad.denneb...@gmail.com> wrote: > >> There doesn't seem to be one... I guess I can try StackOverflow. >> >> On Tue, May 23, 2017 at 11:54 AM, Andreas Lehmkühler >> wrote: >> >>> > Gilad Denneboom hat am 22. Mai 2017 um >>> 22:07 geschrieben: >>> > >>> > >>> > Hi all, >>> > >>> > So I'm trying to identify hair-lines in my PDFs. I came across tabula, >>> > which seems to be able to do it, but I can't get it to quite work with >>> my >>> > files in the way I need it to, so any help is greatly appreciated! >>> > >>> > Here's what I've been doing so far: I used the Ruling object from >>> tabula to >>> > extract both the horizontal and vertical rules from a stripped version >>> of >>> > the PDF page (ie, after removing all the text in it). >>> > I'm getting results but now I want to relate them back to the original >>> PDF >>> > page, and that's proving difficult. If I add a text field using the >>> > coordinates of the Ruling objects they are way off then where I would >>> > expect them to be. I think it has to do with the DPI setting used to >>> > convert the PDF page to an image, which is necessary for the rulings >>> > extraction. >>> > So my question is: How can I take these Ruling objects and convert them >>> > back to the original coordinates of the PDF? >>> > I would also like to be able to only identify lines of a certain width >>> and >>> > height, but if I get the rectangles to work correctly I think I can do >>> that >>> > in post-processing. >>> Sounds like a question for the tabulapdf community ... >>> >>> Andreas >>> > >>> > Thanks in advance! >>> > Gilad >>> >> >> >
Re: Help identifying hair-lines in PDFs using PDFBox and tabula
PS. I'm also happy to hear any ideas on how to achieve it using PDFBox on its own, without tabula... On Tue, May 23, 2017 at 12:01 PM, Gilad Denneboomwrote: > There doesn't seem to be one... I guess I can try StackOverflow. > > On Tue, May 23, 2017 at 11:54 AM, Andreas Lehmkühler > wrote: > >> > Gilad Denneboom hat am 22. Mai 2017 um >> 22:07 geschrieben: >> > >> > >> > Hi all, >> > >> > So I'm trying to identify hair-lines in my PDFs. I came across tabula, >> > which seems to be able to do it, but I can't get it to quite work with >> my >> > files in the way I need it to, so any help is greatly appreciated! >> > >> > Here's what I've been doing so far: I used the Ruling object from >> tabula to >> > extract both the horizontal and vertical rules from a stripped version >> of >> > the PDF page (ie, after removing all the text in it). >> > I'm getting results but now I want to relate them back to the original >> PDF >> > page, and that's proving difficult. If I add a text field using the >> > coordinates of the Ruling objects they are way off then where I would >> > expect them to be. I think it has to do with the DPI setting used to >> > convert the PDF page to an image, which is necessary for the rulings >> > extraction. >> > So my question is: How can I take these Ruling objects and convert them >> > back to the original coordinates of the PDF? >> > I would also like to be able to only identify lines of a certain width >> and >> > height, but if I get the rectangles to work correctly I think I can do >> that >> > in post-processing. >> Sounds like a question for the tabulapdf community ... >> >> Andreas >> > >> > Thanks in advance! >> > Gilad >> > >
Re: Help identifying hair-lines in PDFs using PDFBox and tabula
There doesn't seem to be one... I guess I can try StackOverflow. On Tue, May 23, 2017 at 11:54 AM, Andreas Lehmkühlerwrote: > > Gilad Denneboom hat am 22. Mai 2017 um > 22:07 geschrieben: > > > > > > Hi all, > > > > So I'm trying to identify hair-lines in my PDFs. I came across tabula, > > which seems to be able to do it, but I can't get it to quite work with my > > files in the way I need it to, so any help is greatly appreciated! > > > > Here's what I've been doing so far: I used the Ruling object from tabula > to > > extract both the horizontal and vertical rules from a stripped version of > > the PDF page (ie, after removing all the text in it). > > I'm getting results but now I want to relate them back to the original > PDF > > page, and that's proving difficult. If I add a text field using the > > coordinates of the Ruling objects they are way off then where I would > > expect them to be. I think it has to do with the DPI setting used to > > convert the PDF page to an image, which is necessary for the rulings > > extraction. > > So my question is: How can I take these Ruling objects and convert them > > back to the original coordinates of the PDF? > > I would also like to be able to only identify lines of a certain width > and > > height, but if I get the rectangles to work correctly I think I can do > that > > in post-processing. > Sounds like a question for the tabulapdf community ... > > Andreas > > > > Thanks in advance! > > Gilad >
Re: Help identifying hair-lines in PDFs using PDFBox and tabula
> Gilad Denneboomhat am 22. Mai 2017 um 22:07 > geschrieben: > > > Hi all, > > So I'm trying to identify hair-lines in my PDFs. I came across tabula, > which seems to be able to do it, but I can't get it to quite work with my > files in the way I need it to, so any help is greatly appreciated! > > Here's what I've been doing so far: I used the Ruling object from tabula to > extract both the horizontal and vertical rules from a stripped version of > the PDF page (ie, after removing all the text in it). > I'm getting results but now I want to relate them back to the original PDF > page, and that's proving difficult. If I add a text field using the > coordinates of the Ruling objects they are way off then where I would > expect them to be. I think it has to do with the DPI setting used to > convert the PDF page to an image, which is necessary for the rulings > extraction. > So my question is: How can I take these Ruling objects and convert them > back to the original coordinates of the PDF? > I would also like to be able to only identify lines of a certain width and > height, but if I get the rectangles to work correctly I think I can do that > in post-processing. Sounds like a question for the tabulapdf community ... Andreas > > Thanks in advance! > Gilad - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org