Help identifying hair-lines in PDFs using PDFBox and tabula

2017-05-22 Thread Gilad Denneboom
Hi all,

So I'm trying to identify hair-lines in my PDFs. I came across tabula,
which seems to be able to do it, but I can't get it to quite work with my
files in the way I need it to, so any help is greatly appreciated!

Here's what I've been doing so far: I used the Ruling object from tabula to
extract both the horizontal and vertical rules from a stripped version of
the PDF page (ie, after removing all the text in it).
I'm getting results but now I want to relate them back to the original PDF
page, and that's proving difficult. If I add a text field using the
coordinates of the Ruling objects they are way off then where I would
expect them to be. I think it has to do with the DPI setting used to
convert the PDF page to an image, which is necessary for the rulings
extraction.
So my question is: How can I take these Ruling objects and convert them
back to the original coordinates of the PDF?
I would also like to be able to only identify lines of a certain width and
height, but if I get the rectangles to work correctly I think I can do that
in post-processing.

Thanks in advance!
Gilad


Re: Linearized dictionary

2017-05-22 Thread Tilman Hausherr

Am 22.05.2017 um 12:30 schrieb Andreas Lehmkühler:

While using 1.8.2 Linearized is working properly. But in 2.0.5 I can not
get the linearized and I can't check the linearized as it is not in the
dictionary keyset. Please let me know if you need more details.

I can confirm the behaviour. The object is read but not dereferenced as it 
isn't needed. Consequently that dictionary isn't part of the object pool.
I have no solution yet 



@karthick: as a "dumb" workaround, just read 1024 bytes (or whatever is 
best) and search for "Linerarized".


Tilman




Re: Linearized dictionary

2017-05-22 Thread Andreas Lehmkühler
> karthick g  hat am 22. Mai 2017 um 06:17 geschrieben:
> 
> 
> Hi team,
> 
> Here is the code, I am using COSName.getPDFName("Linearized). The problem
> is
> 
> PDDocument pdDoc = PDDocument.load(new File(""));
> COSDocument cosDoc = pdDoc.getDocument();
> List lObj = cosDoc.getObjects();
> for (Object object : lObj) {
> 
> COSBase curObj = ((COSObject) object).getObject();
> if (curObj instanceof COSDictionary) {
> 
> COSDictionary cOSDictionary = (COSDictionary) curObj;
> 
> if
> (cOSDictionary.keySet().contains(COSName.getPDFName("Linearized"))) {
> //System.out.println("Linearized");
> }
> }
> }
> 
> While using 1.8.2 Linearized is working properly. But in 2.0.5 I can not
> get the linearized and I can't check the linearized as it is not in the
> dictionary keyset. Please let me know if you need more details.
I can confirm the behaviour. The object is read but not dereferenced as it 
isn't needed. Consequently that dictionary isn't part of the object pool.
I have no solution yet 

Andreas
> 
> 
> 
> 
> Regards,
> Karthick G
> 
> On Fri, May 19, 2017 at 9:27 AM, karthick g  wrote:
> 
> > Hi,
> > * I need to Check whether my PDF file is Linearized or not, for fast view
> > web. *
> > In the previous version (1.8.2) of PDFBox Linearized is in the COSName. I
> > will get the COSDictionary and check whether Linearized is available in the
> > COSName and conclude the PDF is suited for fast web view. Now Linearized
> > keyword is not in
> > the List of COSName. How can I get the Linearized dictionary in PDFBox.
> > Please let me know if you need more details.
> >
> > Regards,
> > Karthick G
> >
> >
> >
> > On Thu, May 18, 2017 at 9:17 AM, karthick g 
> > wrote:
> >
> >> Hi team,
> >>
> >> I am a long time user of PDFBox. We starts to migrate pdfbox from 1.8.2
> >> to 2.0.5.
> >> During migration I found that Linearized dictionary moved to preflight
> >> jar.
> >> I created the PDDocument based on preflight context which is returning
> >> null.
> >> Since the PDDocument is null I can't proceed further. What is the right
> >> way to
> >> get Lineraized dictionary in the current version of PDFBox . Please guide
> >> me.
> >> Please let me know if you need more details.
> >>
> >> Regards,
> >> Karthick G
> >>
> >>
> >>
> >>
> >

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org