Karl, Got it. I understand the point about XObjects and how pdfBox might be missing the XObject because typically they are images. I am hoping someone here might have had luck making pdfBox get data from XObject elements that contain text.
Thanks, Pulkit On Thu, Feb 2, 2017 at 10:36 AM, Karl Heinz Kremer <[email protected]> wrote: > Pulpit, > > I did not say that in your document the XObjects are images, I said that > they usually are just images. When you analyze 100 random PDF documents, > changes are that that most of them only use the XObject construct for > images and vector graphic, not for elements that contain text. Your > documents are an exception. > > > Karl Heinz Kremer > PDF Acrobatics Without a Net > PDF Software Development, Training and More... > > [email protected] > http://www.khkonsulting.com > > > On Thu, Feb 2, 2017 at 10:33 AM, Pulkit Kapur <[email protected]> > wrote: > > > Thanks Karl for the reply. > > Thats helpful. > > > > What confuses me is this" very likely because usually such an XObject > would > > just be an > > image" > > -> I am able to select the underlying text in the XObject using acrobat > and > > copy/paste it. > > Thats why i am confused why pdfbox cannot access the XObject. > > > > Perhaps it is more nuanced than how i am phrasing it. > > > > Thanks, > > > > Pulkit > > > > On Thu, Feb 2, 2017 at 10:27 AM, Karl Heinz Kremer <[email protected]> wrote: > > > > > The document does not contain layers (or optional content groups as > they > > > are called in PDF), the problem seems to be that the actual text of > > > the document is in an XObject - something that is completely legal in a > > PDF > > > file. I suspect that the text was created in one application, and then > a > > > second application was used to create a new page, then placed the > header > > on > > > it as "normal" text, and in a second step placed the original content > > into > > > this XObject and then placed it on the page. This is oftentimes what > e.g. > > > an imposition application would do. Without having checked in the > > sources, > > > I would assume that when you extract text, PDFBox will just process the > > > Contents structure on the page, but will not recurse into XObjects that > > are > > > encountered - very likely because usually such an XObject would just be > > an > > > image. > > > > > > > > > Karl Heinz Kremer > > > PDF Acrobatics Without a Net > > > PDF Software Development, Training and More... > > > > > > [email protected] > > > http://www.khkonsulting.com > > > > > > > > > On Thu, Feb 2, 2017 at 10:10 AM, Pulkit Kapur <[email protected]> > > > wrote: > > > > > > > Hi > > > > > > > > I have uploaded the pdf here: > > > > https://www.scribd.com/document/338221804/0024-iros-2016 > > > > > > > > I did some more diagnosis last night and it seems that there are two > > > layers > > > > on the pdf. One which is the content and the other with headers and > > > > footers. Pdf box is only reading the headers and footers. > > > > I suspect this must be common with all conference proceedings. > > > > > > > > Thanks, > > > > > > > > Pulkit > > > > > > > > On Thu, Feb 2, 2017 at 1:21 AM, Tilman Hausherr < > [email protected] > > > > > > > wrote: > > > > > > > > > Am 02.02.2017 um 05:55 schrieb Pulkit Kapur: > > > > > > > > > >> Hi > > > > >> > > > > >> I am trying to read some past years IEEE conference proceedings i > > > have. > > > > >> I can read the pdf using acrobat and select the text. > > > > >> > > > > >> But when i try to read the text using readText function from the > > > pdfbox > > > > >> library, i only get the headers and footers in the pdf. > > > > >> > > > > >> I did check the document is not encrypted. > > > > >> Also my code works on other pdf documents but all IEEE proceedings > > > that > > > > >> are downloaded form IEEE fail to work. > > > > >> > > > > >> I have attached the pdf document with this message. > > > > >> > > > > > > > > > > Please upload the pdf somewhere, PDF attachments are not allowed > > here. > > > > > > > > > > > > > > > > > > > > Tilman > > > > > > > > > > > > > > >

