Re: Problem reading a PDF file

2020-04-24 Thread Maruan Sahyoun
Dear Glen, PDFStreamParser is only for parsing PDF content streams (so specific parts of a PDF) and not the complete PDF. As a starting point take a look at CustomGraphicsStreamEngine and or CustomPageDrawer in the examples package. Also PDFTextStripper will give you some ideas how to process

Problem reading a PDF file

2020-04-24 Thread Glen Peterson
I'm trying to examine an existing PDF file. Initially to extract text and maybe images, but ultimately to apply some logic to the formatting of the text to turn it into valid HTML with H1, H2, ul, li, etc. I thought I would start like this: PDFStreamParser sParse = new

Re: Fwd: Trouble reading IEEE pdf

2017-02-02 Thread Pulkit Kapur
Thank you all. You do a great service. I am up and running. Thanks, Pulkit On Thu, Feb 2, 2017 at 3:19 PM, Tilman Hausherr wrote: > Am 02.02.2017 um 21:12 schrieb Pulkit Kapur: > >> I am getting just the headers: >> "2016 IEEE/RSJ International Conference on Intelligent

Re: Fwd: Trouble reading IEEE pdf

2017-02-02 Thread Tilman Hausherr
Am 02.02.2017 um 21:12 schrieb Pulkit Kapur: I am getting just the headers: "2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) Daejeon Convention Center October 9-14, 2016, Daejeon, Korea 978-1-5090-3761-2/16/$31.00 ©2016 IEEE 5324 5325 5326 5327 5328 5329 5330 5331

Re: Fwd: Trouble reading IEEE pdf

2017-02-02 Thread Pulkit Kapur
I am getting just the headers: "2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) Daejeon Convention Center October 9-14, 2016, Daejeon, Korea 978-1-5090-3761-2/16/$31.00 ©2016 IEEE 5324 5325 5326 5327 5328 5329 5330 5331 " Did use the new file path:

Re: Fwd: Trouble reading IEEE pdf

2017-02-02 Thread Tilman Hausherr
Am 02.02.2017 um 20:26 schrieb Pulkit Kapur: Thanks. Thats what i would expect to read. Also thanks for pointing to the latest version. I pointed to the pdfbox-app-2.0.4.jar and the fontbox-2.0.4.jar files. Since i want to read over 1000 pdf documents programmatically in matlab, i am not using

Re: Fwd: Trouble reading IEEE pdf

2017-02-02 Thread Pulkit Kapur
Thanks. Thats what i would expect to read. Also thanks for pointing to the latest version. I pointed to the pdfbox-app-2.0.4.jar and the fontbox-2.0.4.jar files. Since i want to read over 1000 pdf documents programmatically in matlab, i am not using the command line, but using the java library in

Re: Fwd: Trouble reading IEEE pdf

2017-02-02 Thread Tilman Hausherr
Am 02.02.2017 um 16:10 schrieb Pulkit Kapur: Hi I have uploaded the pdf here: https://www.scribd.com/document/338221804/0024-iros-2016 Hello Pulkit, This site requires registration. This is a "don't" from the list: https://pdfbox.apache.org/support.html I don't want to register. Please

RE: Fwd: Trouble reading IEEE pdf

2017-02-02 Thread Allison, Timothy B.
Of Pulkit Kapur Sent: Thursday, February 2, 2017 10:34 AM To: users@pdfbox.apache.org Subject: Re: Fwd: Trouble reading IEEE pdf Thanks Karl for the reply. Thats helpful. What confuses me is this" very likely because usually such an XObject would just be an image" -> I am a

Re: Fwd: Trouble reading IEEE pdf

2017-02-02 Thread Pulkit Kapur
Karl, Got it. I understand the point about XObjects and how pdfBox might be missing the XObject because typically they are images. I am hoping someone here might have had luck making pdfBox get data from XObject elements that contain text. Thanks, Pulkit On Thu, Feb 2, 2017 at 10:36 AM, Karl

Re: Fwd: Trouble reading IEEE pdf

2017-02-02 Thread Karl Heinz Kremer
Pulpit, I did not say that in your document the XObjects are images, I said that they usually are just images. When you analyze 100 random PDF documents, changes are that that most of them only use the XObject construct for images and vector graphic, not for elements that contain text. Your

Re: Fwd: Trouble reading IEEE pdf

2017-02-02 Thread Pulkit Kapur
Thanks Karl for the reply. Thats helpful. What confuses me is this" very likely because usually such an XObject would just be an image" -> I am able to select the underlying text in the XObject using acrobat and copy/paste it. Thats why i am confused why pdfbox cannot access the XObject. Perhaps

Re: Fwd: Trouble reading IEEE pdf

2017-02-02 Thread Pulkit Kapur
Hi I have uploaded the pdf here: https://www.scribd.com/document/338221804/0024-iros-2016 I did some more diagnosis last night and it seems that there are two layers on the pdf. One which is the content and the other with headers and footers. Pdf box is only reading the headers and footers. I

Re: Fwd: Trouble reading IEEE pdf

2017-02-01 Thread Tilman Hausherr
Am 02.02.2017 um 05:55 schrieb Pulkit Kapur: Hi I am trying to read some past years IEEE conference proceedings i have. I can read the pdf using acrobat and select the text. But when i try to read the text using readText function from the pdfbox library, i only get the headers and footers in

Fwd: Trouble reading IEEE pdf

2017-02-01 Thread Pulkit Kapur
Hi I am trying to read some past years IEEE conference proceedings i have. I can read the pdf using acrobat and select the text. But when i try to read the text using readText function from the pdfbox library, i only get the headers and footers in the pdf. I did check the document is not

Re: Reading a PDF

2016-09-20 Thread John Hewson
> On 20 Sep 2016, at 06:13, Clark, Raymond C wrote: > > We have a need to read a PDF and create a Post Script file from it. This is > working pretty good but I have a question. > > I am currently extending PDFStreamEngine for image information, extending >

RE: Reading a PDF

2016-09-20 Thread Clark, Raymond C
: Reading a PDF Am 20.09.2016 um 15:13 schrieb Clark, Raymond C: > We have a need to read a PDF and create a Post Script file from it. This is > working pretty good but I have a question. > > I am currently extending PDFStreamEngine for image information, extending > PDFGraph

Re: Reading a PDF

2016-09-20 Thread Tilman Hausherr
Am 20.09.2016 um 15:13 schrieb Clark, Raymond C: We have a need to read a PDF and create a Post Script file from it. This is working pretty good but I have a question. I am currently extending PDFStreamEngine for image information, extending PDFGraphicsStreamEngine for information on lines,

Reading a PDF

2016-09-20 Thread Clark, Raymond C
We have a need to read a PDF and create a Post Script file from it. This is working pretty good but I have a question. I am currently extending PDFStreamEngine for image information, extending PDFGraphicsStreamEngine for information on lines, and extending PDFTextStripper for information on

Re: Spaces are ignored when reading a PDF file

2016-03-20 Thread John Hewson
This subject of this thread is "Spaces are ignored when reading a PDF file. Please post new questions to a new thread. — John > On 18 Mar 2016, at 04:02, 风云天空 <1010800...@qq.com> wrote: > > who can help me > i get this error in multithreading > java

回复: Spaces are ignored when reading a PDF file

2016-03-19 Thread 风云天空
(Vector.java:1156) at java.util.Vector$Itr.next(Vector.java:1133) -- 原始邮件 -- 发件人: "Hesham G.";<heshamgne...@gmail.com>; 发送时间: 2016年3月18日(星期五) 下午4:44 收件人: "users"<users@pdfbox.apache.org>; 主题: Re: Spaces are ignored when read

Re: Spaces are ignored when reading a PDF file

2016-03-19 Thread clovis
just an idea from whom is not fluent in pdfbox nor PDF. if you just want to know there is a space in between the letters and not the amount of spaces, you can use your code to get character details and then use extractText to get the words. 2016-03-17 7:20 GMT-03:00 Hesham G.

Re: Spaces are ignored when reading a PDF file

2016-03-19 Thread Tilman Hausherr
Am 17.03.2016 um 11:20 schrieb Hesham G.: So to detect the space between the 2 words "With" & "due" should I make subtraction calculations between X of the last letter(h) and the X of the first letter (d) and if the number is large than normal then this is a space? I think this way might be

Re: Spaces are ignored when reading a PDF file

2016-03-19 Thread Andreas Lehmkühler
Hi, > Frank van der Hulst hat am 17. März 2016 um 08:34 > geschrieben: > > > Spaces don't exist as characters in PDFs. To identify spaces, you have to > compare the X coordinates of adjacent characters against their widths. That's not correct, spaces exist but in most

Re: Spaces are ignored when reading a PDF file

2016-03-19 Thread Hesham G.
Andreas, That is very helpful. I can get the x location of each character using TextPosition.getX(), ex: W: 102.88399 i: 114.18165 t: 117.660614 h: 121.55801 d: 133.09477 u: 140.3994 e: 147.60838 So to detect the space between the 2 words "With" & "due" should I make subtraction calculations

Re: Spaces are ignored when reading a PDF file

2016-03-19 Thread Hesham G.
Andreas, You're absolutely right. I am testing it now, but it seems very complicated. I hope there might be another easier solution. Best regards , Hesham Included message : "Hesham G." hat

Re: Spaces are ignored when reading a PDF file

2016-03-19 Thread Hesham G.
John, I think I have got the idea ... Thumps up Best regards , Hesham Included message : I’m rather confused by this thread, inferring spaces is one of the the main features of PDFTextStripper. I’m not sure why anyone

Re: Spaces are ignored when reading a PDF file

2016-03-19 Thread Frank van der Hulst
Spaces don't exist as characters in PDFs. To identify spaces, you have to compare the X coordinates of adjacent characters against their widths. On Thu, Mar 17, 2016 at 7:12 PM, Hesham G. wrote: > Hello , > > I have a PDF file created using Latex. I am trying to read and

Spaces are ignored when reading a PDF file

2016-03-19 Thread Hesham G.
Hello , I have a PDF file created using Latex. I am trying to read and print all letters in that file using PDFBox, but when doing this all spaces in that file are ignored. Here is the code I am using: PDPage page = (PDPage)allPages.get( 0 ); PDStream contents = page.getContents(); if (

Re: Spaces are ignored when reading a PDF file

2016-03-19 Thread Hesham G.
Clovis, Thanks a lot :) I will have to follow this solution if there is no alternative. The problem is that if I am extracting text of 500 or 600 pages PDF, that will consume much additional memory and time. In addition I guess it's only a special case for latex books only. Best regards ,

Re: Spaces are ignored when reading a PDF file

2016-03-19 Thread Tilman Hausherr
Am 17.03.2016 um 07:12 schrieb Hesham G.: Hello , I have a PDF file created using Latex. I am trying to read and print all letters in that file using PDFBox, but when doing this all spaces in that file are ignored. Here's what I get with ExtractText (your code is unusual), this looks

Re: Spaces are ignored when reading a PDF file

2016-03-18 Thread Hesham G.
Tilman, I am using this code to extract the text from the pdf because I need font information about the extracted characters like determining the font name used. Using the normal extraction code will not work in my case. Best regards , Hesham

Re: Spaces are ignored when reading a PDF file

2016-03-18 Thread Hesham G.
John , I have checked the PrintTextLocations.java example. I have tested using this code for the "With due" term in my book sample, using this code: System.out.println( "String[" + text.getCharacter() + ": " + text.getXDirAdj() + "," + text.getYDirAdj() + " fs="

Re: Spaces are ignored when reading a PDF file

2016-03-18 Thread Andreas Lehmkühler
> "Hesham G." hat am 17. März 2016 um 11:20 > geschrieben: > > > Andreas, > > That is very helpful. > > I can get the x location of each character using TextPosition.getX(), ex: > W: 102.88399 > i: 114.18165 > t: 117.660614 > h: 121.55801 > d: 133.09477 > u: 140.3994 >

Re: Spaces are ignored when reading a PDF file

2016-03-18 Thread John Hewson
I’m rather confused by this thread, inferring spaces is one of the the main features of PDFTextStripper. I’m not sure why anyone is suggesting to process the text manually - there’s no need to do that. We do that already! Looking at the original code the problem is right here: > public class