Re: Spaces are ignored when reading a PDF file

2016-03-18 Thread Hesham G.

Tilman,

I am using this code to extract the text from the pdf because I need font 
information about the extracted characters like determining the font name 
used. Using the normal extraction code will not work in my case.



Best regards ,
Hesham


Included message :

Am 17.03.2016 um 07:12 schrieb Hesham G.:

Hello ,

I have a PDF file created using Latex. I am trying to read and print all 
letters in that file using PDFBox, but when doing this all spaces in that 
file are ignored.


Here's what I get with ExtractText (your code is unusual), this
looks excellent to me:

article titles c©by Michael O’Kane are not part of the law mu7ami.com
Article [220] Right to Regulate
With due regard to Article (219), the competent authority has the right
of monitoring the companies with regard to application of the provisions
set forth in the law and the company’s articles of association and bylaw
including the authority to inspect the company and check its account and
ask for data from the board of directors or the company managers through
a representative or more of its personnel or experts it chooses for this
pur-
pose.
Article [221] Access to Records
All the company officials shall acquaint the Ministry representatives and
the Authority, fi the company is listed in the financial market or
seeking to
be listed, with regard to the works stated in Article (220), all that
they ask
of company books and records and documents and provide them with all
related information or clarification.
94 version 0.2 provided by mu7ami.com


-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org


-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: Strange "Save As" issue with Adobe Reader 11 / DC with PDF being encrypted by PDFBox 2 snapshot

2016-03-18 Thread Tilman Hausherr

Am 16.03.2016 um 20:44 schrieb Stahle, Patrick:

Where would you suggest I upload a sample file too? I will try reopening it in 
pdfbox in a moment.


I've used filedropper.com but always be careful that such sites don't 
ask you to register or to install the downloader.


Tilman



-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Wednesday, March 16, 2016 3:38 PM
To: users@pdfbox.apache.org
Subject: Re: Strange "Save As" issue with Adobe Reader 11 / DC with PDF being 
encrypted by PDFBox 2 snapshot

Can you reopen the file you saved with PDFBox? If not, please open an issue in 
JIRA and attach your file.

If yes, just upload the file somewhere, I'd like to have a look at the 
encryption dictionaries.

Tilman

Am 16.03.2016 um 20:19 schrieb Stahle, Patrick:

Hi,

This is not a general problem and only occurs with original PDF generated with 3D content 
using Anark. The file when loaded seems to have encrypted and loads just find in Adobe 
Reader, but when we try to do a "Save As" we get the following error:
"The document could not be saved. There was a problem reading this document 21."

If I do a control click on the "ok" button. I get the following message:
"This direct object already has a container."

Any ideas what might be causing this problem? We have tried the same thing with 
iText and it does not experience this problem.

Sample Code that reproduces the problem:
  PDDocument doc = null;

  try {
  doc = PDDocument.load(pdfIn);
  PDPage page = null;
  AccessPermission apermission 
= new AccessPermission();
  
apermission.setCanAssembleDocument(false);
  
apermission.setCanExtractContent(false);
  
apermission.setCanExtractForAccessibility(true);
  
apermission.setCanFillInForm(true);
  
apermission.setCanModifyAnnotations(true);
  apermission.setCanPrint(true);
  apermission.setReadOnly();
  StandardProtectionPolicy spp = new 
StandardProtectionPolicy(UUID.randomUUID().toString(), "", apermission);
  doc.protect(spp);

  for (int i = 0; i < 
doc.getNumberOfPages(); i++) {
  page = 
doc.getPage(i);
  
PDPageContentStream canvas = new PDPageContentStream(doc, page, 
PDPageContentStream.AppendMode.APPEND, true, true);
  
canvas.saveGraphicsState();
  
canvas.restoreGraphicsState();
  
canvas.close();
  }
  doc.save(pdfOut);
  bRet = true;
  }
  finally {
  if (doc != null) {
  doc.close();
  }
  }

Thanks,
Patrick




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org


-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: pdfrenderer unicode characters in latest 2.1.0-SNAPSHOT ?

2016-03-18 Thread Jesse Kuhnert
I already un-subscribed so not sure this will get through but I resolved my
problem finally by first calling PDDocument.save() and then
PDDocument.load() on the saved pdf before attempting to render images.

Previously as last step before saving pdf we would render the images out.
I'm not sure if the image render would work if I just called .save() first
or not but I'm just happy it's working again.

On Wed, Mar 16, 2016 at 2:29 PM, Tilman Hausherr 
wrote:

> Am 16.03.2016 um 22:23 schrieb Jesse Kuhnert:
>
>> It appears as if the java2d font rendering logic in new 2.x versions
>> missed
>> something with unicode support. I have pdfs being output wonderfully with
>> my sample unicode text (bengali in this case) but when we try to produce
>> images any glyphs we have which aren't english appear to just be rendered
>> as "blank" somehow. (meaning the spacing and line heights look like there
>> is ghost text taking up space there but maybe that's just our logic for
>> laying out pdf)
>>
>> Has anyone else tried to produce unicode based pdf images yet ?
>>
>>
>
> I just tried to render the file created with the EmbeddedFonts example and
> it works fine.
>
> Tilman
>
> -
> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org
>
>


Re: Spaces are ignored when reading a PDF file

2016-03-18 Thread Hesham G.
John ,

I have checked the PrintTextLocations.java example. I have tested using this 
code for the "With due" term in my book sample, using this code:
System.out.println( "String[" + text.getCharacter() + ": " + text.getXDirAdj() 
+ "," +
text.getYDirAdj() + " fs=" + text.getFontSize() 
+ " xscale=" +
text.getXScale() + " height=" + 
text.getHeightDir() + " space=" +
text.getWidthOfSpace() + " width=" + 
text.getWidthDirAdj() + "]" );
And here are the results:
String[W: 102.88399,169.591 fs=11.9552 xscale=11.9552 height=7.328538 
space=2.9888 width=11.9552]
String[i: 114.18165,169.591 fs=11.9552 xscale=11.9552 height=7.328538 
space=2.9888 width=3.4789658]
String[t: 117.660614,169.591 fs=11.9552 xscale=11.9552 height=7.328538 
space=2.9888 width=3.8973923]
String[h: 121.55801,169.591 fs=11.9552 xscale=11.9552 height=7.328538 
space=2.9888 width=6.957924]
String[d: 133.09477,169.591 fs=11.9552 xscale=11.9552 height=7.328538 
space=2.9888 width=7.3046265]
String[u: 140.3994,169.591 fs=11.9552 xscale=11.9552 height=7.328538 
space=2.9888 width=7.2089844]
String[e: 147.60838,169.591 fs=11.9552 xscale=11.9552 height=7.328538 
space=2.9888 width=5.7265472]

So which method do you mean? .. The getXDirAdj() ?


Best regards ,
Hesham


Included message :

I’m rather confused by this thread, inferring spaces is one of the the main 
features of PDFTextStripper. I’m not sure why anyone is suggesting to process 
the text manually - there’s no need to do that. We do that already!

Looking at the original code the problem is right here:

> public class PDFTextStripperProcessor extends PDFTextStripper {
>@Override
>public void processTextPosition( TextPosition text )  {
>System.out.println( text.getCharacter() );
>}
> }

The processTextPosition method is used to pass an unprocessed TextPosition *in* 
to PDFTextStripper, but this override prevents that from happening, and is just 
printing the unprocessed token before PDFTextStripper has had a chance to do 
its job, such as inferring the missing spaces.

You should follow our PrintTextLocations.java example which shows you how to 
get the processed TextPositions from PDFTextStripper. It’s really easy to do.

— John

> On 17 Mar 2016, at 04:44, Hesham G.  wrote:
> 
> Andreas,
> 
> You're absolutely right. I am testing it now, but it seems very complicated. 
> I hope there might be another easier solution.
> 
> 
> Best regards ,
> Hesham
> 
> 
> Included message :
> 
>> "Hesham G."  hat am 17. März 2016 um 11:20
>> geschrieben:
>> 
>> 
>> Andreas,
>> 
>> That is very helpful.
>> 
>> I can get the x location of each character using TextPosition.getX(), ex:
>> W: 102.88399
>> i: 114.18165
>> t: 117.660614
>> h: 121.55801
>> d: 133.09477
>> u: 140.3994
>> e: 147.60838
>> 
>> So to detect the space between the 2 words "With" & "due" should I make
>> subtraction calculations between X of the last letter(h) and the X of the
>> first letter (d) and if the number is large than normal then this is a
>> space? I think this way might be risky in the detection, or what?
> That's the short story. To decide what is normal could be quite tricky. You 
> have
> to take the following facts into account:
> 
> - different fonts have different widths (important if the font before the 
> space
> isn't the same than the font after the space)
> - keep in mind that you have to take a scaling and sometimes a rotation into
> account
> - the "space" between characters may vary if the text is jusitified
> 
> There are certainly some other details which may be important as well, so that
> you end up with some more or less heuristic.
> 
> BR
> Andreas
> 
>> Best regards ,
>> Hesham
>> 
>> 
>> Included message :
>> 
>> Hi,
>> 
>> > Frank van der Hulst  hat am 17. März 2016 um
>> > 08:34
>> > geschrieben:
>> >
>> >
>> > Spaces don't exist as characters in PDFs. To identify spaces, you have > to
>> > compare the X coordinates of adjacent characters against their widths.
>> That's not correct, spaces exist but in most cases pdf engines omit them and
>> replace spaces by a splitted text with an appropriate positioning.
>> 
>> BTW, latex uses the same strategy. Here is a excerpt from your pdf:
>> 
>>   [ (W) 55 (ith) -383 (due) -384 (r) 18 (egar) 18 (d) -383 (to) -383
>> (Article)
>> -384 (\(219\),) -416 (the) -384 (competent) -383 (authority) -383 (has) -384
>> (the) -383 (right) ] TJ
>> 
>> The text is in between the braces and the numbers are used for horizontal
>> positioning.
>> 
>> BR
>> Andreas
>> 
>> >
>> > On Thu, Mar 17, 2016 at 7:12 PM, Hesham G.  > 
>> > wrote:
>> >
>> > > Hello ,
>> > >
>> > > I have a PDF file created using Latex. I am trying to read and print > > 
>> > >

Re: Spaces are ignored when reading a PDF file

2016-03-18 Thread Andreas Lehmkühler
> "Hesham G."  hat am 17. März 2016 um 11:20
> geschrieben:
> 
> 
> Andreas,
> 
> That is very helpful.
> 
> I can get the x location of each character using TextPosition.getX(), ex:
> W: 102.88399
> i: 114.18165
> t: 117.660614
> h: 121.55801
> d: 133.09477
> u: 140.3994
> e: 147.60838
> 
> So to detect the space between the 2 words "With" & "due" should I make 
> subtraction calculations between X of the last letter(h) and the X of the 
> first letter (d) and if the number is large than normal then this is a 
> space? I think this way might be risky in the detection, or what?
That's the short story. To decide what is normal could be quite tricky. You have
to take the following facts into account:

- different fonts have different widths (important if the font before the space
isn't the same than the font after the space)
- keep in mind that you have to take a scaling and sometimes a rotation into
account
- the "space" between characters may vary if the text is jusitified

There are certainly some other details which may be important as well, so that
you end up with some more or less heuristic. 

BR
Andreas

> Best regards ,
> Hesham
> 
> 
> Included message :
> 
> Hi,
> 
> > Frank van der Hulst  hat am 17. März 2016 um 
> > 08:34
> > geschrieben:
> >
> >
> > Spaces don't exist as characters in PDFs. To identify spaces, you have to
> > compare the X coordinates of adjacent characters against their widths.
> That's not correct, spaces exist but in most cases pdf engines omit them and
> replace spaces by a splitted text with an appropriate positioning.
> 
> BTW, latex uses the same strategy. Here is a excerpt from your pdf:
> 
>[ (W) 55 (ith) -383 (due) -384 (r) 18 (egar) 18 (d) -383 (to) -383 
> (Article)
> -384 (\(219\),) -416 (the) -384 (competent) -383 (authority) -383 (has) -384
> (the) -383 (right) ] TJ
> 
> The text is in between the braces and the numbers are used for horizontal
> positioning.
> 
> BR
> Andreas
> 
> >
> > On Thu, Mar 17, 2016 at 7:12 PM, Hesham G.  wrote:
> >
> > > Hello ,
> > >
> > > I have a PDF file created using Latex. I am trying to read and print all
> > > letters in that file using PDFBox, but when doing this all spaces in 
> > > that
> > > file are ignored. Here is the code I am using:
> > > PDPage page = (PDPage)allPages.get( 0 );
> > > PDStream contents = page.getContents();
> > > if ( contents != null ) {
> > > PDFTextStripperProcessor pdfTextStripperProcessor = new
> > > PDFTextStripperProcessor();
> > > pdfTextStripperProcessor.processStream( page, page.findResources(),
> > > contents.getStream() );
> > > }
> > >
> > > public class PDFTextStripperProcessor extends PDFTextStripper {
> > > @Override
> > > public void processTextPosition( TextPosition text )  {
> > > System.out.println( text.getCharacter() );
> > > }
> > > }
> > >
> > > And you can check a one page file sample here to test it:
> > >
> > > https://dl.dropboxusercontent.com/u/10111483/downloads/pdfbox/pdf_latex_spaces_ignored.pdf
> > >
> > > What is the cause of this issue please?
> > >
> > >
> > > Best regards ,
> > > Hesham
> 
> -
> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org
> 
> 
> -
> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org
>

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



RE: Strange "Save As" issue with Adobe Reader 11 / DC with PDF being encrypted by PDFBox 2 snapshot

2016-03-18 Thread Stahle, Patrick
Hi Tillman,

Oh, fun The joy of working a corporation with a proxy server I can't 
get to that site

I guess from the corrupt end of things I am supposed to use MoveIt... We get to 
this tomorrow

Thanks,
Patrick

-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de] 
Sent: Wednesday, March 16, 2016 3:48 PM
To: users@pdfbox.apache.org
Subject: Re: Strange "Save As" issue with Adobe Reader 11 / DC with PDF being 
encrypted by PDFBox 2 snapshot

Am 16.03.2016 um 20:44 schrieb Stahle, Patrick:
> Where would you suggest I upload a sample file too? I will try reopening it 
> in pdfbox in a moment.

I've used filedropper.com but always be careful that such sites don't ask you 
to register or to install the downloader.

Tilman

>
> -Original Message-
> From: Tilman Hausherr [mailto:thaush...@t-online.de]
> Sent: Wednesday, March 16, 2016 3:38 PM
> To: users@pdfbox.apache.org
> Subject: Re: Strange "Save As" issue with Adobe Reader 11 / DC with 
> PDF being encrypted by PDFBox 2 snapshot
>
> Can you reopen the file you saved with PDFBox? If not, please open an issue 
> in JIRA and attach your file.
>
> If yes, just upload the file somewhere, I'd like to have a look at the 
> encryption dictionaries.
>
> Tilman
>
> Am 16.03.2016 um 20:19 schrieb Stahle, Patrick:
>> Hi,
>>
>> This is not a general problem and only occurs with original PDF generated 
>> with 3D content using Anark. The file when loaded seems to have encrypted 
>> and loads just find in Adobe Reader, but when we try to do a "Save As" we 
>> get the following error:
>> "The document could not be saved. There was a problem reading this document 
>> 21."
>>
>> If I do a control click on the "ok" button. I get the following message:
>> "This direct object already has a container."
>>
>> Any ideas what might be causing this problem? We have tried the same thing 
>> with iText and it does not experience this problem.
>>
>> Sample Code that reproduces the problem:
>>   PDDocument doc = null;
>>
>>   try {
>>   doc = 
>> PDDocument.load(pdfIn);
>>   PDPage page = null;
>>   AccessPermission 
>> apermission = new AccessPermission();
>>   
>> apermission.setCanAssembleDocument(false);
>>   
>> apermission.setCanExtractContent(false);
>>   
>> apermission.setCanExtractForAccessibility(true);
>>   
>> apermission.setCanFillInForm(true);
>>   
>> apermission.setCanModifyAnnotations(true);
>>   
>> apermission.setCanPrint(true);
>>   apermission.setReadOnly();
>>   StandardProtectionPolicy 
>> spp = new StandardProtectionPolicy(UUID.randomUUID().toString(), "", 
>> apermission);
>>   doc.protect(spp);
>>
>>   for (int i = 0; i < 
>> doc.getNumberOfPages(); i++) {
>>   page = 
>> doc.getPage(i);
>>   
>> PDPageContentStream canvas = new PDPageContentStream(doc, page, 
>> PDPageContentStream.AppendMode.APPEND, true, true);
>>   
>> canvas.saveGraphicsState();
>>   
>> canvas.restoreGraphicsState();
>>   
>> canvas.close();
>>   }
>>   doc.save(pdfOut);
>>   bRet = true;
>>   }
>>   finally {
>>   if (doc != null) {
>>   
>> doc.close();
>>   }
>>   }
>>
>> Thanks,
>> Patrick
>>
>>
>
> -
> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org
>
>
> -
> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org
>


---

Re: Data from PDF - > MS Access

2016-03-18 Thread Tres Finocchiaro
>
> There is a good outlook connector called moyosoft but it does cost about
> $200


I'm not sure why you'd use outlook at all.  If the data is going to be
available in email, use something that can fetch email silently and process
the inbox as a batch process.

To have a desktop application interact with Outlook is likely a bad
design.  Something like a silent IMAP processor would be much more scalable
, in my opinion. :)

- tres.finocchi...@gmail.com

On Thu, Mar 17, 2016 at 12:45 AM, Al Grant  wrote:

> Thanks Ken.
>
> Yes that's the only way I can see so far.
>
> There is a good outlook connector called moyosoft but it does cost about
> $200
>
> The field itself might be called CompanyXYZ Ave the data type longstring.
>
> I will also need record number in the email to update the correct record.
>
> Cheers
>
> Al
> On 17/03/2016 5:28 pm, "Ken Bowen"  wrote:
>
> > What is the nature of the feedback into the database?
> >
> > If it amounts to more or less make entries in fields in the db,
> > and you are stuck with email as a medium, you might hack up a
> > convention like this:
> > 1) Select a recognizable boundary line (begin & end), say a line of
> > at least 10 + or *, or whatever.
> > 2) Between the boundary lines, have your compatriots make entries like:
> > [Field Name] = [Value to be input]
> > with the restriction that no ‘=‘ sign occurs on either side (or replace
> > the use of ‘=‘ by something else that would satisfy that restriction).
> >
> > You can knock out a script to process each email to a csv file, and
> > then import that to your Access db.
> >
> > Regards,
> > Ken Bowen
> >
> > On Mar 16, 2016, at 9:23 PM, Al Grant  wrote:
> >
> > > Hi All
> > >
> > > This might be slightly OT - but the list was so helpful in the past...
> > >
> > > I have a database in MS Access and a standalone Java applet that
> imports
> > > data from a PDF form and scrapes data from the PDF form into
> > corresponding
> > > fields in the Access Database. (Thanks to the list for help on this!)
> > >
> > > A report from this Access database then goes out via email to a handful
> > of
> > > people in other companies and they need a way to provide feedback into
> > the
> > > database.
> > >
> > > The question is how to achieve this feedback?
> > >
> > > It is difficult because the I am working within a number of
> constraints:
> > >
> > > 1. We are all working behind large corporate firewalls;
> > > 2. I have Office 2007 installed;
> > > 3. Our shared mailboxes are accessed only via a web interface (not OWA
> -
> > I
> > > think Lotus)
> > >
> > > Getting ports on firewall, vpns etc is not an option, nor are cloud
> > > services like Dropbox or Amazon, so I think I am stuck with email as
> the
> > > transport medium.
> > >
> > > Solutions?
> > >
> > > Cheers
> > >
> > > -AL
> > >
> > > --
> > > "Beat it punk!"
> > > - Clint Eastwood
> >
> >
> > -
> > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> > For additional commands, e-mail: users-h...@pdfbox.apache.org
> >
> >
>


Re: Strange "Save As" issue with Adobe Reader 11 / DC with PDF being encrypted by PDFBox 2 snapshot

2016-03-18 Thread Tilman Hausherr
No need to, I have already found a smaller file that brings the same 
effect with the Encrypt command line tool. But it does not happen with 
every file. I'll open an issue and post it here.


Tilman

Am 17.03.2016 um 20:05 schrieb Stahle, Patrick:

Would you like the original pdf prior to my encrypting for comparison, and the 
simplest code sample I have? I can also upload that one too...

Also debugging through my code I noticed, what seemed like if I didn't use the 
initialization constructor my sets didn't change the value of the actual byte.

Ex.
AccessPermission apermission = new AccessPermission();
apermission.setCanPrint(true);
apermission.setCanModifyAnnotations(true);
apermission.setCanAssembleDocument(true);
apermission.setCanFillInForm(true);
apermission.setCanExtractForAccessibility(true);
apermission.setReadOnly();

Eclipse debugger shows:
bytes= -4
readOnly= true

while:
AccessPermission apermission = new AccessPermission(0);
apermission.setCanPrint(true);
apermission.setCanModifyAnnotations(true);
apermission.setCanAssembleDocument(true);
apermission.setCanFillInForm(true);
apermission.setCanExtractForAccessibility(true);
apermission.setReadOnly();

Eclipse debugger shows:
bytes= 1828
readOnly= true

Which looks correct to me, or least compares to what I see from iText (minus 
the readOnly bit).

-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Thursday, March 17, 2016 2:55 PM
To: users@pdfbox.apache.org
Subject: Re: Strange "Save As" issue with Adobe Reader 11 / DC with PDF being 
encrypted by PDFBox 2 snapshot

Am 17.03.2016 um 14:21 schrieb Stahle, Patrick:

Ok,  I think I have the  file uploaded to the following:
http://wikisend.com/download/381906/tmp_10435-Technical  Data Package
BOMAnarkStampedPDFBox102259703.pdf

The link is hard for me to test since I have to do this all from my phone. If 
it says not found or expired try putting in 381906 from the download page...

Thanks, it worked. I did find some weirdness: parts of the encryption object 
exists twice.

659 0 obj
<<
/Filter /Standard
/V 1
/R 3
/Length 40
/P -1044
/O <92B3A580FEDD525873E5DEA425E75E1B74858FD5C6F5FED7E4C6C39C2E23D2DB>
/U <50D7EE978EFC3D29DAF239DA746CCC2228BF4E5E4E758A4164004E56FFFA0108>
  >>
endobj
660 0 obj
<<
/ID [<3DADA7608D955343B3F967EB90F6801F> <1C65E39CFBD4D44099223D10A9D542B5>]
/Info 13 0 R
/Root 1 0 R
/Encrypt <<
/Filter /Standard
/V 1
/R 3
/Length 40
/P -1044
/O <92B3A580FEDD525873E5DEA425E75E1B74858FD5C6F5FED7E4C6C39C2E23D2DB>
/U <50D7EE978EFC3D29DAF239DA746CCC2228BF4E5E4E758A4164004E56FFFA0108>
  >>
/Type /XRef
/Size 661
/Index [1 659]
/W [1 3 0]
/Filter /FlateDecode
/Length 1881
  >>



-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org


-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: Spaces are ignored when reading a PDF file

2016-03-18 Thread John Hewson
I’m rather confused by this thread, inferring spaces is one of the the main 
features of PDFTextStripper. I’m not sure why anyone is suggesting to process 
the text manually - there’s no need to do that. We do that already!

Looking at the original code the problem is right here:

> public class PDFTextStripperProcessor extends PDFTextStripper {
>@Override
>public void processTextPosition( TextPosition text )  {
>System.out.println( text.getCharacter() );
>}
> }

The processTextPosition method is used to pass an unprocessed TextPosition *in* 
to PDFTextStripper, but this override prevents that from happening, and is just 
printing the unprocessed token before PDFTextStripper has had a chance to do 
its job, such as inferring the missing spaces.

You should follow our PrintTextLocations.java example which shows you how to 
get the processed TextPositions from PDFTextStripper. It’s really easy to do.

— John

> On 17 Mar 2016, at 04:44, Hesham G.  wrote:
> 
> Andreas,
> 
> You're absolutely right. I am testing it now, but it seems very complicated. 
> I hope there might be another easier solution.
> 
> 
> Best regards ,
> Hesham
> 
> 
> Included message :
> 
>> "Hesham G."  hat am 17. März 2016 um 11:20
>> geschrieben:
>> 
>> 
>> Andreas,
>> 
>> That is very helpful.
>> 
>> I can get the x location of each character using TextPosition.getX(), ex:
>> W: 102.88399
>> i: 114.18165
>> t: 117.660614
>> h: 121.55801
>> d: 133.09477
>> u: 140.3994
>> e: 147.60838
>> 
>> So to detect the space between the 2 words "With" & "due" should I make
>> subtraction calculations between X of the last letter(h) and the X of the
>> first letter (d) and if the number is large than normal then this is a
>> space? I think this way might be risky in the detection, or what?
> That's the short story. To decide what is normal could be quite tricky. You 
> have
> to take the following facts into account:
> 
> - different fonts have different widths (important if the font before the 
> space
> isn't the same than the font after the space)
> - keep in mind that you have to take a scaling and sometimes a rotation into
> account
> - the "space" between characters may vary if the text is jusitified
> 
> There are certainly some other details which may be important as well, so that
> you end up with some more or less heuristic.
> 
> BR
> Andreas
> 
>> Best regards ,
>> Hesham
>> 
>> 
>> Included message :
>> 
>> Hi,
>> 
>> > Frank van der Hulst  hat am 17. März 2016 um
>> > 08:34
>> > geschrieben:
>> >
>> >
>> > Spaces don't exist as characters in PDFs. To identify spaces, you have > to
>> > compare the X coordinates of adjacent characters against their widths.
>> That's not correct, spaces exist but in most cases pdf engines omit them and
>> replace spaces by a splitted text with an appropriate positioning.
>> 
>> BTW, latex uses the same strategy. Here is a excerpt from your pdf:
>> 
>>   [ (W) 55 (ith) -383 (due) -384 (r) 18 (egar) 18 (d) -383 (to) -383
>> (Article)
>> -384 (\(219\),) -416 (the) -384 (competent) -383 (authority) -383 (has) -384
>> (the) -383 (right) ] TJ
>> 
>> The text is in between the braces and the numbers are used for horizontal
>> positioning.
>> 
>> BR
>> Andreas
>> 
>> >
>> > On Thu, Mar 17, 2016 at 7:12 PM, Hesham G.  > 
>> > wrote:
>> >
>> > > Hello ,
>> > >
>> > > I have a PDF file created using Latex. I am trying to read and print > > 
>> > > all
>> > > letters in that file using PDFBox, but when doing this all spaces in
>> > > that
>> > > file are ignored. Here is the code I am using:
>> > > PDPage page = (PDPage)allPages.get( 0 );
>> > > PDStream contents = page.getContents();
>> > > if ( contents != null ) {
>> > > PDFTextStripperProcessor pdfTextStripperProcessor = new
>> > > PDFTextStripperProcessor();
>> > > pdfTextStripperProcessor.processStream( page, > > 
>> > > page.findResources(),
>> > > contents.getStream() );
>> > > }
>> > >
>> > > public class PDFTextStripperProcessor extends PDFTextStripper {
>> > > @Override
>> > > public void processTextPosition( TextPosition text )  {
>> > > System.out.println( text.getCharacter() );
>> > > }
>> > > }
>> > >
>> > > And you can check a one page file sample here to test it:
>> > >
>> > > https://dl.dropboxusercontent.com/u/10111483/downloads/pdfbox/pdf_latex_spaces_ignored.pdf
>> > >
>> > > What is the cause of this issue please?
>> > >
>> > >
>> > > Best regards ,
>> > > Hesham
>> 
>> -
>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
>> For additional commands, e-mail: users-h...@pdfbox.apache.org
>> 
>> 
>> -
>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
>> For additional commands, e-mail: users-h...@pdfb