Hi,
If all your files are like that, just dump the space, and make your
extraction on positions only. There is no guarantee that there are
spaces in a PDF between two words anyway.
Tilman
Am 29.03.2016 um 19:36 schrieb Joel Hirsh:
I understand, but is there anything I can do in my code to get the string
as shown in ExtractText?
I am subclassing PDFTextStripper, similar to what is done
in PrintTextLocations, and the string coming into writeString(String
string, List<TextPosition> textPositions) is the one where all the spaces
occur.
Thanks
On Tue, Mar 29, 2016 at 10:03 AM, Tilman Hausherr <[email protected]>
wrote:
Here's what I got with ExtractText command line application:
______
______ 03-09 3,411.69
ELECTRONIC DEPOSIT FDMS-SETTLEMENT DEPOSIT 376249462999
03-10 1,645.22 ELECTRONIC DEPOSIT FDMS-SETTLEMENT DEPOSIT
376249462999
However I think I understand the cause of your problem, because there's
output like this:
String[461.20358,340.904 fs=1.0 xscale=1.0 height=4.44 space=4.7999997
width=4.799988]6
String[461.20428,340.904 fs=1.0 xscale=1.0 height=6.48 space=7.2
width=7.200012]
i.e. space and a character at the same place. See this content stream:
BT
0 0 0 rg
/F0 1 Tf
1 0 0 1 29.204 460.096 Tm
( ______ ) Tj
1 0 0 1 29.204 451.096 Tm
( ______ ) Tj
/F1 1 Tf
1 0 0 1 29.204 451.096 Tm
( 03-09 3,411.69 ELECTRONIC DEPOSIT FDMS-SETTLEMENT
DEPOSIT 376249462999 ) Tj
1 0 0 1 29.204 442.096 Tm
( 03-10 1,645.22 ELECTRONIC DEPOSIT FDMS-SETTLEMENT
DEPOSIT 376249462999 ) Tj
ET
There are two lines that start at the same position 29.204 451.096, one
with blanks, one with a text. That is a bug by the creator of the file.
Tilman
Am 29.03.2016 um 18:48 schrieb Joel Hirsh:
I thought it was attached to the first email, but it is also available at
https://www.dropbox.com/s/btqwaxfsubt3rwx/extra%20spaces.pdf?dl=0
On Tue, Mar 29, 2016 at 9:13 AM, Tilman Hausherr <[email protected]>
wrote:
Please upload that file somewhere.
Tilman
Am 29.03.2016 um 17:24 schrieb Joel Hirsh:
I have a couple of PDF files that have this problem. These are
multi-page PDF files, and on one page (the first) there are a few lines
that get extra spaces between almost every character as seen from
PrintTextLocations.
Attached is a snippet from one of those files, the first line has the
problem, the second line does not.
In this file, the first line gets a string that is
0 3- 09 3 ,4 1 1. 6 9 EL E CT R ON I C D EP O SI T
F DM S -S E TT L EM E NT D E PO S IT 37 6 24 9 46 2 99
9
While the second line gets the text without any extra spaces.
The two lines also have different spacing values as reported by
PrintTextLocations. In the full file, all the good lines have one
value,
the bad lines a different value.
I cannot see any difference between the lines in Acrobat, doing
copy/paste, Nitro editing.
This problem shows up in 2.0.0 and the latest 2.0.1 snapshot, and some
older versions I tried as well (i.e. I don't think it is any kind of
regression)
Thanks
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]