Fixed in https://issues.apache.org/jira/browse/PDFBOX-5390
I didn't fix the final CR LF, this is probably part of the paragraph
handling. I don't see a problem with that.
Tilman
Am 16.03.2022 um 19:38 schrieb Tilman Hausherr:
Yeah, this is a (minor) bug in TextToPDF, so the extraction would have
to be postprocessed. But you already have the text anyway.
I'll fix this soon.
Tilman
Am 16.03.2022 um 13:27 schrieb flywire:
Can text be extracted without adding trailing space?
*Text.txt*
def hello_world():
print("Hello World!")
hello_world()
*File ends line above with no CRLF*
java -jar pdfbox-app-2.0.25.jar TextToPDF -standardFont Courier test.pdf
test.txt
java -jar pdfbox-app-2.0.25.jar ExtractText test.pdf test1.txt
Output file has a space appended to each line and last line has CRLF
appended.
Using test1.txt as input gives matching output.
Using Win10.
java -jar pdfbox-app-2.0.25.jar WriteDecodedDoc test.pdf
test-decoded.txt
%PDF-1.4
%צה
1 0 obj
<<
/Type /Catalog
/Version /1.4
/Pages 2 0 R
endobj
2 0 obj
<<
/Type /Pages
/Kids [3 0 R]
/Count 1
endobj
3 0 obj
<<
/Type /Page
/MediaBox [0.0 0.0 612.0 792.0]
/Parent 2 0 R
/Contents 4 0 R
/Resources 5 0 R
endobj
4 0 obj
<<
/Length 178
stream
/F1 10 Tf
BT
40 763.07751 Td
0 -11.0775 Td
(def hello_world\(\): ) Tj
0 -11.0775 Td
( print\("Hello World!"\) ) Tj
0 -11.0775 Td
( ) Tj
0 -11.0775 Td
(hello_world\(\) ) Tj
ET
endstream
endobj
5 0 obj
<<
/Font 6 0 R
endobj
6 0 obj
<<
/F1 7 0 R
endobj
7 0 obj
<<
/Type /Font
/Subtype /Type1
/BaseFont /Courier
/Encoding /WinAnsiEncoding
endobj
xref
0 8
0000000000 65535 f
0000000015 00000 n
0000000078 00000 n
0000000135 00000 n
0000000247 00000 n
0000000478 00000 n
0000000511 00000 n
0000000542 00000 n
trailer
<<
/Root 1 0 R
/ID [<2B2F22A234DF5483D5614CAB282ED31B>
<2B2F22A234DF5483D5614CAB282ED31B>]
/Size 8
startxref
637
%%EOF
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org