I worked on the issue I created time ago 'PDF file created with LaTeX is bad
parsed', I hope it's a bad font encoding detection, or an unsupported encoding.
https://issues.apache.org/jira/browse/PDFBOX-534
The problem descripcion:
-------------------------------------------------------
I'm getting an unexpected behavior parsing a pdf file.
I'm trying to get the clean body text of some file, and I get a lot of
aXX strings. Where each X is a number. It appear be the char code of
the real character, I don't know really.
My code is too simple:
String[] args = {"/home/ernesto/tesis/documento/kvfs.pdf"};
ExtractText.main(args);
I used the PDFBox 0.8.0-incubator version. Builded on 12/12/2009.
The output I get is:
a73a109a112a108a101a109a101a110a116a97a110a100a111 a97a99a99a101a115a111 a97
a115a105a115a116a101a109a97a115 a100a101
a97a114a99a104a105a118a111a115 a118a105a114a116a117a97a108a101a115
a112a97a114a97 a108a97 a104a101a114a114a97a109a105a101a110a116a97
a100a101 a98a250a115a113a117a101a100a97 a75a110a101a111a98a97a115a101
and more ......
-----------------------------------------------------------------
Now, I debugged, and test some alternatives:
I found the cause of the problem, but not the solution.
It's a bad font encoding detection, or an unsupported encoding.
Debugging the pdfbox classes I found in the lines that encode the characters,
when the character is wrong read. Look this lines:
Class PDFont, Method String encode( byte[] c, int offset, int length ), line
438.
438 Encoding encoding = getEncoding();
439 if( encoding != null)
440 {
441 retval = encoding.getCharacter( getCodeFromArray( c, offset,
length ) );
442 }
443 if( retval == null )
444 {
445 retval = getStringFromArray( c, offset, length );
446 }
The first line, method getEncoding() return a org.apache.pdfbox.encoding.DictionaryEncoding, then go into the if (439), and getCharacter method return a aXX character. The second if(443) is disconsidered, but I evaluated the getStringFromArray method and it return a beautiful normal character like 'i'.
Then I tried two ways, understand what is wrong with my font encoding and who is generating it. My pdf is generated by a latex, and I found for European accented character is used a package \usepackage[T1]{fontenc}, I'm using it. I take off this line from my latex source file, and generate the pdf again. When ran the pdfbox text again, I got a better result:
Implementando acceso a sistemas de
archivos virtuales para la herramienta
de b usqueda Kneobase
Alumno: Ernesto De Santis
Director: Pablo Ernesto Mart nez L opez
But WITHOUT the accented characters.
Then, I tried to use the getStringFromArray instead of encoding.getCharacter in
the pdfbox source, backing the latex source as the original one. I did it, but
the result was similar, bad accented characters:
Implementando acceso a sistemas de
archivos virtuales para la herramienta
de b?squeda Kneobase
Alumno: Ernesto De Santis
Director: Pablo Ernesto Mart?nez L?pez
--
Blog de nuestras vidas en Rio de Janeiro (Fernanda y Ernesto):
http://www.fernandayernesto.blogspot.com/