I worked on the issue I created time ago 'PDF file created with LaTeX is bad 
parsed', I hope it's a bad font encoding detection, or an unsupported encoding.
https://issues.apache.org/jira/browse/PDFBOX-534

The problem descripcion:
-------------------------------------------------------
I'm getting an unexpected behavior parsing a pdf file.

I'm trying to get the clean body text of some file, and I get a lot of
aXX strings. Where each X is a number. It appear be the char code of
the real character, I don't know really.

My code is too simple:

  String[] args = {"/home/ernesto/tesis/documento/kvfs.pdf"};
  ExtractText.main(args);


I used the PDFBox 0.8.0-incubator version. Builded on 12/12/2009.
The output I get is:
a73a109a112a108a101a109a101a110a116a97a110a100a111 a97a99a99a101a115a111 a97 
a115a105a115a116a101a109a97a115 a100a101
a97a114a99a104a105a118a111a115 a118a105a114a116a117a97a108a101a115
a112a97a114a97 a108a97 a104a101a114a114a97a109a105a101a110a116a97
a100a101 a98a250a115a113a117a101a100a97 a75a110a101a111a98a97a115a101
and more ......
-----------------------------------------------------------------

Now, I debugged, and test some alternatives:


I found the cause of the problem, but not the solution. It's a bad font encoding detection, or an unsupported encoding.
Debugging the pdfbox classes I found in the lines that encode the characters, 
when the character is wrong read. Look this lines:
Class PDFont, Method String encode( byte[] c, int offset, int length ), line 
438.

438            Encoding encoding = getEncoding();
439            if( encoding != null)
440            {
441                retval = encoding.getCharacter( getCodeFromArray( c, offset, 
length ) );
442            }
443            if( retval == null )
444            {
445                retval = getStringFromArray( c, offset, length );
446            }

The first line, method getEncoding() return a org.apache.pdfbox.encoding.DictionaryEncoding, then go into the if (439), and getCharacter method return a aXX character. The second if(443) is disconsidered, but I evaluated the getStringFromArray method and it return a beautiful normal character like 'i'. Then I tried two ways, understand what is wrong with my font encoding and who is generating it. My pdf is generated by a latex, and I found for European accented character is used a package \usepackage[T1]{fontenc}, I'm using it. I take off this line from my latex source file, and generate the pdf again. When ran the pdfbox text again, I got a better result:
Implementando acceso a sistemas de
archivos virtuales para la herramienta
de b usqueda Kneobase
Alumno: Ernesto De Santis
Director: Pablo Ernesto Mart  nez L opez

But WITHOUT the accented characters.
Then, I tried to use the getStringFromArray instead of encoding.getCharacter in 
the pdfbox source, backing the latex source as the original one. I did it, but 
the result was similar, bad accented characters:

Implementando acceso a sistemas de
archivos virtuales para la herramienta
de b?squeda Kneobase
Alumno: Ernesto De Santis
Director: Pablo Ernesto Mart?nez L?pez
--
Blog de nuestras vidas en Rio de Janeiro (Fernanda y Ernesto):
http://www.fernandayernesto.blogspot.com/


Reply via email to