Hello Andreas, >> Either way, these TeX-created documents seem to present specific challenges >> for PDFBox. Since we need to make these files available for full-text >> search, we would be very happy if their text extraction could be improved. >> I'm ready to help with tests and examples; I am afraid my lack of experience >> in Java limits my direct help in the development of the code.
> I guess I've fixed most of those issues. There are only a few mappings > missing, but > I'm sure we will find and add them by and by. Yes, it looks already much better. I reopened issue 728 (as minor bug) because of additional characters that were not recognised. I don't know if TeX uses some standard encoding for those characters or if this depends in some way on the TeX environment, so it is possible that not much can be done. On issue 729, you write: "If Type3 fonts are used within a document we should skip the extraction of those text parts to avoid a scrambled output." and I tend to agree, it definitely would be more helpful to have an error message than some gobbledegook. What irritates me is the relation with my file in issue 534: amapn19_03.pdf uses Type 3 fonts and the conversion results in something of the form 0a2a1a4a3a6… while text extracted from wias_preprints_1427.pdf from issue 729 looks like CFCTCXCTD6D7D8D6CPH3B9C1D2D7D8CXD8D9D8 The former looks like indices into some hidden table, but if one could somehow find this table or invest time and patience, this looks like it could be decoded. Or is this an error, are those indices just arbitrary numbers without a relation to the actual text? The latter, on the other hand, does not look like this could be decoded in a meaningful way. In principle, TeX authors shouldn't use Type 3 fonts nowadays, see e.g. http://embs.papercept.net/conferences/support/tex.php similarly for Springer Journals, but hey obviously still are around. Also, I am wondering if some expertise from the TeX community might help, see e.g. http://www.tex.ac.uk/cgi-bin/texfaq2html?label=pkfix Thanks a lot, Thomas
smime.p7s
Description: S/MIME cryptographic signature

