Hi, Is there a way to work around this problem with text being corrupted when extracted from a PDF file using Tika 0.9?
$ java -jar pdfbox-app-1.4.0.jar ExtractText -console 1471-2105-9-379.pdf | grep doi BMC Bioinformatics 2008, 9:379 doi:10.1186/1471-2105-9-379 $ java -jar tika-app-0.9.jar 1471-2105-9-379.pdf | grep doi BMC Bioinformatics 2008, 9:379 doi:10 Accepted: 18 Se.1186/1471-2105-9-379 ptember 2008 The PDF is from http://www.biomedcentral.com/content/pdf/1471-2105-9-379.pdf When using Tika the DOI is jumbled up with other text. I can use PDFBox directly but am happily using Tika for everything else! Any help greatly appreciated, Mark.
