Tika and PDFBox text extraction

Woodbridge, Mark R Fri, 25 Feb 2011 16:30:34 -0800

Hi,

Is there a way to work around this problem with text being corrupted when 
extracted from a PDF file using Tika 0.9?


$ java -jar pdfbox-app-1.4.0.jar ExtractText -console 1471-2105-9-379.pdf | 
grep doi
BMC Bioinformatics 2008, 9:379 doi:10.1186/1471-2105-9-379

$ java -jar tika-app-0.9.jar 1471-2105-9-379.pdf | grep doi
BMC Bioinformatics 2008, 9:379 doi:10 Accepted: 18 Se.1186/1471-2105-9-379 
ptember 2008

The PDF is from http://www.biomedcentral.com/content/pdf/1471-2105-9-379.pdf

When using Tika the DOI is jumbled up with other text. I can use PDFBox 
directly but am happily using Tika for everything else!

Any help greatly appreciated,

Mark.

Tika and PDFBox text extraction

Reply via email to