You should check the Apache PDFBox project. A similar question: https://issues.apache.org/jira/browse/PDFBOX-940
2013/11/15 Marcello Lorenzi <mlore...@sorint.it> > Hi, > during you testing of Apache SOLR 4.3, we have noticed some errors > occurred for PDF indexing: > > ERROR - 2013-11-15 15:14:26.248; org.apache.pdfbox.pdmodel.font.PDCIDFont; > Error: Could not parse predefined CMAP file for 'PDFXC30-Indentity0-UCS2' > ERROR - 2013-11-15 15:14:36.108; org.apache.pdfbox.pdmodel.font.PDCIDFont; > Error: Could not parse predefined CMAP file for '--UCS2' > > and > > ERROR - 2013-11-15 15:12:18.928; org.apache.pdfbox.filter.FlateFilter; > FlateFilter: stop reading corrupt stream due to a DataFormatException > > Could these errors related to PDF files format? > > Thanks, > Marcello >