PDF Content Type seen as application/rdf+xml not appliction/pdf ---------------------------------------------------------------
Key: TIKA-363 URL: https://issues.apache.org/jira/browse/TIKA-363 Project: Tika Issue Type: Bug Affects Versions: 0.5 Environment: JDK 1.5, Windows XP, Adobe Acrobat Pro 8, Luke 0.9.9, tika-app-0.5.jar, Eclipse 4.2, Lucene In Action, Second source code TikaIndexer.java Reporter: Tim Reynolds Priority: Minor I am using TikaIndexer.java from the source code of Lucene In Action Second Edition to index pdf files. Most PDF files work fine as verified by Luke (0.9.9), some files show content type of application/rdf+xml not appliction/pdf, and thus show no meta data in Luke The pdf files that show application/rdf+xml were opened via Adobe Acrobat Pro 8. Highlights/Bookmarks and Notes were added to the files, this was done several times with many saves. Acrobat can read these files without problem. The original pdfs, show application/pdf, the modified files show application/rdf+xml. If I open the pdf files via my editor VIM, I do see some CR +LF strangeness. Both the good & "bad" files have 0000000: 2550 4446 2d31 2e36 0d25 e2e3 cfd3 0d0a %PDF-1.6.%...... for the first line, but the "bad" file doesn't have another $0d0a until 0001210: 6574 2065 6e64 3d22 7722 3f3e 0d0a 656e et end="w"?>..en up until that point I do see some 0d (CR) but no CR+LF. It is maybe the case that something is getting confused because it sees this very long line. Why the file stops using CR+LF I don't know. I assume this confusion then leads Tika to guess this is an rdf+xml file. I see the following bug in Tika: Mime type application/rdf+xml not correctly detected [#TIKA-309], but it says it is fixed in 0.5 which I am using. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.