[ https://issues.apache.org/jira/browse/TIKA-363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jukka Zitting resolved TIKA-363. -------------------------------- Resolution: Duplicate Fix Version/s: 0.6 Assignee: Jukka Zitting I believe the RDF pattern is being triggered by the embedded XMP metadata included in the PDF file. I tested this with Tika 0.6 where the file is correctly detected as application/pdf, so I believe the problem has already been solved as a part of another issue. > PDF Content Type seen as application/rdf+xml not appliction/pdf > --------------------------------------------------------------- > > Key: TIKA-363 > URL: https://issues.apache.org/jira/browse/TIKA-363 > Project: Tika > Issue Type: Bug > Affects Versions: 0.5 > Environment: JDK 1.5, Windows XP, Adobe Acrobat Pro 8, Luke 0.9.9, > tika-app-0.5.jar, Eclipse 4.2, Lucene In Action, Second source code > TikaIndexer.java > Reporter: Tim Reynolds > Assignee: Jukka Zitting > Priority: Minor > Fix For: 0.6 > > Attachments: TikaData.zip > > > I am using TikaIndexer.java from the source code of Lucene In Action Second > Edition > to index pdf files. Most PDF files work fine as verified by Luke (0.9.9), > some files show > content type of application/rdf+xml not appliction/pdf, and thus show no meta > data in Luke > The pdf files that show application/rdf+xml were opened via Adobe Acrobat > Pro 8. > Highlights/Bookmarks and Notes were added to the files, this was done several > times > with many saves. Acrobat can read these files without problem. > The original pdfs, show application/pdf, the modified files show > application/rdf+xml. > If I open the pdf files via my editor VIM, I do see some CR +LF strangeness. > Both the good & "bad" files have > 0000000: 2550 4446 2d31 2e36 0d25 e2e3 cfd3 0d0a %PDF-1.6.%...... > for the first line, but the "bad" file doesn't have another $0d0a until > 0001210: 6574 2065 6e64 3d22 7722 3f3e 0d0a 656e et end="w"?>..en > up until that point I do see some 0d (CR) but no CR+LF. It is maybe the case > that > something is getting confused because it sees this very long line. Why the > file > stops using CR+LF I don't know. I assume this confusion then leads Tika to > guess > this is an rdf+xml file. > I see the following bug in Tika: Mime type application/rdf+xml not correctly > detected > [#TIKA-309], but it says it is fixed in 0.5 which I am using. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.