PDF Content Type seen as application/rdf+xml not appliction/pdf
---------------------------------------------------------------

                 Key: TIKA-363
                 URL: https://issues.apache.org/jira/browse/TIKA-363
             Project: Tika
          Issue Type: Bug
    Affects Versions: 0.5
         Environment: JDK 1.5, Windows XP, Adobe Acrobat Pro 8, Luke 0.9.9, 
tika-app-0.5.jar, Eclipse 4.2, Lucene In Action, Second  source code 
TikaIndexer.java
            Reporter: Tim Reynolds
            Priority: Minor


I am using TikaIndexer.java from the source code of Lucene In Action Second 
Edition
to index pdf files. Most PDF files work fine as verified by Luke (0.9.9), some 
files show
content type of application/rdf+xml not appliction/pdf, and thus show no meta 
data in Luke

The pdf files that show  application/rdf+xml were opened via Adobe Acrobat Pro 
8.
Highlights/Bookmarks and Notes were added to the files, this was done several 
times
with many saves. Acrobat can read these files without problem.

The original pdfs, show application/pdf, the modified files show 
application/rdf+xml.

If I open the pdf files via my editor VIM, I do see some CR +LF strangeness.
Both the good & "bad" files have

0000000: 2550 4446 2d31 2e36 0d25 e2e3 cfd3 0d0a %PDF-1.6.%......

for the first line, but the "bad" file doesn't have another $0d0a until

0001210: 6574 2065 6e64 3d22 7722 3f3e 0d0a 656e et end="w"?>..en

up until that point I do see some 0d (CR) but no CR+LF. It is maybe the case 
that
something is getting confused because it sees this very long line. Why the file
stops using CR+LF I don't know. I assume this confusion then leads Tika to guess
this is an rdf+xml file.

I see the following bug in Tika: Mime type application/rdf+xml not correctly 
detected
[#TIKA-309], but it says it is fixed in 0.5 which I am using. 



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to