[ 
https://issues.apache.org/jira/browse/TIKA-363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved TIKA-363.
--------------------------------

       Resolution: Duplicate
    Fix Version/s: 0.6
         Assignee: Jukka Zitting

I believe the RDF pattern is being triggered by the embedded XMP metadata 
included in the PDF file. I tested this with Tika 0.6 where the file is 
correctly detected as application/pdf, so I believe the problem has already 
been solved as a part of another issue.

> PDF Content Type seen as application/rdf+xml not appliction/pdf
> ---------------------------------------------------------------
>
>                 Key: TIKA-363
>                 URL: https://issues.apache.org/jira/browse/TIKA-363
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 0.5
>         Environment: JDK 1.5, Windows XP, Adobe Acrobat Pro 8, Luke 0.9.9, 
> tika-app-0.5.jar, Eclipse 4.2, Lucene In Action, Second  source code 
> TikaIndexer.java
>            Reporter: Tim Reynolds
>            Assignee: Jukka Zitting
>            Priority: Minor
>             Fix For: 0.6
>
>         Attachments: TikaData.zip
>
>
> I am using TikaIndexer.java from the source code of Lucene In Action Second 
> Edition
> to index pdf files. Most PDF files work fine as verified by Luke (0.9.9), 
> some files show
> content type of application/rdf+xml not appliction/pdf, and thus show no meta 
> data in Luke
> The pdf files that show  application/rdf+xml were opened via Adobe Acrobat 
> Pro 8.
> Highlights/Bookmarks and Notes were added to the files, this was done several 
> times
> with many saves. Acrobat can read these files without problem.
> The original pdfs, show application/pdf, the modified files show 
> application/rdf+xml.
> If I open the pdf files via my editor VIM, I do see some CR +LF strangeness.
> Both the good & "bad" files have
> 0000000: 2550 4446 2d31 2e36 0d25 e2e3 cfd3 0d0a %PDF-1.6.%......
> for the first line, but the "bad" file doesn't have another $0d0a until
> 0001210: 6574 2065 6e64 3d22 7722 3f3e 0d0a 656e et end="w"?>..en
> up until that point I do see some 0d (CR) but no CR+LF. It is maybe the case 
> that
> something is getting confused because it sees this very long line. Why the 
> file
> stops using CR+LF I don't know. I assume this confusion then leads Tika to 
> guess
> this is an rdf+xml file.
> I see the following bug in Tika: Mime type application/rdf+xml not correctly 
> detected
> [#TIKA-309], but it says it is fixed in 0.5 which I am using. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to