[ 
https://issues.apache.org/jira/browse/TIKA-309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777564#action_12777564
 ] 

Chris A. Mattmann edited comment on TIKA-309 at 11/13/09 5:16 PM:
------------------------------------------------------------------

This ended up turning out to be a tricky nightmare. Yuan-Fang,

#. the first file, http://www.ai.sri.com/daml/services/owl-s/1.2/Process.owl
  a. uses XML entities to obfuscate the rdf, rdfs, owl, etc. URI and localname, 
making rootXML detection nullified.
  b. includes XML byte chars that match application/xml magic detection, 
nullifying the bytes that I put in to detect RDF and OWL
#. the second file, http://www.w3.org/2002/07/owl#
  a. is easier to detect, since the URIs and localname are not obfuscated, but 
magic detection still doesn't work out of the box

I'm going to add in some better magic detection for RDF/OWL files that start 
with <rdf:RDF or <owl:Ontology, as well as better detection based on glob 
patterns and so forth, but in the end, my suggestion for this particular 
problem is:

# use the o.a.tika.detect.NameDetector and set the:

{code}
Metadata.RESOURCE_NAME_KEY
{code}

value before calling (pseudo-code):

{code}
AutoDetectParser parser = new AutoDetectParser();
parser.setDetector(new NameDetector());
Metadata met = new Metadata();
met.set(Metadata.RESOURCE_NAME_KEY, "name or url of your file");
parser.parse(InputStream stream, some ContentHandler, met);
{code}
 
Commit that includes my updated tika-mimetypes.xml and unit tests to show that 
when the RESOURCE_NAME_KEY is set, your examples are correctly detected is 
forthcoming.

      was (Author: chrismattmann):
    This ended up turning out to be a tricky nightmare. Yuan-Fang,

# the first file, http://www.ai.sri.com/daml/services/owl-s/1.2/Process.owl
  a. uses XML entities to obfuscate the rdf, rdfs, owl, etc. URI and localname, 
making rootXML detection nullified.
  b. includes XML byte chars that match application/xml magic detection, 
nullifying the bytes that I put in to detect RDF and OWL
# the second file, http://www.w3.org/2002/07/owl#
  a. is easier to detect, since the URIs and localname are not obfuscated, but 
magic detection still doesn't work out of the box

I'm going to add in some better magic detection for RDF/OWL files that start 
with <rdf:RDF or <owl:Ontology, as well as better detection based on glob 
patterns and so forth, but in the end, my suggestion for this particular 
problem is:

# use the o.a.tika.detect.NameDetector and set the:

{code}
Metadata.RESOURCE_NAME_KEY
{code}

value before calling (pseudo-code):

{code}
AutoDetectParser parser = new AutoDetectParser();
parser.setDetector(new NameDetector());
Metadata met = new Metadata();
met.set(Metadata.RESOURCE_NAME_KEY, "name or url of your file");
parser.parse(InputStream stream, some ContentHandler, met);
{code}
 
Commit that includes my updated tika-mimetypes.xml and unit tests to show that 
when the RESOURCE_NAME_KEY is set, your examples are correctly detected is 
forthcoming.
  
> Mime type application/rdf+xml not correctly detected
> ----------------------------------------------------
>
>                 Key: TIKA-309
>                 URL: https://issues.apache.org/jira/browse/TIKA-309
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 0.5
>            Reporter: Yuan-Fang Li
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.5
>
>
> Mime type detector using AutoDetectParser and Metadata returns 
> "application/xml" for the URL http://www.w3.org/2002/07/owl#, where it should 
> be "application/rdf+xml". The correct mime type is also suggested here: 
> http://www.w3.org/TR/owl-ref/#MIMEType.
> P.S., Tika was downloaded from svn and built with Maven last week.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to