Hi Iain,

that means mime type detection is done exclusively on content
without URL and server content type. There are examples where
both will definitely add necessary support, cf. NUTCH-1605.

Maybe it's best to let Tika improve the mime detectors, there
is still some work ongoing, cf. TIKA-1517.

It could be an option, instead of a binary mime.type.magic
to set a (weighted) hierarchy of heuristics
 magic > URL pattern > HTTP content type
or just a list of hints to be used.

But it's not as easy because often these are used in combination
a zip file by signature with extension .xlsx is likely to be an Excel
Office Open XML spreadsheet. JSON is similar or even worse:
a '{' 0x7B in position 0 is only a little hint:
- could be also '[' (but less likely)
- also RTF has a '{' in position 0

Sebastian


On 04/15/2015 02:05 PM, Iain Lopata wrote:
> The following change to MimeUtil.java seems to solve my problem:
> 
> //      magicType = tika.detect(data);
>             try {
>                     InputStream in = new ByteArrayInputStream(data);
>                     Metadata meta = new Metadata();
>                     magicType = this.mimeTypes.detect(in, meta).toString();
>                     LOG.debug("Magic Type for" + url + " is " + magicType);
>             } catch (Exception e) {
>                     //Can't complete magic detection
>             }
> 
> However, my confidence that I haven’t broken something else is modest at best.
> 
> If this looks like a bug I am happy to create the JIRA entry and submit this 
> as a patch, but before I do so can you tell me if this looks sensible?
> 
> -----Original Message-----
> From: Iain Lopata [mailto:[email protected]] 
> Sent: Tuesday, April 14, 2015 8:43 PM
> To: [email protected]
> Subject: RE: Mimetype detection for JSON
> 
> It seems to me that setting tika-mimetypes.xml in the Nutch configuration 
> causes MimeUtil.java to use the specified file for initial lookup and for URL 
> resolution.  However, when it comes to magic detection, the 
> tika-mimetypes.xml file in the Tika jar file seems to be used instead.  
> 
> If I update the Tika jar with my match rule it works perfectly. If I only 
> place the updated tika-mimetypes.xml file in my Nutch configuration 
> directory, the magic detection does not use my match rule.
> 
> Can anyone familiar with the Tika implementation tell me if there is a way to 
> update Nutch's MimeUtil.java to instantiate Tika to use the configuration 
> file from Nutch?  Or would it be better just to update the configuration file 
> in the Tika jar?
> 
> -----Original Message-----
> From: Iain Lopata [mailto:[email protected]]
> Sent: Tuesday, April 14, 2015 5:32 PM
> To: [email protected]
> Subject: RE: Mimetype detection for JSON
> 
> Thanks Sebastian.
> 
> mime.type.magic is true.
> 
> I don’t have control over the web server, so cannot test with 
> application/javascript
> 
> Time for some deeper debugging it seems.  Will update the list with findings.
> 
> -----Original Message-----
> From: Sebastian Nagel [mailto:[email protected]]
> Sent: Tuesday, April 14, 2015 4:09 PM
> To: [email protected]
> Subject: Re: Mimetype detection for JSON
> 
> Hi Iain,
> 
>> I have copied tika-mimetypes.xml from the tika jar file and installed 
>> a copy in my configuration directory.  I have updated nutch-site.xml 
>> to point to this file and the log entries indicate that this is being found.
> 
> ... and the property mime.type.magic is true (default)?
> 
> 
>> <mime-type type="application/json">
>>           <sub-class-of type="application/javascript"/>
> 
> Just as a trial: What happens if you make the web server return 
> "application/javascript"
> as content type?
> 
> 
>> I am still getting the content type detected as text/html and the json 
>> parser is not being invoked.  Any suggestions as to what to look at next?
> 
> The mime magic is done by Tika. Nutch (o.a.n.util.MimeUtil) passes the 
> following resources to Tika:
> - byte stream for magic detection
> - URL for additional file name patterns
> - content type sent by server
> URL and server content type are required as additional hints, e.g., for zip 
> containers such as .xlsx, etc.
> 
> I fear that you have to run a debugger to find out what is going wrong.
> I would also run first Tika alone with the modified tika-mimetypes.xml, just 
> to make sure that the mime magic works as expected.
> 
> Cheers,
> Sebastian
> 
> On 04/13/2015 04:26 PM, Iain Lopata wrote:
>> I have a page that I am fetching that contains JSON and I have a 
>> plugin for parsing JSON.
>>
>>  
>>
>> The server sets a mimetype of "text/html" and consequently my json 
>> parser does not get invoked.
>>
>>  
>>
>> If I run parsechecker from the command line and specify -forceAs 
>> "application/json" the json parser is invoked and works successfully.
>>
>>  
>>
>> So, I believe that if I can get tika to give me "application/json" as 
>> the detected content type for this page, it should work during a crawl.
>>
>>  
>>
>> I have copied tika-mimetypes.xml from the tika jar file and installed 
>> a copy in my configuration directory.  I have updated nutch-site.xml 
>> to point to this file and the log entries indicate that this is being found.
>>
>>  
>>
>> In my copy of tika-mimetypes.xml I have added the match rule shown 
>> below
>>
>>  
>>
>> <mime-type type="application/json">
>>
>>           <sub-class-of type="application/javascript"/>
>>
>>           <magic priority="100">
>>
>>                   <match value="{" type="string" offset="0"/>
>>
>>           </magic>
>>
>>           <glob pattern="*.json"/>
>>
>>   </mime-type>
>>
>>  
>>
>> I know that my match is much too broad, but I am using this just while 
>> trying to resolve this problem.
>>
>>  
>>
>> I have also set lang.extraction.policy to identify in nutch-site.xml 
>> (again primarily for testing purposes).
>>
>>  
>>
>> I am still getting the content type detected as text/html and the json 
>> parser is not being invoked.  Any suggestions as to what to look at next?
>>
>>  
>>
>> Thanks!
>>
>>  
>>
>> Iain
>>
>>
> 
> 
> 
> 

Reply via email to