Re: Mimetype detection for JSON

Sebastian Nagel Wed, 15 Apr 2015 15:27:06 -0700

Hi Iain,

that means mime type detection is done exclusively on content
without URL and server content type. There are examples where
both will definitely add necessary support, cf. NUTCH-1605.


Maybe it's best to let Tika improve the mime detectors, there
is still some work ongoing, cf. TIKA-1517.

It could be an option, instead of a binary mime.type.magic
to set a (weighted) hierarchy of heuristics
 magic > URL pattern > HTTP content type
or just a list of hints to be used.

But it's not as easy because often these are used in combination
a zip file by signature with extension .xlsx is likely to be an Excel
Office Open XML spreadsheet. JSON is similar or even worse:
a '{' 0x7B in position 0 is only a little hint:
- could be also '[' (but less likely)
- also RTF has a '{' in position 0

Sebastian


On 04/15/2015 02:05 PM, Iain Lopata wrote:
> The following change to MimeUtil.java seems to solve my problem:
> 
> //      magicType = tika.detect(data);
>             try {
>                     InputStream in = new ByteArrayInputStream(data);
>                     Metadata meta = new Metadata();
>                     magicType = this.mimeTypes.detect(in, meta).toString();
>                     LOG.debug("Magic Type for" + url + " is " + magicType);
>             } catch (Exception e) {
>                     //Can't complete magic detection
>             }
> 
> However, my confidence that I haven’t broken something else is modest at best.
> 
> If this looks like a bug I am happy to create the JIRA entry and submit this 
> as a patch, but before I do so can you tell me if this looks sensible?
> 
> -----Original Message-----
> From: Iain Lopata [mailto:[email protected]] 
> Sent: Tuesday, April 14, 2015 8:43 PM
> To: [email protected]
> Subject: RE: Mimetype detection for JSON
> 
> It seems to me that setting tika-mimetypes.xml in the Nutch configuration 
> causes MimeUtil.java to use the specified file for initial lookup and for URL 
> resolution.  However, when it comes to magic detection, the 
> tika-mimetypes.xml file in the Tika jar file seems to be used instead.  
> 
> If I update the Tika jar with my match rule it works perfectly. If I only 
> place the updated tika-mimetypes.xml file in my Nutch configuration 
> directory, the magic detection does not use my match rule.
> 
> Can anyone familiar with the Tika implementation tell me if there is a way to 
> update Nutch's MimeUtil.java to instantiate Tika to use the configuration 
> file from Nutch?  Or would it be better just to update the configuration file 
> in the Tika jar?
> 
> -----Original Message-----
> From: Iain Lopata [mailto:[email protected]]
> Sent: Tuesday, April 14, 2015 5:32 PM
> To: [email protected]
> Subject: RE: Mimetype detection for JSON
> 
> Thanks Sebastian.
> 
> mime.type.magic is true.
> 
> I don’t have control over the web server, so cannot test with 
> application/javascript
> 
> Time for some deeper debugging it seems.  Will update the list with findings.
> 
> -----Original Message-----
> From: Sebastian Nagel [mailto:[email protected]]
> Sent: Tuesday, April 14, 2015 4:09 PM
> To: [email protected]
> Subject: Re: Mimetype detection for JSON
> 
> Hi Iain,
> 
>> I have copied tika-mimetypes.xml from the tika jar file and installed 
>> a copy in my configuration directory.  I have updated nutch-site.xml 
>> to point to this file and the log entries indicate that this is being found.
> 
> ... and the property mime.type.magic is true (default)?
> 
> 
>> <mime-type type="application/json">
>>           <sub-class-of type="application/javascript"/>
> 
> Just as a trial: What happens if you make the web server return 
> "application/javascript"
> as content type?
> 
> 
>> I am still getting the content type detected as text/html and the json 
>> parser is not being invoked.  Any suggestions as to what to look at next?
> 
> The mime magic is done by Tika. Nutch (o.a.n.util.MimeUtil) passes the 
> following resources to Tika:
> - byte stream for magic detection
> - URL for additional file name patterns
> - content type sent by server
> URL and server content type are required as additional hints, e.g., for zip 
> containers such as .xlsx, etc.
> 
> I fear that you have to run a debugger to find out what is going wrong.
> I would also run first Tika alone with the modified tika-mimetypes.xml, just 
> to make sure that the mime magic works as expected.
> 
> Cheers,
> Sebastian
> 
> On 04/13/2015 04:26 PM, Iain Lopata wrote:
>> I have a page that I am fetching that contains JSON and I have a 
>> plugin for parsing JSON.
>>
>>  
>>
>> The server sets a mimetype of "text/html" and consequently my json 
>> parser does not get invoked.
>>
>>  
>>
>> If I run parsechecker from the command line and specify -forceAs 
>> "application/json" the json parser is invoked and works successfully.
>>
>>  
>>
>> So, I believe that if I can get tika to give me "application/json" as 
>> the detected content type for this page, it should work during a crawl.
>>
>>  
>>
>> I have copied tika-mimetypes.xml from the tika jar file and installed 
>> a copy in my configuration directory.  I have updated nutch-site.xml 
>> to point to this file and the log entries indicate that this is being found.
>>
>>  
>>
>> In my copy of tika-mimetypes.xml I have added the match rule shown 
>> below
>>
>>  
>>
>> <mime-type type="application/json">
>>
>>           <sub-class-of type="application/javascript"/>
>>
>>           <magic priority="100">
>>
>>                   <match value="{" type="string" offset="0"/>
>>
>>           </magic>
>>
>>           <glob pattern="*.json"/>
>>
>>   </mime-type>
>>
>>  
>>
>> I know that my match is much too broad, but I am using this just while 
>> trying to resolve this problem.
>>
>>  
>>
>> I have also set lang.extraction.policy to identify in nutch-site.xml 
>> (again primarily for testing purposes).
>>
>>  
>>
>> I am still getting the content type detected as text/html and the json 
>> parser is not being invoked.  Any suggestions as to what to look at next?
>>
>>  
>>
>> Thanks!
>>
>>  
>>
>> Iain
>>
>>
> 
> 
> 
>

Re: Mimetype detection for JSON

Reply via email to