Re: Tika content detection and crawled "remote" content

Chris Mattmann Wed, 05 Jul 2017 08:27:36 -0700

Totally agree, thank you Common Crawl for running Tika!



On 7/5/17, 5:09 AM, "Allison, Timothy B." <[email protected]> wrote:

    This is FANTASTIC!!!  Thank you, Sebastian!
    
    I suspect that we should try to fix these at the Tika level.  We'll never 
be 100%, but most of the problems you describe _should_ be fixable.
    
     > If anyone is interested in using the detected MIME types or anything 
else from Common Crawl - I'm happy to help!  The URL index [4] contains now a 
new field "mime-detected" which makes it easy to search or grep for confusion 
pairs.
    
    This is an amazing step forward for our regression corpus.  We used to rely 
on the http headers and/or file suffix to oversample non-html.  This will allow 
far cleaner pulls.
    
    -----Original Message-----
    From: Sebastian Nagel [mailto:[email protected]] 
    Sent: Tuesday, July 4, 2017 6:18 AM
    To: [email protected]
    Subject: Tika content detection and crawled "remote" content
    
    Hi,
    
    recently I've plugged in Tika's content detection into Common Crawl's 
crawler (modified Nutch) with the target to get clean and correct MIME type - 
the HTTP Content-Type may contain garbage and isn't always correct [1].
    
    For the June 2017 crawl I've prepared a comparison of content types sent by 
the server in the HTTP header and as detected by Tika 1.15 [2].  It shows that 
content types by Tika are definitely clean
    (1,400 different content types vs. more than 6,000 content type "strings" 
from HTTP headers).
    
    A look on the "confusions" where Content-Type and Tika differ, shows a 
mixed picture: some pairs are plausible, e.g., if Tika changes the type to a 
more precise subtype or detects the MIME at all:
    
                Tika-1.15                HTTP-Content-Type
    1001968023  application/xhtml+xml    text/html
       2298146  application/rss+xml      text/xml
        617435  application/rss+xml      application/xml
        613525  text/html                unk
        361525  application/xhtml+xml    unk
        297707  application/rdf+xml      application/xml
    
    
    However, there are a few dubious decisions, esp. the group of web 
server-side scripting languages (ASP, JSP, PHP, ColdFusion, etc.):
    
             Tika-1.15         HTTP-Content-Type
    2047739  text/x-php        text/html
     681629  text/asp          text/html
     193095  text/x-coldfusion text/html
     172318  text/aspdotnet    text/html
     139033  text/x-jsp        text/html
      38415  text/x-cgi        text/html
      32092  text/x-php        text/xml
      18021  text/x-perl       text/html
    
    Of course, due to misconfigurations some servers may deliver the script 
files unmodified but in general I wouldn't expect that this happens for 
millions of pages.  I've checked some of the affected URLs:
    
    - HTML fragment (no declaration of <!DOCTYPE...> or <html> opening tag)
    
    
https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
        http://www.privi.com/product-details.asp?cno=C10910011
        http://mental-ray.de/Root_alt/Default.asp
        http://ekyrs.org/support/index.php?action=profile
        http://cwmorse.eu5.org/lineal/mostrar.php?contador=200
    
    - (overlong) comment block at start of HTML which "masks" the HTML 
declaration
        http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24
    
    
http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6
        
https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php
        https://de.e-stories.org/categories.php?&lan=nl&art=p
    
    - HTML with some scripting fragments ("<?php?>") present:
        http://www.eco-ani-yao.org/shien/
    
    - others are clearly HTML (looks more like a bug, at least, there is no 
simple explanation)
        http://www.proedinc.com/customer/content.aspx?redid=9
        
http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79
        http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
        
http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068f79
    
    
    Obviously certain file suffixes (.php, .aspx) should get less weight 
compared to Content-Type sent from the responding server.
    Now my question: where's the best place to fix this: in the crawler [3] or 
in Tika?
    
    If anyone is interested in using the detected MIME types or anything else 
from Common Crawl - I'm happy to help!  The URL index [4] contains now a new 
field "mime-detected" which makes it easy to search or grep for confusion pairs.
    
    
    Thanks and best,
    Sebastian
    
    
    [1] https://github.com/commoncrawl/nutch/issues/3
    [2] 
s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz
    
    
https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz
    [3] 
https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/util/MimeUtil.java#L152
    [4] http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/

Re: Tika content detection and crawled "remote" content

Reply via email to