Re: Tika content detection and crawled "remote" content

Nick Burch Wed, 05 Jul 2017 07:08:04 -0700

Having taken a "quick" look over lunch at some of the "programminglanguage" ones, and gone down a rabbit whole... I think at least some ofthem are as described in TIKA-2419, where our change to the HTML magicpriority to fix for HTML-containing formats like email had broken somethings.

I've done a quick fix for 1.16, but it'd be good to try the impact ofother things, eg dropping the xml priority to match the html one to seeif that helps / breaks other things

Otherwise, for anything else (eg that word / graphviz one), please doopen up JIRAs!


Thanks
Nick

On 05/07/17 14:10, Allison, Timothy B. wrote:

Why, yes, please!  JIRA with small samples would be fantastic.  I think working 
in desc order of most common to least would be best...php, asp, coldfusion.

I'm about to cut 1.16, but I look forward to improving Tika with this 
tremendously useful data.

Again, many thanks!

Cheers,

            Tim

-----Original Message-----
From: Sebastian Nagel [mailto:[email protected]]
Sent: Wednesday, July 5, 2017 9:03 AM
To: [email protected]
Subject: Re: Tika content detection and crawled "remote" content

Hi Tim,

thanks! Let me know if I should take any actions (e.g., open issue(s) on Jira) 
or whether I can help by compiling smaller test sets.

Best,
Sebastian

On 07/05/2017 02:09 PM, Allison, Timothy B. wrote:

This is FANTASTIC!!!  Thank you, Sebastian!

I suspect that we should try to fix these at the Tika level.  We'll never be 
100%, but most of the problems you describe _should_ be fixable.

  > If anyone is interested in using the detected MIME types or anything else from Common 
Crawl - I'm happy to help!  The URL index [4] contains now a new field 
"mime-detected" which makes it easy to search or grep for confusion pairs.

This is an amazing step forward for our regression corpus.  We used to rely on 
the http headers and/or file suffix to oversample non-html.  This will allow 
far cleaner pulls.

-----Original Message-----
From: Sebastian Nagel [mailto:[email protected]]
Sent: Tuesday, July 4, 2017 6:18 AM
To: [email protected]
Subject: Tika content detection and crawled "remote" content

Hi,

recently I've plugged in Tika's content detection into Common Crawl's crawler 
(modified Nutch) with the target to get clean and correct MIME type - the HTTP 
Content-Type may contain garbage and isn't always correct [1].

For the June 2017 crawl I've prepared a comparison of content types
sent by the server in the HTTP header and as detected by Tika 1.15
[2].  It shows that content types by Tika are definitely clean
(1,400 different content types vs. more than 6,000 content type "strings" from 
HTTP headers).

A look on the "confusions" where Content-Type and Tika differ, shows a mixed 
picture: some pairs are plausible, e.g., if Tika changes the type to a more precise 
subtype or detects the MIME at all:

             Tika-1.15                HTTP-Content-Type
1001968023  application/xhtml+xml    text/html
    2298146  application/rss+xml      text/xml
     617435  application/rss+xml      application/xml
     613525  text/html                unk
     361525  application/xhtml+xml    unk
     297707  application/rdf+xml      application/xml


However, there are a few dubious decisions, esp. the group of web server-side 
scripting languages (ASP, JSP, PHP, ColdFusion, etc.):

          Tika-1.15         HTTP-Content-Type
2047739  text/x-php        text/html
  681629  text/asp          text/html
  193095  text/x-coldfusion text/html
  172318  text/aspdotnet    text/html
  139033  text/x-jsp        text/html
   38415  text/x-cgi        text/html
   32092  text/x-php        text/xml
   18021  text/x-perl       text/html

Of course, due to misconfigurations some servers may deliver the script files 
unmodified but in general I wouldn't expect that this happens for millions of 
pages.  I've checked some of the affected URLs:

- HTML fragment (no declaration of <!DOCTYPE...> or <html> opening
tag)

https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
     http://www.privi.com/product-details.asp?cno=C10910011
     http://mental-ray.de/Root_alt/Default.asp
     http://ekyrs.org/support/index.php?action=profile
     http://cwmorse.eu5.org/lineal/mostrar.php?contador=200

- (overlong) comment block at start of HTML which "masks" the HTML declaration
     http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24

http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6
     
https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php
     https://de.e-stories.org/categories.php?&lan=nl&art=p

- HTML with some scripting fragments ("<?php?>") present:
     http://www.eco-ani-yao.org/shien/

- others are clearly HTML (looks more like a bug, at least, there is no simple 
explanation)
     http://www.proedinc.com/customer/content.aspx?redid=9
     
http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79
     http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact

http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068

f79


Obviously certain file suffixes (.php, .aspx) should get less weight compared 
to Content-Type sent from the responding server.
Now my question: where's the best place to fix this: in the crawler [3] or in 
Tika?

If anyone is interested in using the detected MIME types or anything else from Common 
Crawl - I'm happy to help!  The URL index [4] contains now a new field 
"mime-detected" which makes it easy to search or grep for confusion pairs.


Thanks and best,
Sebastian


[1] https://github.com/commoncrawl/nutch/issues/3
[2]
s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tik
a-1.15-cc-main-2017-26.txt.xz

https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/c
ontent-type-diff-tika-1.15-cc-main-2017-26.txt.xz
[3]
https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/
util/MimeUtil.java#L152 [4]
http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/

Re: Tika content detection and crawled "remote" content

Reply via email to