Re: Tika content detection and crawled "remote" content

Chris Mattmann Fri, 07 Jul 2017 07:26:53 -0700

Yep!


 

 

From: "Allison, Timothy B." <talli...@mitre.org>
Reply-To: "user@tika.apache.org" <user@tika.apache.org>
Date: Friday, July 7, 2017 at 3:52 AM
To: "user@tika.apache.org" <user@tika.apache.org>
Subject: RE: Tika content detection and crawled "remote" content

 

Should we add a WARC parser? J

 

From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com] 
Sent: Friday, July 7, 2017 3:43 AM
To: user@tika.apache.org
Subject: Re: Tika content detection and crawled "remote" content

 

Is anyone aware of a tool to run Tika on a WARC file? Everything required for 
detection
and parsing is contained (URL, HTTP metadata, binary content).

 

you could do that with my good old Behemoth in 2 steps : WARC to Behemoth 
format then run Tika on that

 

 

 

 

 

On 6 July 2017 at 13:27, Sebastian Nagel <wastl.na...@googlemail.com> wrote:

Hi,

> Otherwise, for anything else (eg that word / graphviz one), please do open up 
> JIRAs!
Done, see TIKA-2242.

>> Why, yes, please!  JIRA with small samples would be fantastic.

1000 randomly chosen examples per content-type are ready:

https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/test/
  tika_html_server_side_scripting_lang_php.warc.gz
  tika_html_server_side_scripting_lang_asp.warc.gz
  tika_html_server_side_scripting_lang_coldfusion.warc.gz
  tika_html_server_side_scripting_lang_jsp.warc.gz
  tika_html_server_side_scripting_lang_cgi.warc.gz
  tika_html_server_side_scripting_lang_perl.warc.gz

Note: there are few real PHP/JSP/Perl/... documents among them.

If there is no "global" solution (TIKA-2419), I'll open "smaller" Jiras.

Is anyone aware of a tool to run Tika on a WARC file? Everything required for 
detection
and parsing is contained (URL, HTTP metadata, binary content).

Thanks,
Sebastian


On 07/05/2017 04:07 PM, Nick Burch wrote:
> Having taken a "quick" look over lunch at some of the "programming language" 
> ones, and gone down a
> rabbit whole... I think at least some of them are as described in TIKA-2419, 
> where our change to the
> HTML magic priority to fix for HTML-containing formats like email had broken 
> some things.
>
> I've done a quick fix for 1.16, but it'd be good to try the impact of other 
> things, eg dropping the
> xml priority to match the html one to see if that helps / breaks other things
>
>
> Otherwise, for anything else (eg that word / graphviz one), please do open up 
> JIRAs!
>
> Thanks
> Nick
>
> On 05/07/17 14:10, Allison, Timothy B. wrote:
>> Why, yes, please!  JIRA with small samples would be fantastic.  I think 
>> working in desc order of
>> most common to least would be best...php, asp, coldfusion.
>>
>> I'm about to cut 1.16, but I look forward to improving Tika with this 
>> tremendously useful data.
>>
>> Again, many thanks!
>>
>> Cheers,
>>
>>             Tim
>>
>> -----Original Message-----
>> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com]
>> Sent: Wednesday, July 5, 2017 9:03 AM
>> To: user@tika.apache.org
>> Subject: Re: Tika content detection and crawled "remote" content
>>
>> Hi Tim,
>>
>> thanks! Let me know if I should take any actions (e.g., open issue(s) on 
>> Jira) or whether I can
>> help by compiling smaller test sets.
>>
>> Best,
>> Sebastian
>>
>> On 07/05/2017 02:09 PM, Allison, Timothy B. wrote:
>>> This is FANTASTIC!!!  Thank you, Sebastian!
>>>
>>> I suspect that we should try to fix these at the Tika level.  We'll never 
>>> be 100%, but most of
>>> the problems you describe _should_ be fixable.
>>>
>>>   > If anyone is interested in using the detected MIME types or anything 
>>> else from Common Crawl -
>>> I'm happy to help!  The URL index [4] contains now a new field 
>>> "mime-detected" which makes it
>>> easy to search or grep for confusion pairs.
>>>
>>> This is an amazing step forward for our regression corpus.  We used to rely 
>>> on the http headers
>>> and/or file suffix to oversample non-html.  This will allow far cleaner 
>>> pulls.
>>>
>>> -----Original Message-----
>>> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com]
>>> Sent: Tuesday, July 4, 2017 6:18 AM
>>> To: user@tika.apache.org
>>> Subject: Tika content detection and crawled "remote" content
>>>
>>> Hi,
>>>
>>> recently I've plugged in Tika's content detection into Common Crawl's 
>>> crawler (modified Nutch)
>>> with the target to get clean and correct MIME type - the HTTP Content-Type 
>>> may contain garbage
>>> and isn't always correct [1].
>>>
>>> For the June 2017 crawl I've prepared a comparison of content types
>>> sent by the server in the HTTP header and as detected by Tika 1.15
>>> [2].  It shows that content types by Tika are definitely clean
>>> (1,400 different content types vs. more than 6,000 content type "strings" 
>>> from HTTP headers).
>>>
>>> A look on the "confusions" where Content-Type and Tika differ, shows a 
>>> mixed picture: some pairs
>>> are plausible, e.g., if Tika changes the type to a more precise subtype or 
>>> detects the MIME at all:
>>>
>>>              Tika-1.15                HTTP-Content-Type
>>> 1001968023  application/xhtml+xml    text/html
>>>     2298146  application/rss+xml      text/xml
>>>      617435  application/rss+xml      application/xml
>>>      613525  text/html                unk
>>>      361525  application/xhtml+xml    unk
>>>      297707  application/rdf+xml      application/xml
>>>
>>>
>>> However, there are a few dubious decisions, esp. the group of web 
>>> server-side scripting languages
>>> (ASP, JSP, PHP, ColdFusion, etc.):
>>>
>>>           Tika-1.15         HTTP-Content-Type
>>> 2047739  text/x-php        text/html
>>>   681629  text/asp          text/html
>>>   193095  text/x-coldfusion text/html
>>>   172318  text/aspdotnet    text/html
>>>   139033  text/x-jsp        text/html
>>>    38415  text/x-cgi        text/html
>>>    32092  text/x-php        text/xml
>>>    18021  text/x-perl       text/html
>>>
>>> Of course, due to misconfigurations some servers may deliver the script 
>>> files unmodified but in
>>> general I wouldn't expect that this happens for millions of pages.  I've 
>>> checked some of the
>>> affected URLs:
>>>
>>> - HTML fragment (no declaration of <!DOCTYPE...> or <html> opening
>>> tag)
>>>
>>> https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
>>>
>>>      http://www.privi.com/product-details.asp?cno=C10910011
>>>      http://mental-ray.de/Root_alt/Default.asp
>>>      http://ekyrs.org/support/index.php?action=profile
>>>      http://cwmorse.eu5.org/lineal/mostrar.php?contador=200
>>>
>>> - (overlong) comment block at start of HTML which "masks" the HTML 
>>> declaration
>>>      http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24
>>>
>>> http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6
>>>
>>>      
>>> https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php
>>>      https://de.e-stories.org/categories.php?&lan=nl&art=p
>>>
>>> - HTML with some scripting fragments ("<?php?>") present:
>>>      http://www.eco-ani-yao.org/shien/
>>>
>>> - others are clearly HTML (looks more like a bug, at least, there is no 
>>> simple explanation)
>>>      http://www.proedinc.com/customer/content.aspx?redid=9
>>>      
>>> http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79
>>>      http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
>>>      http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068
>>> f79
>>>
>>>
>>> Obviously certain file suffixes (.php, .aspx) should get less weight 
>>> compared to Content-Type
>>> sent from the responding server.
>>> Now my question: where's the best place to fix this: in the crawler [3] or 
>>> in Tika?
>>>
>>> If anyone is interested in using the detected MIME types or anything else 
>>> from Common Crawl - I'm
>>> happy to help!  The URL index [4] contains now a new field "mime-detected" 
>>> which makes it easy to
>>> search or grep for confusion pairs.
>>>
>>>
>>> Thanks and best,
>>> Sebastian
>>>
>>>
>>> [1] https://github.com/commoncrawl/nutch/issues/3
>>> [2]
>>> s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tik
>>> a-1.15-cc-main-2017-26.txt.xz
>>>
>>> https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/c
>>> ontent-type-diff-tika-1.15-cc-main-2017-26.txt.xz
>>> [3]
>>> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/
>>> util/MimeUtil.java#L152 [4]
>>> http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/
>>>
>>
>



 

-- 


Open Source Solutions for Text Engineering


http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble

Re: Tika content detection and crawled "remote" content

Reply via email to