Re: Tika content detection and crawled "remote" content

Jackson, Andy Fri, 07 Jul 2017 04:19:42 -0700

In case it helps, I wrote some prototype modules to add ARC and WARC support to 
Tika:


https://github.com/ukwa/webarchive-discovery/tree/master/digipres-tika/src/main/java/uk/bl/wa/tika/parser/warc

…and extended Tika to use them:

https://github.com/ukwa/webarchive-discovery/blob/master/digipres-tika/src/main/java/uk/bl/wa/tika/PreservationParser.java#L62-L63

However, they are based on the Internet Archive’s (W)ARC parsers, which have a 
pretty heavy/messy dependency tree. It would probably be better to build them 
on JWAT, which has few dependencies (but may not be quite as robust to edge 
cases as the IA ones).

https://sbforge.org/display/JWAT/JWAT

(see e.g. https://sbforge.org/display/JWAT/Reading+a+WARC+file)

Hope that helps,
Andy Jackson (UK Web Archive)

From: Timothy Allison <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Friday, 7 July 2017 at 11:52
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: RE: Tika content detection and crawled "remote" content

Should we add a WARC parser? :)

From: Julien Nioche [mailto:[email protected]]
Sent: Friday, July 7, 2017 3:43 AM
To: [email protected]<mailto:[email protected]>
Subject: Re: Tika content detection and crawled "remote" content

Is anyone aware of a tool to run Tika on a WARC file? Everything required for 
detection
and parsing is contained (URL, HTTP metadata, binary content).

you could do that with my good old 
Behemoth<https://github.com/DigitalPebble/behemoth> in 2 steps : WARC to 
Behemoth format then run Tika on that





On 6 July 2017 at 13:27, Sebastian Nagel 
<[email protected]<mailto:[email protected]>> wrote:
Hi,

> Otherwise, for anything else (eg that word / graphviz one), please do open up 
> JIRAs!
Done, see TIKA-2242.

>> Why, yes, please!  JIRA with small samples would be fantastic.

1000 randomly chosen examples per content-type are ready:

https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/test/
  tika_html_server_side_scripting_lang_php.warc.gz
  tika_html_server_side_scripting_lang_asp.warc.gz
  tika_html_server_side_scripting_lang_coldfusion.warc.gz
  tika_html_server_side_scripting_lang_jsp.warc.gz
  tika_html_server_side_scripting_lang_cgi.warc.gz
  tika_html_server_side_scripting_lang_perl.warc.gz

Note: there are few real PHP/JSP/Perl/... documents among them.

If there is no "global" solution (TIKA-2419), I'll open "smaller" Jiras.

Is anyone aware of a tool to run Tika on a WARC file? Everything required for 
detection
and parsing is contained (URL, HTTP metadata, binary content).

Thanks,
Sebastian

On 07/05/2017 04:07 PM, Nick Burch wrote:
> Having taken a "quick" look over lunch at some of the "programming language" 
> ones, and gone down a
> rabbit whole... I think at least some of them are as described in TIKA-2419, 
> where our change to the
> HTML magic priority to fix for HTML-containing formats like email had broken 
> some things.
>
> I've done a quick fix for 1.16, but it'd be good to try the impact of other 
> things, eg dropping the
> xml priority to match the html one to see if that helps / breaks other things
>
>
> Otherwise, for anything else (eg that word / graphviz one), please do open up 
> JIRAs!
>
> Thanks
> Nick
>
> On 05/07/17 14:10, Allison, Timothy B. wrote:
>> Why, yes, please!  JIRA with small samples would be fantastic.  I think 
>> working in desc order of
>> most common to least would be best...php, asp, coldfusion.
>>
>> I'm about to cut 1.16, but I look forward to improving Tika with this 
>> tremendously useful data.
>>
>> Again, many thanks!
>>
>> Cheers,
>>
>>             Tim
>>
>> -----Original Message-----
>> From: Sebastian Nagel 
>> [mailto:[email protected]<mailto:[email protected]>]
>> Sent: Wednesday, July 5, 2017 9:03 AM
>> To: [email protected]<mailto:[email protected]>
>> Subject: Re: Tika content detection and crawled "remote" content
>>
>> Hi Tim,
>>
>> thanks! Let me know if I should take any actions (e.g., open issue(s) on 
>> Jira) or whether I can
>> help by compiling smaller test sets.
>>
>> Best,
>> Sebastian
>>
>> On 07/05/2017 02:09 PM, Allison, Timothy B. wrote:
>>> This is FANTASTIC!!!  Thank you, Sebastian!
>>>
>>> I suspect that we should try to fix these at the Tika level.  We'll never 
>>> be 100%, but most of
>>> the problems you describe _should_ be fixable.
>>>
>>>   > If anyone is interested in using the detected MIME types or anything 
>>> else from Common Crawl -
>>> I'm happy to help!  The URL index [4] contains now a new field 
>>> "mime-detected" which makes it
>>> easy to search or grep for confusion pairs.
>>>
>>> This is an amazing step forward for our regression corpus.  We used to rely 
>>> on the http headers
>>> and/or file suffix to oversample non-html.  This will allow far cleaner 
>>> pulls.
>>>
>>> -----Original Message-----
>>> From: Sebastian Nagel 
>>> [mailto:[email protected]<mailto:[email protected]>]
>>> Sent: Tuesday, July 4, 2017 6:18 AM
>>> To: [email protected]<mailto:[email protected]>
>>> Subject: Tika content detection and crawled "remote" content
>>>
>>> Hi,
>>>
>>> recently I've plugged in Tika's content detection into Common Crawl's 
>>> crawler (modified Nutch)
>>> with the target to get clean and correct MIME type - the HTTP Content-Type 
>>> may contain garbage
>>> and isn't always correct [1].
>>>
>>> For the June 2017 crawl I've prepared a comparison of content types
>>> sent by the server in the HTTP header and as detected by Tika 1.15
>>> [2].  It shows that content types by Tika are definitely clean
>>> (1,400 different content types vs. more than 6,000 content type "strings" 
>>> from HTTP headers).
>>>
>>> A look on the "confusions" where Content-Type and Tika differ, shows a 
>>> mixed picture: some pairs
>>> are plausible, e.g., if Tika changes the type to a more precise subtype or 
>>> detects the MIME at all:
>>>
>>>              Tika-1.15                HTTP-Content-Type
>>> 1001968023  application/xhtml+xml    text/html
>>>     2298146  application/rss+xml      text/xml
>>>      617435  application/rss+xml      application/xml
>>>      613525  text/html                unk
>>>      361525  application/xhtml+xml    unk
>>>      297707  application/rdf+xml      application/xml
>>>
>>>
>>> However, there are a few dubious decisions, esp. the group of web 
>>> server-side scripting languages
>>> (ASP, JSP, PHP, ColdFusion, etc.):
>>>
>>>           Tika-1.15         HTTP-Content-Type
>>> 2047739  text/x-php        text/html
>>>   681629  text/asp          text/html
>>>   193095  text/x-coldfusion text/html
>>>   172318  text/aspdotnet    text/html
>>>   139033  text/x-jsp        text/html
>>>    38415  text/x-cgi        text/html
>>>    32092  text/x-php        text/xml
>>>    18021  text/x-perl       text/html
>>>
>>> Of course, due to misconfigurations some servers may deliver the script 
>>> files unmodified but in
>>> general I wouldn't expect that this happens for millions of pages.  I've 
>>> checked some of the
>>> affected URLs:
>>>
>>> - HTML fragment (no declaration of <!DOCTYPE...> or <html> opening
>>> tag)
>>>
>>> https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
>>>
>>>      http://www.privi.com/product-details.asp?cno=C10910011
>>>      http://mental-ray.de/Root_alt/Default.asp
>>>      http://ekyrs.org/support/index.php?action=profile
>>>      http://cwmorse.eu5.org/lineal/mostrar.php?contador=200
>>>
>>> - (overlong) comment block at start of HTML which "masks" the HTML 
>>> declaration
>>>      http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24
>>>
>>> http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6
>>>
>>>      
>>> https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php
>>>      https://de.e-stories.org/categories.php?&lan=nl&art=p
>>>
>>> - HTML with some scripting fragments ("<?php?>") present:
>>>      http://www.eco-ani-yao.org/shien/
>>>
>>> - others are clearly HTML (looks more like a bug, at least, there is no 
>>> simple explanation)
>>>      http://www.proedinc.com/customer/content.aspx?redid=9
>>>      
>>> http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79
>>>      http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
>>>      http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068
>>> f79
>>>
>>>
>>> Obviously certain file suffixes (.php, .aspx) should get less weight 
>>> compared to Content-Type
>>> sent from the responding server.
>>> Now my question: where's the best place to fix this: in the crawler [3] or 
>>> in Tika?
>>>
>>> If anyone is interested in using the detected MIME types or anything else 
>>> from Common Crawl - I'm
>>> happy to help!  The URL index [4] contains now a new field "mime-detected" 
>>> which makes it easy to
>>> search or grep for confusion pairs.
>>>
>>>
>>> Thanks and best,
>>> Sebastian
>>>
>>>
>>> [1] https://github.com/commoncrawl/nutch/issues/3
>>> [2]
>>> s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tik
>>> a-1.15-cc-main-2017-26.txt.xz
>>>
>>> https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/c
>>> ontent-type-diff-tika-1.15-cc-main-2017-26.txt.xz
>>> [3]
>>> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/
>>> util/MimeUtil.java#L152 [4]
>>> http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/
>>>
>>
>



--

Open Source Solutions for Text Engineering

http://www.digitalpebble.com<http://www.digitalpebble.com/>
http://digitalpebble.blogspot.com/
#digitalpebble<http://twitter.com/digitalpebble>


******************************************************************************************************************
Experience the British Library online at www.bl.uk<http://www.bl.uk/>
The British Library’s latest Annual Report and Accounts : 
www.bl.uk/aboutus/annrep/index.html<http://www.bl.uk/aboutus/annrep/index.html>
Help the British Library conserve the world's knowledge. Adopt a Book. 
www.bl.uk/adoptabook<http://www.bl.uk/adoptabook>
The Library's St Pancras site is WiFi - enabled
*****************************************************************************************************************
The information contained in this e-mail is confidential and may be legally 
privileged. It is intended for the addressee(s) only. If you are not the 
intended recipient, please delete this e-mail and notify the 
[email protected]<mailto:[email protected]> : The contents of this e-mail must 
not be disclosed or copied without the sender's consent.
The statements and opinions expressed in this message are those of the author 
and do not necessarily reflect those of the British Library. The British 
Library does not take any responsibility for the views of the author.
*****************************************************************************************************************
Think before you print

Re: Tika content detection and crawled "remote" content

Reply via email to