Re: Tika content detection and crawled "remote" content

Julien Nioche Fri, 07 Jul 2017 00:44:32 -0700

>
> Is anyone aware of a tool to run Tika on a WARC file? Everything required
> for detection
> and parsing is contained (URL, HTTP metadata, binary content).



you could do that with my good old Behemoth
<https://github.com/DigitalPebble/behemoth> in 2 steps : WARC to Behemoth
format then run Tika on that





On 6 July 2017 at 13:27, Sebastian Nagel <wastl.na...@googlemail.com> wrote:

> Hi,
>
> > Otherwise, for anything else (eg that word / graphviz one), please do
> open up JIRAs!
> Done, see TIKA-2242.
>
> >> Why, yes, please!  JIRA with small samples would be fantastic.
>
> 1000 randomly chosen examples per content-type are ready:
>
> https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/test/
>   tika_html_server_side_scripting_lang_php.warc.gz
>   tika_html_server_side_scripting_lang_asp.warc.gz
>   tika_html_server_side_scripting_lang_coldfusion.warc.gz
>   tika_html_server_side_scripting_lang_jsp.warc.gz
>   tika_html_server_side_scripting_lang_cgi.warc.gz
>   tika_html_server_side_scripting_lang_perl.warc.gz
>
> Note: there are few real PHP/JSP/Perl/... documents among them.
>
> If there is no "global" solution (TIKA-2419), I'll open "smaller" Jiras.
>
> Is anyone aware of a tool to run Tika on a WARC file? Everything required
> for detection
> and parsing is contained (URL, HTTP metadata, binary content).
>
> Thanks,
> Sebastian
>
> On 07/05/2017 04:07 PM, Nick Burch wrote:
> > Having taken a "quick" look over lunch at some of the "programming
> language" ones, and gone down a
> > rabbit whole... I think at least some of them are as described in
> TIKA-2419, where our change to the
> > HTML magic priority to fix for HTML-containing formats like email had
> broken some things.
> >
> > I've done a quick fix for 1.16, but it'd be good to try the impact of
> other things, eg dropping the
> > xml priority to match the html one to see if that helps / breaks other
> things
> >
> >
> > Otherwise, for anything else (eg that word / graphviz one), please do
> open up JIRAs!
> >
> > Thanks
> > Nick
> >
> > On 05/07/17 14:10, Allison, Timothy B. wrote:
> >> Why, yes, please!  JIRA with small samples would be fantastic.  I think
> working in desc order of
> >> most common to least would be best...php, asp, coldfusion.
> >>
> >> I'm about to cut 1.16, but I look forward to improving Tika with this
> tremendously useful data.
> >>
> >> Again, many thanks!
> >>
> >> Cheers,
> >>
> >>             Tim
> >>
> >> -----Original Message-----
> >> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com]
> >> Sent: Wednesday, July 5, 2017 9:03 AM
> >> To: user@tika.apache.org
> >> Subject: Re: Tika content detection and crawled "remote" content
> >>
> >> Hi Tim,
> >>
> >> thanks! Let me know if I should take any actions (e.g., open issue(s)
> on Jira) or whether I can
> >> help by compiling smaller test sets.
> >>
> >> Best,
> >> Sebastian
> >>
> >> On 07/05/2017 02:09 PM, Allison, Timothy B. wrote:
> >>> This is FANTASTIC!!!  Thank you, Sebastian!
> >>>
> >>> I suspect that we should try to fix these at the Tika level.  We'll
> never be 100%, but most of
> >>> the problems you describe _should_ be fixable.
> >>>
> >>>   > If anyone is interested in using the detected MIME types or
> anything else from Common Crawl -
> >>> I'm happy to help!  The URL index [4] contains now a new field
> "mime-detected" which makes it
> >>> easy to search or grep for confusion pairs.
> >>>
> >>> This is an amazing step forward for our regression corpus.  We used to
> rely on the http headers
> >>> and/or file suffix to oversample non-html.  This will allow far
> cleaner pulls.
> >>>
> >>> -----Original Message-----
> >>> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com]
> >>> Sent: Tuesday, July 4, 2017 6:18 AM
> >>> To: user@tika.apache.org
> >>> Subject: Tika content detection and crawled "remote" content
> >>>
> >>> Hi,
> >>>
> >>> recently I've plugged in Tika's content detection into Common Crawl's
> crawler (modified Nutch)
> >>> with the target to get clean and correct MIME type - the HTTP
> Content-Type may contain garbage
> >>> and isn't always correct [1].
> >>>
> >>> For the June 2017 crawl I've prepared a comparison of content types
> >>> sent by the server in the HTTP header and as detected by Tika 1.15
> >>> [2].  It shows that content types by Tika are definitely clean
> >>> (1,400 different content types vs. more than 6,000 content type
> "strings" from HTTP headers).
> >>>
> >>> A look on the "confusions" where Content-Type and Tika differ, shows a
> mixed picture: some pairs
> >>> are plausible, e.g., if Tika changes the type to a more precise
> subtype or detects the MIME at all:
> >>>
> >>>              Tika-1.15                HTTP-Content-Type
> >>> 1001968023  application/xhtml+xml    text/html
> >>>     2298146  application/rss+xml      text/xml
> >>>      617435  application/rss+xml      application/xml
> >>>      613525  text/html                unk
> >>>      361525  application/xhtml+xml    unk
> >>>      297707  application/rdf+xml      application/xml
> >>>
> >>>
> >>> However, there are a few dubious decisions, esp. the group of web
> server-side scripting languages
> >>> (ASP, JSP, PHP, ColdFusion, etc.):
> >>>
> >>>           Tika-1.15         HTTP-Content-Type
> >>> 2047739  text/x-php        text/html
> >>>   681629  text/asp          text/html
> >>>   193095  text/x-coldfusion text/html
> >>>   172318  text/aspdotnet    text/html
> >>>   139033  text/x-jsp        text/html
> >>>    38415  text/x-cgi        text/html
> >>>    32092  text/x-php        text/xml
> >>>    18021  text/x-perl       text/html
> >>>
> >>> Of course, due to misconfigurations some servers may deliver the
> script files unmodified but in
> >>> general I wouldn't expect that this happens for millions of pages.
> I've checked some of the
> >>> affected URLs:
> >>>
> >>> - HTML fragment (no declaration of <!DOCTYPE...> or <html> opening
> >>> tag)
> >>>
> >>> https://www.projectmanagement.com/profile/profile_
> contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&
> c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&
> c_d=0&c_ra=2&c_p=0
> >>>
> >>>      http://www.privi.com/product-details.asp?cno=C10910011
> >>>      http://mental-ray.de/Root_alt/Default.asp
> >>>      http://ekyrs.org/support/index.php?action=profile
> >>>      http://cwmorse.eu5.org/lineal/mostrar.php?contador=200
> >>>
> >>> - (overlong) comment block at start of HTML which "masks" the HTML
> declaration
> >>>      http://www.mannheim-virtuell.de/index.php?branchenID=2&;
> rubrikID=24
> >>>
> >>> http://www.exoduschurch.org/bbs/view.php?id=sunday_school&;
> page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=
> headnum&desc=asc&no=6
> >>>
> >>>      https://www.preventiongenetics.com/About/Resources/disease/
> MarfansSyndrome.php
> >>>      https://de.e-stories.org/categories.php?&lan=nl&art=p
> >>>
> >>> - HTML with some scripting fragments ("<?php?>") present:
> >>>      http://www.eco-ani-yao.org/shien/
> >>>
> >>> - others are clearly HTML (looks more like a bug, at least, there is
> no simple explanation)
> >>>      http://www.proedinc.com/customer/content.aspx?redid=9
> >>>      http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=
> bf3b7971faa23413fa1164be0c068f79
> >>>      http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
> >>>      http://cball.dyndns.org/wbb2/map.php?sid=
> bf3b7971faa23413fa1164be0c068
> >>> f79
> >>>
> >>>
> >>> Obviously certain file suffixes (.php, .aspx) should get less weight
> compared to Content-Type
> >>> sent from the responding server.
> >>> Now my question: where's the best place to fix this: in the crawler
> [3] or in Tika?
> >>>
> >>> If anyone is interested in using the detected MIME types or anything
> else from Common Crawl - I'm
> >>> happy to help!  The URL index [4] contains now a new field
> "mime-detected" which makes it easy to
> >>> search or grep for confusion pairs.
> >>>
> >>>
> >>> Thanks and best,
> >>> Sebastian
> >>>
> >>>
> >>> [1] https://github.com/commoncrawl/nutch/issues/3
> >>> [2]
> >>> s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tik
> >>> a-1.15-cc-main-2017-26.txt.xz
> >>>
> >>> https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/c
> >>> ontent-type-diff-tika-1.15-cc-main-2017-26.txt.xz
> >>> [3]
> >>> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/
> >>> util/MimeUtil.java#L152 [4]
> >>> http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/
> >>>
> >>
> >
>
>


-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble <http://twitter.com/digitalpebble>

Re: Tika content detection and crawled "remote" content

Reply via email to