> > Is anyone aware of a tool to run Tika on a WARC file? Everything required > for detection > and parsing is contained (URL, HTTP metadata, binary content).
you could do that with my good old Behemoth <https://github.com/DigitalPebble/behemoth> in 2 steps : WARC to Behemoth format then run Tika on that On 6 July 2017 at 13:27, Sebastian Nagel <wastl.na...@googlemail.com> wrote: > Hi, > > > Otherwise, for anything else (eg that word / graphviz one), please do > open up JIRAs! > Done, see TIKA-2242. > > >> Why, yes, please! JIRA with small samples would be fantastic. > > 1000 randomly chosen examples per content-type are ready: > > https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/test/ > tika_html_server_side_scripting_lang_php.warc.gz > tika_html_server_side_scripting_lang_asp.warc.gz > tika_html_server_side_scripting_lang_coldfusion.warc.gz > tika_html_server_side_scripting_lang_jsp.warc.gz > tika_html_server_side_scripting_lang_cgi.warc.gz > tika_html_server_side_scripting_lang_perl.warc.gz > > Note: there are few real PHP/JSP/Perl/... documents among them. > > If there is no "global" solution (TIKA-2419), I'll open "smaller" Jiras. > > Is anyone aware of a tool to run Tika on a WARC file? Everything required > for detection > and parsing is contained (URL, HTTP metadata, binary content). > > Thanks, > Sebastian > > On 07/05/2017 04:07 PM, Nick Burch wrote: > > Having taken a "quick" look over lunch at some of the "programming > language" ones, and gone down a > > rabbit whole... I think at least some of them are as described in > TIKA-2419, where our change to the > > HTML magic priority to fix for HTML-containing formats like email had > broken some things. > > > > I've done a quick fix for 1.16, but it'd be good to try the impact of > other things, eg dropping the > > xml priority to match the html one to see if that helps / breaks other > things > > > > > > Otherwise, for anything else (eg that word / graphviz one), please do > open up JIRAs! > > > > Thanks > > Nick > > > > On 05/07/17 14:10, Allison, Timothy B. wrote: > >> Why, yes, please! JIRA with small samples would be fantastic. I think > working in desc order of > >> most common to least would be best...php, asp, coldfusion. > >> > >> I'm about to cut 1.16, but I look forward to improving Tika with this > tremendously useful data. > >> > >> Again, many thanks! > >> > >> Cheers, > >> > >> Tim > >> > >> -----Original Message----- > >> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com] > >> Sent: Wednesday, July 5, 2017 9:03 AM > >> To: user@tika.apache.org > >> Subject: Re: Tika content detection and crawled "remote" content > >> > >> Hi Tim, > >> > >> thanks! Let me know if I should take any actions (e.g., open issue(s) > on Jira) or whether I can > >> help by compiling smaller test sets. > >> > >> Best, > >> Sebastian > >> > >> On 07/05/2017 02:09 PM, Allison, Timothy B. wrote: > >>> This is FANTASTIC!!! Thank you, Sebastian! > >>> > >>> I suspect that we should try to fix these at the Tika level. We'll > never be 100%, but most of > >>> the problems you describe _should_ be fixable. > >>> > >>> > If anyone is interested in using the detected MIME types or > anything else from Common Crawl - > >>> I'm happy to help! The URL index [4] contains now a new field > "mime-detected" which makes it > >>> easy to search or grep for confusion pairs. > >>> > >>> This is an amazing step forward for our regression corpus. We used to > rely on the http headers > >>> and/or file suffix to oversample non-html. This will allow far > cleaner pulls. > >>> > >>> -----Original Message----- > >>> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com] > >>> Sent: Tuesday, July 4, 2017 6:18 AM > >>> To: user@tika.apache.org > >>> Subject: Tika content detection and crawled "remote" content > >>> > >>> Hi, > >>> > >>> recently I've plugged in Tika's content detection into Common Crawl's > crawler (modified Nutch) > >>> with the target to get clean and correct MIME type - the HTTP > Content-Type may contain garbage > >>> and isn't always correct [1]. > >>> > >>> For the June 2017 crawl I've prepared a comparison of content types > >>> sent by the server in the HTTP header and as detected by Tika 1.15 > >>> [2]. It shows that content types by Tika are definitely clean > >>> (1,400 different content types vs. more than 6,000 content type > "strings" from HTTP headers). > >>> > >>> A look on the "confusions" where Content-Type and Tika differ, shows a > mixed picture: some pairs > >>> are plausible, e.g., if Tika changes the type to a more precise > subtype or detects the MIME at all: > >>> > >>> Tika-1.15 HTTP-Content-Type > >>> 1001968023 application/xhtml+xml text/html > >>> 2298146 application/rss+xml text/xml > >>> 617435 application/rss+xml application/xml > >>> 613525 text/html unk > >>> 361525 application/xhtml+xml unk > >>> 297707 application/rdf+xml application/xml > >>> > >>> > >>> However, there are a few dubious decisions, esp. the group of web > server-side scripting languages > >>> (ASP, JSP, PHP, ColdFusion, etc.): > >>> > >>> Tika-1.15 HTTP-Content-Type > >>> 2047739 text/x-php text/html > >>> 681629 text/asp text/html > >>> 193095 text/x-coldfusion text/html > >>> 172318 text/aspdotnet text/html > >>> 139033 text/x-jsp text/html > >>> 38415 text/x-cgi text/html > >>> 32092 text/x-php text/xml > >>> 18021 text/x-perl text/html > >>> > >>> Of course, due to misconfigurations some servers may deliver the > script files unmodified but in > >>> general I wouldn't expect that this happens for millions of pages. > I've checked some of the > >>> affected URLs: > >>> > >>> - HTML fragment (no declaration of <!DOCTYPE...> or <html> opening > >>> tag) > >>> > >>> https://www.projectmanagement.com/profile/profile_ > contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0& > c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0& > c_d=0&c_ra=2&c_p=0 > >>> > >>> http://www.privi.com/product-details.asp?cno=C10910011 > >>> http://mental-ray.de/Root_alt/Default.asp > >>> http://ekyrs.org/support/index.php?action=profile > >>> http://cwmorse.eu5.org/lineal/mostrar.php?contador=200 > >>> > >>> - (overlong) comment block at start of HTML which "masks" the HTML > declaration > >>> http://www.mannheim-virtuell.de/index.php?branchenID=2& > rubrikID=24 > >>> > >>> http://www.exoduschurch.org/bbs/view.php?id=sunday_school& > page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange= > headnum&desc=asc&no=6 > >>> > >>> https://www.preventiongenetics.com/About/Resources/disease/ > MarfansSyndrome.php > >>> https://de.e-stories.org/categories.php?&lan=nl&art=p > >>> > >>> - HTML with some scripting fragments ("<?php?>") present: > >>> http://www.eco-ani-yao.org/shien/ > >>> > >>> - others are clearly HTML (looks more like a bug, at least, there is > no simple explanation) > >>> http://www.proedinc.com/customer/content.aspx?redid=9 > >>> http://cball.dyndns.org/wbb2/board.php?boardid=8&sid= > bf3b7971faa23413fa1164be0c068f79 > >>> http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact > >>> http://cball.dyndns.org/wbb2/map.php?sid= > bf3b7971faa23413fa1164be0c068 > >>> f79 > >>> > >>> > >>> Obviously certain file suffixes (.php, .aspx) should get less weight > compared to Content-Type > >>> sent from the responding server. > >>> Now my question: where's the best place to fix this: in the crawler > [3] or in Tika? > >>> > >>> If anyone is interested in using the detected MIME types or anything > else from Common Crawl - I'm > >>> happy to help! The URL index [4] contains now a new field > "mime-detected" which makes it easy to > >>> search or grep for confusion pairs. > >>> > >>> > >>> Thanks and best, > >>> Sebastian > >>> > >>> > >>> [1] https://github.com/commoncrawl/nutch/issues/3 > >>> [2] > >>> s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tik > >>> a-1.15-cc-main-2017-26.txt.xz > >>> > >>> https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/c > >>> ontent-type-diff-tika-1.15-cc-main-2017-26.txt.xz > >>> [3] > >>> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/ > >>> util/MimeUtil.java#L152 [4] > >>> http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/ > >>> > >> > > > > -- *Open Source Solutions for Text Engineering* http://www.digitalpebble.com http://digitalpebble.blogspot.com/ #digitalpebble <http://twitter.com/digitalpebble>