In case it helps, I wrote some prototype modules to add ARC and WARC support to Tika:
https://github.com/ukwa/webarchive-discovery/tree/master/digipres-tika/src/main/java/uk/bl/wa/tika/parser/warc …and extended Tika to use them: https://github.com/ukwa/webarchive-discovery/blob/master/digipres-tika/src/main/java/uk/bl/wa/tika/PreservationParser.java#L62-L63 However, they are based on the Internet Archive’s (W)ARC parsers, which have a pretty heavy/messy dependency tree. It would probably be better to build them on JWAT, which has few dependencies (but may not be quite as robust to edge cases as the IA ones). https://sbforge.org/display/JWAT/JWAT (see e.g. https://sbforge.org/display/JWAT/Reading+a+WARC+file) Hope that helps, Andy Jackson (UK Web Archive) From: Timothy Allison <[email protected]<mailto:[email protected]>> Reply-To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Date: Friday, 7 July 2017 at 11:52 To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Subject: RE: Tika content detection and crawled "remote" content Should we add a WARC parser? :) From: Julien Nioche [mailto:[email protected]] Sent: Friday, July 7, 2017 3:43 AM To: [email protected]<mailto:[email protected]> Subject: Re: Tika content detection and crawled "remote" content Is anyone aware of a tool to run Tika on a WARC file? Everything required for detection and parsing is contained (URL, HTTP metadata, binary content). you could do that with my good old Behemoth<https://github.com/DigitalPebble/behemoth> in 2 steps : WARC to Behemoth format then run Tika on that On 6 July 2017 at 13:27, Sebastian Nagel <[email protected]<mailto:[email protected]>> wrote: Hi, > Otherwise, for anything else (eg that word / graphviz one), please do open up > JIRAs! Done, see TIKA-2242. >> Why, yes, please! JIRA with small samples would be fantastic. 1000 randomly chosen examples per content-type are ready: https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/test/ tika_html_server_side_scripting_lang_php.warc.gz tika_html_server_side_scripting_lang_asp.warc.gz tika_html_server_side_scripting_lang_coldfusion.warc.gz tika_html_server_side_scripting_lang_jsp.warc.gz tika_html_server_side_scripting_lang_cgi.warc.gz tika_html_server_side_scripting_lang_perl.warc.gz Note: there are few real PHP/JSP/Perl/... documents among them. If there is no "global" solution (TIKA-2419), I'll open "smaller" Jiras. Is anyone aware of a tool to run Tika on a WARC file? Everything required for detection and parsing is contained (URL, HTTP metadata, binary content). Thanks, Sebastian On 07/05/2017 04:07 PM, Nick Burch wrote: > Having taken a "quick" look over lunch at some of the "programming language" > ones, and gone down a > rabbit whole... I think at least some of them are as described in TIKA-2419, > where our change to the > HTML magic priority to fix for HTML-containing formats like email had broken > some things. > > I've done a quick fix for 1.16, but it'd be good to try the impact of other > things, eg dropping the > xml priority to match the html one to see if that helps / breaks other things > > > Otherwise, for anything else (eg that word / graphviz one), please do open up > JIRAs! > > Thanks > Nick > > On 05/07/17 14:10, Allison, Timothy B. wrote: >> Why, yes, please! JIRA with small samples would be fantastic. I think >> working in desc order of >> most common to least would be best...php, asp, coldfusion. >> >> I'm about to cut 1.16, but I look forward to improving Tika with this >> tremendously useful data. >> >> Again, many thanks! >> >> Cheers, >> >> Tim >> >> -----Original Message----- >> From: Sebastian Nagel >> [mailto:[email protected]<mailto:[email protected]>] >> Sent: Wednesday, July 5, 2017 9:03 AM >> To: [email protected]<mailto:[email protected]> >> Subject: Re: Tika content detection and crawled "remote" content >> >> Hi Tim, >> >> thanks! Let me know if I should take any actions (e.g., open issue(s) on >> Jira) or whether I can >> help by compiling smaller test sets. >> >> Best, >> Sebastian >> >> On 07/05/2017 02:09 PM, Allison, Timothy B. wrote: >>> This is FANTASTIC!!! Thank you, Sebastian! >>> >>> I suspect that we should try to fix these at the Tika level. We'll never >>> be 100%, but most of >>> the problems you describe _should_ be fixable. >>> >>> > If anyone is interested in using the detected MIME types or anything >>> else from Common Crawl - >>> I'm happy to help! The URL index [4] contains now a new field >>> "mime-detected" which makes it >>> easy to search or grep for confusion pairs. >>> >>> This is an amazing step forward for our regression corpus. We used to rely >>> on the http headers >>> and/or file suffix to oversample non-html. This will allow far cleaner >>> pulls. >>> >>> -----Original Message----- >>> From: Sebastian Nagel >>> [mailto:[email protected]<mailto:[email protected]>] >>> Sent: Tuesday, July 4, 2017 6:18 AM >>> To: [email protected]<mailto:[email protected]> >>> Subject: Tika content detection and crawled "remote" content >>> >>> Hi, >>> >>> recently I've plugged in Tika's content detection into Common Crawl's >>> crawler (modified Nutch) >>> with the target to get clean and correct MIME type - the HTTP Content-Type >>> may contain garbage >>> and isn't always correct [1]. >>> >>> For the June 2017 crawl I've prepared a comparison of content types >>> sent by the server in the HTTP header and as detected by Tika 1.15 >>> [2]. It shows that content types by Tika are definitely clean >>> (1,400 different content types vs. more than 6,000 content type "strings" >>> from HTTP headers). >>> >>> A look on the "confusions" where Content-Type and Tika differ, shows a >>> mixed picture: some pairs >>> are plausible, e.g., if Tika changes the type to a more precise subtype or >>> detects the MIME at all: >>> >>> Tika-1.15 HTTP-Content-Type >>> 1001968023 application/xhtml+xml text/html >>> 2298146 application/rss+xml text/xml >>> 617435 application/rss+xml application/xml >>> 613525 text/html unk >>> 361525 application/xhtml+xml unk >>> 297707 application/rdf+xml application/xml >>> >>> >>> However, there are a few dubious decisions, esp. the group of web >>> server-side scripting languages >>> (ASP, JSP, PHP, ColdFusion, etc.): >>> >>> Tika-1.15 HTTP-Content-Type >>> 2047739 text/x-php text/html >>> 681629 text/asp text/html >>> 193095 text/x-coldfusion text/html >>> 172318 text/aspdotnet text/html >>> 139033 text/x-jsp text/html >>> 38415 text/x-cgi text/html >>> 32092 text/x-php text/xml >>> 18021 text/x-perl text/html >>> >>> Of course, due to misconfigurations some servers may deliver the script >>> files unmodified but in >>> general I wouldn't expect that this happens for millions of pages. I've >>> checked some of the >>> affected URLs: >>> >>> - HTML fragment (no declaration of <!DOCTYPE...> or <html> opening >>> tag) >>> >>> https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0 >>> >>> http://www.privi.com/product-details.asp?cno=C10910011 >>> http://mental-ray.de/Root_alt/Default.asp >>> http://ekyrs.org/support/index.php?action=profile >>> http://cwmorse.eu5.org/lineal/mostrar.php?contador=200 >>> >>> - (overlong) comment block at start of HTML which "masks" the HTML >>> declaration >>> http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24 >>> >>> http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6 >>> >>> >>> https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php >>> https://de.e-stories.org/categories.php?&lan=nl&art=p >>> >>> - HTML with some scripting fragments ("<?php?>") present: >>> http://www.eco-ani-yao.org/shien/ >>> >>> - others are clearly HTML (looks more like a bug, at least, there is no >>> simple explanation) >>> http://www.proedinc.com/customer/content.aspx?redid=9 >>> >>> http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79 >>> http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact >>> http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068 >>> f79 >>> >>> >>> Obviously certain file suffixes (.php, .aspx) should get less weight >>> compared to Content-Type >>> sent from the responding server. >>> Now my question: where's the best place to fix this: in the crawler [3] or >>> in Tika? >>> >>> If anyone is interested in using the detected MIME types or anything else >>> from Common Crawl - I'm >>> happy to help! The URL index [4] contains now a new field "mime-detected" >>> which makes it easy to >>> search or grep for confusion pairs. >>> >>> >>> Thanks and best, >>> Sebastian >>> >>> >>> [1] https://github.com/commoncrawl/nutch/issues/3 >>> [2] >>> s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tik >>> a-1.15-cc-main-2017-26.txt.xz >>> >>> https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/c >>> ontent-type-diff-tika-1.15-cc-main-2017-26.txt.xz >>> [3] >>> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/ >>> util/MimeUtil.java#L152 [4] >>> http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/ >>> >> > -- Open Source Solutions for Text Engineering http://www.digitalpebble.com<http://www.digitalpebble.com/> http://digitalpebble.blogspot.com/ #digitalpebble<http://twitter.com/digitalpebble> ****************************************************************************************************************** Experience the British Library online at www.bl.uk<http://www.bl.uk/> The British Library’s latest Annual Report and Accounts : www.bl.uk/aboutus/annrep/index.html<http://www.bl.uk/aboutus/annrep/index.html> Help the British Library conserve the world's knowledge. Adopt a Book. www.bl.uk/adoptabook<http://www.bl.uk/adoptabook> The Library's St Pancras site is WiFi - enabled ***************************************************************************************************************** The information contained in this e-mail is confidential and may be legally privileged. It is intended for the addressee(s) only. If you are not the intended recipient, please delete this e-mail and notify the [email protected]<mailto:[email protected]> : The contents of this e-mail must not be disclosed or copied without the sender's consent. The statements and opinions expressed in this message are those of the author and do not necessarily reflect those of the British Library. The British Library does not take any responsibility for the views of the author. ***************************************************************************************************************** Think before you print
