Re: Adding a WARC parser to Tika

Sebastian Nagel Tue, 11 Jul 2017 11:13:13 -0700

FYI, for a similar task - testing crawler-commons sitemaps.org parser - I've 
started a small test
tools which reads the sitemaps from WARC files:
   
https://groups.google.com/forum/?fromgroups#!topic/crawler-commons/pOLsCVwRsxY
   https://github.com/sebastian-nagel/sitemap-performance-test/


As it only takes what is necessary for testing, it's lean and "no overkill".

Sebastian

On 07/11/2017 12:06 PM, Jackson, Andy wrote:
> In case it helps, I'll try to summarise what we've done in this area.
> 
> Currently our webarchive-discovery indexing tool parses the WARC and then 
> passes the payload to Tika:
> 
> https://github.com/ukwa/webarchive-discovery
> https://github.com/ukwa/webarchive-discovery/blob/master/warc-indexer/src/main/java/uk/bl/wa/solr/TikaExtractor.java
> 
> This works fine, but along the way we've also experimented with adding WARC 
> parsing to Tika directly. The code is an extremely messy proof-of-concept but 
> I've pushed it here so you can see how it works:
> 
> https://github.com/ukwa/tika/tree/experimental-warc-parsing
> 
> The parser itself is fairly straightforward:
> 
> https://github.com/ukwa/tika/blob/5d89169151257a2696ceac2a4897527ea1b227a7/tika-parsers/src/main/java/org/apache/tika/parser/warc/WARCParser.java#L94
> 
> but it did require a few changes elsewhere...
> 
> 1. Needed to teach Tika to spot ARC/WARC:
> https://github.com/apache/tika/compare/master...ukwa:experimental-warc-parsing#diff-a7a8080db8d7c69d9a66b875b4c5b9e7
> 
> 2. Added webarchive-commons as a dependency:
> https://github.com/apache/tika/compare/master...ukwa:experimental-warc-parsing#diff-2426935affac837a5f8f7a84a15939f7
> 
> 3. Enable concatenated block gunzip in order to parse WARC.GZ:
> https://github.com/apache/tika/compare/master...ukwa:experimental-warc-parsing#diff-5ae41a78b18e2ca8481960cd5e02b860
> (given this was explicitly disabled before, this may be contentious?)
> 
> There's another couple of bigger issues that would need resolving too.
> 
> Firstly, the WARC format is not a file archive, but primarily a HTTP 
> request/response archive. There are 8 different record types (see 
> https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#warc-record-types
>  for details) that may or may not be of interest. The HTTP request and the 
> response get separate records, and of course the response might be 303 or 
> 404, not just 200. One strategy that is fairly widely used is to simply 
> ignore anything that is not a 200 response, but that does discard quite a lot 
> of information.
> 
> Secondly, I'm not sure how many layers of embedded are appropriate. According 
> to the spec, I would argue that these are the layers:
> 
> - archive.warc.gz (a series of block-concatenated gzip records)
> - archive.warc.gz/record.warc (an individual WARC record)
> - archive.warc.gz/record.warc/http.response (the message/http in its entirety)
> - archive.warc.gz/record.warc/http.response/entity.body (the actual resource)
> 
> This is probably overkill (and gets worse if it's a gzipped HTTP response!). 
> We could just use:
> 
> - archive.warc.gz (a series of block-concatenated gzip records)
> - archive.warc.gz/record.warc (the parsed entity.body, with all relevant info 
> from WARC and HTTP headers attached as metadata)
> 
> Collapsing the layers down does make is less clear where some of the metadata 
> is coming from, but it’s probably worth it.
> 
> One final note - I've not put the test WARC files in that repo yet as I need 
> to create some new ones from an Apache 2 source.
> 
> I hope this is useful.
> 
> Best,
> Andy
> 
> 
> =-=-=-=-=-=-=-=
> Dr Andrew N. Jackson
> Web Archiving Technical Lead
> 01937 546602
> @UKWebArchive
> @anjacks0n
> Blog: http://britishlibrary.typepad.co.uk/webarchive/
> 
> 
> 
> 
> 
> -----Original Message-----
> From: Nick Burch [mailto:[email protected]]
> Sent: 10 July 2017 19:45
> To: [email protected]
> Subject: Re: Adding a WARC parser to Tika
> 
> On Mon, 10 Jul 2017, Allison, Timothy B. wrote:
>> Sorry, I can't tell if this is tongue-in-cheek...
> 
> No, I do think we should add a WARC parser to Tika Parsers.
> 
> Once done, I'd suggest we figure out a way for Tika Batch to run over a 
> collection of WARC files just as it does for directories, to make it easier 
> to run over crawl collections without having to unpack them first!
> 
> Nick
> 
> 
> ******************************************************************************************************************
> Experience the British Library online at www.bl.uk<http://www.bl.uk/>
> The British Library’s latest Annual Report and Accounts : 
> www.bl.uk/aboutus/annrep/index.html<http://www.bl.uk/aboutus/annrep/index.html>
> Help the British Library conserve the world's knowledge. Adopt a Book. 
> www.bl.uk/adoptabook<http://www.bl.uk/adoptabook>
> The Library's St Pancras site is WiFi - enabled
> *****************************************************************************************************************
> The information contained in this e-mail is confidential and may be legally 
> privileged. It is intended for the addressee(s) only. If you are not the 
> intended recipient, please delete this e-mail and notify the 
> [email protected]<mailto:[email protected]> : The contents of this e-mail must 
> not be disclosed or copied without the sender's consent.
> The statements and opinions expressed in this message are those of the author 
> and do not necessarily reflect those of the British Library. The British 
> Library does not take any responsibility for the views of the author.
> *****************************************************************************************************************
> Think before you print
>

Re: Adding a WARC parser to Tika

Reply via email to