FYI, for a similar task - testing crawler-commons sitemaps.org parser - I've started a small test tools which reads the sitemaps from WARC files: https://groups.google.com/forum/?fromgroups#!topic/crawler-commons/pOLsCVwRsxY https://github.com/sebastian-nagel/sitemap-performance-test/
As it only takes what is necessary for testing, it's lean and "no overkill". Sebastian On 07/11/2017 12:06 PM, Jackson, Andy wrote: > In case it helps, I'll try to summarise what we've done in this area. > > Currently our webarchive-discovery indexing tool parses the WARC and then > passes the payload to Tika: > > https://github.com/ukwa/webarchive-discovery > https://github.com/ukwa/webarchive-discovery/blob/master/warc-indexer/src/main/java/uk/bl/wa/solr/TikaExtractor.java > > This works fine, but along the way we've also experimented with adding WARC > parsing to Tika directly. The code is an extremely messy proof-of-concept but > I've pushed it here so you can see how it works: > > https://github.com/ukwa/tika/tree/experimental-warc-parsing > > The parser itself is fairly straightforward: > > https://github.com/ukwa/tika/blob/5d89169151257a2696ceac2a4897527ea1b227a7/tika-parsers/src/main/java/org/apache/tika/parser/warc/WARCParser.java#L94 > > but it did require a few changes elsewhere... > > 1. Needed to teach Tika to spot ARC/WARC: > https://github.com/apache/tika/compare/master...ukwa:experimental-warc-parsing#diff-a7a8080db8d7c69d9a66b875b4c5b9e7 > > 2. Added webarchive-commons as a dependency: > https://github.com/apache/tika/compare/master...ukwa:experimental-warc-parsing#diff-2426935affac837a5f8f7a84a15939f7 > > 3. Enable concatenated block gunzip in order to parse WARC.GZ: > https://github.com/apache/tika/compare/master...ukwa:experimental-warc-parsing#diff-5ae41a78b18e2ca8481960cd5e02b860 > (given this was explicitly disabled before, this may be contentious?) > > There's another couple of bigger issues that would need resolving too. > > Firstly, the WARC format is not a file archive, but primarily a HTTP > request/response archive. There are 8 different record types (see > https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#warc-record-types > for details) that may or may not be of interest. The HTTP request and the > response get separate records, and of course the response might be 303 or > 404, not just 200. One strategy that is fairly widely used is to simply > ignore anything that is not a 200 response, but that does discard quite a lot > of information. > > Secondly, I'm not sure how many layers of embedded are appropriate. According > to the spec, I would argue that these are the layers: > > - archive.warc.gz (a series of block-concatenated gzip records) > - archive.warc.gz/record.warc (an individual WARC record) > - archive.warc.gz/record.warc/http.response (the message/http in its entirety) > - archive.warc.gz/record.warc/http.response/entity.body (the actual resource) > > This is probably overkill (and gets worse if it's a gzipped HTTP response!). > We could just use: > > - archive.warc.gz (a series of block-concatenated gzip records) > - archive.warc.gz/record.warc (the parsed entity.body, with all relevant info > from WARC and HTTP headers attached as metadata) > > Collapsing the layers down does make is less clear where some of the metadata > is coming from, but it’s probably worth it. > > One final note - I've not put the test WARC files in that repo yet as I need > to create some new ones from an Apache 2 source. > > I hope this is useful. > > Best, > Andy > > > =-=-=-=-=-=-=-= > Dr Andrew N. Jackson > Web Archiving Technical Lead > 01937 546602 > @UKWebArchive > @anjacks0n > Blog: http://britishlibrary.typepad.co.uk/webarchive/ > > > > > > -----Original Message----- > From: Nick Burch [mailto:[email protected]] > Sent: 10 July 2017 19:45 > To: [email protected] > Subject: Re: Adding a WARC parser to Tika > > On Mon, 10 Jul 2017, Allison, Timothy B. wrote: >> Sorry, I can't tell if this is tongue-in-cheek... > > No, I do think we should add a WARC parser to Tika Parsers. > > Once done, I'd suggest we figure out a way for Tika Batch to run over a > collection of WARC files just as it does for directories, to make it easier > to run over crawl collections without having to unpack them first! > > Nick > > > ****************************************************************************************************************** > Experience the British Library online at www.bl.uk<http://www.bl.uk/> > The British Library’s latest Annual Report and Accounts : > www.bl.uk/aboutus/annrep/index.html<http://www.bl.uk/aboutus/annrep/index.html> > Help the British Library conserve the world's knowledge. Adopt a Book. > www.bl.uk/adoptabook<http://www.bl.uk/adoptabook> > The Library's St Pancras site is WiFi - enabled > ***************************************************************************************************************** > The information contained in this e-mail is confidential and may be legally > privileged. It is intended for the addressee(s) only. If you are not the > intended recipient, please delete this e-mail and notify the > [email protected]<mailto:[email protected]> : The contents of this e-mail must > not be disclosed or copied without the sender's consent. > The statements and opinions expressed in this message are those of the author > and do not necessarily reflect those of the British Library. The British > Library does not take any responsibility for the views of the author. > ***************************************************************************************************************** > Think before you print >
