In case it helps, I'll try to summarise what we've done in this area.

Currently our webarchive-discovery indexing tool parses the WARC and then 
passes the payload to Tika:

https://github.com/ukwa/webarchive-discovery
https://github.com/ukwa/webarchive-discovery/blob/master/warc-indexer/src/main/java/uk/bl/wa/solr/TikaExtractor.java

This works fine, but along the way we've also experimented with adding WARC 
parsing to Tika directly. The code is an extremely messy proof-of-concept but 
I've pushed it here so you can see how it works:

https://github.com/ukwa/tika/tree/experimental-warc-parsing

The parser itself is fairly straightforward:

https://github.com/ukwa/tika/blob/5d89169151257a2696ceac2a4897527ea1b227a7/tika-parsers/src/main/java/org/apache/tika/parser/warc/WARCParser.java#L94

but it did require a few changes elsewhere...

1. Needed to teach Tika to spot ARC/WARC:
https://github.com/apache/tika/compare/master...ukwa:experimental-warc-parsing#diff-a7a8080db8d7c69d9a66b875b4c5b9e7

2. Added webarchive-commons as a dependency:
https://github.com/apache/tika/compare/master...ukwa:experimental-warc-parsing#diff-2426935affac837a5f8f7a84a15939f7

3. Enable concatenated block gunzip in order to parse WARC.GZ:
https://github.com/apache/tika/compare/master...ukwa:experimental-warc-parsing#diff-5ae41a78b18e2ca8481960cd5e02b860
(given this was explicitly disabled before, this may be contentious?)

There's another couple of bigger issues that would need resolving too.

Firstly, the WARC format is not a file archive, but primarily a HTTP 
request/response archive. There are 8 different record types (see 
https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#warc-record-types
 for details) that may or may not be of interest. The HTTP request and the 
response get separate records, and of course the response might be 303 or 404, 
not just 200. One strategy that is fairly widely used is to simply ignore 
anything that is not a 200 response, but that does discard quite a lot of 
information.

Secondly, I'm not sure how many layers of embedded are appropriate. According 
to the spec, I would argue that these are the layers:

- archive.warc.gz (a series of block-concatenated gzip records)
- archive.warc.gz/record.warc (an individual WARC record)
- archive.warc.gz/record.warc/http.response (the message/http in its entirety)
- archive.warc.gz/record.warc/http.response/entity.body (the actual resource)

This is probably overkill (and gets worse if it's a gzipped HTTP response!). We 
could just use:

- archive.warc.gz (a series of block-concatenated gzip records)
- archive.warc.gz/record.warc (the parsed entity.body, with all relevant info 
from WARC and HTTP headers attached as metadata)

Collapsing the layers down does make is less clear where some of the metadata 
is coming from, but it’s probably worth it.

One final note - I've not put the test WARC files in that repo yet as I need to 
create some new ones from an Apache 2 source.

I hope this is useful.

Best,
Andy


=-=-=-=-=-=-=-=
Dr Andrew N. Jackson
Web Archiving Technical Lead
01937 546602
@UKWebArchive
@anjacks0n
Blog: http://britishlibrary.typepad.co.uk/webarchive/





-----Original Message-----
From: Nick Burch [mailto:[email protected]]
Sent: 10 July 2017 19:45
To: [email protected]
Subject: Re: Adding a WARC parser to Tika

On Mon, 10 Jul 2017, Allison, Timothy B. wrote:
> Sorry, I can't tell if this is tongue-in-cheek...

No, I do think we should add a WARC parser to Tika Parsers.

Once done, I'd suggest we figure out a way for Tika Batch to run over a 
collection of WARC files just as it does for directories, to make it easier to 
run over crawl collections without having to unpack them first!

Nick


******************************************************************************************************************
Experience the British Library online at www.bl.uk<http://www.bl.uk/>
The British Library’s latest Annual Report and Accounts : 
www.bl.uk/aboutus/annrep/index.html<http://www.bl.uk/aboutus/annrep/index.html>
Help the British Library conserve the world's knowledge. Adopt a Book. 
www.bl.uk/adoptabook<http://www.bl.uk/adoptabook>
The Library's St Pancras site is WiFi - enabled
*****************************************************************************************************************
The information contained in this e-mail is confidential and may be legally 
privileged. It is intended for the addressee(s) only. If you are not the 
intended recipient, please delete this e-mail and notify the 
[email protected]<mailto:[email protected]> : The contents of this e-mail must 
not be disclosed or copied without the sender's consent.
The statements and opinions expressed in this message are those of the author 
and do not necessarily reflect those of the British Library. The British 
Library does not take any responsibility for the views of the author.
*****************************************************************************************************************
Think before you print

Reply via email to