Re: Adding a WARC parser to Tika

Jackson, Andy Tue, 11 Jul 2017 15:02:15 -0700

Nice.

Well, in case it¹s useful, I cleaned up my code somewhat, used Sebastian¹s
code to parse the HTTP headers for WARC files, and added (BSD licensed)
test files from DROID and some reasonably meaningful tests.


It¹s on this branch:

https://github.com/ukwa/tika/tree/experimental-warc-parsing

And the parser tests give some idea of the current behaviour:

https://github.com/ukwa/tika/blob/experimental-warc-parsing/tika-parsers/sr
c/test/java/org/apache/tika/parser/warc/WARCParserTest.java

HTH,
Andy


On 11/07/2017, 19:11, "Sebastian Nagel" <[email protected]> wrote:

>FYI, for a similar task - testing crawler-commons sitemaps.org parser -
>I've started a small test
>tools which reads the sitemaps from WARC files:
>
>https://groups.google.com/forum/?fromgroups#!topic/crawler-commons/pOLsCVw
>RsxY
>   https://github.com/sebastian-nagel/sitemap-performance-test/
>
>As it only takes what is necessary for testing, it's lean and "no
>overkill".
>
>Sebastian
>
>On 07/11/2017 12:06 PM, Jackson, Andy wrote:
>> In case it helps, I'll try to summarise what we've done in this area.
>>
>> Currently our webarchive-discovery indexing tool parses the WARC and
>>then passes the payload to Tika:
>>
>> https://github.com/ukwa/webarchive-discovery
>>
>>https://github.com/ukwa/webarchive-discovery/blob/master/warc-indexer/src
>>/main/java/uk/bl/wa/solr/TikaExtractor.java
>>
>> This works fine, but along the way we've also experimented with adding
>>WARC parsing to Tika directly. The code is an extremely messy
>>proof-of-concept but I've pushed it here so you can see how it works:
>>
>> https://github.com/ukwa/tika/tree/experimental-warc-parsing
>>
>> The parser itself is fairly straightforward:
>>
>>
>>https://github.com/ukwa/tika/blob/5d89169151257a2696ceac2a4897527ea1b227a
>>7/tika-parsers/src/main/java/org/apache/tika/parser/warc/WARCParser.java#
>>L94
>>
>> but it did require a few changes elsewhere...
>>
>> 1. Needed to teach Tika to spot ARC/WARC:
>>
>>https://github.com/apache/tika/compare/master...ukwa:experimental-warc-pa
>>rsing#diff-a7a8080db8d7c69d9a66b875b4c5b9e7
>>
>> 2. Added webarchive-commons as a dependency:
>>
>>https://github.com/apache/tika/compare/master...ukwa:experimental-warc-pa
>>rsing#diff-2426935affac837a5f8f7a84a15939f7
>>
>> 3. Enable concatenated block gunzip in order to parse WARC.GZ:
>>
>>https://github.com/apache/tika/compare/master...ukwa:experimental-warc-pa
>>rsing#diff-5ae41a78b18e2ca8481960cd5e02b860
>> (given this was explicitly disabled before, this may be contentious?)
>>
>> There's another couple of bigger issues that would need resolving too.
>>
>> Firstly, the WARC format is not a file archive, but primarily a HTTP
>>request/response archive. There are 8 different record types (see
>>https://iipc.github.io/warc-specifications/specifications/warc-format/war
>>c-1.1/#warc-record-types for details) that may or may not be of
>>interest. The HTTP request and the response get separate records, and of
>>course the response might be 303 or 404, not just 200. One strategy that
>>is fairly widely used is to simply ignore anything that is not a 200
>>response, but that does discard quite a lot of information.
>>
>> Secondly, I'm not sure how many layers of embedded are appropriate.
>>According to the spec, I would argue that these are the layers:
>>
>> - archive.warc.gz (a series of block-concatenated gzip records)
>> - archive.warc.gz/record.warc (an individual WARC record)
>> - archive.warc.gz/record.warc/http.response (the message/http in its
>>entirety)
>> - archive.warc.gz/record.warc/http.response/entity.body (the actual
>>resource)
>>
>> This is probably overkill (and gets worse if it's a gzipped HTTP
>>response!). We could just use:
>>
>> - archive.warc.gz (a series of block-concatenated gzip records)
>> - archive.warc.gz/record.warc (the parsed entity.body, with all
>>relevant info from WARC and HTTP headers attached as metadata)
>>
>> Collapsing the layers down does make is less clear where some of the
>>metadata is coming from, but it¹s probably worth it.
>>
>> One final note - I've not put the test WARC files in that repo yet as I
>>need to create some new ones from an Apache 2 source.
>>
>> I hope this is useful.
>>
>> Best,
>> Andy
>>
>>
>> =-=-=-=-=-=-=-=
>> Dr Andrew N. Jackson
>> Web Archiving Technical Lead
>> 01937 546602
>> @UKWebArchive
>> @anjacks0n
>> Blog: http://britishlibrary.typepad.co.uk/webarchive/
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Nick Burch [mailto:[email protected]]
>> Sent: 10 July 2017 19:45
>> To: [email protected]
>> Subject: Re: Adding a WARC parser to Tika
>>
>> On Mon, 10 Jul 2017, Allison, Timothy B. wrote:
>>> Sorry, I can't tell if this is tongue-in-cheek...
>>
>> No, I do think we should add a WARC parser to Tika Parsers.
>>
>> Once done, I'd suggest we figure out a way for Tika Batch to run over a
>>collection of WARC files just as it does for directories, to make it
>>easier to run over crawl collections without having to unpack them first!
>>
>> Nick
>>
>>
>>
>>*************************************************************************
>>*****************************************
>> Experience the British Library online at www.bl.uk<http://www.bl.uk/>
>> The British Library¹s latest Annual Report and Accounts :
>>www.bl.uk/aboutus/annrep/index.html<http://www.bl.uk/aboutus/annrep/index
>>.html>
>> Help the British Library conserve the world's knowledge. Adopt a Book.
>>www.bl.uk/adoptabook<http://www.bl.uk/adoptabook>
>> The Library's St Pancras site is WiFi - enabled
>>
>>*************************************************************************
>>****************************************
>> The information contained in this e-mail is confidential and may be
>>legally privileged. It is intended for the addressee(s) only. If you are
>>not the intended recipient, please delete this e-mail and notify the
>>[email protected]<mailto:[email protected]> : The contents of this e-mail
>>must not be disclosed or copied without the sender's consent.
>> The statements and opinions expressed in this message are those of the
>>author and do not necessarily reflect those of the British Library. The
>>British Library does not take any responsibility for the views of the
>>author.
>>
>>*************************************************************************
>>****************************************
>> Think before you print
>>
>



******************************************************************************************************************
Experience the British Library online at www.bl.uk<http://www.bl.uk/>
The British Library’s latest Annual Report and Accounts : 
www.bl.uk/aboutus/annrep/index.html<http://www.bl.uk/aboutus/annrep/index.html>
Help the British Library conserve the world's knowledge. Adopt a Book. 
www.bl.uk/adoptabook<http://www.bl.uk/adoptabook>
The Library's St Pancras site is WiFi - enabled
*****************************************************************************************************************
The information contained in this e-mail is confidential and may be legally 
privileged. It is intended for the addressee(s) only. If you are not the 
intended recipient, please delete this e-mail and notify the 
[email protected]<mailto:[email protected]> : The contents of this e-mail must 
not be disclosed or copied without the sender's consent.
The statements and opinions expressed in this message are those of the author 
and do not necessarily reflect those of the British Library. The British 
Library does not take any responsibility for the views of the author.
*****************************************************************************************************************
Think before you print

Re: Adding a WARC parser to Tika

Reply via email to