Hi David

 the resulting file contains no matching request records, or even a
> warcinfo record for that matter.


 It wouldn't be too difficult to add at least the request records to
WARCExporter
- please open a JIRA + contributions are welcome as always.

I'm willing to move to nutch v2.x if it makes a difference.


2.x has neither resources, you're better off being on 1.x

Julien


On 14 April 2016 at 16:51, Davíð Steinn Geirsson <[email protected]> wrote:

> Hi all,
>
> I'm trying to use Nutch v1.11 for an archival crawl and export
> the results to WARC files.
>
> It seems there are at least two seperate WARC exporters in Nutch,
> but both have some problems.
>
> The first one is org.apache.nutch.tools.CommonCrawlDataDumper
> (invoked with 'nutch commoncrawldump' which can export a WARC
> file with the appropriate option. The resulting WARC file looks
> good, except that the HTTP response body seems to have been
> mangled by removing the CR-LF between the HTTP response headers
> and the HTTP response body. The result is that it's not really
> possible to tell where the headers end and the body begins.
>
> The second one is org.apache.nutch.tools.warc.WARCExporter
> (invoked with 'nutch warc'). That one writes WARC response
> records properly, with the header seperator. Unfortunately,
> that's *all* it writes - the resulting file contains no matching
> request records, or even a warcinfo record for that matter.
>
> So my question is, is it possible to use Nutch in its present
> state to export working WARC files containing both request and
> response records? I'm willing to move to nutch v2.x if it makes a
> difference.
>
> Best regards,
> Davíð




-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble <http://twitter.com/digitalpebble>

Reply via email to