Hi all,

I'm trying to use Nutch v1.11 for an archival crawl and export
the results to WARC files.

It seems there are at least two seperate WARC exporters in Nutch,
but both have some problems.

The first one is org.apache.nutch.tools.CommonCrawlDataDumper
(invoked with 'nutch commoncrawldump' which can export a WARC
file with the appropriate option. The resulting WARC file looks
good, except that the HTTP response body seems to have been
mangled by removing the CR-LF between the HTTP response headers
and the HTTP response body. The result is that it's not really
possible to tell where the headers end and the body begins.

The second one is org.apache.nutch.tools.warc.WARCExporter
(invoked with 'nutch warc'). That one writes WARC response
records properly, with the header seperator. Unfortunately,
that's *all* it writes - the resulting file contains no matching
request records, or even a warcinfo record for that matter.

So my question is, is it possible to use Nutch in its present
state to export working WARC files containing both request and
response records? I'm willing to move to nutch v2.x if it makes a
difference.

Best regards,
Davíð

Reply via email to