Hi all, I'm trying to use Nutch v1.11 for an archival crawl and export the results to WARC files.
It seems there are at least two seperate WARC exporters in Nutch, but both have some problems. The first one is org.apache.nutch.tools.CommonCrawlDataDumper (invoked with 'nutch commoncrawldump' which can export a WARC file with the appropriate option. The resulting WARC file looks good, except that the HTTP response body seems to have been mangled by removing the CR-LF between the HTTP response headers and the HTTP response body. The result is that it's not really possible to tell where the headers end and the body begins. The second one is org.apache.nutch.tools.warc.WARCExporter (invoked with 'nutch warc'). That one writes WARC response records properly, with the header seperator. Unfortunately, that's *all* it writes - the resulting file contains no matching request records, or even a warcinfo record for that matter. So my question is, is it possible to use Nutch in its present state to export working WARC files containing both request and response records? I'm willing to move to nutch v2.x if it makes a difference. Best regards, Davíð

