Hi David the resulting file contains no matching request records, or even a > warcinfo record for that matter.
It wouldn't be too difficult to add at least the request records to WARCExporter - please open a JIRA + contributions are welcome as always. I'm willing to move to nutch v2.x if it makes a difference. 2.x has neither resources, you're better off being on 1.x Julien On 14 April 2016 at 16:51, Davíð Steinn Geirsson <[email protected]> wrote: > Hi all, > > I'm trying to use Nutch v1.11 for an archival crawl and export > the results to WARC files. > > It seems there are at least two seperate WARC exporters in Nutch, > but both have some problems. > > The first one is org.apache.nutch.tools.CommonCrawlDataDumper > (invoked with 'nutch commoncrawldump' which can export a WARC > file with the appropriate option. The resulting WARC file looks > good, except that the HTTP response body seems to have been > mangled by removing the CR-LF between the HTTP response headers > and the HTTP response body. The result is that it's not really > possible to tell where the headers end and the body begins. > > The second one is org.apache.nutch.tools.warc.WARCExporter > (invoked with 'nutch warc'). That one writes WARC response > records properly, with the header seperator. Unfortunately, > that's *all* it writes - the resulting file contains no matching > request records, or even a warcinfo record for that matter. > > So my question is, is it possible to use Nutch in its present > state to export working WARC files containing both request and > response records? I'm willing to move to nutch v2.x if it makes a > difference. > > Best regards, > Davíð -- *Open Source Solutions for Text Engineering* http://www.digitalpebble.com http://digitalpebble.blogspot.com/ #digitalpebble <http://twitter.com/digitalpebble>

