Hi Julien, Julien Nioche <[email protected]> wrote: > Hi David > > the resulting file contains no matching request records, or even a > > warcinfo record for that matter. > > > It wouldn't be too difficult to add at least the request records to > WARCExporter > - please open a JIRA + contributions are welcome as always.
Thanks for the info, I'll open a ticket. I'm not familiar enough with java to take a crack at that unfortunately. I did manage to fix the response record output of the CommonCrawlDataDumper, since it was only a tiny change. But given this bug, I'm leary of trusting its WARC output and I think I'll need to find some good WARC test suite to run it through. If I do, I'll submit a patch. > > I'm willing to move to nutch v2.x if it makes a difference. > > > 2.x has neither resources, you're better off being on 1.x Good to know, thanks. Best regards, Davíð > > Julien > > > On 14 April 2016 at 16:51, Davíð Steinn Geirsson <[email protected]> > wrote: > > > Hi all, > > > > I'm trying to use Nutch v1.11 for an archival crawl and export > > the results to WARC files. > > > > It seems there are at least two seperate WARC exporters in Nutch, > > but both have some problems. > > > > The first one is org.apache.nutch.tools.CommonCrawlDataDumper > > (invoked with 'nutch commoncrawldump' which can export a WARC > > file with the appropriate option. The resulting WARC file looks > > good, except that the HTTP response body seems to have been > > mangled by removing the CR-LF between the HTTP response headers > > and the HTTP response body. The result is that it's not really > > possible to tell where the headers end and the body begins. > > > > The second one is org.apache.nutch.tools.warc.WARCExporter > > (invoked with 'nutch warc'). That one writes WARC response > > records properly, with the header seperator. Unfortunately, > > that's *all* it writes - the resulting file contains no matching > > request records, or even a warcinfo record for that matter. > > > > So my question is, is it possible to use Nutch in its present > > state to export working WARC files containing both request and > > response records? I'm willing to move to nutch v2.x if it makes a > > difference. > > > > Best regards, > > Davíð > > > >

