Hi Julien,

Julien Nioche <[email protected]> wrote:
> Hi David
> 
>  the resulting file contains no matching request records, or even a
> > warcinfo record for that matter.
> 
> 
>  It wouldn't be too difficult to add at least the request records to
> WARCExporter
> - please open a JIRA + contributions are welcome as always.

Thanks for the info, I'll open a ticket. I'm not familiar enough
with java to take a crack at that unfortunately.

I did manage to fix the response record output of the
CommonCrawlDataDumper, since it was only a tiny change. But given
this bug, I'm leary of trusting its WARC output and I think I'll
need to find some good WARC test suite to run it through. If I
do, I'll submit a patch.

> 
> I'm willing to move to nutch v2.x if it makes a difference.
> 
> 
> 2.x has neither resources, you're better off being on 1.x

Good to know, thanks.

Best regards,
Davíð



> 
> Julien
> 
> 
> On 14 April 2016 at 16:51, Davíð Steinn Geirsson <[email protected]>
> wrote:
> 
> > Hi all,
> >
> > I'm trying to use Nutch v1.11 for an archival crawl and export
> > the results to WARC files.
> >
> > It seems there are at least two seperate WARC exporters in Nutch,
> > but both have some problems.
> >
> > The first one is org.apache.nutch.tools.CommonCrawlDataDumper
> > (invoked with 'nutch commoncrawldump' which can export a WARC
> > file with the appropriate option. The resulting WARC file looks
> > good, except that the HTTP response body seems to have been
> > mangled by removing the CR-LF between the HTTP response headers
> > and the HTTP response body. The result is that it's not really
> > possible to tell where the headers end and the body begins.
> >
> > The second one is org.apache.nutch.tools.warc.WARCExporter
> > (invoked with 'nutch warc'). That one writes WARC response
> > records properly, with the header seperator. Unfortunately,
> > that's *all* it writes - the resulting file contains no matching
> > request records, or even a warcinfo record for that matter.
> >
> > So my question is, is it possible to use Nutch in its present
> > state to export working WARC files containing both request and
> > response records? I'm willing to move to nutch v2.x if it makes a
> > difference.
> >
> > Best regards,
> > Davíð
> 
> 
> 
> 

Reply via email to