Yeah, i've overwritten the Content-Length header with on the length of the
decompressed content byte array.

Luckily our clients' needs are modest in what they demand in their WARCs.

Many thanks,
Markus

Op wo 31 jul 2024 om 14:22 schreef Sebastian Nagel
<wastl.na...@googlemail.com.invalid>:

> Hi Markus,
>
>  >> And i do not agree with it. Almost all content is compressed now, so
> this
>  >> will never work. We need the headers and response code stored for WARC
>  >> export and do not care about an incorrect length header.
>
> No, don't do this. You need to rewrite the header. There are many WARC
> readers
> which just fail in the following situations:
> - HTTP header Content-Length does not match the length of the WARC payload
> - there's a Content-Encoding or Transfer-Encoding header but the payload is
>    not stored using these encodings
> - a fully functional WARC parser needs to understand chunked transfer
>    encoding, and also gzip, deflate, brotli content encodings
> - however, this is not the case for many WARC parsers: they just fail
>    or pass the chunked or encoded content forward to the user
>
> That's why at Common Crawl we store the payload with all HTTP-level
> encodings removed. Using Nutch it would be also difficult to implement
> that the HTTP stream is stored unmodified.
>
> The encoding headers are rewritten to "X-Crawler-Content-Length",
> "X-Crawler-Content-Encoding" and "X-Crawler-Transfer-Encoding". The
> "Content-Length" header
> shows the length of the decoded (and eventually truncated) payload.
>
> If you need more information and pointers about that, please ping me.
> I've discussed this with other web archiving peoples.
>
> Also WARC validators ([1,2,3] include one) will complain about invalid
> HTTP header values if the payload is decoded but headers are not rewritten.
>
>
>  >> I don't see okhttp having the same condition.
>
> We use protocol-okhttp: it has meanwhile (Nutch master) much better support
> for WARC writing and - more important - supports HTTP/2.
>
>
> Common Crawl also uses a custom WARC writer [4,5].
>
> Unfortunately, both WarcExporter and CommonCrawlDataDumper cannot be used
> because
> - cannot write gzip-compressed WARC files
>    Note: writing WARCs with
>     -Dmapreduce.output.fileoutputformat.compress=true
>     -Dmapreduce.output.fileoutputformat.compress.codec=gzip
>    results in invalid WARC files, not compressed per-record.
> - issues with the headers, see above
> - no WARC request records (NUTCH-2255)
> - no WARC digests
>
>
> Since long I hope to find the time to rewrite a clean and lean WARC writer
> based
> on on jwarc [2] which writes perfect WARC files combined with the
> flexibility of
> WarcExporter.
>
>
> Best,
> Sebastian
>
>
> [1] https://pypi.org/project/warcio/
> [2] https://github.com/iipc/jwarc
> [3] https://resiliparse.chatnoir.eu/en/latest/api/fastwarc.html
> [4] https://github.com/commoncrawl/nutch/
> [5]
> https://github.com/commoncrawl/nutch/tree/cc/src/java/org/commoncrawl/util
>
>
>
> On 7/31/24 10:15, Markus Jelsma wrote:
> > Aah thanks Lewis. We're still on 1.15, glad to see this was fixed
> already,
> > and that i would have patched it in exactly the same way.
> >
> > Thanks!
> >
> > Op di 30 jul 2024 om 18:42 schreef lewis john mcgibbney <
> lewi...@apache.org
> >> :
> >
> >> Hi Markus,
> >>
> >> Which version of Nutch are you referring to? I'm not seeing this exact
> >> code in master branch.
> >> Is this roughly the code you are referencing?
> >>
> >>
> https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L304-L318
> >>
> >> Thanks
> >> lewismc
> >>
> >> On Tue, Jul 30, 2024 at 8:14 AM <user-digest-h...@nutch.apache.org>
> wrote:
> >>
> >>> ---------- Forwarded message ----------
> >>> From: Markus Jelsma <markus.jel...@openindex.io>
> >>> To: user <user@nutch.apache.org>
> >>> Cc:
> >>> Bcc:
> >>> Date: Tue, 30 Jul 2024 17:13:01 +0200
> >>> Subject: Protocol-http not storing response headers
> >>> Hi,
> >>>
> >>> Protocol-http does this (not storing HTTP response heades if response
> is
> >>> compressed):
> >>>
> >>>            // store the headers verbatim only if the response was not
> >>> compressed
> >>>            // as the content length reported does not match otherwise
> >>>            if (httpHeaders != null) {
> >>>              headers.add(Response.RESPONSE_HEADERS,
> >> httpHeaders.toString());
> >>>            }
> >>>            if (Http.LOG.isTraceEnabled()) {
> >>>              Http.LOG.trace("fetched " + content.length + " bytes from
> " +
> >>> url);
> >>>            }
> >>>
> >>> And i do not agree with it. Almost all content is compressed now, so
> this
> >>> will never work. We need the headers and response code stored for WARC
> >>> export and do not care about an incorrect length header.
> >>>
> >>> Before patching this up and breaking that code out of the compression
> >>> condition, i do ask myself, is that a good idea? I don't see okhttp
> >> having
> >>> the same condition.
> >>>
> >>> Markus
> >>
> >> --
> >> http://people.apache.org/keys/committer/lewismc
> >>
> >
>
>

Reply via email to