Hi,
> Can Nutch index custom HTTP headers?
Nutch stores the HTTP response headers if the property
`store.http.headers` is true. The headers are saved as
string concatenated by `\r\n` under the key
`_response.headers_` in the content metadata.
You can send the entire HTTP headers to the indexer using
the plugin index-metadata and adding `_response.headers_`
to `index.content.md`. It will add a field `_response.headers_`
to the index:
% bin/nutch indexchecker \
-Dplugin.includes='protocol-okhttp|parse-html|index-metadata' \
-Dstore.http.headers=true \
-Dindex.content.md=_response.headers_ \
'http://localhost/'
fetching: http://localhost/
...
_response.headers_ : HTTP/1.1 200 OK
Date: Mon, 11 Mar 2019 16:03:41 GMT
Server: Apache/2.4.29 (Ubuntu)
Last-Modified: ...
But there is no standard way to pick single headers and send
them to the indexer as arbitrary fields.
Best,
Sebastian
On 3/11/19 4:21 PM, [email protected] wrote:
> Hello,
>
> Can Nutch index custom HTTP headers?
>
> Kind regards,
> Hany Shehata
> Enterprise Engineer
> Green Six Sigma Certified
> Solutions Architect, Marketing and Communications IT
> Corporate Functions | HSBC Operations, Services and Technology (HOST)
> ul. Kapelanka 42A, 30-347 Kraków, Poland
> __________________________________________________________________
>
> Tie line: 7148 7689 4698
> External: +48 123 42 0698
> Mobile: +48 723 680 278
> E-mail: [email protected]<mailto:[email protected]>
> __________________________________________________________________
> Protect our environment - please only print this if you have to!
>
>
>
> -----------------------------------------
> SAVE PAPER - THINK BEFORE YOU PRINT!
>
> This E-mail is confidential.
>
> It may also be legally privileged. If you are not the addressee you may not
> copy,
> forward, disclose or use any part of it. If you have received this message in
> error,
> please delete it and all copies from your system and notify the sender
> immediately by
> return E-mail.
>
> Internet communications cannot be guaranteed to be timely secure, error or
> virus-free.
> The sender does not accept liability for any errors or omissions.
>