Hi,

> Can Nutch index custom HTTP headers?

Nutch stores the HTTP response headers if the property
`store.http.headers` is true.  The headers are saved as
string concatenated by `\r\n` under the key
`_response.headers_` in the content metadata.

You can send the entire HTTP headers to the indexer using
the plugin index-metadata and adding `_response.headers_`
to `index.content.md`.  It will add a field `_response.headers_`
to the index:

 % bin/nutch indexchecker \
    -Dplugin.includes='protocol-okhttp|parse-html|index-metadata' \
    -Dstore.http.headers=true \
    -Dindex.content.md=_response.headers_ \
   'http://localhost/'
 fetching: http://localhost/
 ...
 _response.headers_ :    HTTP/1.1 200 OK
 Date: Mon, 11 Mar 2019 16:03:41 GMT
 Server: Apache/2.4.29 (Ubuntu)
 Last-Modified: ...

But there is no standard way to pick single headers and send
them to the indexer as arbitrary fields.

Best,
Sebastian


On 3/11/19 4:21 PM, hany.n...@hsbc.com.INVALID wrote:
> Hello,
> 
> Can Nutch index custom HTTP headers?
> 
> Kind regards,
> Hany Shehata
> Enterprise Engineer
> Green Six Sigma Certified
> Solutions Architect, Marketing and Communications IT
> Corporate Functions | HSBC Operations, Services and Technology (HOST)
> ul. Kapelanka 42A, 30-347 Kraków, Poland
> __________________________________________________________________
> 
> Tie line: 7148 7689 4698
> External: +48 123 42 0698
> Mobile: +48 723 680 278
> E-mail: hany.n...@hsbc.com<mailto:hany.n...@hsbc.com>
> __________________________________________________________________
> Protect our environment - please only print this if you have to!
> 
> 
> 
> -----------------------------------------
> SAVE PAPER - THINK BEFORE YOU PRINT!
> 
> This E-mail is confidential.  
> 
> It may also be legally privileged. If you are not the addressee you may not 
> copy,
> forward, disclose or use any part of it. If you have received this message in 
> error,
> please delete it and all copies from your system and notify the sender 
> immediately by
> return E-mail.
> 
> Internet communications cannot be guaranteed to be timely secure, error or 
> virus-free.
> The sender does not accept liability for any errors or omissions.
> 

Reply via email to