Hi,

> How can I index this value on Solr?

 1. add the field "_response.headers_" to the Solr schema, see
      http://localhost:8983/solr/#/nutch/schema

 2. set the property store.http.headers = true

 3. you can test it sending a single document using the indexchecker:

   % bin/nutch indexchecker \
      
-Dplugin.includes='protocol-okhttp|parse-html|index-metadata|indexer-solr' \
      -Dstore.http.headers=true \
      -Dindex.content.md=_response.headers_ \
      -DdoIndex=true \
     'http://localhost/'
   fetching: http://localhost/
   ...
   Indexing 1/1 documents
   Deleting 0 documents

 4. Solr should contain the document including the header

   "response":{"numFound":1,"start":0,"docs":[
      {
        "digest":"3526531ccd6c6a1d2340574a305a18f8",
        "id":"http://localhost/";,
        "_response.headers_":"HTTP/1.1 200 OK\r\nDate: Wed, 13 Mar 2019 
17:29:49 ..."


> What is the difference between protocol-okhttp and protocol-http?

There are few differences, see NUTCH-2576.

For historic reasons (NUTCH-2213) protocol-http does not always keep the 
original HTTP header while
protocol-okhttp does.  I think we can remove this restriction, feel free to 
open a Jira issue for this.

Best,
Sebastian



On 3/13/19 9:21 AM, [email protected] wrote:
> Thank you Sebastian.
> 
> I'm able to get the HTTP headers as you explained below.
> 
> How can I index this value on Solr?
> What is the difference between protocol-okhttp and protocol-http?
> 
> Kind regards, 
> Hany Shehata
> Enterprise Engineer
> Green Six Sigma Certified
> Solutions Architect, Marketing and Communications IT 
> Corporate Functions | HSBC Operations, Services and Technology (HOST)
> ul. Kapelanka 42A, 30-347 Kraków, Poland
> __________________________________________________________________ 
> 
> Tie line: 7148 7689 4698 
> External: +48 123 42 0698 
> Mobile: +48 723 680 278 
> E-mail: [email protected] 
> __________________________________________________________________ 
> Protect our environment - please only print this if you have to!
> 
> 
> -----Original Message-----
> From: Sebastian Nagel [mailto:[email protected]] 
> Sent: 11 March 2019 17:06
> To: [email protected]
> Subject: Re: Nutch and HTTP headers
> 
> Hi,
> 
>> Can Nutch index custom HTTP headers?
> 
> Nutch stores the HTTP response headers if the property `store.http.headers` 
> is true.  The headers are saved as string concatenated by `\r\n` under the 
> key `_response.headers_` in the content metadata.
> 
> You can send the entire HTTP headers to the indexer using the plugin 
> index-metadata and adding `_response.headers_` to `index.content.md`.  It 
> will add a field `_response.headers_` to the index:
> 
>  % bin/nutch indexchecker \
>     -Dplugin.includes='protocol-okhttp|parse-html|index-metadata' \
>     -Dstore.http.headers=true \
>     -Dindex.content.md=_response.headers_ \
>    'http://localhost/'
>  fetching: http://localhost/
>  ...
>  _response.headers_ :    HTTP/1.1 200 OK
>  Date: Mon, 11 Mar 2019 16:03:41 GMT
>  Server: Apache/2.4.29 (Ubuntu)
>  Last-Modified: ...
> 
> But there is no standard way to pick single headers and send them to the 
> indexer as arbitrary fields.
> 
> Best,
> Sebastian
> 
> 
> On 3/11/19 4:21 PM, [email protected] wrote:
>> Hello,
>>
>> Can Nutch index custom HTTP headers?
>>
>> Kind regards,
>> Hany Shehata
>> Enterprise Engineer
>> Green Six Sigma Certified
>> Solutions Architect, Marketing and Communications IT Corporate 
>> Functions | HSBC Operations, Services and Technology (HOST) ul. 
>> Kapelanka 42A, 30-347 Kraków, Poland 
>> __________________________________________________________________
>>
>> Tie line: 7148 7689 4698
>> External: +48 123 42 0698
>> Mobile: +48 723 680 278
>> E-mail: [email protected]<mailto:[email protected]>
>> __________________________________________________________________
>> Protect our environment - please only print this if you have to!
>>
>>
>>
>> -----------------------------------------
>> SAVE PAPER - THINK BEFORE YOU PRINT!
>>
>> This E-mail is confidential.  
>>
>> It may also be legally privileged. If you are not the addressee you 
>> may not copy, forward, disclose or use any part of it. If you have 
>> received this message in error, please delete it and all copies from 
>> your system and notify the sender immediately by return E-mail.
>>
>> Internet communications cannot be guaranteed to be timely secure, error or 
>> virus-free.
>> The sender does not accept liability for any errors or omissions.
>>
> 
> 
> 
> ***************************************************
> This message originated from the Internet. Its originator may or may not be 
> who they claim to be and the information contained in the message and any 
> attachments may or may not be accurate.
> ****************************************************
> 
>  
> 
> 
> -----------------------------------------
> SAVE PAPER - THINK BEFORE YOU PRINT!
> 
> This E-mail is confidential.  
> 
> It may also be legally privileged. If you are not the addressee you may not 
> copy,
> forward, disclose or use any part of it. If you have received this message in 
> error,
> please delete it and all copies from your system and notify the sender 
> immediately by
> return E-mail.
> 
> Internet communications cannot be guaranteed to be timely secure, error or 
> virus-free.
> The sender does not accept liability for any errors or omissions.
> 

Reply via email to