Hi,
> How can I index this value on Solr?
1. add the field "_response.headers_" to the Solr schema, see
http://localhost:8983/solr/#/nutch/schema
2. set the property store.http.headers = true
3. you can test it sending a single document using the indexchecker:
% bin/nutch indexchecker \
-Dplugin.includes='protocol-okhttp|parse-html|index-metadata|indexer-solr' \
-Dstore.http.headers=true \
-Dindex.content.md=_response.headers_ \
-DdoIndex=true \
'http://localhost/'
fetching: http://localhost/
...
Indexing 1/1 documents
Deleting 0 documents
4. Solr should contain the document including the header
"response":{"numFound":1,"start":0,"docs":[
{
"digest":"3526531ccd6c6a1d2340574a305a18f8",
"id":"http://localhost/",
"_response.headers_":"HTTP/1.1 200 OK\r\nDate: Wed, 13 Mar 2019
17:29:49 ..."
> What is the difference between protocol-okhttp and protocol-http?
There are few differences, see NUTCH-2576.
For historic reasons (NUTCH-2213) protocol-http does not always keep the
original HTTP header while
protocol-okhttp does. I think we can remove this restriction, feel free to
open a Jira issue for this.
Best,
Sebastian
On 3/13/19 9:21 AM, [email protected] wrote:
> Thank you Sebastian.
>
> I'm able to get the HTTP headers as you explained below.
>
> How can I index this value on Solr?
> What is the difference between protocol-okhttp and protocol-http?
>
> Kind regards,
> Hany Shehata
> Enterprise Engineer
> Green Six Sigma Certified
> Solutions Architect, Marketing and Communications IT
> Corporate Functions | HSBC Operations, Services and Technology (HOST)
> ul. Kapelanka 42A, 30-347 Kraków, Poland
> __________________________________________________________________
>
> Tie line: 7148 7689 4698
> External: +48 123 42 0698
> Mobile: +48 723 680 278
> E-mail: [email protected]
> __________________________________________________________________
> Protect our environment - please only print this if you have to!
>
>
> -----Original Message-----
> From: Sebastian Nagel [mailto:[email protected]]
> Sent: 11 March 2019 17:06
> To: [email protected]
> Subject: Re: Nutch and HTTP headers
>
> Hi,
>
>> Can Nutch index custom HTTP headers?
>
> Nutch stores the HTTP response headers if the property `store.http.headers`
> is true. The headers are saved as string concatenated by `\r\n` under the
> key `_response.headers_` in the content metadata.
>
> You can send the entire HTTP headers to the indexer using the plugin
> index-metadata and adding `_response.headers_` to `index.content.md`. It
> will add a field `_response.headers_` to the index:
>
> % bin/nutch indexchecker \
> -Dplugin.includes='protocol-okhttp|parse-html|index-metadata' \
> -Dstore.http.headers=true \
> -Dindex.content.md=_response.headers_ \
> 'http://localhost/'
> fetching: http://localhost/
> ...
> _response.headers_ : HTTP/1.1 200 OK
> Date: Mon, 11 Mar 2019 16:03:41 GMT
> Server: Apache/2.4.29 (Ubuntu)
> Last-Modified: ...
>
> But there is no standard way to pick single headers and send them to the
> indexer as arbitrary fields.
>
> Best,
> Sebastian
>
>
> On 3/11/19 4:21 PM, [email protected] wrote:
>> Hello,
>>
>> Can Nutch index custom HTTP headers?
>>
>> Kind regards,
>> Hany Shehata
>> Enterprise Engineer
>> Green Six Sigma Certified
>> Solutions Architect, Marketing and Communications IT Corporate
>> Functions | HSBC Operations, Services and Technology (HOST) ul.
>> Kapelanka 42A, 30-347 Kraków, Poland
>> __________________________________________________________________
>>
>> Tie line: 7148 7689 4698
>> External: +48 123 42 0698
>> Mobile: +48 723 680 278
>> E-mail: [email protected]<mailto:[email protected]>
>> __________________________________________________________________
>> Protect our environment - please only print this if you have to!
>>
>>
>>
>> -----------------------------------------
>> SAVE PAPER - THINK BEFORE YOU PRINT!
>>
>> This E-mail is confidential.
>>
>> It may also be legally privileged. If you are not the addressee you
>> may not copy, forward, disclose or use any part of it. If you have
>> received this message in error, please delete it and all copies from
>> your system and notify the sender immediately by return E-mail.
>>
>> Internet communications cannot be guaranteed to be timely secure, error or
>> virus-free.
>> The sender does not accept liability for any errors or omissions.
>>
>
>
>
> ***************************************************
> This message originated from the Internet. Its originator may or may not be
> who they claim to be and the information contained in the message and any
> attachments may or may not be accurate.
> ****************************************************
>
>
>
>
> -----------------------------------------
> SAVE PAPER - THINK BEFORE YOU PRINT!
>
> This E-mail is confidential.
>
> It may also be legally privileged. If you are not the addressee you may not
> copy,
> forward, disclose or use any part of it. If you have received this message in
> error,
> please delete it and all copies from your system and notify the sender
> immediately by
> return E-mail.
>
> Internet communications cannot be guaranteed to be timely secure, error or
> virus-free.
> The sender does not accept liability for any errors or omissions.
>