RE: Nutch and HTTP headers

2019-03-14 Thread hany . nasr
Thank you so much.

I'm able to index the http headers.

I can't imagine my life without this group :)

Kind regards, 
Hany Shehata
Enterprise Engineer
Green Six Sigma Certified
Solutions Architect, Marketing and Communications IT 
Corporate Functions | HSBC Operations, Services and Technology (HOST)
ul. Kapelanka 42A, 30-347 Kraków, Poland
__ 

Tie line: 7148 7689 4698 
External: +48 123 42 0698 
Mobile: +48 723 680 278 
E-mail: hany.n...@hsbc.com 
__ 
Protect our environment - please only print this if you have to!

-Original Message-
From: Sebastian Nagel [mailto:wastl.na...@googlemail.com.INVALID] 
Sent: 13 March 2019 18:41
To: user@nutch.apache.org
Subject: Re: Nutch and HTTP headers

Hi,

> How can I index this value on Solr?

 1. add the field "_response.headers_" to the Solr schema, see
  http://localhost:8983/solr/#/nutch/schema

 2. set the property store.http.headers = true

 3. you can test it sending a single document using the indexchecker:

   % bin/nutch indexchecker \
  
-Dplugin.includes='protocol-okhttp|parse-html|index-metadata|indexer-solr' \
  -Dstore.http.headers=true \
  -Dindex.content.md=_response.headers_ \
  -DdoIndex=true \
 'http://localhost/'
   fetching: http://localhost/
   ...
   Indexing 1/1 documents
   Deleting 0 documents

 4. Solr should contain the document including the header

   "response":{"numFound":1,"start":0,"docs":[
  {
"digest":"3526531ccd6c6a1d2340574a305a18f8",
"id":"http://localhost/";,
"_response.headers_":"HTTP/1.1 200 OK\r\nDate: Wed, 13 Mar 2019 
17:29:49 ..."


> What is the difference between protocol-okhttp and protocol-http?

There are few differences, see NUTCH-2576.

For historic reasons (NUTCH-2213) protocol-http does not always keep the 
original HTTP header while protocol-okhttp does.  I think we can remove this 
restriction, feel free to open a Jira issue for this.

Best,
Sebastian



On 3/13/19 9:21 AM, hany.n...@hsbc.com.INVALID wrote:
> Thank you Sebastian.
> 
> I'm able to get the HTTP headers as you explained below.
> 
> How can I index this value on Solr?
> What is the difference between protocol-okhttp and protocol-http?
> 
> Kind regards,
> Hany Shehata
> Enterprise Engineer
> Green Six Sigma Certified
> Solutions Architect, Marketing and Communications IT Corporate 
> Functions | HSBC Operations, Services and Technology (HOST) ul. 
> Kapelanka 42A, 30-347 Kraków, Poland 
> __
> 
> Tie line: 7148 7689 4698
> External: +48 123 42 0698
> Mobile: +48 723 680 278
> E-mail: hany.n...@hsbc.com
> __
> Protect our environment - please only print this if you have to!
> 
> 
> -Original Message-
> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com.INVALID]
> Sent: 11 March 2019 17:06
> To: user@nutch.apache.org
> Subject: Re: Nutch and HTTP headers
> 
> Hi,
> 
>> Can Nutch index custom HTTP headers?
> 
> Nutch stores the HTTP response headers if the property `store.http.headers` 
> is true.  The headers are saved as string concatenated by `\r\n` under the 
> key `_response.headers_` in the content metadata.
> 
> You can send the entire HTTP headers to the indexer using the plugin 
> index-metadata and adding `_response.headers_` to `index.content.md`.  It 
> will add a field `_response.headers_` to the index:
> 
>  % bin/nutch indexchecker \
> -Dplugin.includes='protocol-okhttp|parse-html|index-metadata' \
> -Dstore.http.headers=true \
> -Dindex.content.md=_response.headers_ \
>'http://localhost/'
>  fetching: http://localhost/
>  ...
>  _response.headers_ :HTTP/1.1 200 OK
>  Date: Mon, 11 Mar 2019 16:03:41 GMT
>  Server: Apache/2.4.29 (Ubuntu)
>  Last-Modified: ...
> 
> But there is no standard way to pick single headers and send them to the 
> indexer as arbitrary fields.
> 
> Best,
> Sebastian
> 
> 
> On 3/11/19 4:21 PM, hany.n...@hsbc.com.INVALID wrote:
>> Hello,
>>
>> Can Nutch index custom HTTP headers?
>>
>> Kind regards,
>> Hany Shehata
>> Enterprise Engineer
>> Green Six Sigma Certified
>> Solutions Architect, Marketing and Communications IT Corporate 
>> Functions | HSBC Operations, Services and Technology (HOST) ul.
>> Kapelanka 42A, 30-347 Kraków, Poland 
>> __
>>
>> Tie line: 7148 7689 46

Re: Nutch and HTTP headers

2019-03-13 Thread Sebastian Nagel
Hi,

> How can I index this value on Solr?

 1. add the field "_response.headers_" to the Solr schema, see
  http://localhost:8983/solr/#/nutch/schema

 2. set the property store.http.headers = true

 3. you can test it sending a single document using the indexchecker:

   % bin/nutch indexchecker \
  
-Dplugin.includes='protocol-okhttp|parse-html|index-metadata|indexer-solr' \
  -Dstore.http.headers=true \
  -Dindex.content.md=_response.headers_ \
  -DdoIndex=true \
 'http://localhost/'
   fetching: http://localhost/
   ...
   Indexing 1/1 documents
   Deleting 0 documents

 4. Solr should contain the document including the header

   "response":{"numFound":1,"start":0,"docs":[
  {
"digest":"3526531ccd6c6a1d2340574a305a18f8",
"id":"http://localhost/";,
"_response.headers_":"HTTP/1.1 200 OK\r\nDate: Wed, 13 Mar 2019 
17:29:49 ..."


> What is the difference between protocol-okhttp and protocol-http?

There are few differences, see NUTCH-2576.

For historic reasons (NUTCH-2213) protocol-http does not always keep the 
original HTTP header while
protocol-okhttp does.  I think we can remove this restriction, feel free to 
open a Jira issue for this.

Best,
Sebastian



On 3/13/19 9:21 AM, hany.n...@hsbc.com.INVALID wrote:
> Thank you Sebastian.
> 
> I'm able to get the HTTP headers as you explained below.
> 
> How can I index this value on Solr?
> What is the difference between protocol-okhttp and protocol-http?
> 
> Kind regards, 
> Hany Shehata
> Enterprise Engineer
> Green Six Sigma Certified
> Solutions Architect, Marketing and Communications IT 
> Corporate Functions | HSBC Operations, Services and Technology (HOST)
> ul. Kapelanka 42A, 30-347 Kraków, Poland
> __ 
> 
> Tie line: 7148 7689 4698 
> External: +48 123 42 0698 
> Mobile: +48 723 680 278 
> E-mail: hany.n...@hsbc.com 
> __ 
> Protect our environment - please only print this if you have to!
> 
> 
> -Original Message-
> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com.INVALID] 
> Sent: 11 March 2019 17:06
> To: user@nutch.apache.org
> Subject: Re: Nutch and HTTP headers
> 
> Hi,
> 
>> Can Nutch index custom HTTP headers?
> 
> Nutch stores the HTTP response headers if the property `store.http.headers` 
> is true.  The headers are saved as string concatenated by `\r\n` under the 
> key `_response.headers_` in the content metadata.
> 
> You can send the entire HTTP headers to the indexer using the plugin 
> index-metadata and adding `_response.headers_` to `index.content.md`.  It 
> will add a field `_response.headers_` to the index:
> 
>  % bin/nutch indexchecker \
> -Dplugin.includes='protocol-okhttp|parse-html|index-metadata' \
> -Dstore.http.headers=true \
> -Dindex.content.md=_response.headers_ \
>'http://localhost/'
>  fetching: http://localhost/
>  ...
>  _response.headers_ :HTTP/1.1 200 OK
>  Date: Mon, 11 Mar 2019 16:03:41 GMT
>  Server: Apache/2.4.29 (Ubuntu)
>  Last-Modified: ...
> 
> But there is no standard way to pick single headers and send them to the 
> indexer as arbitrary fields.
> 
> Best,
> Sebastian
> 
> 
> On 3/11/19 4:21 PM, hany.n...@hsbc.com.INVALID wrote:
>> Hello,
>>
>> Can Nutch index custom HTTP headers?
>>
>> Kind regards,
>> Hany Shehata
>> Enterprise Engineer
>> Green Six Sigma Certified
>> Solutions Architect, Marketing and Communications IT Corporate 
>> Functions | HSBC Operations, Services and Technology (HOST) ul. 
>> Kapelanka 42A, 30-347 Kraków, Poland 
>> __
>>
>> Tie line: 7148 7689 4698
>> External: +48 123 42 0698
>> Mobile: +48 723 680 278
>> E-mail: hany.n...@hsbc.com<mailto:hany.n...@hsbc.com>
>> __
>> Protect our environment - please only print this if you have to!
>>
>>
>>
>> -
>> SAVE PAPER - THINK BEFORE YOU PRINT!
>>
>> This E-mail is confidential.  
>>
>> It may also be legally privileged. If you are not the addressee you 
>> may not copy, forward, disclose or use any part of it. If you have 
>> received this message in error, please delete it and all copies from 
>> your system and notify the sender immediately by return E-mail.
>>
>> Internet communications cannot be guaran

RE: Nutch and HTTP headers

2019-03-13 Thread hany . nasr
Thank you Sebastian.

I'm able to get the HTTP headers as you explained below.

How can I index this value on Solr?
What is the difference between protocol-okhttp and protocol-http?

Kind regards, 
Hany Shehata
Enterprise Engineer
Green Six Sigma Certified
Solutions Architect, Marketing and Communications IT 
Corporate Functions | HSBC Operations, Services and Technology (HOST)
ul. Kapelanka 42A, 30-347 Kraków, Poland
__ 

Tie line: 7148 7689 4698 
External: +48 123 42 0698 
Mobile: +48 723 680 278 
E-mail: hany.n...@hsbc.com 
__ 
Protect our environment - please only print this if you have to!


-Original Message-
From: Sebastian Nagel [mailto:wastl.na...@googlemail.com.INVALID] 
Sent: 11 March 2019 17:06
To: user@nutch.apache.org
Subject: Re: Nutch and HTTP headers

Hi,

> Can Nutch index custom HTTP headers?

Nutch stores the HTTP response headers if the property `store.http.headers` is 
true.  The headers are saved as string concatenated by `\r\n` under the key 
`_response.headers_` in the content metadata.

You can send the entire HTTP headers to the indexer using the plugin 
index-metadata and adding `_response.headers_` to `index.content.md`.  It will 
add a field `_response.headers_` to the index:

 % bin/nutch indexchecker \
-Dplugin.includes='protocol-okhttp|parse-html|index-metadata' \
-Dstore.http.headers=true \
-Dindex.content.md=_response.headers_ \
   'http://localhost/'
 fetching: http://localhost/
 ...
 _response.headers_ :HTTP/1.1 200 OK
 Date: Mon, 11 Mar 2019 16:03:41 GMT
 Server: Apache/2.4.29 (Ubuntu)
 Last-Modified: ...

But there is no standard way to pick single headers and send them to the 
indexer as arbitrary fields.

Best,
Sebastian


On 3/11/19 4:21 PM, hany.n...@hsbc.com.INVALID wrote:
> Hello,
> 
> Can Nutch index custom HTTP headers?
> 
> Kind regards,
> Hany Shehata
> Enterprise Engineer
> Green Six Sigma Certified
> Solutions Architect, Marketing and Communications IT Corporate 
> Functions | HSBC Operations, Services and Technology (HOST) ul. 
> Kapelanka 42A, 30-347 Kraków, Poland 
> __
> 
> Tie line: 7148 7689 4698
> External: +48 123 42 0698
> Mobile: +48 723 680 278
> E-mail: hany.n...@hsbc.com<mailto:hany.n...@hsbc.com>
> __
> Protect our environment - please only print this if you have to!
> 
> 
> 
> -
> SAVE PAPER - THINK BEFORE YOU PRINT!
> 
> This E-mail is confidential.  
> 
> It may also be legally privileged. If you are not the addressee you 
> may not copy, forward, disclose or use any part of it. If you have 
> received this message in error, please delete it and all copies from 
> your system and notify the sender immediately by return E-mail.
> 
> Internet communications cannot be guaranteed to be timely secure, error or 
> virus-free.
> The sender does not accept liability for any errors or omissions.
> 



***
This message originated from the Internet. Its originator may or may not be who 
they claim to be and the information contained in the message and any 
attachments may or may not be accurate.


 


-
SAVE PAPER - THINK BEFORE YOU PRINT!

This E-mail is confidential.  

It may also be legally privileged. If you are not the addressee you may not 
copy,
forward, disclose or use any part of it. If you have received this message in 
error,
please delete it and all copies from your system and notify the sender 
immediately by
return E-mail.

Internet communications cannot be guaranteed to be timely secure, error or 
virus-free.
The sender does not accept liability for any errors or omissions.


Re: Nutch and HTTP headers

2019-03-11 Thread Sebastian Nagel
Hi,

> Can Nutch index custom HTTP headers?

Nutch stores the HTTP response headers if the property
`store.http.headers` is true.  The headers are saved as
string concatenated by `\r\n` under the key
`_response.headers_` in the content metadata.

You can send the entire HTTP headers to the indexer using
the plugin index-metadata and adding `_response.headers_`
to `index.content.md`.  It will add a field `_response.headers_`
to the index:

 % bin/nutch indexchecker \
-Dplugin.includes='protocol-okhttp|parse-html|index-metadata' \
-Dstore.http.headers=true \
-Dindex.content.md=_response.headers_ \
   'http://localhost/'
 fetching: http://localhost/
 ...
 _response.headers_ :HTTP/1.1 200 OK
 Date: Mon, 11 Mar 2019 16:03:41 GMT
 Server: Apache/2.4.29 (Ubuntu)
 Last-Modified: ...

But there is no standard way to pick single headers and send
them to the indexer as arbitrary fields.

Best,
Sebastian


On 3/11/19 4:21 PM, hany.n...@hsbc.com.INVALID wrote:
> Hello,
> 
> Can Nutch index custom HTTP headers?
> 
> Kind regards,
> Hany Shehata
> Enterprise Engineer
> Green Six Sigma Certified
> Solutions Architect, Marketing and Communications IT
> Corporate Functions | HSBC Operations, Services and Technology (HOST)
> ul. Kapelanka 42A, 30-347 Kraków, Poland
> __
> 
> Tie line: 7148 7689 4698
> External: +48 123 42 0698
> Mobile: +48 723 680 278
> E-mail: hany.n...@hsbc.com
> __
> Protect our environment - please only print this if you have to!
> 
> 
> 
> -
> SAVE PAPER - THINK BEFORE YOU PRINT!
> 
> This E-mail is confidential.  
> 
> It may also be legally privileged. If you are not the addressee you may not 
> copy,
> forward, disclose or use any part of it. If you have received this message in 
> error,
> please delete it and all copies from your system and notify the sender 
> immediately by
> return E-mail.
> 
> Internet communications cannot be guaranteed to be timely secure, error or 
> virus-free.
> The sender does not accept liability for any errors or omissions.
>