[Nutch Wiki] Update of HttpAuthenticationSchemes by susam

2008-02-05 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/HttpAuthenticationSchemes

The comment on the change is:
role of http.agent.host in NTLM and patch committed

--
  == Necessity ==
  There were two plugins already present, viz. 'protocol-http' and 
'protocol-httpclient'. However, 'protocol-http' could not support HTTP 1.1, 
HTTPS and NTLM, Basic and Digest authentication schemes. 'protocol-httpclient' 
supported HTTPS and had code for NTLM authentication but the NTLM 
authentication didn't work due to a bug. Some portions of 'protocol-httpclient' 
were re-written to solve these problems, provide additional features like 
authentication support for proxy server and better inline documentation for the 
properties to be used to configure authentication. The author (Susam Pal) of 
these features has tested it in Infosys Technologies Limited by crawling the 
corporate intranet requiring NTLM authentication and this has been found to 
work well.
  
- == Download ==
- Currently, these features are present in the form of a patch in JIRA. 
Download the patch from [https://issues.apache.org/jira/browse/NUTCH-559 JIRA 
NUTCH-559] and apply it to trunk. The latest patch is named as 
[https://issues.apache.org/jira/secure/attachment/12370428/NUTCH-559v0.5.patch 
NUTCH-559v0.5.patch].
+ == JIRA NUTCH-559 ==
+ These features were submitted as 
[https://issues.apache.org/jira/browse/NUTCH-559 JIRA NUTCH-559] in the JIRA. 
If you have checked out the latest Nutch trunk, you don't need to apply the 
patches. These features were included in the Nutch subversion repository in 
[http://svn.apache.org/viewvc?view=revrevision=608972 revision #608972]
  
  == Introduction to Authentication Scope ==
  Different credentials for different authentication scopes can be configured 
in 'conf/httpclient-auth.xml'. If a set of credentials is configured for a 
particular authentication scope (i.e. particular host, port number, realm 
and/or scheme), then that set of credentials would be sent only to pages 
falling under the specified authentication scope.
@@ -82, +82 @@

   1. For authscope tag, 'host' and 'port' attribute should always be 
specified. 'realm' and 'scheme' attributes may or may not be specified 
depending on your needs. If you are tempted to omit the 'host' and 'port' 
attribute, because you want the credentials to be used for any host and any 
port for that realm/scheme, please use the 'default' tag instead. That's what 
'default' tag is meant for.
   1. One authentication scope should not be defined twice as different 
authscope tags for different credentials tag. However, if this is done by 
mistake, the credentials for the last defined authscope tag would be used. 
This is because, the XML parsing code, reads the file from top to bottom and 
sets the credentials for authentication-scopes. If the same authentication 
scope is encountered once again, it will be overwritten with the new 
credentials. However, one should not rely on this behavior as this might change 
with further developments.
   1. Do not define multiple authscope tags with the same host, port but 
different realms if the server requires NTLM authentication. This can means 
there should not be multiple tags with same host, port, scheme=NTLM but 
different realms. If you are omitting the scheme attribute and the server 
requires NTLM authentication, then there should not be multiple tags with same 
host, port but different realms. This is discussed more in the next section.
+  1. If you are using NTLM scheme, you should also set the 'http.agent.host' 
property in conf/nutch-site.xml
  
  === A note on NTLM domains ===
- NTLM does not use the concept of realms. Therefore, multiple realms for a 
web-server can not be defined as different authentication scopes for the same 
web-server requiring NTLM authentication. There should be exactly one authscope 
tag for NTLM scheme authentication scope for a particular web-server. The 
authentication domain should be specified as the value of the 'realm' attribute.
+ NTLM does not use the concept of realms. Therefore, multiple realms for a 
web-server can not be defined as different authentication scopes for the same 
web-server requiring NTLM authentication. There should be exactly one authscope 
tag for NTLM scheme authentication scope for a particular web-server. The 
authentication domain should be specified as the value of the 'realm' 
attribute. NTLM authentication also requires the name of IP address of the host 
on which the crawler is running. Thus, 'http.agent.host' should be set properly.
  
  == Underlying HttpClient Library ==
  'protocol-httpclient' is based on 
[http://jakarta.apache.org/httpcomponents/httpclient-3.x/ Jakarta Commons 
HttpClient]. Some servers support multiple schemes for 

[Nutch Wiki] Update of HttpAuthenticationSchemes by susam

2007-12-31 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/HttpAuthenticationSchemes

The comment on the change is:
root element omitted - restructured the sentence

--
  When authentication is required to fetch a resource from a web-server, the 
authentication-scope is determined from the host and port obtained from the URL 
of the page. If it matches any 'authscope' in this configuration file, then the 
'credentials' for that 'authscope' is used for authentication.
  
  == Configuration ==
- Since the example and explanation provided as comments in 
'conf/httpclient-auth.xml' is very brief, therefore this section would explain 
it in a little more detail. The root element is auth-configuration for all 
the examples below which has been omitted for the sake of clarity.
+ Since the example and explanation provided as comments in 
'conf/httpclient-auth.xml' is very brief, therefore this section would explain 
it in a little more detail. In all the examples below, the root element 
auth-configuration has been omitted for the sake of clarity.
  
  === Crawling an Intranet with Default Authentication Scope ===
  Let's say all pages of an intranet are protected by basic, digest or ntlm 
authentication and there is only one set of credentials to be used for all web 
pages in the intranet, then a configuration as described below is enough. This 
is also the simplest possible configuration possible for authentication schemes.


[Nutch Wiki] Update of HttpAuthenticationSchemes by susam

2007-12-05 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/HttpAuthenticationSchemes

The comment on the change is:
re-writing document as per latest v0.5 patch

--
  'protocol-httpclient' is a protocol plugin which supports retrieving 
documents via the HTTP 1.0, HTTP 1.1 and HTTPS protocols, optionally with 
Basic, Digest and NTLM authentication schemes for web server as well as proxy 
server.
  
  == Necessity ==
- There were two plugins already present, viz. 'protocol-http' and 
'protocol-httpclient'. However, 'protocol-http' could not support HTTP 1.1, 
HTTPS and NTLM, Basic and Digest authentication schemes. 'protocol-httpclient' 
supported HTTPS and had code for NTLM authentication but the NTLM 
authentication didn't work due to a bug. Some portions of 'protocol-httpclient' 
were re-written to solve these problems, provide additional features like 
authentication support for proxy server and better inline documentation for the 
properties to be used in 'nutch-site.xml' to enable 'protocol-httpclient' and 
use its authentication features. The author (Susam Pal) of these features has 
tested it in Infosys Technologies Limited by crawling the corporate intranet 
requiring NTLM authentication and this has been found to work well.
+ There were two plugins already present, viz. 'protocol-http' and 
'protocol-httpclient'. However, 'protocol-http' could not support HTTP 1.1, 
HTTPS and NTLM, Basic and Digest authentication schemes. 'protocol-httpclient' 
supported HTTPS and had code for NTLM authentication but the NTLM 
authentication didn't work due to a bug. Some portions of 'protocol-httpclient' 
were re-written to solve these problems, provide additional features like 
authentication support for proxy server and better inline documentation for the 
properties to be used in 'httpclient-auth.xml' to enable 'protocol-httpclient' 
and use its authentication features. The author (Susam Pal) of these features 
has tested it in Infosys Technologies Limited by crawling the corporate 
intranet requiring NTLM authentication and this has been found to work well.
  
  == Download ==
- Currently, these features are present in the form of a patch in JIRA. 
Download the patch from [https://issues.apache.org/jira/browse/NUTCH-559 JIRA 
NUTCH-559] and apply it to trunk.
+ Currently, these features are present in the form of a patch in JIRA. 
Download the patch from [https://issues.apache.org/jira/browse/NUTCH-559 JIRA 
NUTCH-559] and apply it to trunk. The latest patch is named as 
[https://issues.apache.org/jira/secure/attachment/12370428/NUTCH-559v0.5.patch 
NUTCH-559v0.5.patch].
  
  == Configuration ==
- This is an advanced feature that lets the user specify different credentials 
for different authentication scopes. This section does not describe the default 
configuration. Some parts of this section might be outdated. It is better to 
read the guidelines in 'conf/httpclient-auth.xml' because they are correct. 
This section will be improved later when time permits.
+ Since the example and explanation provided as comments in 
'conf/httpclient-auth.xml' is very crisp, therefore this section would explain 
it in more details. The section starts with a few very simple examples which 
would suffice for most real life situations. Complex cases are described later 
in this article. The root element is auth-configuration for all the examples 
below which has been omitted for the sake of clarity.
+ 
+ === Crawling an intranet with default authentication scope ===
+ Let's say all pages of an intranet are protected by basic, digest or ntlm 
authentication and there is only one set of credentials to be used for all web 
pages in the intranet, then a configuration as described below is enough. This 
is also the simplest possible configuration possible for authentication schemes.
+ 
+ {{{credentials username=susam password=masus
+  default/
+ /credentials}}}
+ 
+ The credentials specified above would be sent to any page requesting 
authentication. Though it is extremely simple, default authentication scope 
should be used with caution. This set of credentials would be sent to any 
web-page requesting for authentication and therefore, a malicious user can 
steal the credentials used in the configuration by setting up a web-page 
requiring Basic authentication. Therefore, we usually use credentials set apart 
for crawling only so that even if a user steals the credentials, he wouldn't be 
able to do anything harmful. If you are sure, that all pages in the intranet 
use a particular authentication scheme, say, NTLM, then this situation can be 
improved a little in this manner.
+ 
+ {{{credentials username=susam password=masus
+  default scheme=ntlm/
+ /credentials}}}
+ 
+ Thus, these set of credentials would be sent to pages 

[Nutch Wiki] Update of HttpAuthenticationSchemes by susam

2007-11-04 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/HttpAuthenticationSchemes

The comment on the change is:
removed conf/nutch-site.xml conf

--
  == Download ==
  Currently, these features are present in the form of a patch in JIRA. 
Download the patch from [https://issues.apache.org/jira/browse/NUTCH-559 JIRA 
NUTCH-559] and apply it to trunk.
  
+ == Configuration ==
+ This is an advanced feature that lets the user specify different credentials 
for different authentication scopes. This section does not describe the default 
configuration. Some parts of this section might be outdated. It is better to 
read the guidelines in 'conf/httpclient-auth.xml' because they are correct. 
This section will be improved later when time permits.
- == Common Credentials Configuration ==
- This is the simplest possible configuration which involves setting just one 
set of credentials. It is useful in trusted Intranets where all sites are 
trusted and require the same username/password for authentication.
- 
- === Quick Guide ===
-  1. Include 'protocol-httpclient' in 'plugin.includes'.
-  1. For basic or digest authentication in proxy server, set 
'http.proxy.username' and 'http.proxy.password'. Also, set 'http.proxy.realm' 
if you want to specify a realm  as the authentication scope.
-  1. For NTLM authentication in proxy server, set 'http.proxy.username', 
'http.proxy.password', 'http.proxy.realm' and 'http.auth.host'. 
'http.proxy.realm' is the NTLM domain name. 'http.auth.host' is the host where 
the crawler is running.
-  1. For basic or digest authentication in web servers, set 
'http.auth.username' and 'http.auth.password'. Also, set 'http.auth.realm' if 
you want to specify a realm as the authentication scope.
-  1. For NTLM authentication in proxy server, set 'http.auth.username', 
'http.auth.password', 'http.auth.realm' and 'http.auth.host'. 'http.auth.realm' 
is the NTLM domain name. 'http.auth.host' is the host where the crawler is 
running.
- 
- This is explained in details in the following section.
- 
- === Details ===
- To use 'protocol-httpclient', 'conf/nutch-site.xml has to be edited to 
include some properties which is explained in this section. First and foremost, 
to enable the plugin, this plugin must be added in the 'plugin.includes' of 
'nutch-site.xml'. So, this property would typically look like:-
- 
- {{{property
-   nameplugin.includes/name
-   valueprotocol-httpclient|urlfilter-regex|.../value
-   description.../description
- /property}}}
- 
- (... indicates a long line truncated)
- 
- Next, if authentication is required for proxy server, the following 
properties need to be set in 'conf/nutch-site.xml'.
- 
-  * http.proxy.username
-  * http.proxy.password
-  * http.proxy.realm (If a realm needs to be provided. In case of NTLM 
authentication, the domain name should be provided as its value.)
-  * http.auth.host (This is required in case of NTLM authentication only. This 
is the host where the crawler would be running.)
- 
- If the web servers of the intranet are in a particular domain or realm and 
requires authentication, these properties should be set in 
'conf/nutch-site.xml'.
- 
-  * http.auth.username
-  * http.auth.password
-  * http.auth.realm
-  * http.auth.host
- 
- The explanation for these properties are similar to that of the proxy 
authentication properties. As you might have noticed, 'http.auth.host' is used 
for proxy NTLM authentication as well as web server NTLM authentication. Since, 
the host at which the HTTP requests are originating are same for both, so the 
same property is used for both and two different properties were not created.
- 
- Even though, the 'http.auth.host' property is required only for NTLM 
authentication, it is advisable to set this for all cases, because, in case the 
crawler comes across a server which requires NTLM authentication (which you 
might not have anticipated), the crawler can still fetch the page.
- 
- == Authentication Scope Specific Credentials ==
- This is an advanced feature that lets the user specify different credentials 
for different authentication scopes.
  
  === Quick Guide ===
  An example of 'conf/httpclient-auth.xml' configuration is provided below:
@@ -98, +57 @@

  
  The 'realm' attribute is optional in authscope tag and it can be omitted if 
you want the credentials to be used for all realms on a particular web-server 
(or all remaining realms as shown in the Quick Guide section above). One 
authentication scope should not be defined twice as different authscope tags 
for different credentials tag. However, if this is done by mistake, the 
credentials for the last defined authscope tag would be used. This is 
because, the XML parsing code, reads the file from top to bottom and sets the 
credentials for 

[Nutch Wiki] Update of HttpAuthenticationSchemes by susam

2007-10-31 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/HttpAuthenticationSchemes

The comment on the change is:
typo fixes

--
  Even though, the 'http.auth.host' property is required only for NTLM 
authentication, it is advisable to set this for all cases, because, in case the 
crawler comes across a server which requires NTLM authentication (which you 
might not have anticipated), the crawler can still fetch the page.
  
  == Authentication Scope Specific Credentials ==
- This is an advanced feature that lets the user specify different credentials 
for different authentication scopes. After that you might want to try this out 
and appreciate the advantages.
+ This is an advanced feature that lets the user specify different credentials 
for different authentication scopes.
  
  === Quick Guide ===
  An example of 'conf/httpclient-auth.xml' configuration is provided below:
@@ -96, +96 @@

  
  If a page, say, 'http://192.168.101.34/index.jsp' requires authentication, 
then the common credentials would be used since there is no credential defined 
for this scope.
  
- The 'realm' attribute is optional in authscope tag and it can be omitted if 
you want the credentials to be used for all realms on a particular web-server 
(or all remaining realms as shown in the Quick Guide section above). One 
authentication scope should not be defined twice as different authscope tags 
for different credentials tag. However, if this is done by mistake, The 
credentials for the last defined authscope tag would be used. This is 
because, the XML parsing code, reads the file from top to bottom and sets the 
credentials for authentication-scopes. If the same authentication scope is 
encountered once again, it will be overwritten with the new credentials. 
However, one should not rely on this behavior as this might change with further 
developments.
+ The 'realm' attribute is optional in authscope tag and it can be omitted if 
you want the credentials to be used for all realms on a particular web-server 
(or all remaining realms as shown in the Quick Guide section above). One 
authentication scope should not be defined twice as different authscope tags 
for different credentials tag. However, if this is done by mistake, the 
credentials for the last defined authscope tag would be used. This is 
because, the XML parsing code, reads the file from top to bottom and sets the 
credentials for authentication-scopes. If the same authentication scope is 
encountered once again, it will be overwritten with the new credentials. 
However, one should not rely on this behavior as this might change with further 
developments.
  
  
  == Underlying HttpClient Library ==


[Nutch Wiki] Update of HttpAuthenticationSchemes by susam

2007-09-24 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/HttpAuthenticationSchemes

The comment on the change is:
Initial draft copied from protocol-http11

New page:
== Introduction ==
'protocol-httpclient' is a protocol plugin which supports retrieving documents 
via the HTTP 1.0, HTTP 1.1 and HTTPS protocols, optionally with Basic, Digest 
and NTLM authentication schemes for web server as well as proxy server.

== Author of Authentication Features ==
Susam Pal, Infosys Technologies Limited

== Necessity ==
There were two plugins already present, viz. 'protocol-http' and 
'protocol-httpclient'. However, 'protocol-http' could not support HTTP 1.1, 
HTTPS and NTLM, Basic and Digest authentication schemes. 'protocol-httpclient' 
supported HTTPS and had code for NTLM authentication but the NTLM 
authentication didn't work due to a bug. Some portions of 'protocol-httpclient' 
were written to solve these problems, provide additional features like 
authentication support for proxy server and better inline documentation for the 
properties to be used in 'nutch-site.xml' to enable 'protocol-http11' and use 
its authentication features. This is an improvement on the previous two 
plugins. The author of the authentication features has tested it in Infosys 
Technologies Limited by crawling the corporate intranet requiring NTLM 
authentication and this has been found to work well.

== Download ==
Currently, this plugin is in the form of patch in JIRA. Download the patch from 
[https://issues.apache.org/jira/browse/NUTCH-557 JIRA NUTCH-557] and apply it 
to trunk.

== Quick Guide ==
This section is a quick guide to configure authentication related properties 
for 'protocol-httpclient'.

 1. Include 'protocol-httpclient' in 'plugin.includes'.
 1. For basic or digest authentication in proxy server, set 
'http.proxy.username' and 'http.proxy.password'. Also, set 'http.proxy.realm' 
if you want to specify a realm  as the authentication scope.
 1. For NTLM authentication in proxy server, set 'http.proxy.username', 
'http.proxy.password', 'http.proxy.realm' and 'http.auth.host'. 
'http.proxy.realm' is the NTLM domain name. 'http.auth.host' is the host where 
the crawler is running.
 1. For basic or digest authentication in web servers, set 'http.auth.username' 
and 'http.auth.password'. Also, set 'http.auth.realm' if you want to specify a 
realm as the authentication scope.
 1. For NTLM authentication in proxy server, set 'http.auth.username', 
'http.auth.password', 'http.auth.realm' and 'http.auth.host'. 'http.auth.realm' 
is the NTLM domain name. 'http.auth.host' is the host where the crawler is 
running.
 1. It is recommended that 'http.useHttp11' be set to true.

This is explained in a little more detail in the next section.

== Nutch Configuration ==
To use 'protocol-httpclient', 'conf/nutch-site.xml has to be edited to include 
some properties which is explained in this section. First and foremost, to 
enable the plugin, this plugin must be added in the 'plugin.includes' of 
'nutch-site.xml'. So, this property would typically look like:-

{{{property
  nameplugin.includes/name
  valueprotocol-httpclient|urlfilter-regex|.../value
  description.../description
/property}}}

(... indicates a long line truncated)

Next, if authentication is required for proxy server, the following properties 
need to be set in 'conf/nutch-site.xml'.

 * http.proxy.username
 * http.proxy.password
 * http.proxy.realm (If a realm needs to be provided. In case of NTLM 
authentication, the domain name should be provided as its value.)
 * http.auth.host (This is required in case of NTLM authentication only. This 
is the host where the crawler would be running.)

If the web servers of the intranet are in a particular domain or realm and 
requires authentication, these properties should be set in 
'conf/nutch-site.xml'.

 * http.auth.username
 * http.auth.password
 * http.auth.realm
 * http.auth.host

The explanation for these properties are similar to that of the proxy 
authentication properties. As you might have noticed, 'http.auth.host' is used 
both for proxy NTLM authentication and web server NTLM authentication. Since, 
the host at which the HTTP requests are originating are same for both, so the 
same property is used for both and two different properties were not created.

Even though, the 'http.auth.host' property is required only for NTLM 
authentication, it is advisable to set this for all cases, because, in case the 
crawler comes across a server which requires NTLM authentication (which you 
might not have anticipated), the crawler can still fetch the page.

== Underlying HttpClient Library ==
'protocol-httpclient' is based on 
[http://jakarta.apache.org/httpcomponents/httpclient-3.x/ Jakarta Commons 
HttpClient]. Some servers support multiple schemes for authenticating users. 
Given that only one