[Nutch Wiki] Update of HttpAuthenticationSchemes by susam
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by susam: http://wiki.apache.org/nutch/HttpAuthenticationSchemes The comment on the change is: role of http.agent.host in NTLM and patch committed -- == Necessity == There were two plugins already present, viz. 'protocol-http' and 'protocol-httpclient'. However, 'protocol-http' could not support HTTP 1.1, HTTPS and NTLM, Basic and Digest authentication schemes. 'protocol-httpclient' supported HTTPS and had code for NTLM authentication but the NTLM authentication didn't work due to a bug. Some portions of 'protocol-httpclient' were re-written to solve these problems, provide additional features like authentication support for proxy server and better inline documentation for the properties to be used to configure authentication. The author (Susam Pal) of these features has tested it in Infosys Technologies Limited by crawling the corporate intranet requiring NTLM authentication and this has been found to work well. - == Download == - Currently, these features are present in the form of a patch in JIRA. Download the patch from [https://issues.apache.org/jira/browse/NUTCH-559 JIRA NUTCH-559] and apply it to trunk. The latest patch is named as [https://issues.apache.org/jira/secure/attachment/12370428/NUTCH-559v0.5.patch NUTCH-559v0.5.patch]. + == JIRA NUTCH-559 == + These features were submitted as [https://issues.apache.org/jira/browse/NUTCH-559 JIRA NUTCH-559] in the JIRA. If you have checked out the latest Nutch trunk, you don't need to apply the patches. These features were included in the Nutch subversion repository in [http://svn.apache.org/viewvc?view=revrevision=608972 revision #608972] == Introduction to Authentication Scope == Different credentials for different authentication scopes can be configured in 'conf/httpclient-auth.xml'. If a set of credentials is configured for a particular authentication scope (i.e. particular host, port number, realm and/or scheme), then that set of credentials would be sent only to pages falling under the specified authentication scope. @@ -82, +82 @@ 1. For authscope tag, 'host' and 'port' attribute should always be specified. 'realm' and 'scheme' attributes may or may not be specified depending on your needs. If you are tempted to omit the 'host' and 'port' attribute, because you want the credentials to be used for any host and any port for that realm/scheme, please use the 'default' tag instead. That's what 'default' tag is meant for. 1. One authentication scope should not be defined twice as different authscope tags for different credentials tag. However, if this is done by mistake, the credentials for the last defined authscope tag would be used. This is because, the XML parsing code, reads the file from top to bottom and sets the credentials for authentication-scopes. If the same authentication scope is encountered once again, it will be overwritten with the new credentials. However, one should not rely on this behavior as this might change with further developments. 1. Do not define multiple authscope tags with the same host, port but different realms if the server requires NTLM authentication. This can means there should not be multiple tags with same host, port, scheme=NTLM but different realms. If you are omitting the scheme attribute and the server requires NTLM authentication, then there should not be multiple tags with same host, port but different realms. This is discussed more in the next section. + 1. If you are using NTLM scheme, you should also set the 'http.agent.host' property in conf/nutch-site.xml === A note on NTLM domains === - NTLM does not use the concept of realms. Therefore, multiple realms for a web-server can not be defined as different authentication scopes for the same web-server requiring NTLM authentication. There should be exactly one authscope tag for NTLM scheme authentication scope for a particular web-server. The authentication domain should be specified as the value of the 'realm' attribute. + NTLM does not use the concept of realms. Therefore, multiple realms for a web-server can not be defined as different authentication scopes for the same web-server requiring NTLM authentication. There should be exactly one authscope tag for NTLM scheme authentication scope for a particular web-server. The authentication domain should be specified as the value of the 'realm' attribute. NTLM authentication also requires the name of IP address of the host on which the crawler is running. Thus, 'http.agent.host' should be set properly. == Underlying HttpClient Library == 'protocol-httpclient' is based on [http://jakarta.apache.org/httpcomponents/httpclient-3.x/ Jakarta Commons HttpClient]. Some servers support multiple schemes for
[Nutch Wiki] Update of HttpAuthenticationSchemes by susam
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by susam: http://wiki.apache.org/nutch/HttpAuthenticationSchemes The comment on the change is: root element omitted - restructured the sentence -- When authentication is required to fetch a resource from a web-server, the authentication-scope is determined from the host and port obtained from the URL of the page. If it matches any 'authscope' in this configuration file, then the 'credentials' for that 'authscope' is used for authentication. == Configuration == - Since the example and explanation provided as comments in 'conf/httpclient-auth.xml' is very brief, therefore this section would explain it in a little more detail. The root element is auth-configuration for all the examples below which has been omitted for the sake of clarity. + Since the example and explanation provided as comments in 'conf/httpclient-auth.xml' is very brief, therefore this section would explain it in a little more detail. In all the examples below, the root element auth-configuration has been omitted for the sake of clarity. === Crawling an Intranet with Default Authentication Scope === Let's say all pages of an intranet are protected by basic, digest or ntlm authentication and there is only one set of credentials to be used for all web pages in the intranet, then a configuration as described below is enough. This is also the simplest possible configuration possible for authentication schemes.
[Nutch Wiki] Update of HttpAuthenticationSchemes by susam
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by susam: http://wiki.apache.org/nutch/HttpAuthenticationSchemes The comment on the change is: re-writing document as per latest v0.5 patch -- 'protocol-httpclient' is a protocol plugin which supports retrieving documents via the HTTP 1.0, HTTP 1.1 and HTTPS protocols, optionally with Basic, Digest and NTLM authentication schemes for web server as well as proxy server. == Necessity == - There were two plugins already present, viz. 'protocol-http' and 'protocol-httpclient'. However, 'protocol-http' could not support HTTP 1.1, HTTPS and NTLM, Basic and Digest authentication schemes. 'protocol-httpclient' supported HTTPS and had code for NTLM authentication but the NTLM authentication didn't work due to a bug. Some portions of 'protocol-httpclient' were re-written to solve these problems, provide additional features like authentication support for proxy server and better inline documentation for the properties to be used in 'nutch-site.xml' to enable 'protocol-httpclient' and use its authentication features. The author (Susam Pal) of these features has tested it in Infosys Technologies Limited by crawling the corporate intranet requiring NTLM authentication and this has been found to work well. + There were two plugins already present, viz. 'protocol-http' and 'protocol-httpclient'. However, 'protocol-http' could not support HTTP 1.1, HTTPS and NTLM, Basic and Digest authentication schemes. 'protocol-httpclient' supported HTTPS and had code for NTLM authentication but the NTLM authentication didn't work due to a bug. Some portions of 'protocol-httpclient' were re-written to solve these problems, provide additional features like authentication support for proxy server and better inline documentation for the properties to be used in 'httpclient-auth.xml' to enable 'protocol-httpclient' and use its authentication features. The author (Susam Pal) of these features has tested it in Infosys Technologies Limited by crawling the corporate intranet requiring NTLM authentication and this has been found to work well. == Download == - Currently, these features are present in the form of a patch in JIRA. Download the patch from [https://issues.apache.org/jira/browse/NUTCH-559 JIRA NUTCH-559] and apply it to trunk. + Currently, these features are present in the form of a patch in JIRA. Download the patch from [https://issues.apache.org/jira/browse/NUTCH-559 JIRA NUTCH-559] and apply it to trunk. The latest patch is named as [https://issues.apache.org/jira/secure/attachment/12370428/NUTCH-559v0.5.patch NUTCH-559v0.5.patch]. == Configuration == - This is an advanced feature that lets the user specify different credentials for different authentication scopes. This section does not describe the default configuration. Some parts of this section might be outdated. It is better to read the guidelines in 'conf/httpclient-auth.xml' because they are correct. This section will be improved later when time permits. + Since the example and explanation provided as comments in 'conf/httpclient-auth.xml' is very crisp, therefore this section would explain it in more details. The section starts with a few very simple examples which would suffice for most real life situations. Complex cases are described later in this article. The root element is auth-configuration for all the examples below which has been omitted for the sake of clarity. + + === Crawling an intranet with default authentication scope === + Let's say all pages of an intranet are protected by basic, digest or ntlm authentication and there is only one set of credentials to be used for all web pages in the intranet, then a configuration as described below is enough. This is also the simplest possible configuration possible for authentication schemes. + + {{{credentials username=susam password=masus + default/ + /credentials}}} + + The credentials specified above would be sent to any page requesting authentication. Though it is extremely simple, default authentication scope should be used with caution. This set of credentials would be sent to any web-page requesting for authentication and therefore, a malicious user can steal the credentials used in the configuration by setting up a web-page requiring Basic authentication. Therefore, we usually use credentials set apart for crawling only so that even if a user steals the credentials, he wouldn't be able to do anything harmful. If you are sure, that all pages in the intranet use a particular authentication scheme, say, NTLM, then this situation can be improved a little in this manner. + + {{{credentials username=susam password=masus + default scheme=ntlm/ + /credentials}}} + + Thus, these set of credentials would be sent to pages
[Nutch Wiki] Update of HttpAuthenticationSchemes by susam
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by susam: http://wiki.apache.org/nutch/HttpAuthenticationSchemes The comment on the change is: removed conf/nutch-site.xml conf -- == Download == Currently, these features are present in the form of a patch in JIRA. Download the patch from [https://issues.apache.org/jira/browse/NUTCH-559 JIRA NUTCH-559] and apply it to trunk. + == Configuration == + This is an advanced feature that lets the user specify different credentials for different authentication scopes. This section does not describe the default configuration. Some parts of this section might be outdated. It is better to read the guidelines in 'conf/httpclient-auth.xml' because they are correct. This section will be improved later when time permits. - == Common Credentials Configuration == - This is the simplest possible configuration which involves setting just one set of credentials. It is useful in trusted Intranets where all sites are trusted and require the same username/password for authentication. - - === Quick Guide === - 1. Include 'protocol-httpclient' in 'plugin.includes'. - 1. For basic or digest authentication in proxy server, set 'http.proxy.username' and 'http.proxy.password'. Also, set 'http.proxy.realm' if you want to specify a realm as the authentication scope. - 1. For NTLM authentication in proxy server, set 'http.proxy.username', 'http.proxy.password', 'http.proxy.realm' and 'http.auth.host'. 'http.proxy.realm' is the NTLM domain name. 'http.auth.host' is the host where the crawler is running. - 1. For basic or digest authentication in web servers, set 'http.auth.username' and 'http.auth.password'. Also, set 'http.auth.realm' if you want to specify a realm as the authentication scope. - 1. For NTLM authentication in proxy server, set 'http.auth.username', 'http.auth.password', 'http.auth.realm' and 'http.auth.host'. 'http.auth.realm' is the NTLM domain name. 'http.auth.host' is the host where the crawler is running. - - This is explained in details in the following section. - - === Details === - To use 'protocol-httpclient', 'conf/nutch-site.xml has to be edited to include some properties which is explained in this section. First and foremost, to enable the plugin, this plugin must be added in the 'plugin.includes' of 'nutch-site.xml'. So, this property would typically look like:- - - {{{property - nameplugin.includes/name - valueprotocol-httpclient|urlfilter-regex|.../value - description.../description - /property}}} - - (... indicates a long line truncated) - - Next, if authentication is required for proxy server, the following properties need to be set in 'conf/nutch-site.xml'. - - * http.proxy.username - * http.proxy.password - * http.proxy.realm (If a realm needs to be provided. In case of NTLM authentication, the domain name should be provided as its value.) - * http.auth.host (This is required in case of NTLM authentication only. This is the host where the crawler would be running.) - - If the web servers of the intranet are in a particular domain or realm and requires authentication, these properties should be set in 'conf/nutch-site.xml'. - - * http.auth.username - * http.auth.password - * http.auth.realm - * http.auth.host - - The explanation for these properties are similar to that of the proxy authentication properties. As you might have noticed, 'http.auth.host' is used for proxy NTLM authentication as well as web server NTLM authentication. Since, the host at which the HTTP requests are originating are same for both, so the same property is used for both and two different properties were not created. - - Even though, the 'http.auth.host' property is required only for NTLM authentication, it is advisable to set this for all cases, because, in case the crawler comes across a server which requires NTLM authentication (which you might not have anticipated), the crawler can still fetch the page. - - == Authentication Scope Specific Credentials == - This is an advanced feature that lets the user specify different credentials for different authentication scopes. === Quick Guide === An example of 'conf/httpclient-auth.xml' configuration is provided below: @@ -98, +57 @@ The 'realm' attribute is optional in authscope tag and it can be omitted if you want the credentials to be used for all realms on a particular web-server (or all remaining realms as shown in the Quick Guide section above). One authentication scope should not be defined twice as different authscope tags for different credentials tag. However, if this is done by mistake, the credentials for the last defined authscope tag would be used. This is because, the XML parsing code, reads the file from top to bottom and sets the credentials for
[Nutch Wiki] Update of HttpAuthenticationSchemes by susam
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by susam: http://wiki.apache.org/nutch/HttpAuthenticationSchemes The comment on the change is: typo fixes -- Even though, the 'http.auth.host' property is required only for NTLM authentication, it is advisable to set this for all cases, because, in case the crawler comes across a server which requires NTLM authentication (which you might not have anticipated), the crawler can still fetch the page. == Authentication Scope Specific Credentials == - This is an advanced feature that lets the user specify different credentials for different authentication scopes. After that you might want to try this out and appreciate the advantages. + This is an advanced feature that lets the user specify different credentials for different authentication scopes. === Quick Guide === An example of 'conf/httpclient-auth.xml' configuration is provided below: @@ -96, +96 @@ If a page, say, 'http://192.168.101.34/index.jsp' requires authentication, then the common credentials would be used since there is no credential defined for this scope. - The 'realm' attribute is optional in authscope tag and it can be omitted if you want the credentials to be used for all realms on a particular web-server (or all remaining realms as shown in the Quick Guide section above). One authentication scope should not be defined twice as different authscope tags for different credentials tag. However, if this is done by mistake, The credentials for the last defined authscope tag would be used. This is because, the XML parsing code, reads the file from top to bottom and sets the credentials for authentication-scopes. If the same authentication scope is encountered once again, it will be overwritten with the new credentials. However, one should not rely on this behavior as this might change with further developments. + The 'realm' attribute is optional in authscope tag and it can be omitted if you want the credentials to be used for all realms on a particular web-server (or all remaining realms as shown in the Quick Guide section above). One authentication scope should not be defined twice as different authscope tags for different credentials tag. However, if this is done by mistake, the credentials for the last defined authscope tag would be used. This is because, the XML parsing code, reads the file from top to bottom and sets the credentials for authentication-scopes. If the same authentication scope is encountered once again, it will be overwritten with the new credentials. However, one should not rely on this behavior as this might change with further developments. == Underlying HttpClient Library ==
[Nutch Wiki] Update of HttpAuthenticationSchemes by susam
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by susam: http://wiki.apache.org/nutch/HttpAuthenticationSchemes The comment on the change is: Initial draft copied from protocol-http11 New page: == Introduction == 'protocol-httpclient' is a protocol plugin which supports retrieving documents via the HTTP 1.0, HTTP 1.1 and HTTPS protocols, optionally with Basic, Digest and NTLM authentication schemes for web server as well as proxy server. == Author of Authentication Features == Susam Pal, Infosys Technologies Limited == Necessity == There were two plugins already present, viz. 'protocol-http' and 'protocol-httpclient'. However, 'protocol-http' could not support HTTP 1.1, HTTPS and NTLM, Basic and Digest authentication schemes. 'protocol-httpclient' supported HTTPS and had code for NTLM authentication but the NTLM authentication didn't work due to a bug. Some portions of 'protocol-httpclient' were written to solve these problems, provide additional features like authentication support for proxy server and better inline documentation for the properties to be used in 'nutch-site.xml' to enable 'protocol-http11' and use its authentication features. This is an improvement on the previous two plugins. The author of the authentication features has tested it in Infosys Technologies Limited by crawling the corporate intranet requiring NTLM authentication and this has been found to work well. == Download == Currently, this plugin is in the form of patch in JIRA. Download the patch from [https://issues.apache.org/jira/browse/NUTCH-557 JIRA NUTCH-557] and apply it to trunk. == Quick Guide == This section is a quick guide to configure authentication related properties for 'protocol-httpclient'. 1. Include 'protocol-httpclient' in 'plugin.includes'. 1. For basic or digest authentication in proxy server, set 'http.proxy.username' and 'http.proxy.password'. Also, set 'http.proxy.realm' if you want to specify a realm as the authentication scope. 1. For NTLM authentication in proxy server, set 'http.proxy.username', 'http.proxy.password', 'http.proxy.realm' and 'http.auth.host'. 'http.proxy.realm' is the NTLM domain name. 'http.auth.host' is the host where the crawler is running. 1. For basic or digest authentication in web servers, set 'http.auth.username' and 'http.auth.password'. Also, set 'http.auth.realm' if you want to specify a realm as the authentication scope. 1. For NTLM authentication in proxy server, set 'http.auth.username', 'http.auth.password', 'http.auth.realm' and 'http.auth.host'. 'http.auth.realm' is the NTLM domain name. 'http.auth.host' is the host where the crawler is running. 1. It is recommended that 'http.useHttp11' be set to true. This is explained in a little more detail in the next section. == Nutch Configuration == To use 'protocol-httpclient', 'conf/nutch-site.xml has to be edited to include some properties which is explained in this section. First and foremost, to enable the plugin, this plugin must be added in the 'plugin.includes' of 'nutch-site.xml'. So, this property would typically look like:- {{{property nameplugin.includes/name valueprotocol-httpclient|urlfilter-regex|.../value description.../description /property}}} (... indicates a long line truncated) Next, if authentication is required for proxy server, the following properties need to be set in 'conf/nutch-site.xml'. * http.proxy.username * http.proxy.password * http.proxy.realm (If a realm needs to be provided. In case of NTLM authentication, the domain name should be provided as its value.) * http.auth.host (This is required in case of NTLM authentication only. This is the host where the crawler would be running.) If the web servers of the intranet are in a particular domain or realm and requires authentication, these properties should be set in 'conf/nutch-site.xml'. * http.auth.username * http.auth.password * http.auth.realm * http.auth.host The explanation for these properties are similar to that of the proxy authentication properties. As you might have noticed, 'http.auth.host' is used both for proxy NTLM authentication and web server NTLM authentication. Since, the host at which the HTTP requests are originating are same for both, so the same property is used for both and two different properties were not created. Even though, the 'http.auth.host' property is required only for NTLM authentication, it is advisable to set this for all cases, because, in case the crawler comes across a server which requires NTLM authentication (which you might not have anticipated), the crawler can still fetch the page. == Underlying HttpClient Library == 'protocol-httpclient' is based on [http://jakarta.apache.org/httpcomponents/httpclient-3.x/ Jakarta Commons HttpClient]. Some servers support multiple schemes for authenticating users. Given that only one