Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by susam: http://wiki.apache.org/nutch/HttpPostAuthentication ------------------------------------------------------------------------------ == Introduction == - Often, Nutch has to crawl websites with pages protected by authentication. Therefore, to crawl such web-pages, Nutch must authenticate itself to the website and then proceed with fetching the pages from it. Currently, the development version of Nutch can do Basic, Digest and NTLM] based authentication. This is documented in HttpAuthenticationSchemes. In this project, we would be adding HTTP POST based authentication, which is the most popular form of authentication on most websites. It should be possible to configure different credentials for different websites. + Often, Nutch has to crawl websites with pages protected by authentication. Therefore, to crawl such web-pages, Nutch must authenticate itself to the website and then proceed with fetching the pages from it. Currently, the development version of Nutch can do Basic, Digest and NTLM based authentication. This is documented in HttpAuthenticationSchemes. In this project, we would be adding HTTP POST based authentication, which is the most popular form of authentication on most websites. It should be possible to configure different credentials for different websites. == Configuration == A configuration file with a list of domains for which authentication should be done along with the login URL and POST data. If possible, the configuration should also allow the user to mention a session timeout value for websites as an optional parameter. This would be helpful if some website is known to timeout very quickly, or when the duration of the fetch cycle would be too long as compared to the session's life. @@ -23, +23 @@ 1. We use pattern matching to find out whether the contents of the page indicates it as an authentication failure page or not, for the website. But it is an unnecessary waste of time because for most cases the page wouldn't be an error page. 1. We perform an authentication by sending POST data to login URL every time we fetch a page from that domain. By this, we are almost doubling the bandwidth requirement to crawl that website. - 1. For those sites, where authentication failure page comes from a known URL, we can add which URLs mean authentication failure along with the login URL and POST data in the configuration file. There wouldn't be too many such URLs for a particular domain and so a regex match or a complete string match for the URLs after every response + 1. For those sites, where authentication failure page comes from a known URL, we can add which URLs mean authentication failure along with the login URL and POST data in the configuration file. There wouldn't be too many such URLs for a particular domain and so a regex match or a complete string match for the URLs after every response from that domain shouldn't consume much time. - from that domain shouldn't consume much time. However, even without taking care of these points, and simply getting the fetcher behavior right as discussed in the previous section, we'll have a solution that may be useful to many.