[Nutch Wiki] Trivial Update of "HttpPostAuthentication" by susam

Apache Wiki Fri, 05 Dec 2008 10:40:46 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The following page has been changed by susam:
http://wiki.apache.org/nutch/HttpPostAuthentication

------------------------------------------------------------------------------
  == Introduction ==
- Often, Nutch has to crawl websites with pages protected by authentication. 
Therefore, to crawl such web-pages, Nutch must authenticate itself to the 
website and then proceed with fetching the pages from it. Currently, the 
development version of Nutch can do Basic, Digest and NTLM] based 
authentication. This is documented in HttpAuthenticationSchemes. In this 
project, we would be adding HTTP POST based authentication, which is the most 
popular form of authentication on most websites. It should be possible to 
configure different credentials for different websites.
+ Often, Nutch has to crawl websites with pages protected by authentication. 
Therefore, to crawl such web-pages, Nutch must authenticate itself to the 
website and then proceed with fetching the pages from it. Currently, the 
development version of Nutch can do Basic, Digest and NTLM based 
authentication. This is documented in HttpAuthenticationSchemes. In this 
project, we would be adding HTTP POST based authentication, which is the most 
popular form of authentication on most websites. It should be possible to 
configure different credentials for different websites.
  
  == Configuration ==
  A configuration file with a list of domains for which authentication should 
be done along with the login URL and POST data. If possible, the configuration 
should also allow the user to mention a session timeout value for websites as 
an optional parameter. This would be helpful if some website is known to 
timeout very quickly, or when the duration of the fetch cycle would be too long 
as compared to the session's life.
@@ -23, +23 @@

  
   1. We use pattern matching to find out whether the contents of the page 
indicates it as an authentication failure page or not, for the website. But it 
is an unnecessary waste of time because for most cases the page wouldn't be an 
error page.
   1. We perform an authentication by sending POST data to login URL every time 
we fetch a page from that domain. By this, we are almost doubling the bandwidth 
requirement to crawl that website.
-  1. For those sites, where authentication failure page comes from a known 
URL, we can add which URLs mean authentication failure along with the login URL 
and POST data in the configuration file. There wouldn't be too many such URLs 
for a particular domain and so a regex match or a complete string match for the 
URLs after every response
+  1. For those sites, where authentication failure page comes from a known 
URL, we can add which URLs mean authentication failure along with the login URL 
and POST data in the configuration file. There wouldn't be too many such URLs 
for a particular domain and so a regex match or a complete string match for the 
URLs after every response from that domain shouldn't consume much time.
- from that domain shouldn't consume much time.
  
  However, even without taking care of these points, and simply getting the 
fetcher behavior right as discussed in the previous section, we'll have a 
solution that may be useful to many.

[Nutch Wiki] Trivial Update of "HttpPostAuthentication" by susam

Reply via email to