Hi,

I have to crawl a website on the internet which requires authentication.
While going through nutch wiki I found a link
http://wiki.apache.org/nutch/HttpAuthenticationSchemes
It describes about how we can connect to simple,digest or ntlm authenticated
site.
I have gone through all the steps and tried to crawl the website, but it
does not help many of the pages are still directed to login page.
Further while checking the logs for httpclient and httpclient.auth I found
that it has thrown an Exception
*org.apache.commons.httpclient.ConnectionPoolTimeoutException: Timeout
waiting for connection*
Can someone please explain what is wrong here ??

I also found another link
http://wiki.apache.org/nutch/HttpPostAuthentication
that describes about the steps to build the crawler that crawls Http Post
authenticated pages
Is there any new development on this ?.

regards
Sourabh

Reply via email to