Is there an alternative to protocol-httpclient that can do basic auth? I am running into a wall right now trying to get nutch to get anything past the seed URL of my site. It requires auth, so I configured httpclient, which (according to apache logs) is correctly sending credentials when it gets a 401 auth request returned from the server, but after getting '/', it quits with:
Stopping at depth=1 - no more URLs to fetch. Running again stops at depth=0. The target page is an apache mod_autoindex page with 15 or so directories listed so it should not be hitting any limit since it only fetching the 1 page total (turned off the ignore.db.internal.links option even though I think I read it only applies to index scoring, not the crawlDB). I thought it might be one of the regexp filters blocking, so I trimmed them down to +.*, still nothing. I pointed it at a server that does not require auth, and it spit out a "unzipBestEffort returned null" error, even though nothing on the page is a zip/gz/tgz, and server compression is not on. I traced this to NUTCH-990, which is marked "won't fix", and everything pointing at upgrading to httpclient4 says it wont happen.... so is there an alternative, or some way to get this working?? Crawling the non-auth site with protocol-http works as expected, nutch starts crawling the autoindex pages and I can watch from the console or the apache access log. -T

