Re: Alternatvie to httpclient for crawling with basic auth?

Markus Jelsma Sun, 10 Jul 2011 15:20:28 -0700

You can open an issue for rewriting httpclient for version 4 and maybe submit 
a patch ;)


A dirty fix would be hacking protocol-http to send a cookie or HTTP auth 
credentials along with its requests.

> Is there an alternative to protocol-httpclient that can do basic auth? I am
> running into a wall right now trying to get nutch to get anything past the
> seed URL of my site. It requires auth, so I configured httpclient, which
> (according to apache logs) is correctly sending credentials when it gets a
> 401 auth request returned from the server, but after getting '/', it quits
> with:
> 
> Stopping at depth=1 - no more URLs to fetch.
> 
> Running again stops at depth=0. The target page is an apache mod_autoindex
> page with 15 or so directories listed so it should not be hitting any limit
> since it only fetching the 1 page total (turned off the
> ignore.db.internal.links option even though I think I read it only applies
> to index scoring, not the crawlDB). I thought it might be one of the regexp
> filters blocking, so I trimmed them down to +.*, still nothing. I pointed
> it at a server that does not require auth, and it spit out a
> "unzipBestEffort returned null" error, even though nothing on the page is
> a zip/gz/tgz, and server compression is not on. I traced this to
> NUTCH-990, which is marked "won't fix", and everything pointing at
> upgrading to httpclient4 says it wont happen.... so is there an
> alternative, or some way to get this working?? Crawling the non-auth site
> with protocol-http works as expected, nutch starts crawling the autoindex
> pages and I can watch from the console or the apache access log.
> 
> -T

Re: Alternatvie to httpclient for crawling with basic auth?

Reply via email to