Alternatvie to httpclient for crawling with basic auth?

Theral Mackey Fri, 08 Jul 2011 14:29:22 -0700

Is there an alternative to protocol-httpclient that can do basic auth? I am
running into a wall right now trying to get nutch to get anything past the
seed URL of my site. It requires auth, so I configured httpclient, which
(according to apache logs) is correctly sending credentials when it gets a
401 auth request returned from the server, but after getting '/', it quits
with:


Stopping at depth=1 - no more URLs to fetch.

Running again stops at depth=0. The target page is an apache mod_autoindex
page with 15 or so directories listed so it should not be hitting any limit
since it only fetching the 1 page total (turned off the
ignore.db.internal.links option even though I think I read it only applies
to index scoring, not the crawlDB). I thought it might be one of the regexp
filters blocking, so I trimmed them down to +.*, still nothing. I pointed it
at a server that does not require auth, and it spit out a "unzipBestEffort
returned null" error, even though nothing on the page is a zip/gz/tgz, and
server compression is not on. I traced this to NUTCH-990, which is marked
"won't fix", and everything pointing at upgrading to httpclient4 says it
wont happen.... so is there an alternative, or some way to get this
working?? Crawling the non-auth site with protocol-http works as expected,
nutch starts crawling the autoindex pages and I can watch from the console
or the apache access log.

-T

Alternatvie to httpclient for crawling with basic auth?

Reply via email to