Maybe you can use bin/nutch plugin to test the httplcient plugin first. It use HttpClient handles authenticating with servers almost transparently, the only thing a developer must do is actually provide the login credentials. you can see this [0]
[0] http://hc.apache.org/httpclient-legacy/authentication.html On Thu, Apr 18, 2013 at 12:06 AM, kneerosh <[email protected]>wrote: > My task is to make available an intranet site for searching. I crawl the > site > in nutch and index in solr. I have nutch installed it works great for sites > without authentication. However , for an https site, its just not working. > I > have modified the nutch site.xml > <property> > <name>plugin.includes</name> > > > <value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value> > </property> > > And my httpclient-auth.xml is like this :-- > > <auth-configuration> > <credentials username="myuserid" password="mypasswd"> > <default/> > </credentials> > </auth-configuration> > > However the urls are not getting any content, as authentication is not > happening. > > A url format which works is :- > > https://abc.xyz.com/pages/viewpage.action?&os_username=myuserid&os_password=mypasswd > However once it crawls this page, the links it finds dont have the > &os_username=myuserid&os_password=mypasswd > appended to the url, and so it doesn't get any content > > > Is there a way to append parameters to every url found by nutch? Or how can > I pass request parameters for the https request?' > > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Send-parameters-to-a-url-tp4056721.html > Sent from the Nutch - User mailing list archive at Nabble.com. > -- Don't Grow Old, Grow Up... :-)

