Maybe you can use bin/nutch plugin to test the httplcient plugin first. It
use HttpClient handles authenticating with servers almost transparently,
the only thing a developer must do is actually provide the login
credentials. you can see this [0]

[0] http://hc.apache.org/httpclient-legacy/authentication.html


On Thu, Apr 18, 2013 at 12:06 AM, kneerosh <[email protected]>wrote:

> My task is to make available an intranet site for searching. I crawl the
> site
> in nutch and index in solr. I have nutch installed it works great for sites
> without authentication. However , for an https site, its just not working.
> I
> have modified the nutch site.xml
> <property>
>   <name>plugin.includes</name>
>
>
> <value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>   </property>
>
> And my httpclient-auth.xml is like this :--
>
> <auth-configuration>
> <credentials username="myuserid" password="mypasswd">
>   <default/>
> </credentials>
> </auth-configuration>
>
> However the urls are not getting any content, as authentication is not
> happening.
>
> A url format which works is :-
>
> https://abc.xyz.com/pages/viewpage.action?&os_username=myuserid&os_password=mypasswd
> However once it crawls this page, the links it finds dont have the
> &os_username=myuserid&os_password=mypasswd
>  appended to the url, and so it doesn't get any content
>
>
> Is there a way to append parameters to every url found by nutch? Or how can
> I pass request parameters for the https request?'
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Send-parameters-to-a-url-tp4056721.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
Don't Grow Old, Grow Up... :-)

Reply via email to