My task is to make available an intranet site for searching. I crawl the site
in nutch and index in solr. I have nutch installed it works great for sites
without authentication. However , for an https site, its just not working. I
have modified the nutch site.xml
<property>
  <name>plugin.includes</name>
 
<value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  </property>

And my httpclient-auth.xml is like this :--

<auth-configuration>
<credentials username="myuserid" password="mypasswd">
  <default/>
</credentials>
</auth-configuration>

However the urls are not getting any content, as authentication is not
happening.

A url format which works is :-
https://abc.xyz.com/pages/viewpage.action?&os_username=myuserid&os_password=mypasswd
However once it crawls this page, the links it finds dont have the
&os_username=myuserid&os_password=mypasswd
 appended to the url, and so it doesn't get any content


Is there a way to append parameters to every url found by nutch? Or how can
I pass request parameters for the https request?'





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Send-parameters-to-a-url-tp4056721.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to