Hello Everyone, I am using apache nutch to crawl HTTP websites. But when I try to crawl HTTPS site its throwing the following error.
“failed with: org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=https: I searched the forums and there were some suggestions to enable the “protocol-httpclient” plugin in nutch-site.xml. Here is the plugin that I use in my nutch-site .xml file <configuration> <property> <name>plugin.includes</name> <value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. </description> </property> </configuration> Do you guys have any other suggestions on how to crawl https websites ? Your help is greatly appreciated, Thanks a lot in advance, -Kay -- View this message in context: http://lucene.472066.n3.nabble.com/Crawl-HTTPS-websites-Enable-Plugin-tp3996861.html Sent from the Nutch - User mailing list archive at Nabble.com.

