So did it fail before or after you used protocol-httpclient? On 7/24/12, Kay <[email protected]> wrote: > Hello Everyone, > > I am using apache nutch to crawl HTTP websites. But when I try to crawl > HTTPS site its throwing the following error. > > “failed with: org.apache.nutch.protocol.ProtocolNotFound: protocol not > found > for url=https: > > > I searched the forums and there were some suggestions to enable the > “protocol-httpclient” plugin in nutch-site.xml. Here is the plugin that I > use in my nutch-site .xml file > <configuration> > <property> > <name>plugin.includes</name> > > <value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value> > <description>Regular expression naming plugin directory names to > include. Any plugin not matching this expression is excluded. > In any case you need at least include the nutch-extensionpoints plugin. > By > default Nutch includes crawling just HTML and plain text via HTTP, > and basic indexing and search plugins. In order to use HTTPS please enable > > protocol-httpclient, but be aware of possible intermittent problems with > the > underlying commons-httpclient library. > </description> > </property> > > </configuration> > > Do you guys have any other suggestions on how to crawl https websites ? > > Your help is greatly appreciated, > > > Thanks a lot in advance, > -Kay > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Crawl-HTTPS-websites-Enable-Plugin-tp3996861.html > Sent from the Nutch - User mailing list archive at Nabble.com. >
-- Remi Tassing

