Re: Crawl HTTPS websites/Enable Plugin

remi tassing Mon, 23 Jul 2012 22:20:56 -0700

So did it fail before or after you used protocol-httpclient?

On 7/24/12, Kay <[email protected]> wrote:
> Hello Everyone,
>
> I am using apache nutch to crawl HTTP websites. But when I try to crawl
> HTTPS site its throwing the following error.
>
> “failed with: org.apache.nutch.protocol.ProtocolNotFound: protocol not
> found
> for url=https:
>
>
> I searched the forums and there were some suggestions to enable the
> “protocol-httpclient” plugin in nutch-site.xml.   Here is the plugin that I
> use in my nutch-site .xml  file
> <configuration>
> <property>
>   <name>plugin.includes</name>
>
> <value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>   <description>Regular expression naming plugin directory names to
>   include.  Any plugin not matching this expression is excluded.
>   In any case you need at least include the nutch-extensionpoints plugin.
> By
>   default Nutch includes crawling just HTML and plain text via HTTP,
>   and basic indexing and search plugins. In order to use HTTPS please enable
>
>   protocol-httpclient, but be aware of possible intermittent problems with
> the
>   underlying commons-httpclient library.
>   </description>
> </property>
>
> </configuration>
>
> Do you guys have any other suggestions on how to crawl https websites ?
>
> Your help is greatly appreciated,
>
>
> Thanks a lot in advance,
> -Kay
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Crawl-HTTPS-websites-Enable-Plugin-tp3996861.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
Remi Tassing

Re: Crawl HTTPS websites/Enable Plugin

Reply via email to