Hello - maybe there is a firewall or was there a temporary network issue? We 
have no trouble with Nutch on that site.
Markus
 
 
-----Original message-----
> From:Bin Wang <binwang...@gmail.com>
> Sent: Tuesday 12th April 2016 21:41
> To: Apache.Nutch.User <user@nutch.apache.org>
> Subject: HTTPS Problem even using httpclient
> 
> Hi there,
> 
> I am testing Nutch against a blog. https://datafireball.com/
> 
> I added the link to the seed.txt and left the regex-urlfilter the way it
> is. I replaced protocol-http with protocol-httpclient and thought that will
> make it capable of fetching https links. However, it failed with the
> following error after I executed the crawl command:
> 
> $ bin/crawl urls/ crawldir 3
> 
> fetcher.maxNum.threads can't be < than 50 : using 50 instead
> robots.txt whitelist not configured.
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0,
> fetchQueues.getQueueCount=1
> fetch of https://datafireball.com/ failed with:
> org.apache.commons.httpclient.NoHttpResponseException: The server
> datafireball.com failed to respond
> Thread FetcherThread has no more work available
> -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0,
> fetchQueues.getQueueCount=0
> -activeThreads=0
> 
> I am pretty positive that the blog was functioning really well but couldn't
> really get that much help from the internet.
> 
> Can anyone give me some guide.
> 
> Below is the nutch-site.xml that I was using.
> 
> Best regards,
> 
> Bin
> 
> 
> 
> <?xml version="1.0"?>
> 
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> 
> <!-- Put site-specific property overrides in this file. -->
> 
> <configuration>
> 
> <property>
> 
>   <name>http.agent.name</name>
> 
>   <value>Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36
> (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36</value>
> 
> </property>
> 
> <property>
> 
>   <name>db.ignore.internal.links</name>
> 
>   <value>false</value>
> 
> </property>
> 
> <property>
> 
>   <name>plugin.includes</name>
> 
> 
> <value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> 
> </property>
> 
> <property>
> 
>   <name>http.content.limit</name>
> 
>   <value>-1</value>
> 
> </property>
> 
> <property>
> 
>   <name>fetcher.server.delay</name>
> 
>   <value>0</value>
> 
> </property>
> 
> <property>
> 
>   <name>http.redirect.max</name>
> 
>   <value>5</value>
> 
> </property>
> 
> <property>
> 
>   <name>db.max.anchor.length</name>
> 
>   <value>1000</value>
> 
> </property>
> 
> </configuration>
> 

Reply via email to