Hi there,

I am testing Nutch against a blog. https://datafireball.com/

I added the link to the seed.txt and left the regex-urlfilter the way it
is. I replaced protocol-http with protocol-httpclient and thought that will
make it capable of fetching https links. However, it failed with the
following error after I executed the crawl command:

$ bin/crawl urls/ crawldir 3

fetcher.maxNum.threads can't be < than 50 : using 50 instead
robots.txt whitelist not configured.
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0,
fetchQueues.getQueueCount=1
fetch of https://datafireball.com/ failed with:
org.apache.commons.httpclient.NoHttpResponseException: The server
datafireball.com failed to respond
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0,
fetchQueues.getQueueCount=0
-activeThreads=0

I am pretty positive that the blog was functioning really well but couldn't
really get that much help from the internet.

Can anyone give me some guide.

Below is the nutch-site.xml that I was using.

Best regards,

Bin



<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>

  <name>http.agent.name</name>

  <value>Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36</value>

</property>

<property>

  <name>db.ignore.internal.links</name>

  <value>false</value>

</property>

<property>

  <name>plugin.includes</name>


<value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>

</property>

<property>

  <name>http.content.limit</name>

  <value>-1</value>

</property>

<property>

  <name>fetcher.server.delay</name>

  <value>0</value>

</property>

<property>

  <name>http.redirect.max</name>

  <value>5</value>

</property>

<property>

  <name>db.max.anchor.length</name>

  <value>1000</value>

</property>

</configuration>

Reply via email to