Hi all,
So I have been trying to crawl the following link,
"https://intuitmarket.intuit.com/fsg/home.aspx?page_id=152&brand=3", using
Nutch 1.7.
Somehow got it to work after switching to Unix. It crawl http links perfectly.
So, after reading around, I found that, in order to crawl https links, we need
to add the following to nutch-site.xml.
"<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please enable
protocol-httpclient, but be aware of possible intermittent problems with the
underlying commons-httpclient library.
</description>
</property>"
I also changed the following in the nutch-default.xml, giving some arbitrary
value to each property,
"<property>
<name>http.agent.name</name>
<value>Blah</value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
</description>
</property>
<property>
<name>http.robots.agents</name>
<value>Blah</value>
<description>The agent strings we'll look for in robots.txt files,
comma-separated, in decreasing order of precedence. You should
put the value of http.agent.name as the first agent name, and keep the
default * at the end of the list. E.g.: BlurflDev,Blurfl,*
</description>
</property>"
After that I proceeded to crawling with the following command, bin/nutch crawl
urls -dir crawl -threads 10 -depth 3 -topN 10
The logs are present at the following link,
http://pastebin.com/e7JEcEjV
My stats show that only one link was crawled, whose min, max scores are all 1
When I read the segment that was crawled, I got the following,
http://pastebin.com/D83D5BeX
I have checked the robots.txt file as well of the website. My friend is doing
the same thing, but using nutch 1.2 on windows, with the exact same changes as
mine and it's working.
Hoping a really quick reply as this is urgent.
Regards,
Vincent Anup Kuri