Regarding crawling https links

Anup Kuri, Vincent Mon, 08 Jul 2013 20:52:45 -0700

Hi all,

So I have been trying to crawl the following link, 
"https://intuitmarket.intuit.com/fsg/home.aspx?page_id=152&brand=3";, using 
Nutch 1.7.
Somehow got it to work after switching to Unix. It crawl http links perfectly. 
So, after reading around, I found that, in order to crawl https links, we need 
to add the following to nutch-site.xml.


"<property>
  <name>plugin.includes</name>
  
<value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable
  protocol-httpclient, but be aware of possible intermittent problems with the
  underlying commons-httpclient library.
  </description>
</property>"

I also changed the following in the nutch-default.xml, giving some arbitrary 
value to each property,

"<property>
  <name>http.agent.name</name>
  <value>Blah</value>
  <description>HTTP 'User-Agent' request header. MUST NOT be empty -
  please set this to a single word uniquely related to your organization.

  NOTE: You should also check other related properties:

        http.robots.agents
        http.agent.description
        http.agent.url
        http.agent.email
        http.agent.version

  and set their values appropriately.

  </description>
</property>

<property>
  <name>http.robots.agents</name>
  <value>Blah</value>
  <description>The agent strings we'll look for in robots.txt files,
  comma-separated, in decreasing order of precedence. You should
  put the value of http.agent.name as the first agent name, and keep the
  default * at the end of the list. E.g.: BlurflDev,Blurfl,*
  </description>
</property>"

After that I proceeded to crawling with the following command, bin/nutch crawl 
urls -dir crawl -threads 10 -depth 3 -topN 10

The logs are present at the following link,

http://pastebin.com/e7JEcEjV

My stats show that only one link was crawled, whose min, max scores are all 1
When I read the segment that was crawled, I got the following,

http://pastebin.com/D83D5BeX

I have checked the robots.txt file as well of the website. My friend is doing 
the same thing, but using nutch 1.2 on windows, with the exact same changes as 
mine and it's working.

Hoping a really quick reply as this is urgent.

Regards,
Vincent Anup Kuri

Regarding crawling https links

Reply via email to