RE: Regarding crawling https links

Anup Kuri, Vincent Tue, 09 Jul 2013 03:15:17 -0700

That's for the asp file. When I used Parser Checker, it works perfectly, 

bin/nutch org.apache.nutch.parse.ParserChecker 
"https://intuitmarket.intuit.com/fsg/home.aspx?page_id=152&brand=3";



Regards,
Vincent Anup Kuri

-----Original Message-----
From: Canan GİRGİN [mailto:[email protected]] 
Sent: Tuesday, July 09, 2013 2:19 PM
To: [email protected]
Subject: Re: Regarding crawling https links

In think problem is about robots.txt:
robots.txt file[1] for this website  denied 
https://intuitmarket.intuit.com/fsg/home.aspx<https://intuitmarket.intuit.com/fsg/home.aspx?page_id=152&brand=3>

Disallow: /fsg/Home.asp


[1]:https://intuitmarket.intuit.com/robots.txt


On Tue, Jul 9, 2013 at 6:50 AM, Anup Kuri, Vincent < 
[email protected]> wrote:

> Hi all,
>
> So I have been trying to crawl the following link, "
> https://intuitmarket.intuit.com/fsg/home.aspx?page_id=152&brand=3";, 
> using Nutch 1.7.
> Somehow got it to work after switching to Unix. It crawl http links 
> perfectly. So, after reading around, I found that, in order to crawl 
> https links, we need to add the following to nutch-site.xml.
>
> "<property>
>   <name>plugin.includes</name>
>
> <value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>   <description>Regular expression naming plugin directory names to
>   include.  Any plugin not matching this expression is excluded.
>   In any case you need at least include the nutch-extensionpoints plugin.
> By
>   default Nutch includes crawling just HTML and plain text via HTTP,
>   and basic indexing and search plugins. In order to use HTTPS please 
> enable
>   protocol-httpclient, but be aware of possible intermittent problems 
> with the
>   underlying commons-httpclient library.
>   </description>
> </property>"
>
> I also changed the following in the nutch-default.xml, giving some 
> arbitrary value to each property,
>
> "<property>
>   <name>http.agent.name</name>
>   <value>Blah</value>
>   <description>HTTP 'User-Agent' request header. MUST NOT be empty -
>   please set this to a single word uniquely related to your organization.
>
>   NOTE: You should also check other related properties:
>
>         http.robots.agents
>         http.agent.description
>         http.agent.url
>         http.agent.email
>         http.agent.version
>
>   and set their values appropriately.
>
>   </description>
> </property>
>
> <property>
>   <name>http.robots.agents</name>
>   <value>Blah</value>
>   <description>The agent strings we'll look for in robots.txt files,
>   comma-separated, in decreasing order of precedence. You should
>   put the value of http.agent.name as the first agent name, and keep the
>   default * at the end of the list. E.g.: BlurflDev,Blurfl,*
>   </description>
> </property>"
>
> After that I proceeded to crawling with the following command, 
> bin/nutch crawl urls -dir crawl -threads 10 -depth 3 -topN 10
>
> The logs are present at the following link,
>
> http://pastebin.com/e7JEcEjV
>
> My stats show that only one link was crawled, whose min, max scores 
> are all 1 When I read the segment that was crawled, I got the 
> following,
>
> http://pastebin.com/D83D5BeX
>
> I have checked the robots.txt file as well of the website. My friend 
> is doing the same thing, but using nutch 1.2 on windows, with the 
> exact same changes as mine and it's working.
>
> Hoping a really quick reply as this is urgent.
>
> Regards,
> Vincent Anup Kuri
>
>

RE: Regarding crawling https links

Reply via email to