RE: Regarding crawling https links

Anup Kuri, Vincent Tue, 09 Jul 2013 03:25:12 -0700

How can I make nutch ignore robots.txt file?

Regards,
Vincent Anup Kuri



-----Original Message-----
From: Markus Jelsma [mailto:[email protected]] 
Sent: Tuesday, July 09, 2013 3:46 PM
To: [email protected]
Subject: RE: Regarding crawling https links

That's because the checker tools do not use robots.txt.
 
-----Original message-----
> From:Anup Kuri, Vincent <[email protected]>
> Sent: Tuesday 9th July 2013 12:14
> To: [email protected]
> Subject: RE: Regarding crawling https links
> 
> That's for the asp file. When I used Parser Checker, it works 
> perfectly,
> 
> bin/nutch org.apache.nutch.parse.ParserChecker 
> "https://intuitmarket.intuit.com/fsg/home.aspx?page_id=152&brand=3";
> 
> 
> Regards,
> Vincent Anup Kuri
> 
> -----Original Message-----
> From: Canan GİRGİN [mailto:[email protected]]
> Sent: Tuesday, July 09, 2013 2:19 PM
> To: [email protected]
> Subject: Re: Regarding crawling https links
> 
> In think problem is about robots.txt:
> robots.txt file[1] for this website  denied 
> https://intuitmarket.intuit.com/fsg/home.aspx<https://intuitmarket.int
> uit.com/fsg/home.aspx?page_id=152&brand=3>
> 
> Disallow: /fsg/Home.asp
> 
> 
> [1]:https://intuitmarket.intuit.com/robots.txt
> 
> 
> On Tue, Jul 9, 2013 at 6:50 AM, Anup Kuri, Vincent < 
> [email protected]> wrote:
> 
> > Hi all,
> >
> > So I have been trying to crawl the following link, "
> > https://intuitmarket.intuit.com/fsg/home.aspx?page_id=152&brand=3";,
> > using Nutch 1.7.
> > Somehow got it to work after switching to Unix. It crawl http links 
> > perfectly. So, after reading around, I found that, in order to crawl 
> > https links, we need to add the following to nutch-site.xml.
> >
> > "<property>
> >   <name>plugin.includes</name>
> >
> > <value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> >   <description>Regular expression naming plugin directory names to
> >   include.  Any plugin not matching this expression is excluded.
> >   In any case you need at least include the nutch-extensionpoints plugin.
> > By
> >   default Nutch includes crawling just HTML and plain text via HTTP,
> >   and basic indexing and search plugins. In order to use HTTPS 
> > please enable
> >   protocol-httpclient, but be aware of possible intermittent 
> > problems with the
> >   underlying commons-httpclient library.
> >   </description>
> > </property>"
> >
> > I also changed the following in the nutch-default.xml, giving some 
> > arbitrary value to each property,
> >
> > "<property>
> >   <name>http.agent.name</name>
> >   <value>Blah</value>
> >   <description>HTTP 'User-Agent' request header. MUST NOT be empty -
> >   please set this to a single word uniquely related to your organization.
> >
> >   NOTE: You should also check other related properties:
> >
> >         http.robots.agents
> >         http.agent.description
> >         http.agent.url
> >         http.agent.email
> >         http.agent.version
> >
> >   and set their values appropriately.
> >
> >   </description>
> > </property>
> >
> > <property>
> >   <name>http.robots.agents</name>
> >   <value>Blah</value>
> >   <description>The agent strings we'll look for in robots.txt files,
> >   comma-separated, in decreasing order of precedence. You should
> >   put the value of http.agent.name as the first agent name, and keep the
> >   default * at the end of the list. E.g.: BlurflDev,Blurfl,*
> >   </description>
> > </property>"
> >
> > After that I proceeded to crawling with the following command, 
> > bin/nutch crawl urls -dir crawl -threads 10 -depth 3 -topN 10
> >
> > The logs are present at the following link,
> >
> > http://pastebin.com/e7JEcEjV
> >
> > My stats show that only one link was crawled, whose min, max scores 
> > are all 1 When I read the segment that was crawled, I got the 
> > following,
> >
> > http://pastebin.com/D83D5BeX
> >
> > I have checked the robots.txt file as well of the website. My friend 
> > is doing the same thing, but using nutch 1.2 on windows, with the 
> > exact same changes as mine and it's working.
> >
> > Hoping a really quick reply as this is urgent.
> >
> > Regards,
> > Vincent Anup Kuri
> >
> >
>

RE: Regarding crawling https links

Reply via email to