How can I make nutch ignore robots.txt file? Regards, Vincent Anup Kuri
-----Original Message----- From: Markus Jelsma [mailto:[email protected]] Sent: Tuesday, July 09, 2013 3:46 PM To: [email protected] Subject: RE: Regarding crawling https links That's because the checker tools do not use robots.txt. -----Original message----- > From:Anup Kuri, Vincent <[email protected]> > Sent: Tuesday 9th July 2013 12:14 > To: [email protected] > Subject: RE: Regarding crawling https links > > That's for the asp file. When I used Parser Checker, it works > perfectly, > > bin/nutch org.apache.nutch.parse.ParserChecker > "https://intuitmarket.intuit.com/fsg/home.aspx?page_id=152&brand=3" > > > Regards, > Vincent Anup Kuri > > -----Original Message----- > From: Canan GİRGİN [mailto:[email protected]] > Sent: Tuesday, July 09, 2013 2:19 PM > To: [email protected] > Subject: Re: Regarding crawling https links > > In think problem is about robots.txt: > robots.txt file[1] for this website denied > https://intuitmarket.intuit.com/fsg/home.aspx<https://intuitmarket.int > uit.com/fsg/home.aspx?page_id=152&brand=3> > > Disallow: /fsg/Home.asp > > > [1]:https://intuitmarket.intuit.com/robots.txt > > > On Tue, Jul 9, 2013 at 6:50 AM, Anup Kuri, Vincent < > [email protected]> wrote: > > > Hi all, > > > > So I have been trying to crawl the following link, " > > https://intuitmarket.intuit.com/fsg/home.aspx?page_id=152&brand=3", > > using Nutch 1.7. > > Somehow got it to work after switching to Unix. It crawl http links > > perfectly. So, after reading around, I found that, in order to crawl > > https links, we need to add the following to nutch-site.xml. > > > > "<property> > > <name>plugin.includes</name> > > > > <value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value> > > <description>Regular expression naming plugin directory names to > > include. Any plugin not matching this expression is excluded. > > In any case you need at least include the nutch-extensionpoints plugin. > > By > > default Nutch includes crawling just HTML and plain text via HTTP, > > and basic indexing and search plugins. In order to use HTTPS > > please enable > > protocol-httpclient, but be aware of possible intermittent > > problems with the > > underlying commons-httpclient library. > > </description> > > </property>" > > > > I also changed the following in the nutch-default.xml, giving some > > arbitrary value to each property, > > > > "<property> > > <name>http.agent.name</name> > > <value>Blah</value> > > <description>HTTP 'User-Agent' request header. MUST NOT be empty - > > please set this to a single word uniquely related to your organization. > > > > NOTE: You should also check other related properties: > > > > http.robots.agents > > http.agent.description > > http.agent.url > > http.agent.email > > http.agent.version > > > > and set their values appropriately. > > > > </description> > > </property> > > > > <property> > > <name>http.robots.agents</name> > > <value>Blah</value> > > <description>The agent strings we'll look for in robots.txt files, > > comma-separated, in decreasing order of precedence. You should > > put the value of http.agent.name as the first agent name, and keep the > > default * at the end of the list. E.g.: BlurflDev,Blurfl,* > > </description> > > </property>" > > > > After that I proceeded to crawling with the following command, > > bin/nutch crawl urls -dir crawl -threads 10 -depth 3 -topN 10 > > > > The logs are present at the following link, > > > > http://pastebin.com/e7JEcEjV > > > > My stats show that only one link was crawled, whose min, max scores > > are all 1 When I read the segment that was crawled, I got the > > following, > > > > http://pastebin.com/D83D5BeX > > > > I have checked the robots.txt file as well of the website. My friend > > is doing the same thing, but using nutch 1.2 on windows, with the > > exact same changes as mine and it's working. > > > > Hoping a really quick reply as this is urgent. > > > > Regards, > > Vincent Anup Kuri > > > > >

