That's for the asp file. When I used Parser Checker, it works perfectly, bin/nutch org.apache.nutch.parse.ParserChecker "https://intuitmarket.intuit.com/fsg/home.aspx?page_id=152&brand=3"
Regards, Vincent Anup Kuri -----Original Message----- From: Canan GİRGİN [mailto:[email protected]] Sent: Tuesday, July 09, 2013 2:19 PM To: [email protected] Subject: Re: Regarding crawling https links In think problem is about robots.txt: robots.txt file[1] for this website denied https://intuitmarket.intuit.com/fsg/home.aspx<https://intuitmarket.intuit.com/fsg/home.aspx?page_id=152&brand=3> Disallow: /fsg/Home.asp [1]:https://intuitmarket.intuit.com/robots.txt On Tue, Jul 9, 2013 at 6:50 AM, Anup Kuri, Vincent < [email protected]> wrote: > Hi all, > > So I have been trying to crawl the following link, " > https://intuitmarket.intuit.com/fsg/home.aspx?page_id=152&brand=3", > using Nutch 1.7. > Somehow got it to work after switching to Unix. It crawl http links > perfectly. So, after reading around, I found that, in order to crawl > https links, we need to add the following to nutch-site.xml. > > "<property> > <name>plugin.includes</name> > > <value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value> > <description>Regular expression naming plugin directory names to > include. Any plugin not matching this expression is excluded. > In any case you need at least include the nutch-extensionpoints plugin. > By > default Nutch includes crawling just HTML and plain text via HTTP, > and basic indexing and search plugins. In order to use HTTPS please > enable > protocol-httpclient, but be aware of possible intermittent problems > with the > underlying commons-httpclient library. > </description> > </property>" > > I also changed the following in the nutch-default.xml, giving some > arbitrary value to each property, > > "<property> > <name>http.agent.name</name> > <value>Blah</value> > <description>HTTP 'User-Agent' request header. MUST NOT be empty - > please set this to a single word uniquely related to your organization. > > NOTE: You should also check other related properties: > > http.robots.agents > http.agent.description > http.agent.url > http.agent.email > http.agent.version > > and set their values appropriately. > > </description> > </property> > > <property> > <name>http.robots.agents</name> > <value>Blah</value> > <description>The agent strings we'll look for in robots.txt files, > comma-separated, in decreasing order of precedence. You should > put the value of http.agent.name as the first agent name, and keep the > default * at the end of the list. E.g.: BlurflDev,Blurfl,* > </description> > </property>" > > After that I proceeded to crawling with the following command, > bin/nutch crawl urls -dir crawl -threads 10 -depth 3 -topN 10 > > The logs are present at the following link, > > http://pastebin.com/e7JEcEjV > > My stats show that only one link was crawled, whose min, max scores > are all 1 When I read the segment that was crawled, I got the > following, > > http://pastebin.com/D83D5BeX > > I have checked the robots.txt file as well of the website. My friend > is doing the same thing, but using nutch 1.2 on windows, with the > exact same changes as mine and it's working. > > Hoping a really quick reply as this is urgent. > > Regards, > Vincent Anup Kuri > >

