Hi,

You can test the robots parsing with : ./nutch plugin lib-http
org.apache.nutch.protocol.http.api.RobotRulesParser ~/testRobots.txt
~/testURL PHFAWS/Nutch-1.3

where testRobots.txt contains the robots.txt file that you want to test,
testURL has the URLs and finally your user agent.

HTH

Julien



On 30 September 2011 21:15, Chip Calhoun <[email protected]> wrote:

> I've been able to run the ParserChecker now, but I'm not sure how to
> understand the results. Here's what I got:
>
> # bin/nutch org.apache.nutch.parse.ParserChecker
> http://digital.lib.umd.edu/oclc/MdU.ead.histms.0094.xml
> ---------
> Url
> ---------------
> http://digital.lib.umd.edu/oclc/MdU.ead.histms.0094.xml---------
> ParseData
> ---------
> Version: 5
> Status: success(1,0)
> Title:
> Outlinks: 1
>  outlink: toUrl: GR:32:A:128 anchor:
> Content Metadata: ETag="1fa962a-56f20-485df79c50980" Date=Fri, 30 Sep 2011
> 19:54:14 GMT Content-Length=356128 Last-Modified=Wed, 05 May 2010 21:26:14
> GMT Content-Type=text/xml Connection=close Accept-Ranges=bytes
> Server=Apache/2.2.3 (Red Hat)
> Parse Metadata: Content-Type=application/xml
>
> Curl also retrieves this file, and yet I can't get my crawl to pick it up.
>
> Could it be an issue with robots.txt? The robots file for this site reads
> as follows:
> User-agent: PHFAWS/Nutch-1.3
> Disallow:
>
> User-agent: archive.org_bot
> Disallow:
>
> User-agent: *
> Disallow: /
>
> That first user-agent is, as near as I can tell, what I'm sending. My log
> shows the following:
> 2011-09-30 15:54:17,712 INFO  http.Http - http.agent = PHFAWS/Nutch-1.3
> (American Institute of Physics: Physics History Finding Aids Web Site;
> http://www.aip.org/history/nbl/findingaids.html; [email protected])
>
> Can anyone tell what I'm missing? Thanks.
>
> Chip
>
>
> -----Original Message-----
> From: Chip Calhoun [mailto:[email protected]]
> Sent: Thursday, September 29, 2011 4:12 PM
> To: [email protected]
> Subject: RE: What could be blocking me, if not robots.txt?
>
> Ah, sorry. I had already deleted the local copy from my server (aip.org)
> to avoid clutter. So yeah, that will definitely 404 now.
>
> Curl retrieves the whole file with no problems. I can't try the
> ParserChecker today as I'm stuck away from my own machine, but I will try it
> tomorrow. The fact that I can curl it at least tells me this is a problem I
> need to fix in Nutch.
>
> Chip
>
> ________________________________________
> From: Markus Jelsma [[email protected]]
> Sent: Thursday, September 29, 2011 1:01 PM
> To: [email protected]
> Cc: Chip Calhoun
> Subject: Re: What could be blocking me, if not robots.txt?
>
> Oh, it's a 404. That makes sense.
>
> > Hi everyone,
> >
> > I'm using Nutch to crawl a few friendly sites, and am having trouble
> > with some of them. One site in particular has created an exception for
> > me in its robots.txt, and yet I can't crawl any of its pages. I've
> > tried copying the files I want to index (3 XML documents) to my own
> > server and crawling that, and it works fine that way; so something is
> > keeping me from indexing any files on this other site.
> >
> > I compared the logs of my attempt to crawl the friendly site with my
> > attempt to crawl my own site, and I've found few differences. Most
> > differences come from the fact that my own site requires a crawlDelay,
> > so there are many log sections along the lines of:
> >
> > 2011-09-29 10:57:37,529 INFO  fetcher.Fetcher - -activeThreads=10,
> > spinWaiting=10, fetchQueues.totalSize=2 2011-09-29 10:57:37,529 INFO
> > fetcher.Fetcher - * queue: http://www.aip.org 2011-09-29 10:57:37,529
> INFO
> >  fetcher.Fetcher -   maxThreads    = 1 2011-09-29 10:57:37,529 INFO
> > fetcher.Fetcher -   inProgress    = 0 2011-09-29 10:57:37,529 INFO
> > fetcher.Fetcher -   crawlDelay    = 5000 2011-09-29 10:57:37,529 INFO
> > fetcher.Fetcher -   minCrawlDelay = 0 2011-09-29 10:57:37,529 INFO
> > fetcher.Fetcher -   nextFetchTime = 1317308262122 2011-09-29 10:57:37,529
> > INFO  fetcher.Fetcher -   now           = 1317308257529 2011-09-29
> > 10:57:37,529 INFO  fetcher.Fetcher -   0.
> > http://www.aip.org/history/ead/umd/MdU.ead.histms.0067.xml 2011-09-29
> > 10:57:37,529 INFO  fetcher.Fetcher -   1.
> > http://www.aip.org/history/ead/umd/MdU.ead.histms.0312.xml
> >
> > That strikes me as probably irrelevant, but I figured I should mention
> it.
> > The main difference I see in the logs is that the crawl of my own site
> > (the crawl that worked) has the following two lines which do not
> > appear in the log of my failed crawl:
> >
> > 2011-09-29 10:57:50,497 INFO  parse.ParserFactory - The parsing plugins:
> > [org.apache.nutch.parse.tika.TikaParser] are enabled via the
> > plugin.includes system property, and all claim to support the content
> > type application/xml, but they are not mapped to it  in the
> > parse-plugins.xml file 2011-09-29 10:58:23,559 INFO
> > crawl.SignatureFactory - Using Signature impl:
> > org.apache.nutch.crawl.MD5Signature
> >
> > Also, while my successful crawl has three lines like the following, my
> > failed one only has two:
> >
> > 2011-09-29 10:58:44,824 WARN  regex.RegexURLNormalizer - can't find
> > rules for scope 'crawldb', using default
> >
> > Can anyone think of something I might have missed?
> >
> > Chip
>
>
> -----Original Message-----
> From: Markus Jelsma [mailto:[email protected]]
> Sent: Thursday, September 29, 2011 1:00 PM
> To: [email protected]
> Cc: Chip Calhoun
> Subject: Re: What could be blocking me, if not robots.txt?
>
>
> > Hi everyone,
> >
> > I'm using Nutch to crawl a few friendly sites, and am having trouble
> > with some of them. One site in particular has created an exception for
> > me in its robots.txt, and yet I can't crawl any of its pages. I've
> > tried copying the files I want to index (3 XML documents) to my own
> > server and crawling that, and it works fine that way; so something is
> > keeping me from indexing any files on this other site.
> >
> > I compared the logs of my attempt to crawl the friendly site with my
> > attempt to crawl my own site, and I've found few differences. Most
> > differences come from the fact that my own site requires a crawlDelay,
> > so there are many log sections along the lines of:
> >
> > 2011-09-29 10:57:37,529 INFO  fetcher.Fetcher - -activeThreads=10,
> > spinWaiting=10, fetchQueues.totalSize=2 2011-09-29 10:57:37,529 INFO
> > fetcher.Fetcher - * queue: http://www.aip.org 2011-09-29 10:57:37,529
> INFO
> >  fetcher.Fetcher -   maxThreads    = 1 2011-09-29 10:57:37,529 INFO
> > fetcher.Fetcher -   inProgress    = 0 2011-09-29 10:57:37,529 INFO
> > fetcher.Fetcher -   crawlDelay    = 5000 2011-09-29 10:57:37,529 INFO
> > fetcher.Fetcher -   minCrawlDelay = 0 2011-09-29 10:57:37,529 INFO
> > fetcher.Fetcher -   nextFetchTime = 1317308262122 2011-09-29 10:57:37,529
> > INFO  fetcher.Fetcher -   now           = 1317308257529 2011-09-29
> > 10:57:37,529 INFO  fetcher.Fetcher -   0.
> > http://www.aip.org/history/ead/umd/MdU.ead.histms.0067.xml 2011-09-29
> > 10:57:37,529 INFO  fetcher.Fetcher -   1.
> > http://www.aip.org/history/ead/umd/MdU.ead.histms.0312.xml
> >
> > That strikes me as probably irrelevant, but I figured I should mention
> it.
> > The main difference I see in the logs is that the crawl of my own site
> > (the crawl that worked) has the following two lines which do not
> > appear in the log of my failed crawl:
> >
> > 2011-09-29 10:57:50,497 INFO  parse.ParserFactory - The parsing plugins:
> > [org.apache.nutch.parse.tika.TikaParser] are enabled via the
> > plugin.includes system property, and all claim to support the content
> > type application/xml, but they are not mapped to it  in the
> > parse-plugins.xml file 2011-09-29 10:58:23,559 INFO
> > crawl.SignatureFactory - Using Signature impl:
> > org.apache.nutch.crawl.MD5Signature
>
> If this doesn't popup when crawling the site that means its not fetched
> (properly). Can you try using the parser checker to download it? Can you you
> curl? The fetcher should throw an exception if there's trouble but it may
> also be stopped by the http.content.limit setting.
>
> >
> > Also, while my successful crawl has three lines like the following, my
> > failed one only has two:
> >
> > 2011-09-29 10:58:44,824 WARN  regex.RegexURLNormalizer - can't find rules
> > for scope 'crawldb', using default
> >
> > Can anyone think of something I might have missed?
> >
> > Chip
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Reply via email to