RE: What could be blocking me, if not robots.txt?

Chip Calhoun Thu, 29 Sep 2011 13:12:16 -0700

Ah, sorry. I had already deleted the local copy from my server (aip.org) to 
avoid clutter. So yeah, that will definitely 404 now.


Curl retrieves the whole file with no problems. I can't try the ParserChecker 
today as I'm stuck away from my own machine, but I will try it tomorrow. The 
fact that I can curl it at least tells me this is a problem I need to fix in 
Nutch.

Chip

________________________________________
From: Markus Jelsma [markus.jel...@openindex.io]
Sent: Thursday, September 29, 2011 1:01 PM
To: user@nutch.apache.org
Cc: Chip Calhoun
Subject: Re: What could be blocking me, if not robots.txt?

Oh, it's a 404. That makes sense.

> Hi everyone,
>
> I'm using Nutch to crawl a few friendly sites, and am having trouble with
> some of them. One site in particular has created an exception for me in
> its robots.txt, and yet I can't crawl any of its pages. I've tried copying
> the files I want to index (3 XML documents) to my own server and crawling
> that, and it works fine that way; so something is keeping me from indexing
> any files on this other site.
>
> I compared the logs of my attempt to crawl the friendly site with my
> attempt to crawl my own site, and I've found few differences. Most
> differences come from the fact that my own site requires a crawlDelay, so
> there are many log sections along the lines of:
>
> 2011-09-29 10:57:37,529 INFO  fetcher.Fetcher - -activeThreads=10,
> spinWaiting=10, fetchQueues.totalSize=2 2011-09-29 10:57:37,529 INFO
> fetcher.Fetcher - * queue: http://www.aip.org 2011-09-29 10:57:37,529 INFO
>  fetcher.Fetcher -   maxThreads    = 1 2011-09-29 10:57:37,529 INFO
> fetcher.Fetcher -   inProgress    = 0 2011-09-29 10:57:37,529 INFO
> fetcher.Fetcher -   crawlDelay    = 5000 2011-09-29 10:57:37,529 INFO
> fetcher.Fetcher -   minCrawlDelay = 0 2011-09-29 10:57:37,529 INFO
> fetcher.Fetcher -   nextFetchTime = 1317308262122 2011-09-29 10:57:37,529
> INFO  fetcher.Fetcher -   now           = 1317308257529 2011-09-29
> 10:57:37,529 INFO  fetcher.Fetcher -   0.
> http://www.aip.org/history/ead/umd/MdU.ead.histms.0067.xml 2011-09-29
> 10:57:37,529 INFO  fetcher.Fetcher -   1.
> http://www.aip.org/history/ead/umd/MdU.ead.histms.0312.xml
>
> That strikes me as probably irrelevant, but I figured I should mention it.
> The main difference I see in the logs is that the crawl of my own site
> (the crawl that worked) has the following two lines which do not appear in
> the log of my failed crawl:
>
> 2011-09-29 10:57:50,497 INFO  parse.ParserFactory - The parsing plugins:
> [org.apache.nutch.parse.tika.TikaParser] are enabled via the
> plugin.includes system property, and all claim to support the content type
> application/xml, but they are not mapped to it  in the parse-plugins.xml
> file 2011-09-29 10:58:23,559 INFO  crawl.SignatureFactory - Using
> Signature impl: org.apache.nutch.crawl.MD5Signature
>
> Also, while my successful crawl has three lines like the following, my
> failed one only has two:
>
> 2011-09-29 10:58:44,824 WARN  regex.RegexURLNormalizer - can't find rules
> for scope 'crawldb', using default
>
> Can anyone think of something I might have missed?
>
> Chip

RE: What could be blocking me, if not robots.txt?

Reply via email to