Re: How do I debug why a url doesn't pass through generate despite being the only one?

Markus Jelsma Mon, 20 Jun 2011 14:23:50 -0700

You're the victim of the default regex url filter.

31      # skip URLs containing certain characters as probable queries, etc.
32      -[?*!@=]


The injector won't inject that URL. This can be trickty indeed as the filters 
don't log rejected URL's.

> Hello,
> 
> I've noticed that for some urls don't make it into my index. Debugging I've
> created a seed file that has only one of them (
> http://abcnews.go.com/Technology/google-chromebook-works-great-long-online/
> story?id=13850997) and tried to crawl for it on an empty crawldb. However I
> notice that already at the bin/nutch generate stage the script exists
> reporting that there are no urls to fetch. So it got nothing to do with
> parsing, or fetching (we don't even reach the host yet). What could it be?
> I've tried enconding it into
> http%3A%2F%2Fabcnews.go.com%2FTechnology%2Fgoogle-chromebook-works-great-lo
> ng-online%2Fstory%3Fid%3D13850997, but that didn't help.
> 
> STEPS TO REPRODUCE:
> 
> wget http://apache.panu.it//nutch/apache-nutch-1.3-src.zip
> unzip apache-nutch-1.3-src.zip
> ant
> cat > urls << __EOF__
> http://abcnews.go.com/Technology/google-chromebook-works-great-long-online/
> story?id=13850997 __EOF__
> runtime/local/bin/nutch inject crawl urls
> runtime/local/bin/nutch generate crawl crawl/segs -topN 1 #even w/o -topN
> you will get the same
> # Generator: 0 records selected for fetching, exiting ...

Re: How do I debug why a url doesn't pass through generate despite being the only one?

Reply via email to