You're the victim of the default regex url filter. 31 # skip URLs containing certain characters as probable queries, etc. 32 -[?*!@=]
The injector won't inject that URL. This can be trickty indeed as the filters don't log rejected URL's. > Hello, > > I've noticed that for some urls don't make it into my index. Debugging I've > created a seed file that has only one of them ( > http://abcnews.go.com/Technology/google-chromebook-works-great-long-online/ > story?id=13850997) and tried to crawl for it on an empty crawldb. However I > notice that already at the bin/nutch generate stage the script exists > reporting that there are no urls to fetch. So it got nothing to do with > parsing, or fetching (we don't even reach the host yet). What could it be? > I've tried enconding it into > http%3A%2F%2Fabcnews.go.com%2FTechnology%2Fgoogle-chromebook-works-great-lo > ng-online%2Fstory%3Fid%3D13850997, but that didn't help. > > STEPS TO REPRODUCE: > > wget http://apache.panu.it//nutch/apache-nutch-1.3-src.zip > unzip apache-nutch-1.3-src.zip > ant > cat > urls << __EOF__ > http://abcnews.go.com/Technology/google-chromebook-works-great-long-online/ > story?id=13850997 __EOF__ > runtime/local/bin/nutch inject crawl urls > runtime/local/bin/nutch generate crawl crawl/segs -topN 1 #even w/o -topN > you will get the same > # Generator: 0 records selected for fetching, exiting ...

