Did you rebuild the Nutch job file with the updated configuration?
> You were right, and indeed fixing that it now works locally. However trying > it on the server it seems the configuration won't update. I'm not sure why! > Where is that documented? > > On Mon, Jun 20, 2011 at 11:22 PM, Markus Jelsma > > <[email protected]>wrote: > > You're the victim of the default regex url filter. > > > > 31 # skip URLs containing certain characters as probable queries, > > etc. 32 -[?*!@=] > > > > The injector won't inject that URL. This can be trickty indeed as the > > filters > > don't log rejected URL's. > > > > > Hello, > > > > > > I've noticed that for some urls don't make it into my index. Debugging > > > > I've > > > > > created a seed file that has only one of them ( > > > > http://abcnews.go.com/Technology/google-chromebook-works-great-long-onlin > > e/ > > > > > story?id=13850997) and tried to crawl for it on an empty crawldb. > > > However > > > > I > > > > > notice that already at the bin/nutch generate stage the script exists > > > reporting that there are no urls to fetch. So it got nothing to do with > > > parsing, or fetching (we don't even reach the host yet). What could it > > > > be? > > > > > I've tried enconding it into > > > http%3A%2F%2Fabcnews.go.com > > > > %2FTechnology%2Fgoogle-chromebook-works-great-lo > > > > > ng-online%2Fstory%3Fid%3D13850997, but that didn't help. > > > > > > STEPS TO REPRODUCE: > > > > > > wget http://apache.panu.it//nutch/apache-nutch-1.3-src.zip > > > unzip apache-nutch-1.3-src.zip > > > ant > > > cat > urls << __EOF__ > > > > http://abcnews.go.com/Technology/google-chromebook-works-great-long-onlin > > e/ > > > > > story?id=13850997 __EOF__ > > > runtime/local/bin/nutch inject crawl urls > > > runtime/local/bin/nutch generate crawl crawl/segs -topN 1 #even w/o > > > -topN you will get the same > > > # Generator: 0 records selected for fetching, exiting ...

