Re: How do I debug why a url doesn't pass through generate despite being the only one?

Markus Jelsma Tue, 21 Jun 2011 11:18:09 -0700

Did you rebuild the Nutch job file with the updated configuration?


> You were right, and indeed fixing that it now works locally. However trying
> it on the server it seems the configuration won't update. I'm not sure why!
> Where is that documented?
> 
> On Mon, Jun 20, 2011 at 11:22 PM, Markus Jelsma
> 
> <[email protected]>wrote:
> > You're the victim of the default regex url filter.
> > 
> > 31      # skip URLs containing certain characters as probable queries,
> > etc. 32      -[?*!@=]
> > 
> > The injector won't inject that URL. This can be trickty indeed as the
> > filters
> > don't log rejected URL's.
> > 
> > > Hello,
> > > 
> > > I've noticed that for some urls don't make it into my index. Debugging
> > 
> > I've
> > 
> > > created a seed file that has only one of them (
> > 
> > http://abcnews.go.com/Technology/google-chromebook-works-great-long-onlin
> > e/
> > 
> > > story?id=13850997) and tried to crawl for it on an empty crawldb.
> > > However
> > 
> > I
> > 
> > > notice that already at the bin/nutch generate stage the script exists
> > > reporting that there are no urls to fetch. So it got nothing to do with
> > > parsing, or fetching (we don't even reach the host yet). What could it
> > 
> > be?
> > 
> > > I've tried enconding it into
> > > http%3A%2F%2Fabcnews.go.com
> > 
> > %2FTechnology%2Fgoogle-chromebook-works-great-lo
> > 
> > > ng-online%2Fstory%3Fid%3D13850997, but that didn't help.
> > > 
> > > STEPS TO REPRODUCE:
> > > 
> > > wget http://apache.panu.it//nutch/apache-nutch-1.3-src.zip
> > > unzip apache-nutch-1.3-src.zip
> > > ant
> > > cat > urls << __EOF__
> > 
> > http://abcnews.go.com/Technology/google-chromebook-works-great-long-onlin
> > e/
> > 
> > > story?id=13850997 __EOF__
> > > runtime/local/bin/nutch inject crawl urls
> > > runtime/local/bin/nutch generate crawl crawl/segs -topN 1 #even w/o
> > > -topN you will get the same
> > > # Generator: 0 records selected for fetching, exiting ...

Re: How do I debug why a url doesn't pass through generate despite being the only one?

Reply via email to