I tried that, but my code won't compile anymore. I was convinced the conf dir was external.
On Tue, Jun 21, 2011 at 8:16 PM, Markus Jelsma <[email protected]>wrote: > Did you rebuild the Nutch job file with the updated configuration? > > > You were right, and indeed fixing that it now works locally. However > trying > > it on the server it seems the configuration won't update. I'm not sure > why! > > Where is that documented? > > > > On Mon, Jun 20, 2011 at 11:22 PM, Markus Jelsma > > > > <[email protected]>wrote: > > > You're the victim of the default regex url filter. > > > > > > 31 # skip URLs containing certain characters as probable queries, > > > etc. 32 -[?*!@=] > > > > > > The injector won't inject that URL. This can be trickty indeed as the > > > filters > > > don't log rejected URL's. > > > > > > > Hello, > > > > > > > > I've noticed that for some urls don't make it into my index. > Debugging > > > > > > I've > > > > > > > created a seed file that has only one of them ( > > > > > > > http://abcnews.go.com/Technology/google-chromebook-works-great-long-onlin > > > e/ > > > > > > > story?id=13850997) and tried to crawl for it on an empty crawldb. > > > > However > > > > > > I > > > > > > > notice that already at the bin/nutch generate stage the script exists > > > > reporting that there are no urls to fetch. So it got nothing to do > with > > > > parsing, or fetching (we don't even reach the host yet). What could > it > > > > > > be? > > > > > > > I've tried enconding it into > > > > http%3A%2F%2Fabcnews.go.com > > > > > > %2FTechnology%2Fgoogle-chromebook-works-great-lo > > > > > > > ng-online%2Fstory%3Fid%3D13850997, but that didn't help. > > > > > > > > STEPS TO REPRODUCE: > > > > > > > > wget http://apache.panu.it//nutch/apache-nutch-1.3-src.zip > > > > unzip apache-nutch-1.3-src.zip > > > > ant > > > > cat > urls << __EOF__ > > > > > > > http://abcnews.go.com/Technology/google-chromebook-works-great-long-onlin > > > e/ > > > > > > > story?id=13850997 __EOF__ > > > > runtime/local/bin/nutch inject crawl urls > > > > runtime/local/bin/nutch generate crawl crawl/segs -topN 1 #even w/o > > > > -topN you will get the same > > > > # Generator: 0 records selected for fetching, exiting ... > -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains "[LON]" or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) < Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with "X". ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).

