You were right, and indeed fixing that it now works locally. However trying it on the server it seems the configuration won't update. I'm not sure why! Where is that documented?
On Mon, Jun 20, 2011 at 11:22 PM, Markus Jelsma <[email protected]>wrote: > You're the victim of the default regex url filter. > > 31 # skip URLs containing certain characters as probable queries, etc. > 32 -[?*!@=] > > The injector won't inject that URL. This can be trickty indeed as the > filters > don't log rejected URL's. > > > Hello, > > > > I've noticed that for some urls don't make it into my index. Debugging > I've > > created a seed file that has only one of them ( > > > http://abcnews.go.com/Technology/google-chromebook-works-great-long-online/ > > story?id=13850997) and tried to crawl for it on an empty crawldb. However > I > > notice that already at the bin/nutch generate stage the script exists > > reporting that there are no urls to fetch. So it got nothing to do with > > parsing, or fetching (we don't even reach the host yet). What could it > be? > > I've tried enconding it into > > http%3A%2F%2Fabcnews.go.com > %2FTechnology%2Fgoogle-chromebook-works-great-lo > > ng-online%2Fstory%3Fid%3D13850997, but that didn't help. > > > > STEPS TO REPRODUCE: > > > > wget http://apache.panu.it//nutch/apache-nutch-1.3-src.zip > > unzip apache-nutch-1.3-src.zip > > ant > > cat > urls << __EOF__ > > > http://abcnews.go.com/Technology/google-chromebook-works-great-long-online/ > > story?id=13850997 __EOF__ > > runtime/local/bin/nutch inject crawl urls > > runtime/local/bin/nutch generate crawl crawl/segs -topN 1 #even w/o -topN > > you will get the same > > # Generator: 0 records selected for fetching, exiting ... > -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains "[LON]" or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) < Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with "X". ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).

