Hello, I've noticed that for some urls don't make it into my index. Debugging I've created a seed file that has only one of them ( http://abcnews.go.com/Technology/google-chromebook-works-great-long-online/story?id=13850997) and tried to crawl for it on an empty crawldb. However I notice that already at the bin/nutch generate stage the script exists reporting that there are no urls to fetch. So it got nothing to do with parsing, or fetching (we don't even reach the host yet). What could it be? I've tried enconding it into http%3A%2F%2Fabcnews.go.com%2FTechnology%2Fgoogle-chromebook-works-great-long-online%2Fstory%3Fid%3D13850997, but that didn't help.
STEPS TO REPRODUCE: wget http://apache.panu.it//nutch/apache-nutch-1.3-src.zip unzip apache-nutch-1.3-src.zip ant cat > urls << __EOF__ http://abcnews.go.com/Technology/google-chromebook-works-great-long-online/story?id=13850997 __EOF__ runtime/local/bin/nutch inject crawl urls runtime/local/bin/nutch generate crawl crawl/segs -topN 1 #even w/o -topN you will get the same # Generator: 0 records selected for fetching, exiting ... -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains "[LON]" or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) < Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with "X". ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).

