Hello,

I've noticed that for some urls don't make it into my index. Debugging I've
created a seed file that has only one of them (
http://abcnews.go.com/Technology/google-chromebook-works-great-long-online/story?id=13850997)
and tried to crawl for it on an empty crawldb. However I notice that already
at the bin/nutch generate stage the script exists reporting that there are
no urls to fetch. So it got nothing to do with parsing, or fetching (we
don't even reach the host yet). What could it be?
I've tried enconding it into
http%3A%2F%2Fabcnews.go.com%2FTechnology%2Fgoogle-chromebook-works-great-long-online%2Fstory%3Fid%3D13850997,
but that didn't help.

STEPS TO REPRODUCE:

wget http://apache.panu.it//nutch/apache-nutch-1.3-src.zip
unzip apache-nutch-1.3-src.zip
ant
cat > urls << __EOF__
http://abcnews.go.com/Technology/google-chromebook-works-great-long-online/story?id=13850997
__EOF__
runtime/local/bin/nutch inject crawl urls
runtime/local/bin/nutch generate crawl crawl/segs -topN 1 #even w/o -topN
you will get the same
# Generator: 0 records selected for fetching, exiting ...

-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Reply via email to