I tried that, but my code won't compile anymore. I was convinced the conf
dir was external.

On Tue, Jun 21, 2011 at 8:16 PM, Markus Jelsma
<[email protected]>wrote:

> Did you rebuild the Nutch job file with the updated configuration?
>
> > You were right, and indeed fixing that it now works locally. However
> trying
> > it on the server it seems the configuration won't update. I'm not sure
> why!
> > Where is that documented?
> >
> > On Mon, Jun 20, 2011 at 11:22 PM, Markus Jelsma
> >
> > <[email protected]>wrote:
> > > You're the victim of the default regex url filter.
> > >
> > > 31      # skip URLs containing certain characters as probable queries,
> > > etc. 32      -[?*!@=]
> > >
> > > The injector won't inject that URL. This can be trickty indeed as the
> > > filters
> > > don't log rejected URL's.
> > >
> > > > Hello,
> > > >
> > > > I've noticed that for some urls don't make it into my index.
> Debugging
> > >
> > > I've
> > >
> > > > created a seed file that has only one of them (
> > >
> > >
> http://abcnews.go.com/Technology/google-chromebook-works-great-long-onlin
> > > e/
> > >
> > > > story?id=13850997) and tried to crawl for it on an empty crawldb.
> > > > However
> > >
> > > I
> > >
> > > > notice that already at the bin/nutch generate stage the script exists
> > > > reporting that there are no urls to fetch. So it got nothing to do
> with
> > > > parsing, or fetching (we don't even reach the host yet). What could
> it
> > >
> > > be?
> > >
> > > > I've tried enconding it into
> > > > http%3A%2F%2Fabcnews.go.com
> > >
> > > %2FTechnology%2Fgoogle-chromebook-works-great-lo
> > >
> > > > ng-online%2Fstory%3Fid%3D13850997, but that didn't help.
> > > >
> > > > STEPS TO REPRODUCE:
> > > >
> > > > wget http://apache.panu.it//nutch/apache-nutch-1.3-src.zip
> > > > unzip apache-nutch-1.3-src.zip
> > > > ant
> > > > cat > urls << __EOF__
> > >
> > >
> http://abcnews.go.com/Technology/google-chromebook-works-great-long-onlin
> > > e/
> > >
> > > > story?id=13850997 __EOF__
> > > > runtime/local/bin/nutch inject crawl urls
> > > > runtime/local/bin/nutch generate crawl crawl/segs -topN 1 #even w/o
> > > > -topN you will get the same
> > > > # Generator: 0 records selected for fetching, exiting ...
>



-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Reply via email to