subject:"Scheduler questions, 1.1 nightly build."

Scheduler questions, 1.1 nightly build.

2010-04-22 Thread Phil Barnett

I'm having a problem where shouldfetch is rejecting everything. I have
deleted the crawl directory and started the entire crawl from scratch by

rm -rf crawl
mkdir crawl
mkdir segments

I'm absolutely baffled by how this scheduler works.

Is there documentation?

Is the fetchtime saved somewhere other than the crawl database?

I have tried lowering

db.default.fetch.interval to 0
db.fetch.interval.default to many lower values
db.fetch.interval.max to different levels.

With those changed, it crawls the top page over and over again. I make them
a little larger and it rejects the top page.

I'd really like to see how this tika parser works, but I can't get any web
pages into the crawl database.

The last thing I tried was to remove the entire /opt/nutch-1.1 directory and
start from scratch. It made no difference.

Is this a bug or am I doing something stupid?

Re: Scheduler questions, 1.1 nightly build.

2010-04-22 Thread Phil Barnett

I should add that what I really want to do is toss all previous crawl
information and reindex everything every night. It's just a few servers and
very low impact. My crawl on 1.0 takes about 10 minutes.

On Thu, Apr 22, 2010 at 4:59 AM, Phil Barnett ph...@philb.us wrote:

 I'm having a problem where shouldfetch is rejecting everything. I have
 deleted the crawl directory and started the entire crawl from scratch by

 rm -rf crawl
 mkdir crawl
 mkdir segments

 I'm absolutely baffled by how this scheduler works.

 Is there documentation?

 Is the fetchtime saved somewhere other than the crawl database?

 I have tried lowering

 db.default.fetch.interval to 0
 db.fetch.interval.default to many lower values
 db.fetch.interval.max to different levels.

 With those changed, it crawls the top page over and over again. I make them
 a little larger and it rejects the top page.

 I'd really like to see how this tika parser works, but I can't get any web
 pages into the crawl database.

 The last thing I tried was to remove the entire /opt/nutch-1.1 directory
 and start from scratch. It made no difference.

 Is this a bug or am I doing something stupid?

Scheduler questions, 1.1 nightly build.

Re: Scheduler questions, 1.1 nightly build.

2 matches

Site Navigation

Mail list logo

Footer information