RE: [MASSMAIL]Nutch not indexing all seed URLs

Chip Calhoun Fri, 12 May 2017 07:39:04 -0700

Thank you. The problem was right below that; I had the default 
"timeLimitFetch=180", and it stopped after 3 hours. I'll bump that up to 
something ridiculous and try again.

Chip

-----Original Message-----
From: Eyeris Rodriguez Rueda [mailto:[email protected]] 
Sent: Thursday, May 11, 2017 4:46 PM
To: [email protected]
Subject: Re: [MASSMAIL]Nutch not indexing all seed URLs

Hi.
Maybe one cause:
Have you seen topN (fetchlist) parameter inside bin/crawl script (line 117) 
sizeFetchlist=`expr $numSlaves \* 50` this number could limit your url list.

Also check your filters.

Tell me if you have solved the problem

----- Mensaje original -----
De: "Chip Calhoun" <[email protected]>
Para: [email protected]
Enviados: Jueves, 11 de Mayo 2017 16:30:34
Asunto: [MASSMAIL]Nutch not indexing all seed URLs

I'm using Nutch 1.12 to index a local site. To keep Nutch from indexing the 
uninteresting navigation pages on my site, I've made a URLs list of all the 
URLs I want crawled; the current list is 2522 URLs. However, the indexer 
stopped after just 1077 of these URLs. My generate.max.count is set to -1. What 
would cause my URLs to be skipped?

Chip Calhoun
Digital Archivist
Niels Bohr Library & Archives
American Institute of Physics
One Physics Ellipse
College Park, MD  20740-3840  USA
Tel: +1 301-209-3180
Email: [email protected]
https://www.aip.org/history-programs/niels-bohr-library

La @universidad_uci es Fidel. Los jóvenes no fallaremos.
#HastaSiempreComandante
#HastalaVictoriaSiempre

RE: [MASSMAIL]Nutch not indexing all seed URLs

Reply via email to