Re: [MASSMAIL]Nutch not indexing all seed URLs

Eyeris Rodriguez Rueda Thu, 11 May 2017 13:46:21 -0700

Hi.
Maybe one cause:
Have you seen topN (fetchlist) parameter inside bin/crawl script (line 117)
sizeFetchlist=`expr $numSlaves \* 50`
this number could limit your url list.


Also check your filters.


Tell me if you have solved the problem





----- Mensaje original -----
De: "Chip Calhoun" <[email protected]>
Para: [email protected]
Enviados: Jueves, 11 de Mayo 2017 16:30:34
Asunto: [MASSMAIL]Nutch not indexing all seed URLs

I'm using Nutch 1.12 to index a local site. To keep Nutch from indexing the 
uninteresting navigation pages on my site, I've made a URLs list of all the 
URLs I want crawled; the current list is 2522 URLs. However, the indexer 
stopped after just 1077 of these URLs. My generate.max.count is set to -1. What 
would cause my URLs to be skipped?

Chip Calhoun
Digital Archivist
Niels Bohr Library & Archives
American Institute of Physics
One Physics Ellipse
College Park, MD  20740-3840  USA
Tel: +1 301-209-3180
Email: [email protected]
https://www.aip.org/history-programs/niels-bohr-library

La @universidad_uci es Fidel. Los jóvenes no fallaremos.
#HastaSiempreComandante
#HastalaVictoriaSiempre

Re: [MASSMAIL]Nutch not indexing all seed URLs

Reply via email to