Hi Chip, Another possible reason is that some websites claims in the robot.txt that crawlers are not allowed to access them. I had the same problem before.
Yongyao On Fri, May 12, 2017 at 10:28 AM, Chip Calhoun <[email protected]> wrote: > Thank you. The problem was right below that; I had the default > "timeLimitFetch=180", and it stopped after 3 hours. I'll bump that up to > something ridiculous and try again. > > Chip > > -----Original Message----- > From: Eyeris Rodriguez Rueda [mailto:[email protected]] > Sent: Thursday, May 11, 2017 4:46 PM > To: [email protected] > Subject: Re: [MASSMAIL]Nutch not indexing all seed URLs > > Hi. > Maybe one cause: > Have you seen topN (fetchlist) parameter inside bin/crawl script (line > 117) sizeFetchlist=`expr $numSlaves \* 50` this number could limit your url > list. > > Also check your filters. > > > Tell me if you have solved the problem > > > > > > ----- Mensaje original ----- > De: "Chip Calhoun" <[email protected]> > Para: [email protected] > Enviados: Jueves, 11 de Mayo 2017 16:30:34 > Asunto: [MASSMAIL]Nutch not indexing all seed URLs > > I'm using Nutch 1.12 to index a local site. To keep Nutch from indexing > the uninteresting navigation pages on my site, I've made a URLs list of all > the URLs I want crawled; the current list is 2522 URLs. However, the > indexer stopped after just 1077 of these URLs. My generate.max.count is set > to -1. What would cause my URLs to be skipped? > > Chip Calhoun > Digital Archivist > Niels Bohr Library & Archives > American Institute of Physics > One Physics Ellipse > College Park, MD 20740-3840 USA > Tel: +1 301-209-3180 > Email: [email protected] > https://www.aip.org/history-programs/niels-bohr-library > > La @universidad_uci es Fidel. Los jóvenes no fallaremos. > #HastaSiempreComandante > #HastalaVictoriaSiempre > > -- Yongyao Jiang https://www.linkedin.com/in/yongyao-jiang-42516164 Ph.D. Student in Earth Systems and GeoInformation Sciences NSF Spatiotemporal Innovation Center George Mason University

