Re: [MASSMAIL]Nutch not indexing all seed URLs

Yongyao Jiang Fri, 12 May 2017 09:39:06 -0700

Hi Chip,

Another possible reason is that some websites claims in the robot.txt that
crawlers are not allowed to access them. I had the same problem before.


Yongyao

On Fri, May 12, 2017 at 10:28 AM, Chip Calhoun <[email protected]> wrote:

> Thank you. The problem was right below that; I had the default
> "timeLimitFetch=180", and it stopped after 3 hours. I'll bump that up to
> something ridiculous and try again.
>
> Chip
>
> -----Original Message-----
> From: Eyeris Rodriguez Rueda [mailto:[email protected]]
> Sent: Thursday, May 11, 2017 4:46 PM
> To: [email protected]
> Subject: Re: [MASSMAIL]Nutch not indexing all seed URLs
>
> Hi.
> Maybe one cause:
> Have you seen topN (fetchlist) parameter inside bin/crawl script (line
> 117) sizeFetchlist=`expr $numSlaves \* 50` this number could limit your url
> list.
>
> Also check your filters.
>
>
> Tell me if you have solved the problem
>
>
>
>
>
> ----- Mensaje original -----
> De: "Chip Calhoun" <[email protected]>
> Para: [email protected]
> Enviados: Jueves, 11 de Mayo 2017 16:30:34
> Asunto: [MASSMAIL]Nutch not indexing all seed URLs
>
> I'm using Nutch 1.12 to index a local site. To keep Nutch from indexing
> the uninteresting navigation pages on my site, I've made a URLs list of all
> the URLs I want crawled; the current list is 2522 URLs. However, the
> indexer stopped after just 1077 of these URLs. My generate.max.count is set
> to -1. What would cause my URLs to be skipped?
>
> Chip Calhoun
> Digital Archivist
> Niels Bohr Library & Archives
> American Institute of Physics
> One Physics Ellipse
> College Park, MD  20740-3840  USA
> Tel: +1 301-209-3180
> Email: [email protected]
> https://www.aip.org/history-programs/niels-bohr-library
>
> La @universidad_uci es Fidel. Los jóvenes no fallaremos.
> #HastaSiempreComandante
> #HastalaVictoriaSiempre
>
>


-- 
Yongyao Jiang
https://www.linkedin.com/in/yongyao-jiang-42516164
Ph.D. Student in Earth Systems and GeoInformation Sciences
NSF Spatiotemporal Innovation Center
George Mason University

Re: [MASSMAIL]Nutch not indexing all seed URLs

Reply via email to