Hi Roannel,

After review the URL filters configuration and log I have seen  the
following evidence in log file:

crawl.Injector - Injector: Total number of urls rejected by filters: 1413
crawl.Injector - Injector: Total number of urls after normalization: 25146

crawl.Generator - Generator: topN: 26559

So whit these values is not possible infer that the trouble is related with
the URL filter.

Any other solution for the trouble?

Thanks for your help.



2015-10-09 9:34 GMT-05:00 Roannel Fernández Hernández <[email protected]>:

> Hi Andres,
>
> Check your rules in the URL filters.
>
> Roannel
>
> ----- Mensaje original -----
> > De: "Andrés Rincón Pacheco" <[email protected]>
> > Para: [email protected]
> > Enviados: Jueves, 8 de Octubre 2015 9:26:11
> > Asunto: [MASSMAIL]Nutch only fetch and parse the third part of urls
> >
> > Hi,
> >
> > I am using nutch 1.9, after review the urls added by the Injector the
> total
> > url is 25146.
> > (Log evidence)
> > crawl.Injector - Injector: Total number of urls after normalization:
> 25146
> >
> > When I was checking the log file only 7003 urls was fetched and 6727 urls
> > was parsed.
> >
> > And these are the statistics:
> >
> > CrawlDb statistics start: ../crawlInfo/crawldb
> > Statistics for CrawlDb: ../crawlInfo/crawldb
> > TOTAL urls:     30914
> > retry 0:        30913
> > retry 1:        1
> > min score:      0.0
> > avg score:      0.4359605
> > max score:      100.002
> > status 1 (db_unfetched):        23912
> > status 2 (db_fetched):  6727
> > status 3 (db_gone):     8
> > status 4 (db_redir_temp):       266
> > status 5 (db_redir_perm):       1
> > CrawlDb statistics: done
> >
> > Why only the third part (approximately) urls is fetched and parsed?
> >
> > Thanks.
> >
> 17 de octubre: Final Cubana 2015 del Concurso de Programación ACM-ICPC.
> http://coj.uci.cu/contest/contestview.xhtml?cid 07
>

Reply via email to