You can use this configuration directive to set the expr. https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/Generator.java#L78
Spider traps are not mitigated with topN. You could use scoring-depth to control the problem, but it has other drawbacks such as being unable to go deep, e.g. follow deep pages overviews to find articles or forum threads etc. -----Original message----- > From:Eyeris Rodriguez Rueda <[email protected]> > Sent: Tuesday 25th October 2016 20:25 > To: [email protected] > Subject: Re: ***UNCHECKED*** [MASSMAIL]RE: generator conditional by crawldb > status > > thanks a lot markus for your answer. > > For now maybe i need to use jexl expresion because i have to many documents > unfetched and is important for me to crawl it first. > I have used a command (bin/crawl urls/ crawl/ 5) > Can you tell me how use jexl parameter ?, please one example using the > command will be appreciated. > > Later i will use my own custom scoring using perhaps a percent of topN > parameter dedicated to status of crawldb(unfetched) > and other percent using normal scoring. this is for avoid traps. > Thanks a lot. > > > > > > > > > ----- Mensaje original ----- > De: "Markus Jelsma" <[email protected]> > Para: [email protected] > Enviados: Martes, 25 de Octubre 2016 12:48:06 > Asunto: ***UNCHECKED*** [MASSMAIL]RE: generator conditional by crawldb status > > Yes, you can using the -expr with an JEXL expression e.g. -expr '(status = > "db_fetched")' > > Fields are here: > https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/CrawlDatum.java#L524 > > But you can also achieve this using a custom scoring filter, which is a much > more elegant solution. Take care of spider traps, if you prioritize unfetched > unconditionally, you can easily fall into such a trap and not come out of it. > > -----Original message----- > > From:Eyeris Rodriguez Rueda <[email protected]> > > Sent: Tuesday 25th October 2016 18:34 > > To: [email protected] > > Subject: generator conditional by crawldb status > > > > Hi all. > > I am using nutch 1.12 and solr 4.10.3 with linuxmint 18. > > I want to crawl pages from crawldb using this order. > > > > 1-unfetched > > 2-modified > > 3-gone > > and others > > > > I know that generator process is which decides what pages are selected or > > not from crawldb. > > Any help or advice to crawl pages in that order will be appreciated. > > > > Greetings. > > >

