RE: UNCHECKED [MASSMAIL]RE: generator conditional by crawldb status

Markus Jelsma Tue, 25 Oct 2016 11:32:55 -0700

You can use this configuration directive to set the expr.
https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/Generator.java#L78


Spider traps are not mitigated with topN. You could use scoring-depth to 
control the problem, but it has other drawbacks such as being unable to go 
deep, e.g. follow deep pages overviews to find articles or forum threads etc.



-----Original message-----
> From:Eyeris Rodriguez Rueda <[email protected]>
> Sent: Tuesday 25th October 2016 20:25
> To: [email protected]
> Subject: Re: ***UNCHECKED*** [MASSMAIL]RE: generator conditional by crawldb 
> status
> 
> thanks a lot markus for your answer.
> 
> For now maybe i need to use jexl expresion because i have to many documents 
> unfetched and is important for me to crawl it first.
> I have used a command (bin/crawl urls/ crawl/ 5)
> Can you tell me how use jexl parameter ?, please one example using the 
> command will be appreciated.
> 
> Later i will use my own custom scoring using perhaps a percent of topN 
> parameter dedicated to status of crawldb(unfetched)
> and other percent using normal scoring. this is for avoid traps.
> Thanks a lot.
> 
> 
> 
> 
> 
> 
> 
> 
> ----- Mensaje original -----
> De: "Markus Jelsma" <[email protected]>
> Para: [email protected]
> Enviados: Martes, 25 de Octubre 2016 12:48:06
> Asunto: ***UNCHECKED*** [MASSMAIL]RE: generator conditional by crawldb status
> 
> Yes, you can using the -expr with an JEXL expression e.g. -expr '(status = 
> "db_fetched")'
> 
> Fields are here: 
> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/CrawlDatum.java#L524
> 
> But you can also achieve this using a custom scoring filter, which is a much 
> more elegant solution. Take care of spider traps, if you prioritize unfetched 
> unconditionally, you can easily fall into such a trap and not come out of it.
>  
> -----Original message-----
> > From:Eyeris Rodriguez Rueda <[email protected]>
> > Sent: Tuesday 25th October 2016 18:34
> > To: [email protected]
> > Subject: generator conditional by crawldb status
> > 
> > Hi all.
> > I am using nutch 1.12 and solr 4.10.3 with linuxmint 18.
> > I want to crawl pages from crawldb using this order.
> > 
> > 1-unfetched 
> > 2-modified
> > 3-gone
> > and others
> > 
> > I know that generator process is which decides what pages are selected or 
> > not from crawldb.
> > Any help or advice to crawl pages in that order will be appreciated.
> > 
> > Greetings.
> > 
>

RE: ***UNCHECKED*** [MASSMAIL]RE: generator conditional by crawldb status

Reply via email to

RE: UNCHECKED [MASSMAIL]RE: generator conditional by crawldb status