Re: UNCHECKED [MASSMAIL]RE: generator conditional by crawldb status

Eyeris Rodriguez Rueda Tue, 25 Oct 2016 11:25:55 -0700

thanks a lot markus for your answer.

For now maybe i need to use jexl expresion because i have to many documents 
unfetched and is important for me to crawl it first.
I have used a command (bin/crawl urls/ crawl/ 5)
Can you tell me how use jexl parameter ?, please one example using the command 
will be appreciated.


Later i will use my own custom scoring using perhaps a percent of topN 
parameter dedicated to status of crawldb(unfetched)
and other percent using normal scoring. this is for avoid traps.
Thanks a lot.








----- Mensaje original -----
De: "Markus Jelsma" <[email protected]>
Para: [email protected]
Enviados: Martes, 25 de Octubre 2016 12:48:06
Asunto: ***UNCHECKED*** [MASSMAIL]RE: generator conditional by crawldb status

Yes, you can using the -expr with an JEXL expression e.g. -expr '(status = 
"db_fetched")'

Fields are here: 
https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/CrawlDatum.java#L524

But you can also achieve this using a custom scoring filter, which is a much 
more elegant solution. Take care of spider traps, if you prioritize unfetched 
unconditionally, you can easily fall into such a trap and not come out of it.
 
-----Original message-----
> From:Eyeris Rodriguez Rueda <[email protected]>
> Sent: Tuesday 25th October 2016 18:34
> To: [email protected]
> Subject: generator conditional by crawldb status
> 
> Hi all.
> I am using nutch 1.12 and solr 4.10.3 with linuxmint 18.
> I want to crawl pages from crawldb using this order.
> 
> 1-unfetched 
> 2-modified
> 3-gone
> and others
> 
> I know that generator process is which decides what pages are selected or not 
> from crawldb.
> Any help or advice to crawl pages in that order will be appreciated.
> 
> Greetings.
>

Re: ***UNCHECKED*** [MASSMAIL]RE: generator conditional by crawldb status

Reply via email to

Re: UNCHECKED [MASSMAIL]RE: generator conditional by crawldb status