I want to extract a random sample of URLS from my big crawldb. I think I should 
be able to do this using readdb -dump with a Jexl expression, but I haven't 
been able to get it to work.

I have tried several variations of the following command.
$NUTCH_HOME/runtime/deploy/bin/nutch readdb /crawls/pop2/data/crawldb -dump 
/crawls/pop2/data/crawldb/pruned/current -format crawldb -expr 
"((Math.random())>=0.1)"


Typically, it produces zero records. I know the expression is getting through 
to the CrawlDbReader (without quotes) because I get this message:
18/05/01 13:22:48 INFO crawl.CrawlDbReader: CrawlDb db: expr: 
((Math.random())>=0.1)

Even when I use the expression "((Math.random())>=0.0)" I get zero output 
records.

If I use the expression "((Math.random())>=.99)" it lets all records pass 
through to the output. I guess it has something to do with the lack of leading 
zero on the numeric constant.

Does anyone know a good way to extract a random sample of records from a 
crawlDb?

Reply via email to