Hello Michael, I would think this should work as well. But since you mention .99 works fine, did you try .1 as well to get ~10% output? It seems the expressions itself do work at some level, and since this is a Jexl specific thing, you might want to try the Jexl list as well. I could not find an online Jexl parser to test this question, it would be really helpful!
Regards, Markus -----Original message----- > From:Michael Coffey <[email protected]> > Sent: Tuesday 1st May 2018 22:47 > To: User <[email protected]> > Subject: random sampling of crawlDb urls > > I want to extract a random sample of URLS from my big crawldb. I think I > should be able to do this using readdb -dump with a Jexl expression, but I > haven't been able to get it to work. > > I have tried several variations of the following command. > $NUTCH_HOME/runtime/deploy/bin/nutch readdb /crawls/pop2/data/crawldb -dump > /crawls/pop2/data/crawldb/pruned/current -format crawldb -expr > "((Math.random())>=0.1)" > > > Typically, it produces zero records. I know the expression is getting through > to the CrawlDbReader (without quotes) because I get this message: > 18/05/01 13:22:48 INFO crawl.CrawlDbReader: CrawlDb db: expr: > ((Math.random())>=0.1) > > Even when I use the expression "((Math.random())>=0.0)" I get zero output > records. > > If I use the expression "((Math.random())>=.99)" it lets all records pass > through to the output. I guess it has something to do with the lack of > leading zero on the numeric constant. > > Does anyone know a good way to extract a random sample of records from a > crawlDb? >

