Ah crap, i got it wrong, >0.1 should not get 10% but 90% of the records.
If you could add debugging lines that emit the direct output of Math.random() and the equation as well, we might learn more. Maybe Math.random() is evaluated just once, i have no idea how Jexl works under the hood. Again, you might have more luck on the Jexl list, we just implemented it. And there could be a bug somewhere. Hope you find some answers. Sorry to be of so little help. Markus -----Original message----- > From:Michael Coffey <[email protected]> > Sent: Tuesday 1st May 2018 23:18 > To: [email protected] > Subject: Re: RE: random sampling of crawlDb urls > > Just to clarify: .99 does NOT work fine. It should have rejected most of the > records when I specified "((Math.random())>=.99)". > > I have used expressions not involving Math.random. For example, I can extract > records above a specific score with "score>1.0". But the random thing doesn't > work even though I have tried various thresholds. > > On Tuesday, May 1, 2018, 2:00:48 PM PDT, Markus Jelsma > <[email protected]> wrote: > > Hello Michael, > > I would think this should work as well. But since you mention .99 works fine, > did you try .1 as well to get ~10% output? It seems the expressions itself do > work at some level, and since this is a Jexl specific thing, you might want > to try the Jexl list as well. I could not find an online Jexl parser to test > this question, it would be really helpful! > > Regards, > Markus > > -----Original message----- > > From:Michael Coffey <[email protected]> > > Sent: Tuesday 1st May 2018 22:47 > > To: User <[email protected]> > > Subject: random sampling of crawlDb urls > > > > I want to extract a random sample of URLS from my big crawldb. I think I > > should be able to do this using readdb -dump with a Jexl expression, but I > > haven't been able to get it to work. > > > > I have tried several variations of the following command. > > $NUTCH_HOME/runtime/deploy/bin/nutch readdb /crawls/pop2/data/crawldb -dump > > /crawls/pop2/data/crawldb/pruned/current -format crawldb -expr > > "((Math.random())>=0.1)" > > > > > > Typically, it produces zero records. I know the expression is getting > > through to the CrawlDbReader (without quotes) because I get this message: > > 18/05/01 13:22:48 INFO crawl.CrawlDbReader: CrawlDb db: expr: > > ((Math.random())>=0.1) > > > > Even when I use the expression "((Math.random())>=0.0)" I get zero output > > records. > > > > If I use the expression "((Math.random())>=.99)" it lets all records pass > > through to the output. I guess it has something to do with the lack of > > leading zero on the numeric constant. > > > > Does anyone know a good way to extract a random sample of records from a > > crawlDb? > >

