Ah crap, i got it wrong, >0.1 should not get 10% but 90% of the records.

If you could add debugging lines that emit the direct output of Math.random() 
and the equation as well, we might learn more. Maybe Math.random() is evaluated 
just once, i have no idea how Jexl works under the hood.

Again, you might have more luck on the Jexl list, we just implemented it. And 
there could be a bug  somewhere.

Hope you find some answers. Sorry to be of so little help.
Markus

-----Original message-----
> From:Michael Coffey <[email protected]>
> Sent: Tuesday 1st May 2018 23:18
> To: [email protected]
> Subject: Re: RE: random sampling of crawlDb urls
> 
> Just to clarify: .99 does NOT work fine. It should have rejected most of the 
> records when I specified "((Math.random())>=.99)".
>  
> I have used expressions not involving Math.random. For example, I can extract 
> records above a specific score with "score>1.0". But the random thing doesn't 
> work even though I have tried various thresholds.
> 
>     On Tuesday, May 1, 2018, 2:00:48 PM PDT, Markus Jelsma 
> <[email protected]> wrote:  
>  
>  Hello Michael,
> 
> I would think this should work as well. But since you mention .99 works fine, 
> did you try .1 as well to get ~10% output? It seems the expressions itself do 
> work at some level, and since this is a Jexl specific thing, you might want 
> to try the Jexl list as well. I could not find an online Jexl parser to test 
> this question, it would be really helpful! 
> 
> Regards,
> Markus
> 
> -----Original message-----
> > From:Michael Coffey <[email protected]>
> > Sent: Tuesday 1st May 2018 22:47
> > To: User <[email protected]>
> > Subject: random sampling of crawlDb urls
> > 
> > I want to extract a random sample of URLS from my big crawldb. I think I 
> > should be able to do this using readdb -dump with a Jexl expression, but I 
> > haven't been able to get it to work.
> > 
> > I have tried several variations of the following command.
> > $NUTCH_HOME/runtime/deploy/bin/nutch readdb /crawls/pop2/data/crawldb -dump 
> > /crawls/pop2/data/crawldb/pruned/current -format crawldb -expr 
> > "((Math.random())>=0.1)"
> > 
> > 
> > Typically, it produces zero records. I know the expression is getting 
> > through to the CrawlDbReader (without quotes) because I get this message:
> > 18/05/01 13:22:48 INFO crawl.CrawlDbReader: CrawlDb db: expr: 
> > ((Math.random())>=0.1)
> > 
> > Even when I use the expression "((Math.random())>=0.0)" I get zero output 
> > records.
> > 
> > If I use the expression "((Math.random())>=.99)" it lets all records pass 
> > through to the output. I guess it has something to do with the lack of 
> > leading zero on the numeric constant.
> > 
> > Does anyone know a good way to extract a random sample of records from a 
> > crawlDb?
> >   

Reply via email to