RE: random sampling of crawlDb urls

2018-05-01 Thread Yossi Tamari
Hi Michael, If you are using 1.14, there is a parameter -sample that allows you to request a random sample. See https://issues.apache.org/jira/browse/NUTCH-2463. Yossi. > -Original Message- > From: Michael Coffey > Sent: 01 May 2018 23:47 > To: User

RE: RE: random sampling of crawlDb urls

2018-05-01 Thread Markus Jelsma
3:18 > To: user@nutch.apache.org > Subject: Re: RE: random sampling of crawlDb urls > > Just to clarify: .99 does NOT work fine. It should have rejected most of the > records when I specified "((Math.random())>=.99)". > > I have used expressions not involving M

Re: RE: random sampling of crawlDb urls

2018-05-01 Thread Michael Coffey
Just to clarify: .99 does NOT work fine. It should have rejected most of the records when I specified "((Math.random())>=.99)". I have used expressions not involving Math.random. For example, I can extract records above a specific score with "score>1.0". But the random thing doesn't work even

RE: random sampling of crawlDb urls

2018-05-01 Thread Markus Jelsma
Hello Michael, I would think this should work as well. But since you mention .99 works fine, did you try .1 as well to get ~10% output? It seems the expressions itself do work at some level, and since this is a Jexl specific thing, you might want to try the Jexl list as well. I could not find