Re: Nutch fetching times out at 3 hours, not sure why.

2018-05-01 Thread Chip Calhoun
Hi Sebastian, Yes, that explains it! Now I wish I'd pasted my crawl command in the first place. I'll leave it alone for now, but if it becomes an issue again I know where to check. Thank you. Chip From: Sebastian Nagel Sent:

RE: random sampling of crawlDb urls

2018-05-01 Thread Markus Jelsma
Hello Michael, I would think this should work as well. But since you mention .99 works fine, did you try .1 as well to get ~10% output? It seems the expressions itself do work at some level, and since this is a Jexl specific thing, you might want to try the Jexl list as well. I could not find

RE: random sampling of crawlDb urls

2018-05-01 Thread Yossi Tamari
Hi Michael, If you are using 1.14, there is a parameter -sample that allows you to request a random sample. See https://issues.apache.org/jira/browse/NUTCH-2463. Yossi. > -Original Message- > From: Michael Coffey > Sent: 01 May 2018 23:47 > To: User

Re: RE: random sampling of crawlDb urls

2018-05-01 Thread Michael Coffey
Just to clarify: .99 does NOT work fine. It should have rejected most of the records when I specified "((Math.random())>=.99)". I have used expressions not involving Math.random. For example, I can extract records above a specific score with "score>1.0". But the random thing doesn't work even

RE: RE: random sampling of crawlDb urls

2018-05-01 Thread Markus Jelsma
Ah crap, i got it wrong, >0.1 should not get 10% but 90% of the records. If you could add debugging lines that emit the direct output of Math.random() and the equation as well, we might learn more. Maybe Math.random() is evaluated just once, i have no idea how Jexl works under the hood. Again,

random sampling of crawlDb urls

2018-05-01 Thread Michael Coffey
I want to extract a random sample of URLS from my big crawldb. I think I should be able to do this using readdb -dump with a Jexl expression, but I haven't been able to get it to work. I have tried several variations of the following command. $NUTCH_HOME/runtime/deploy/bin/nutch readdb

ApacheCon North America 2018 schedule is now live.

2018-05-01 Thread Rich Bowen
Dear Apache Enthusiast, We are pleased to announce our schedule for ApacheCon North America 2018. ApacheCon will be held September 23-27 at the Montreal Marriott Chateau Champlain in Montreal, Canada. Registration is open! The early bird rate of $575 lasts until July 21, at which time it