querying crawldb

Michael Coffey Tue, 12 Sep 2017 18:45:50 -0700

Hello Nutchians,
I need to be able to query a (nutch 1.x) crawldb for read-only 
search/sort/summarize purposes, based on combinations of status, fetch_time, 
score, and things like that. What is a good tool or process for doing such 
things?
Up until now, I've been doing readdb-dump and then processing the output with 
python code that I wrote. But this is slow and clunky, and my code probably has 
bugs. I wonder, would Hive be a good tool for this?

querying crawldb

Reply via email to