Post-filtering reduced results?

Calle Dybedahl Mon, 19 Sep 2011 04:28:25 -0700

Hello.

I have a pretty simple pair of map and reduce functions. The first is basically 
just emitting a key and a 1, and the reduce is the built-in _sum function. This 
works fine, and tells me how many times every key has been seen.


Now, the problem is that I'm actually only interested in the handful of keys 
that have been seen the most often. The data fits a power-law distribution, 
which means that there is a long tail that I'm not at all interested in. And by 
"long" here I'm talking about tens of thousands of rows. At the moment, my 
client-side code spends more than 99.9% of its runtime receiving and parsing 
JSON from the CouchDB server, very nearly all of which it will promptly throw 
away as soon as it's been parsed. This is annoying and silly.

Is there any way at all to filter the results of a reduced query on the CouchDB 
end? Alternatively, is there a way for a reduce function to know that it's the 
final stage in the re-reduce chain (if I could drop all keys with a final value 
of 1, I'd save an order of magnitude of runtime)?

I can't be the first one ever to run into a problem like this, but I've failed 
to find any solutions on the net.
-- 
Calle Dybedahl
[email protected] -*- +46 703 - 970 612

Post-filtering reduced results?

Reply via email to