Why I think view generation should be done concurrent.

Julian Moritz Sun, 04 Jul 2010 02:38:06 -0700

Hi,

a few days ago I've tweeted a wish to have view generation done
concurrent. I'll tell you why (because @janl doesn't think so).


I've got some documents in the form of:

_id: 1,
_rev: 3-abc, 
url: http://www.abc.com,
hrefs: [http://www.xyz.com, 
        http://www.nbc.com,
        ...,
        ...,
        ...]

As you can imagine me crawling the web, I got plenty of them. And every
second thousands more. I've got a view, map.py is:

def fun(doc):    
    h = hash
    if doc.has_key("hrefs"):
        for href in doc["hrefs"]:
            yield (h(href), href), None

reduce.py is:

def fun(key, value, rereduce):
    return True

If you're not able to read python code: it's generating a large list of
unique pseudo-randomly ordered urls. I'm calling this view quite often
(to get new urls to be crawled). 

What is my problem now? My couchdb process is at 100%cpu and the view
needs sometimes quite long to be generated (even if I got only testing
data about 5-10 GB). I've got 4 cores and 3 of them are sleeping. I
think it could be way more faster if every core was used. What does
couchdb do with a very large system, let's say 64 atom cores (which
would be in an idle mode energy saving) and 20TB of data? Using 1 core
with let's say 1ghz to munch down 20TB? Oh please. 

Why doesn't couchdb use all cores to generate views?

Regards
Julian

P.S.: Maybe I'm totally wrong and the way you do it is right, but ATM it
makes me mad to see one core out of four working and the rest is idle.

Why I think view generation should be done concurrent.

Reply via email to