On Jul 4, 2010, at 2:36 AM, Julian Moritz wrote: > Hi, > > a few days ago I've tweeted a wish to have view generation done > concurrent. I'll tell you why (because @janl doesn't think so). > > I've got some documents in the form of: > > _id: 1, > _rev: 3-abc, > url: http://www.abc.com, > hrefs: [http://www.xyz.com, > http://www.nbc.com, > ..., > ..., > ...] > > As you can imagine me crawling the web, I got plenty of them. And every > second thousands more. I've got a view, map.py is: > > def fun(doc): > h = hash > if doc.has_key("hrefs"): > for href in doc["hrefs"]: > yield (h(href), href), None > > reduce.py is: > > def fun(key, value, rereduce): > return True >
You should remove this reduce function. It's not doing you any good and it's burning up your CPU. Things will be much faster without it. Chris > If you're not able to read python code: it's generating a large list of > unique pseudo-randomly ordered urls. I'm calling this view quite often > (to get new urls to be crawled). > > What is my problem now? My couchdb process is at 100%cpu and the view > needs sometimes quite long to be generated (even if I got only testing > data about 5-10 GB). I've got 4 cores and 3 of them are sleeping. I > think it could be way more faster if every core was used. What does > couchdb do with a very large system, let's say 64 atom cores (which > would be in an idle mode energy saving) and 20TB of data? Using 1 core > with let's say 1ghz to munch down 20TB? Oh please. > > Why doesn't couchdb use all cores to generate views? > > Regards > Julian > > P.S.: Maybe I'm totally wrong and the way you do it is right, but ATM it > makes me mad to see one core out of four working and the rest is idle. > > > > >
