On Jul 4, 2010, at 10:24 AM, Julian Moritz wrote: > Hi, > > Am Sonntag, den 04.07.2010, 09:37 -0700 schrieb J Chris Anderson: >> On Jul 4, 2010, at 9:21 AM, Julian Moritz wrote: >> >>> Am Sonntag, den 04.07.2010, 07:10 -0700 schrieb J Chris Anderson: >>> >>>>> reduce.py is: >>>>> >>>>> def fun(key, value, rereduce): >>>>> return True >>>>> >>>> >>>> You should remove this reduce function. It's not doing you any good and >>>> it's burning up your CPU. Things will be much faster without it. >>>> >>> >>> But does the view then still what I want to? I need the keys to be >>> unique. >>> >> >> if you just need unique keys, you can replace the text of the python reduce >> function with "_count" and you will avoid the python overhead for reduce, >> which will help alot. >> > > ok, thanks. > >> also, if what you are really saying is that you only want each URL in your >> database once, it might make sense to consider using URLs (or URL hashes) as >> your docids, to prevent duplicates. >> > > nope. I'm yield'ing the _outgoing_ urls of each url. Having one document > per url is another topic (and I do that already). >
one thing you can do that is kinda neat is in the map emit(fetched_url, 1) for each URL that has been fetched, and emit(linked_url, 0) for any URL that is linked. then you can use _sum, instead of _count, and you will know to fetch any urls where the reduce value is 0, because they haven't been fetched yet. > Regards > Julian > >> >>> Regards >>> Julian >>> >>>> Chris >>>> >>>>> If you're not able to read python code: it's generating a large list of >>>>> unique pseudo-randomly ordered urls. I'm calling this view quite often >>>>> (to get new urls to be crawled). >>>>> >>>>> What is my problem now? My couchdb process is at 100%cpu and the view >>>>> needs sometimes quite long to be generated (even if I got only testing >>>>> data about 5-10 GB). I've got 4 cores and 3 of them are sleeping. I >>>>> think it could be way more faster if every core was used. What does >>>>> couchdb do with a very large system, let's say 64 atom cores (which >>>>> would be in an idle mode energy saving) and 20TB of data? Using 1 core >>>>> with let's say 1ghz to munch down 20TB? Oh please. >>>>> >>>>> Why doesn't couchdb use all cores to generate views? >>>>> >>>>> Regards >>>>> Julian >>>>> >>>>> P.S.: Maybe I'm totally wrong and the way you do it is right, but ATM it >>>>> makes me mad to see one core out of four working and the rest is idle. >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>> >>> >> > >
