Hi, Am Sonntag, den 04.07.2010, 10:28 -0700 schrieb J Chris Anderson: > On Jul 4, 2010, at 10:24 AM, Julian Moritz wrote: > > > Hi, > > > > Am Sonntag, den 04.07.2010, 09:37 -0700 schrieb J Chris Anderson: > >> On Jul 4, 2010, at 9:21 AM, Julian Moritz wrote: > >> > >>> Am Sonntag, den 04.07.2010, 07:10 -0700 schrieb J Chris Anderson: > >>> > >>>>> reduce.py is: > >>>>> > >>>>> def fun(key, value, rereduce): > >>>>> return True > >>>>> > >>>> > >>>> You should remove this reduce function. It's not doing you any good and > >>>> it's burning up your CPU. Things will be much faster without it. > >>>> > >>> > >>> But does the view then still what I want to? I need the keys to be > >>> unique. > >>> > >> > >> if you just need unique keys, you can replace the text of the python > >> reduce function with "_count" and you will avoid the python overhead for > >> reduce, which will help alot. > >> > > > > ok, thanks. > > > >> also, if what you are really saying is that you only want each URL in your > >> database once, it might make sense to consider using URLs (or URL hashes) > >> as your docids, to prevent duplicates. > >> > > > > nope. I'm yield'ing the _outgoing_ urls of each url. Having one document > > per url is another topic (and I do that already). > > > > one thing you can do that is kinda neat is in the map emit(fetched_url, 1) > for each URL that has been fetched, and emit(linked_url, 0) for any URL that > is linked. > > then you can use _sum, instead of _count, and you will know to fetch any urls > where the reduce value is 0, because they haven't been fetched yet. >
cooly-dooly, thank you a lot! Regards Julian > > Regards > > Julian > > > >> > >>> Regards > >>> Julian > >>> > >>>> Chris > >>>> > >>>>> If you're not able to read python code: it's generating a large list of > >>>>> unique pseudo-randomly ordered urls. I'm calling this view quite often > >>>>> (to get new urls to be crawled). > >>>>> > >>>>> What is my problem now? My couchdb process is at 100%cpu and the view > >>>>> needs sometimes quite long to be generated (even if I got only testing > >>>>> data about 5-10 GB). I've got 4 cores and 3 of them are sleeping. I > >>>>> think it could be way more faster if every core was used. What does > >>>>> couchdb do with a very large system, let's say 64 atom cores (which > >>>>> would be in an idle mode energy saving) and 20TB of data? Using 1 core > >>>>> with let's say 1ghz to munch down 20TB? Oh please. > >>>>> > >>>>> Why doesn't couchdb use all cores to generate views? > >>>>> > >>>>> Regards > >>>>> Julian > >>>>> > >>>>> P.S.: Maybe I'm totally wrong and the way you do it is right, but ATM it > >>>>> makes me mad to see one core out of four working and the rest is idle. > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>> > >>> > >>> > >> > > > > >
