Re: Why I think view generation should be done concurrent.

Julian Moritz Sun, 04 Jul 2010 10:40:19 -0700

Hi,

Am Sonntag, den 04.07.2010, 10:28 -0700 schrieb J Chris Anderson:
> On Jul 4, 2010, at 10:24 AM, Julian Moritz wrote:
> 
> > Hi,
> > 
> > Am Sonntag, den 04.07.2010, 09:37 -0700 schrieb J Chris Anderson:
> >> On Jul 4, 2010, at 9:21 AM, Julian Moritz wrote:
> >> 
> >>> Am Sonntag, den 04.07.2010, 07:10 -0700 schrieb J Chris Anderson:
> >>> 
> >>>>> reduce.py is:
> >>>>> 
> >>>>> def fun(key, value, rereduce):
> >>>>>  return True
> >>>>> 
> >>>> 
> >>>> You should remove this reduce function. It's not doing you any good and 
> >>>> it's burning up your CPU. Things will be much faster without it.
> >>>> 
> >>> 
> >>> But does the view then still what I want to? I need the keys to be
> >>> unique.
> >>> 
> >> 
> >> if you just need unique keys, you can replace the text of the python 
> >> reduce function with "_count" and you will avoid the python overhead for 
> >> reduce, which will help alot.
> >> 
> > 
> > ok, thanks.
> > 
> >> also, if what you are really saying is that you only want each URL in your 
> >> database once, it might make sense to consider using URLs (or URL hashes) 
> >> as your docids, to prevent duplicates.
> >> 
> > 
> > nope. I'm yield'ing the _outgoing_ urls of each url. Having one document
> > per url is another topic (and I do that already).
> > 
> 
> one thing you can do that is kinda neat is in the map emit(fetched_url, 1) 
> for each URL that has been fetched, and emit(linked_url, 0) for any URL that 
> is linked.
> 
> then you can use _sum, instead of _count, and you will know to fetch any urls 
> where the reduce value is 0, because they haven't been fetched yet.
>


cooly-dooly, thank you a lot!

Regards
Julian

> > Regards
> > Julian
> > 
> >> 
> >>> Regards
> >>> Julian
> >>> 
> >>>> Chris
> >>>> 
> >>>>> If you're not able to read python code: it's generating a large list of
> >>>>> unique pseudo-randomly ordered urls. I'm calling this view quite often
> >>>>> (to get new urls to be crawled). 
> >>>>> 
> >>>>> What is my problem now? My couchdb process is at 100%cpu and the view
> >>>>> needs sometimes quite long to be generated (even if I got only testing
> >>>>> data about 5-10 GB). I've got 4 cores and 3 of them are sleeping. I
> >>>>> think it could be way more faster if every core was used. What does
> >>>>> couchdb do with a very large system, let's say 64 atom cores (which
> >>>>> would be in an idle mode energy saving) and 20TB of data? Using 1 core
> >>>>> with let's say 1ghz to munch down 20TB? Oh please. 
> >>>>> 
> >>>>> Why doesn't couchdb use all cores to generate views?
> >>>>> 
> >>>>> Regards
> >>>>> Julian
> >>>>> 
> >>>>> P.S.: Maybe I'm totally wrong and the way you do it is right, but ATM it
> >>>>> makes me mad to see one core out of four working and the rest is idle.
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>> 
> >>>> 
> >>> 
> >>> 
> >> 
> > 
> > 
>

Re: Why I think view generation should be done concurrent.

Reply via email to