Re: Why I think view generation should be done concurrent.

J Chris Anderson Sun, 04 Jul 2010 10:29:34 -0700

On Jul 4, 2010, at 10:24 AM, Julian Moritz wrote:

> Hi,
> 
> Am Sonntag, den 04.07.2010, 09:37 -0700 schrieb J Chris Anderson:
>> On Jul 4, 2010, at 9:21 AM, Julian Moritz wrote:
>> 
>>> Am Sonntag, den 04.07.2010, 07:10 -0700 schrieb J Chris Anderson:
>>> 
>>>>> reduce.py is:
>>>>> 
>>>>> def fun(key, value, rereduce):
>>>>>  return True
>>>>> 
>>>> 
>>>> You should remove this reduce function. It's not doing you any good and 
>>>> it's burning up your CPU. Things will be much faster without it.
>>>> 
>>> 
>>> But does the view then still what I want to? I need the keys to be
>>> unique.
>>> 
>> 
>> if you just need unique keys, you can replace the text of the python reduce 
>> function with "_count" and you will avoid the python overhead for reduce, 
>> which will help alot.
>> 
> 
> ok, thanks.
> 
>> also, if what you are really saying is that you only want each URL in your 
>> database once, it might make sense to consider using URLs (or URL hashes) as 
>> your docids, to prevent duplicates.
>> 
> 
> nope. I'm yield'ing the _outgoing_ urls of each url. Having one document
> per url is another topic (and I do that already).
>


one thing you can do that is kinda neat is in the map emit(fetched_url, 1) for 
each URL that has been fetched, and emit(linked_url, 0) for any URL that is 
linked.

then you can use _sum, instead of _count, and you will know to fetch any urls 
where the reduce value is 0, because they haven't been fetched yet.

> Regards
> Julian
> 
>> 
>>> Regards
>>> Julian
>>> 
>>>> Chris
>>>> 
>>>>> If you're not able to read python code: it's generating a large list of
>>>>> unique pseudo-randomly ordered urls. I'm calling this view quite often
>>>>> (to get new urls to be crawled). 
>>>>> 
>>>>> What is my problem now? My couchdb process is at 100%cpu and the view
>>>>> needs sometimes quite long to be generated (even if I got only testing
>>>>> data about 5-10 GB). I've got 4 cores and 3 of them are sleeping. I
>>>>> think it could be way more faster if every core was used. What does
>>>>> couchdb do with a very large system, let's say 64 atom cores (which
>>>>> would be in an idle mode energy saving) and 20TB of data? Using 1 core
>>>>> with let's say 1ghz to munch down 20TB? Oh please. 
>>>>> 
>>>>> Why doesn't couchdb use all cores to generate views?
>>>>> 
>>>>> Regards
>>>>> Julian
>>>>> 
>>>>> P.S.: Maybe I'm totally wrong and the way you do it is right, but ATM it
>>>>> makes me mad to see one core out of four working and the rest is idle.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>> 
> 
>

Re: Why I think view generation should be done concurrent.

Reply via email to