just wanted to add http://wiki.apache.org/couchdb/Built-In_Reduce_Functions

:)

On 28.05.2010, at 23:57, J Chris Anderson wrote:

> 
> On May 28, 2010, at 10:02 AM, Aurélien Bénel wrote:
> 
>> Thanks for your answer,
>> 
>>> It seems that you're using a _list function to filter your view results, 
>>> right? 
>>> Be aware that even though you're not sending that data to the client, the 
>>> database still has to iterate thru all the view rows and send them to the 
>>> _list function, just to get filtered there. So the amount of time it takes 
>>> to query your view/list will increase proportionally with the number rows 
>>> returned from the view query.
>> 
>> Yes. This is indeed why I am sceptic about this way of selecting reduce 
>> values.
>> 
>> In our project, we try to move our open-source text analysis software from 
>> PHP/PostgreSQL to CouchDB.
>> The current issue is about getting repeated phrases (sequences of 3 words) 
>> in forums. 
>> 
>> Each forum thread is stored as a CouchDB "document".
>> 
>> A view emits every sequence that match different constraints :
>> 
>> function(doc) {
>> const ALPHA = /[a-zàâçéêèëïîôöüùû0-9]+|[^a-zàâçéêèëïîôöüùû0-9]+/gi;
>> for each (p in doc.posts) {
>>  var words = p.text.match(ALPHA);
>>  for (i=0; i<words.length-4; i+=2) {
>>    if (
>>      (words[i].length>3 || words[i+2].length>3 || words[i+4].length>3)
>>      && words[i+1].length==1
>>      && words[i+3].length==1
>>    ) {
>>      emit([
>>        words[i].toLowerCase(),
>>        words[i+2].toLowerCase(),
>>        words[i+4].toLowerCase()
>>      ], null);
>>    }
>>  }
>> }
>> }
>> 
>> Then a reduce is done to count occurrences on the whole corpus :
>> 
>> function(keys, values, combine) {
>> if (combine) {
>>  return sum(values);
>> } else {
>>  return values.length;
>> }
>> }
>> 
> 
> try replacing the reduce function with the single word string "_count" 
> (Without the quotes)
> 
> this will do it in Erlang, and should speed things up a lot. please let us 
> know what kind of difference this makes.
> 
>> Then a list filters out unrepeated phrases :
>> 
>> function(head, req) {
>> var phrase;
>> send('{"rows":[\n');
>> while (phrase = getRow()) {
>>  if (phrase.value>1) { // is repeated
>>      send(JSON.stringify(phrase));
>>      send(',\n');
>>  }
>> }  
>> send(']}');
>> }
>> 
>> 
>> I know that the view could be done differently and probably more efficiently 
>> with regular expressions, but my worry is not on the performance of the 
>> first generation of views (that was what I meant by "cached"), but every 
>> time I query the list. 
>> 
>> 
>> Regards,
>> 
>> Aurélien
> 

Reply via email to