just wanted to add http://wiki.apache.org/couchdb/Built-In_Reduce_Functions
:)
On 28.05.2010, at 23:57, J Chris Anderson wrote:
>
> On May 28, 2010, at 10:02 AM, Aurélien Bénel wrote:
>
>> Thanks for your answer,
>>
>>> It seems that you're using a _list function to filter your view results,
>>> right?
>>> Be aware that even though you're not sending that data to the client, the
>>> database still has to iterate thru all the view rows and send them to the
>>> _list function, just to get filtered there. So the amount of time it takes
>>> to query your view/list will increase proportionally with the number rows
>>> returned from the view query.
>>
>> Yes. This is indeed why I am sceptic about this way of selecting reduce
>> values.
>>
>> In our project, we try to move our open-source text analysis software from
>> PHP/PostgreSQL to CouchDB.
>> The current issue is about getting repeated phrases (sequences of 3 words)
>> in forums.
>>
>> Each forum thread is stored as a CouchDB "document".
>>
>> A view emits every sequence that match different constraints :
>>
>> function(doc) {
>> const ALPHA = /[a-zàâçéêèëïîôöüùû0-9]+|[^a-zàâçéêèëïîôöüùû0-9]+/gi;
>> for each (p in doc.posts) {
>> var words = p.text.match(ALPHA);
>> for (i=0; i<words.length-4; i+=2) {
>> if (
>> (words[i].length>3 || words[i+2].length>3 || words[i+4].length>3)
>> && words[i+1].length==1
>> && words[i+3].length==1
>> ) {
>> emit([
>> words[i].toLowerCase(),
>> words[i+2].toLowerCase(),
>> words[i+4].toLowerCase()
>> ], null);
>> }
>> }
>> }
>> }
>>
>> Then a reduce is done to count occurrences on the whole corpus :
>>
>> function(keys, values, combine) {
>> if (combine) {
>> return sum(values);
>> } else {
>> return values.length;
>> }
>> }
>>
>
> try replacing the reduce function with the single word string "_count"
> (Without the quotes)
>
> this will do it in Erlang, and should speed things up a lot. please let us
> know what kind of difference this makes.
>
>> Then a list filters out unrepeated phrases :
>>
>> function(head, req) {
>> var phrase;
>> send('{"rows":[\n');
>> while (phrase = getRow()) {
>> if (phrase.value>1) { // is repeated
>> send(JSON.stringify(phrase));
>> send(',\n');
>> }
>> }
>> send(']}');
>> }
>>
>>
>> I know that the view could be done differently and probably more efficiently
>> with regular expressions, but my worry is not on the performance of the
>> first generation of views (that was what I meant by "cached"), but every
>> time I query the list.
>>
>>
>> Regards,
>>
>> Aurélien
>