Thanks for your answer,
> It seems that you're using a _list function to filter your view results,
> right?
> Be aware that even though you're not sending that data to the client, the
> database still has to iterate thru all the view rows and send them to the
> _list function, just to get filtered there. So the amount of time it takes to
> query your view/list will increase proportionally with the number rows
> returned from the view query.
Yes. This is indeed why I am sceptic about this way of selecting reduce values.
In our project, we try to move our open-source text analysis software from
PHP/PostgreSQL to CouchDB.
The current issue is about getting repeated phrases (sequences of 3 words) in
forums.
Each forum thread is stored as a CouchDB "document".
A view emits every sequence that match different constraints :
function(doc) {
const ALPHA = /[a-zàâçéêèëïîôöüùû0-9]+|[^a-zàâçéêèëïîôöüùû0-9]+/gi;
for each (p in doc.posts) {
var words = p.text.match(ALPHA);
for (i=0; i<words.length-4; i+=2) {
if (
(words[i].length>3 || words[i+2].length>3 || words[i+4].length>3)
&& words[i+1].length==1
&& words[i+3].length==1
) {
emit([
words[i].toLowerCase(),
words[i+2].toLowerCase(),
words[i+4].toLowerCase()
], null);
}
}
}
}
Then a reduce is done to count occurrences on the whole corpus :
function(keys, values, combine) {
if (combine) {
return sum(values);
} else {
return values.length;
}
}
Then a list filters out unrepeated phrases :
function(head, req) {
var phrase;
send('{"rows":[\n');
while (phrase = getRow()) {
if (phrase.value>1) { // is repeated
send(JSON.stringify(phrase));
send(',\n');
}
}
send(']}');
}
I know that the view could be done differently and probably more efficiently
with regular expressions, but my worry is not on the performance of the first
generation of views (that was what I meant by "cached"), but every time I query
the list.
Regards,
Aurélien