On Sat, Apr 25, 2009 at 3:40 PM, Brian Candler <[email protected]> wrote: > On Sat, Apr 25, 2009 at 10:27:33AM -0700, Adam Wolff wrote: >> Hi list, >> I have a little query problem that would be easy to solve in SQL, but >> seems hard in CouchDB. I have data that looks like this: >> doc1: { >> keys : [1,2,3], >> data : .... >> } >> doc2: { >> keys : [1,3,5], >> data : .... >> } >> ... >> >> I'd like a view of my documents that lets me filter on multiple keys, >> so a query of keys=[1] yields doc1,doc2, as does a query of >> keys=[1,3]. > > What do you mean by keys=[1,3]? Does it mean all documents which contains > either key 1 or 3, or all documents which contains both keys 1 and 3? > > In the first case, you can POST to the view with {"keys":[1,3]} > > In the latter case, the simplest answer is to fetch the views for key=1 and > key=3, and intersect them on the client. Actually you can be smarter than > that: fetch whichever result set is smaller (keys=1 or key=3), and then > filter the result set on the client to discard those entries which don't > have the other key. > > (This is exactly what a SQL database would do, except it hides it from you > by doing it all server-side) > >> In couch, the only way I can think of to truly index all the >> combinations of the keys is to emit the n! combinations of the keys, >> which seems like a bad idea. (Maybe it's not though?) > > I think it would be (2^n)-1 rather than n!, since you can sort them first. > > e.g. for 5 keys there would be 31 combinations: > 1 x [] -- useless, ignore > 5 x [a] > 10 x [a b] (a <= b) > 10 x [a b c] (a <= b <= c) > 5 x [a b c d] > 1 x [a b c d e] > > Actually you can halve it again, by realising that if you have emitted > [1,2,3,4,5] then you don't need to emit any prefixes of this, e.g. [1,2,3,4] > or [1,2,3] or [1,2] or [1], because a startkey-endkey query can be used to > find them. >
I've contemplated something like this, but I'm pretty sure it breaks when you start contemplating that a key may end up being [a, e] in which case your querying needs to account for all possible values between a and e. In other news, if someone wants to write an inverted index implementation that could solve this class of queries I'm sure there'd probably be a couple beers in it. HTH, Paul Davis > Mind you, exponential growth still ain't good :-) It might be acceptable if > you have a limited number of tags, and you really want an instantaneous > answer to your multi-tag query across millions of documents. > > Other tradeoffs are possible. For example, it might make sense just to emit > all single tags plus all combinations of two tags; that would be (n^2+n)/2 > combinations. Then your query can instantly find all docs matching two of > the tags requested, and if the user has asked for three or more, filter > those at the client. But you still have to choose which 2 tags to query on, > and it may well not be worth it in practice. > > Regards, > > Brian. >
