Hi! Background: I'm working on a web app that uses a single CouchDB database for storing data belong to 400000+ users. Each user has an average of about 40 documents that need to be fetched in one go when the frontend is launched. I have accomplished this by querying a simple view with ?key=ownerID (with a fallback to /_alldocs?startkey=<ownerID>_...&endkey=<ownerID>~ if the view isn't built). Since the data for each user rarely changes, there's a potential to save resources by supporting conditional GET with If-None-Match, which would amount having the web app backend copy the CouchDB-generated ETag into the response sent to the browser.
However, I just learned that CouchDB only maintains a single ETag for the entire view, so every time one of my users changes something, the ETag for everyone else's query result also changes. This makes conditional GETs useless with this usage pattern. I asked about this on #couchdb and had a brief talk with rnewson, who was sympathetic to the idea. Unfortunately we weren't able to come up with an idea that didn't involve traversing all docs in the result just for computing the ETag (my suggestion was a hash of the _revs of all docs contributing to the result). That would be a bad default, but might still work as an opt-in thing per request, eg. slowetag=true. Newson said I should try raising the discussion here in case someone else had an idea for a cheaper way to calculate a good ETag. So what does everyone else think about this? Is my use case too rare, or would it be worthwhile to implement it? Best regards, Andreas Lind Petersen (papandreou) Here's our chat transcript: [11:46] <papandreou> Does anyone know if there are plans for issuing even more granular etags for view lookups? When you only look up a small range or a specific key it would be really great if the ETag only changed when that subset changes rather than the entire view. [11:47] <papandreou> In the application I'm working on I'll hardly ever be able to get a 304 response because of this. [...] [13:51] <+rnewson> papandreou: unlikely. [13:52] <papandreou> rnewson: So the best thing I can do is to fetch the data and compute a better etag myself? (My use case is a backend for a web app) [13:53] <+rnewson> papandreou: You might be able to set ETag in a list function? If you can't, I'll gladly change CouchDB so you can. [13:54] <papandreou> rnewson: I thought about that, too, but that would cause a big overhead for every request, right? [13:55] <papandreou> rnewson: (Last time I tried views were slooow) [13:55] <papandreou> I mean lists [13:55] <+rnewson> papandreou: slower, yes, because couch needs to evaluate the javascript in an external process. [13:55] <+rnewson> how will you calculate the fine-grained ETag? [13:56] <+rnewson> Also we did recently make it slightly finer, before it was view group scope and now it's the view itself (I think) [13:56] <papandreou> rnewson: Maybe something like a hash of the _revs of all the documents contributing to the result? [13:56] <+rnewson> hm, that makes no sense actually. but we did refine it recently. [13:57] <+rnewson> papandreou: that doesn't sound cheap at all, and it would need to be cheaper than doing the view query itself to make sense. [13:58] <papandreou> rnewson: There's still the bandwidth thing [13:58] <+rnewson> oh, you're working with restricted bandwidth and/or have huge view responses? [13:59] <papandreou> rnewson: And it would be really nice to have something like this completely handled by the database instead of inventing a bunch of workarounds. [14:01] <+rnewson> If there's a correct and efficient algorithm for doing it, I'm sure it would be applied. [14:02] <papandreou> rnewson: I guess it depends on the use case. If the database is rarely updated I suppose the current tradeoff is better. [14:03] <+rnewson> I'm sure the only reason we have ETags at the current granularity is because it's very quick to calculate. A finer-grain would be committed if a viable approach was proposed. [14:04] <papandreou> rnewson: I have a huge database with data belonging to 400000+ different users, and I'm using a view to enable a lookup-by-owner thing. But every time a single piece of data is inserted, the ETag for the view changes [14:04] == case_ [[email protected]] has quit [Read error: Connection reset by peer] [14:04] <+rnewson> yes, I've completely understood the problem you stated earlier. [14:05] <+rnewson> I can't think of a way to improve this right now but I would spend the time to implement it if you had one. [14:06] <papandreou> rnewson: So right now the code path that sends a 304 only needs to look at a single piece of metadata for the view to make its decision? That'll be hard to beat :) [14:07] <+rnewson> doesn't need to beat it, it just needs to be fast. [14:07] <+rnewson> but I don't see any current possible solutions, let alone fast ones. [14:07] <papandreou> rnewson: Well, thanks anyway for considering my suggestion. I'll let you know of I get an idea :) [14:08] <+rnewson> and it is now per-view and not per-viewgroup. so it's what I said first before I thought it was silly [14:08] <+benoitc> query + last seq returned maybe .... [14:08] <+rnewson> but obviously a change could affect one view in a group but not others [14:09] <papandreou> benoitc: The query is already sort of included since it's in the url. [14:09] <+rnewson> benoitc: ? [14:10] <+benoitc> i was meaning last committed seq,but it won't change anything ... [14:10] <papandreou> benoitc: I guess you'd also need to make sure that the ETag changes if a document is deleted? [14:10] <papandreou> ah [14:10] <+rnewson> benoitc: we already use the update_seq of the #view, which is finer-grained that db's last committed seq [14:11] <+benoitc> rnewson: commited seq in the view group but anyway it won't work [14:12] <+rnewson> benoitc: right, that would be the pre-1.1.0 behavior, I think. [14:12] <+rnewson> which is coarser [14:12] <+rnewson> we simply don't record the info that papandreou's suggestion would need to work. [14:12] <+benoitc> papandreou: easier solution would be to request each time on on stale view [14:13] <papandreou> rnewson: Another reason why my suggestion sucks is that it would require two traversals of the range, right? I'm guessing it starts streaming as soon as it has found the first doc now? [14:13] <+benoitc> and update after, think it would work. except if you want something strict [14:13] <+rnewson> papandreou: yes, we stream the results as we read them, we don't buffer. [14:14] <papandreou> benoitc: Hmm, so the theory is that stale=ok would increase the percentage of 304 responses? [14:14] <papandreou> rnewson: Right, yes, then it would take a serious hit. [14:14] <+rnewson> papandreou: but we could add an option that reads the thing, builds an etag, and then streams the result. it would be slower, but for the times that we can send 304 we'd save bandwidth. It sounds a bit too niche to me, but you could raise it on user@ [14:15] == Frippe [~Frippe@unaffiliated/frippe] has quit [Ping timeout: 240 seconds] [14:15] <papandreou> rnewson: Would be awesome to have that as a configuration option [14:15] <+rnewson> papandreou: the view would not change, so neither would the ETag (with stale=ok) [14:15] <+rnewson> papandreou: I think it would be a runtime option ?slow_etag=true [14:15] <papandreou> rnewson: That would also be fine [14:16] <+rnewson> a better solution would not require two passes, though. [14:16] <+benoitc> papandreou: i would use stale=ok, then query the view async, save new etag & ... [14:16] <papandreou> rnewson: I really don't think it's that niche :). But maybe ETag-nerds are rarer than I think, hehe [14:16] <+benoitc> rnewson: that could encourage pretty dangerous things [14:16] <+rnewson> benoitc: ? [14:17] <+benoitc> rnewson: cpu intensives tasks eacht time the call is done, [14:17] <+benoitc> rather than encouraging something async [14:18] <+benoitc> rahh I hate osx, it introduce be bad unicode chars in vim :@ [14:23] == Frippe_ has changed nick to Frippe [14:23] <papandreou> benoitc: I'm not sure exactly how that would work? I'm working on the backend for a web app, so the requests will be coming from multiple machines [14:24] <+benoitc> papandreou: call with stale==ok and have a process asking your deb for refresh from time to time [14:24] <+benoitc> s/deb/view [14:25] <+rnewson> benoitc: not sure I follow. doubling the number of view requests to achieve a finer etag is an ok solution, but shouldn't be the default, but I do think we'd need a better solution than that. [14:25] <+rnewson> benoitc: and you might be forgetting all the md5 verification we do all the time. [14:27] <+benoitc> rnewson: you don't need to call each views though [14:27] <+benoitc> I don't see the arg about last one [14:27] <papandreou> benoitc: Ah, ok, I understand now. Won't work very well for me, though, the web app is a single page thing that only asks for this particular chunk of data once per session, so the ETag will probably have changed anyway unless we accept day-old data. [14:27] <+benoitc> anyway enotime to discuss about that , i'm on anotherthing [14:32] <papandreou> rnewson: But next step would be for me to raise the issue on the user mailing list? [14:33] <+rnewson> papandreou: on reflection, it's more a dev@ thing, but yes. [14:33] <+rnewson> post the suggestion about calculating an etag over the results and then streaming them, with the caveat that a better solution should be found. [14:34] <papandreou> rnewson: Ok, I will, thanks :). Btw. do you think there's a chance that this will be easier for key=... queries than arbitrary startkey=...&endkey=... ones? [14:35] <+rnewson> papandreou: yes. for key= we could use a bloom filter. [14:38] <papandreou> rnewson: Man, I've got some reading up to do :). Thanks! So dev@ it is? [14:39] <+rnewson> papandreou: yes. [14:40] <+rnewson> papandreou: 'bloom filter' is just how we handwave solutions these days, it just sounds vaguely plausible to for the keys= variant [14:40] <+rnewson> but doesn't make sense at all for startkey/endkey [14:40] <+jan____> haha, I'm sitting in an ""HTTP Architecture" session, and all the two speakers do is tell the audience how CouchDB gets it all right. [14:41] <+rnewson> at base, we'd want some cheap way to invalidate a range of keys in memory. [14:49] <+jan____> the answer must include bloom filters.
