Even more fine-grained ETag support when querying views?

Andreas Lind Petersen Mon, 12 Sep 2011 10:28:35 -0700

Hi!

Background: I'm working on a web app that uses a single CouchDB database for
storing data belong to 400000+ users. Each user has an average of about 40
documents that need to be fetched in one go when the frontend is launched. I
have accomplished this by querying a simple view with ?key=ownerID (with a
fallback to /_alldocs?startkey=<ownerID>_...&endkey=<ownerID>~ if the view
isn't built). Since the data for each user rarely changes, there's a
potential to save resources by supporting conditional GET with
If-None-Match, which would amount having the web app backend copy the
CouchDB-generated ETag into the response sent to the browser.


However, I just learned that CouchDB only maintains a single ETag for the
entire view, so every time one of my users changes something, the ETag for
everyone else's query result also changes. This makes conditional GETs
useless with this usage pattern.

I asked about this on #couchdb and had a brief talk with rnewson, who was
sympathetic to the idea. Unfortunately we weren't able to come up with an
idea that didn't involve traversing all docs in the result just for
computing the ETag (my suggestion was a hash of the _revs of all docs
contributing to the result). That would be a bad default, but might still
work as an opt-in thing per request, eg. slowetag=true.

Newson said I should try raising the discussion here in case someone else
had an idea for a cheaper way to calculate a good ETag. So what does
everyone else think about this? Is my use case too rare, or would it be
worthwhile to implement it?

Best regards,
Andreas Lind Petersen (papandreou)

Here's our chat transcript:

[11:46] <papandreou> Does anyone know if there are plans for issuing even
more granular etags for view lookups? When you only look up a small range or
a specific key it would be really great if the ETag only changed when that
subset changes rather than the entire view.
[11:47] <papandreou> In the application I'm working on I'll hardly ever be
able to get a 304 response because of this.
[...]
[13:51] <+rnewson> papandreou: unlikely.
[13:52] <papandreou> rnewson: So the best thing I can do is to fetch the
data and compute a better etag myself? (My use case is a backend for a web
app)
[13:53] <+rnewson> papandreou: You might be able to set ETag in a list
function? If you can't, I'll gladly change CouchDB so you can.
[13:54] <papandreou> rnewson: I thought about that, too, but that would
cause a big overhead for every request, right?
[13:55] <papandreou> rnewson: (Last time I tried views were slooow)
[13:55] <papandreou> I mean lists
[13:55] <+rnewson> papandreou: slower, yes, because couch needs to evaluate
the javascript in an external process.
[13:55] <+rnewson> how will you calculate the fine-grained ETag?
[13:56] <+rnewson> Also we did recently make it slightly finer, before it
was view group scope and now it's the view itself (I think)
[13:56] <papandreou> rnewson: Maybe something like a hash of the _revs of
all the documents contributing to the result?
[13:56] <+rnewson> hm, that makes no sense actually. but we did refine it
recently.
[13:57] <+rnewson> papandreou: that doesn't sound cheap at all, and it would
need to be cheaper than doing the view query itself to make sense.
[13:58] <papandreou> rnewson: There's still the bandwidth thing
[13:58] <+rnewson> oh, you're working with restricted bandwidth and/or have
huge view responses?
[13:59] <papandreou> rnewson: And it would be really nice to have something
like this completely handled by the database instead of inventing a bunch of
workarounds.
[14:01] <+rnewson> If there's a correct and efficient algorithm for doing
it, I'm sure it would be applied.
[14:02] <papandreou> rnewson: I guess it depends on the use case. If the
database is rarely updated I suppose the current tradeoff is better.
[14:03] <+rnewson> I'm sure the only reason we have ETags at the current
granularity is because it's very quick to calculate. A finer-grain would be
committed if a viable approach was proposed.
[14:04] <papandreou> rnewson: I have a huge database with data belonging to
400000+ different users, and I'm using a view to enable a lookup-by-owner
thing. But every time a single piece of data is inserted, the ETag for the
view changes
[14:04] == case_ [[email protected]]
has quit [Read error: Connection reset by peer]
[14:04] <+rnewson> yes, I've completely understood the problem you stated
earlier.
[14:05] <+rnewson> I can't think of a way to improve this right now but I
would spend the time to implement it if you had one.
[14:06] <papandreou> rnewson: So right now the code path that sends a 304
only needs to look at a single piece of metadata for the view to make its
decision? That'll be hard to beat :)
[14:07] <+rnewson> doesn't need to beat it, it just needs to be fast.
[14:07] <+rnewson> but I don't see any current possible solutions, let alone
fast ones.
[14:07] <papandreou> rnewson: Well, thanks anyway for considering my
suggestion. I'll let you know of I get an idea :)
[14:08] <+rnewson> and it is now per-view and not per-viewgroup. so it's
what I said first before I thought it was silly
[14:08] <+benoitc> query + last seq returned maybe ....
[14:08] <+rnewson> but obviously a change could affect one view in a group
but not others
[14:09] <papandreou> benoitc: The query is already sort of included since
it's in the url.
[14:09] <+rnewson> benoitc: ?
[14:10] <+benoitc> i was meaning last committed seq,but it won't change
anything ...
[14:10] <papandreou> benoitc: I guess you'd also need to make sure that the
ETag changes if a document is deleted?
[14:10] <papandreou> ah
[14:10] <+rnewson> benoitc: we already use the update_seq of the #view,
which is finer-grained that db's last committed seq
[14:11] <+benoitc> rnewson: commited seq in the view group but anyway it
won't work
[14:12] <+rnewson> benoitc: right, that would be the pre-1.1.0 behavior, I
think.
[14:12] <+rnewson> which is coarser
[14:12] <+rnewson> we simply don't record the info that papandreou's
suggestion would need to work.
[14:12] <+benoitc> papandreou: easier solution would be to request each time
on on stale view
[14:13] <papandreou> rnewson: Another reason why my suggestion sucks is that
it would require two traversals of the range, right? I'm guessing it starts
streaming as soon as it has found the first doc now?
[14:13] <+benoitc> and update after, think it would work. except if you want
something strict
[14:13] <+rnewson> papandreou: yes, we stream the results as we read them,
we don't buffer.
[14:14] <papandreou> benoitc: Hmm, so the theory is that stale=ok would
increase the percentage of 304 responses?
[14:14] <papandreou> rnewson: Right, yes, then it would take a serious hit.
[14:14] <+rnewson> papandreou: but we could add an option that reads the
thing, builds an etag, and then streams the result. it would be slower, but
for the times that we can send 304 we'd save bandwidth. It sounds a bit too
niche to me, but you could raise it on user@
[14:15] == Frippe [~Frippe@unaffiliated/frippe] has quit [Ping timeout: 240
seconds]
[14:15] <papandreou> rnewson: Would be awesome to have that as a
configuration option
[14:15] <+rnewson> papandreou: the view would not change, so neither would
the ETag (with stale=ok)
[14:15] <+rnewson> papandreou: I think it would be a runtime option
?slow_etag=true
[14:15] <papandreou> rnewson: That would also be fine
[14:16] <+rnewson> a better solution would not require two passes, though.
[14:16] <+benoitc> papandreou: i would use stale=ok, then query the view
async, save new etag & ...
[14:16] <papandreou> rnewson: I really don't think it's that niche :). But
maybe ETag-nerds are rarer than I think, hehe
[14:16] <+benoitc> rnewson: that could encourage pretty dangerous things
[14:16] <+rnewson> benoitc: ?
[14:17] <+benoitc> rnewson: cpu intensives tasks eacht time the call is
done,
[14:17] <+benoitc> rather than encouraging something async
[14:18] <+benoitc> rahh I hate osx, it introduce be bad unicode chars in vim
:@
[14:23] == Frippe_ has changed nick to Frippe
[14:23] <papandreou> benoitc: I'm not sure exactly how that would work? I'm
working on the backend for a web app, so the requests will be coming from
multiple machines
[14:24] <+benoitc> papandreou: call with stale==ok and have a process asking
your deb for refresh from time to time
[14:24] <+benoitc> s/deb/view
[14:25] <+rnewson> benoitc: not sure I follow. doubling the number of view
requests to achieve a finer etag is an ok solution, but shouldn't be the
default, but I do think we'd need a better solution than that.
[14:25] <+rnewson> benoitc: and you might be forgetting all the md5
verification we do all the time.
[14:27] <+benoitc> rnewson: you don't need to call each views though
[14:27] <+benoitc> I don't see the arg about last one
[14:27] <papandreou> benoitc: Ah, ok, I understand now. Won't work very well
for me, though, the web app is a single page thing that only asks for this
particular chunk of data once per session, so the ETag will probably have
changed anyway unless we accept day-old data.
[14:27] <+benoitc> anyway enotime to discuss about that , i'm on
anotherthing
[14:32] <papandreou> rnewson: But next step would be for me to raise the
issue on the user mailing list?
[14:33] <+rnewson> papandreou: on reflection, it's more a dev@ thing, but
yes.
[14:33] <+rnewson> post the suggestion about calculating an etag over the
results and then streaming them, with the caveat that a better solution
should be found.
[14:34] <papandreou> rnewson: Ok, I will, thanks :). Btw. do you think
there's a chance that this will be easier for key=... queries than arbitrary
startkey=...&endkey=... ones?
[14:35] <+rnewson> papandreou: yes. for key= we could use a bloom filter.
[14:38] <papandreou> rnewson: Man, I've got some reading up to do :).
Thanks! So dev@ it is?
[14:39] <+rnewson> papandreou: yes.
[14:40] <+rnewson> papandreou: 'bloom filter' is just how we handwave
solutions these days, it just sounds vaguely plausible to for the keys=
variant
[14:40] <+rnewson> but doesn't make sense at all for startkey/endkey
[14:40] <+jan____> haha, I'm sitting in an ""HTTP Architecture" session, and
all the two speakers do is tell the audience how CouchDB gets it all right.
[14:41] <+rnewson> at base, we'd want some cheap way to invalidate a range
of keys in memory.
[14:49] <+jan____> the answer must include bloom filters.

Even more fine-grained ETag support when querying views?

Reply via email to