Re: [DISCUSS] On the _changes feed - how hard should we strive for exactly once semantics?

Robert Newson Thu, 07 Mar 2019 05:24:25 -0800

I said a bunch of this on IRC also but Adam has it. further operations within 
the 'expired' txn just fail. we recognise that and start a new one. In the 
_changes case, we'd send a last_seq row and end the request, but this isn't 
going to be a great answer (at least, not a backward compatible answer) for 
_view, _all_docs and _find.


-- 
  Robert Samuel Newson
  rnew...@apache.org

On Thu, 7 Mar 2019, at 12:37, Adam Kocoloski wrote:
> Bah, our “cue”, not our “queue” ;)
> 
> Adam
> 
> > On Mar 7, 2019, at 7:35 AM, Adam Kocoloski <kocol...@apache.org> wrote:
> > 
> > Hi Garren,
> > 
> > In general we wouldn’t know ahead of time whether we can complete in five 
> > seconds. I believe the way it works is that we start a transaction, issue a 
> > bunch of reads, and after 5 seconds any additional reads will start to fail 
> > with something like “read version too old”. That’s our queue to start a new 
> > transaction. All the reads that completed successfully are fine, and the 
> > CouchDB API layer can certainly choose to start streaming as soon as the 
> > first read completes (~2ms after the beginning of the transaction).
> > 
> > Agree with Bob that steering towards a larger number of short-lived 
> > operations is the way to go in general. But I also want to balance that 
> > with backwards-compatibility where it makes sense.
> > 
> > Adam
> > 
> >> On Mar 7, 2019, at 7:22 AM, Garren Smith <gar...@apache.org> wrote:
> >> 
> >> I agree that option A seems the most sensibile. I just want to understand
> >> this comment:
> >> 
> >>>> A _changes request that cannot be satisfied within the 5 second limit
> >> will be implemented as multiple FoundationDB transactions under the covers
> >> 
> >> How will we know if a change request cannot be completed in 5 seconds? Can
> >> we tell that beforehand. Or would we try and complete a change request. The
> >> transaction fails after 5 seconds and then do multiple transactions to get
> >> the full changes? If that is the case the response from CouchDB to the user
> >> will be really slow as they have already waited 5 seconds and have still
> >> not received anything. Or if we start streaming a result back to the user
> >> in the first transaction (Is this even possible?) then we would somehow
> >> need to know how to continue the changes feed after the transaction has
> >> failed.
> >> 
> >> Then Bob from your comment:
> >> 
> >>>> Forcing clients to do short (<5s) requests feels like a general good, as
> >> long as meaningful things can be done in that time-frame, which I strongly
> >> believe from what we've said elsewhere that they can.
> >> 
> >> That makes sense, but how would we do that? How do you help a user to make
> >> sure their request is under 5 seconds?
> >> 
> >> Cheers
> >> Garren
> >> 
> >> 
> >> 
> >> On Thu, Mar 7, 2019 at 11:15 AM Robert Newson <rnew...@apache.org> wrote:
> >> 
> >>> Hi,
> >>> 
> >>> Given that option A is the behaviour of feed=continuous today (barring the
> >>> initial whole-snapshot phase to catch up to "now") I think that's the 
> >>> right
> >>> move.  I confess to not reading your option B too deeply but I was there 
> >>> on
> >>> IRC when the first spark was lit. We can build some sort of temporary
> >>> multi-index on FDB today, that's clear, but it's equally clear that we
> >>> should avoid doing so if at all possible.
> >>> 
> >>> Perhaps the future Redwood storage engine for FDB will, as you say,
> >>> significantly improve on this, but, even if it does, I'm not 100% 
> >>> convinced
> >>> we should expose it. Forcing clients to do short (<5s) requests feels like
> >>> a general good, as long as meaningful things can be done in that
> >>> time-frame, which I strongly believe from what we've said elsewhere that
> >>> they can.
> >>> 
> >>> CouchDB's API, as we both know from rich (heh, and sometimes poor)
> >>> experience in production, has a lot of endpoints of wildly varying
> >>> performance characteristics. It's right that we evolve away from that 
> >>> where
> >>> possible, and this seems a great candidate given the replicator in ~all
> >>> versions of CouchDB will handle the change without blinking.
> >>> 
> >>> We have the same issue for _all_docs and _view and _find, in that the user
> >>> might ask for more data back than can be sent within a single FDB
> >>> transaction. I suggest that's a new thread, though.
> >>> 
> >>> --
> >>> Robert Samuel Newson
> >>> rnew...@apache.org
> >>> 
> >>> On Thu, 7 Mar 2019, at 01:24, Adam Kocoloski wrote:
> >>>> Hi all, as the project devs are working through the design for the
> >>>> _changes feed in FoundationDB we’ve come across a limitation that is
> >>>> worth discussing with the broader user community. FoundationDB
> >>>> currently imposes a 5 second limit on all transactions, and read
> >>>> versions from old transactions are inaccessible after that window. This
> >>>> means that, unlike a single CouchDB storage shard, it is not possible
> >>>> to grab a long-lived snapshot of the entire database.
> >>>> 
> >>>> In extant versions of CouchDB we rely on this long-lived snapshot
> >>>> behavior for a number of operations, some of which are user-facing. For
> >>>> example, it is possible to make a request to the _changes feed for a
> >>>> database of an arbitrary size and, if you’ve got the storage space and
> >>>> time to spare, you can pull down a snapshot of the entire database in a
> >>>> single request. That snapshot will contain exactly one entry for each
> >>>> document in the database. In CouchDB 1.x the documents appear in the
> >>>> order in which they were most recently updated. In CouchDB 2.x there is
> >>>> no guaranteed ordering, although in practice the documents are roughly
> >>>> ordered by most recent edit. Note that you really do have to complete
> >>>> the operation in a single HTTP request; if you chunk up the requests or
> >>>> have to retry because the connection was severed then the exactly-once
> >>>> guarantees disappear.
> >>>> 
> >>>> We have a couple of different options for how we can implement _changes
> >>>> with FoundationDB as a backing store, I’ll describe them below and
> >>>> discuss the tradeoffs
> >>>> 
> >>>> ## Option A: Single Version Index, long-running operations as multiple
> >>>> transactions
> >>>> 
> >>>> In this option the internal index has exactly one entry for each
> >>>> document at all times. A _changes request that cannot be satisfied
> >>>> within the 5 second limit will be implemented as multiple FoundationDB
> >>>> transactions under the covers. These transactions will have different
> >>>> read versions, and a document that gets updated in between those read
> >>>> versions will show up *multiple times* in the response body. The entire
> >>>> feed will be totally ordered, and later occurrences of a particular
> >>>> document are guaranteed to represent more recent edits than than the
> >>>> earlier occurrences. In effect, it’s rather like the semantics of a
> >>>> feed=continuous request today, but with much better ordering and zero
> >>>> possibility of “rewinds” where large portions of the ID space get
> >>>> replayed because of issues in the cluster.
> >>>> 
> >>>> This option is very efficient internally and does not require any
> >>>> background maintenance. A future enhancement in FoundationDB’s storage
> >>>> engine is designed to enable longer-running read-only transactions, so
> >>>> we will likely to be able to improve the semantics with this option
> >>>> over time.
> >>>> 
> >>>> ## Option B: Multi-Version Index
> >>>> 
> >>>> In this design the internal index can contain multiple entries for a
> >>>> given document. Each entry includes the sequence at which the document
> >>>> edit was made, and may also include a sequence at which it was
> >>>> overwritten by a more recent edit.
> >>>> 
> >>>> The implementation of a _changes request would start by getting the
> >>>> current version of the datastore (call this the read version), and then
> >>>> as it examines entries in the index it would skip over any entries
> >>>> where there’s a “tombstone” sequence less than the read version.
> >>>> Crucially, if the request needs to be implemented across multiple
> >>>> transactions, each transaction would use the same read version when
> >>>> deciding whether to include entries in the index in the _changes
> >>>> response. The readers would know to stop when and if they encounter an
> >>>> entry where the created version is greater than the read version.
> >>>> Perhaps a diagram helps to clarify, a simplified version of the
> >>>> internal index might look like
> >>>> 
> >>>> {“seq”: 1, “id”: ”foo”}
> >>>> {“seq”: 2, “id”: ”bar”, “tombstone”: 5}
> >>>> {“seq”: 3, “id”: “baz”}
> >>>> {“seq”: 4, “id”: “bif”, “tombstone": 6}
> >>>> {“seq”: 5, “id”: “bar”}
> >>>> {“seq”: 6, “id”: “bif”}
> >>>> 
> >>>> A _changes request which happens to commence when the database is at
> >>>> sequence 5 would return (ignoring the format of “seq” for simplicity)
> >>>> 
> >>>> {“seq”: 1, “id”: ”foo”}
> >>>> {“seq”: 3, “id”: “baz”}
> >>>> {“seq”: 4, “id”: “bif”}
> >>>> {“seq”: 5, “id”: “bar”}
> >>>> 
> >>>> i.e., the first instance “bar” would be skipped over because a more
> >>>> recent version exists within the time horizon, but the first instance
> >>>> of “bif” would included because “seq”: 6 is outside our horizon.
> >>>> 
> >>>> The downside of this approach is someone has to go in and clean up
> >>>> tombstoned index entries eventually (or else provision lots and lots of
> >>>> storage space). One way we could do this (inside CouchDB) would be to
> >>>> have each _changes session record its read version somewhere, and then
> >>>> have a background process go in and remove tombstoned entries where the
> >>>> tombstone is less than the earliest read version of any active request.
> >>>> It’s doable, but definitely more load on the server.
> >>>> 
> >>>> Also, note this approach is not guaranteeing that the older versions of
> >>>> the documents referenced in those tombstoned entries are actually
> >>>> accessible. Much like today, the changes feed would include a revision
> >>>> identifier which, upon closer inspection, has been superseded by a more
> >>>> recent version of the document. Unlike today, that older version would
> >>>> be expunged from the database immediately if a descendant revision
> >>>> exists.
> >>>> 
> >>>> —
> >>>> 
> >>>> OK, so those are the two basic options. I’d particularly like to hear
> >>>> if the behavior described in Option A would prove problematic for
> >>>> certain use cases, as it’s the simpler and more efficient of the two
> >>>> options. Thanks!
> >>>> 
> >>>> Adam
> >>> 
> > 
> 
>

Re: [DISCUSS] On the _changes feed - how hard should we strive for exactly once semantics?

Reply via email to