Re: Sequence number bug in CouchDB 2 or is there another way to compare sequence numbers?

Robert Samuel Newson Mon, 05 Jun 2017 14:09:07 -0700
:)

> On 5 Jun 2017, at 02:55, Geoffrey Cox <[email protected]> wrote:
> 
> Hi Robert,
> 
> This last email is killer at explaining how the sequence numbers work.
> Thanks!
> 
> I've modified my code to use the algorithm you mentioned, i.e. the seq
> number of the last row in the _changes feed is used in the next call to the
> _changes feed and it appears to be working well. I'm not sure where the
> breakdown was before, but I suspect I probably have a small bug somewhere
> else in my code. For now, I don't think there are any major issues with
> this replication protocol, but I'll be sure to provide more details if I
> find anything in the future.
> 
> Thanks again!
> 
> Geoff
> 
> On Sun, Jun 4, 2017, 10:37 Robert Samuel Newson <[email protected]> wrote:
> 
>> Nope, it won't. There is no single order to the changes feed, so even if
>> you could decode these values you still couldn't decide if one seq was
>> before another. I'll illustrate with unencoded values.
>> 
>> Imagine an empty (N=2, Q=2) database, then we add two documents at the
>> same time that map to different shards. We make two requests to _changes at
>> the same time but, critically, one changes feed is fed by node2 for the
>> 81-ff range and sees that update and hits node1 for the 00-80 range which
>> has not yet seen the other update. the other changes feed sees the opposite.
>> 
>> 1. [{node1, 00-80, 0}, {node2, 81-ff, 1}]
>> 2. [{node1, 81-ff, 1}, {node2, 00-80, 0}]
>> 
>> is seq 1 greater than 2? No. is seq 2 greater than 1? No.
>> 
>> Fractionally later, of course, both nodes see both updates and so the last
>> seq of any changes feed would converge to any of the following;
>> 
>> [{node1, 00-80, 1}, {node1, 81-ff, 1}]
>> [{node1, 00-80, 1}, {node2, 81-ff, 1}]
>> [{node2, 00-80, 1}, {node1, 81-ff, 1}]
>> [{node2, 00-80, 1}, {node2, 81-ff, 1}]
>> 
>> note also that all four of the seq values above will encode to different
>> string values (though the number on the front will be the same, 2).
>> 
>> The algorithm for correctly consuming the changes feed is described in the
>> quoted posts but I'll pull it out for clarity;
>> 
>> 1. read /dbname/_changes
>> 2. process each row idempotently
>> 3. periodically (every X seconds or every X rows) store the "seq" value of
>> the last row you processed
>> 
>> If you ever crash, or if you weren't using continuous=true, you can do
>> this same procedure again but modified in step 1;
>> 
>> revised 1. read /dbname/_changes?since=X
>> 
>> where X is the value you saved in step 3. If you're not using continuous
>> mode then you could just record the "last_seq" value at the end of
>> consuming the non-continuous response. You run the risk of reprocessing a
>> lot more items, though.
>> 
>> With this scheme (which the replicator and all indexers follow), you don't
>> care if the results come out of order, you don't need to compare any two
>> seq values.
>> 
>> You _do_ need to ensure you can correctly process the same change multiple
>> times. For an example of that, consider the replicator, when it sees a row
>> from a changes feed it asks the target database if it contains the _id and
>> _rev values from that row. If it does, the replicator moves on to the next
>> row. If it doesn't, it tries to write the document in that row to the
>> target database. In the event of a crash, and therefore a call to _changes
>> with a seq value from before processing that row, it will ask the target
>> database if it has the _id/_rev again, only this time the target will say
>> yes.
>> 
>> We are very interested in any evidence that this algorithm does not work
>> as it is fundamental to CouchDB replication and internal data redundancy
>> too.
>> 
>> B.
>> 
>> 
>>> On 4 Jun 2017, at 16:06, Geoffrey Cox <[email protected]> wrote:
>>> 
>>> Wow, thanks for all the great information, Robert and Alexander! I think
>> it
>>> would be very valuable for this information to be included in the CouchDB
>>> docs. I want to create a StackOverflow post that summarizes some of this
>>> stuff once I understand it just a little better.
>>> 
>>> 
>>> 
>>> I actually came up with an algorithm that appears to work correctly when
>>> comparing sequence numbers to generally determine which sequence number
>> is
>>> the later.
>>> 
>>> 
>>> 
>>> var sequenceGreaterThan = function (s1, s2) {
>>> 
>>> if (typeof s1 === 'number') { // CouchDB 1?
>>> 
>>>  return s1 > s2;
>>> 
>>> } else {
>>> 
>>>  var parts1 = s1.split(/-/),
>>> 
>>>    s1Int = parseInt(parts1[0]),
>>> 
>>>    parts2 = s2.split(/-/),
>>> 
>>>    s2Int = parseInt(parts2[0]);
>>> 
>>>  return s1Int > s2Int || (s1Int === s2Int && parts1[1] > parts2[1]);
>>> 
>>> }
>>> 
>>> };
>>> 
>>> 
>>> 
>>> Let me explain why it is necessary and hopefully someone can verify that
>> it
>>> will work for my case. As part of implementing
>>> https://github.com/redgeoff/spiegel for CouchDB 2, I’ve had to port
>>> sequence number comparisons. There is a piece of Spiegel that listens to
>>> the changes feed on a database and then executes a function to process
>> the
>>> change. As is mentioned, changes can be received out of order or even
>>> replayed and this is fine, but the listener needs to come back later and
>>> pick up from where it has left off to process the next set of changes. To
>>> do this, there is code that checks for the latest sequence number so that
>>> it can be used in the next call to _changes. (Continuous listening on the
>>> _changes feed is not desired here as one of the goals of Spiegel is to
>>> listen to many databases without using many database connections).
>>> 
>>> 
>>> 
>>> For my particular case, does sequenceGreaterThan appear to work as
>> intended
>>> and guarantee that my listening on the _changes feed will generally move
>>> forward?
>>> 
>>> 
>>> 
>>> (In my testing, I’ve found that using the sequence number of the last
>> item
>>> in a _changes feed in the next call to the _changes feed doesn’t
>>> necessarily return the next set of changes. It appears to be a fairly
>> rare
>>> occurrence, but this is why I am using a sequence number comparison).
>>> 
>>> 
>>> Thanks!
>>> 
>>> 
>>> Geoff
>>> 
>>> 
>>> On Sun, Jun 4, 2017 at 3:07 AM Alexander Harm <[email protected]> wrote:
>>> 
>>>> Hello Geoffrey,
>>>> 
>>>> some time ago Robert and Paul explained the ordering in the changes
>>>> feed. Maybe that helps:
>>>> 
>>>> Hey all,
>>>> 
>>>> Bob did a pretty good job explaining how the changes feed works most
>>>> of the time but I did want to call attention to an error condition
>>>> that might be of concern to the original question. There is a
>>>> situation where you can see old changes in the response depending on
>>>> some timing and error conditions.
>>>> 
>>>> For a bit of background on the since parameter, when we are looking at
>>>> a clustered since sequence like such:
>>>> 
>>>> 
>>>> 
>> 35-g1AAAAG1eJyNz0EKwjAQBdBoC2I3nkEPEGppSLOyV8k0U2pJE9C61pvpzWKaitBNyWYGhs9_jCaEpF2iyFFBY29YK-B0sNbcu6tB2mj7UNKM1OCofXQrCVycc32XyP3gDztExlkpYwqWTLHGQO0nPH_SJkh5i6IVTUzHUmJrkkn9JC-_PPaetCxoLYe8AhHTM2mnf-q8-tjMfWYuPHcIHIiyqDiPKuq_TDeP1A
>>>> 
>>>> What that actually contains is an encoded set of {shard, node,
>>>> update_seq} triples kinda like such:
>>>> 
>>>> [
>>>>   {"shards/00000000-3fffffff/davisp/test-db.1384769918", db7,  { 9,
>>>> <<"ee5754a">>, db7}},
>>>>   {"shards/40000000-7fffffff/davisp/test-db.1384769918", db2,  { 1,
>>>> <<"0fe9f9c">>, db2}},
>>>>   {"shards/80000000-bfffffff/davisp/test-db.1384769918", db5,  {10,
>>>> <<"f7b08b9">>, db5}},
>>>>   {"shards/c0000000-ffffffff/davisp/test-db.1384769918", db12, {15,
>>>> <<"b942877">>, db12}}
>>>> ]
>>>> 
>>>> What's happening here is that when you specify that parameter, CouchDB
>>>> will decode it and then try and read the changes from each of those
>>>> shards resuming at the given update seq. As an aside the extra uuid
>>>> prefix and node name are extra so that we can try and skip some old
>>>> changes, an optimization but not important for this discussion.
>>>> 
>>>> Now, the important bit is that if we specify this since seq and one of
>>>> the nodes db2, db5, db7, or db12 happens to be down (or just gone if
>>>> you stored that since seq for a long time and say the cluster changed
>>>> size) then CouchDB has to choose another shard to replace the missing
>>>> node. When this happens you will see "old" changes that have already
>>>> been processed. Your application should be able to handle this using
>>>> the approaches that Bob listed in his email.
>>>> 
>>>> However, (there's always a however in distributed systems) there
>>>> exists a timing situation where you may be reading the changes feed,
>>>> an update comes in to the shard you're reading from, and you see the
>>>> change. Then say that node goes down (which would terminate the
>>>> changes feed). The client would then reconnect with their latest
>>>> update seq and get a different node. This node may (depending on a
>>>> whole bunch of timing and optimization things) send a change for the
>>>> doc that is technically "before" the change that was already
>>>> processed. So you do have to be able to recognize that you already
>>>> processed a change for the doc. CouchDB does this internally by
>>>> keeping the revision tree and checking revisions against that.
>>>> 
>>>> I get the feeling that my description may not be super clear so I'll
>>>> try and restate it as a sequence of events:
>>>> 
>>>> 1. Client requests _changes with since=35-g1AAA...
>>>> 2. Someone makes an update to doc foo
>>>> 3. The first shard that handles the update is part of the changes feed
>>>> from #1
>>>> 4. Client reading _changes sees update appear in its feed
>>>> 5. The node containing the shard with an update dies (an operator
>>>> rebooted the wrong node perhaps)
>>>> 6. Client has its changes feed disconnect and reconnects with
>>>> since=35-g1AAA.. (or even a newer sequence)
>>>> 7. The shard that responds as a replacement for the shard on the dead
>>>> node does not (yet) have the update
>>>> 8. The doc exists in the update sequence after where it starts reading
>>>> its changes feed
>>>> 9. Client sees the "old" change and must know that its old
>>>> 
>>>> Note that this is all very unlikely and super sensitive to timing. For
>>>> instance, one node seeing the update, and then dying, and the other
>>>> two copies not receving the update would require some very peculiar
>>>> network disconnections in the cluster while still being reachable from
>>>> the load balancer. But the possibility remains which means apps will
>>>> want to be able to handle it.
>>>> 
>>>> Paul
>>>> 
>>>> On Sat, Sep 3, 2016 at 12:09 PM, Robert Payne <[email protected]>
>> wrote:
>>>>> Thanks for this,
>>>>> 
>>>>> Sorry it’s a bit light on details, it’s a very specific use case on iOS
>>>> where cpu/disk speed is constrained and we have a large data set. The
>> full
>>>> replication protocol just requires too many reads/writes to be
>> performant
>>>> and we’ve optimised it various ways. We’re idempotent so long as
>>>> per-document changes are in-order which I was just checking.
>>>>> 
>>>>> I appreciate the more technical analysis and it certainly clears up
>> what
>>>> I was asking.
>>>>> 
>>>>> Cheers,
>>>>> 
>>>>> Robert
>>>>> 
>>>>>> On 4/09/2016, at 4:41 AM, Robert Samuel Newson <[email protected]>
>>>> wrote:
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> It is important to understand that the order of rows in the _changes
>>>> response is not important. In couchdb before 2.0 the response was
>> totally
>>>> ordered, but this was never necessary for correctness. The essential
>>>> contract for _changes is that you are guaranteed to see all changes made
>>>> since the 'since' parameter you pass. The order of those changes is not
>>>> guaranteed and it is also not guaranteed that changes from _before_ that
>>>> 'since' value are _not_ also returned. The consequence of this contract
>> is
>>>> that all consumers of the _changes response must apply each row
>>>> idempotently. This is true for the replicator, of course.
>>>>>> 
>>>>>> The changes response in 2.0 is partially ordered. The changes from any
>>>> given shard will be in a consistent order, but we merge the changes from
>>>> each shard range of your database as they are collected from the various
>>>> contributing nodes, we don't apply a total ordering over that. The
>> reason
>>>> is simple; it's expensive and unnecessary. It's important to also
>> remember
>>>> that replication, even before 2.0, would not replicate in strict source
>>>> update order either, due to (valuable) parallelism when reading changes
>> and
>>>> applying them.
>>>>>> 
>>>>>> Your question: "Is it possible for the changes feed to send older
>>>> changes before newer changes for the same document ID across multiple
>>>> calls?" requires a little further background knowledge before answering.
>>>>>> 
>>>>>> While we call it a changes "feed" it's important to remember what it
>>>> really is, internally, first. Every database in couchdb, prior to 2.0,
>> is a
>>>> single file with multiple b+trees recorded inside it that are kept in
>>>> absolute sync with each other. One b+tree allows you to look up a
>> document
>>>> by the _id parameter. The other b+tree allows you to look up a document
>> by
>>>> its update order. It is essential to note that these two b+trees have
>> the
>>>> same number of key/value pairs in them at all times.
>>>>>> 
>>>>>> To illustrate this more clearly, consider an empty database. We add
>> one
>>>> document to it. It is retrievable by its _id and is also visible in the
>>>> _changes response as change number 1. Now, we update that document. It
>> is
>>>> now change number 2. Change number 1 will never again appear in the
>>>> _changes response. That is, every document appears in the _changes
>> response
>>>> at its most recent update number.
>>>>>> 
>>>>>> When you call _changes without the continuous parameter, couchdb is
>>>> simply traversing that second b+tree and returning each row it finds. It
>>>> may do this from the beginning (which was 1 before our update and 2
>> after)
>>>> or it may do so from some update seq you supply with the 'since'
>> parameter.
>>>>>> 
>>>>>> With that now understood, we can look at what changes when we do
>>>> continuous=true which is what makes it a "feed" (that is, a potentially
>>>> unending response of changes as they are made). This is sent in two
>> phases.
>>>> The first is exactly as the previous paragraph. Once all those changes
>> have
>>>> been sent, couchdb enters a loop where it returns updates as they happen
>>>> (or shortly after).
>>>>>> 
>>>>>> It is only in a continuous=true response in couchdb before 2.0 that
>> you
>>>> would ever see more than one change for any given document.
>>>>>> 
>>>>>> So, to cut a long story short (too late), the answer to your question
>>>> is "no". The changes feed is not a permanent history of all changes
>> made to
>>>> all documents. Once a document is updated, it is _moved_ to a newer
>>>> position and no longer appears in its old one (and no record of that
>>>> position is even preserved). Do note, though, that couchdb might return
>>>> 'Doc A change (seq: 2-XXXX)' even if your 'since' parameter is _after_
>> the
>>>> last change to doc A. We won't return ' Doc A change (seq: 1-XXXX)' at
>> all
>>>> after its updated to 2-XXXX.
>>>>>> 
>>>>>> The algorithm for correctly processing the changes response is as
>>>> follows, and any variation on this is likely broken;
>>>>>> 
>>>>>> 1) call /_changes?since=0
>>>>>> 2) for each returned row, ensure the target has the change in question
>>>> (either use _id + _rev to prevent duplicate application of the change or
>>>> apply the change in a way that is idempotent)
>>>>>> 3) periodically store the update seq of the last processed row to
>>>> stable storage (a _local document is a good choice)
>>>>>> 
>>>>>> If you wish to resume applying changes after a shutdown, reboot, or
>>>> crash, repeat the above process but substitute your stored update
>> sequence
>>>> in the ?since= parameter.
>>>>>> 
>>>>>> There are many things that use the changes feed in this way. Within
>>>> couchdb, there's database replication (obviously) but also couchdb
>> views.
>>>> Outside of the core, software like pouchdb and couchdb-lucene use the
>>>> changes feed to replicate data or update search indexes.
>>>>>> 
>>>>>> I hope this was useful, and I think it might expose some problems in
>>>> your couchdb-to-sqlite synchronisation protocol. Your email is obviously
>>>> silent on many details there, but if you've predicated its design on the
>>>> total ordering properties of couchdb < 2.0, you likely have some work
>> to do.
>>>>>> 
>>>>>> B.
>>>>>> 
>>>>>> 
>>>>>>> On 3 Sep 2016, at 00:04, Robert Payne <[email protected]> wrote:
>>>>>>> 
>>>>>>> Hey Everyone,
>>>>>>> 
>>>>>>> Reading up on the CouchDB 2.0 migration guides and getting a bit
>> antsy
>>>> around the mentions of out of order changes feed and sorts. Is it
>> possible
>>>> for the changes feed to send older changes before newer changes for the
>>>> same document ID across multiple calls?
>>>>>>> 
>>>>>>> Assuming start at ?since=“” and always pass in the “last_seq” on
>> every
>>>> additional call could a situation like this occur in a single or
>> multiple
>>>> HTTP calls:
>>>>>>> 
>>>>>>> — Changes feed emits Doc A change (seq: 2-XXXX)
>>>>>>> — Changes feed emits Doc B change (seq: 3-XXXX)
>>>>>>> — Changes feed emits Doc A change (seq: 1-XXXX)
>>>>>>> 
>>>>>>> I’m really hoping the case is just that across different doc ids
>>>> changes can be out of order. Our use case on mobile is a bit particular
>> as
>>>> we duplicate edits into a separate SQLite table and use the changes
>> feed to
>>>> keep the local database up to date with winning revs from the server, it
>>>> just increases the performance of sync by a ton since there is only 1
>> check
>>>> and set in SQLite per change that comes in.
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> Robert
>>>> 
>>>>> Geoffrey Cox <mailto:[email protected]>
>>>>> 4. June 2017 at 06:04
>>>>> I’m digging deeper into CouchDB 2 and I’m finding some unexpected
>>>> ordering
>>>>> with sequence numbers. In one case, I found that an early change in a
>>>>> _changes feed has the sequence number
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>> *99-g1AAAAI-eJyd0EsOgjAQBuAGiI-dN9C9LmrBwqzkJtrSNkgQV6z1JnoTvYneBEvbhA0aMU1mkj6-_NMSITTJfYFm2anOcsFT10mpTzyG-LxpmiL32eqoN8aEAcWE9dz_jPCFrnzrHGQchiFM4kSgaV0JqQ6VFF-AtAV2DggMgCEGxrNhQfatc3bOyDiKUalg2EBVoCu66KapazcUh41e69-GssjNIvcWWRokk2oNofwj0MNazy4QFURhGQ0J9LKI-SHPIBHEgiak51nxBhxnrRk*
>>>>> 
>>>>> 
>>>>> 
>>>>> The last sequence number in my _changes feed, for the same DB, is
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>> *228-g1AAAAJFeJyd0EkOgjAUBuAGTJCdN9AjlIKFruQm2jFAEFes9SZ6E72J3gQ7JW7QCGnyXtLhy-vfAgCWVSjAip96XglW-o5afRJQwNbDMDRVSOuj3ogQJRgiOnL_O8I2urKdd4B1KCRpkRcCxH0npKo7KX4ApQH2HogsAElOKOPTBjkY5-yd2DqKYqnItA91C13BRTdNXY0VWouRrV7JDOvmrLuxlLW4VAlJ5Qzr4aznJ2wskIIy-y9sh7wcYoMKLJKRXOACjTxr3uHcsBE*
>>>>> 
>>>>> 
>>>>> 
>>>>> In a browser console, the following is false
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>> '228-g1AAAAJFeJyd0EkOgjAUBuAGTJCdN9AjlIKFruQm2jFAEFes9SZ6E72J3gQ7JW7QCGnyXtLhy-vfAgCWVSjAip96XglW-o5afRJQwNbDMDRVSOuj3ogQJRgiOnL_O8I2urKdd4B1KCRpkRcCxH0npKo7KX4ApQH2HogsAElOKOPTBjkY5-yd2DqKYqnItA91C13BRTdNXY0VWouRrV7JDOvmrLuxlLW4VAlJ5Qzr4aznJ2wskIIy-y9sh7wcYoMKLJKRXOACjTxr3uHcsBE'
>>>>> 
>>>> 
>> '99-g1AAAAI-eJyd0EsOgjAQBuAGiI-dN9C9LmrBwqzkJtrSNkgQV6z1JnoTvYneBEvbhA0aMU1mkj6-_NMSITTJfYFm2anOcsFT10mpTzyG-LxpmiL32eqoN8aEAcWE9dz_jPCFrnzrHGQchiFM4kSgaV0JqQ6VFF-AtAV2DggMgCEGxrNhQfatc3bOyDiKUalg2EBVoCu66KapazcUh41e69-GssjNIvcWWRokk2oNofwj0MNazy4QFURhGQ0J9LKI-SHPIBHEgiak51nxBhxnrRk'
>>>>> 
>>>>> 
>>>>> 
>>>>> Is this a bug or do I need to use some other method to compare sequence
>>>>> numbers?
>>>>> 
>>>>> 
>>>>> 
>>>>> In looking at the other sequence numbers in my _changes feed, it looks
>>>>> like
>>>>> they are generally ordered as I would expect, but in this case it
>> appears
>>>>> that when the first number, e.g. 99, jumps from 2 digits to 3 digits,
>> the
>>>>> ordering breaks. If you boil this down to a simple string comparison
>>>>> example, you can see that '228' > '99' => false
>>>>> 
>>>>> 
>>>>> Thanks.
>>>>> 
>>>>> 
>>>>> Geoff
>>>>> 
>>>> 
>>>> 
>> 
>>
Re: Sequence number bug in CouchDB 2 or is there another way to compare sequence numbers?

Reply via email to