Thanks Sam - a couple of subtleties there that we missed in our review. Cheers Ben
On Tue, 30 Aug 2016 at 19:42 Sam Tunnicliffe <s...@beobal.com> wrote: > Just to clarify a little further, it's true that read repair queries are > performed at CL ALL, but this is slightly different to a regular, > user-initiated query at that CL. > > Say you have RF=5 and you issue read at CL ALL, the coordinator will send > requests to all 5 replicas and block until it receives a response from each > (or a timeout occurs) before replying to the client. This is the > straightforward and intuitive case. > > If instead you read at CL QUORUM, the # of replicas required for CL is 3, > so the coordinator only contacts 3 nodes. In the case where a speculative > retry is activated, an additional replica is added to the initial set. The > coordinator will still only wait for 3 out of the 4 responses before > proceeding, but if a digest mismatch occurs the read repair queries are > sent to all 4. It's this follow up query that the coordinator executes at > CL ALL, i.e. it requires all 4 replicas to respond to the read repair query > before merging their results to figure out the canonical, latest data. > > You can see that the number of replicas queried/required for read repair > is different than if the client actually requests a read at CL ALL (i.e. > here it's 4, not 5), it's the behaviour of waiting for all *contacted* > replicas to respond which is significant here. > > There are addtional considerations when constructing that initial replica > set (which you can follow in > o.a.c.Service.AbstractReadExecutor::getReadExecutor), involving the table's > read_repair_chance, dclocal_read_repair_chance and speculative_retry > options. THe main gotcha is global read repair (via read_repair_chance) > which will trigger cross-dc repairs at CL ALL in the case of a digest > mismatch, even if the requested CL is DC-local. > > > On Sun, Aug 28, 2016 at 11:55 AM, Ben Slater <ben.sla...@instaclustr.com> > wrote: > >> In case anyone else is interested - we figured this out. When C* decides >> it need to do a repair based on a digest mismatch from the initial reads >> for the consistency level it does actually try to do a read at CL=ALL in >> order to get the most up to date data to use to repair. >> >> This led to an interesting issue in our case where we had one node in an >> RF3 cluster down for maintenance (to correct data that became corrupted due >> to a severe write overload) and started getting occasional “timeout during >> read query at consistency LOCAL_QUORUM” failures. We believe this due to >> the case where data for a read was only available on one of the two up >> replicas which then triggered an attempt to repair and a failed read at >> CL=ALL. It seems that CASSANDRA-7947 (a while ago) change the behaviour so >> that C* reports a failure at the originally request level even when it was >> actually the attempted repair read at CL=ALL which could not read >> sufficient replicas - a bit confusing (although I can also see how getting >> CL=ALL errors when you thought you were reading at QUORUM or ONE would be >> confusing). >> >> Cheers >> Ben >> >> On Sun, 28 Aug 2016 at 10:52 kurt Greaves <k...@instaclustr.com> wrote: >> >>> Looking at the wiki for the read path ( >>> http://wiki.apache.org/cassandra/ReadPathForUsers), in the bottom >>> diagram for reading with a read repair, it states the following when >>> "reading from all replica nodes" after there is a hash mismatch: >>> >>> If hashes do not match, do conflict resolution. First step is to read >>>> all data from all replica nodes excluding the fastest replica (since >>>> CL=ALL) >>>> >>> >>> In the bottom left of the diagram it also states: >>> >>>> In this example: >>>> >>> RF>=2 >>>> >>> CL=ALL >>>> >>> >>> The (since CL=ALL) implies that the CL for the read during the read >>> repair is based off the CL of the query. However I don't think that makes >>> sense at other CLs. Anyway, I just want to clarify what CL the read for the >>> read repair occurs at for cases where the overall query CL is not ALL. >>> >>> Thanks, >>> Kurt. >>> >>> -- >>> Kurt Greaves >>> k...@instaclustr.com >>> www.instaclustr.com >>> >> -- >> ———————— >> Ben Slater >> Chief Product Officer >> Instaclustr: Cassandra + Spark - Managed | Consulting | Support >> +61 437 929 798 >> > > -- ———————— Ben Slater Chief Product Officer Instaclustr: Cassandra + Spark - Managed | Consulting | Support +61 437 929 798