Re: Unbounded memory usage for WQ > AQ ?

Sijie Guo Sun, 17 Jan 2021 22:07:34 -0800

Sorry for being late in this thread.

If I understand this correctly, the main topic is about the "hole" when WQ
> AQ.


> This leaves a "hole" as the entry is now replicated only to 2 bookies,

We do have one hole when ensemble change is enabled and WQ > AQ. That was a
known behavior. But the hole will be repaired by the ledger auditor as JV
said. Did you guys see any issues with the ledger auditor?

> I'd think that we guarantee that an entry that is acknowledged is
eventually written WQ ways and that it is observable by readers when the
ledger is closed.

To Flavio's question, we don't guarantee (and can't guarantee) that the
active writer will eventually write the entries to WQ. For the active
writers, we only guarantee entries are written to AQ. The ledger auditor is
to ensure all the entries are written to WQ.

The active writer can't guarantee it writing entries to WQ because it can
crash during retrying adding entries to (WQ - AQ) bookies.

>  A single successful read is enough. However
there is an odd behavior in that as soon as (WQ-AQ)+1 reads fail with
explicit NoSuchEntry/Ledger, the read is considered failed and the ledger
recovery process ends there. This means that given the responses
b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read is
considered successful is non-deterministic.

Regarding recovery reads, recovery read doesn't need to be deterministic.
For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger),
either including it or excluding it in the sealed ledger is correct
behavior. The bookkeeper client guarantees that once a ledger is sealed,
the entries in the sealed ledger can always be read and can be read
consistently.

I am not sure it is a problem unless I misunderstand it.

- Sijie

On Fri, Jan 15, 2021 at 7:43 AM Jack Vanlightly
<jvanligh...@splunk.com.invalid> wrote:

> Let's set up a call and create any issues from that. I have already created
> the patches in our (Splunk) fork and it might be easiest or not to wait
> until we re-sync up with the open source repo. We can include the fixes in
> the discussion.
>
> Jack
>
> On Fri, Jan 15, 2021 at 4:33 PM Flavio Junqueira <f...@apache.org> wrote:
>
> > [ External sender. Exercise caution. ]
> >
> > Hi Jack,
> >
> > Thanks for getting back.
> >
> > > What's the best way to share the TLA+ findings?
> >
> > Would you be able to share the spec? I'm ok with reading TLA+.
> >
> > As for sharing your specific findings, I'd suggest one of the following:
> >
> > 1- Create an email thread describing the scenarios that trigger a bug.
> > 2- Create issues, one for each problem you found.
> > 3- Create a discussion on the project Slack, perhaps a channel specific
> > for it.
> > 4- Set up a zoom call to present and discuss with the community.
> >
> > Option 2 is ideal from a community perspective, but we can also set up a
> > call inviting everyone and create issues out of that discussion. We can
> in
> > fact set up a call even if we create the issues ahead of time.
> >
> > Does it make sense?
> >
> > -Flavio
> >
> > > On 15 Jan 2021, at 16:14, Jack Vanlightly <jvanligh...@splunk.com
> .INVALID>
> > wrote:
> > >
> > > Hi Flavio,
> > >
> > >>> This is an example of a scenario corresponding to what we suspect is
> a
> > > bug introduced earlier, but Enrico is arguing that this is not the
> > intended
> > > behavior, and at this point, I agree.
> > >
> > >>> By the time a successful callback is received, the client might only
> > > have replicated AQ ways, so the guarantee can only be at that point of
> > > being able to tolerate AQ - 1 crashes. The ledger configuration states
> > that
> > > the application wants to have WQ copies >> of each entry, though. I'd
> > > expect a ledger to have WQ copies of each entry up to the final entry
> > > number when it is closed. Do you see it differently?
> > >
> > > I also agree and was pretty surprised when I discovered the behaviour.
> It
> > > is not something that users expect and I think we need to correct it.
> So
> > > I'm with you.
> > >
> > > What's the best way to share the TLA+ findings?
> > >
> > > Jack
> > >
> > >
> > >
> > >
> > >
> > > On Fri, Jan 15, 2021 at 3:59 PM Flavio Junqueira <f...@apache.org>
> wrote:
> > >
> > >> [ External sender. Exercise caution. ]
> > >>
> > >>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and the
> > >>> confirm callback to the client is called and the LAC is set to
> 100.Now
> > >> the
> > >>> 3rd bookie times out. Ensemble change is executed and all pending
> adds
> > >> that
> > >>> are above the LAC of 100 are replayed to another bookie, meaning that
> > the
> > >>> entry e100 is not replayed to another bookie, causing this entry to
> > meet
> > >>> the rep factor of only AQ.
> > >>
> > >> This is an example of a scenario corresponding to what we suspect is a
> > bug
> > >> introduced earlier, but Enrico is arguing that this is not the
> intended
> > >> behavior, and at this point, I agree.
> > >>
> > >>> This is alluded to in the docs as they state
> > >>> that AQ is also the minimum guaranteed replication factor.
> > >>
> > >> By the time a successful callback is received, the client might only
> > have
> > >> replicated AQ ways, so the guarantee can only be at that point of
> being
> > >> able to tolerate AQ - 1 crashes. The ledger configuration states that
> > the
> > >> application wants to have WQ copies of each entry, though. I'd expect
> a
> > >> ledger to have WQ copies of each entry up to the final entry number
> > when it
> > >> is closed. Do you see it differently?
> > >>
> > >>> I'd be happy to set up a meeting to discuss the spec and its
> findings.
> > >>
> > >>
> > >> That'd be great, I'm interested.
> > >>
> > >> -Flavio
> > >>
> > >>> On 15 Jan 2021, at 15:30, Jack Vanlightly <jvanligh...@splunk.com
> > .INVALID>
> > >> wrote:
> > >>>
> > >>>> No you cannot miss data, if the client is not able to find a bookie
> > that
> > >>> is
> > >>>> able to answer with the entry it receives an error.
> > >>>
> > >>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and the
> > >>> confirm callback to the client is called and the LAC is set to 100.
> Now
> > >> the
> > >>> 3rd bookie times out. Ensemble change is executed and all pending
> adds
> > >> that
> > >>> are above the LAC of 100 are replayed to another bookie, meaning that
> > the
> > >>> entry e100 is not replayed to another bookie, causing this entry to
> > meet
> > >>> the rep factor of only AQ. This is alluded to in the docs as they
> state
> > >>> that AQ is also the minimum guaranteed replication factor.
> > >>>
> > >>>> The recovery read fails if it is not possible to read every entry
> from
> > >> at
> > >>>> least AQ bookies  (that is it allows WQ-QA read failures),
> > >>>> and it does not hazard to "repair" (truncate) the ledger if it does
> > not
> > >>>> find enough bookies.
> > >>>
> > >>> This is not quite accurate. A single successful read is enough.
> However
> > >>> there is an odd behaviour in that as soon as (WQ-AQ)+1 reads fail
> with
> > >>> explicit NoSuchEntry/Ledger, the read is considered failed and the
> > ledger
> > >>> recovery process ends there. This means that given the responses
> > >>> b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read is
> > >>> considered successful is non-deterministic. If the response from b1
> is
> > >>> received last, then the read is already considered failed, otherwise
> > the
> > >>> read succeeds.
> > >>>
> > >>> I have come to the above conclusions through my reverse engineering
> > >> process
> > >>> for creating the TLA+ specification. I still have pending to
> > >>> reproduce the AQ rep factor behaviour via some tests, but have
> verified
> > >> via
> > >>> tests the conclusion about ledger recovery reads.
> > >>>
> > >>> Note that I have found two defects with the BookKeeper protocol, most
> > >>> notably data loss due to that fencing does not prevent further
> > successful
> > >>> adds. Currently the specification and associated documentation is on
> a
> > >>> private Splunk repo, but I'd be happy to set up a meeting to discuss
> > the
> > >>> spec and its findings.
> > >>>
> > >>> Best
> > >>> Jack
> > >>>
> > >>>
> > >>> On Fri, Jan 15, 2021 at 11:51 AM Enrico Olivelli <
> eolive...@gmail.com>
> > >>> wrote:
> > >>>
> > >>>> [ External sender. Exercise caution. ]
> > >>>>
> > >>>> Jonathan,
> > >>>>
> > >>>> Il giorno gio 14 gen 2021 alle ore 20:57 Jonathan Ellis <
> > >>>> jbel...@apache.org>
> > >>>> ha scritto:
> > >>>>
> > >>>>> On 2021/01/11 08:31:03, Jack Vanlightly wrote:
> > >>>>>> Hi,
> > >>>>>>
> > >>>>>> I've recently modelled the BookKeeper protocol in TLA+ and can
> > confirm
> > >>>>> that
> > >>>>>> once confirmed, that an entry is not replayed to another bookie.
> > This
> > >>>>>> leaves a "hole" as the entry is now replicated only to 2 bookies,
> > >>>>> however,
> > >>>>>> the new data integrity check that Ivan worked on, when run
> > >> periodically
> > >>>>>> will be able to repair that hole.
> > >>>>>
> > >>>>> Can I read from the bookie with a hole in the meantime, and
> silently
> > >> miss
> > >>>>> data that it doesn't know about?
> > >>>>>
> > >>>>
> > >>>> No you cannot miss data, if the client is not able to find a bookie
> > >> that is
> > >>>> able to answer with the entry it receives an error.
> > >>>>
> > >>>> The ledger has a known tail (LastAddConfirmed entry) and this value
> is
> > >>>> stored on ledger metadata once the ledger is "closed".
> > >>>>
> > >>>> When the ledger is still open, that is when the writer is writing to
> > it,
> > >>>> the reader is allowed to read only up to the LastAddConfirmed entry
> > >>>> this LAC value is returned to the reader using a piggyback
> mechanism,
> > >>>> without reading from metadata.
> > >>>> The reader cannot read beyond the latest position that has been
> > >> confirmed
> > >>>> to the writer by AQ bookies.
> > >>>>
> > >>>> We have a third case, the 'recovery read'.
> > >>>> A reader starts a "recovery read" when you want to recover a ledger
> > that
> > >>>> has been abandoned by a dead writer
> > >>>> or when you are a new leader (Pulsar Bundle Owner) or you want to
> > fence
> > >> out
> > >>>> the old leader.
> > >>>> In this case the reader merges the current status of the ledger on
> ZK
> > >> with
> > >>>> the result of a scan of the whole ledger.
> > >>>> Basically it reads the ledger from the beginning up to the tail,
> until
> > >> it
> > >>>> is able to "read" entries, this way it is setting the 'fenced' flag
> on
> > >> the
> > >>>> ledger
> > >>>> on every bookie and also it is able to detect the actual tail of the
> > >> ledger
> > >>>> (because the writer died and it was not able to flush metadata to
> ZK).
> > >>>>
> > >>>> The recovery read fails if it is not possible to read every entry
> from
> > >> at
> > >>>> least AQ bookies  (that is it allows WQ-QA read failures),
> > >>>> and it does not hazard to "repair" (truncate) the ledger if it does
> > not
> > >>>> find enough bookies.
> > >>>>
> > >>>> I hope that helps
> > >>>> Enrico
> > >>>>
> > >>
> > >>
> > >>
> >
> >
> >
>

Re: Unbounded memory usage for WQ > AQ ?

Reply via email to