Re: Using 100's of ZK Observers

2020-04-27 Thread Fangmin Lv
James, we treat observers not part of the ensemble's dynamic config, and
it's all using -1 as server id, that's fine for us since we don't allow
global sessions on observers.

If you don't need global sessions on observers, then probably you can adopt
similar solutions here for now.

Thanks,
Fangmin

On Fri, Apr 10, 2020 at 2:58 PM James Arbo  wrote:

> Thanks Fangmin. That's an Interesting feature - allowing followers to host
> observers.
> but I assume the entire collection of servers is still considered part of
> the ensemble.
> If so, isn't the upper limit still capped to 256 - by the lowest 8 bits of
> the server id?
>
>
> On Fri, Apr 10, 2020 at 5:32 PM Fangmin Lv  wrote:
>
> > There is ObserverMaster feature contributed back in ZOOKEEPER-3140
> > <https://issues.apache.org/jira/browse/ZOOKEEPER-3140> could be used to
> > scale the
> > number of observers and traffics a single ensemble can support.
> >
> > It allows followers to serve observers as well, which relieves the fanout
> > load on leader.
> >
> > But as Michael mentioned, there is server id limit given lowest 8 bits
> are
> > used guarantee the session id
> > uniqueness, so max servers are limited to 255.
> >
> > Internally, we use local sessions only on observers, so we use dynamic
> > observer id (-1) for all observers,
> > which is not part of the dynamic config. It helps us scale more
> observers,
> > but this may not be a good
> > solution for community since there is limitation here.
> >
> > Thanks,
> > Fangmin
> >
> > On Fri, Apr 10, 2020 at 1:43 PM Michael Han  wrote:
> >
> > > If you have 100s of 1000s of ZK clients then having observer in each
> pod
> > > will presumably reduce traffic as most of the fan out traffic, from
> > server
> > > to clients is localized to each pod.
> > >
> > > Observer is not part of quorum, and a quorum can't scale pass a few
> > servers
> > > (typical just 5 or 7). Observers can scale from 100s to 1000s (depends
> on
> > > whether only leader hosts them, or follower can host them) but actual
> > > number depends on workload and hardware capacity. Although it's
> > recommended
> > > myid being [0,255] but I vaguely remember we can pass this limit, just
> > need
> > > to make sure the lower 8 bits of the myid always to be unique as that's
> > > used to construct session id.
> > >
> > > On Fri, Apr 10, 2020 at 12:09 PM James Arbo 
> wrote:
> > >
> > > > That was my instinct as well. I *think* any ZK writes would require a
> > > > quorum before the transaction is committed. Getting a quorum over a
> > > several
> > > > hundred/thousand node ensemble seems like a lot of traffic.
> > > > Plus, from what I've read - though not 100% certain, it seems the
> > number
> > > ZK
> > > > nodes is capped at 255.
> > > >
> > > > On Fri, Apr 10, 2020 at 2:52 PM Bram Van Dam 
> > > wrote:
> > > >
> > > > > On 10/04/2020 20:13, James Arbo wrote:
> > > > > > When we proposed this, there was great concern from the software
> > > > > architects
> > > > > > that network traffic between the kubernetes pods and the ZK
> > ensemble
> > > > must
> > > > > > be minimized.
> > > > >
> > > > > > This means that, at a minimum, we would be running at least 1 ZK
> > > > ensemble
> > > > > > member on every node of our K8S cluster.
> > > > >
> > > > > Sounds to me like this would *increase* network traffic, not
> decrease
> > > > > it. Instead of having communication between the pod and ZK whenever
> > > > > needed (which likely isn't very frequently?), you'll now be having
> > > > > constant communication between the ensemble and your hundreds of
> > > > > observers in order to keep the observers in sync.
> > > > >
> > > > > Maybe I'm missing something?
> > > > >
> > > > >  - Bram
> > > > >
> > > > >
> > > >
> > >
> >
>


Re: Using 100's of ZK Observers

2020-04-10 Thread Fangmin Lv
There is ObserverMaster feature contributed back in ZOOKEEPER-3140
 could be used to
scale the
number of observers and traffics a single ensemble can support.

It allows followers to serve observers as well, which relieves the fanout
load on leader.

But as Michael mentioned, there is server id limit given lowest 8 bits are
used guarantee the session id
uniqueness, so max servers are limited to 255.

Internally, we use local sessions only on observers, so we use dynamic
observer id (-1) for all observers,
which is not part of the dynamic config. It helps us scale more observers,
but this may not be a good
solution for community since there is limitation here.

Thanks,
Fangmin

On Fri, Apr 10, 2020 at 1:43 PM Michael Han  wrote:

> If you have 100s of 1000s of ZK clients then having observer in each pod
> will presumably reduce traffic as most of the fan out traffic, from server
> to clients is localized to each pod.
>
> Observer is not part of quorum, and a quorum can't scale pass a few servers
> (typical just 5 or 7). Observers can scale from 100s to 1000s (depends on
> whether only leader hosts them, or follower can host them) but actual
> number depends on workload and hardware capacity. Although it's recommended
> myid being [0,255] but I vaguely remember we can pass this limit, just need
> to make sure the lower 8 bits of the myid always to be unique as that's
> used to construct session id.
>
> On Fri, Apr 10, 2020 at 12:09 PM James Arbo  wrote:
>
> > That was my instinct as well. I *think* any ZK writes would require a
> > quorum before the transaction is committed. Getting a quorum over a
> several
> > hundred/thousand node ensemble seems like a lot of traffic.
> > Plus, from what I've read - though not 100% certain, it seems the number
> ZK
> > nodes is capped at 255.
> >
> > On Fri, Apr 10, 2020 at 2:52 PM Bram Van Dam 
> wrote:
> >
> > > On 10/04/2020 20:13, James Arbo wrote:
> > > > When we proposed this, there was great concern from the software
> > > architects
> > > > that network traffic between the kubernetes pods and the ZK ensemble
> > must
> > > > be minimized.
> > >
> > > > This means that, at a minimum, we would be running at least 1 ZK
> > ensemble
> > > > member on every node of our K8S cluster.
> > >
> > > Sounds to me like this would *increase* network traffic, not decrease
> > > it. Instead of having communication between the pod and ZK whenever
> > > needed (which likely isn't very frequently?), you'll now be having
> > > constant communication between the ensemble and your hundreds of
> > > observers in order to keep the observers in sync.
> > >
> > > Maybe I'm missing something?
> > >
> > >  - Bram
> > >
> > >
> >
>


Re: String inconsistency issue when running ZK with OpenJDK 10 on SKL machines

2019-11-02 Thread Fangmin Lv
Enrico,

As Andor mentioned, the issue has been fixed in JDK 11 since b27, you
should be fine :)

Fangmin

On Mon, Oct 28, 2019 at 10:44 PM Andor Molnar  wrote:

> Here’s the JDK issue that Fangmin mentioned:
>
> https://bugs.openjdk.java.net/browse/JDK-8207746
>
> It’s a JDK 10 & 11 bug which has already been fixed since JDK11 b27.
>
> Andor
>
>
>
> > On 2019. Oct 28., at 8:00, Enrico Olivelli  wrote:
> >
> > Fangmin,
> >
> > Il lun 28 ott 2019, 02:23 Fangmin Lv  ha scritto:
> >
> >> Hey everyone,
> >>
> >> (Forgot to add subject in the previous email, resent with clear
> subject.)
> >>
> >> I'd like to share some weird inconsistency bugs we saw recently on prod,
> >> the root cause and potential fixes of it. It took us around a month to
> >> investigate, reproduce and find out the root cause, hopefully the
> >> informations here will help people avoid hitting this same potential
> issue.
> >>
> >> [Trigger conditions and behavior]
> >>
> >> The inconsistency issue only happened when running ZK with OpenJDK 10 on
> >> SKL machines, and it's not because of bugs inside ZK but due to a
> >> macro-assembly bug inside JDK.
> >>
> >> And the behavior of the issues might be:
> >>
> >> * NONODE returned when getData from a child exist when queried with
> >> getChildren, and there is no delete issued
> >> * NONODE error returned when try to create a child based on the parent
> node
> >> just successfully created, and there is no delete issued
> >> * No client is able to acquire the lock even though the previous session
> >> who hold the lock already dead
> >>
> >> [Root cause]
> >>
> >> The direct cause of the misbehavior above is due to the key/value put
> into
> >> the ZooKeeperServer.outstandingChangesForPath HashMap or the
> >> DataNode.children HashSet are not visible to the future get or remove,
> >> which caused the outstanding changes not visible when leader prepare the
> >> following txns, or node being deleted but not removed from
> >> DataNode.children.
> >>
> >> And the 'bad' HashMap/HashSet behavior is not because of concurrency
> bugs
> >> inside ZK, but due to a macro-assembly bug which is used to generate the
> >> String.equals intrinsic assembly code in JDK 9 and 10. The bug was
> >> introduced in JDK-8144771 when adding AVX-512 instructions support in
> JDK
> >> to optimize the String.equals intrinsic performance with 512 bit vector
> op
> >> support. Due to the bug, the String.equals method may return false
> result
> >> when using high band of CPU register (xmm16 - xmm31) with non-empty
> stack
> >> on SKL machines where AVX-512 is available.
> >>
> >> The macro-assembly bug we hit is in vptest which is used in the
> >> string_compare macro assembly code
> >> <
> >>
> http://hg.openjdk.java.net/jdk/jdk10/file/b09e56145e11/src/hotspot/cpu/x86/macroAssembler_x86.cpp#l4933
> >>> .
> >> It uses add/sub instruction when saving/resuming register values
> >> temporarily from stack, which will affect and distort the ZF (zero
> flag) in
> >> FLAGS register from the previous test instruction.
> >>
> >> For our case, if the key exist in the DataNode.children HashSet, the
> test
> >> instruction result will be zero, ZF bit will be set to 1, if the RSP
> value
> >> is not 0 (e.g stack is not empty) after addptr code here, then the ZF
> bit
> >> will be changed to 0, so String.equals compare during removeNode will
> >> return false result, and the key won't be removed.
> >>
> >> There is bug reported in JDK-8207746, the behavior is different, we've
> >> confirmed the issue by adding assembly code to log the issue in JDK 10.
> >>
> >> [Solutions]
> >>
> >> The possible mitigations are:
> >>
> >> 1. Disabling the AVX-512 with JVM option -XX:UseAVX=2
> >> 2. Using OpenJDK version higher than 10, which has fixed the issue in
> >> JDK-8207746
> >>
> >> Upgrading to OpenJDK 11+ is a better option, since 10 is not well
> >> supported, and AVX-512 do helps improving performance.
> >>
> >> We use JDK 10 due to SSL quorum socket close stall issue mentioned in
> >> ZOOKEEPER-3384 <https://issues.apache.org/jira/browse/ZOOKEEPER-3384>,
> and
> >> the SO_LINGER option is not honored in JDK 11. We've unblocked JDK 11 by
> >> asynchronously closing the quorum socket, and we're upstreaming that in
> >> ZOOKEEPER-3574 <https://issues.apache.org/jira/browse/ZOOKEEPER-3574>.
> >>
> >> Thanks,
> >> Fangmin
> >>
> >
> >
> > Thank you for sharing this.
> > Do you have any pointer to the jdk11 bugs? Is it solved in 12+?
> >
> > I am running with jdk11-13 but without ssl, so never seen problems.
> >
> > Enrico
> >
> >>
>
>


String inconsistency issue when running ZK with OpenJDK 10 on SKL machines

2019-10-27 Thread Fangmin Lv
Hey everyone,

(Forgot to add subject in the previous email, resent with clear subject.)

I'd like to share some weird inconsistency bugs we saw recently on prod,
the root cause and potential fixes of it. It took us around a month to
investigate, reproduce and find out the root cause, hopefully the
informations here will help people avoid hitting this same potential issue.

[Trigger conditions and behavior]

The inconsistency issue only happened when running ZK with OpenJDK 10 on
SKL machines, and it's not because of bugs inside ZK but due to a
macro-assembly bug inside JDK.

And the behavior of the issues might be:

* NONODE returned when getData from a child exist when queried with
getChildren, and there is no delete issued
* NONODE error returned when try to create a child based on the parent node
just successfully created, and there is no delete issued
* No client is able to acquire the lock even though the previous session
who hold the lock already dead

[Root cause]

The direct cause of the misbehavior above is due to the key/value put into
the ZooKeeperServer.outstandingChangesForPath HashMap or the
DataNode.children HashSet are not visible to the future get or remove,
which caused the outstanding changes not visible when leader prepare the
following txns, or node being deleted but not removed from
DataNode.children.

And the 'bad' HashMap/HashSet behavior is not because of concurrency bugs
inside ZK, but due to a macro-assembly bug which is used to generate the
String.equals intrinsic assembly code in JDK 9 and 10. The bug was
introduced in JDK-8144771 when adding AVX-512 instructions support in JDK
to optimize the String.equals intrinsic performance with 512 bit vector op
support. Due to the bug, the String.equals method may return false result
when using high band of CPU register (xmm16 - xmm31) with non-empty stack
on SKL machines where AVX-512 is available.

The macro-assembly bug we hit is in vptest which is used in the
string_compare macro assembly code
.
It uses add/sub instruction when saving/resuming register values
temporarily from stack, which will affect and distort the ZF (zero flag) in
FLAGS register from the previous test instruction.

For our case, if the key exist in the DataNode.children HashSet, the test
instruction result will be zero, ZF bit will be set to 1, if the RSP value
is not 0 (e.g stack is not empty) after addptr code here, then the ZF bit
will be changed to 0, so String.equals compare during removeNode will
return false result, and the key won't be removed.

There is bug reported in JDK-8207746, the behavior is different, we've
confirmed the issue by adding assembly code to log the issue in JDK 10.

[Solutions]

The possible mitigations are:

1. Disabling the AVX-512 with JVM option -XX:UseAVX=2
2. Using OpenJDK version higher than 10, which has fixed the issue in
JDK-8207746

Upgrading to OpenJDK 11+ is a better option, since 10 is not well
supported, and AVX-512 do helps improving performance.

We use JDK 10 due to SSL quorum socket close stall issue mentioned in
ZOOKEEPER-3384 , and
the SO_LINGER option is not honored in JDK 11. We've unblocked JDK 11 by
asynchronously closing the quorum socket, and we're upstreaming that in
ZOOKEEPER-3574 .

Thanks,
Fangmin


Re: Please Register: ZooKeeper Meetup @ Facebook, Nov 8th 2018

2018-11-08 Thread Fangmin Lv
Hi guys,

Thanks for coming to ZooKeeper Meetup today, it's nice to chat and discuss
the technique things with the community face to face!

For those who haven't got time to attend this event, here is the record:
https://www.facebook.com/zkmeetup. The event was delayed about 20 minutes,
so please skip the first 20 minutes of recording.

Let us know if you have any questions or feedback.

Fangmin

On Wed, Nov 7, 2018 at 4:53 AM Ivan Serdyuk 
wrote:

> Thanks a lot!
>
> On Wed, Nov 7, 2018 at 2:54 AM Fangmin Lv  wrote:
>
> > Ivan, FB will record the streaming and might publish it. I’ll post the
> link
> > here when it’s published.
> >
> > On Tue, Nov 6, 2018 at 3:11 PM Ivan Serdyuk <
> local.tourist.k...@gmail.com>
> > wrote:
> >
> > > Yes, but I wonder if anyone could record that streaming.
> > >
> > > Ivan
> > >
> > > On Wed, Nov 7, 2018 at 1:01 AM Kathryn Hogg 
> > wrote:
> > >
> > > > It appears the stream will be at https://www.facebook.com/zkmeetup
> > > >
> > > > --
> > > > Kathryn Hogg
> > > > Senior Manager Product Development
> > > > Phone: 763.201.2000
> > > > Fax: 763.201.5333
> > > > Open Access Technology International, Inc.
> > > > 3660 Technology Drive NE, Minneapolis, MN 55418
> > > >
> > > > -Original Message-
> > > > From: Ivan Serdyuk [mailto:local.tourist.k...@gmail.com]
> > > > Sent: Tuesday, November 6, 2018 4:58 PM
> > > > To: user@zookeeper.apache.org
> > > > Subject: Re: Please Register: ZooKeeper Meetup @ Facebook, Nov 8th
> 2018
> > > >
> > > > {External email message: This email is from an external source.
> Please
> > > > exercise caution prior to opening attachments, clicking on links, or
> > > > providing any sensitive information.}
> > > >
> > > > Sorry, just mentioned that it would be streamed via FB streaming.
> > > >
> > > > Hope you are expecting to record one?
> > > >
> > > > Ivan
> > > >
> > > > On Wed, Nov 7, 2018 at 12:56 AM Ivan Serdyuk <
> > > local.tourist.k...@gmail.com
> > > > >
> > > > wrote:
> > > >
> > > > > And where is the streaming link?
> > > > >
> > > > > On Tue, Nov 6, 2018 at 11:18 PM Norbert Kalmar
> > > > >  wrote:
> > > > >
> > > > >> Yes, 2 days from now.
> > > > >>
> > > > >> Regards,
> > > > >> Norbert
> > > > >>
> > > > >> On Tue, Nov 6, 2018 at 9:07 PM Jeff Widman 
> > > wrote:
> > > > >>
> > > > >> > This is happening this week, correct?
> > > > >> >
> > > > >> > On Fri, Sep 14, 2018 at 8:54 AM Ivan Serdyuk <
> > > > >> local.tourist.k...@gmail.com
> > > > >> > >
> > > > >> > wrote:
> > > > >> >
> > > > >> > > Awesome.
> > > > >> > >
> > > > >> > > I wonder if you are expecting to record your talk.
> > > > >> > >
> > > > >> > > Ivan
> > > > >> > >
> > > > >> > > On Fri, Sep 14, 2018 at 2:46 AM Mohamed Jeelani <
> > mjeel...@fb.com>
> > > > >> wrote:
> > > > >> > >
> > > > >> > > > Your ZooKeeper friends @ Facebook would like to invite you
> to
> > > > >> > > > share
> > > > >> and
> > > > >> > > > learn what’s new with ZooKeeper.
> > > > >> > > >
> > > > >> > > > We will not only share what we at Facebook have been up to,
> > but
> > > > >> > > > we
> > > > >> have
> > > > >> > > > exciting talks from speakers from the ZooKeeper community
> lin
> > > <
> >
> https://maps.google.com/?q=the+ZooKeeper+community+lin=gmail=g
> > >
> > > ed
> > > > >> > > > up
> > > > >> who
> > > > >> > > are
> > > > >> > > > eager to share what they've been working on as well. And of
> > > > >> > > > course,
> > > > >> > we've
> > > > >> > > > got some cool swag for you :-)
> > > > >> > > >
> > > > >> > > > When: November 8th 2018, 5pm – 8pm (Talks: 5pm - 7pm;
> > > > >> > > > Networking &
> > > > >> > Happy
> > > > >> > > > Hour: 7pm - 8pm)
> > > > >> > > > Where: Facebook HQ - MPK 16, 1 Hacker Way, Menlo Park, CA We
> > > > >> > > > will have remote viewing locations in our Facebook Seattle
> > > > >> office,
> > > > >> > and
> > > > >> > > > the event will also be live streamed. You can indicate how
> > > > >> > > > you'd
> > > > >> like
> > > > >> > to
> > > > >> > > > attend on the registration page.
> > > > >> > > >
> > > > >> > > > Please register here -
> https://zookeeperatfb.splashthat.com/
> > > > >> > > >
> > > > >> > > > We look forward to seeing you soon!
> > > > >> > > >
> > > > >> > > > ZooKeeper Friends @ Facebook
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >> >
> > > > >> > --
> > > > >> >
> > > > >> > *Jeff Widman*
> > > > >> > jeffwidman.com <http://www.jeffwidman.com/> | 740-WIDMAN-J
> > > > >> > (943-6265) <><
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> >
>


Re: Please Register: ZooKeeper Meetup @ Facebook, Nov 8th 2018

2018-11-06 Thread Fangmin Lv
Ivan, FB will record the streaming and might publish it. I’ll post the link
here when it’s published.

On Tue, Nov 6, 2018 at 3:11 PM Ivan Serdyuk 
wrote:

> Yes, but I wonder if anyone could record that streaming.
>
> Ivan
>
> On Wed, Nov 7, 2018 at 1:01 AM Kathryn Hogg  wrote:
>
> > It appears the stream will be at https://www.facebook.com/zkmeetup
> >
> > --
> > Kathryn Hogg
> > Senior Manager Product Development
> > Phone: 763.201.2000
> > Fax: 763.201.5333
> > Open Access Technology International, Inc.
> > 3660 Technology Drive NE, Minneapolis, MN 55418
> >
> > -Original Message-
> > From: Ivan Serdyuk [mailto:local.tourist.k...@gmail.com]
> > Sent: Tuesday, November 6, 2018 4:58 PM
> > To: user@zookeeper.apache.org
> > Subject: Re: Please Register: ZooKeeper Meetup @ Facebook, Nov 8th 2018
> >
> > {External email message: This email is from an external source. Please
> > exercise caution prior to opening attachments, clicking on links, or
> > providing any sensitive information.}
> >
> > Sorry, just mentioned that it would be streamed via FB streaming.
> >
> > Hope you are expecting to record one?
> >
> > Ivan
> >
> > On Wed, Nov 7, 2018 at 12:56 AM Ivan Serdyuk <
> local.tourist.k...@gmail.com
> > >
> > wrote:
> >
> > > And where is the streaming link?
> > >
> > > On Tue, Nov 6, 2018 at 11:18 PM Norbert Kalmar
> > >  wrote:
> > >
> > >> Yes, 2 days from now.
> > >>
> > >> Regards,
> > >> Norbert
> > >>
> > >> On Tue, Nov 6, 2018 at 9:07 PM Jeff Widman 
> wrote:
> > >>
> > >> > This is happening this week, correct?
> > >> >
> > >> > On Fri, Sep 14, 2018 at 8:54 AM Ivan Serdyuk <
> > >> local.tourist.k...@gmail.com
> > >> > >
> > >> > wrote:
> > >> >
> > >> > > Awesome.
> > >> > >
> > >> > > I wonder if you are expecting to record your talk.
> > >> > >
> > >> > > Ivan
> > >> > >
> > >> > > On Fri, Sep 14, 2018 at 2:46 AM Mohamed Jeelani 
> > >> wrote:
> > >> > >
> > >> > > > Your ZooKeeper friends @ Facebook would like to invite you to
> > >> > > > share
> > >> and
> > >> > > > learn what’s new with ZooKeeper.
> > >> > > >
> > >> > > > We will not only share what we at Facebook have been up to, but
> > >> > > > we
> > >> have
> > >> > > > exciting talks from speakers from the ZooKeeper community lin
> 
> ed
> > >> > > > up
> > >> who
> > >> > > are
> > >> > > > eager to share what they've been working on as well. And of
> > >> > > > course,
> > >> > we've
> > >> > > > got some cool swag for you :-)
> > >> > > >
> > >> > > > When: November 8th 2018, 5pm – 8pm (Talks: 5pm - 7pm;
> > >> > > > Networking &
> > >> > Happy
> > >> > > > Hour: 7pm - 8pm)
> > >> > > > Where: Facebook HQ - MPK 16, 1 Hacker Way, Menlo Park, CA We
> > >> > > > will have remote viewing locations in our Facebook Seattle
> > >> office,
> > >> > and
> > >> > > > the event will also be live streamed. You can indicate how
> > >> > > > you'd
> > >> like
> > >> > to
> > >> > > > attend on the registration page.
> > >> > > >
> > >> > > > Please register here - https://zookeeperatfb.splashthat.com/
> > >> > > >
> > >> > > > We look forward to seeing you soon!
> > >> > > >
> > >> > > > ZooKeeper Friends @ Facebook
> > >> > > >
> > >> > >
> > >> >
> > >> >
> > >> > --
> > >> >
> > >> > *Jeff Widman*
> > >> > jeffwidman.com  | 740-WIDMAN-J
> > >> > (943-6265) <><
> > >> >
> > >>
> > >
> >
>