2018-02-21 15:28:04 UTC - Karthik Palanivelu: Hi, I am trying pulsar in AWS.
But the problem I am facing is the auto discovery of ZK node IPs. I see a
article while googling to have ENIs for ZK nodes but would like to use the ASG
feature. Please advise how do I solve this problem. Appreciate any help.
2018-02-22 01:49:19 UTC - Sijie Guo: @Karthikeyan Palanivelu: not very familiar
with either ENI and ASG. does ASG provide any sort of DNS for its instances?
2018-02-22 07:24:03 UTC - SansWord Huang: @SansWord Huang has joined the channel
2018-02-22 07:26:25 UTC - SansWord Huang: Hi there, I’m trying to host Pulsar
on AWS, I’m following the doc here:
But I don’t know why / how can I use service discovery?
should I run this daemon on the client-side?
what url should I use if I’m using daemon?
2018-02-22 07:27:16 UTC - SansWord Huang: The alternative is to launch a
network load balancer in front of all broker nodes and uses the url of NLB.
2018-02-22 07:27:36 UTC - SansWord Huang: But then what’s the correct health
check rule when using NLB?
2018-02-22 07:56:16 UTC - Sijie Guo: @SansWord Huang:
the service discovery is basically used for pulsar clients (or pulsar tools) to
look up where are the brokers and which topics are served by which brokers. you
don’t need additional service discovery component if you can have DNS over all
the broker nodes or have a load balancer in front of that. If you don’t have
those DNS or load balancer, and you have multiple brokers, you can start a
service discovery daemon for serving broker discovery purpose.
let me explain this in an example, suppose you have 3 brokers, broker1, broker2
1) you can use any broker hostname as the service url (e.g
<pulsar://broker1:6650>) for pulsar client. the drawback of this, it is always
point to one broker for service discovery first. that broker can be the
2) if you can setup a DNS for all your broker. for example, you setup a DNS
name “<http://pulsar.service.com|pulsar.service.com>” for your 3 brokers. you
can just use <pulsar://pulsar.service.com:6650>. so the client will connect to
a broker based on DNS, and that broker will serve the topic discovery requests
to figure out the actual brokers which are serving the given topics.
3) similarly, you can setup any network load balancers for this purpose.
4) if you can’t do 2) or 3), you can start an additional service discovery
daemon (let’s assume the domain name of the machine(s) running service
discovery daemon is “pulsar.service.discovery”). then you can use
<pulsar://pusar.service.discovery:6650> for pulsar client. so the clients will
talk to that daemon for broker/topic discovery. and you can scale as many
discovery daemons as possible.
hope this help.
2018-02-22 07:59:01 UTC - SansWord Huang: Thanks for detailed explaining.
So I can run service discovery daemon on my client-server, and uses
localhost:6650 as broker url, this will dynamically send me to the active
2018-02-22 07:59:39 UTC - Sijie Guo: yeah. that works as well.
2018-02-22 08:02:26 UTC - SansWord Huang: But then the client should be able to
connect to ZooKeeper to look up alive broker.
2018-02-22 08:02:34 UTC - SansWord Huang: Thanks.
2018-02-22 08:03:21 UTC - SansWord Huang: I’ll use NLB instead of discovery
What healthy check to a broker should I do to make sure that broker is alive?
2018-02-22 08:03:32 UTC - SansWord Huang: is port check for 6650 port good
2018-02-22 08:16:41 UTC - Sijie Guo: the client doesn’t connect to zookeeper at
all. it is basically handled by brokers (or service discovery if you use
service discovery daemon)
2018-02-22 08:16:54 UTC - Sijie Guo: port check for 6650 is good enough for NLB
2018-02-22 08:17:02 UTC - Sijie Guo: (sorry for late response)
2018-02-22 08:17:41 UTC - SansWord Huang: not at all. thanks!
2018-02-22 08:34:21 UTC - SansWord Huang: Another question, how to decide how
many bookies I need?
And can I dynamically add / remove bookies from BookKeeper cluster?
2018-02-22 08:39:58 UTC - Sijie Guo: a simple way to calculate is based on your
network/disk bandwidth per bookie. for example, assume if a bookie machine has
1Gb nic or 1 HDD disk for journal disk, that mean the ingress bandwidth is
around 100MB/second. If you are sending 500MB/second to brokers, and you
configure broker to store 3 replicas, that means from brokers to bookies it
will be 500MB/second * 3 , then you divide to a bookie’s network limit =>
500MB/second * 3 / 100MB/second ~= 15 bookies.
producer throughput * num_replicas / network bandwidth per bookie
2018-02-22 08:42:47 UTC - Sijie Guo: @SansWord Huang yes you can dynamically
add and remove bookies. adding bookies is much easier, you just add new
machines, and the bookies will be automatically discovered by brokers, so
brokers will start sending traffic there without data rebalancing. removing
bookies is also easy, you can just stop a bookie machine (or even you can turn
a bookie into readonly for serving read-only traffic). However you might need
to pay attentions on removing bookies, because you don’t want to remove all
replicas at the same time, which would cause your data unavailable. so in
general, when you remove a bookie, you stop one bookie, wait for the data to be
replicated, after the data is replicated, then stop the second one.
2018-02-22 08:42:55 UTC - Sijie Guo: hope this help
2018-02-22 08:55:29 UTC - SansWord Huang: it helps a lot!
it triggers more questions related to BookKeeper.
I’m wondering are these knowhow are in the doc?
What size should I estimate for journal disc and ledger storage device?
data first written into journal disc then “flush” into ledger storage?
when will data be rebalanced? how do I know my data is already replicated?