2018-02-21 15:28:04 UTC - Karthik Palanivelu: Hi, I am trying pulsar in AWS. 
But the problem I am facing is the auto discovery of ZK node IPs. I see a 
article while googling to have ENIs for ZK nodes but would like to use the ASG 
feature. Please advise how do I solve this problem. Appreciate any help.
----
2018-02-22 01:49:19 UTC - Sijie Guo: @Karthikeyan Palanivelu: not very familiar 
with either ENI and ASG. does ASG provide any sort of DNS for its instances?
----
2018-02-22 07:24:03 UTC - SansWord Huang: @SansWord Huang has joined the channel
----
2018-02-22 07:26:25 UTC - SansWord Huang: Hi there, I’m trying to host Pulsar 
on AWS, I’m following the doc here:
<https://pulsar.apache.org/docs/latest/deployment/instance/#Servicediscovery>

But I don’t know why / how can I use service discovery?
should I run this daemon on the client-side?
what url should I use if I’m using daemon?
----
2018-02-22 07:27:16 UTC - SansWord Huang: The alternative is to launch a 
network load balancer in front of all broker nodes and uses the url of NLB.
----
2018-02-22 07:27:36 UTC - SansWord Huang: But then what’s the correct health 
check  rule when using NLB?
----
2018-02-22 07:56:16 UTC - Sijie Guo: @SansWord Huang: 

the service discovery is basically used for pulsar clients (or pulsar tools) to 
look up where are the brokers and which topics are served by which brokers. you 
don’t need additional service discovery component if you can have DNS over all 
the broker nodes or have a load balancer in front of that. If you don’t have 
those DNS or load balancer, and you have multiple brokers, you can start a 
service discovery daemon for serving broker discovery purpose. 

let me explain this in an example, suppose you have 3 brokers, broker1, broker2 
and broker3.

1) you can use any broker hostname as the service url (e.g 
<pulsar://broker1:6650>) for pulsar client. the drawback of this, it is always 
point to one broker for service discovery first. that broker can be the 
bottleneck.

2) if you can setup a DNS for all your broker. for example, you setup a DNS 
name “<http://pulsar.service.com|pulsar.service.com>” for your 3 brokers. you 
can just use <pulsar://pulsar.service.com:6650>. so the client will connect to 
a broker based on DNS, and that broker will serve the topic discovery requests 
to figure out the actual brokers which are serving the given topics.

3) similarly, you can setup any network load balancers for this purpose.

4) if you can’t do 2) or 3), you can start an additional service discovery 
daemon (let’s assume the domain name of the machine(s) running service 
discovery daemon is “pulsar.service.discovery”). then you can use 
<pulsar://pusar.service.discovery:6650> for pulsar client. so the clients will 
talk to that daemon for broker/topic discovery. and you can scale as many 
discovery daemons as possible.

hope this help.
----
2018-02-22 07:59:01 UTC - SansWord Huang: Thanks for detailed explaining.
So I can run service discovery daemon on my client-server, and uses 
localhost:6650 as broker url, this will dynamically send me to the active 
broker?
----
2018-02-22 07:59:39 UTC - Sijie Guo: yeah. that works as well.
----
2018-02-22 08:02:26 UTC - SansWord Huang: But then the client should be able to 
connect to ZooKeeper to look up alive broker.
----
2018-02-22 08:02:34 UTC - SansWord Huang: Thanks.
----
2018-02-22 08:03:21 UTC - SansWord Huang: I’ll use NLB instead of discovery 
daemon.
What healthy check to a broker should I do to make sure that broker is alive?
----
2018-02-22 08:03:32 UTC - SansWord Huang: is port check for 6650 port good 
enough?
----
2018-02-22 08:16:41 UTC - Sijie Guo: the client doesn’t connect to zookeeper at 
all. it is basically handled by brokers (or service discovery if you use 
service discovery daemon)
----
2018-02-22 08:16:54 UTC - Sijie Guo: port check for 6650 is good enough for NLB
----
2018-02-22 08:17:02 UTC - Sijie Guo: (sorry for late response)
----
2018-02-22 08:17:41 UTC - SansWord Huang: not at all. thanks!
----
2018-02-22 08:34:21 UTC - SansWord Huang: Another question, how to decide how 
many bookies I need?
And can I dynamically add / remove bookies from BookKeeper cluster?
----
2018-02-22 08:39:58 UTC - Sijie Guo: a simple way to calculate is based on your 
network/disk bandwidth per bookie. for example, assume if a bookie machine has 
1Gb nic or 1 HDD disk for journal disk, that mean the ingress bandwidth is 
around 100MB/second. If you are sending 500MB/second to brokers, and you 
configure broker to store 3 replicas, that means from brokers to bookies it 
will be 500MB/second * 3 , then you divide to a bookie’s network limit =&gt; 
500MB/second * 3 / 100MB/second ~= 15 bookies.

producer throughput * num_replicas / network bandwidth per bookie
----
2018-02-22 08:42:47 UTC - Sijie Guo: @SansWord Huang yes you can dynamically 
add and remove bookies. adding bookies is much easier, you just add new 
machines, and the bookies will be automatically discovered by brokers, so 
brokers will start sending traffic there without data rebalancing. removing 
bookies is also easy, you can just stop a bookie machine (or even you can turn 
a bookie into readonly for serving read-only traffic). However you might need 
to pay attentions on removing bookies, because you don’t want to remove all 
replicas at the same time, which would cause your data unavailable. so in 
general, when you remove a bookie, you stop one bookie, wait for the data to be 
replicated, after the data is replicated, then stop the second one.
----
2018-02-22 08:42:55 UTC - Sijie Guo: hope this help
----
2018-02-22 08:55:29 UTC - SansWord Huang: it helps a lot!
it triggers more questions related to BookKeeper.

I’m wondering are these knowhow are in the doc?

What size should I estimate for journal disc and ledger storage device?

data first written into journal disc then “flush” into ledger storage?

when will data be rebalanced? how do I know my data is already replicated?
----

Reply via email to