Slack digest for #general - 2018-06-22

Apache Pulsar Slack Fri, 22 Jun 2018 02:11:21 -0700

2018-06-21 09:25:03 UTC - Karthik Palanivelu: Thanks Ivan, then what is the 
best possible way to make pulsar work k8s using TLS, I am stuck here for a 
week. Any help is appreciated. 
----
2018-06-21 10:05:56 UTC - Ivan Kelly: do you need TLS between the proxy and the 
broker?
----
2018-06-21 10:06:12 UTC - Ivan Kelly: or just to the proxy?
----
2018-06-21 10:12:48 UTC - Karthik Palanivelu: I need TLS between Client and 
Proxy for authentication. Then I need the authorization happen between client 
and broker for multi tenant behavior. Please advise me how to get this going
----
2018-06-21 10:16:53 UTC - Ivan Kelly: I don't think this does work right now. 
as I said, we're going to look into it this week, as its quite a glaring flaw 
for multitenancy
----
2018-06-21 10:17:41 UTC - Karthik Palanivelu: Without this feature selling the 
product would be tough in my company
----
2018-06-21 10:18:02 UTC - Ivan Kelly: yup, noted.
----
2018-06-21 10:18:37 UTC - Karthik Palanivelu: I can do this in containers 
locally but with proxy in between is a challenge for me.
----
2018-06-21 10:19:34 UTC - Ivan Kelly: it works fine with broker directly
----
2018-06-21 10:19:39 UTC - Karthik Palanivelu: Is there an option I can expose 
the broker port externally using nodeport by excluding proxy...it should work 
for multitenancy right?
----
2018-06-21 10:19:54 UTC - Ivan Kelly: the problem is the proxy-broker link
----
2018-06-21 10:20:12 UTC - Ivan Kelly: yes, but you would have to expose all 
brokers
----
2018-06-21 10:20:38 UTC - Ivan Kelly: i think. I'm not 100% sure. I need to 
check how service discovery is being used
----
2018-06-21 10:22:02 UTC - Karthik Palanivelu: Sure can you post me some sample 
and limitations of this approach. Even proxy has a limitation that I can run 
only one proxy per node. If I have 3 workers then I can run only 3 proxy.
----
2018-06-21 10:23:31 UTC - Karthik Palanivelu: I tried that option but getting 
connection timed out exception, I posted the logs in a direct message to 
@Matteo Merli, let me forward you as well
----
2018-06-21 10:24:17 UTC - Ivan Kelly: I don't have any k8s samples I'm afraid, 
I've been trying this with full AWS instances
----
2018-06-21 10:33:28 UTC - Ivan Kelly: you should put a watch on that issue 
though, I'll likely get to it later today, and will be updating it with my 
findings
----
2018-06-21 11:46:26 UTC - Nikita Tomilov: @Nikita Tomilov has joined the channel
----
2018-06-21 12:04:25 UTC - Nikita Tomilov: Hello everyone
We've encountered a problem with receiving messages from non-persistent topics
The issue is here, please, take a look
<https://github.com/apache/incubator-pulsar/issues/2009>
Thanks
----
2018-06-21 13:25:09 UTC - Ryan Rose: Hi all! Is the golang client in the master 
branch operational? I'm having trouble with missing C dependencies. Thanks!
----
2018-06-21 14:59:05 UTC - William Fry: Has anyone run into a 
`BrokenPersistenceError`? Looks like I can connect to the broker but then 
receive the following response `11 -- req_id: 0`
----
2018-06-21 15:00:15 UTC - Sijie Guo: @Nikita Tomilov regex subscription only 
works for persistence. It has been implemented for non persistent. Due to the 
way how non pesistence is architected, I am not sure it is easy to support, 
will have to take a deeper look.
----
2018-06-21 15:03:59 UTC - Sijie Guo: @Ryan Rose I believe the go client should 
be operational. Bunch of the people have tried it out. What is your 
environment? It is cgo wrapped client, and the documentation is in progress on 
being merged. That would provide some instructions, if you want to check out 
the documentation, go to pulsar github pull requests and find the PR called go 
client docs. It should have all the instructions about how to build and use it
----
2018-06-21 15:04:51 UTC - Sijie Guo: @William Fry is that the only output? Or 
do you have more output or logging?
----
2018-06-21 15:18:45 UTC - Ryan Rose: Excellent, that's where I'll head! Thanks!
----
2018-06-21 15:22:31 UTC - Sijie Guo: 
<https://github.com/apache/incubator-pulsar/pull/1838> this is PR for docs
----
2018-06-21 16:17:42 UTC - Karthik Palanivelu: @Matteo Merli I followed your 
instruction to configure client cert to proxy, enabled ssl between proxy-broker 
and forward the authorization token to broker. As @Ivan Kelly  mentioned the I 
got the Connection Already Closed exception. Log wise broker received the 
request, authorized the call but closed the connection.
----
2018-06-21 16:46:54 UTC - Karthik Palanivelu: @Matteo Merli Follow up finding 
is I made the proxy to communicate with broker on nonssl by turning off 
`tlsEnabledWithBroker=false` and `forwardAuthorizationCredentials=true`. This 
time call did not even reach broker. On client I get Connection already Closed 
exception.
----
2018-06-21 16:57:22 UTC - Daniel Ferreira Jorge: I'm still a little confused 
about the inner workings of message de-duplication (I read many times the 
cookbook, concepts and architecture, and articles at streamlio regarding that 
topic). Upon reading the cookbook, it gives the impression that broker 
configuration is the only thing necessary to ensure that a message is *accepted 
by the broker* only once. So, if I have a *non-idempotent* producer that 
publishes the same message twice, and I take the steps outlined in the 
cookbook, the broker will reject the second message. How exactly will the 
broker know that the second message is a duplicate? Some sort of checksum?
----
2018-06-21 17:07:52 UTC - William Fry: @Sijie Guo essentially it fails with 
these four lines whenever I try to create a producer or consumer:
```
17:08:49.514 [0x7f17fb90c700] INFO  
/pulsar/pulsar-client-cpp/lib/HandlerBase.cc(53) - 
[<persistent://sample/foo/default/bars>, ] Getting connection from pool
17:08:49.519 [0x7f17fb90c700] WARN  
/pulsar/pulsar-client-cpp/lib/ClientConnection.cc(846) - [10.0.0.136:34296 
-&gt; 10.0.1.248:6650] Received error response from server: 11 -- req_id: 2
17:08:49.519 [0x7f17fb90c700] ERROR 
/pulsar/pulsar-client-cpp/lib/ProducerImpl.cc(212) - 
[<persistent://sample/foo/default/bar>, ] Failed to create producer: 
BrokerPersistenceError
17:08:49.519 [0x7f17fb90c700] INFO  
/pulsar/pulsar-client-cpp/lib/ProducerImpl.cc(415) - Producer - 
[<persistent://sample/foo/default/bar>, ] , [batching  = off]
```
----
2018-06-21 17:11:36 UTC - William Fry: It looks like on the pulsar instance, 
this is the stacktrace: `Caused by: java.util.concurrent.CompletionException: 
org.apache.pulsar.broker.service.BrokerServiceException$PersistenceException: 
org.apache.bookkeeper.mledger.ManagedLedgerException: Not enough non-faulty 
bookies available`
----
2018-06-21 17:13:46 UTC - William Fry: On the zookeeper instance, this is the 
stacktrace `INFO  org.apache.zookeeper.server.PrepRequestProcessor - Got 
user-level KeeperException when processing sessionid:0x164230a14550001 
type:create cxid:0x1 zxid:0x10 txntype:-1 reqpath:n/a Error Path:/admin 
Error:KeeperErrorCode = NoNode for /admin`
----
2018-06-21 17:25:08 UTC - Ivan Kelly: dedupe requires that the producer sends 
some sort of monotonically increasing sequence number
----
2018-06-21 17:26:41 UTC - Ivan Kelly: 
<http://pulsar.apache.org/api/client/org/apache/pulsar/client/api/TypedMessageBuilder.html#sequenceId-long->
----
2018-06-21 17:26:49 UTC - Karthik Palanivelu: @Karthik Palanivelu uploaded a 
file: 
<https://apache-pulsar.slack.com/files/U7VRE0Q1G/FBCT07XGF/-.txt|Untitled> and 
commented: Team, how do I get my pulsar working in k8s till proxy is fixed. I 
created a nodeport for broker to expose. But I am getting above exception. This 
is non-ssl. Can you please help me here?
----
2018-06-21 17:26:55 UTC - Ivan Kelly: the idea is that this message id is 
coming from a  log offset or something
----
2018-06-21 17:27:23 UTC - Ivan Kelly: this is a http request hitting a binary 
port
----
2018-06-21 17:28:31 UTC - Sijie Guo: This means there is no enough bookies 
available in the cluster
----
2018-06-21 17:28:52 UTC - Sijie Guo: How many bookies do you have?
----
2018-06-21 17:30:00 UTC - Ivan Kelly: the simplest way to get it to work, is 
disable tls between the proxy and the broker. you lose authorization, but you 
still have authentication
----
2018-06-21 17:36:34 UTC - Daniel Ferreira Jorge: Ok so for the de-duping to 
work I need always to work with the sequence ids. Thank you @Ivan Kelly
----
2018-06-21 17:36:46 UTC - William Fry: 1
----
2018-06-21 17:36:56 UTC - William Fry: just added a 2nd, is that now required 
in 2.0.0?
----
2018-06-21 17:37:00 UTC - William Fry: previously it worked with 1?
----
2018-06-21 17:38:30 UTC - Karthik Palanivelu: I am not using Proxy here. I 
disable SSL and using non-ssl trying to hit broker port directly through 
nodeport.
----
2018-06-21 17:38:51 UTC - Karthik Palanivelu: My NodePort points to broker now
----
2018-06-21 17:38:56 UTC - Sijie Guo: are you deploying using standalone or a 
full cluster?
----
2018-06-21 17:40:02 UTC - Sijie Guo: it seems that you are sending protobuf 
requests to http endpoints, which is your serviceUrl for your client?
----
2018-06-21 17:41:23 UTC - William Fry: full cluster w/ ansible+terraform
----
2018-06-21 17:42:36 UTC - Sijie Guo: oh did you configure broker settings, like 
ensembleSize, writeQuorumSize and such
----
2018-06-21 17:42:49 UTC - Karthik Palanivelu: I am using 
<pulsar://ELB-NAME:9151(TCP)> -&gt; Node Port 30003 -&gt; Head Less Service 
6650 --&gt; broker 6650
----
2018-06-21 17:43:07 UTC - Karthik Palanivelu: Broker disabled SSL
----
2018-06-21 17:45:05 UTC - William Fry: no kept the standards
----
2018-06-21 17:45:15 UTC - William Fry: I see this message in the Zookeepr logs 
```WARN  org.apache.zookeeper.server.quorum.QuorumCnxManager - Cannot open 
channel to 0 at election address /10.0.1.241:3888
Jun 21 17:44:26 ip-10-0-1-52 pulsar: java.net.SocketTimeoutException: connect 
timed out```
----
2018-06-21 17:45:22 UTC - Sijie Guo: i see. that does hit the protobuf port.
----
2018-06-21 17:46:09 UTC - Karthik Palanivelu: which is protobuf port 6651?
----
2018-06-21 17:46:37 UTC - Sijie Guo: port 6650, it is speaking protobuf protocol
----
2018-06-21 17:46:39 UTC - Karthik Palanivelu: Even I tried with hostPort 
mapping but noi success
----
2018-06-21 17:46:41 UTC - Sijie Guo: so it is not http port
----
2018-06-21 17:47:07 UTC - Karthik Palanivelu: Yes I am trying to reach 6650
----
2018-06-21 17:47:26 UTC - Sijie Guo: actually can you try
----
2018-06-21 17:48:08 UTC - Sijie Guo: try go get to a broker pod and try locally?
----
2018-06-21 17:48:31 UTC - Sijie Guo: sorry I have to step to a meeting. maybe 
@Matteo Merli can you help you
----
2018-06-21 17:51:39 UTC - Karthik Palanivelu: You mean on http port?
----
2018-06-21 18:25:59 UTC - Ivan Kelly: :thumbsup:
----
2018-06-21 18:29:56 UTC - Sijie Guo: I mean - can you ssh to a broker pod? run 
`bin/pulsar-client produce -m "test" test-topic` to test if you can publish 
messages within a k8s cluster. I am trying to isolate whether the problem is 
from pulsar cluster or from the network out side of k8s
----
2018-06-21 18:32:38 UTC - Sijie Guo: sounds like a connection issue.
----
2018-06-21 18:33:02 UTC - Sijie Guo: do you have any connectivities issue 
between broker and bookies?
----
2018-06-21 18:34:58 UTC - Matteo Merli: @Daniel Ferreira Jorge If you don’t do 
anything special, just with the setting in the configuration, the dedup will 
work within a single producer session.


If your application crashes and comes back, it will be seen as a different 
producer, so brokers won’t be able to detect eventual dups at that point. 

If you want to have dedup across producer sessions, you need to : 
 1. Set producer name (instead of auto-generated name) 
 2. Use sequence ids tied to application domain
----
2018-06-21 18:37:16 UTC - Matteo Merli: @Ryan Rose the docs are live at 
<http://pulsar.apache.org/docs/latest/clients/go/> 

With 2.1 release we’ll also be able to publish rpm/deb binary directly.
----
2018-06-21 18:37:45 UTC - Ryan Rose: Awesome, thanks!
----
2018-06-21 18:42:47 UTC - William Fry: I used to not
----
2018-06-21 18:42:55 UTC - William Fry: Right now I’m looking into the Zookeeper 
setup
----
2018-06-21 18:43:02 UTC - William Fry: if I try to configure everything from 
scratch
----
2018-06-21 18:43:08 UTC - William Fry: I find this error on the initial 
Zookeeper instance
----
2018-06-21 18:43:09 UTC - William Fry: ```Jun 21 18:42:08 ip-10-0-1-186 pulsar: 
18:42:08.946 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] WARN  
org.apache.zookeeper.server.NIOServerCnxn - Exception causing close of session 
0x0 due to java.io.IOException: ZooKeeperServer not running
Jun 21 18:42:08 ip-10-0-1-186 pulsar: 18:42:08.946 
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] INFO  
org.apache.zookeeper.server.NIOServerCnxn - Closed socket connection for client 
/127.0.0.1:54340 (no session established for client)```
----
2018-06-21 18:44:40 UTC - Sijie Guo: you redeploy with ansible?
----
2018-06-21 18:45:20 UTC - William Fry: Yep
----
2018-06-21 18:45:27 UTC - William Fry: I’m doing it manually now
----
2018-06-21 18:45:29 UTC - William Fry: To test each step
----
2018-06-21 18:45:36 UTC - William Fry: It looks like after the zookeeper 
service is started
----
2018-06-21 18:45:40 UTC - William Fry: This is the first set of warnings
----
2018-06-21 18:45:43 UTC - William Fry: ```
Jun 21 18:44:46 ip-10-0-1-186 pulsar: 18:44:46.799 
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] INFO  
org.apache.zookeeper.server.NIOServerCnxnFactory - Accepted socket connection 
from /127.0.0.1:54542
Jun 21 18:44:46 ip-10-0-1-186 pulsar: 18:44:46.807 
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] WARN  
org.apache.zookeeper.server.NIOServerCnxn - Exception causing close of session 
0x0 due to java.io.IOException: ZooKeeperServer not running
Jun 21 18:44:46 ip-10-0-1-186 pulsar: 18:44:46.807 
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] INFO  
org.apache.zookeeper.server.NIOServerCnxn - Closed socket connection for client 
/127.0.0.1:54542 (no session established for client)
Jun 21 18:44:48 ip-10-0-1-186 pulsar: 18:44:48.561 
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] INFO  
org.apache.zookeeper.server.NIOServerCnxnFactory - Accepted socket connection 
from /127.0.0.1:54544
Jun 21 18:44:48 ip-10-0-1-186 pulsar: 18:44:48.562 
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] WARN  
org.apache.zookeeper.server.NIOServerCnxn - Exception causing close of session 
0x0 due to java.io.IOException: ZooKeeperServer not running
Jun 21 18:44:48 ip-10-0-1-186 pulsar: 18:44:48.562 
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] INFO  
org.apache.zookeeper.server.NIOServerCnxn - Closed socket connection for client 
/127.0.0.1:54544 (no session established for client)
Jun 21 18:44:49 ip-10-0-1-186 pulsar: 18:44:49.910 [WorkerSender[myid=0]] WARN  
org.apache.zookeeper.server.quorum.QuorumCnxManager - Cannot open channel to 1 
at election address /10.0.1.230:3888
```
----
2018-06-21 18:46:07 UTC - Sijie Guo: it seems that it can’t connect to ip 
10.0.1.230
----
2018-06-21 18:46:12 UTC - Sijie Guo: is your ip changed?
----
2018-06-21 18:46:21 UTC - William Fry: That’s the other zookeeper instance
----
2018-06-21 18:46:29 UTC - William Fry: Do they need to be able to talk to one 
another?
----
2018-06-21 18:47:03 UTC - Sijie Guo: yes they do. zookeeper uses port 3888 for 
its internal communication (e.g. leader election)
----
2018-06-21 18:48:46 UTC - William Fry: is the same true of the pulsar instances?
----
2018-06-21 18:49:54 UTC - Sijie Guo: for pulsar instances, they don’t need to 
inter instance communication.
----
2018-06-21 18:50:26 UTC - Sijie Guo: but pulsar brokers need to be able to talk 
to bookies at 3181, both brokers and bookie need to talk to zookeeper at 2181
----
2018-06-21 18:51:06 UTC - Sijie Guo: but I think terraform sets a vpc for the 
cluster, so those ports should be accessible between instances. are you using a 
customized terraform script?
----
2018-06-21 18:53:53 UTC - Ryan Rose: FWIW I fixed it -- was something fishy 
with the CPP lib install. I'd built it before from source, and reinstalling via 
the brew script in the new docs fixed it! Much appreciated.
+1 : Sijie Guo
----
2018-06-21 18:53:54 UTC - William Fry: yep
----
2018-06-21 18:54:02 UTC - William Fry: i wanted to lock it down more than the 
example code did
----
2018-06-21 18:54:39 UTC - William Fry: found out I mixed using 
`aws_security_group_rule` and the inline `egress` which causes failures
----
2018-06-21 18:58:45 UTC - Sijie Guo: gotcha!
----
2018-06-21 18:59:13 UTC - Sijie Guo: hope adding that back will address your 
problem
----
2018-06-21 19:06:11 UTC - William Fry: trying it out now
+1 : Sijie Guo
----
2018-06-21 19:10:31 UTC - William Fry: ok, different error but past that 
problem :sweat_smile:
----
2018-06-21 19:11:12 UTC - William Fry: when creating a producer, I get this:
```
19:09:55.902 [0x7fcf15a05740] INFO  
/pulsar/pulsar-client-cpp/lib/ConnectionPool.cc(53) - Deleting stale connection 
from pool for 
<pulsar://internal-pulsar-elb-123456789.us-east-2.elb.amazonaws.com:6650> 
use_count: -1 @ 0
19:09:55.902 [0x7fcf15a05740] INFO  
/pulsar/pulsar-client-cpp/lib/ConnectionPool.cc(63) - Created connection for 
<pulsar://internal-pulsar-elb-123456789.us-east-2.elb.amazonaws.com:6650>
19:09:55.903 [0x7fcf0c513700] INFO  
/pulsar/pulsar-client-cpp/lib/ClientConnection.cc(279) - [10.0.0.136:54916 
-&gt; 10.0.1.25:6650] Connected to broker
19:09:55.904 [0x7fcf0c513700] INFO  
/pulsar/pulsar-client-cpp/lib/ClientConnection.cc(1132) - [10.0.0.136:54916 
-&gt; 10.0.1.25:6650] Connection closed
19:09:55.904 [0x7fcf0c513700] ERROR 
/pulsar/pulsar-client-cpp/lib/ClientImpl.cc(138) - Error Checking/Getting 
Partition Metadata while creating producer on 
<persistent://sample/foo/default/bar-previews> -- 5
Traceback (most recent call last):
  File "&lt;stdin&gt;", line 1, in &lt;module&gt;
  File "/usr/local/lib64/python2.7/site-packages/pulsar/__init__.py", line 351, 
in create_producer
    p._producer = self._client.create_producer(topic, conf)
Exception: Pulsar error: ConnectError
```
----
2018-06-21 19:15:45 UTC - William Fry: Looks like it may be from ansible
----
2018-06-21 19:16:00 UTC - William Fry: What triggers ansible to “skip” a task?
----
2018-06-21 19:16:24 UTC - Karthik Palanivelu: @Sijie Guo It successfully 
published the message as non-ssl
----
2018-06-21 19:17:35 UTC - Sijie Guo: which task was skipped? can you telnet 
<http://internal-pulsar-elb-123456789.us-east-2.elb.amazonaws.com|internal-pulsar-elb-123456789.us-east-2.elb.amazonaws.com>
 6650  to see if you can reach 6650?
----
2018-06-21 19:17:49 UTC - William Fry: sure
----
2018-06-21 19:18:04 UTC - Sijie Guo: okay out side k8s cluster?
----
2018-06-21 19:18:33 UTC - William Fry: and two tasks: `Initialize cluster 
metadata` skipped 1 of 2 Zookeeper instances and `Create default property and 
namespace` skipped 1 of 1 Pulsar instances
----
2018-06-21 19:19:16 UTC - Sijie Guo: initialize should only be done once, if 
you already initialized, you don’t need initialize it again. otherwise it will 
erase all the metadata.
----
2018-06-21 19:19:27 UTC - Sijie Guo: same for “create default property and 
namespace”
----
2018-06-21 19:21:09 UTC - William Fry: for telnet, the following was the trace:
```
Trying 10.0.1.25...
Connected to 
<http://internal-pulsar-elb-123456789.us-east-2.elb.amazonaws.com|internal-pulsar-elb-123456789.us-east-2.elb.amazonaws.com>.
Escape character is '^]'.
Connection closed by foreign host.
```
----
2018-06-21 19:21:19 UTC - William Fry: (it was immediately closed, but that 
seems correct?)
----
2018-06-21 19:22:35 UTC - Sijie Guo: this sounds like some issues
----
2018-06-21 19:22:50 UTC - Sijie Guo: I am not sure it is network permissions or 
on the broker. can you check the broker logs?
----
2018-06-21 19:23:59 UTC - William Fry: It looks like it’s the broker. All the 
network perms look fine now.
----
2018-06-21 19:24:52 UTC - William Fry: Broker is on pulsar instance right?
----
2018-06-21 19:25:02 UTC - Sijie Guo: yes
----
2018-06-21 19:25:24 UTC - Sijie Guo: can you check the logs
----
2018-06-21 19:25:24 UTC - William Fry: on pulsar there are two errors:
```
 19:18:14 ip-10-0-1-10 pulsar: 19:18:14.845 [main-SendThread(10.0.1.146:2181)] 
INFO  org.apache.zookeeper.ClientCnxn - Opening socket connection to server 
10.0.1.146/10.0.1.146:2181. Will not attempt to authenticate using SASL 
(unknown error)
Jun 21 19:18:14 ip-10-0-1-10 pulsar: 19:18:14.846 
[main-SendThread(10.0.1.146:2181)] INFO  org.apache.zookeeper.ClientCnxn - 
Socket connection established to 10.0.1.146/10.0.1.146:2181, initiating session
Jun 21 19:18:14 ip-10-0-1-10 pulsar: 19:18:14.848 
[main-SendThread(10.0.1.146:2181)] INFO  org.apache.zookeeper.ClientCnxn - 
Unable to read additional data from server sessionid 0x0, likely server has 
closed socket, closing socket connection and attempting reconnect
Jun 21 19:18:16 ip-10-0-1-10 pulsar: 19:18:16.381 
[main-SendThread(10.0.1.230:2181)] INFO  org.apache.zookeeper.ClientCnxn - 
Opening socket connection to server 10.0.1.230/10.0.1.230:2181. Will not 
attempt to authenticate using SASL (unknown error)
Jun 21 19:18:16 ip-10-0-1-10 pulsar: 19:18:16.482 [main-EventThread] INFO  
org.apache.zookeeper.ClientCnxn - EventThread shut down for session: 0x0
Jun 21 19:18:16 ip-10-0-1-10 pulsar: 19:18:16.483 [main] INFO  
org.apache.zookeeper.ZooKeeper - Session: 0x0 closed
Jun 21 19:18:16 ip-10-0-1-10 pulsar: 19:18:16.484 [main] ERROR 
org.apache.bookkeeper.meta.zk.ZKMetadataDriverBase - Failed to create zookeeper 
client to 10.0.1.230:2181,10.0.1.146:2181
Jun 21 19:18:16 ip-10-0-1-10 pulsar: 
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss
```
----
2018-06-21 19:25:42 UTC - William Fry: and ```
Jun 21 19:18:16 ip-10-0-1-10 pulsar: 19:18:16.497 [main] ERROR 
org.apache.pulsar.PulsarBrokerStarter - Uncaught exception in thread main: 
Failed to initialize metadata bookie driver
Jun 21 19:18:16 ip-10-0-1-10 pulsar: 
org.apache.bookkeeper.bookie.BookieException$MetadataStoreException: Failed to 
initialize metadata bookie driver
Jun 21 19:18:16 ip-10-0-1-10 pulsar: at 
org.apache.bookkeeper.bookie.Bookie.instantiateMetadataDriver(Bookie.java:945) 
~[org.apache.bookkeeper-bookkeeper-server-shaded-4.7.0.jar:4.7.0]
Jun 21 19:18:16 ip-10-0-1-10 pulsar: at 
org.apache.bookkeeper.bookie.Bookie.&lt;init&gt;(Bookie.java:626) 
~[org.apache.bookkeeper-bookkeeper-server-shaded-4.7.0.jar:4.7.0]
Jun 21 19:18:16 ip-10-0-1-10 pulsar: at 
org.apache.bookkeeper.proto.BookieServer.newBookie(BookieServer.java:115) 
~[org.apache.bookkeeper-bookkeeper-server-shaded-4.7.0.jar:4.7.0]
Jun 21 19:18:16 ip-10-0-1-10 pulsar: at 
org.apache.bookkeeper.proto.BookieServer.&lt;init&gt;(BookieServer.java:96) 
~[org.apache.bookkeeper-bookkeeper-server-shaded-4.7.0.jar:4.7.0]
Jun 21 19:18:16 ip-10-0-1-10 pulsar: at 
org.apache.pulsar.PulsarBrokerStarter$BrokerStarter.&lt;init&gt;(PulsarBrokerStarter.java:198)
 
~[org.apache.pulsar-pulsar-broker-2.0.0-rc1-incubating.jar:2.0.0-rc1-incubating]
Jun 21 19:18:16 ip-10-0-1-10 pulsar: at 
org.apache.pulsar.PulsarBrokerStarter.main(PulsarBrokerStarter.java:274) 
~[org.apache.pulsar-pulsar-broker-2.0.0-rc1-incubating.jar:2.0.0-rc1-incubating]
Jun 21 19:18:16 ip-10-0-1-10 pulsar: Caused by: 
org.apache.bookkeeper.meta.exceptions.MetadataException: Failed to create 
zookeeper client to 10.0.1.230:2181,10.0.1.146:2181
Jun 21 19:18:16 ip-10-0-1-10 pulsar: at 
org.apache.bookkeeper.meta.zk.ZKMetadataDriverBase.initialize(ZKMetadataDriverBase.java:216)
 ~[org.apache.bookkeeper-bookkeeper-server-shaded-4.7.0.jar:4.7.0]
Jun 21 19:18:16 ip-10-0-1-10 pulsar: at 
org.apache.bookkeeper.meta.zk.ZKMetadataBookieDriver.initialize(ZKMetadataBookieDriver.java:60)
 ~[org.apache.bookkeeper-bookkeeper-server-shaded-4.7.0.jar:4.7.0]
Jun 21 19:18:16 ip-10-0-1-10 pulsar: at 
org.apache.bookkeeper.bookie.Bookie.instantiateMetadataDriver(Bookie.java:935) 
~[org.apache.bookkeeper-bookkeeper-server-shaded-4.7.0.jar:4.7.0]
Jun 21 19:18:16 ip-10-0-1-10 pulsar: ... 5 more
Jun 21 19:18:16 ip-10-0-1-10 pulsar: Caused by: 
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss
```
----
2018-06-21 19:25:54 UTC - William Fry: they may be related by connection loss?
----
2018-06-21 19:25:54 UTC - Karthik Palanivelu: Which port you want me to check, 
abive code used client.conf which used webServiceUrl=<http://broker:9101/>..
----
2018-06-21 19:26:03 UTC - Sijie Guo: it seems that broker cannot connect to 
those two machines
----
2018-06-21 19:26:31 UTC - Sijie Guo: can you do following:

on the broker machine, try telnet 10.0.1.230 2181
----
2018-06-21 19:26:42 UTC - Sijie Guo: and telnet 10.0.1.146 2181
----
2018-06-21 19:26:46 UTC - Sijie Guo: to see if you can telnet
----
2018-06-21 19:28:16 UTC - Sijie Guo: <pulsar://broker:&lt;broker-port&gt;> 
(e.g. <pulsar://broker:6650/>)
----
2018-06-21 19:29:05 UTC - William Fry: 10.0.1.230 has crashed
----
2018-06-21 19:29:13 UTC - William Fry: but i connect to 10.0.1.146
----
2018-06-21 19:29:19 UTC - William Fry: without it being immediately closed
----
2018-06-21 19:29:28 UTC - Sijie Guo: cool
----
2018-06-21 19:29:56 UTC - Sijie Guo: how many zookeeper do you have? can you 
check their logs? especially the crahsed one
----
2018-06-21 19:29:57 UTC - William Fry: 2
----
2018-06-21 19:30:08 UTC - William Fry: it looks like 10.0.1.146 keeps trying to 
connect to 10.0.1.230
----
2018-06-21 19:30:23 UTC - William Fry: which yields ```Jun 21 19:28:19 
ip-10-0-1-146 pulsar: 19:28:19.782 [QuorumPeer[myid=1]/0.0.0.0:2181] WARN  
org.apache.zookeeper.server.quorum.QuorumCnxManager - Cannot open channel to 0 
at election address /10.0.1.230:3888
```
----
2018-06-21 19:30:51 UTC - William Fry: I don’t know what happened to 10.0.1.230
----
2018-06-21 19:30:54 UTC - William Fry: I can’t ssh into it
----
2018-06-21 19:31:03 UTC - William Fry: something strange happened to the box
----
2018-06-21 19:31:32 UTC - William Fry: will 10.0.1.146 eventually give up 
trying to connect and elect itself?
----
2018-06-21 19:31:47 UTC - Sijie Guo: oh so you need odd number of zookeeper 
machines since zookeeper requires majoritity for consensus
----
2018-06-21 19:31:50 UTC - Joe Francis: <!channel> We are organizing a Pulsar 
1/2 day conference on the afternoon of July 10th, at Oath/Yahoo campus in 
Sunnyvale, CA. If anyone using Pulsar is interested in talking about their use 
case or experience in using Pulsar, we will be quite happy to host your 
presentation along with other tech and deep dive  talks.
+1 : Matteo Merli, Sijie Guo, Sanjeev Kulkarni, Ali Ahmed, Lewis Kaneshiro
bananadance : Sijie Guo, Ali Ahmed, Lewis Kaneshiro
----
2018-06-21 19:31:55 UTC - William Fry: …lol that makes sense
----
2018-06-21 19:32:12 UTC - Sijie Guo: so I think what you can do here
----
2018-06-21 19:32:21 UTC - Sijie Guo: since 10.0.1.230 is down,
----
2018-06-21 19:32:51 UTC - Sijie Guo: you can update the zookeeper conf on 
10.0.1.146 to use one zookeeper server or just enable standalone mode
----
2018-06-21 19:33:06 UTC - Sijie Guo: so in that way you can function as one 
zookeeper server
----
2018-06-21 19:33:21 UTC - Sijie Guo: however the downside is if zookeeper is 
down, you will become unavailable.
----
2018-06-21 19:38:49 UTC - Karthik Palanivelu: @Sijie Guo It is Connection Time 
Out on the Producer Client. But I can see from the logs of Broker, connection 
is open. Once I disconnect the client, I see the below log - `15:35:15.024 
[pulsar-io-21-1] INFO  org.apache.pulsar.broker.service.ServerCnx - Closed 
connection from /IP-CLIENT:40989`
----
2018-06-21 19:39:25 UTC - William Fry: Any plans for an NYC conf :smile:
+1 : John Crawford, Ali Ahmed, Matteo Merli, Sijie Guo
----
2018-06-21 19:39:54 UTC - Karthik Palanivelu: Which means there is some issue 
in k8s.
----
2018-06-21 19:40:13 UTC - Karthik Palanivelu: I am new to k8s, seeking your help
----
2018-06-21 19:41:22 UTC - Karthik Palanivelu: It seems like I need proxy and 
nodeport on Broker will not work. Please correct my understanding
----
2018-06-21 19:45:08 UTC - William Fry: gotcha
----
2018-06-21 19:45:13 UTC - William Fry: ok I have three up
----
2018-06-21 19:45:18 UTC - William Fry: I’m still getting the `Error 
Checking/Getting Partition Metadata while creating producer` error
----
2018-06-21 19:45:33 UTC - William Fry: but there are no warnings or errors in 
the bookie logs
----
2018-06-21 19:45:33 UTC - William Fry: I’ll try telnet’ing?
----
2018-06-21 19:46:47 UTC - Sijie Guo: what version of k8s are you using? and 
type of network are you using? (I know there was some issues about nodePort 
support on some type of network. didn’t have the names right on my head).
----
2018-06-21 19:46:49 UTC - William Fry: ok, I can telnet to all of the Zookeeper 
instances
----
2018-06-21 19:47:02 UTC - Sijie Guo: I mean k8s network
----
2018-06-21 19:49:34 UTC - William Fry: Now when I connect, the producer hangs 
at `Created connection for ...`
----
2018-06-21 19:49:43 UTC - William Fry: And on the bookie logs, I see the below:
----
2018-06-21 19:50:39 UTC - William Fry: ```
Jun 21 19:48:28 ip-10-0-1-109 pulsar: 19:48:28.358 [pulsar-io-49-1] INFO  
org.apache.pulsar.broker.service.ServerCnx - New connection from 
/10.0.1.24:47058
Jun 21 19:48:45 ip-10-0-1-109 pulsar: 19:48:45.433 [pulsar-load-manager-4-1] 
INFO  org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - Only 1 
broker available: no load shedding will be performed
Jun 21 19:49:03 ip-10-0-1-109 pulsar: 19:49:03.020 
[pulsar-ordered-OrderedExecutor-0-0] INFO  
org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - Opening managed ledger 
sample/foo/default/persistent/bar
```
----
2018-06-21 19:50:47 UTC - Sijie Guo: okay so it seems that you have the cluster 
up running
----
2018-06-21 19:51:14 UTC - William Fry: Any idea why the producer would just 
hang?
----
2018-06-21 19:51:38 UTC - Sijie Guo: are you using 2.0?
----
2018-06-21 19:52:01 UTC - William Fry: on the pulsar instance but not the client
----
2018-06-21 19:52:17 UTC - William Fry: scratch that, also on the client
----
2018-06-21 19:52:57 UTC - Sijie Guo: okay so you can change the topic name to 
<persistent://sample/foo/default/bar_previews> or just plain “bar_previews”
----
2018-06-21 19:53:12 UTC - Sijie Guo: so this plain “bar-previews” will go to a 
public default namespace
----
2018-06-21 19:53:39 UTC - William Fry: yep
----
2018-06-21 19:53:54 UTC - William Fry: did that and get the following:
```
19:53:32.086 [0x7f8df8b50700] INFO  
/pulsar/pulsar-client-cpp/lib/HandlerBase.cc(53) - 
[<persistent://public/default/profile-previews>, ] Getting connection from pool
19:53:32.114 [0x7f8df8b50700] INFO  
/pulsar/pulsar-client-cpp/lib/ConnectionPool.cc(53) - Deleting stale connection 
from pool for <pulsar://10.0.1.109:6650> use_count: -1 @ 0
19:53:32.114 [0x7f8df8b50700] INFO  
/pulsar/pulsar-client-cpp/lib/ConnectionPool.cc(63) - Created connection for 
<pulsar://10.0.1.109:6650>
```
----
2018-06-21 19:54:40 UTC - William Fry: On the bookie instance, I see the 
following as a result:
```
Jun 21 19:53:32 ip-10-0-1-109 pulsar: 19:53:32.097 
[pulsar-ordered-OrderedExecutor-6-0-EventThread] INFO  
org.apache.pulsar.broker.cache.LocalZooKeeperCacheService - Successfully 
copyied bundles data to local zk at /admin/local-policies/public/default
Jun 21 19:53:32 ip-10-0-1-109 pulsar: 19:53:32.097 
[pulsar-ordered-OrderedExecutor-6-0-EventThread] INFO  
org.apache.pulsar.broker.cache.LocalZooKeeperCacheService - Successfully 
created local policies for /admin/local-policies/public/default -- 
Optional[org.apache.pulsar.common.policies.data.LocalPolicies@1e7ab195]
Jun 21 19:53:32 ip-10-0-1-109 pulsar: 19:53:32.102 [pulsar-1-15] INFO  
org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - 1 brokers 
being considered for assignment of public/default/0xc0000000_0xd0000000
Jun 21 19:53:32 ip-10-0-1-109 pulsar: 19:53:32.103 [pulsar-1-15] INFO  
org.apache.pulsar.broker.namespace.OwnershipCache - Trying to acquire ownership 
of public/default/0xc0000000_0xd0000000
Jun 21 19:53:32 ip-10-0-1-109 pulsar: 19:53:32.113 
[pulsar-ordered-OrderedExecutor-6-0-EventThread] INFO  
org.apache.pulsar.broker.namespace.OwnershipCache - Successfully acquired 
ownership of /namespace/public/default/0xc0000000_0xd0000000
Jun 21 19:53:32 ip-10-0-1-109 pulsar: 19:53:32.114 [pulsar-1-1] INFO  
org.apache.pulsar.broker.PulsarService - Loading all topics on bundle: 
public/default/0xc0000000_0xd0000000
Jun 21 19:53:45 ip-10-0-1-109 pulsar: 19:53:45.433 [pulsar-load-manager-4-1] 
INFO  org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - Only 1 
broker available: no load shedding will be performed
```
----
2018-06-21 19:56:23 UTC - William Fry: The command also doesn’t return, it 
hangs and I can’t cancel it with `^c`. Just shows those log messages below.
----
2018-06-21 19:58:08 UTC - Sijie Guo: can you do me a favor?
----
2018-06-21 19:58:17 UTC - Sijie Guo: can you go to the broker machine
----
2018-06-21 19:58:39 UTC - Sijie Guo: run `bin/pulsar-client produce -m "test 
message" test-topic` in the pulsar directory
----
2018-06-21 20:00:05 UTC - Karthik Palanivelu: Mine uses Calico
----
2018-06-21 20:00:59 UTC - William Fry: yep
----
2018-06-21 20:01:21 UTC - Sijie Guo: it succeed?
----
2018-06-21 20:01:24 UTC - William Fry: btw, after hanging for a long time, this 
is the final response of creating a producer with the python client 
`19:57:53.354 [0x7f8df8b50700] ERROR 
/pulsar/pulsar-client-cpp/lib/ClientConnection.cc(323) - [&lt;none&gt; -&gt; 
<pulsar://10.0.1.109:6650>] Failed to establish connection: Connection timed out
`
----
2018-06-21 20:01:32 UTC - William Fry: running now, one sec
----
2018-06-21 20:01:58 UTC - William Fry: failed
----
2018-06-21 20:02:23 UTC - William Fry: bunch of errors
----
2018-06-21 20:03:08 UTC - William Fry: ` ERROR 
org.apache.pulsar.client.impl.ProducerImpl - [test-topic] [null] Failed to 
create producer: org.apache.bookkeeper.mledger.ManagedLedgerException: Not 
enough non-faulty bookies available`
----
2018-06-21 20:03:22 UTC - Sijie Guo: okay
----
2018-06-21 20:03:29 UTC - Sijie Guo: so you have 1 bookie right?
----
2018-06-21 20:03:37 UTC - William Fry: 1 bookie, 3 zookeepers
----
2018-06-21 20:03:52 UTC - Sijie Guo: # Number of bookies to use when creating a 
ledger
managedLedgerDefaultEnsembleSize=2

# Number of copies to store for each message
managedLedgerDefaultWriteQuorum=2

# Number of guaranteed copies (acks to wait before write is complete)
managedLedgerDefaultAckQuorum=2
----
2018-06-21 20:04:16 UTC - William Fry: do those need to be changed to 1?
----
2018-06-21 20:04:18 UTC - Sijie Guo: on the broker node, can you update 
broker.conf to set the above settings to 1? or can you add another bookie
----
2018-06-21 20:05:14 UTC - Sijie Guo: yeah. by default it is 2
----
2018-06-21 20:06:26 UTC - William Fry: yep, updated to 1
----
2018-06-21 20:06:30 UTC - William Fry: should i restart pulsar?
----
2018-06-21 20:06:34 UTC - William Fry: or does it watch the conf files?
----
2018-06-21 20:07:37 UTC - William Fry: yayyyy
----
2018-06-21 20:07:44 UTC - William Fry: the `bin/pulsar` command worked
----
2018-06-21 20:08:02 UTC - William Fry: the python client still hangs
----
2018-06-21 20:14:51 UTC - Sijie Guo: @Karthik Palanivelu I think there are some 
issues reported around Calico with nodePort in k8s. 
<https://github.com/kubernetes/kubernetes/issues/58908> I am not sure if that 
is the exact problem, but it might be worth trying out this suggestions in the 
PR.
----
2018-06-21 20:15:18 UTC - Sijie Guo: yeah you need to restart pulsar.
----
2018-06-21 20:15:31 UTC - Sijie Guo: if bin/pulsar command worker, that mean 
pulsar service is fine.
----
2018-06-21 20:15:42 UTC - Robin Bartholdson: Will be recorded I hope? :relaxed:
----
2018-06-21 20:15:50 UTC - William Fry: yeah I restarted it with `systemctl` but 
the python client still is failing
----
2018-06-21 20:15:59 UTC - Sijie Guo: so the problem would be the connection 
between your python client and pulsar broker.
----
2018-06-21 20:16:32 UTC - Sijie Guo: can you do one more thing, can you copy 
your python application to the broker server and run the python application 
there
----
2018-06-21 20:16:54 UTC - Sijie Guo: I am try to isolate the problem whether it 
is a problem of accessing network out side of vpc
----
2018-06-21 20:16:59 UTC - William Fry: sure
----
2018-06-21 20:18:51 UTC - William Fry: and use localhost or still use the load 
balancer?
----
2018-06-21 20:19:33 UTC - Sijie Guo: Use localhost 
----
2018-06-21 20:20:24 UTC - William Fry: worked
----
2018-06-21 20:20:55 UTC - Sijie Guo: So it seems that problem comes from load 
balancer 
----
2018-06-21 20:21:12 UTC - William Fry: yeah, because it also works using the 
load balancer from the pulsar instance…
----
2018-06-21 20:21:28 UTC - Sijie Guo: I see.
----
2018-06-21 20:21:54 UTC - Sijie Guo: you might want to check the aws 
loadbalancer configuration
----
2018-06-21 20:22:00 UTC - Sijie Guo: it might show some health status for you
----
2018-06-21 20:22:30 UTC - William Fry: yeah it’s healthy
----
2018-06-21 20:22:47 UTC - Sijie Guo: okay
----
2018-06-21 20:22:49 UTC - William Fry: I thought if you used a load balancer, 
you didn’t also need access to the other instance as long as the load balancer 
had access to the other instance
----
2018-06-21 20:23:14 UTC - Sijie Guo: yes that should be the case
----
2018-06-21 20:23:41 UTC - William Fry: ok, now it’s working from the other 
instance
----
2018-06-21 20:23:45 UTC - William Fry: despite my doing nothing…
----
2018-06-21 20:23:57 UTC - Sijie Guo: interesting
----
2018-06-21 20:24:01 UTC - William Fry: testing out a local client shouldn’t 
have changed anything
----
2018-06-21 20:24:04 UTC - William Fry: right?
----
2018-06-21 20:24:08 UTC - Sijie Guo: so it should like a loadbalancer issue
----
2018-06-21 20:24:18 UTC - Sijie Guo: yeah it shouldn’t change
----
2018-06-21 20:24:43 UTC - Sijie Guo: not very familiar with aws loadbalancer
----
2018-06-21 20:24:52 UTC - William Fry: yeah I didn’t touch anything there
----
2018-06-21 20:24:59 UTC - William Fry: re-running terraform to see if I did 
without meaning to
----
2018-06-21 20:25:13 UTC - Sijie Guo: sgtm
----
2018-06-21 20:26:51 UTC - William Fry: ok, got it. The other instance needs 
access not only to the load balancer but also to pulsar
----
2018-06-21 20:26:52 UTC - William Fry: bizarre
----
2018-06-21 20:27:03 UTC - William Fry: that seems counter-intuitive
----
2018-06-21 20:27:31 UTC - Sijie Guo: okay
----
2018-06-21 20:27:36 UTC - Sijie Guo: so everything worked out?
----
2018-06-21 20:28:28 UTC - William Fry: yeah, I’m just stumped why we would need 
access to the pulsar instance
----
2018-06-21 20:28:36 UTC - William Fry: I thought that the load balancer funnels 
traffic to and from
----
2018-06-21 20:28:58 UTC - William Fry: maybe it just selects the ideal machine 
to use and then connection takes place directly between the machines?
----
2018-06-21 20:29:12 UTC - Sijie Guo: oh i see
----
2018-06-21 20:29:32 UTC - Sijie Guo: let me explain
----
2018-06-21 20:33:25 UTC - Sijie Guo: so broker is running in such a mode - 
client talks to any broker thru a load balancer, and ask the broker I want to 
publish/consume messages to this topic. if the broker is the owner of this 
topic, it serves the connection; if it isn’t, the broker tell the client who is 
the right broker and the client redirect and talk to that broker directly.  if 
this happens, the client talks to broker directly, which bypass the load 
balancer. so you need pulsar instance access for this case.

however there is another service instance, called “proxy”, what is does is a 
“proxy” that proxy the connections to the real broker. in that case, you just 
need use loadbalancer.

so in a production deployment, you might consider put up proxies in front of 
brokers and let loadbalancer load balancing among proxies.

hope this clears your question
----
2018-06-21 20:34:08 UTC - William Fry: ah that makes perfect sense
----
2018-06-21 20:34:15 UTC - William Fry: are there issues with the “proxy” right 
now?
----
2018-06-21 20:34:21 UTC - William Fry: I saw chatter about it in 
<#C5Z4T36F7|general>
----
2018-06-21 20:35:56 UTC - Sijie Guo: I think there are some issues around tls 
authentication enabled in proxy. we are fixing it and improving the 
documentation. without TLS or only TLS on proxy, everything should work fine
----
2018-06-21 20:41:41 UTC - William Fry: ok perfect
----
2018-06-21 20:41:55 UTC - William Fry: one last question, why do I need to 
bypass the load balancer if there is only one broker?
----
2018-06-21 20:42:17 UTC - William Fry: In that case, there’s no chance that I 
connect to a broker that doesn’t own a topic. Right?
----
2018-06-21 20:42:37 UTC - William Fry: Or is it that the load balancer balances 
requests to different brokers but any subsequent connection is _always_ direct?
----
2018-06-21 20:42:59 UTC - William Fry: What’s the purpose of the load balancer 
when not using the “proxy” service?
----
2018-06-21 20:44:32 UTC - William Fry: It looks like this is never a problem 
with HTTP requests to the API
----
2018-06-21 20:47:05 UTC - Sijie Guo: oh loadbalancer in the broker mode is more 
used as DNS for finding a broker
----
2018-06-21 20:47:09 UTC - Sijie Guo: if you only have one broker
----
2018-06-21 20:47:13 UTC - Sijie Guo: you don’t need loadbalancer
----
2018-06-21 21:53:22 UTC - William Fry: Can there be an even number of pulsar 
instances?
----
2018-06-21 21:53:28 UTC - William Fry: Or do you also need an odd number?
----
2018-06-21 21:55:25 UTC - Sijie Guo: pulsar instances can be any number
----
2018-06-21 21:55:32 UTC - William Fry: nice
----
2018-06-21 22:56:24 UTC - Joe Francis: Yes, will be recorded.
----
2018-06-22 00:10:02 UTC - Matteo Merli: @Matteo Merli pinned a message to this 
channel.
----
2018-06-22 06:43:17 UTC - Vasily Yanov: Hi!
----
2018-06-22 06:43:26 UTC - Vasily Yanov: One more downtime....
----
2018-06-22 06:43:44 UTC - Vasily Yanov: tried to restart broker and got this in 
logs:
----
2018-06-22 06:43:50 UTC - Vasily Yanov: ```
2018-06-22 06:38:58,641 - WARN  
[BookKeeperClientWorker-24-1:LedgerHandle$NoopCloseCallback@1189] - Close 
failed: Error while using ZooKeeper
2018-06-22 06:38:58,648 - ERROR 
[BookKeeperClientWorker-24-1:LedgerHandle$2$1CloseCb@405] - Error update ledger 
metadata for ledger 10752446 : -9
2018-06-22 06:38:58,648 - WARN  
[BookKeeperClientWorker-24-1:LedgerHandle$NoopCloseCallback@1189] - Close 
failed: Error while using ZooKeeper
2018-06-22 06:38:58,648 - ERROR 
[BookKeeperClientWorker-18-1:LedgerHandle$2$1CloseCb@405] - Error update ledger 
metadata for ledger 10752488 : -9
2018-06-22 06:38:58,648 - WARN  
[BookKeeperClientWorker-18-1:LedgerHandle$NoopCloseCallback@1189] - Close 
failed: Error while using ZooKeeper
2018-06-22 06:38:58,648 - INFO  
[BookKeeperClientWorker-21-1:LedgerHandle$ReReadLedgerMetadataCb@1100] - 
Resolve ledger metadata conflict while changing ensemble to: 
[144.76.185.49:3181, 148.251.181.171:3181, 136.243.34.224:3181], old meta data 
is
 (meta:BookieMetadataFormatVersion      2
quorumSize: 3
ensembleSize: 3
length: 0
lastEntryId: -1
state: OPEN
segment {
  ensembleMember: "144.76.185.49:3181"
  ensembleMember: "148.251.181.171:3181"
  ensembleMember: "136.243.34.224:3181"
  firstEntryId: 0
}
digestType: CRC32
password: ""
ackQuorumSize: 2
, version:1)
, new meta data is
 (meta:BookieMetadataFormatVersion      2
quorumSize: 3
ensembleSize: 3
length: 0
lastEntryId: -1
state: OPEN
segment {
  ensembleMember: "144.76.185.49:3181"
  ensembleMember: "188.40.124.119:3181"
  ensembleMember: "136.243.34.224:3181"
  firstEntryId: 0
}
digestType: CRC32
password: ""
ackQuorumSize: 2
, version:1).
2018-06-22 06:38:58,649 - WARN  
[main-EventThread:AbstractZkLedgerManager$4@407] - Conditional update ledger 
metadata failed: NONODE
2018-06-22 06:38:58,649 - ERROR 
[BookKeeperClientWorker-23-1:LedgerHandle$2$1CloseCb@405] - Error update ledger 
metadata for ledger 10752445 : -9
2018-06-22 06:38:58,649 - WARN  
[BookKeeperClientWorker-23-1:LedgerHandle$NoopCloseCallback@1189] - Close 
failed: Error while using ZooKeeper
```
----
2018-06-22 06:44:29 UTC - Vasily Yanov: in ZK logs a lot of messages with INFO 
level:
```
Got user-level KeeperException when processing sessionid:0x300b72d5dc10005 
type:setData cxid:0x121b1 zxid:0x20008dd4c txntype:-1 reqpath:n/a Error 
Path:/ledgers/00/1075/L5594 Error:KeeperErrorCode = BadVersion for 
/ledgers/00/1075/L5594
```
----
2018-06-22 06:50:15 UTC - Ali Ahmed: @Vasily Yanov did the cluster topology 
change ?
----
2018-06-22 06:57:56 UTC - Sijie Guo: @Vasily Yanov BadVersion means there are 
concurrent metadata updates. NONODE / -9 sounds like topic is delete while some 
other brokers attempts to update metadata. but these logging/warnings should 
only happen temporily, everything should be back to normal after restarts, does 
it?
----
2018-06-22 08:43:08 UTC - Vasily Yanov: @Ali Ahmed sorry was on meeting. No 
topology changes
----
2018-06-22 08:44:59 UTC - Vasily Yanov: @Sijie Guo I performed full pulsar 
reset because restart broker or zookeeper or bookie didn't fix the issue
----

Slack digest for #general - 2018-06-22

Reply via email to