2018-06-21 09:25:03 UTC - Karthik Palanivelu: Thanks Ivan, then what is the best possible way to make pulsar work k8s using TLS, I am stuck here for a week. Any help is appreciated. ---- 2018-06-21 10:05:56 UTC - Ivan Kelly: do you need TLS between the proxy and the broker? ---- 2018-06-21 10:06:12 UTC - Ivan Kelly: or just to the proxy? ---- 2018-06-21 10:12:48 UTC - Karthik Palanivelu: I need TLS between Client and Proxy for authentication. Then I need the authorization happen between client and broker for multi tenant behavior. Please advise me how to get this going ---- 2018-06-21 10:16:53 UTC - Ivan Kelly: I don't think this does work right now. as I said, we're going to look into it this week, as its quite a glaring flaw for multitenancy ---- 2018-06-21 10:17:41 UTC - Karthik Palanivelu: Without this feature selling the product would be tough in my company ---- 2018-06-21 10:18:02 UTC - Ivan Kelly: yup, noted. ---- 2018-06-21 10:18:37 UTC - Karthik Palanivelu: I can do this in containers locally but with proxy in between is a challenge for me. ---- 2018-06-21 10:19:34 UTC - Ivan Kelly: it works fine with broker directly ---- 2018-06-21 10:19:39 UTC - Karthik Palanivelu: Is there an option I can expose the broker port externally using nodeport by excluding proxy...it should work for multitenancy right? ---- 2018-06-21 10:19:54 UTC - Ivan Kelly: the problem is the proxy-broker link ---- 2018-06-21 10:20:12 UTC - Ivan Kelly: yes, but you would have to expose all brokers ---- 2018-06-21 10:20:38 UTC - Ivan Kelly: i think. I'm not 100% sure. I need to check how service discovery is being used ---- 2018-06-21 10:22:02 UTC - Karthik Palanivelu: Sure can you post me some sample and limitations of this approach. Even proxy has a limitation that I can run only one proxy per node. If I have 3 workers then I can run only 3 proxy. ---- 2018-06-21 10:23:31 UTC - Karthik Palanivelu: I tried that option but getting connection timed out exception, I posted the logs in a direct message to @Matteo Merli, let me forward you as well ---- 2018-06-21 10:24:17 UTC - Ivan Kelly: I don't have any k8s samples I'm afraid, I've been trying this with full AWS instances ---- 2018-06-21 10:33:28 UTC - Ivan Kelly: you should put a watch on that issue though, I'll likely get to it later today, and will be updating it with my findings ---- 2018-06-21 11:46:26 UTC - Nikita Tomilov: @Nikita Tomilov has joined the channel ---- 2018-06-21 12:04:25 UTC - Nikita Tomilov: Hello everyone We've encountered a problem with receiving messages from non-persistent topics The issue is here, please, take a look <https://github.com/apache/incubator-pulsar/issues/2009> Thanks ---- 2018-06-21 13:25:09 UTC - Ryan Rose: Hi all! Is the golang client in the master branch operational? I'm having trouble with missing C dependencies. Thanks! ---- 2018-06-21 14:59:05 UTC - William Fry: Has anyone run into a `BrokenPersistenceError`? Looks like I can connect to the broker but then receive the following response `11 -- req_id: 0` ---- 2018-06-21 15:00:15 UTC - Sijie Guo: @Nikita Tomilov regex subscription only works for persistence. It has been implemented for non persistent. Due to the way how non pesistence is architected, I am not sure it is easy to support, will have to take a deeper look. ---- 2018-06-21 15:03:59 UTC - Sijie Guo: @Ryan Rose I believe the go client should be operational. Bunch of the people have tried it out. What is your environment? It is cgo wrapped client, and the documentation is in progress on being merged. That would provide some instructions, if you want to check out the documentation, go to pulsar github pull requests and find the PR called go client docs. It should have all the instructions about how to build and use it ---- 2018-06-21 15:04:51 UTC - Sijie Guo: @William Fry is that the only output? Or do you have more output or logging? ---- 2018-06-21 15:18:45 UTC - Ryan Rose: Excellent, that's where I'll head! Thanks! ---- 2018-06-21 15:22:31 UTC - Sijie Guo: <https://github.com/apache/incubator-pulsar/pull/1838> this is PR for docs ---- 2018-06-21 16:17:42 UTC - Karthik Palanivelu: @Matteo Merli I followed your instruction to configure client cert to proxy, enabled ssl between proxy-broker and forward the authorization token to broker. As @Ivan Kelly mentioned the I got the Connection Already Closed exception. Log wise broker received the request, authorized the call but closed the connection. ---- 2018-06-21 16:46:54 UTC - Karthik Palanivelu: @Matteo Merli Follow up finding is I made the proxy to communicate with broker on nonssl by turning off `tlsEnabledWithBroker=false` and `forwardAuthorizationCredentials=true`. This time call did not even reach broker. On client I get Connection already Closed exception. ---- 2018-06-21 16:57:22 UTC - Daniel Ferreira Jorge: I'm still a little confused about the inner workings of message de-duplication (I read many times the cookbook, concepts and architecture, and articles at streamlio regarding that topic). Upon reading the cookbook, it gives the impression that broker configuration is the only thing necessary to ensure that a message is *accepted by the broker* only once. So, if I have a *non-idempotent* producer that publishes the same message twice, and I take the steps outlined in the cookbook, the broker will reject the second message. How exactly will the broker know that the second message is a duplicate? Some sort of checksum? ---- 2018-06-21 17:07:52 UTC - William Fry: @Sijie Guo essentially it fails with these four lines whenever I try to create a producer or consumer: ``` 17:08:49.514 [0x7f17fb90c700] INFO /pulsar/pulsar-client-cpp/lib/HandlerBase.cc(53) - [<persistent://sample/foo/default/bars>, ] Getting connection from pool 17:08:49.519 [0x7f17fb90c700] WARN /pulsar/pulsar-client-cpp/lib/ClientConnection.cc(846) - [10.0.0.136:34296 -> 10.0.1.248:6650] Received error response from server: 11 -- req_id: 2 17:08:49.519 [0x7f17fb90c700] ERROR /pulsar/pulsar-client-cpp/lib/ProducerImpl.cc(212) - [<persistent://sample/foo/default/bar>, ] Failed to create producer: BrokerPersistenceError 17:08:49.519 [0x7f17fb90c700] INFO /pulsar/pulsar-client-cpp/lib/ProducerImpl.cc(415) - Producer - [<persistent://sample/foo/default/bar>, ] , [batching = off] ``` ---- 2018-06-21 17:11:36 UTC - William Fry: It looks like on the pulsar instance, this is the stacktrace: `Caused by: java.util.concurrent.CompletionException: org.apache.pulsar.broker.service.BrokerServiceException$PersistenceException: org.apache.bookkeeper.mledger.ManagedLedgerException: Not enough non-faulty bookies available` ---- 2018-06-21 17:13:46 UTC - William Fry: On the zookeeper instance, this is the stacktrace `INFO org.apache.zookeeper.server.PrepRequestProcessor - Got user-level KeeperException when processing sessionid:0x164230a14550001 type:create cxid:0x1 zxid:0x10 txntype:-1 reqpath:n/a Error Path:/admin Error:KeeperErrorCode = NoNode for /admin` ---- 2018-06-21 17:25:08 UTC - Ivan Kelly: dedupe requires that the producer sends some sort of monotonically increasing sequence number ---- 2018-06-21 17:26:41 UTC - Ivan Kelly: <http://pulsar.apache.org/api/client/org/apache/pulsar/client/api/TypedMessageBuilder.html#sequenceId-long-> ---- 2018-06-21 17:26:49 UTC - Karthik Palanivelu: @Karthik Palanivelu uploaded a file: <https://apache-pulsar.slack.com/files/U7VRE0Q1G/FBCT07XGF/-.txt|Untitled> and commented: Team, how do I get my pulsar working in k8s till proxy is fixed. I created a nodeport for broker to expose. But I am getting above exception. This is non-ssl. Can you please help me here? ---- 2018-06-21 17:26:55 UTC - Ivan Kelly: the idea is that this message id is coming from a log offset or something ---- 2018-06-21 17:27:23 UTC - Ivan Kelly: this is a http request hitting a binary port ---- 2018-06-21 17:28:31 UTC - Sijie Guo: This means there is no enough bookies available in the cluster ---- 2018-06-21 17:28:52 UTC - Sijie Guo: How many bookies do you have? ---- 2018-06-21 17:30:00 UTC - Ivan Kelly: the simplest way to get it to work, is disable tls between the proxy and the broker. you lose authorization, but you still have authentication ---- 2018-06-21 17:36:34 UTC - Daniel Ferreira Jorge: Ok so for the de-duping to work I need always to work with the sequence ids. Thank you @Ivan Kelly ---- 2018-06-21 17:36:46 UTC - William Fry: 1 ---- 2018-06-21 17:36:56 UTC - William Fry: just added a 2nd, is that now required in 2.0.0? ---- 2018-06-21 17:37:00 UTC - William Fry: previously it worked with 1? ---- 2018-06-21 17:38:30 UTC - Karthik Palanivelu: I am not using Proxy here. I disable SSL and using non-ssl trying to hit broker port directly through nodeport. ---- 2018-06-21 17:38:51 UTC - Karthik Palanivelu: My NodePort points to broker now ---- 2018-06-21 17:38:56 UTC - Sijie Guo: are you deploying using standalone or a full cluster? ---- 2018-06-21 17:40:02 UTC - Sijie Guo: it seems that you are sending protobuf requests to http endpoints, which is your serviceUrl for your client? ---- 2018-06-21 17:41:23 UTC - William Fry: full cluster w/ ansible+terraform ---- 2018-06-21 17:42:36 UTC - Sijie Guo: oh did you configure broker settings, like ensembleSize, writeQuorumSize and such ---- 2018-06-21 17:42:49 UTC - Karthik Palanivelu: I am using <pulsar://ELB-NAME:9151(TCP)> -> Node Port 30003 -> Head Less Service 6650 --> broker 6650 ---- 2018-06-21 17:43:07 UTC - Karthik Palanivelu: Broker disabled SSL ---- 2018-06-21 17:45:05 UTC - William Fry: no kept the standards ---- 2018-06-21 17:45:15 UTC - William Fry: I see this message in the Zookeepr logs ```WARN org.apache.zookeeper.server.quorum.QuorumCnxManager - Cannot open channel to 0 at election address /10.0.1.241:3888 Jun 21 17:44:26 ip-10-0-1-52 pulsar: java.net.SocketTimeoutException: connect timed out``` ---- 2018-06-21 17:45:22 UTC - Sijie Guo: i see. that does hit the protobuf port. ---- 2018-06-21 17:46:09 UTC - Karthik Palanivelu: which is protobuf port 6651? ---- 2018-06-21 17:46:37 UTC - Sijie Guo: port 6650, it is speaking protobuf protocol ---- 2018-06-21 17:46:39 UTC - Karthik Palanivelu: Even I tried with hostPort mapping but noi success ---- 2018-06-21 17:46:41 UTC - Sijie Guo: so it is not http port ---- 2018-06-21 17:47:07 UTC - Karthik Palanivelu: Yes I am trying to reach 6650 ---- 2018-06-21 17:47:26 UTC - Sijie Guo: actually can you try ---- 2018-06-21 17:48:08 UTC - Sijie Guo: try go get to a broker pod and try locally? ---- 2018-06-21 17:48:31 UTC - Sijie Guo: sorry I have to step to a meeting. maybe @Matteo Merli can you help you ---- 2018-06-21 17:51:39 UTC - Karthik Palanivelu: You mean on http port? ---- 2018-06-21 18:25:59 UTC - Ivan Kelly: :thumbsup: ---- 2018-06-21 18:29:56 UTC - Sijie Guo: I mean - can you ssh to a broker pod? run `bin/pulsar-client produce -m "test" test-topic` to test if you can publish messages within a k8s cluster. I am trying to isolate whether the problem is from pulsar cluster or from the network out side of k8s ---- 2018-06-21 18:32:38 UTC - Sijie Guo: sounds like a connection issue. ---- 2018-06-21 18:33:02 UTC - Sijie Guo: do you have any connectivities issue between broker and bookies? ---- 2018-06-21 18:34:58 UTC - Matteo Merli: @Daniel Ferreira Jorge If you don’t do anything special, just with the setting in the configuration, the dedup will work within a single producer session.
If your application crashes and comes back, it will be seen as a different producer, so brokers won’t be able to detect eventual dups at that point. If you want to have dedup across producer sessions, you need to : 1. Set producer name (instead of auto-generated name) 2. Use sequence ids tied to application domain ---- 2018-06-21 18:37:16 UTC - Matteo Merli: @Ryan Rose the docs are live at <http://pulsar.apache.org/docs/latest/clients/go/> With 2.1 release we’ll also be able to publish rpm/deb binary directly. ---- 2018-06-21 18:37:45 UTC - Ryan Rose: Awesome, thanks! ---- 2018-06-21 18:42:47 UTC - William Fry: I used to not ---- 2018-06-21 18:42:55 UTC - William Fry: Right now I’m looking into the Zookeeper setup ---- 2018-06-21 18:43:02 UTC - William Fry: if I try to configure everything from scratch ---- 2018-06-21 18:43:08 UTC - William Fry: I find this error on the initial Zookeeper instance ---- 2018-06-21 18:43:09 UTC - William Fry: ```Jun 21 18:42:08 ip-10-0-1-186 pulsar: 18:42:08.946 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] WARN org.apache.zookeeper.server.NIOServerCnxn - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running Jun 21 18:42:08 ip-10-0-1-186 pulsar: 18:42:08.946 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] INFO org.apache.zookeeper.server.NIOServerCnxn - Closed socket connection for client /127.0.0.1:54340 (no session established for client)``` ---- 2018-06-21 18:44:40 UTC - Sijie Guo: you redeploy with ansible? ---- 2018-06-21 18:45:20 UTC - William Fry: Yep ---- 2018-06-21 18:45:27 UTC - William Fry: I’m doing it manually now ---- 2018-06-21 18:45:29 UTC - William Fry: To test each step ---- 2018-06-21 18:45:36 UTC - William Fry: It looks like after the zookeeper service is started ---- 2018-06-21 18:45:40 UTC - William Fry: This is the first set of warnings ---- 2018-06-21 18:45:43 UTC - William Fry: ``` Jun 21 18:44:46 ip-10-0-1-186 pulsar: 18:44:46.799 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] INFO org.apache.zookeeper.server.NIOServerCnxnFactory - Accepted socket connection from /127.0.0.1:54542 Jun 21 18:44:46 ip-10-0-1-186 pulsar: 18:44:46.807 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] WARN org.apache.zookeeper.server.NIOServerCnxn - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running Jun 21 18:44:46 ip-10-0-1-186 pulsar: 18:44:46.807 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] INFO org.apache.zookeeper.server.NIOServerCnxn - Closed socket connection for client /127.0.0.1:54542 (no session established for client) Jun 21 18:44:48 ip-10-0-1-186 pulsar: 18:44:48.561 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] INFO org.apache.zookeeper.server.NIOServerCnxnFactory - Accepted socket connection from /127.0.0.1:54544 Jun 21 18:44:48 ip-10-0-1-186 pulsar: 18:44:48.562 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] WARN org.apache.zookeeper.server.NIOServerCnxn - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running Jun 21 18:44:48 ip-10-0-1-186 pulsar: 18:44:48.562 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] INFO org.apache.zookeeper.server.NIOServerCnxn - Closed socket connection for client /127.0.0.1:54544 (no session established for client) Jun 21 18:44:49 ip-10-0-1-186 pulsar: 18:44:49.910 [WorkerSender[myid=0]] WARN org.apache.zookeeper.server.quorum.QuorumCnxManager - Cannot open channel to 1 at election address /10.0.1.230:3888 ``` ---- 2018-06-21 18:46:07 UTC - Sijie Guo: it seems that it can’t connect to ip 10.0.1.230 ---- 2018-06-21 18:46:12 UTC - Sijie Guo: is your ip changed? ---- 2018-06-21 18:46:21 UTC - William Fry: That’s the other zookeeper instance ---- 2018-06-21 18:46:29 UTC - William Fry: Do they need to be able to talk to one another? ---- 2018-06-21 18:47:03 UTC - Sijie Guo: yes they do. zookeeper uses port 3888 for its internal communication (e.g. leader election) ---- 2018-06-21 18:48:46 UTC - William Fry: is the same true of the pulsar instances? ---- 2018-06-21 18:49:54 UTC - Sijie Guo: for pulsar instances, they don’t need to inter instance communication. ---- 2018-06-21 18:50:26 UTC - Sijie Guo: but pulsar brokers need to be able to talk to bookies at 3181, both brokers and bookie need to talk to zookeeper at 2181 ---- 2018-06-21 18:51:06 UTC - Sijie Guo: but I think terraform sets a vpc for the cluster, so those ports should be accessible between instances. are you using a customized terraform script? ---- 2018-06-21 18:53:53 UTC - Ryan Rose: FWIW I fixed it -- was something fishy with the CPP lib install. I'd built it before from source, and reinstalling via the brew script in the new docs fixed it! Much appreciated. +1 : Sijie Guo ---- 2018-06-21 18:53:54 UTC - William Fry: yep ---- 2018-06-21 18:54:02 UTC - William Fry: i wanted to lock it down more than the example code did ---- 2018-06-21 18:54:39 UTC - William Fry: found out I mixed using `aws_security_group_rule` and the inline `egress` which causes failures ---- 2018-06-21 18:58:45 UTC - Sijie Guo: gotcha! ---- 2018-06-21 18:59:13 UTC - Sijie Guo: hope adding that back will address your problem ---- 2018-06-21 19:06:11 UTC - William Fry: trying it out now +1 : Sijie Guo ---- 2018-06-21 19:10:31 UTC - William Fry: ok, different error but past that problem :sweat_smile: ---- 2018-06-21 19:11:12 UTC - William Fry: when creating a producer, I get this: ``` 19:09:55.902 [0x7fcf15a05740] INFO /pulsar/pulsar-client-cpp/lib/ConnectionPool.cc(53) - Deleting stale connection from pool for <pulsar://internal-pulsar-elb-123456789.us-east-2.elb.amazonaws.com:6650> use_count: -1 @ 0 19:09:55.902 [0x7fcf15a05740] INFO /pulsar/pulsar-client-cpp/lib/ConnectionPool.cc(63) - Created connection for <pulsar://internal-pulsar-elb-123456789.us-east-2.elb.amazonaws.com:6650> 19:09:55.903 [0x7fcf0c513700] INFO /pulsar/pulsar-client-cpp/lib/ClientConnection.cc(279) - [10.0.0.136:54916 -> 10.0.1.25:6650] Connected to broker 19:09:55.904 [0x7fcf0c513700] INFO /pulsar/pulsar-client-cpp/lib/ClientConnection.cc(1132) - [10.0.0.136:54916 -> 10.0.1.25:6650] Connection closed 19:09:55.904 [0x7fcf0c513700] ERROR /pulsar/pulsar-client-cpp/lib/ClientImpl.cc(138) - Error Checking/Getting Partition Metadata while creating producer on <persistent://sample/foo/default/bar-previews> -- 5 Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib64/python2.7/site-packages/pulsar/__init__.py", line 351, in create_producer p._producer = self._client.create_producer(topic, conf) Exception: Pulsar error: ConnectError ``` ---- 2018-06-21 19:15:45 UTC - William Fry: Looks like it may be from ansible ---- 2018-06-21 19:16:00 UTC - William Fry: What triggers ansible to “skip” a task? ---- 2018-06-21 19:16:24 UTC - Karthik Palanivelu: @Sijie Guo It successfully published the message as non-ssl ---- 2018-06-21 19:17:35 UTC - Sijie Guo: which task was skipped? can you telnet <http://internal-pulsar-elb-123456789.us-east-2.elb.amazonaws.com|internal-pulsar-elb-123456789.us-east-2.elb.amazonaws.com> 6650 to see if you can reach 6650? ---- 2018-06-21 19:17:49 UTC - William Fry: sure ---- 2018-06-21 19:18:04 UTC - Sijie Guo: okay out side k8s cluster? ---- 2018-06-21 19:18:33 UTC - William Fry: and two tasks: `Initialize cluster metadata` skipped 1 of 2 Zookeeper instances and `Create default property and namespace` skipped 1 of 1 Pulsar instances ---- 2018-06-21 19:19:16 UTC - Sijie Guo: initialize should only be done once, if you already initialized, you don’t need initialize it again. otherwise it will erase all the metadata. ---- 2018-06-21 19:19:27 UTC - Sijie Guo: same for “create default property and namespace” ---- 2018-06-21 19:21:09 UTC - William Fry: for telnet, the following was the trace: ``` Trying 10.0.1.25... Connected to <http://internal-pulsar-elb-123456789.us-east-2.elb.amazonaws.com|internal-pulsar-elb-123456789.us-east-2.elb.amazonaws.com>. Escape character is '^]'. Connection closed by foreign host. ``` ---- 2018-06-21 19:21:19 UTC - William Fry: (it was immediately closed, but that seems correct?) ---- 2018-06-21 19:22:35 UTC - Sijie Guo: this sounds like some issues ---- 2018-06-21 19:22:50 UTC - Sijie Guo: I am not sure it is network permissions or on the broker. can you check the broker logs? ---- 2018-06-21 19:23:59 UTC - William Fry: It looks like it’s the broker. All the network perms look fine now. ---- 2018-06-21 19:24:52 UTC - William Fry: Broker is on pulsar instance right? ---- 2018-06-21 19:25:02 UTC - Sijie Guo: yes ---- 2018-06-21 19:25:24 UTC - Sijie Guo: can you check the logs ---- 2018-06-21 19:25:24 UTC - William Fry: on pulsar there are two errors: ``` 19:18:14 ip-10-0-1-10 pulsar: 19:18:14.845 [main-SendThread(10.0.1.146:2181)] INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server 10.0.1.146/10.0.1.146:2181. Will not attempt to authenticate using SASL (unknown error) Jun 21 19:18:14 ip-10-0-1-10 pulsar: 19:18:14.846 [main-SendThread(10.0.1.146:2181)] INFO org.apache.zookeeper.ClientCnxn - Socket connection established to 10.0.1.146/10.0.1.146:2181, initiating session Jun 21 19:18:14 ip-10-0-1-10 pulsar: 19:18:14.848 [main-SendThread(10.0.1.146:2181)] INFO org.apache.zookeeper.ClientCnxn - Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect Jun 21 19:18:16 ip-10-0-1-10 pulsar: 19:18:16.381 [main-SendThread(10.0.1.230:2181)] INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server 10.0.1.230/10.0.1.230:2181. Will not attempt to authenticate using SASL (unknown error) Jun 21 19:18:16 ip-10-0-1-10 pulsar: 19:18:16.482 [main-EventThread] INFO org.apache.zookeeper.ClientCnxn - EventThread shut down for session: 0x0 Jun 21 19:18:16 ip-10-0-1-10 pulsar: 19:18:16.483 [main] INFO org.apache.zookeeper.ZooKeeper - Session: 0x0 closed Jun 21 19:18:16 ip-10-0-1-10 pulsar: 19:18:16.484 [main] ERROR org.apache.bookkeeper.meta.zk.ZKMetadataDriverBase - Failed to create zookeeper client to 10.0.1.230:2181,10.0.1.146:2181 Jun 21 19:18:16 ip-10-0-1-10 pulsar: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss ``` ---- 2018-06-21 19:25:42 UTC - William Fry: and ``` Jun 21 19:18:16 ip-10-0-1-10 pulsar: 19:18:16.497 [main] ERROR org.apache.pulsar.PulsarBrokerStarter - Uncaught exception in thread main: Failed to initialize metadata bookie driver Jun 21 19:18:16 ip-10-0-1-10 pulsar: org.apache.bookkeeper.bookie.BookieException$MetadataStoreException: Failed to initialize metadata bookie driver Jun 21 19:18:16 ip-10-0-1-10 pulsar: at org.apache.bookkeeper.bookie.Bookie.instantiateMetadataDriver(Bookie.java:945) ~[org.apache.bookkeeper-bookkeeper-server-shaded-4.7.0.jar:4.7.0] Jun 21 19:18:16 ip-10-0-1-10 pulsar: at org.apache.bookkeeper.bookie.Bookie.<init>(Bookie.java:626) ~[org.apache.bookkeeper-bookkeeper-server-shaded-4.7.0.jar:4.7.0] Jun 21 19:18:16 ip-10-0-1-10 pulsar: at org.apache.bookkeeper.proto.BookieServer.newBookie(BookieServer.java:115) ~[org.apache.bookkeeper-bookkeeper-server-shaded-4.7.0.jar:4.7.0] Jun 21 19:18:16 ip-10-0-1-10 pulsar: at org.apache.bookkeeper.proto.BookieServer.<init>(BookieServer.java:96) ~[org.apache.bookkeeper-bookkeeper-server-shaded-4.7.0.jar:4.7.0] Jun 21 19:18:16 ip-10-0-1-10 pulsar: at org.apache.pulsar.PulsarBrokerStarter$BrokerStarter.<init>(PulsarBrokerStarter.java:198) ~[org.apache.pulsar-pulsar-broker-2.0.0-rc1-incubating.jar:2.0.0-rc1-incubating] Jun 21 19:18:16 ip-10-0-1-10 pulsar: at org.apache.pulsar.PulsarBrokerStarter.main(PulsarBrokerStarter.java:274) ~[org.apache.pulsar-pulsar-broker-2.0.0-rc1-incubating.jar:2.0.0-rc1-incubating] Jun 21 19:18:16 ip-10-0-1-10 pulsar: Caused by: org.apache.bookkeeper.meta.exceptions.MetadataException: Failed to create zookeeper client to 10.0.1.230:2181,10.0.1.146:2181 Jun 21 19:18:16 ip-10-0-1-10 pulsar: at org.apache.bookkeeper.meta.zk.ZKMetadataDriverBase.initialize(ZKMetadataDriverBase.java:216) ~[org.apache.bookkeeper-bookkeeper-server-shaded-4.7.0.jar:4.7.0] Jun 21 19:18:16 ip-10-0-1-10 pulsar: at org.apache.bookkeeper.meta.zk.ZKMetadataBookieDriver.initialize(ZKMetadataBookieDriver.java:60) ~[org.apache.bookkeeper-bookkeeper-server-shaded-4.7.0.jar:4.7.0] Jun 21 19:18:16 ip-10-0-1-10 pulsar: at org.apache.bookkeeper.bookie.Bookie.instantiateMetadataDriver(Bookie.java:935) ~[org.apache.bookkeeper-bookkeeper-server-shaded-4.7.0.jar:4.7.0] Jun 21 19:18:16 ip-10-0-1-10 pulsar: ... 5 more Jun 21 19:18:16 ip-10-0-1-10 pulsar: Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss ``` ---- 2018-06-21 19:25:54 UTC - William Fry: they may be related by connection loss? ---- 2018-06-21 19:25:54 UTC - Karthik Palanivelu: Which port you want me to check, abive code used client.conf which used webServiceUrl=<http://broker:9101/>.. ---- 2018-06-21 19:26:03 UTC - Sijie Guo: it seems that broker cannot connect to those two machines ---- 2018-06-21 19:26:31 UTC - Sijie Guo: can you do following: on the broker machine, try telnet 10.0.1.230 2181 ---- 2018-06-21 19:26:42 UTC - Sijie Guo: and telnet 10.0.1.146 2181 ---- 2018-06-21 19:26:46 UTC - Sijie Guo: to see if you can telnet ---- 2018-06-21 19:28:16 UTC - Sijie Guo: <pulsar://broker:<broker-port>> (e.g. <pulsar://broker:6650/>) ---- 2018-06-21 19:29:05 UTC - William Fry: 10.0.1.230 has crashed ---- 2018-06-21 19:29:13 UTC - William Fry: but i connect to 10.0.1.146 ---- 2018-06-21 19:29:19 UTC - William Fry: without it being immediately closed ---- 2018-06-21 19:29:28 UTC - Sijie Guo: cool ---- 2018-06-21 19:29:56 UTC - Sijie Guo: how many zookeeper do you have? can you check their logs? especially the crahsed one ---- 2018-06-21 19:29:57 UTC - William Fry: 2 ---- 2018-06-21 19:30:08 UTC - William Fry: it looks like 10.0.1.146 keeps trying to connect to 10.0.1.230 ---- 2018-06-21 19:30:23 UTC - William Fry: which yields ```Jun 21 19:28:19 ip-10-0-1-146 pulsar: 19:28:19.782 [QuorumPeer[myid=1]/0.0.0.0:2181] WARN org.apache.zookeeper.server.quorum.QuorumCnxManager - Cannot open channel to 0 at election address /10.0.1.230:3888 ``` ---- 2018-06-21 19:30:51 UTC - William Fry: I don’t know what happened to 10.0.1.230 ---- 2018-06-21 19:30:54 UTC - William Fry: I can’t ssh into it ---- 2018-06-21 19:31:03 UTC - William Fry: something strange happened to the box ---- 2018-06-21 19:31:32 UTC - William Fry: will 10.0.1.146 eventually give up trying to connect and elect itself? ---- 2018-06-21 19:31:47 UTC - Sijie Guo: oh so you need odd number of zookeeper machines since zookeeper requires majoritity for consensus ---- 2018-06-21 19:31:50 UTC - Joe Francis: <!channel> We are organizing a Pulsar 1/2 day conference on the afternoon of July 10th, at Oath/Yahoo campus in Sunnyvale, CA. If anyone using Pulsar is interested in talking about their use case or experience in using Pulsar, we will be quite happy to host your presentation along with other tech and deep dive talks. +1 : Matteo Merli, Sijie Guo, Sanjeev Kulkarni, Ali Ahmed, Lewis Kaneshiro bananadance : Sijie Guo, Ali Ahmed, Lewis Kaneshiro ---- 2018-06-21 19:31:55 UTC - William Fry: …lol that makes sense ---- 2018-06-21 19:32:12 UTC - Sijie Guo: so I think what you can do here ---- 2018-06-21 19:32:21 UTC - Sijie Guo: since 10.0.1.230 is down, ---- 2018-06-21 19:32:51 UTC - Sijie Guo: you can update the zookeeper conf on 10.0.1.146 to use one zookeeper server or just enable standalone mode ---- 2018-06-21 19:33:06 UTC - Sijie Guo: so in that way you can function as one zookeeper server ---- 2018-06-21 19:33:21 UTC - Sijie Guo: however the downside is if zookeeper is down, you will become unavailable. ---- 2018-06-21 19:38:49 UTC - Karthik Palanivelu: @Sijie Guo It is Connection Time Out on the Producer Client. But I can see from the logs of Broker, connection is open. Once I disconnect the client, I see the below log - `15:35:15.024 [pulsar-io-21-1] INFO org.apache.pulsar.broker.service.ServerCnx - Closed connection from /IP-CLIENT:40989` ---- 2018-06-21 19:39:25 UTC - William Fry: Any plans for an NYC conf :smile: +1 : John Crawford, Ali Ahmed, Matteo Merli, Sijie Guo ---- 2018-06-21 19:39:54 UTC - Karthik Palanivelu: Which means there is some issue in k8s. ---- 2018-06-21 19:40:13 UTC - Karthik Palanivelu: I am new to k8s, seeking your help ---- 2018-06-21 19:41:22 UTC - Karthik Palanivelu: It seems like I need proxy and nodeport on Broker will not work. Please correct my understanding ---- 2018-06-21 19:45:08 UTC - William Fry: gotcha ---- 2018-06-21 19:45:13 UTC - William Fry: ok I have three up ---- 2018-06-21 19:45:18 UTC - William Fry: I’m still getting the `Error Checking/Getting Partition Metadata while creating producer` error ---- 2018-06-21 19:45:33 UTC - William Fry: but there are no warnings or errors in the bookie logs ---- 2018-06-21 19:45:33 UTC - William Fry: I’ll try telnet’ing? ---- 2018-06-21 19:46:47 UTC - Sijie Guo: what version of k8s are you using? and type of network are you using? (I know there was some issues about nodePort support on some type of network. didn’t have the names right on my head). ---- 2018-06-21 19:46:49 UTC - William Fry: ok, I can telnet to all of the Zookeeper instances ---- 2018-06-21 19:47:02 UTC - Sijie Guo: I mean k8s network ---- 2018-06-21 19:49:34 UTC - William Fry: Now when I connect, the producer hangs at `Created connection for ...` ---- 2018-06-21 19:49:43 UTC - William Fry: And on the bookie logs, I see the below: ---- 2018-06-21 19:50:39 UTC - William Fry: ``` Jun 21 19:48:28 ip-10-0-1-109 pulsar: 19:48:28.358 [pulsar-io-49-1] INFO org.apache.pulsar.broker.service.ServerCnx - New connection from /10.0.1.24:47058 Jun 21 19:48:45 ip-10-0-1-109 pulsar: 19:48:45.433 [pulsar-load-manager-4-1] INFO org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - Only 1 broker available: no load shedding will be performed Jun 21 19:49:03 ip-10-0-1-109 pulsar: 19:49:03.020 [pulsar-ordered-OrderedExecutor-0-0] INFO org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - Opening managed ledger sample/foo/default/persistent/bar ``` ---- 2018-06-21 19:50:47 UTC - Sijie Guo: okay so it seems that you have the cluster up running ---- 2018-06-21 19:51:14 UTC - William Fry: Any idea why the producer would just hang? ---- 2018-06-21 19:51:38 UTC - Sijie Guo: are you using 2.0? ---- 2018-06-21 19:52:01 UTC - William Fry: on the pulsar instance but not the client ---- 2018-06-21 19:52:17 UTC - William Fry: scratch that, also on the client ---- 2018-06-21 19:52:57 UTC - Sijie Guo: okay so you can change the topic name to <persistent://sample/foo/default/bar_previews> or just plain “bar_previews” ---- 2018-06-21 19:53:12 UTC - Sijie Guo: so this plain “bar-previews” will go to a public default namespace ---- 2018-06-21 19:53:39 UTC - William Fry: yep ---- 2018-06-21 19:53:54 UTC - William Fry: did that and get the following: ``` 19:53:32.086 [0x7f8df8b50700] INFO /pulsar/pulsar-client-cpp/lib/HandlerBase.cc(53) - [<persistent://public/default/profile-previews>, ] Getting connection from pool 19:53:32.114 [0x7f8df8b50700] INFO /pulsar/pulsar-client-cpp/lib/ConnectionPool.cc(53) - Deleting stale connection from pool for <pulsar://10.0.1.109:6650> use_count: -1 @ 0 19:53:32.114 [0x7f8df8b50700] INFO /pulsar/pulsar-client-cpp/lib/ConnectionPool.cc(63) - Created connection for <pulsar://10.0.1.109:6650> ``` ---- 2018-06-21 19:54:40 UTC - William Fry: On the bookie instance, I see the following as a result: ``` Jun 21 19:53:32 ip-10-0-1-109 pulsar: 19:53:32.097 [pulsar-ordered-OrderedExecutor-6-0-EventThread] INFO org.apache.pulsar.broker.cache.LocalZooKeeperCacheService - Successfully copyied bundles data to local zk at /admin/local-policies/public/default Jun 21 19:53:32 ip-10-0-1-109 pulsar: 19:53:32.097 [pulsar-ordered-OrderedExecutor-6-0-EventThread] INFO org.apache.pulsar.broker.cache.LocalZooKeeperCacheService - Successfully created local policies for /admin/local-policies/public/default -- Optional[org.apache.pulsar.common.policies.data.LocalPolicies@1e7ab195] Jun 21 19:53:32 ip-10-0-1-109 pulsar: 19:53:32.102 [pulsar-1-15] INFO org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - 1 brokers being considered for assignment of public/default/0xc0000000_0xd0000000 Jun 21 19:53:32 ip-10-0-1-109 pulsar: 19:53:32.103 [pulsar-1-15] INFO org.apache.pulsar.broker.namespace.OwnershipCache - Trying to acquire ownership of public/default/0xc0000000_0xd0000000 Jun 21 19:53:32 ip-10-0-1-109 pulsar: 19:53:32.113 [pulsar-ordered-OrderedExecutor-6-0-EventThread] INFO org.apache.pulsar.broker.namespace.OwnershipCache - Successfully acquired ownership of /namespace/public/default/0xc0000000_0xd0000000 Jun 21 19:53:32 ip-10-0-1-109 pulsar: 19:53:32.114 [pulsar-1-1] INFO org.apache.pulsar.broker.PulsarService - Loading all topics on bundle: public/default/0xc0000000_0xd0000000 Jun 21 19:53:45 ip-10-0-1-109 pulsar: 19:53:45.433 [pulsar-load-manager-4-1] INFO org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - Only 1 broker available: no load shedding will be performed ``` ---- 2018-06-21 19:56:23 UTC - William Fry: The command also doesn’t return, it hangs and I can’t cancel it with `^c`. Just shows those log messages below. ---- 2018-06-21 19:58:08 UTC - Sijie Guo: can you do me a favor? ---- 2018-06-21 19:58:17 UTC - Sijie Guo: can you go to the broker machine ---- 2018-06-21 19:58:39 UTC - Sijie Guo: run `bin/pulsar-client produce -m "test message" test-topic` in the pulsar directory ---- 2018-06-21 20:00:05 UTC - Karthik Palanivelu: Mine uses Calico ---- 2018-06-21 20:00:59 UTC - William Fry: yep ---- 2018-06-21 20:01:21 UTC - Sijie Guo: it succeed? ---- 2018-06-21 20:01:24 UTC - William Fry: btw, after hanging for a long time, this is the final response of creating a producer with the python client `19:57:53.354 [0x7f8df8b50700] ERROR /pulsar/pulsar-client-cpp/lib/ClientConnection.cc(323) - [<none> -> <pulsar://10.0.1.109:6650>] Failed to establish connection: Connection timed out ` ---- 2018-06-21 20:01:32 UTC - William Fry: running now, one sec ---- 2018-06-21 20:01:58 UTC - William Fry: failed ---- 2018-06-21 20:02:23 UTC - William Fry: bunch of errors ---- 2018-06-21 20:03:08 UTC - William Fry: ` ERROR org.apache.pulsar.client.impl.ProducerImpl - [test-topic] [null] Failed to create producer: org.apache.bookkeeper.mledger.ManagedLedgerException: Not enough non-faulty bookies available` ---- 2018-06-21 20:03:22 UTC - Sijie Guo: okay ---- 2018-06-21 20:03:29 UTC - Sijie Guo: so you have 1 bookie right? ---- 2018-06-21 20:03:37 UTC - William Fry: 1 bookie, 3 zookeepers ---- 2018-06-21 20:03:52 UTC - Sijie Guo: # Number of bookies to use when creating a ledger managedLedgerDefaultEnsembleSize=2 # Number of copies to store for each message managedLedgerDefaultWriteQuorum=2 # Number of guaranteed copies (acks to wait before write is complete) managedLedgerDefaultAckQuorum=2 ---- 2018-06-21 20:04:16 UTC - William Fry: do those need to be changed to 1? ---- 2018-06-21 20:04:18 UTC - Sijie Guo: on the broker node, can you update broker.conf to set the above settings to 1? or can you add another bookie ---- 2018-06-21 20:05:14 UTC - Sijie Guo: yeah. by default it is 2 ---- 2018-06-21 20:06:26 UTC - William Fry: yep, updated to 1 ---- 2018-06-21 20:06:30 UTC - William Fry: should i restart pulsar? ---- 2018-06-21 20:06:34 UTC - William Fry: or does it watch the conf files? ---- 2018-06-21 20:07:37 UTC - William Fry: yayyyy ---- 2018-06-21 20:07:44 UTC - William Fry: the `bin/pulsar` command worked ---- 2018-06-21 20:08:02 UTC - William Fry: the python client still hangs ---- 2018-06-21 20:14:51 UTC - Sijie Guo: @Karthik Palanivelu I think there are some issues reported around Calico with nodePort in k8s. <https://github.com/kubernetes/kubernetes/issues/58908> I am not sure if that is the exact problem, but it might be worth trying out this suggestions in the PR. ---- 2018-06-21 20:15:18 UTC - Sijie Guo: yeah you need to restart pulsar. ---- 2018-06-21 20:15:31 UTC - Sijie Guo: if bin/pulsar command worker, that mean pulsar service is fine. ---- 2018-06-21 20:15:42 UTC - Robin Bartholdson: Will be recorded I hope? :relaxed: ---- 2018-06-21 20:15:50 UTC - William Fry: yeah I restarted it with `systemctl` but the python client still is failing ---- 2018-06-21 20:15:59 UTC - Sijie Guo: so the problem would be the connection between your python client and pulsar broker. ---- 2018-06-21 20:16:32 UTC - Sijie Guo: can you do one more thing, can you copy your python application to the broker server and run the python application there ---- 2018-06-21 20:16:54 UTC - Sijie Guo: I am try to isolate the problem whether it is a problem of accessing network out side of vpc ---- 2018-06-21 20:16:59 UTC - William Fry: sure ---- 2018-06-21 20:18:51 UTC - William Fry: and use localhost or still use the load balancer? ---- 2018-06-21 20:19:33 UTC - Sijie Guo: Use localhost ---- 2018-06-21 20:20:24 UTC - William Fry: worked ---- 2018-06-21 20:20:55 UTC - Sijie Guo: So it seems that problem comes from load balancer ---- 2018-06-21 20:21:12 UTC - William Fry: yeah, because it also works using the load balancer from the pulsar instance… ---- 2018-06-21 20:21:28 UTC - Sijie Guo: I see. ---- 2018-06-21 20:21:54 UTC - Sijie Guo: you might want to check the aws loadbalancer configuration ---- 2018-06-21 20:22:00 UTC - Sijie Guo: it might show some health status for you ---- 2018-06-21 20:22:30 UTC - William Fry: yeah it’s healthy ---- 2018-06-21 20:22:47 UTC - Sijie Guo: okay ---- 2018-06-21 20:22:49 UTC - William Fry: I thought if you used a load balancer, you didn’t also need access to the other instance as long as the load balancer had access to the other instance ---- 2018-06-21 20:23:14 UTC - Sijie Guo: yes that should be the case ---- 2018-06-21 20:23:41 UTC - William Fry: ok, now it’s working from the other instance ---- 2018-06-21 20:23:45 UTC - William Fry: despite my doing nothing… ---- 2018-06-21 20:23:57 UTC - Sijie Guo: interesting ---- 2018-06-21 20:24:01 UTC - William Fry: testing out a local client shouldn’t have changed anything ---- 2018-06-21 20:24:04 UTC - William Fry: right? ---- 2018-06-21 20:24:08 UTC - Sijie Guo: so it should like a loadbalancer issue ---- 2018-06-21 20:24:18 UTC - Sijie Guo: yeah it shouldn’t change ---- 2018-06-21 20:24:43 UTC - Sijie Guo: not very familiar with aws loadbalancer ---- 2018-06-21 20:24:52 UTC - William Fry: yeah I didn’t touch anything there ---- 2018-06-21 20:24:59 UTC - William Fry: re-running terraform to see if I did without meaning to ---- 2018-06-21 20:25:13 UTC - Sijie Guo: sgtm ---- 2018-06-21 20:26:51 UTC - William Fry: ok, got it. The other instance needs access not only to the load balancer but also to pulsar ---- 2018-06-21 20:26:52 UTC - William Fry: bizarre ---- 2018-06-21 20:27:03 UTC - William Fry: that seems counter-intuitive ---- 2018-06-21 20:27:31 UTC - Sijie Guo: okay ---- 2018-06-21 20:27:36 UTC - Sijie Guo: so everything worked out? ---- 2018-06-21 20:28:28 UTC - William Fry: yeah, I’m just stumped why we would need access to the pulsar instance ---- 2018-06-21 20:28:36 UTC - William Fry: I thought that the load balancer funnels traffic to and from ---- 2018-06-21 20:28:58 UTC - William Fry: maybe it just selects the ideal machine to use and then connection takes place directly between the machines? ---- 2018-06-21 20:29:12 UTC - Sijie Guo: oh i see ---- 2018-06-21 20:29:32 UTC - Sijie Guo: let me explain ---- 2018-06-21 20:33:25 UTC - Sijie Guo: so broker is running in such a mode - client talks to any broker thru a load balancer, and ask the broker I want to publish/consume messages to this topic. if the broker is the owner of this topic, it serves the connection; if it isn’t, the broker tell the client who is the right broker and the client redirect and talk to that broker directly. if this happens, the client talks to broker directly, which bypass the load balancer. so you need pulsar instance access for this case. however there is another service instance, called “proxy”, what is does is a “proxy” that proxy the connections to the real broker. in that case, you just need use loadbalancer. so in a production deployment, you might consider put up proxies in front of brokers and let loadbalancer load balancing among proxies. hope this clears your question ---- 2018-06-21 20:34:08 UTC - William Fry: ah that makes perfect sense ---- 2018-06-21 20:34:15 UTC - William Fry: are there issues with the “proxy” right now? ---- 2018-06-21 20:34:21 UTC - William Fry: I saw chatter about it in <#C5Z4T36F7|general> ---- 2018-06-21 20:35:56 UTC - Sijie Guo: I think there are some issues around tls authentication enabled in proxy. we are fixing it and improving the documentation. without TLS or only TLS on proxy, everything should work fine ---- 2018-06-21 20:41:41 UTC - William Fry: ok perfect ---- 2018-06-21 20:41:55 UTC - William Fry: one last question, why do I need to bypass the load balancer if there is only one broker? ---- 2018-06-21 20:42:17 UTC - William Fry: In that case, there’s no chance that I connect to a broker that doesn’t own a topic. Right? ---- 2018-06-21 20:42:37 UTC - William Fry: Or is it that the load balancer balances requests to different brokers but any subsequent connection is _always_ direct? ---- 2018-06-21 20:42:59 UTC - William Fry: What’s the purpose of the load balancer when not using the “proxy” service? ---- 2018-06-21 20:44:32 UTC - William Fry: It looks like this is never a problem with HTTP requests to the API ---- 2018-06-21 20:47:05 UTC - Sijie Guo: oh loadbalancer in the broker mode is more used as DNS for finding a broker ---- 2018-06-21 20:47:09 UTC - Sijie Guo: if you only have one broker ---- 2018-06-21 20:47:13 UTC - Sijie Guo: you don’t need loadbalancer ---- 2018-06-21 21:53:22 UTC - William Fry: Can there be an even number of pulsar instances? ---- 2018-06-21 21:53:28 UTC - William Fry: Or do you also need an odd number? ---- 2018-06-21 21:55:25 UTC - Sijie Guo: pulsar instances can be any number ---- 2018-06-21 21:55:32 UTC - William Fry: nice ---- 2018-06-21 22:56:24 UTC - Joe Francis: Yes, will be recorded. ---- 2018-06-22 00:10:02 UTC - Matteo Merli: @Matteo Merli pinned a message to this channel. ---- 2018-06-22 06:43:17 UTC - Vasily Yanov: Hi! ---- 2018-06-22 06:43:26 UTC - Vasily Yanov: One more downtime.... ---- 2018-06-22 06:43:44 UTC - Vasily Yanov: tried to restart broker and got this in logs: ---- 2018-06-22 06:43:50 UTC - Vasily Yanov: ``` 2018-06-22 06:38:58,641 - WARN [BookKeeperClientWorker-24-1:LedgerHandle$NoopCloseCallback@1189] - Close failed: Error while using ZooKeeper 2018-06-22 06:38:58,648 - ERROR [BookKeeperClientWorker-24-1:LedgerHandle$2$1CloseCb@405] - Error update ledger metadata for ledger 10752446 : -9 2018-06-22 06:38:58,648 - WARN [BookKeeperClientWorker-24-1:LedgerHandle$NoopCloseCallback@1189] - Close failed: Error while using ZooKeeper 2018-06-22 06:38:58,648 - ERROR [BookKeeperClientWorker-18-1:LedgerHandle$2$1CloseCb@405] - Error update ledger metadata for ledger 10752488 : -9 2018-06-22 06:38:58,648 - WARN [BookKeeperClientWorker-18-1:LedgerHandle$NoopCloseCallback@1189] - Close failed: Error while using ZooKeeper 2018-06-22 06:38:58,648 - INFO [BookKeeperClientWorker-21-1:LedgerHandle$ReReadLedgerMetadataCb@1100] - Resolve ledger metadata conflict while changing ensemble to: [144.76.185.49:3181, 148.251.181.171:3181, 136.243.34.224:3181], old meta data is (meta:BookieMetadataFormatVersion 2 quorumSize: 3 ensembleSize: 3 length: 0 lastEntryId: -1 state: OPEN segment { ensembleMember: "144.76.185.49:3181" ensembleMember: "148.251.181.171:3181" ensembleMember: "136.243.34.224:3181" firstEntryId: 0 } digestType: CRC32 password: "" ackQuorumSize: 2 , version:1) , new meta data is (meta:BookieMetadataFormatVersion 2 quorumSize: 3 ensembleSize: 3 length: 0 lastEntryId: -1 state: OPEN segment { ensembleMember: "144.76.185.49:3181" ensembleMember: "188.40.124.119:3181" ensembleMember: "136.243.34.224:3181" firstEntryId: 0 } digestType: CRC32 password: "" ackQuorumSize: 2 , version:1). 2018-06-22 06:38:58,649 - WARN [main-EventThread:AbstractZkLedgerManager$4@407] - Conditional update ledger metadata failed: NONODE 2018-06-22 06:38:58,649 - ERROR [BookKeeperClientWorker-23-1:LedgerHandle$2$1CloseCb@405] - Error update ledger metadata for ledger 10752445 : -9 2018-06-22 06:38:58,649 - WARN [BookKeeperClientWorker-23-1:LedgerHandle$NoopCloseCallback@1189] - Close failed: Error while using ZooKeeper ``` ---- 2018-06-22 06:44:29 UTC - Vasily Yanov: in ZK logs a lot of messages with INFO level: ``` Got user-level KeeperException when processing sessionid:0x300b72d5dc10005 type:setData cxid:0x121b1 zxid:0x20008dd4c txntype:-1 reqpath:n/a Error Path:/ledgers/00/1075/L5594 Error:KeeperErrorCode = BadVersion for /ledgers/00/1075/L5594 ``` ---- 2018-06-22 06:50:15 UTC - Ali Ahmed: @Vasily Yanov did the cluster topology change ? ---- 2018-06-22 06:57:56 UTC - Sijie Guo: @Vasily Yanov BadVersion means there are concurrent metadata updates. NONODE / -9 sounds like topic is delete while some other brokers attempts to update metadata. but these logging/warnings should only happen temporily, everything should be back to normal after restarts, does it? ---- 2018-06-22 08:43:08 UTC - Vasily Yanov: @Ali Ahmed sorry was on meeting. No topology changes ---- 2018-06-22 08:44:59 UTC - Vasily Yanov: @Sijie Guo I performed full pulsar reset because restart broker or zookeeper or bookie didn't fix the issue ----