2019-08-17 13:21:36 UTC - Chris Bartholomew: @Kendall Magesh-Davis Did you also 
delete the PersistentVolume that stores the bookie data?
----
2019-08-17 13:28:33 UTC - Kendall Magesh-Davis: No. Here’s a summary of 
commands:
```
kubectl cordon <nodeX>
kubectl drain <nodeX> --ignore-daemonsets --delete-local-data 
aws autoscaling \       
  terminate-instance-in-auto-scaling-group \
  --no-should-decrement-desired-capacity --instance-id=<nodeX instanceid>
```
----
2019-08-17 13:29:04 UTC - Kendall Magesh-Davis: Essentially that K8s cluster 
was running low on resources, and I bumped up the instance type and was forcing 
it to replace one node with the new bigger instance.
----
2019-08-17 13:40:21 UTC - Chris Bartholomew: In your values.yaml file that you 
used with helm, is persistence set to "yes"?
----
2019-08-17 13:56:56 UTC - Ming Fang: This issue sounds similar to 
<https://github.com/apache/pulsar/issues/3121>
----
2019-08-17 13:59:22 UTC - Kendall Magesh-Davis: No, it doesn’t appear so.
```bookkeeper:
  component: bookkeeper
  replicaCount: 3
  updateStrategy:
    type: OnDelete
  podManagementPolicy: OrderedReady
  default-pool
  annotations:
    <http://prometheus.io/scrape|prometheus.io/scrape>: "true"
    <http://prometheus.io/port|prometheus.io/port>: "8000"
  tolarations: []
  gracePeriod: 0
  resources:
    requests:
      memory: 128Mi
      cpu: 0.2
  volumes:
    journal:
      name: journal
      size: 5Gi
    ledgers:
      name: ledgers
      size: 5Gi
  configData:
    PULSAR_MEM: "\"-Xms128m -Xmx256m -XX:MaxDirectMemorySize=128m 
-Dio.netty.leakDetectionLevel=disabled -Dio.netty.recycler.linkCapacity=1024 
-XX:+UseG1GC -XX:MaxGCPauseMillis=10 -XX:+ParallelRefProcEnabled 
-XX:+UnlockExperimentalVMOptions -XX:+AggressiveOpts -XX:+DoEscapeAnalysis 
-XX:ParallelGCThreads=32 -XX:ConcGCThreads=32 -XX:G1NewSizePercent=50 
-XX:+DisableExplicitGC -XX:-ResizePLAB -XX:+ExitOnOutOfMemoryError 
-XX:+PerfDisableSharedMem -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
-XX:+PrintGCApplicationStoppedTime -XX:+PrintHeapAtGC -verbosegc 
-XX:G1LogLevel=finest\""
    dbStorage_writeCacheMaxSizeMb: "32"
    dbStorage_readAheadCacheMaxSizeMb: "32"
    journalMaxSizeMB: "2048"
    statsProviderClass: 
org.apache.bookkeeper.stats.prometheus.PrometheusMetricsProvider
    useHostNameAsBookieID: "true"
  service:
    annotations:
      publishNotReadyAddresses: "true"
    ports:
    - name: server
      port: 3181
  pdb:
    usePolicy: yes
    maxUnavailable: 1```
----
2019-08-17 14:00:01 UTC - Ming Fang: It looks like when the Bookie starts up, 
it checks zk and compares the “cookie” to local storage. If they don’t match or 
locla is empty then errors out.  My workaround is to delete the node in zk.
----
2019-08-17 14:11:03 UTC - Chris Bartholomew: In the public helm chart, this is 
a global setting, so its not in the bookkeeper section. Right up near the top, 
you will see: ```## If persistence is enabled, components that have state will
## be deployed with PersistentVolumeClaims, otherwise, for test
## purposes, they will be deployed with emptyDir
persistence: no
```
----
2019-08-17 14:14:09 UTC - Chris Bartholomew: Right @Ming Fang. @Kendall 
Magesh-Davis, if persistence is not enabled in the helm config, the bookie data 
is lost if the node is reset. This means that Zookeeper knows about the bookie, 
but the bookie is missing all its data, so the sanity check fails. The error 
you are seeing is consistent with the data directories being empty. In fact, I 
have reproduced this exact error by wiping the data dirs and restarting the 
bookkeeper pod.  Deleting the node in zk is probably the only way to recover 
this. In production you would want to run with persistence enabled. That way 
PersistentVolume claims are mounted into /pulsar/data/bookkeeper directory so 
it will recover in the event of node failure.
----
2019-08-17 14:17:33 UTC - Kendall Magesh-Davis: Thanks for the help @Ming Fang 
and @Chris Bartholomew :thumbsup: I will test it out and let you know if I run 
into anything else.
----
2019-08-17 14:22:48 UTC - Way Dev: @Way Dev has joined the channel
----
2019-08-17 14:39:51 UTC - Ming Fang: There’s a similar problem on the Broker 
side too. <https://github.com/apache/pulsar/issues/4964>
----
2019-08-17 15:23:46 UTC - Chris Bartholomew: For reference, you can delete the 
node in zookeeper logging into one of your pods, starting the zookeeper shell, 
finding the cookie, and then deleting it. This set of steps worked for me: 
```kubectl exec -it &lt;pod&gt; /bin/bash
bin/pulsar zookeeper-shell
ls /ledgers/cookies
delete /ledgers/cookies/&lt;cookie-for-bad-bookie&gt;
```
----
2019-08-17 15:25:10 UTC - Ming Fang: I wonder if there’s a downside for the 
Bookie startup to do this automatically
----
2019-08-17 15:34:15 UTC - Chris Bartholomew: This is a bad state to be in. It 
means you've lost all your data for this particular bookie, so it all need to 
be re-replicated. Ideally, the data would have been preserved and the sanity 
check passes.
----
2019-08-17 15:35:38 UTC - Ming Fang: I agree this is a bad state. But I think 
have to manually delete zk nodes makes the situation even worse
----
2019-08-17 15:38:17 UTC - Chris Bartholomew: For development/test purpose, I 
agree. In production, you would want to make sure the data directories are not 
ephemeral.
----
2019-08-17 15:43:53 UTC - Ming Fang: Even if the volume is not ephemeral there 
is a still a chance we can loose the volume for whatever reason. If the volume 
is lost, then yes it’s bad.  But if I have to then manually recover by editing 
zk, then that makes it worse
+1 : Vladimir Shchur
----
2019-08-17 16:07:03 UTC - Chris Bartholomew: You can just increase the replica 
count by 1. That one will be considered a new bookie. It will initialize 
without any intervention and will restore quorum for your cluster. If you can 
recover the failed volume, great. If not, you can delete the bookie from zk. 
BTW, there is some info on this in the Bookkeeper admin docs 
(<https://bookkeeper.apache.org/docs/4.9.2/admin/bookies/>) under the heading 
"Missing disks or directories".
----
2019-08-17 21:49:38 UTC - Kendall Magesh-Davis: Ahh, right on. The answer 
remains the same :smile:
```
namespaceCreate: yes
persistence: no```
----
2019-08-17 22:32:44 UTC - Ming Fang: I was able get Bookkeeper to restart by 
adding `--force` to the `metaformat` command, 
<https://github.com/mingfang/terraform-provider-k8s/blob/d91a9cef710c21fbdf346629709d829222992041/modules/pulsar/bookkeeper/main.tf#L40>
----
2019-08-18 02:20:48 UTC - bright: @bright has joined the channel
----
2019-08-18 02:21:17 UTC - GaoHang: @GaoHang has joined the channel
----
2019-08-18 02:26:47 UTC - 18505565928m0: @18505565928m0 has joined the channel
----
2019-08-18 02:27:48 UTC - 308027245: @308027245 has joined the channel
----
2019-08-18 02:29:54 UTC - liangliliang: @liangliliang has joined the channel
----
2019-08-18 02:30:12 UTC - anonymitaet: @anonymitaet has joined the channel
----
2019-08-18 02:39:10 UTC - lingya: @lingya has joined the channel
----
2019-08-18 02:41:26 UTC - pxhssg: @pxhssg has joined the channel
----
2019-08-18 03:32:56 UTC - chengyanan1008: @chengyanan1008 has joined the channel
----
2019-08-18 03:42:15 UTC - Zurich: @Zurich has joined the channel
----
2019-08-18 03:43:02 UTC - Ming Fang: I’m unable to localrun a source outside of 
a kubernetes cluster via the ingress controller to the pulsar proxy.
`./bin/pulsar-admin sources localrun --name generator --destinationTopicName 
generator_test -a ./connectors/pulsar-io-data-generator-2.4.0.nar 
--broker-service-url <pulsar://192.168.2.249:6650>`

It prints this error and continues to run but is not doing any work

localrun log

03:45:26.029 [pulsar-client-io-1-1] INFO  
org.apache.pulsar.client.impl.ConnectionPool - [[id: 0x8ca52491, 
L:/250.2.43.2:43526 - R:192.168.2.249/192.168.2.249:6650]] Connected to server
03:45:26.061 [pulsar-client-io-1-1] WARN  
org.apache.pulsar.client.impl.ClientCnx - [id: 0x8ca52491, L:/250.2.43.2:43526 
- R:192.168.2.249/192.168.2.249:6650] Received error from server: 
org.apache.pulsar.client.api.PulsarClientException: Disconnected from server at 
pulsar-0.pulsar.example.svc.cluster.local/250.2.216.6:6650
03:45:26.061 [pulsar-client-io-1-1] WARN  
org.apache.pulsar.client.impl.ClientCnx - [id: 0x8ca52491, L:/250.2.43.2:43526 
- R:192.168.2.249/192.168.2.249:6650] Received unknown request id from server: 0

proxy log

03:45:26.033 [pulsar-discovery-io-2-1] INFO  
org.apache.pulsar.proxy.server.ProxyConnection - [/250.2.146.2:37384] New 
connection opened
03:45:26.044 [pulsar-discovery-io-2-1] INFO  
org.apache.pulsar.proxy.server.ProxyConnection - [/250.2.146.2:37384] complete 
connection, init proxy handler. authenticated with none role null, 
hasProxyToBrokerUrl: false
03:45:26.058 [pulsar-discovery-io-2-1] INFO  
org.apache.pulsar.client.impl.ConnectionPool - [[id: 0x9ca51b7c, 
L:/250.2.216.5:48600 - 
R:pulsar-0.pulsar.example.svc.cluster.local/250.2.216.6:6650]] Connected to 
server
03:45:26.061 [pulsar-discovery-io-2-1] INFO  
org.apache.pulsar.client.impl.ClientCnx - [id: 0x9ca51b7c, L:/250.2.216.5:48600 
! R:pulsar-0.pulsar.example.svc.cluster.local/250.2.216.6:6650] Disconnected
03:45:26.062 [pulsar-discovery-io-2-1] WARN  
org.apache.pulsar.proxy.server.LookupProxyHandler - [/250.2.146.2:37384] Failed 
to get schema : org.apache.pulsar.client.api.PulsarClientException: 
Disconnected from server at 
pulsar-0.pulsar.example.svc.cluster.local/250.2.216.6:6650


Before I open an issue, can someone confirm if this setup should work?
----
2019-08-18 05:03:38 UTC - Ming Fang: I’m guessing proxy_to_broker_url needs to 
be set
----
2019-08-18 05:26:31 UTC - Sijie Guo: are you able to produce or consume 
messages from <pulsar://192.168.2.249:6650> ?

a localrun pulsar function has no difference from a consumer and a producer. so 
it is better to check if the pulsar setup is good first.
----
2019-08-18 05:41:36 UTC - Ming Fang: Yes I’m able to produce and consume

`./bin/pulsar-client --url <pulsar://192.168.2.249:6650> produce my-topic 
--messages "hello-pulsar"`
05:40:54.612 [main] INFO  org.apache.pulsar.client.impl.PulsarClientImpl - 
Client closing. URL: <pulsar://192.168.2.249:6650>
05:40:54.624 [pulsar-client-io-1-1] INFO  
org.apache.pulsar.client.impl.ProducerImpl - [my-topic] [local-0-26] Closed 
Producer
05:40:54.628 [main] INFO  org.apache.pulsar.client.cli.PulsarClientTool - 1 
messages successfully produced
----
2019-08-18 05:42:56 UTC - Ming Fang: When I run produce, this is on the proxy 
log
`05:42:06.340 [pulsar-discovery-io-2-1] INFO  
org.apache.pulsar.proxy.server.ProxyConnection - [/250.2.89.5:41190] complete 
connection, init proxy handler. authenticated with none role null, 
hasProxyToBrokerUrl: true `
----
2019-08-18 05:43:50 UTC - Ming Fang: Compare to localrun, hasProxyToBrokerUrl: 
true and not false
----
2019-08-18 05:49:49 UTC - Sijie Guo: hasProxyToBrokerUrl just means the 
connection is in different state. hasProxyToBrokerUrl == false means it is 
still looking up the topic metadata, hasProxyToBrokerUrl == true means it 
already knows the topic metadata and know which target broker to connect to.
----
2019-08-18 05:50:01 UTC - Sijie Guo: I don’t think it is the cause for localrun 
fails to start.
----
2019-08-18 05:50:13 UTC - Sijie Guo: `Failed to get schema : 
org.apache.pulsar.client.api.PulsarClientException: `
----
2019-08-18 05:50:29 UTC - Sijie Guo: this error message from localrun is the 
cause, I guess.
----
2019-08-18 05:50:38 UTC - Sijie Guo: Which function are your running?
----
2019-08-18 05:51:21 UTC - Ming Fang: I’m following the sql tutorial and trying 
to produce sample data using connectors/pulsar-io-data-generator-2.4.0.nar
----
2019-08-18 05:53:28 UTC - Sijie Guo: ok. let me check
pray : Ming Fang
----
2019-08-18 06:12:05 UTC - Sijie Guo: Oh I see. It seems that there is a bug 
when pulsar proxy forwarding the get schema request
----
2019-08-18 06:12:57 UTC - Ming Fang: Thanks for taking the time to debug this
----
2019-08-18 06:15:28 UTC - Sijie Guo: I am creating a pull request for it.
100 : Ming Fang
----
2019-08-18 06:19:57 UTC - Ming Fang: Amazing turnaround time. Thanks!
----
2019-08-18 06:24:33 UTC - Ming Fang: I have a related question about localrun. 
Instead of the tcp proxy, can I run it thru websocket?
`./bin/pulsar-admin sources localrun --name generator --destinationTopicName 
generator_test -a ./connectors/pulsar-io-data-generator-2.4.0.nar 
--broker-service-url <ws://pulsar-websocket.192.168.2.249.nip.io:80>`
----
2019-08-18 06:26:24 UTC - Sijie Guo: @Ming Fang unfortunately no. the function 
is using the java client which talks to the binary protocol port.
----
2019-08-18 06:27:21 UTC - Sijie Guo: Can you share your thoughts behind 
websocket? We can discuss to see if this is a good feature to add the support 
to pulsar functions.
----
2019-08-18 06:28:36 UTC - Ming Fang: Good to know. I’m using Pulsar to build 
solution that is almost like IoT. I want to have the datasources coming from 
the internet via websocket
----
2019-08-18 06:31:59 UTC - Ming Fang: I think getting functions to work over 
websocket is very powerful. Imagine a javascript function runtime where many 
browsers run the functions.  That’ll be serverless at internet scale
----
2019-08-18 06:35:38 UTC - Sijie Guo: yeah it is making sense to support 
websocket for javascript functions.
----
2019-08-18 06:36:05 UTC - Sijie Guo: I thought you were talking about the 
java/python functions :slightly_smiling_face:
----
2019-08-18 06:38:28 UTC - Ming Fang: I can see java/python functions can 
benefit also.  Imagine a Raspberry PI running java/python to collect and 
process data from the field and then send results or even raw data back to a 
Pulsar cluster in the cloud.  While TCP will work most of the time, there are 
some restricted networks that only allows HTTP
----
2019-08-18 06:40:59 UTC - Sijie Guo: make sense. in this case, we might just 
need to have a java/python client that talks to websocket. that java function 
and python function can work seamlessly
----
2019-08-18 06:41:27 UTC - Sijie Guo: Do you want to create feature requests for 
them? We can lookup into them in future releases.
----
2019-08-18 06:43:14 UTC - Ming Fang: Yes I’ll submit a feature request. Do you 
think just one request will do for java/python?
----
2019-08-18 06:43:56 UTC - Ming Fang: And do you think the javascript function 
runtime + websocket is an entirely different feature request?
----
2019-08-18 06:44:01 UTC - Sijie Guo: One request is fine. We can use that as a 
master issue for tracking the sub tasks.
----
2019-08-18 06:44:21 UTC - Sijie Guo: yea javascript function support should be 
an entirely different feature request :slightly_smiling_face:
----
2019-08-18 06:44:40 UTC - Sijie Guo: it is a new language runtime.
----
2019-08-18 06:45:03 UTC - Ming Fang: ok I’ll do it in the morning. It’s almost 
3am in NYC.  Thanks for your help. Now I can sleep :slightly_smiling_face:
----
2019-08-18 06:45:19 UTC - Sijie Guo: ah. good night :slightly_smiling_face:
----
2019-08-18 06:45:23 UTC - Sijie Guo: thank you
----
2019-08-18 06:45:33 UTC - Ming Fang: :+1:
----
2019-08-18 06:46:17 UTC - Ming Fang: Btw what time zone are you in?
----
2019-08-18 06:48:55 UTC - Sijie Guo: San Francisco :slightly_smiling_face:
----
2019-08-18 06:50:43 UTC - Ming Fang: Ic good night 
night_with_stars : Sijie Guo
----

Reply via email to