2019-02-08 09:18:33 UTC - Niketh Sabbineni: @Sijie Guo Thank you :ok_hand: ---- 2019-02-08 13:46:04 UTC - Stephen Dotz: @Stephen Dotz has joined the channel ---- 2019-02-08 13:51:31 UTC - Stephen Dotz: Hi! We're using Apache Kafka for my project and there are a number of features pulsar offers which we really need. Geo-replication, protobuf schema registry, tiered storage. One thing I gathered from reading through the docs is that this is a very similar project to Kafka, but Kafka is seldom mentioned. Lots of features are 1:1 between them with different names. I am wondering if there is some reading I can do, or if somebody can elaborate on the primary differences, or why a separate project was created. Not trying to create a this vs. that, just trying to understand more about the two projects and the primary differences, and motivations behind pulsar. Thanks! ---- 2019-02-08 13:54:52 UTC - Sijie Guo: Yahoo’s blog post is a good post about the motivation. <https://yahooeng.tumblr.com/post/150078336821/open-sourcing-pulsar-pub-sub-messaging-at-scale>
for the differences, you can checkout: - <https://streaml.io/blog/pulsar-streaming-queuing> - <https://streaml.io/blog/pulsar-segment-based-architecture> - <https://streaml.io/blog/apache-pulsar-architecture-designing-for-streaming-performance-and-scalability> ---- 2019-02-08 14:00:03 UTC - Stephen Dotz: Perfect! thanks! Will read these and get back with any questions ---- 2019-02-08 14:06:59 UTC - Sijie Guo: @Stephen Dotz feel free to ping us any time ---- 2019-02-08 14:29:49 UTC - Matan Shalit: @Matan Shalit has joined the channel ---- 2019-02-08 14:38:38 UTC - Matan Shalit: hey :slightly_smiling_face: i need to create a "statistics" service of sorts, this service needs to read all messages that were accumulated in the last week for a topic. i have found a way to start a reader at a specific message id but the thing is that i have no way of knowing what was the message id from 7 days back. is there a way to create a reader that starts consuming messages by timespan? or by timestamp? any help appreciated! :innocent: ---- 2019-02-08 14:44:55 UTC - Stephen Dotz: This is a longshot, but one of the biggest things we struggle w/ using kafka is the difficulty of swapping out topics from underneath clients. e.g. if we want to re-write a topic to remove some data for a GDPR request, we could stream the topic into another and filter it, but we also have to coordinate with consumers to switch their topic to the new one and find their new offset. There are other scenarios where we want to re-write history too, to just clean the data up. Does Pulsar offer anything that could potentially make that easier? ---- 2019-02-08 15:20:13 UTC - Sijie Guo: currently I don’t think there is a direct time-based “seek” api in either consumer and reader. we might consider adding some. for getting around, you can use consumer api to subscribe to latest. then using pulsar admin tool to reset-cursor by time: <http://pulsar.apache.org/docs/en/admin-api-persistent-topics/#reset-cursor> or using resetCursor() api in PulsarAdmin ---- 2019-02-08 15:21:18 UTC - Stephen Dotz: Does anyone know if PulsarSQL can be used to query against protobuf schemas? ---- 2019-02-08 15:24:24 UTC - Sijie Guo: current PulsarSQL only supports json and avro. but Pulsar support protobuf schema. so it should be easy to add protobuf support in PulsarSQL. /cc @Jerry Peng @Jerry Peng (in case he has more to add on) ---- 2019-02-08 15:48:23 UTC - Stephen Dotz: I haven't tried to make this work but we did look into it briefly w/ KSQL. I thought since PBs map to and from JSON deterministically it would be a matter of plugging in a Ser/De but it wasn't that simple (nothing ever is) ---- 2019-02-08 15:52:54 UTC - Stephen Dotz: However :crossed_fingers: that it _is_ that simple w/ this ---- 2019-02-08 15:59:10 UTC - Matan Shalit: Thx, ill give the reset cursor command a shot. :slightly_smiling_face: having this as part of the consumer / reader interface would indeed be great should i open a suggestion / ticket somewhere for it? ---- 2019-02-08 16:29:05 UTC - Vincent Ngan: Is it possible to start the standalone pulsar server programatically? I need to run unit tests of my application with the pulsar server running. Can I start an embedded pulsar server from my unit test code and shut it down when the tests finish? ---- 2019-02-08 16:30:15 UTC - Grant Wu: That sounds possible, yes. I actually do that. ---- 2019-02-08 16:31:31 UTC - Vincent Ngan: Can you show me how to do it? ---- 2019-02-08 16:31:43 UTC - Grant Wu: It really depends on your CI setup ---- 2019-02-08 16:31:55 UTC - Grant Wu: We use GitLab CI ---- 2019-02-08 16:32:13 UTC - Grant Wu: We start the image using ``` services: - name: ${PULSAR_TESTING_IMAGE} alias: pulsar-broker ``` (which is actually a deprecated legacy feature…….) ---- 2019-02-08 16:32:33 UTC - Grant Wu: ``` ARG DEV_REGISTRY ARG BASE_IMAGE FROM ${DEV_REGISTRY}/${BASE_IMAGE} EXPOSE 6650 8080 RUN bin/apply-config-from-env.py conf/standalone.conf && bin/apply-config-from-env.py conf/pulsar_env.sh CMD ["bin/pulsar", "standalone"] ``` ---- 2019-02-08 16:32:35 UTC - Grant Wu: This is our Dockerfile ---- 2019-02-08 16:33:04 UTC - Grant Wu: ``` docker-ci: $(DOCKER) build --tag $(PACKAGE_REPOSITORY):$(GIT_HEAD_SHA1) --label wraps=$(PULSAR_IMAGE) --build-arg DEV_REGISTRY=$(DEV_REGISTRY) --build-arg BASE_IMAGE=$(PULSAR_IMAGE) ci ``` and the Makefile command we use to build it ---- 2019-02-08 16:33:42 UTC - Grant Wu: DEV_REGISTRY is our internal Dockerhub, PACKAGE_REGISTRY is just the path for the name of this image internally, PULSAR_IMAGE is… pulsar:2.1.0 :man-facepalming:(I really need to update this) ---- 2019-02-08 16:34:05 UTC - Grant Wu: Basically this is more of a roll-your-own thing ---- 2019-02-08 16:36:22 UTC - Matteo Merli: @Stephen Dotz we have it in the roadmap though we haven’t got to it yet. The main difference with protobuf is that it’s based on having the generated code available. In the case of Pulsar SQL we won’t have the generated protobuf code in the engine. Though we have added preliminary support in our schema definition to include mapping between the field names and the proto filed tags. That will allow one to “manually” parse the serialized protobuf and convert it into a generic map object that can be fed into Presto. ---- 2019-02-08 16:39:27 UTC - Vincent Ngan: Your suggestion is to configure the CI build scripts to start the server. I know that this should work, but I am looking for a even leaner approach by starting the pulsar server as an embedded server from my Java unit test code during the setup stage of my test suite and shutdown it down in the wrap up stage. ---- 2019-02-08 16:40:32 UTC - Grant Wu: oh, I see ---- 2019-02-08 16:40:34 UTC - Grant Wu: I have no idea then ---- 2019-02-08 16:41:07 UTC - Vincent Ngan: I am looking for something like this: <https://github.com/salesforce/kafka-junit> ---- 2019-02-08 16:41:09 UTC - Grant Wu: Didn’t think about the implications of “from my unit test code” ---- 2019-02-08 16:49:19 UTC - Vincent Ngan: For Kafka, there are quite a lot of embedded solutions, e.g. <https://dzone.com/articles/unit-testing-of-kafka> <https://github.com/tuplejump/embedded-kafka> <https://github.com/Mayvenn/embedded-kafka> <https://objectpartners.com/2017/10/25/functional-testing-with-embedded-kafka/> ---- 2019-02-08 16:50:52 UTC - Vincent Ngan: Is there anyone doing this for Pulsar? ---- 2019-02-08 16:58:04 UTC - Sijie Guo: try <https://github.com/apache/pulsar/blob/133331c15587400b8d7846ff1ca84cf3a8802565/pulsar-broker/src/main/java/org/apache/pulsar/PulsarStandaloneStarter.java> ---- 2019-02-08 16:58:32 UTC - Sijie Guo: if you can file an issue for that, it will be really great! ---- 2019-02-08 17:29:44 UTC - Chris DiGiovanni: I'm doing some testing with pulsar and I am using StatefulSets. I bounced my two bookies in K8s and now things are just messed up. Mostly due to the identification via IP which is a problem with StatefulSets vs. DaemonSets. I deleted the journal and ledgers PVCs so they are now clean for both bookies. I also added the config option useHostNameAsBookieID: "true". Though I'm still getting issues around the old cluster, etc... How do I get Pulsar/Zookeeper to forget about the other nodes and continue w/ the new ones? ---- 2019-02-08 17:32:24 UTC - Emma Pollum: That is the host bookie ---- 2019-02-08 17:40:13 UTC - Ali Ahmed: @Vincent Ngan here is an example <https://github.com/streamlio/pulsar-embedded-tutorial/blob/master/src/test/java/org/apache/pulsar/PulsarEmbeddedTest.java> ---- 2019-02-08 17:41:05 UTC - Chris DiGiovanni: It also loosk like useHostNameAsBookieID: "true" in my configmap didn't have the desired affect rebuilding the whole cluster. ``` org.apache.bookkeeper.bookie.BookieException$InvalidCookieException: Cookie [4 bookieHost: "10.250.10.238:3181" journalDir: "data/bookkeeper/journal" ledgerDirs: "1\tdata/bookkeeper/ledgers" instanceId: "6c5247f4-da4c-4414-bc85-cd9cf0246093" ] is not matching with [4 bookieHost: "10.250.10.237:3181" journalDir: "data/bookkeeper/journal" ledgerDirs: "1\tdata/bookkeeper/ledgers" instanceId: "6c5247f4-da4c-4414-bc85-cd9cf0246093" ``` ---- 2019-02-08 17:43:58 UTC - Matteo Merli: @Stephen Dotz Not 100% sure of the context, but the fact that subscriptions are fully managed by Pulsar makes it easier for consumer to not have to “think” about offset, just attach to a subscription on a topic that will be positioned in the right place. ---- 2019-02-08 17:59:20 UTC - Stephen Dotz: It does seem easy in the API to manipulate the consumers offset, but is there any way to also manipulate their topic? Or the contents of the topic? Like being able to atomically rename a topic and have their subscription continue from a specified offset (we would handle translating the offset probably) ---- 2019-02-08 18:00:01 UTC - Stephen Dotz: In essence, once we say to a consumer "consume your data from topic xyz", we can never re-write or clean that topic without coordinating with the consumers to start consuming some different topic ---- 2019-02-08 18:02:00 UTC - Matteo Merli: I see, so the filtering can always happen after-the-fact, right? ---- 2019-02-08 18:03:30 UTC - Matteo Merli: I don’t think there’s a direct way to do exactly that right now, though some time back we were thinking of adding a way to force producers/consumers to reconnect to a different topic, or even a different serviceURL. ---- 2019-02-08 18:03:51 UTC - Stephen Dotz: Yeah that would be a really interesting feature to me ---- 2019-02-08 18:04:05 UTC - Matteo Merli: Probably, this kind of mechanism will allow for that operation ---- 2019-02-08 18:07:51 UTC - Stephen Dotz: So far we've considered doing sort of a topic name translation inside of a proxy layer that sits in front of Kafka. Also would handle mapping their offset in one topic to a sensible offset in the other. It's super rough POC and I'm not sure how well it will pan out yet. ---- 2019-02-08 18:14:58 UTC - Matteo Merli: Thinking loud, one other way could be to have some sort of “custom” compaction filter, to filter out the messages in place. ---- 2019-02-08 18:22:15 UTC - Matteo Merli: @Chris DiGiovanni Yes, setting up stateful services in K8S is a fun experience :slightly_smiling_face: Bookies need to advertise a specific “address” because the data they contain is already set in the metadata as available at that specific address. That’s the reason for the cookie checks. To wipe out a bookie, you can use the : ``` bin/bookkeeper shell bookieformat -force -deleteCookie ``` ---- 2019-02-08 18:22:27 UTC - Matteo Merli: (From the bookie you want to wipe) ---- 2019-02-08 18:23:54 UTC - Matteo Merli: You can also check an example of bookie deployment with stateful sets in <https://github.com/apache/pulsar/blob/master/deployment/kubernetes/google-kubernetes-engine/bookie.yaml> ---- 2019-02-08 18:30:45 UTC - Stephen Dotz: Haha, we have a plan to do something like that for GDPR requests. Basically we uniquely key every single message, so if we need to remove one, we can re-use the key and have it compacted away ---- 2019-02-08 18:31:57 UTC - Stephen Dotz: What is a custom compaction filter though? That sounds a bit cooler ---- 2019-02-08 18:32:35 UTC - Matteo Merli: Pass application defined logic to decide whether to keep or drop a particular message ---- 2019-02-08 18:32:52 UTC - Matteo Merli: So that you don’t have to keep a unique key to refer to these messages ---- 2019-02-08 18:33:45 UTC - Matteo Merli: Notice: this is not something that’s available today, though it’s not a long stretch either ---- 2019-02-08 18:34:03 UTC - Stephen Dotz: ah okay I was searching around for it. That sounds like a cool idea ---- 2019-02-08 18:35:02 UTC - Stephen Dotz: I wonder if this mechanism could be used to edit the data in place, rather than tombstoning it? ---- 2019-02-08 18:35:47 UTC - Matteo Merli: Since the data get’s rewritten when compaction runs, I guess that would possible ---- 2019-02-08 18:36:57 UTC - Matteo Merli: Right now, the compaction in Pulsar can be run with a CLI tool even from outside a Pulsar broker. That’s why I’m thinking that passing a custom Filter/Modifier class is easy to do. ---- 2019-02-08 18:38:07 UTC - Stephen Dotz: Interesting. does the compactor work directly with bookkeeper API? Or perhaps the physical log format? ---- 2019-02-08 18:47:32 UTC - Matteo Merli: Yes, it reads from broker and store the compacted data in BookKeeper, sending back a pointer with the compacted ledger ---- 2019-02-08 19:07:50 UTC - Chris DiGiovanni: Thanks I'll take a look. ---- 2019-02-08 20:38:37 UTC - Stephen Dotz: That makes sense. Thanks. Is there any GH issue or something I can follow? I'd like to try it as soon as it is ready, whenever that is. ---- 2019-02-08 20:52:35 UTC - Matteo Merli: It was only on internal roadmaps :slightly_smiling_face: Just created issue: <https://github.com/apache/pulsar/issues/3554> +1 : Sijie Guo ---- 2019-02-08 22:04:30 UTC - Grant Wu: Is there a way to disable logging from the C++ client (from within the Python client wrapper?) ---- 2019-02-08 22:04:49 UTC - Grant Wu: Apparently its log format causes Apache Livy to misbehave :man-facepalming: ---- 2019-02-08 22:05:45 UTC - Matteo Merli: Not from Python unfortunately ---- 2019-02-08 22:06:02 UTC - Matteo Merli: It’s possible to pass a logger function in C++ and Go clients ---- 2019-02-08 22:06:28 UTC - Matteo Merli: it would be easy to also add for Python, but it’s not there at this moment ----
