2020-04-03 09:35:58 UTC - Foong Fook Won: @Foong Fook Won has joined the channel ---- 2020-04-03 11:15:56 UTC - Poul Henriksen: I have a question about the message listener implementation in the Java client.
In <https://pulsar.apache.org/api/client/org/apache/pulsar/client/api/MessageListener.html#received-org.apache.pulsar.client.api.Consumer-org.apache.pulsar.client.api.Message-> it is stated that: ```This method is called whenever a new message is received. Messages are guaranteed to be delivered in order and from the same thread for a single consumer This method will only be called once for each message, unless either application or broker crashes. Application is responsible of handling any exception that could be thrown while processing the message.``` Imagine a case of having 2 listener threads for the pool (A and B), and 3 consumers with the following characteristics: - Consumer 1: 1 message every 20 second, but processing always takes 20 seconds - Consumer 2: 1 message every 1 hour, processing always takes 1 millisecond - Consumer 3: 1000 messsages every 1 second, processing always takes 1 millisecond Consumer 1 and 3 could then be assigned to the same thread (Thread A). When Consumer 1 is using the Thread A, it will block it for 20 seconds, and Consumer 3 will not get any messages processed in that period, although it could just have used the idling Thread B for processing. Is my above understanding correct? If yes, then that seems like potential performance issue... ---- 2020-04-03 13:59:04 UTC - Dhaval Kotecha: @Dhaval Kotecha has joined the channel ---- 2020-04-03 16:50:41 UTC - Matteo Merli: The listener is implemented using a thread-pool. You can define the size of that thread pool. default is 1. Each consumer is assigned to one of the threads in the pool, in round robin fashion. If you have 3 consumers and 3 threads, they won't impact each others ---- 2020-04-03 16:51:07 UTC - Matteo Merli: Also, you can always "jump" to other threads to do the processing. ---- 2020-04-03 16:56:46 UTC - Poul Henriksen: The example above was just an example. With 1000s of consumers I don’t want 1000s of threads to avoid this. That kind of defeats the purpose of a thread pool. ---- 2020-04-03 16:56:56 UTC - Poul Henriksen: What do you mean by jumping ? ---- 2020-04-03 16:59:48 UTC - Matteo Merli: Using your own thread pool ---- 2020-04-03 17:00:32 UTC - Matteo Merli: at the end of the day, in presence of blocking operations, it's not possible to offer a "fair" execution environment in a general form ---- 2020-04-03 17:01:08 UTC - Matteo Merli: if you have more information on the priorities of each consumer, you can process their messages in different sets of threads. ---- 2020-04-03 17:01:36 UTC - Poul Henriksen: Yes, using my own thread pool would be my workaround. Just wanted to make sure i understood the behaviour. ---- 2020-04-03 17:04:02 UTC - Poul Henriksen: I would want all threads to be active if there are messages to process. I don’t think this is an issue of fairness but more of utilising available resources. ---- 2020-04-03 17:14:03 UTC - Poul Henriksen: But thanks for your answer. I'll look at implementing a workaround for it. ---- 2020-04-03 17:30:55 UTC - Matteo Merli: The reasoning around that consumers are pinned to 1 thread in the listener is to maintain ordering. If you don't need ordering you can just create an Executor with N threads and do your processing there for all the consumers. All the threads will be busy processing messages in any order. Just be careful to put a max size on the Executor queue so that backpressure can be applied back on the consumers. ---- 2020-04-03 17:39:42 UTC - Poul Henriksen: I do want to ensure ordering though, so unfortunately it will not be that easy. ---- 2020-04-03 17:43:59 UTC - Matteo Merli: Then I think using 2 (or more) sets of thread pools, based on processing time, could be a way to do it ---- 2020-04-03 17:48:08 UTC - Adime: @Adime has joined the channel ---- 2020-04-03 19:32:02 UTC - Tobias Macey: Hello Folks, I am the host of the Data Engineering Podcast and Podcast.__init__, and I am working with O'Reilly on a new book that showcases the work and challenges of data engineers. I am collecting short essays (~400 - 500 words) from industry veterans and experts such as yourself that will be compiled into the final work. This project is a collection of pearls of wisdom collected from leading experts like you -- whether you are/have been a data engineer or you work closely with data engineers in your daily work. The collection will include multiple and varied perspectives on key topics, ideas, skills, and future-focused thoughts that data engineers should be aware of. We’re looking for articles on a wide range of topics: building pipelines, batch and stream processing, data privacy, data security, data governance and lineage, data storage, career advice, data team makeup and culture--and everything in between, from the granular issues to big-picture concepts. What are the challenges? How did you fail? What have you learned? When did the light bulb go off? What do you wish you knew when you started? Submissions are being accepted via the Google form found at <http://boundlessnotions.com/97things|boundlessnotions.com/97things>, which also contains additional details and instructions. You can feel free to submit multiple articles, but please have them to me within the next 3 - 4 weeks so that we can be sure to hit our publishing deadline. You can find out more about the existing "97 Things Every Programmer Should Know" book here at <https://www.safaribooksonline.com/library/view/97-things-every/9780596809515/> If you know anyone else who might be interested in participating then please pass this invitation along to your friends and colleagues as well. Thank you for your time and interest, and we look forward to learning from you! tada : Matteo Merli, Karthik Ramasamy clap : Karthik Ramasamy ---- 2020-04-03 23:19:11 UTC - Tim Corbett: I've finished the first draft of my C# performance testing client. There are some feature gaps but it shows the issue pretty well I hope. The next few posts will contain the output from that tool running with 200 consumers against a single topic in KeyShared mode. ---- 2020-04-03 23:19:57 UTC - Tim Corbett: Ignore the super-accurate instance clocks allowing for end-to-end latency coming back negative. This is a static load of 200 consumers. ```================================================= Receive latency: Min: 0.0264 Max: 294.9339 Mean: 74.62139855072465 Out-of-bounds results: 0 / 69 0.001 0.01 0.1 1 10 100 1000 10000 100000 1000000 000000000010000000000000000000000001441332211920000000000000000000000000000000000000000000 ------------------------------------------------- End-to-end latency: Min: -8.7073 Max: 1.4427 Mean: -4.261844927536234 Out-of-bounds results: 61 / 69 0.1 1 10 100 1000 10000 100000 1000000 002004020800000000000000000000000000000000000000000000000000000000000000 =================================================``` ---- 2020-04-03 23:20:16 UTC - Tim Corbett: This is from adding a consumer: ```================================================= Receive latency: Min: 0.0179 Max: 2706.6598 Mean: 67.08200945945943 Out-of-bounds results: 0 / 74 0.001 0.01 0.1 1 10 100 1000 10000 100000 1000000 000000000190000000000000100154310301951001100310000000000000000000000000000000000000000000 ------------------------------------------------- End-to-end latency: Min: 4.3639 Max: 2607.2919 Mean: 1037.0302081081088 Out-of-bounds results: 0 / 74 0.1 1 10 100 1000 10000 100000 1000000 000000000000100000600000000002112001790000000000000000000000000000000000 =================================================``` ---- 2020-04-03 23:20:37 UTC - Tim Corbett: And then removing that added consumer, which ended up even worse: ```================================================= Receive latency: Min: 0.02 Max: 9150.0224 Mean: 78.59670518518516 Out-of-bounds results: 0 / 135 0.001 0.01 0.1 1 10 100 1000 10000 100000 1000000 000000000050000000000000000041121210931000000000000000000000000000000000000000000000000000 ------------------------------------------------- End-to-end latency: Min: 5.8923 Max: 8831.887 Mean: 4458.832968888888 Out-of-bounds results: 0 / 135 0.1 1 10 100 1000 10000 100000 1000000 000000000000000000100000000000000100678695760000000000000000000000000000 =================================================``` ---- 2020-04-03 23:21:02 UTC - Tim Corbett: However, the removal case may not have been a clean shutdown. That is something I will investigate as I improve my tool. ---- 2020-04-03 23:22:12 UTC - Tim Corbett: The funny line of numbers at the bottom of the reports is a LogLinear Histogram, normalized such that 9 represents the largest bucket and the other numbers indicate relative sizes to that bucket. ----
