Re: Maelstrom: Kafka integration with Spark

2016-08-24 Thread Jeoffrey Lim
To clarify my earlier statement, I will continue working on Maelstrom as an alternative to official Spark integration with Kafka and keep the KafkaRDDs + Consumers as it is - until I find the official Spark Kafka more stable and resilient to Kafka broker issues/failures (reason I have infinite retr

Re: Maelstrom: Kafka integration with Spark

2016-08-24 Thread Jeoffrey Lim
Hi Cody, thank you for pointing out sub-millisecond processing, it is an "exaggerated" term :D I simply got excited releasing this project, it should be: "millisecond stream processing at the spark level". Highly appreciate the info about latest Kafka consumer. Would need to get up to speed about

Re: Maelstrom: Kafka integration with Spark

2016-08-24 Thread Cody Koeninger
Yes, spark-streaming-kafka-0-10 uses the new consumer. Besides pre-fetching messages, the big reason for that is that security features are only available with the new consumer. The Kafka project is at release 0.10.0.1 now, they think most of the issues with the new consumer have been ironed out

Re: Maelstrom: Kafka integration with Spark

2016-08-23 Thread Jeoffrey Lim
Apologies, I was not aware that Spark 2.0 has Kafka Consumer caching/pooling now. What I have checked is the latest Kafka Consumer, and I believe it is still in beta quality. https://kafka.apache.org/documentation.html#newconsumerconfigs > Since 0.9.0.0 we have been working on a replacement for o

Re: Maelstrom: Kafka integration with Spark

2016-08-23 Thread Cody Koeninger
Were you aware that the spark 2.0 / kafka 0.10 integration also reuses kafka consumer instances on the executors? On Tue, Aug 23, 2016 at 3:19 PM, Jeoffrey Lim wrote: > Hi, > > I have released the first version of a new Kafka integration with Spark > that we use in the company I work for: open so

Maelstrom: Kafka integration with Spark

2016-08-23 Thread Jeoffrey Lim
Hi, I have released the first version of a new Kafka integration with Spark that we use in the company I work for: open sourced and named Maelstrom. It is unique compared to other solutions out there as it reuses the Kafka Consumer connection to achieve sub-milliseconds latency. This library has