why groupByKey still shuffle if SQL does "Distribute By" on same columns ?

2018-01-30 Thread Dibyendu Bhattacharya
Hi, I am trying something like this.. val sesDS: Dataset[XXX] = hiveContext.sql(select).as[XXX] The select statement is something like this : "select * from sometable DISTRIBUTE by col1, col2, col3" Then comes groupByKey... val gpbyDS = sesDS .groupByKey(x => (x.col1, x.col2, x.col3))

Re: question on Write Ahead Log (Spark Streaming )

2017-03-10 Thread Dibyendu Bhattacharya
Hi, You could also use this Receiver : https://github.com/dibbhatt/kafka-spark-consumer This is part of spark-packages also : https://spark-packages.org/package/dibbhatt/kafka-spark-consumer You do not need to enable WAL in this and still recover from Driver failure with no data loss. You can re

Latest Release of Receiver based Kafka Consumer for Spark Streaming.

2017-02-15 Thread Dibyendu Bhattacharya
Hi , Released latest version of Receiver based Kafka Consumer for Spark Streaming . Available at Spark Packages : https://spark-packages.org/package/dibbhatt/ kafka-spark-consumer Also at github : https://github.com/dibbhatt/kafka-spark-consumer Some key features - Tuned for better perform

Re: Latest Release of Receiver based Kafka Consumer for Spark Streaming.

2016-08-25 Thread Dibyendu Bhattacharya
yendu On Thu, Aug 25, 2016 at 7:01 PM, wrote: > Hi Dibyendu, > > Looks like it is available in 2.0, we are using older version of spark 1.5 > . Could you please let me know how to use this with older versions. > > Thanks, > Asmath > > Sent from my iPhone >

Latest Release of Receiver based Kafka Consumer for Spark Streaming.

2016-08-25 Thread Dibyendu Bhattacharya
Hi , Released latest version of Receiver based Kafka Consumer for Spark Streaming. Receiver is compatible with Kafka versions 0.8.x, 0.9.x and 0.10.x and All Spark Versions Available at Spark Packages : https://spark-packages.org/package/dibbhatt/kafka-spark-consumer Also at github : https://g

Re: Severe Spark Streaming performance degradation after upgrading to 1.6.1

2016-07-13 Thread Dibyendu Bhattacharya
You can get some good pointers in this JIRA https://issues.apache.org/jira/browse/SPARK-15796 Dibyendu On Thu, Jul 14, 2016 at 12:53 AM, Sunita wrote: > I am facing the same issue. Upgrading to Spark1.6 is causing hugh > performance > loss. Could you solve this issue? I am also attempting mem

Any Idea about this error : IllegalArgumentException: File segment length cannot be negative ?

2016-07-12 Thread Dibyendu Bhattacharya
In Spark Streaming job, I see a Batch failed with following error. Haven't seen anything like this earlier. This has happened during Shuffle for one Batch (haven't reoccurred after that).. Just curious to know what can cause this error. I am running Spark 1.5.1 Regards, Dibyendu Job aborted due

Re: [Spark 1.6] Spark Streaming - java.lang.AbstractMethodError

2016-01-07 Thread Dibyendu Bhattacharya
:39 AM, Dibyendu Bhattacharya > wrote: > > You are using low level spark kafka consumer . I am the author of the > same. > > If I may ask, what are the differences between this and the direct > version shipped with spark? I've just started toying with it, and > would

Re: [Spark 1.6] Spark Streaming - java.lang.AbstractMethodError

2016-01-07 Thread Dibyendu Bhattacharya
> On Thu, Jan 7, 2016 at 2:39 AM, Dibyendu Bhattacharya < > dibyendu.bhattach...@gmail.com> wrote: > >> You are using low level spark kafka consumer . I am the author of the >> same. >> >> Are you using the spark-packages version ? if yes which one ? >> >&g

Re: Need to maintain the consumer offset by myself when using spark streaming kafka direct approach?

2015-12-08 Thread Dibyendu Bhattacharya
In direct stream checkpoint location is not recoverable if you modify your driver code. So if you just rely on checkpoint to commit offset , you can possibly loose messages if you modify driver code and you select offset from "largest" offset. If you do not want to loose messages, you need to com

Re: Recovery for Spark Streaming Kafka Direct in case of issues with Kafka

2015-12-02 Thread Dibyendu Bhattacharya
brary thinks that my approach of failing (so they KNOW > there was data loss and can adjust their system) is the right thing to do, > how do they do that? > > On Wed, Dec 2, 2015 at 9:49 PM, Dibyendu Bhattacharya < > dibyendu.bhattach...@gmail.com> wrote: > >

Re: Recovery for Spark Streaming Kafka Direct in case of issues with Kafka

2015-12-02 Thread Dibyendu Bhattacharya
ly addressing it". > > The only really correct way to deal with that situation is to recognize > why it's happening, and either increase your Kafka retention or increase > the speed at which you are consuming. > > On Wed, Dec 2, 2015 at 7:13 PM, Dibyendu Bhattacharya <

Re: Recovery for Spark Streaming Kafka Direct in case of issues with Kafka

2015-12-02 Thread Dibyendu Bhattacharya
data because your system > isn't working right is not the same as "recover from any Kafka leader > changes and offset out of ranges issue". > > > > On Tue, Dec 1, 2015 at 11:27 PM, Dibyendu Bhattacharya < > dibyendu.bhattach...@gmail.com> wrote: > >&

Re: Recovery for Spark Streaming Kafka Direct in case of issues with Kafka

2015-12-01 Thread Dibyendu Bhattacharya
Hi, if you use Receiver based consumer which is available in spark-packages ( http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) , this has all built in failure recovery and it can recover from any Kafka leader changes and offset out of ranges issue. Here is the package form github :

Re: Need more tasks in KafkaDirectStream

2015-10-28 Thread Dibyendu Bhattacharya
If you do not need one to one semantics and does not want strict ordering guarantee , you can very well use the Receiver based approach, and this consumer from Spark-Packages ( https://github.com/dibbhatt/kafka-spark-consumer) can give much better alternatives in terms of performance and reliabilit

Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to Apache Spark Project

2015-10-24 Thread Dibyendu Bhattacharya
Hi, I have raised a JIRA ( https://issues.apache.org/jira/browse/SPARK-11045) to track the discussion but also mailing user group . This Kafka consumer is around for a while in spark-packages ( http://spark-packages.org/package/dibbhatt/kafka-spark-consumer ) and I see many started using it , I a

Re: Spark Streaming over YARN

2015-10-04 Thread Dibyendu Bhattacharya
ng the kafka flow. > (I use spark 1.3.1) > > Tks > Nicolas > > > - Mail original - > De: "Dibyendu Bhattacharya" > À: nib...@free.fr > Cc: "Cody Koeninger" , "user" > Envoyé: Vendredi 2 Octobre 2015 18:21:35 > Objet: Re: Spa

Re: Spark Streaming over YARN

2015-10-02 Thread Dibyendu Bhattacharya
RDD will be > distributed over the nodes/core. > For example in my example I have 4 nodes (with 2 cores) > > Tks > Nicolas > > > - Mail original - > De: "Dibyendu Bhattacharya" > À: nib...@free.fr > Cc: "Cody Koeninger" , "user&qu

Re: Spark Streaming over YARN

2015-10-02 Thread Dibyendu Bhattacharya
Hi, If you need to use Receiver based approach , you can try this one : https://github.com/dibbhatt/kafka-spark-consumer This is also part of Spark packages : http://spark-packages.org/package/dibbhatt/kafka-spark-consumer You just need to specify the number of Receivers you want for desired par

Re: Spark Streaming with Tachyon : Data Loss on Receiver Failure due to WAL error

2015-09-26 Thread Dibyendu Bhattacharya
m not sure I understand completely. But are you suggesting that > currently there is no way to enable Checkpoint directory to be in Tachyon? > > Thanks > Nikunj > > > On Fri, Sep 25, 2015 at 11:49 PM, Dibyendu Bhattacharya < > dibyendu.bhattach...@gmail.com> wrote: > &

Re: Spark Streaming with Tachyon : Data Loss on Receiver Failure due to WAL error

2015-09-25 Thread Dibyendu Bhattacharya
ne go about configuring spark streaming to use tachyon as its > place for storing checkpoints? Also, can one do this with tachyon running > on a completely different node than where spark processes are running? > > Thanks > Nikunj > > > On Thu, May 21, 2015 at 8:35 PM, Dib

Re: Managing scheduling delay in Spark Streaming

2015-09-16 Thread Dibyendu Bhattacharya
Hi Michal, If you use https://github.com/dibbhatt/kafka-spark-consumer , it comes with int own built-in back pressure mechanism. By default this is disabled, you need to enable it to use this feature with this consumer. It does control the rate based on Scheduling Delay at runtime.. Regards, Dib

Re: Using KafkaDirectStream, stopGracefully and exceptions

2015-09-10 Thread Dibyendu Bhattacharya
Hi, Just to clarify one point which may not be clear to many. If someone decides to use Receiver based approach , the best options at this point is to use https://github.com/dibbhatt/kafka-spark-consumer. This will also work with WAL like any other receiver based consumer. The major issue with K

Re: Using KafkaDirectStream, stopGracefully and exceptions

2015-09-10 Thread Dibyendu Bhattacharya
Hi, This is being running in Production in many organization who has adopted this consumer as an alternative option. The Consumer will run with spark 1.3.1 . This is being running in Pearson for sometime in production. This is part of spark packages and you can see how to include it in your mvn

Just Released V1.0.4 Low Level Receiver Based Kafka-Spark-Consumer in Spark Packages having built-in Back Pressure Controller

2015-08-26 Thread Dibyendu Bhattacharya
Dear All, Just now released the 1.0.4 version of Low Level Receiver based Kafka-Spark-Consumer in spark-packages.org . You can find the latest release here : http://spark-packages.org/package/dibbhatt/kafka-spark-consumer Here is github location : https://github.com/dibbhatt/kafka-spark-consumer

Re: BlockNotFoundException when running spark word count on Tachyon

2015-08-26 Thread Dibyendu Bhattacharya
The URL seems to have changed .. here is the one .. http://tachyon-project.org/documentation/Tiered-Storage-on-Tachyon.html On Wed, Aug 26, 2015 at 12:32 PM, Dibyendu Bhattacharya < dibyendu.bhattach...@gmail.com> wrote: > Sometime back I was playing with Spark and Tachyon and I a

Re: BlockNotFoundException when running spark word count on Tachyon

2015-08-26 Thread Dibyendu Bhattacharya
Sometime back I was playing with Spark and Tachyon and I also found this issue . The issue here is TachyonBlockManager put the blocks in WriteType.TRY_CACHE configuration . And because of this Blocks ate evicted from Tachyon Cache when Memory is full and when Spark try to find the block it throws

Re: spark streaming 1.3 kafka error

2015-08-22 Thread Dibyendu Bhattacharya
I think you also can give a try to this consumer : http://spark-packages.org/package/dibbhatt/kafka-spark-consumer in your environment. This has been running fine for topic with large number of Kafka partition ( > 200 ) like yours without any issue.. no issue with connection as this consumer re-use

Re: Reliable Streaming Receiver

2015-08-05 Thread Dibyendu Bhattacharya
Hi, You can try This Kafka Consumer for Spark which is also part of Spark Packages . https://github.com/dibbhatt/kafka-spark-consumer Regards, Dibyendu On Thu, Aug 6, 2015 at 6:48 AM, Sourabh Chandak wrote: > Thanks Tathagata. I tried that but BlockGenerator internally uses > SystemClock which

Re: spark streaming get kafka individual message's offset and partition no

2015-07-28 Thread Dibyendu Bhattacharya
If you want the offset of individual kafka messages , you can use this consumer form Spark Packages .. http://spark-packages.org/package/dibbhatt/kafka-spark-consumer Regards, Dibyendu On Tue, Jul 28, 2015 at 6:18 PM, Shushant Arora wrote: > Hi > > I am processing kafka messages using spark str

Some BlockManager Doubts

2015-07-09 Thread Dibyendu Bhattacharya
Hi , Just would like to clarify few doubts I have how BlockManager behaves . This is mostly in regards to Spark Streaming Context . There are two possible cases Blocks may get dropped / not stored in memory Case 1. While writing the Block for MEMORY_ONLY settings , if Node's BlockManager does no

Re: spark streaming with kafka reset offset

2015-06-27 Thread Dibyendu Bhattacharya
Hi, There is another option to try for Receiver Based Low Level Kafka Consumer which is part of Spark-Packages ( http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) . This can be used with WAL as well for end to end zero data loss. This is also Reliable Receiver and Commit offset to

Re: Kafka Spark Streaming: ERROR EndpointWriter: dropping message

2015-06-09 Thread Dibyendu Bhattacharya
Hi, Can you please little detail stack trace from your receiver logs and also the consumer settings you used ? I have never tested the consumer with Kafka 0.7.3 ..not sure if Kafka Version is the issue . Have you tried building the consumer using Kafka 0.7.3 ? Regards, Dibyendu On Wed, Jun 10, 2

Re: [Kafka-Spark-Consumer] Spark-Streaming Job Fails due to Futures timed out

2015-06-08 Thread Dibyendu Bhattacharya
Seems to be related to this JIRA : https://issues.apache.org/jira/browse/SPARK-3612 ? On Tue, Jun 9, 2015 at 7:39 AM, Dibyendu Bhattacharya < dibyendu.bhattach...@gmail.com> wrote: > Hi Snehal > > Are you running the latest kafka consumer from github/spark-packages ? If > no

Re: [Kafka-Spark-Consumer] Spark-Streaming Job Fails due to Futures timed out

2015-06-08 Thread Dibyendu Bhattacharya
Hi Snehal Are you running the latest kafka consumer from github/spark-packages ? If not can you take the latest changes. This low level receiver will make attempt to keep trying if underlying BlockManager gives error. Are you see those retry cycle in log ? If yes then there is issue writing blocks

Re: Spark Streaming and Drools

2015-05-22 Thread Dibyendu Bhattacharya
Hi, Sometime back I played with Distributed Rule processing by integrating Drool with HBase Co-Processors ..and invoke Rules on any incoming data .. https://github.com/dibbhatt/hbase-rule-engine You can get some idea how to use Drools rules if you see this RegionObserverCoprocessor .. https://g

Re: Spark Streaming with Tachyon : Data Loss on Receiver Failure due to WAL error

2015-05-21 Thread Dibyendu Bhattacharya
m interface, is returning zero. > > On Mon, May 11, 2015 at 4:38 AM, Dibyendu Bhattacharya < > dibyendu.bhattach...@gmail.com> wrote: > >> Just to follow up this thread further . >> >> I was doing some fault tolerant testing of Spark Streaming with Tachyon >>

Re: Spark Streaming graceful shutdown in Spark 1.4

2015-05-20 Thread Dibyendu Bhattacharya
; https://github.com/apache/spark/pull/6307 > > > On Tue, May 19, 2015 at 12:51 AM, Dibyendu Bhattacharya < > dibyendu.bhattach...@gmail.com> wrote: > >> Thenka Sean . you are right. If driver program is running then I can >> handle shutdown in main exit path

Re: spark streaming doubt

2015-05-19 Thread Dibyendu Bhattacharya
Just to add, there is a Receiver based Kafka consumer which uses Kafka Low Level Consumer API. http://spark-packages.org/package/dibbhatt/kafka-spark-consumer Regards, Dibyendu On Tue, May 19, 2015 at 9:00 PM, Akhil Das wrote: > > On Tue, May 19, 2015 at 8:10 PM, Shushant Arora > wrote: > >>

Re: Spark Streaming graceful shutdown in Spark 1.4

2015-05-19 Thread Dibyendu Bhattacharya
1:03 PM, Sean Owen wrote: > I don't think you should rely on a shutdown hook. Ideally you try to > stop it in the main exit path of your program, even in case of an > exception. > > On Tue, May 19, 2015 at 7:59 AM, Dibyendu Bhattacharya > wrote: > > You mea

Re: Spark Streaming graceful shutdown in Spark 1.4

2015-05-19 Thread Dibyendu Bhattacharya
By the way this happens when I stooped the Driver process ... On Tue, May 19, 2015 at 12:29 PM, Dibyendu Bhattacharya < dibyendu.bhattach...@gmail.com> wrote: > You mean to say within Runtime.getRuntime().addShutdownHook I call > ssc.stop(stopSparkContext = true, stopGracef

Re: Spark Streaming graceful shutdown in Spark 1.4

2015-05-19 Thread Dibyendu Bhattacharya
s called or not. > > TD > > On Mon, May 18, 2015 at 9:43 PM, Dibyendu Bhattacharya < > dibyendu.bhattach...@gmail.com> wrote: > >> Hi, >> >> Just figured out that if I want to perform graceful shutdown of Spark >> Streaming 1.4 ( from master ) , the

Spark Streaming graceful shutdown in Spark 1.4

2015-05-18 Thread Dibyendu Bhattacharya
Hi, Just figured out that if I want to perform graceful shutdown of Spark Streaming 1.4 ( from master ) , the Runtime.getRuntime().addShutdownHook no longer works . As in Spark 1.4 there is Utils.addShutdownHook defined for Spark Core, that gets anyway called , which leads to graceful shutdown fro

Re: force the kafka consumer process to different machines

2015-05-13 Thread Dibyendu Bhattacharya
or you can use this Receiver as well : http://spark-packages.org/package/dibbhatt/kafka-spark-consumer Where you can specify how many Receivers you need for your topic and it will divides the partitions among the Receiver and return the joined stream for you . Say you specified 20 receivers , in

Re: Reading Real Time Data only from Kafka

2015-05-13 Thread Dibyendu Bhattacharya
esults > per partition on the executors, after a shuffle of some kind. You can avoid > this pretty easily by just using normal scala code to do your > transformation on the iterator inside a foreachPartition. Again, this > isn't a "con" of the direct stream api, this

Re: Reading Real Time Data only from Kafka

2015-05-12 Thread Dibyendu Bhattacharya
The low level consumer which Akhil mentioned , has been running in Pearson for last 4-5 months without any downtime. I think this one is the reliable "Receiver Based" Kafka consumer as of today for Spark .. if you say it that way .. Prior to Spark 1.3 other Receiver based consumers have used Kafka

Re: Some questions on Multiple Streams

2015-04-24 Thread Dibyendu Bhattacharya
You can probably try the Low Level Consumer from spark-packages ( http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) . How many partitions are there for your topics ? Let say you have 10 topics , and each having 3 partition , ideally you can create max 30 parallel Receiver and 30 str

Re: Latest enhancement in Low Level Receiver based Kafka Consumer

2015-04-01 Thread Dibyendu Bhattacharya
ith kafka-spark-consumer. > > Thanks! > -neelesh > > On Wed, Apr 1, 2015 at 9:49 AM, Dibyendu Bhattacharya < > dibyendu.bhattach...@gmail.com> wrote: > >> Hi, >> >> Just to let you know, I have made some enhancement in Low Level Reliable &

Latest enhancement in Low Level Receiver based Kafka Consumer

2015-04-01 Thread Dibyendu Bhattacharya
Hi, Just to let you know, I have made some enhancement in Low Level Reliable Receiver based Kafka Consumer ( http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) . Earlier version uses as many Receiver task for number of partitions of your kafka topic . Now you can configure desired

Re: Question about Spark Streaming Receiver Failure

2015-03-16 Thread Dibyendu Bhattacharya
omepage now. >>> >>> One quick question I care about is: >>> >>> If the receivers failed for some reasons(for example, killed brutally by >>> someone else), is there any mechanism for the receiver to fail over >>> automatically? >>> >>&g

Re: Question about Spark Streaming Receiver Failure

2015-03-16 Thread Dibyendu Bhattacharya
Which version of Spark you are running ? You can try this Low Level Consumer : http://spark-packages.org/package/dibbhatt/kafka-spark-consumer This is designed to recover from various failures and have very good fault recovery mechanism built in. This is being used by many users and at present we

Re: Spark streaming app shutting down

2015-02-04 Thread Dibyendu Bhattacharya
Thanks Akhil for mentioning this Low Level Consumer ( https://github.com/dibbhatt/kafka-spark-consumer ) . Yes it has better fault tolerant mechanism than any existing Kafka consumer available . This has no data loss on receiver failure and have ability to reply or restart itself in-case of failure

Re: Error when Spark streaming consumes from Kafka

2015-02-02 Thread Dibyendu Bhattacharya
its working nicely. I think there is merit in make it > the default Kafka Receiver for spark streaming. > > -neelesh > > On Mon, Feb 2, 2015 at 5:25 PM, Dibyendu Bhattacharya < > dibyendu.bhattach...@gmail.com> wrote: > >> Or you can use this Low Level Kafka Consu

Re: Error when Spark streaming consumes from Kafka

2015-02-02 Thread Dibyendu Bhattacharya
Or you can use this Low Level Kafka Consumer for Spark : https://github.com/dibbhatt/kafka-spark-consumer This is now part of http://spark-packages.org/ and is running successfully for past few months in Pearson production environment . Being Low Level consumer, it does not have this re-balancing

Re: Spark Streaming with Kafka

2015-01-21 Thread Dibyendu Bhattacharya
You can probably try the Low Level Consumer option with Spark 1.2 https://github.com/dibbhatt/kafka-spark-consumer This Consumer can recover from any underlying failure of Spark Platform or Kafka and either retry or restart the receiver. This is being working nicely for us. Regards, Dibyendu O

Re: ReliableKafkaReceiver stopped receiving data after WriteAheadLogBasedBlockHandler throws TimeoutException

2015-01-18 Thread Dibyendu Bhattacharya
Hi Max, Is it possible for you to try Kafka Low Level Consumer which I have written which is also part of Spark-Packages . This Consumer also a Reliable Consumer having no data loss on Receiver failure. I have tested this with Spark 1.2 with spark.streaming.receiver.writeAheadLog.enable as "true"

Re: Low Level Kafka Consumer for Spark

2015-01-16 Thread Dibyendu Bhattacharya
example >> <https://github.com/dibbhatt/kafka-spark-consumer/blob/master/examples/scala/LowLevelKafkaConsumer.scala#L45> >> which you can run after changing few lines of configurations. >> >> Thanks >> Best Regards >> >> On Fri, Jan 16, 2015 at 12:23

Re: Low Level Kafka Consumer for Spark

2015-01-15 Thread Dibyendu Bhattacharya
On Fri, Jan 16, 2015 at 11:50 AM, Dibyendu Bhattacharya < dibyendu.bhattach...@gmail.com> wrote: > Hi Kidong, > > No , I have not tried yet with Spark 1.2 yet. I will try this out and let > you know how this goes. > > By the way, is there any change in Receiver Store method

Re: Low Level Kafka Consumer for Spark

2015-01-15 Thread Dibyendu Bhattacharya
Hi Kidong, No , I have not tried yet with Spark 1.2 yet. I will try this out and let you know how this goes. By the way, is there any change in Receiver Store method happened in Spark 1.2 ? Regards, Dibyendu On Fri, Jan 16, 2015 at 11:25 AM, mykidong wrote: > Hi Dibyendu, > > I am using k

Re: Low Level Kafka Consumer for Spark

2014-12-03 Thread Dibyendu Bhattacharya
Hi, Yes, as Jerry mentioned, the Spark -3129 ( https://issues.apache.org/jira/browse/SPARK-3129) enabled the WAL feature which solves the Driver failure problem. The way 3129 is designed , it solved the driver failure problem agnostic of the source of the stream ( like Kafka or Flume etc) But with

Re: Error when Spark streaming consumes from Kafka

2014-11-22 Thread Dibyendu Bhattacharya
I believe this is something to do with how Kafka High Level API manages consumers within a Consumer group and how it re-balance during failure. You can find some mention in this Kafka wiki. https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Client+Re-Design Due to various issues in Kafka

Spark - Apache Blur Connector : Index Kafka Messages into Blur using Spark Streaming

2014-09-22 Thread Dibyendu Bhattacharya
Hi, Last few days I am working on a Spark - Apache Blur Connector to index Kafka messages into Apache Blur using Spark Streaming. We have been working on to build a distributed search platform for our NRT use cases and we have been playing with Spark Streaming and Apache Blur for the same. We are

Re: Low Level Kafka Consumer for Spark

2014-09-15 Thread Dibyendu Bhattacharya
gt; Right? > > Thanks, > > Tim > > > On Mon, Sep 15, 2014 at 4:33 AM, Dibyendu Bhattacharya > wrote: > > Hi Alon, > > > > No this will not be guarantee that same set of messages will come in same > > RDD. This fix just re-play the messages from last proc

Re: Low Level Kafka Consumer for Spark

2014-09-15 Thread Dibyendu Bhattacharya
Hi Alon, No this will not be guarantee that same set of messages will come in same RDD. This fix just re-play the messages from last processed offset in same order. Again this is just a interim fix we needed to solve our use case . If you do not need this message re-play feature, just do not perfo

Re: Some Serious Issue with Spark Streaming ? Blocks Getting Removed and Jobs have Failed..

2014-09-12 Thread Dibyendu Bhattacharya
MB) > > The question that I have now is: how to prevent the > MemoryStore/BlockManager of dropping the block inputs? And should they be > logged in the level WARN/ERROR? > > > Thanks. > > > On Fri, Sep 12, 2014 at 4:45 PM, Dibyendu Bhattacharya [via Apache Spark >

Re: Some Serious Issue with Spark Streaming ? Blocks Getting Removed and Jobs have Failed..

2014-09-12 Thread Dibyendu Bhattacharya
oolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > > > > > -- > Nan Zhu > > On Thursday, September 11, 2014 at 10:42 AM, Nan Zhu wrote: > &g

Re: How to scale more consumer to Kafka stream

2014-09-11 Thread Dibyendu Bhattacharya
I agree Gerard. Thanks for pointing this.. Dib On Thu, Sep 11, 2014 at 5:28 PM, Gerard Maas wrote: > This pattern works. > > One note, thought: Use 'union' only if you need to group the data from all > RDDs into one RDD for processing (like count distinct or need a groupby). > If your process c

Re: How to scale more consumer to Kafka stream

2014-09-10 Thread Dibyendu Bhattacharya
Hi, You can use this Kafka Spark Consumer. https://github.com/dibbhatt/kafka-spark-consumer This is exactly does that . It creates parallel Receivers for every Kafka topic partitions. You can see the Consumer.java under consumer.kafka.client package to see an example how to use it. There is some

Re: Low Level Kafka Consumer for Spark

2014-09-10 Thread Dibyendu Bhattacharya
data getting written to target system in very very slow pace. Regards, Dibyendu On Mon, Sep 8, 2014 at 12:08 AM, Dibyendu Bhattacharya < dibyendu.bhattach...@gmail.com> wrote: > Hi Tathagata, > > I have managed to implement the logic into the Kafka-Spark consumer to > rec

Re: Low Level Kafka Consumer for Spark

2014-09-07 Thread Dibyendu Bhattacharya
s helps solve the problem that you all the facing. I am > going to add this to the stremaing programming guide to make sure this > common mistake is avoided. > > TD > > > > > On Wed, Sep 3, 2014 at 10:38 AM, Dibyendu Bhattacharya < > dibyendu.bhattach...@gmail.com>

Re: Low Level Kafka Consumer for Spark

2014-09-03 Thread Dibyendu Bhattacharya
Hi, Sorry for little delay . As discussed in this thread, I have modified the Kafka-Spark-Consumer ( https://github.com/dibbhatt/kafka-spark-consumer) code to have dedicated Receiver for every Topic Partition. You can see the example howto create Union of these receivers in consumer.kafka.client.C

Re: Kafka stream receiver stops input

2014-08-27 Thread Dibyendu Bhattacharya
I think this is a known issue in Existing KafkaUtils. Even we had this issue. The problem is in Existing KafkaUtil there is no way to control the message flow. You can refer to another mail thread on Low Level Kafka Consumer which I have written to solve this issue along with many other.. Dib On

Re: Low Level Kafka Consumer for Spark

2014-08-27 Thread Dibyendu Bhattacharya
I agree. This issue should be fixed in Spark rather rely on replay of Kafka messages. Dib On Aug 28, 2014 6:45 AM, "RodrigoB" wrote: > Dibyendu, > > Tnks for getting back. > > I believe you are absolutely right. We were under the assumption that the > raw data was being computed again and that's

Re: Low Level Kafka Consumer for Spark

2014-08-26 Thread Dibyendu Bhattacharya
> and isn't it fun proposing a popular piece of code? the question > floodgates have opened! haha. :) > > -chris > > > > On Tue, Aug 26, 2014 at 7:29 AM, Dibyendu Bhattacharya < > dibyendu.bhattach...@gmail.com> wrote: > >> Hi Bharat, >> >>

Re: Low Level Kafka Consumer for Spark

2014-08-26 Thread Dibyendu Bhattacharya
Hi Bharat, Thanks for your email. If the "Kafka Reader" worker process dies, it will be replaced by different machine, and it will start consuming from the offset where it left over ( for each partition). Same case can happen even if I tried to have individual Receiver for every partition. Regard

Re: Low Level Kafka Consumer for Spark

2014-08-26 Thread Dibyendu Bhattacharya
Hi, As I understand, your problem is similar to this JIRA. https://issues.apache.org/jira/browse/SPARK-1647 The issue in this case, Kafka can not replay the message as offsets are already committed. Also I think existing KafkaUtils ( The Default High Level Kafka Consumer) also have this issue.

Re: Data loss - Spark streaming and network receiver

2014-08-18 Thread Dibyendu Bhattacharya
Dear All, Recently I have written a Spark Kafka Consumer to solve this problem. Even we have seen issues with KafkaUtils which is using Highlevel Kafka Consumer and consumer code has no handle to offset management. The below code solves this problem, and this has is being tested in our Spark Clus

Re: Low Level Kafka Consumer for Spark

2014-08-05 Thread Dibyendu Bhattacharya
l.com >> +1 (206) 849-4108 >> >> >> On Sun, Aug 3, 2014 at 8:59 PM, Patrick Wendell >> wrote: >> >>> I'll let TD chime on on this one, but I'm guessing this would be a >>> welcome addition. It's great to see community effort on adding new &

Re: Spark stream data from kafka topics and output as parquet file on HDFS

2014-08-05 Thread Dibyendu Bhattacharya
You can try this Kafka Spark Consumer which I recently wrote. This uses the Low Level Kafka Consumer https://github.com/dibbhatt/kafka-spark-consumer Dibyendu On Tue, Aug 5, 2014 at 12:52 PM, rafeeq s wrote: > Hi, > > I am new to Apache Spark and Trying to Develop spark streaming program to

Low Level Kafka Consumer for Spark

2014-08-02 Thread Dibyendu Bhattacharya
Hi, I have implemented a Low Level Kafka Consumer for Spark Streaming using Kafka Simple Consumer API. This API will give better control over the Kafka offset management and recovery from failures. As the present Spark KafkaUtils uses HighLevel Kafka Consumer API, I wanted to have a better control