Hi,
I am trying something like this..
val sesDS: Dataset[XXX] = hiveContext.sql(select).as[XXX]
The select statement is something like this : "select * from sometable
DISTRIBUTE by col1, col2, col3"
Then comes groupByKey...
val gpbyDS = sesDS .groupByKey(x => (x.col1, x.col2, x.col3))
Hi,
You could also use this Receiver :
https://github.com/dibbhatt/kafka-spark-consumer
This is part of spark-packages also :
https://spark-packages.org/package/dibbhatt/kafka-spark-consumer
You do not need to enable WAL in this and still recover from Driver failure
with no data loss. You can re
Hi ,
Released latest version of Receiver based Kafka Consumer for Spark Streaming
.
Available at Spark Packages : https://spark-packages.org/package/dibbhatt/
kafka-spark-consumer
Also at github : https://github.com/dibbhatt/kafka-spark-consumer
Some key features
- Tuned for better perform
yendu
On Thu, Aug 25, 2016 at 7:01 PM, wrote:
> Hi Dibyendu,
>
> Looks like it is available in 2.0, we are using older version of spark 1.5
> . Could you please let me know how to use this with older versions.
>
> Thanks,
> Asmath
>
> Sent from my iPhone
>
Hi ,
Released latest version of Receiver based Kafka Consumer for Spark
Streaming.
Receiver is compatible with Kafka versions 0.8.x, 0.9.x and 0.10.x and All
Spark Versions
Available at Spark Packages :
https://spark-packages.org/package/dibbhatt/kafka-spark-consumer
Also at github : https://g
You can get some good pointers in this JIRA
https://issues.apache.org/jira/browse/SPARK-15796
Dibyendu
On Thu, Jul 14, 2016 at 12:53 AM, Sunita wrote:
> I am facing the same issue. Upgrading to Spark1.6 is causing hugh
> performance
> loss. Could you solve this issue? I am also attempting mem
In Spark Streaming job, I see a Batch failed with following error. Haven't
seen anything like this earlier.
This has happened during Shuffle for one Batch (haven't reoccurred after
that).. Just curious to know what can cause this error. I am running Spark
1.5.1
Regards,
Dibyendu
Job aborted due
:39 AM, Dibyendu Bhattacharya
> wrote:
> > You are using low level spark kafka consumer . I am the author of the
> same.
>
> If I may ask, what are the differences between this and the direct
> version shipped with spark? I've just started toying with it, and
> would
> On Thu, Jan 7, 2016 at 2:39 AM, Dibyendu Bhattacharya <
> dibyendu.bhattach...@gmail.com> wrote:
>
>> You are using low level spark kafka consumer . I am the author of the
>> same.
>>
>> Are you using the spark-packages version ? if yes which one ?
>>
>&g
In direct stream checkpoint location is not recoverable if you modify your
driver code. So if you just rely on checkpoint to commit offset , you can
possibly loose messages if you modify driver code and you select offset
from "largest" offset. If you do not want to loose messages, you need to
com
brary thinks that my approach of failing (so they KNOW
> there was data loss and can adjust their system) is the right thing to do,
> how do they do that?
>
> On Wed, Dec 2, 2015 at 9:49 PM, Dibyendu Bhattacharya <
> dibyendu.bhattach...@gmail.com> wrote:
>
>
ly addressing it".
>
> The only really correct way to deal with that situation is to recognize
> why it's happening, and either increase your Kafka retention or increase
> the speed at which you are consuming.
>
> On Wed, Dec 2, 2015 at 7:13 PM, Dibyendu Bhattacharya <
data because your system
> isn't working right is not the same as "recover from any Kafka leader
> changes and offset out of ranges issue".
>
>
>
> On Tue, Dec 1, 2015 at 11:27 PM, Dibyendu Bhattacharya <
> dibyendu.bhattach...@gmail.com> wrote:
>
>&
Hi, if you use Receiver based consumer which is available in spark-packages
( http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) , this
has all built in failure recovery and it can recover from any Kafka leader
changes and offset out of ranges issue.
Here is the package form github :
If you do not need one to one semantics and does not want strict ordering
guarantee , you can very well use the Receiver based approach, and this
consumer from Spark-Packages (
https://github.com/dibbhatt/kafka-spark-consumer) can give much better
alternatives in terms of performance and reliabilit
Hi,
I have raised a JIRA ( https://issues.apache.org/jira/browse/SPARK-11045)
to track the discussion but also mailing user group .
This Kafka consumer is around for a while in spark-packages (
http://spark-packages.org/package/dibbhatt/kafka-spark-consumer ) and I see
many started using it , I a
ng the kafka flow.
> (I use spark 1.3.1)
>
> Tks
> Nicolas
>
>
> - Mail original -
> De: "Dibyendu Bhattacharya"
> À: nib...@free.fr
> Cc: "Cody Koeninger" , "user"
> Envoyé: Vendredi 2 Octobre 2015 18:21:35
> Objet: Re: Spa
RDD will be
> distributed over the nodes/core.
> For example in my example I have 4 nodes (with 2 cores)
>
> Tks
> Nicolas
>
>
> - Mail original -
> De: "Dibyendu Bhattacharya"
> À: nib...@free.fr
> Cc: "Cody Koeninger" , "user&qu
Hi,
If you need to use Receiver based approach , you can try this one :
https://github.com/dibbhatt/kafka-spark-consumer
This is also part of Spark packages :
http://spark-packages.org/package/dibbhatt/kafka-spark-consumer
You just need to specify the number of Receivers you want for desired
par
m not sure I understand completely. But are you suggesting that
> currently there is no way to enable Checkpoint directory to be in Tachyon?
>
> Thanks
> Nikunj
>
>
> On Fri, Sep 25, 2015 at 11:49 PM, Dibyendu Bhattacharya <
> dibyendu.bhattach...@gmail.com> wrote:
>
&
ne go about configuring spark streaming to use tachyon as its
> place for storing checkpoints? Also, can one do this with tachyon running
> on a completely different node than where spark processes are running?
>
> Thanks
> Nikunj
>
>
> On Thu, May 21, 2015 at 8:35 PM, Dib
Hi Michal,
If you use https://github.com/dibbhatt/kafka-spark-consumer , it comes
with int own built-in back pressure mechanism. By default this is disabled,
you need to enable it to use this feature with this consumer. It does
control the rate based on Scheduling Delay at runtime..
Regards,
Dib
Hi,
Just to clarify one point which may not be clear to many. If someone
decides to use Receiver based approach , the best options at this point is
to use https://github.com/dibbhatt/kafka-spark-consumer. This will also
work with WAL like any other receiver based consumer. The major issue with
K
Hi,
This is being running in Production in many organization who has adopted
this consumer as an alternative option. The Consumer will run with spark
1.3.1 .
This is being running in Pearson for sometime in production.
This is part of spark packages and you can see how to include it in your
mvn
Dear All,
Just now released the 1.0.4 version of Low Level Receiver based
Kafka-Spark-Consumer in spark-packages.org . You can find the latest
release here :
http://spark-packages.org/package/dibbhatt/kafka-spark-consumer
Here is github location : https://github.com/dibbhatt/kafka-spark-consumer
The URL seems to have changed .. here is the one ..
http://tachyon-project.org/documentation/Tiered-Storage-on-Tachyon.html
On Wed, Aug 26, 2015 at 12:32 PM, Dibyendu Bhattacharya <
dibyendu.bhattach...@gmail.com> wrote:
> Sometime back I was playing with Spark and Tachyon and I a
Sometime back I was playing with Spark and Tachyon and I also found this
issue . The issue here is TachyonBlockManager put the blocks in
WriteType.TRY_CACHE configuration . And because of this Blocks ate evicted
from Tachyon Cache when Memory is full and when Spark try to find the block
it throws
I think you also can give a try to this consumer :
http://spark-packages.org/package/dibbhatt/kafka-spark-consumer in your
environment. This has been running fine for topic with large number of
Kafka partition ( > 200 ) like yours without any issue.. no issue with
connection as this consumer re-use
Hi,
You can try This Kafka Consumer for Spark which is also part of Spark
Packages . https://github.com/dibbhatt/kafka-spark-consumer
Regards,
Dibyendu
On Thu, Aug 6, 2015 at 6:48 AM, Sourabh Chandak
wrote:
> Thanks Tathagata. I tried that but BlockGenerator internally uses
> SystemClock which
If you want the offset of individual kafka messages , you can use this
consumer form Spark Packages ..
http://spark-packages.org/package/dibbhatt/kafka-spark-consumer
Regards,
Dibyendu
On Tue, Jul 28, 2015 at 6:18 PM, Shushant Arora
wrote:
> Hi
>
> I am processing kafka messages using spark str
Hi ,
Just would like to clarify few doubts I have how BlockManager behaves .
This is mostly in regards to Spark Streaming Context .
There are two possible cases Blocks may get dropped / not stored in memory
Case 1. While writing the Block for MEMORY_ONLY settings , if Node's
BlockManager does no
Hi,
There is another option to try for Receiver Based Low Level Kafka Consumer
which is part of Spark-Packages (
http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) . This can
be used with WAL as well for end to end zero data loss.
This is also Reliable Receiver and Commit offset to
Hi,
Can you please little detail stack trace from your receiver logs and also
the consumer settings you used ? I have never tested the consumer with
Kafka 0.7.3 ..not sure if Kafka Version is the issue . Have you tried
building the consumer using Kafka 0.7.3 ?
Regards,
Dibyendu
On Wed, Jun 10, 2
Seems to be related to this JIRA :
https://issues.apache.org/jira/browse/SPARK-3612 ?
On Tue, Jun 9, 2015 at 7:39 AM, Dibyendu Bhattacharya <
dibyendu.bhattach...@gmail.com> wrote:
> Hi Snehal
>
> Are you running the latest kafka consumer from github/spark-packages ? If
> no
Hi Snehal
Are you running the latest kafka consumer from github/spark-packages ? If
not can you take the latest changes. This low level receiver will make
attempt to keep trying if underlying BlockManager gives error. Are you see
those retry cycle in log ? If yes then there is issue writing blocks
Hi,
Sometime back I played with Distributed Rule processing by integrating
Drool with HBase Co-Processors ..and invoke Rules on any incoming data ..
https://github.com/dibbhatt/hbase-rule-engine
You can get some idea how to use Drools rules if you see this
RegionObserverCoprocessor ..
https://g
m interface, is returning zero.
>
> On Mon, May 11, 2015 at 4:38 AM, Dibyendu Bhattacharya <
> dibyendu.bhattach...@gmail.com> wrote:
>
>> Just to follow up this thread further .
>>
>> I was doing some fault tolerant testing of Spark Streaming with Tachyon
>>
; https://github.com/apache/spark/pull/6307
>
>
> On Tue, May 19, 2015 at 12:51 AM, Dibyendu Bhattacharya <
> dibyendu.bhattach...@gmail.com> wrote:
>
>> Thenka Sean . you are right. If driver program is running then I can
>> handle shutdown in main exit path
Just to add, there is a Receiver based Kafka consumer which uses Kafka Low
Level Consumer API.
http://spark-packages.org/package/dibbhatt/kafka-spark-consumer
Regards,
Dibyendu
On Tue, May 19, 2015 at 9:00 PM, Akhil Das
wrote:
>
> On Tue, May 19, 2015 at 8:10 PM, Shushant Arora > wrote:
>
>>
1:03 PM, Sean Owen wrote:
> I don't think you should rely on a shutdown hook. Ideally you try to
> stop it in the main exit path of your program, even in case of an
> exception.
>
> On Tue, May 19, 2015 at 7:59 AM, Dibyendu Bhattacharya
> wrote:
> > You mea
By the way this happens when I stooped the Driver process ...
On Tue, May 19, 2015 at 12:29 PM, Dibyendu Bhattacharya <
dibyendu.bhattach...@gmail.com> wrote:
> You mean to say within Runtime.getRuntime().addShutdownHook I call
> ssc.stop(stopSparkContext = true, stopGracef
s called or not.
>
> TD
>
> On Mon, May 18, 2015 at 9:43 PM, Dibyendu Bhattacharya <
> dibyendu.bhattach...@gmail.com> wrote:
>
>> Hi,
>>
>> Just figured out that if I want to perform graceful shutdown of Spark
>> Streaming 1.4 ( from master ) , the
Hi,
Just figured out that if I want to perform graceful shutdown of Spark
Streaming 1.4 ( from master ) , the Runtime.getRuntime().addShutdownHook no
longer works . As in Spark 1.4 there is Utils.addShutdownHook defined for
Spark Core, that gets anyway called , which leads to graceful shutdown fro
or you can use this Receiver as well :
http://spark-packages.org/package/dibbhatt/kafka-spark-consumer
Where you can specify how many Receivers you need for your topic and it
will divides the partitions among the Receiver and return the joined stream
for you .
Say you specified 20 receivers , in
esults
> per partition on the executors, after a shuffle of some kind. You can avoid
> this pretty easily by just using normal scala code to do your
> transformation on the iterator inside a foreachPartition. Again, this
> isn't a "con" of the direct stream api, this
The low level consumer which Akhil mentioned , has been running in Pearson
for last 4-5 months without any downtime. I think this one is the reliable
"Receiver Based" Kafka consumer as of today for Spark .. if you say it that
way ..
Prior to Spark 1.3 other Receiver based consumers have used Kafka
You can probably try the Low Level Consumer from spark-packages (
http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) .
How many partitions are there for your topics ? Let say you have 10 topics
, and each having 3 partition , ideally you can create max 30 parallel
Receiver and 30 str
ith kafka-spark-consumer.
>
> Thanks!
> -neelesh
>
> On Wed, Apr 1, 2015 at 9:49 AM, Dibyendu Bhattacharya <
> dibyendu.bhattach...@gmail.com> wrote:
>
>> Hi,
>>
>> Just to let you know, I have made some enhancement in Low Level Reliable
&
Hi,
Just to let you know, I have made some enhancement in Low Level Reliable
Receiver based Kafka Consumer (
http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) .
Earlier version uses as many Receiver task for number of partitions of your
kafka topic . Now you can configure desired
omepage now.
>>>
>>> One quick question I care about is:
>>>
>>> If the receivers failed for some reasons(for example, killed brutally by
>>> someone else), is there any mechanism for the receiver to fail over
>>> automatically?
>>>
>>&g
Which version of Spark you are running ?
You can try this Low Level Consumer :
http://spark-packages.org/package/dibbhatt/kafka-spark-consumer
This is designed to recover from various failures and have very good fault
recovery mechanism built in. This is being used by many users and at
present we
Thanks Akhil for mentioning this Low Level Consumer (
https://github.com/dibbhatt/kafka-spark-consumer ) . Yes it has better
fault tolerant mechanism than any existing Kafka consumer available . This
has no data loss on receiver failure and have ability to reply or restart
itself in-case of failure
its working nicely. I think there is merit in make it
> the default Kafka Receiver for spark streaming.
>
> -neelesh
>
> On Mon, Feb 2, 2015 at 5:25 PM, Dibyendu Bhattacharya <
> dibyendu.bhattach...@gmail.com> wrote:
>
>> Or you can use this Low Level Kafka Consu
Or you can use this Low Level Kafka Consumer for Spark :
https://github.com/dibbhatt/kafka-spark-consumer
This is now part of http://spark-packages.org/ and is running successfully
for past few months in Pearson production environment . Being Low Level
consumer, it does not have this re-balancing
You can probably try the Low Level Consumer option with Spark 1.2
https://github.com/dibbhatt/kafka-spark-consumer
This Consumer can recover from any underlying failure of Spark Platform or
Kafka and either retry or restart the receiver. This is being working
nicely for us.
Regards,
Dibyendu
O
Hi Max,
Is it possible for you to try Kafka Low Level Consumer which I have written
which is also part of Spark-Packages . This Consumer also a Reliable
Consumer having no data loss on Receiver failure. I have tested this with
Spark 1.2 with spark.streaming.receiver.writeAheadLog.enable as "true"
example
>> <https://github.com/dibbhatt/kafka-spark-consumer/blob/master/examples/scala/LowLevelKafkaConsumer.scala#L45>
>> which you can run after changing few lines of configurations.
>>
>> Thanks
>> Best Regards
>>
>> On Fri, Jan 16, 2015 at 12:23
On Fri, Jan 16, 2015 at 11:50 AM, Dibyendu Bhattacharya <
dibyendu.bhattach...@gmail.com> wrote:
> Hi Kidong,
>
> No , I have not tried yet with Spark 1.2 yet. I will try this out and let
> you know how this goes.
>
> By the way, is there any change in Receiver Store method
Hi Kidong,
No , I have not tried yet with Spark 1.2 yet. I will try this out and let
you know how this goes.
By the way, is there any change in Receiver Store method happened in Spark
1.2 ?
Regards,
Dibyendu
On Fri, Jan 16, 2015 at 11:25 AM, mykidong wrote:
> Hi Dibyendu,
>
> I am using k
Hi,
Yes, as Jerry mentioned, the Spark -3129 (
https://issues.apache.org/jira/browse/SPARK-3129) enabled the WAL feature
which solves the Driver failure problem. The way 3129 is designed , it
solved the driver failure problem agnostic of the source of the stream (
like Kafka or Flume etc) But with
I believe this is something to do with how Kafka High Level API manages
consumers within a Consumer group and how it re-balance during failure. You
can find some mention in this Kafka wiki.
https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Client+Re-Design
Due to various issues in Kafka
Hi,
Last few days I am working on a Spark - Apache Blur Connector to index
Kafka messages into Apache Blur using Spark Streaming. We have been working
on to build a distributed search platform for our NRT use cases and we have
been playing with Spark Streaming and Apache Blur for the same. We are
gt; Right?
>
> Thanks,
>
> Tim
>
>
> On Mon, Sep 15, 2014 at 4:33 AM, Dibyendu Bhattacharya
> wrote:
> > Hi Alon,
> >
> > No this will not be guarantee that same set of messages will come in same
> > RDD. This fix just re-play the messages from last proc
Hi Alon,
No this will not be guarantee that same set of messages will come in same
RDD. This fix just re-play the messages from last processed offset in same
order. Again this is just a interim fix we needed to solve our use case .
If you do not need this message re-play feature, just do not perfo
MB)
>
> The question that I have now is: how to prevent the
> MemoryStore/BlockManager of dropping the block inputs? And should they be
> logged in the level WARN/ERROR?
>
>
> Thanks.
>
>
> On Fri, Sep 12, 2014 at 4:45 PM, Dibyendu Bhattacharya [via Apache Spark
>
oolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
>
>
>
>
> --
> Nan Zhu
>
> On Thursday, September 11, 2014 at 10:42 AM, Nan Zhu wrote:
>
&g
I agree Gerard. Thanks for pointing this..
Dib
On Thu, Sep 11, 2014 at 5:28 PM, Gerard Maas wrote:
> This pattern works.
>
> One note, thought: Use 'union' only if you need to group the data from all
> RDDs into one RDD for processing (like count distinct or need a groupby).
> If your process c
Hi,
You can use this Kafka Spark Consumer.
https://github.com/dibbhatt/kafka-spark-consumer
This is exactly does that . It creates parallel Receivers for every Kafka
topic partitions. You can see the Consumer.java under consumer.kafka.client
package to see an example how to use it.
There is some
data getting written to target system in very very slow
pace.
Regards,
Dibyendu
On Mon, Sep 8, 2014 at 12:08 AM, Dibyendu Bhattacharya <
dibyendu.bhattach...@gmail.com> wrote:
> Hi Tathagata,
>
> I have managed to implement the logic into the Kafka-Spark consumer to
> rec
s helps solve the problem that you all the facing. I am
> going to add this to the stremaing programming guide to make sure this
> common mistake is avoided.
>
> TD
>
>
>
>
> On Wed, Sep 3, 2014 at 10:38 AM, Dibyendu Bhattacharya <
> dibyendu.bhattach...@gmail.com>
Hi,
Sorry for little delay . As discussed in this thread, I have modified the
Kafka-Spark-Consumer ( https://github.com/dibbhatt/kafka-spark-consumer)
code to have dedicated Receiver for every Topic Partition. You can see the
example howto create Union of these receivers
in consumer.kafka.client.C
I think this is a known issue in Existing KafkaUtils. Even we had this
issue. The problem is in Existing KafkaUtil there is no way to control the
message flow.
You can refer to another mail thread on Low Level Kafka Consumer which I
have written to solve this issue along with many other..
Dib
On
I agree. This issue should be fixed in Spark rather rely on replay of Kafka
messages.
Dib
On Aug 28, 2014 6:45 AM, "RodrigoB" wrote:
> Dibyendu,
>
> Tnks for getting back.
>
> I believe you are absolutely right. We were under the assumption that the
> raw data was being computed again and that's
> and isn't it fun proposing a popular piece of code? the question
> floodgates have opened! haha. :)
>
> -chris
>
>
>
> On Tue, Aug 26, 2014 at 7:29 AM, Dibyendu Bhattacharya <
> dibyendu.bhattach...@gmail.com> wrote:
>
>> Hi Bharat,
>>
>>
Hi Bharat,
Thanks for your email. If the "Kafka Reader" worker process dies, it will
be replaced by different machine, and it will start consuming from the
offset where it left over ( for each partition). Same case can happen even
if I tried to have individual Receiver for every partition.
Regard
Hi,
As I understand, your problem is similar to this JIRA.
https://issues.apache.org/jira/browse/SPARK-1647
The issue in this case, Kafka can not replay the message as offsets are
already committed. Also I think existing KafkaUtils ( The Default High
Level Kafka Consumer) also have this issue.
Dear All,
Recently I have written a Spark Kafka Consumer to solve this problem. Even
we have seen issues with KafkaUtils which is using Highlevel Kafka Consumer
and consumer code has no handle to offset management.
The below code solves this problem, and this has is being tested in our
Spark Clus
l.com
>> +1 (206) 849-4108
>>
>>
>> On Sun, Aug 3, 2014 at 8:59 PM, Patrick Wendell
>> wrote:
>>
>>> I'll let TD chime on on this one, but I'm guessing this would be a
>>> welcome addition. It's great to see community effort on adding new
&
You can try this Kafka Spark Consumer which I recently wrote. This uses the
Low Level Kafka Consumer
https://github.com/dibbhatt/kafka-spark-consumer
Dibyendu
On Tue, Aug 5, 2014 at 12:52 PM, rafeeq s wrote:
> Hi,
>
> I am new to Apache Spark and Trying to Develop spark streaming program to
Hi,
I have implemented a Low Level Kafka Consumer for Spark Streaming using
Kafka Simple Consumer API. This API will give better control over the Kafka
offset management and recovery from failures. As the present Spark
KafkaUtils uses HighLevel Kafka Consumer API, I wanted to have a better
control
80 matches
Mail list logo