Re: What is the real difference between Kafka streaming and Spark Streaming?

2017-06-11 Thread kant kodali
the results out to many sinks. Look up "Kafka Connect" Regarding Event at a time vs Micro-batch. I hear arguments from a group of people saying Spark Streaming is real time and other group of people is Kafka streaming is the true real time. so do we say Micro-batch is real time or Event

RE: What is the real difference between Kafka streaming and Spark Streaming?

2017-06-11 Thread Mohammed Guller
Just to elaborate more on Vincent wrote – Kafka streaming provides true record-at-a-time processing capabilities whereas Spark Streaming provides micro-batching capabilities on top of Spark. Depending on your use case, you may find one better than the other. Both provide stateless ad stateful

Re: What is the real difference between Kafka streaming and Spark Streaming?

2017-06-11 Thread vincent gromakowski
cons to argument on this topic. *Yohann Jardin* Le 6/11/2017 à 7:08 PM, vaquar khan a écrit : Hi Kant, Kafka is the message broker that using as Producers and Consumers and Spark Streaming is used as the real time processing ,Kafka and Spark Streaming work together not competitors. Spark Streaming is r

Re: What is the real difference between Kafka streaming and Spark Streaming?

2017-06-11 Thread yohann jardin
to argument on this topic. Yohann Jardin Le 6/11/2017 à 7:08 PM, vaquar khan a écrit : Hi Kant, Kafka is the message broker that using as Producers and Consumers and Spark Streaming is used as the real time processing ,Kafka and Spark Streaming work together not competitors. Spark Streaming

Re: What is the real difference between Kafka streaming and Spark Streaming?

2017-06-11 Thread yohann jardin
to argument on this topic. Yohann Jardin Le 6/11/2017 à 7:08 PM, vaquar khan a écrit : Hi Kant, Kafka is the message broker that using as Producers and Consumers and Spark Streaming is used as the real time processing ,Kafka and Spark Streaming work together not competitors. Spark Streaming

Re: What is the real difference between Kafka streaming and Spark Streaming?

2017-06-11 Thread vaquar khan
Hi Kant, Kafka is the message broker that using as Producers and Consumers and Spark Streaming is used as the real time processing ,Kafka and Spark Streaming work together not competitors. Spark Streaming is reading data from Kafka and process into micro batching for streaming data, In easy terms

What is the real difference between Kafka streaming and Spark Streaming?

2017-06-11 Thread kant kodali
Hi All, I am trying hard to figure out what is the real difference between Kafka Streaming vs Spark Streaming other than saying one can be used as part of Micro services (since Kafka streaming is just a library) and the other is a Standalone framework by itself. If I can accomplish same job one

[Spark STREAMING]: Can not kill job gracefully on spark standalone cluster

2017-06-08 Thread Mariusz D.
There is a problem with killing jobs gracefully in spark 2.1.0 with enabled spark.streaming.stopGracefullyOnShutdown I tested killing spark jobs in many ways and I got some conclusions. 1. With command spark-submit --master spark:// --kill driver-id results: It killed all workers almost

Re: Exception which using ReduceByKeyAndWindow in Spark Streaming.

2017-06-07 Thread swetha kasireddy
>>> you are actually mutating it in "set1 ++set2". I suggest creating a new >>> HashMap in the function (and add both maps into it), rather than mutating >>> one of them. >>> >>> On Tue, Jun 6, 2017 at 11:30 AM, SRK <swethakasire...@gmail.co

Re: Exception which using ReduceByKeyAndWindow in Spark Streaming.

2017-06-06 Thread Tathagata Das
HashMap in the function (and add both maps into it), rather than mutating >> one of them. >> >> On Tue, Jun 6, 2017 at 11:30 AM, SRK <swethakasire...@gmail.com> wrote: >> >>> Hi, >>> >>> I see the following error when I use ReduceByKeyAndWin

Re: Exception which using ReduceByKeyAndWindow in Spark Streaming.

2017-06-06 Thread swetha kasireddy
eating a new > HashMap in the function (and add both maps into it), rather than mutating > one of them. > > On Tue, Jun 6, 2017 at 11:30 AM, SRK <swethakasire...@gmail.com> wrote: > >> Hi, >> >> I see the following error when I use ReduceByKeyAndWindow in my Spark

Re: Exception which using ReduceByKeyAndWindow in Spark Streaming.

2017-06-06 Thread Tathagata Das
...@gmail.com> wrote: > Hi, > > I see the following error when I use ReduceByKeyAndWindow in my Spark > Streaming app. I use reduce, invReduce and filterFunction as shown below. > Any idea as to why I get the error? > > java.lang.Exception: Neither previous window has

Exception which using ReduceByKeyAndWindow in Spark Streaming.

2017-06-06 Thread SRK
Hi, I see the following error when I use ReduceByKeyAndWindow in my Spark Streaming app. I use reduce, invReduce and filterFunction as shown below. Any idea as to why I get the error? java.lang.Exception: Neither previous window has value for key, nor new values found. Are you sure your key

Re: Spark Streaming Job Stuck

2017-06-06 Thread Richard Moorhead
have received this communication in error, please notify us immediately by replying to the message and deleting from your computer. From: Jain, Nishit <nja...@underarmour.com> Sent: Tuesday, June 6, 2017 9:54 AM To: Tathagata Das Cc: user@spark.apache.org Sub

Re: Spark Streaming Job Stuck

2017-06-06 Thread Jain, Nishit
g<mailto:user@spark.apache.org>" <user@spark.apache.org<mailto:user@spark.apache.org>> Subject: Re: Spark Streaming Job Stuck http://spark.apache.org/docs/latest/streaming-programming-guide.html#points-to-remember-1 Hope this helps. On Mon, Jun 5, 2017 at 2:51 PM, Jain, Nishit

Re: Spark Streaming Checkpoint and Exactly Once Guarantee on Kafka Direct Stream

2017-06-06 Thread Tathagata Das
need to update them transactionally to handle possible recomputations. This is true for both spark streaming and structured streaming. Hope this helps. On Jun 6, 2017 5:56 AM, "ALunar Beach" <alunarbe...@gmail.com> wrote: > Thanks TD. > In pre-structured streaming,

Re: Spark Streaming Checkpoint and Exactly Once Guarantee on Kafka Direct Stream

2017-06-06 Thread ALunar Beach
y with Spark Streaming, i highly recommend > learning Structured Streaming > <http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html> > instead. > > On Mon, Jun 5, 2017 at 11:16 AM, anbucheeralan <alunarbe...@gmail.com> > wrote: > >> I a

Re: Spark Streaming Checkpoint and Exactly Once Guarantee on Kafka Direct Stream

2017-06-06 Thread Tathagata Das
This is the expected behavior. There are some confusing corner cases. If you are starting to play with Spark Streaming, i highly recommend learning Structured Streaming <http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html> instead. On Mon, Jun 5, 2017 at 11

Re: Spark Streaming Job Stuck

2017-06-06 Thread Tathagata Das
http://spark.apache.org/docs/latest/streaming-programming-guide.html#points-to-remember-1 Hope this helps. On Mon, Jun 5, 2017 at 2:51 PM, Jain, Nishit <nja...@underarmour.com> wrote: > I have a very simple spark streaming job running locally in standalone > mode. There is a custo

Spark Streaming Job Stuck

2017-06-05 Thread Jain, Nishit
I have a very simple spark streaming job running locally in standalone mode. There is a customer receiver which reads from database and pass it to the main job which prints the total. Not an actual use case but I am playing around to learn. Problem is that job gets stuck forever, logic is very

Fwd: Spark Streaming Checkpoint and Exactly Once Guarantee on Kafka Direct Stream

2017-06-05 Thread anbucheeralan
I am using Spark Streaming Checkpoint and Kafka Direct Stream. It uses a 30 sec batch duration and normally the job is successful in 15-20 sec. If the spark application fails after the successful completion (149668428ms in the log below) and restarts, it's duplicating the last batch again

Spark Streaming Checkpoint and Exactly Once Guarantee on Kafka Direct Stream

2017-06-05 Thread ALunar Beach
I am using Spark Streaming Checkpoint and Kafka Direct Stream. It uses a 30 sec batch duration and normally the job is successful in 15-20 sec. If the spark application fails after the successful completion (149668428ms in the log below) and restarts, it's duplicating the last batch again

Kafka + Spark Streaming consumer API offsets

2017-06-05 Thread Nipun Arora
I need some clarification for Kafka consumers in Spark or otherwise. I have the following Kafka Consumer. The consumer is reading from a topic, and I have a mechanism which blocks the consumer from time to time. The producer is a separate thread which is continuously sending data. I want to

Re: Message getting lost in Kafka + Spark Streaming

2017-06-01 Thread Vikash Pareek
other and that way > you'll have less final events. > > -Original Message- > From: Vikash Pareek [mailto:vikash.par...@infoobjects.com] > Sent: Tuesday, May 30, 2017 4:00 PM > To: user@spark.apache.org > Subject: Message getting lost in Kafka + Spark Streaming >

RE: Message getting lost in Kafka + Spark Streaming

2017-05-31 Thread Sidney Feiner
. -Original Message- From: Vikash Pareek [mailto:vikash.par...@infoobjects.com] Sent: Tuesday, May 30, 2017 4:00 PM To: user@spark.apache.org Subject: Message getting lost in Kafka + Spark Streaming I am facing an issue related to spark streaming with kafka, my use case is as follow: 1. Spark

Re: Message getting lost in Kafka + Spark Streaming

2017-05-30 Thread Cody Koeninger
First thing I noticed, you should be using a singleton kafka producer, not recreating one every partition. It's threadsafe. On Tue, May 30, 2017 at 7:59 AM, Vikash Pareek <vikash.par...@infoobjects.com> wrote: > I am facing an issue related to spark streaming with kafka, my

Message getting lost in Kafka + Spark Streaming

2017-05-30 Thread Vikash Pareek
I am facing an issue related to spark streaming with kafka, my use case is as follow: 1. Spark streaming(DirectStream) application reading data/messages from kafka topic and process it 2. On the basis of proccessed message, app will write proccessed message to different kafka topics for e.g

Re: [Spark Streaming] DAG Output Processing mechanism

2017-05-29 Thread Nipun Arora
Sending out the message again.. Hopefully someone cal clarify :) I would like some clarification on the execution model for spark streaming. Broadly, I am trying to understand if output operations in a DAG are only processed after all intermediate operations are finished for all parts

Re: [Spark Streaming] DAG Output Processing mechanism

2017-05-28 Thread Nipun Arora
Apogies - Resending as the previous mail went with some unnecessary copy paste. I would like some clarification on the execution model for spark streaming. Broadly, I am trying to understand if output operations in a DAG are only processed after all intermediate operations are finished for all

[Spark Streaming] DAG Output Processing mechanism

2017-05-28 Thread Nipun Arora
up vote 0 down vote favorite I would like some clarification on the execution model for spark streaming. Broadly, I am trying to understand if output operations in a DAG are only processed after all intermediate operations are finished for all parts of the DAG. Let me give an example: I have

[Spark Streaming] DAG Execution Model Clarification

2017-05-26 Thread Nipun Arora
Hi, I would like some clarification on the execution model for spark streaming. Broadly, I am trying to understand if output operations in a DAG are only processed after all intermediate operations are finished for all parts of the DAG. Let me give an example: I have a dstream -A , I do map

Re: Spark Streaming: Custom Receiver OOM consistently

2017-05-23 Thread Manish Malhotra
s. >> >> But we can assume the receiver, is equivalent to writing a JMS based >> custom receiver, where we register a message listener and for each message >> delivered by JMS will be stored by calling store method of listener. >> >> >> Something like :

Re: Spark Streaming: Custom Receiver OOM consistently

2017-05-22 Thread kant kodali
vered by JMS will be stored by calling store method of listener. > > > Something like : https://github.com/tbfenet/spark-jms-receiver/blob/ > master/src/main/scala/org/apache/spark/streaming/jms/JmsReceiver.scala > > Though the diff is here this JMS receiver is using block generator

Re: Spark Streaming: Custom Receiver OOM consistently

2017-05-22 Thread Manish Malhotra
: https://github.com/tbfenet/spark-jms-receiver/blob/master/src/main/scala/org/apache/spark/streaming/jms/JmsReceiver.scala Though the diff is here this JMS receiver is using block generator and the calling store. I m calling store when I receive message. And also I'm not using block generator

Re: Spark Streaming: Custom Receiver OOM consistently

2017-05-21 Thread Alonso Isidoro Roman
could you share the code? Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman 2017-05-20 7:54 GMT+02:00 Manish Malhotra : > Hello, >

Spark Streaming: Custom Receiver OOM consistently

2017-05-19 Thread Manish Malhotra
Hello, have implemented Java based custom receiver, which consumes from messaging system say JMS. once received message, I call store(object) ... Im storing spark Row object. it run for around 8 hrs, and then goes OOM, and OOM is happening in receiver nodes. I also tried to run multiple

Spark Streaming: NullPointerException when restoring Spark Streaming job from hdfs/s3 checkpoint

2017-05-16 Thread Richard Moorhead
Im having some difficulty reliably restoring a streaming job from a checkpoint. When restoring a streaming job constructed from the following snippet, I receive NullPointerException's when `map` is called on the the restored RDD. lazy val ssc = StreamingContext.getOrCreate(checkpointDir,

Spark Streaming 2.1 recovery

2017-05-16 Thread Dominik Safaric
Hi, currently I am exploring Spark’s fault tolerance capabilities in terms of fault recovery. Namely I run a Spark 2.1 standalone cluster on a master and four worker nodes. The application pulls data using the Kafka direct stream API from a Kafka topic over a (sliding) window of time, and

Spark streaming app leaking memory?

2017-05-16 Thread Srikanth
Hi, I have a Spark streaming(Spark 2.1.0) app where I see these logs in executor. Does this point to some memory leak? 17/05/16 15:11:13 WARN Executor: Managed memory leak detected; size = 67108864 bytes, TID = 7752 17/05/16 15:11:13 WARN Executor: Managed memory leak detected; size = 67108864

Re: Spark streaming - TIBCO EMS

2017-05-15 Thread Piotr Smoliński
<pradeep.mi...@mail.com> wrote: > What is the best way to connect to TIBCO EMS using spark streaming? > > Do we need to write custom receivers or any libraries already exist. > > Thanks, > Pradeep > > -

Spark streaming - TIBCO EMS

2017-05-15 Thread Pradeep
What is the best way to connect to TIBCO EMS using spark streaming? Do we need to write custom receivers or any libraries already exist. Thanks, Pradeep - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

[Spark Streaming] Unknown delay in event timeline

2017-05-10 Thread Zhiwen Sun
Env : Spark 1.6.2 + kafka 0.8.2 DirectStream. Spark streaming job with 1s interval. Process time of micro batch suddenly became to 4s while is is usually 0.4s . When we check where the time spent, we find a unknown delay in job. There is no executor computing or shuffle reading. It is about 4s

Re: [Spark Streaming] Dynamic Broadcast Variable Update

2017-05-05 Thread Pierce Lamb
t; As Tim pointed out, Alluxio (renamed from Tachyon) may be able to help > you. Here is some documentation on how to run Alluxio and Spark together > <http://www.alluxio.org/docs/1.4/en/Running-Spark-on-Alluxio.html>, and > here is a blog post on a Spark streaming + A

Spark Streaming 2.1 - slave parallel recovery

2017-05-04 Thread Dominik Safaric
Hi all, I’m running cluster consisting of a master and four slaves. The cluster runs a Spark application that reads data from a Kafka topic over a window of time, and writes the data back to Kafka. Checkpointing is enabled by using HDFS. However, although Spark periodically commits checkpoints

Re: [Spark Streaming] Dynamic Broadcast Variable Update

2017-05-04 Thread Gene Pang
As Tim pointed out, Alluxio (renamed from Tachyon) may be able to help you. Here is some documentation on how to run Alluxio and Spark together <http://www.alluxio.org/docs/1.4/en/Running-Spark-on-Alluxio.html>, and here is a blog post on a Spark streaming + Alluxio use case

RE: [Spark Streaming] - Killing application from within code

2017-05-04 Thread Sidney Feiner
dnesday, May 3, 2017 10:25 PM To: Sidney Feiner <sidney.fei...@startapp.com> Cc: user@spark.apache.org Subject: Re: [Spark Streaming] - Killing application from within code There isnt a clean programmatic way to kill the application running in the driver from the executor. You will ha

Re: In-order processing using spark streaming

2017-05-03 Thread JayeshLalwani
partition. This will add latency to the system though, because you won't process the messages until the watermark has expired. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/In-order-processing-using-spark-streaming-tp28457p28646.html Sent from the Apache

Re: [Spark Streaming] - Killing application from within code

2017-05-03 Thread Tathagata Das
fei...@startapp.com> wrote: > Hey, I'm using connections to Elasticsearch from within my Spark Streaming > application. > > I'm using Futures to maximize performance when it sends network requests > to the ES cluster. > > Basically, I want my app to crash if any one of the executo

[Spark Streaming] - Killing application from within code

2017-05-03 Thread Sidney Feiner
Hey, I'm using connections to Elasticsearch from within my Spark Streaming application. I'm using Futures to maximize performance when it sends network requests to the ES cluster. Basically, I want my app to crash if any one of the executors fails to connect to ES. The exception gets catched

Re: [Spark Streaming] Dynamic Broadcast Variable Update

2017-05-02 Thread Tim Smith
<nipunarora2...@gmail.com> wrote: > Hi All, > > To support our Spark Streaming based anomaly detection tool, we have made > a patch in Spark 1.6.2 to dynamically update broadcast variables. > > I'll first explain our use-case, which I believe should be common to > several p

[Spark Streaming] Dynamic Broadcast Variable Update

2017-05-02 Thread Nipun Arora
Hi All, To support our Spark Streaming based anomaly detection tool, we have made a patch in Spark 1.6.2 to dynamically update broadcast variables. I'll first explain our use-case, which I believe should be common to several people using Spark Streaming applications. Broadcast variables

Re: Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-27 Thread Cody Koeninger
t;>>>> mind that I do not care about exactly-once, hence having messages >>>>> replayed is perfectly fine. >>>>> >>>>>> On 26 Apr 2017, at 19:26, Cody Koeninger <c...@koeninger.org> wrote: >>>>>> >>>>

Re: Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-27 Thread Dominik Safaric
;>>> On 26 Apr 2017, at 19:26, Cody Koeninger <c...@koeninger.org> wrote: >>>>> >>>>> What is it you're actually trying to accomplish? >>>>> >>>>> You can get topic, partition, and offset bounds from an offset range like

Re: Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-27 Thread Cody Koeninger
...@koeninger.org> wrote: >>>> >>>> What is it you're actually trying to accomplish? >>>> >>>> You can get topic, partition, and offset bounds from an offset range like >>>> >>>> http://spark.apache.org/docs/latest/streaming-ka

Re: Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-27 Thread Dominik Safaric
-0-10-integration.html#obtaining-offsets >>> >>> Timestamp isn't really a meaningful idea for a range of offsets. >>> >>> >>> On Tue, Apr 25, 2017 at 2:43 PM, Dominik Safaric >>> <dominiksafa...@gmail.com> wrote: >>>> Hi

Re: Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-26 Thread Cody Koeninger
;> Timestamp isn't really a meaningful idea for a range of offsets. >> >> >> On Tue, Apr 25, 2017 at 2:43 PM, Dominik Safaric >> <dominiksafa...@gmail.com> wrote: >>> Hi all, >>> >>> Because the Spark Streaming direct Kafka consume

Re: Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-26 Thread Dominik Safaric
ntegration.html#obtaining-offsets > > Timestamp isn't really a meaningful idea for a range of offsets. > > > On Tue, Apr 25, 2017 at 2:43 PM, Dominik Safaric > <dominiksafa...@gmail.com> wrote: >> Hi all, >> >> Because the Spark Streaming direct Kafka consumer maps

Re: Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-26 Thread Cody Koeninger
25, 2017 at 2:43 PM, Dominik Safaric <dominiksafa...@gmail.com> wrote: > Hi all, > > Because the Spark Streaming direct Kafka consumer maps offsets for a given > Kafka topic and a partition internally while having enable.auto.commit set > to false, how can I retrieve th

Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-25 Thread Dominik Safaric
Hi all, Because the Spark Streaming direct Kafka consumer maps offsets for a given Kafka topic and a partition internally while having enable.auto.commit set to false, how can I retrieve the offset of each made consumer’s poll call using the offset ranges of an RDD? More precisely

spark streaming resiliency

2017-04-25 Thread vincent gromakowski
Hi, I have a question regarding Spark streaming resiliency and the documentation is ambiguous : The documentation says that the default configuration use a replication factor of 2 for data received but the recommendation is to use write ahead logs to guarantee data resiliency with receivers

Re: Spark Streaming. Real-time save data and visualize on dashboard

2017-04-12 Thread Sam Elamin
icsearch + kibana too (actually don't know > the differences between ELK and elasticsearch). > I was wondering about pros and cons of using a document indexer vs NoSQL > database. > > > > -- > View this message in context: http://apache-spark-user-list. > 1001560.n

Re: Spark Streaming. Real-time save data and visualize on dashboard

2017-04-12 Thread tencas
.n3.nabble.com/Spark-Streaming-Real-time-save-data-and-visualize-on-dashboard-tp28587p28589.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark Streaming. Real-time save data and visualize on dashboard

2017-04-12 Thread Gaurav1809
May be you can injest your data in ELK and use Kibana for live reporting. Of course there can be better way of doing this. Waiting for others to share their opinion. Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Real-time-save-data

Re: Spark Streaming. Real-time save data and visualize on dashboard

2017-04-11 Thread Pierce Lamb
I just don't know if is it possible to use a no-SQL database like Mongo, > Cassandra in collaboration with a monitoring tools like Grafana,Kabana, and > it make sense. > > > > -- > View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/Spark-Strea

Spark Streaming. Real-time save data and visualize on dashboard

2017-04-11 Thread tencas
I've developed an application using Apache Spark Streaming, that reads simple info from plane sensors like acceleration, via TCP sockets on json format, and analyse it. I'd like to be able to persist this info from each "flight" on real-time, while it is shown on any responsive dashboar

Re: Is checkpointing in Spark Streaming Synchronous or Asynchronous ?

2017-04-11 Thread kant kodali
all your state every time. And we will be making this > asynchronous soon. > > On Fri, Apr 7, 2017 at 3:19 AM, kant kodali <kanth...@gmail.com> wrote: > >> Hi All, >> >> Is checkpointing in Spark Streaming Synchronous or Asynchronous ? other >> words can

Re: Is checkpointing in Spark Streaming Synchronous or Asynchronous ?

2017-04-10 Thread Tathagata Das
nth...@gmail.com> wrote: > Hi All, > > Is checkpointing in Spark Streaming Synchronous or Asynchronous ? other > words can spark continue processing the stream while checkpointing? > > Thanks! >

Is checkpointing in Spark Streaming Synchronous or Asynchronous ?

2017-04-07 Thread kant kodali
Hi All, Is checkpointing in Spark Streaming Synchronous or Asynchronous ? other words can spark continue processing the stream while checkpointing? Thanks!

RE: How to use ManualClock with Spark streaming

2017-04-05 Thread Mendelson, Assaf
ManualClock with Spark streaming Any updates on how can I use ManualClock other than editing the Spark source code? On Wed, Mar 1, 2017 at 10:19 AM, Hemalatha A <hemalatha.amru...@googlemail.com<mailto:hemalatha.amru...@googlemail.com>> wrote: It is certainly possible through a hack. I w

Spark Streaming Kafka Job has strange behavior for certain tasks

2017-04-05 Thread Justin Miller
Greetings! I've been running various spark streaming jobs to persist data from kafka topics and one persister in particular seems to have issues. I've verified that the number of messages is the same per partition (roughly of course) and the volume of data is a fraction of the volume of other

Re: How to use ManualClock with Spark streaming

2017-04-05 Thread Hemalatha A
...@gmail.com> > wrote: > >> I don't think using ManualClock is a right way to fix your problem here >> in Spark Streaming. >> >> ManualClock in Spark is mainly used for unit test, it should manually >> advance the time to make the unit test work. The usage looks

Re: Spark streaming + kafka error with json library

2017-03-30 Thread Srikanth
Thanks for the tip. That worked. When would one use the assembly? On Wed, Mar 29, 2017 at 7:13 PM, Tathagata Das <tathagata.das1...@gmail.com> wrote: > Try depending on "spark-streaming-kafka-0-10_2.11" (not the assembly) > > On Wed, Mar 29, 2017 at 9:59 AM, Srikan

Re: Spark streaming + kafka error with json library

2017-03-29 Thread Tathagata Das
Try depending on "spark-streaming-kafka-0-10_2.11" (not the assembly) On Wed, Mar 29, 2017 at 9:59 AM, Srikanth <srikanth...@gmail.com> wrote: > Hello, > > I'm trying to use "org.json4s" % "json4s-native" library in a spark > streaming + k

Spark streaming + kafka error with json library

2017-03-29 Thread Srikanth
Hello, I'm trying to use "org.json4s" % "json4s-native" library in a spark streaming + kafka direct app. When I use the latest version of the lib I get an error similar to this <https://github.com/json4s/json4s/issues/316> The work around suggest there is to use ve

spark streaming write orc suddenly slow?

2017-03-27 Thread 446463...@qq.com
Hi All: when I use sparking streaming to consume kafka data to stroe in HDFS with orc files format. it's fast in the beginning,but it's slow in several hours later. environment: spark version: 2.1.0 kafka version 0.10.1.1 spark-streaming-kafka jar's version: spark-streaming-kafka-0-8-2.11

Re: Spark streaming to kafka exactly once

2017-03-23 Thread Maurin Lenglart
ically be possible to handle this in Spark but you'll probably have a > better time handling duplicates in the service that reads from Kafka. > > On Wed, Mar 22, 2017 at 1:49 PM, Maurin Lenglart <mau...@cuberonlabs.com> > wrote: >> >> Hi

Re: Spark streaming to kafka exactly once

2017-03-22 Thread Cody Koeninger
9 PM, Maurin Lenglart <mau...@cuberonlabs.com> > wrote: >> >> Hi, >> we are trying to build a spark streaming solution that subscribe and push >> to kafka. >> >> But we are running into the problem of duplicates events. >> >> Right now, I a

Re: Spark streaming to kafka exactly once

2017-03-22 Thread Matt Deaver
wrote: > Hi, > we are trying to build a spark streaming solution that subscribe and push > to kafka. > > But we are running into the problem of duplicates events. > > Right now, I am doing a “forEachRdd” and loop over the message of each > partition and send those message to ka

Spark streaming to kafka exactly once

2017-03-22 Thread Maurin Lenglart
Hi, we are trying to build a spark streaming solution that subscribe and push to kafka. But we are running into the problem of duplicates events. Right now, I am doing a “forEachRdd” and loop over the message of each partition and send those message to kafka. Is there any good way of solving

[ Spark Streaming & Kafka 0.10 ] Possible bug

2017-03-22 Thread Afshartous, Nick
Hi, I think I'm seeing a bug in the context of upgrading to using the Kafka 0.10 streaming API. Code fragments follow. -- Nick JavaInputDStream> rawStream = getDirectKafkaStream(); JavaDStream> messagesTuple =

Re: [Spark Streaming+Kafka][How-to]

2017-03-22 Thread Cody Koeninger
.foreach(... code to write to cassandra ...) >> >> On Fri, Mar 17, 2017 at 7:35 AM, OUASSAIDI, Sami <sami.ouassa...@mind7.fr >> > wrote: >> >>> @Cody : Duly noted. >>> @Michael Ambrust : A repartition is out of the question for our project >

Re: [Spark Streaming+Kafka][How-to]

2017-03-21 Thread OUASSAIDI, Sami
lt;sami.ouassa...@mind7.fr> > wrote: > >> @Cody : Duly noted. >> @Michael Ambrust : A repartition is out of the question for our project >> as it would be a fairly expensive operation. We tried looking into >> targeting a specific executor so as to avoid this extra c

Spark Streaming questions, just 2

2017-03-21 Thread shyla deshpande
Hello all, I have a couple of spark streaming questions. Thanks. 1. In the case of stateful operations, the data is, by default, persistent in memory. In memory does it mean MEMORY_ONLY? When is it removed from memory? 2. I do not see any documentation for spark.cleaner.ttl

Re: spark streaming with kafka source, how many concurrent jobs?

2017-03-21 Thread shyla deshpande
for the response. Can you please provide more explanation. I am >> having multiple streams in the spark streaming application (Spark 2.0.2 >> using DStreams). I know many people using this setting. So your >> explanation will help a lot of people. >> >> Thanks >>

Re: How to use ManualClock with Spark streaming

2017-03-20 Thread ??????????
.com>; Cc: "spark users"<user@spark.apache.org>; Subject: Re: How to use ManualClock with Spark streaming I don't think using ManualClock is a right way to fix your problem here in Spark Streaming. ManualClock in Spark is mainly used for unit test, it should manually advance the time to ma

Re: spark streaming exectors memory increasing and executor killed by yarn

2017-03-20 Thread darin
This issue on stackoverflow maybe help https://stackoverflow.com/questions/42641573/why-does-memory-usage-of-spark-worker-increases-with-time/42642233#42642233 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-exectors-memory-increasing

Re: Spark Streaming from Kafka, deal with initial heavy load.

2017-03-20 Thread Cody Koeninger
You want spark.streaming.kafka.maxRatePerPartition for the direct stream. On Sat, Mar 18, 2017 at 3:37 PM, Mal Edwin wrote: > > Hi, > You can enable backpressure to handle this. > > spark.streaming.backpressure.enabled > spark.streaming.receiver.maxRate > > Thanks, >

Re: Foreachpartition in spark streaming

2017-03-20 Thread Ryan
foreachPartition is an action but run on each worker, which means you won't see anything on driver. mapPartitions is a transformation which is lazy and won't do anything until an action. it depends on the specific use case which is better. To output sth(like a print in single machine) you could

Foreachpartition in spark streaming

2017-03-20 Thread Diwakar Dhanuskodi
Just wanted to clarify!!! Is foreachPartition in spark an output operation? Which one is better use mapPartitions or foreachPartitions? Regards Diwakar

Re: Spark Streaming from Kafka, deal with initial heavy load.

2017-03-18 Thread Mal Edwin
Hi, You can enable backpressure to handle this. spark.streaming.backpressure.enabled spark.streaming.receiver.maxRate Thanks, Edwin On Mar 18, 2017, 12:53 AM -0400, sagarcasual . , wrote: > Hi, we have spark 1.6.1 streaming from Kafka (0.10.1) topic using direct >

Re: spark streaming exectors memory increasing and executor killed by yarn

2017-03-18 Thread Bill Schwanitz
C -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/javadump.hprof" --conf "spark.kryoserializer.buffer.max=512m" --class com.dtise.data.streaming.ad.DTStreamingStatistics hdfs://nameservice1/user/yanghb/spark-streaming-1.0.jar* ``` And This is the main codes: ``` val originalStream

Spark Streaming from Kafka, deal with initial heavy load.

2017-03-17 Thread sagarcasual .
Hi, we have spark 1.6.1 streaming from Kafka (0.10.1) topic using direct approach. The streaming part works fine but when we initially start the job, we have to deal with really huge Kafka message backlog, millions of messages, and that first batch runs for over 40 hours, and after 12 hours or so

Re: spark streaming exectors memory increasing and executor killed by yarn

2017-03-17 Thread darin
I add this code in foreachRDD block . ``` rdd.persist(StorageLevel.MEMORY_AND_DISK) ``` This exception no occur agein.But many executor dead showing in spark streaming UI . ``` User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 21 in stage 1194.0

HyperLogLogMonoid for unique visitor count in Spark Streaming

2017-03-17 Thread SRK
Hi, We have a requirement to calculate unique visitors in Spark Streaming. Can HyperLogLogMonoid be applied to a sliding window in Spark Streaming to calculate unique visitors? Any example on how to do that would be of great help. Thanks, Swetha -- View this message in context: http://apache

Re: [Spark Streaming+Kafka][How-to]

2017-03-17 Thread Michael Armbrust
ion is out of the question for our project as > it would be a fairly expensive operation. We tried looking into targeting a > specific executor so as to avoid this extra cost and directly have well > partitioned data after consuming the kafka topics. Also we are using Spark > stream

Re: [Spark Streaming+Kafka][How-to]

2017-03-17 Thread OUASSAIDI, Sami
we are using Spark streaming to save to the cassandra DB and try to keep shuffle operations to a strict minimum (at best none). As of now we are not entirely pleased with our current performances, that's why I'm doing a kafka topic sharding POC and getting the executor to handle the specificied

Re: [Spark Streaming+Kafka][How-to]

2017-03-17 Thread Michael Armbrust
Sorry, typo. Should be a repartition not a groupBy. > spark.readStream > .format("kafka") > .option("kafka.bootstrap.servers", "...") > .option("subscribe", "t0,t1") > .load() > .repartition($"partition") > .writeStream > .foreach(... code to write to cassandra ...) >

Re: spark streaming exectors memory increasing and executor killed by yarn

2017-03-16 Thread Yong Zhang
In this kind of question, you always want to tell us the spark version. Yong From: darin <lidal...@foxmail.com> Sent: Thursday, March 16, 2017 9:59 PM To: user@spark.apache.org Subject: spark streaming exectors memory increasing and executor killed by ya

spark streaming exectors memory increasing and executor killed by yarn

2017-03-16 Thread darin
C -XX:+UseParNewGC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/javadump.hprof" --conf "spark.kryoserializer.buffer.max=512m" --class com.dtise.data.streaming.ad.DTStreamingStatistics hdfs://nameservice1/user/yanghb/spark-streaming-1.0.jar* ``` And This is the main codes: ``

[Spark Streaming] Checkpoint backup (.bk) file purpose

2017-03-16 Thread Bartosz Konieczny
Hello, Actually I'm studying metadata checkpoint implementation in Spark Streaming and I was wondering the purpose of so called "backup files": CheckpointWriter snippet: > // We will do checkpoint when generating a batch and completing a batch. > When the processing >

<    2   3   4   5   6   7   8   9   10   11   >