Re: Graceful shutdown for Spark Streaming

2015-08-10 Thread Tathagata Das
July 2015 at 08:19, anshu shukla anshushuk...@gmail.com wrote: Yes I was doing same , if You mean that this is the correct way to do Then I will verify it once more in my case . On Thu, Jul 30, 2015 at 1:02 PM, Tathagata Das t...@databricks.com wrote: How is sleep not working? Are you

Re: stopping spark stream app

2015-08-10 Thread Tathagata Das
at end of each batch and if set then gracefully stop the application ? On Tue, Aug 11, 2015 at 12:43 AM, Tathagata Das t...@databricks.com wrote: In general, it is a little risky to put long running stuff in a shutdown hook as it may delay shutdown of the process which may delay other things

Re: Spark Kinesis Checkpointing/Processing Delay

2015-08-10 Thread Tathagata Das
You are correct. The earlier Kinesis receiver (as of Spark 1.4) was not saving checkpoints correctly and was in general not reliable (even with WAL enabled). We have improved this in Spark 1.5 with updated Kinesis receiver, that keeps track of the Kinesis sequence numbers as part of the Spark

Re: Checkpoint Dir Error in Yarn

2015-08-07 Thread Tathagata Das
Have you tried to do what its suggesting? If you want to learn more about checkpointing, you can see the programming guide - http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing For more in depth understanding, you can see my talk -

Re: [Spark Streaming] Session based windowing like in google dataflow

2015-08-07 Thread Tathagata Das
You can use Spark Streaming's updateStateByKey to do arbitrary sessionization. See the example - https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/StatefulNetworkWordCount.scala All it does is store a single number (count of each word seeing

Re: Reliable Streaming Receiver

2015-08-05 Thread Tathagata Das
You could very easily strip out the BlockGenerator code from the Spark source code and use it directly in the same way the Reliable Kafka Receiver uses it. BTW, you should know that we will be deprecating the receiver based approach for the Direct Kafka approach. That is quite flexible, can give

Re: Checkpoint file not found

2015-08-03 Thread Tathagata Das
Can you tell us more about streaming app? DStream operation that you are using? On Sun, Aug 2, 2015 at 9:14 PM, Anand Nalya anand.na...@gmail.com wrote: Hi, I'm writing a Streaming application in Spark 1.3. After running for some time, I'm getting following execption. I'm sure, that no other

Re: Has anybody ever tried running Spark Streaming on 500 text streams?

2015-07-31 Thread Tathagata Das
: Tathagata, Could the bottleneck possibility be the number of executor nodes in our cluster? Since we are creating 500 Dstreams based off 500 textfile directories, do we need at least 500 executors / nodes to be receivers for each one of the streams? On Tue, Jul 28, 2015 at 6:09 PM, Tathagata Das t

Re: Problems with JobScheduler

2015-07-30 Thread Tathagata Das
Yes, and that is indeed the problem. It is trying to process all the data in Kafka, and therefore taking 60 seconds. You need to set the rate limits for that. On Thu, Jul 30, 2015 at 8:51 AM, Cody Koeninger c...@koeninger.org wrote: If you don't set it, there is no maximum rate, it will get

Re: Does Spark Streaming need to list all the files in a directory?

2015-07-30 Thread Tathagata Das
For the first time it needs to list them. AFter that the list should be cached by the file stream implementation (as far as I remember). On Thu, Jul 30, 2015 at 3:55 PM, Brandon White bwwintheho...@gmail.com wrote: Is this a known bottle neck for Spark Streaming textFileStream? Does it need

Re: How RDD lineage works

2015-07-30 Thread Tathagata Das
You have to read the original Spark paper to understand how RDD lineage works. https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf On Thu, Jul 30, 2015 at 9:25 PM, Ted Yu yuzhih...@gmail.com wrote: Please take a look at: core/src/test/scala/org/apache/spark/CheckpointSuite.scala

Re: Heatmap with Spark Streaming

2015-07-30 Thread Tathagata Das
I do suggest that the non-spark related discussions be taken to a different this forum as it does not directly contribute to the contents of this user list. On Thu, Jul 30, 2015 at 8:52 PM, UMESH CHAUDHARY umesh9...@gmail.com wrote: Thanks for the valuable suggestion. I also started with

Re: Getting the number of slaves

2015-07-30 Thread Tathagata Das
To clarify, that is the number of executors requested by the SparkContext from the cluster manager. On Tue, Jul 28, 2015 at 5:18 PM, amkcom amk...@gmail.com wrote: try sc.getConf.getInt(spark.executor.instances, 1) -- View this message in context:

Re: Re: How RDD lineage works

2015-07-30 Thread Tathagata Das
. -- bit1...@163.com *From:* bit1...@163.com *Date:* 2015-07-31 13:11 *To:* Tathagata Das tathagata.das1...@gmail.com; yuzhihong yuzhih...@gmail.com *CC:* user user@spark.apache.org *Subject:* Re: Re: How RDD lineage works Thanks TD and Zhihong for the guide

Re: Graceful shutdown for Spark Streaming

2015-07-30 Thread Tathagata Das
in logic , in my case sleep(t.s.) is not working .) So i used to kill coarseGrained job at each slave by script .Please suggest something . On Thu, Jul 30, 2015 at 5:14 AM, Tathagata Das t...@databricks.com wrote: StreamingContext.stop(stopGracefully = true) stops the streaming context

Re: broadcast variable and accumulators issue while spark streaming checkpoint recovery

2015-07-29 Thread Tathagata Das
Rather than using accumulator directly, what you can do is something like this to lazily create an accumulator and use it (will get lazily recreated if driver restarts from checkpoint) dstream.transform { rdd = val accum = SingletonObject.getOrCreateAccumulator() // single object method to

Re: restart from last successful stage

2015-07-29 Thread Tathagata Das
. sc.saveAs) and then run a modified version of the job that skips Stage 0, assuming you have a full understanding of the breakdown of stages in your job. On Tue, Jul 28, 2015 at 9:28 PM, Tathagata Das t...@databricks.com wrote: Okay, may I am confused on the word would be useful to *restart

Re: Spark Streaming Kafka could not find leader offset for Set()

2015-07-29 Thread Tathagata Das
There is a known issue that Kafka cannot return leader if there is not data in the topic. I think it was raised in another thread in this forum. Is that the issue? On Wed, Jul 29, 2015 at 10:38 AM, unk1102 umesh.ka...@gmail.com wrote: Hi I have Spark Streaming code which streams from Kafka

Re: Graceful shutdown for Spark Streaming

2015-07-29 Thread Tathagata Das
StreamingContext.stop(stopGracefully = true) stops the streaming context gracefully. Then you can safely terminate the Spark cluster. They are two different steps and needs to be done separately ensuring that the driver process has been completely terminated before the Spark cluster is the

Re: broadcast variable and accumulators issue while spark streaming checkpoint recovery

2015-07-29 Thread Tathagata Das
previous batch interval's offsets.. On Thu, Jul 30, 2015 at 2:49 AM, Tathagata Das t...@databricks.com wrote: Rather than using accumulator directly, what you can do is something like this to lazily create an accumulator and use it (will get lazily recreated if driver restarts from

Re: Has anybody ever tried running Spark Streaming on 500 text streams?

2015-07-28 Thread Tathagata Das
(streamTuple._1, SaveMode.Append) } } On Tue, Jul 28, 2015 at 3:42 PM, Tathagata Das t...@databricks.com wrote: I dont think any one has really run 500 text streams. And parSequences do nothing out there, you are only parallelizing the setup code which does not really compute anything. Also

Re: Has anybody ever tried running Spark Streaming on 500 text streams?

2015-07-28 Thread Tathagata Das
I dont think any one has really run 500 text streams. And parSequences do nothing out there, you are only parallelizing the setup code which does not really compute anything. Also it setsup 500 foreachRDD operations that will get executed in each batch sequentially, so does not make sense. The

Re: Spark Streaming Json file groupby function

2015-07-28 Thread Tathagata Das
If you are trying to keep such long term state, it will be more robust in the long term to use a dedicated data store (cassandra/HBase/etc.) that is designed for long term storage. On Tue, Jul 28, 2015 at 4:37 PM, swetha swethakasire...@gmail.com wrote: Hi TD, We have a requirement to

Re: Writing streaming data to cassandra creates duplicates

2015-07-28 Thread Tathagata Das
You have to partition that data on the Spark Streaming by the primary key, and then make sure insert data into Cassandra atomically per key, or per set of keys in the partition. You can use the combination of the (batch time, and partition Id) of the RDD inside foreachRDD as the unique id for the

Re: restart from last successful stage

2015-07-28 Thread Tathagata Das
If you are using the same RDDs in the both the attempts to run the job, the previous stage outputs generated in the previous job will indeed be reused. This applies to core though. For dataframes, depending on what you do, the physical plan may get generated again leading to new RDDs which may

Re: How to deal with the spark streaming application while upgrade spark

2015-07-23 Thread Tathagata Das
Currently that is the best way. On Thu, Jul 23, 2015 at 12:51 AM, JoneZhang joyoungzh...@gmail.com wrote: My spark streaming on kafka application is running in spark 1.3. I want upgrade spark to 1.4 now. How to deal with the spark streaming application? Save the kafka topic partition

Re: user threads in executors

2015-07-22 Thread Tathagata Das
underhood? If compete batch is not loaded will using custim size of 100-200 request in one batch and post will help instead of whole partition ? On Wed, Jul 22, 2015 at 12:18 AM, Tathagata Das t...@databricks.com wrote: If you can post multiple items at a time, then use foreachPartition

Re: spark streaming 1.3 issues

2015-07-22 Thread Tathagata Das
For Java, do OffsetRange[] offsetRanges = ((HasOffsetRanges)rdd*.rdd()*).offsetRanges(); If you fix that error, you should be seeing data. You can call arbitrary RDD operations on a DStream, using DStream.transform. Take a look at the docs. For the direct kafka approach you are doing, - tasks

Re: spark streaming 1.3 coalesce on kafkadirectstream

2015-07-22 Thread Tathagata Das
With DirectKafkaStream there are two approaches. 1. you increase the number of KAfka partitions Spark will automatically read in parallel 2. if that's not possible, then explicitly repartition only if there are more cores in the cluster than the number of Kafka partitions, AND the first map-like

Re: Does Spark streaming support is there with RabbitMQ

2015-07-22 Thread Tathagata Das
You could contact the authors of the spark-packages.. maybe that will help? On Mon, Jul 20, 2015 at 6:41 AM, Jeetendra Gangele gangele...@gmail.com wrote: Thanks Todd, I m not sure whether somebody has used it or not. can somebody confirm if this integrate nicely with Spark streaming? On

Re: Convert Simple Kafka Consumer to standalone Spark JavaStream Consumer

2015-07-21 Thread Tathagata Das
From what I understand about your code, it is getting data from different partitions of a topic - get all data from partition 1, then from partition 2, etc. Though you have configured it to read from just one partition (topicCount has count = 1). So I am not sure what your intention is, read all

Re: user threads in executors

2015-07-21 Thread Tathagata Das
If you can post multiple items at a time, then use foreachPartition to post the whole partition in a single request. On Tue, Jul 21, 2015 at 9:35 AM, Richard Marscher rmarsc...@localytics.com wrote: You can certainly create threads in a map transformation. We do this to do concurrent DB

Re: spark streaming disk hit

2015-07-21 Thread Tathagata Das
Most shuffle files are really kept around in the OS's buffer/disk cache, so it is still pretty much in memory. If you are concerned about performance, you have to do a holistic comparison for end-to-end performance. You could take a look at this.

Re: Spark streaming Processing time keeps increasing

2015-07-17 Thread Tathagata Das
to provide sufficient information as part of the value so that you can take that decision in the filter function. As always, thanks for your help Nikunj On Thu, Jul 16, 2015 at 1:16 AM, Tathagata Das t...@databricks.com wrote: MAke sure you provide the filterFunction with the invertible

Re: it seem like the exactly once feature not work on spark1.4

2015-07-17 Thread Tathagata Das
Yes. More information in my talk - https://www.youtube.com/watch?v=d5UJonrruHk On Fri, Jul 17, 2015 at 1:15 AM, JoneZhang joyoungzh...@gmail.com wrote: I see now. There are three steps in SparkStreaming + Kafka date processing 1.Receiving the data 2.Transforming the data 3.Pushing out the

Re: Spark streaming Processing time keeps increasing

2015-07-16 Thread Tathagata Das
MAke sure you provide the filterFunction with the invertible reduceByKeyAndWindow. Otherwise none of the keys will get removed, and the key space will continue increase. This is what is leading to the lag. So use the filtering function to filter out the keys that are not needed any more. On Thu,

Re: NotSerializableException in spark 1.4.0

2015-07-15 Thread Tathagata Das
Your streaming job may have been seemingly running ok, but the DStream checkpointing must have been failing in the background. It would have been visible in the log4j logs. In 1.4.0, we enabled fast-failure for that so that checkpointing failures dont get hidden in the background. The fact that

Re: NotSerializableException in spark 1.4.0

2015-07-15 Thread Tathagata Das
/SPARK-7180 and squashes the following commits: On Wed, Jul 15, 2015 at 2:37 PM, Chen Song chen.song...@gmail.com wrote: Thanks Can you point me to the patch to fix the serialization stack? Maybe I can pull it in and rerun my job. Chen On Wed, Jul 15, 2015 at 4:40 PM, Tathagata Das t

Re: Master vs. Slave Nodes Clarification

2015-07-14 Thread Tathagata Das
Yep :) On Tue, Jul 14, 2015 at 2:44 PM, algermissen1971 algermissen1...@icloud.com wrote: On 14 Jul 2015, at 23:26, Tathagata Das t...@databricks.com wrote: Just to be clear, you mean the Spark Standalone cluster manager's master and not the applications driver, right. Sorry, by now I

Re: Master vs. Slave Nodes Clarification

2015-07-14 Thread Tathagata Das
Just to be clear, you mean the Spark Standalone cluster manager's master and not the applications driver, right. In that case, the earlier responses are correct. TD On Tue, Jul 14, 2015 at 11:26 AM, Mohammed Guller moham...@glassbeam.com wrote: The master node does not have to be similar to

Re: Sessionization using updateStateByKey

2015-07-14 Thread Tathagata Das
[Apologies for repost, for those who have seen this response already in the dev mailing list] 1. When you set ssc.checkpoint(checkpointDir), the spark streaming periodically saves the state RDD (which is a snapshot of all the state data) to HDFS using RDD checkpointing. In fact, a streaming app

Re: spark streaming with kafka reset offset

2015-07-14 Thread Tathagata Das
, Chen Song chen.song...@gmail.com wrote: Thanks TD. As for 1), if timing is not guaranteed, how does exactly once semantics supported? It feels like exactly once receiving is not necessarily exactly once processing. Chen On Tue, Jul 14, 2015 at 10:16 PM, Tathagata Das t...@databricks.com

Re: Stopping StreamingContext before receiver has started

2015-07-14 Thread Tathagata Das
This is a known race condition - root cause of SPARK-5681 https://issues.apache.org/jira/browse/SPARK-5681 On Mon, Jul 13, 2015 at 3:35 AM, Juan Rodríguez Hortalá juan.rodriguez.hort...@gmail.com wrote: Hi, I have noticed that when StreamingContext.stop is called when no receiver has

Re: spark streaming with kafka reset offset

2015-07-14 Thread Tathagata Das
PM, Tathagata Das t...@databricks.com wrote: In the receiver based approach, If the receiver crashes for any reason (receiver crashed or executor crashed) the receiver should get restarted on another executor and should start reading data from the offset present in the zookeeper

Re: Spark Streaming - Inserting into Tables

2015-07-14 Thread Tathagata Das
Why is .remember not ideal? On Sun, Jul 12, 2015 at 7:22 PM, Brandon White bwwintheho...@gmail.com wrote: Hi Yin, Yes there were no new rows. I fixed it by doing a .remember on the context. Obviously, this is not ideal. On Sun, Jul 12, 2015 at 6:31 PM, Yin Huai yh...@databricks.com wrote:

Re: Is IndexedRDD available in Spark 1.4.0?

2015-07-14 Thread Tathagata Das
I do not recommend using IndexRDD for state management in Spark Streaming. What it does not solve out-of-the-box is checkpointing of indexRDDs, which important because long running streaming jobs can lead to infinite chain of RDDs. Spark Streaming solves it for the updateStateByKey operation which

Re: spark streaming with kafka reset offset

2015-07-14 Thread Tathagata Das
High Level API. Regards, Dibyendu On Sat, Jun 27, 2015 at 3:04 PM, Tathagata Das t...@databricks.com wrote: In the receiver based approach, If the receiver crashes for any reason (receiver crashed or executor crashed) the receiver should get restarted on another executor and should

Re: Ordering of Batches in Spark streaming

2015-07-14 Thread Tathagata Das
This has been discussed in a number of threads in this mailing list. Here is a summary. 1. Processing of batch T+1 always starts after all the processing of batch T has completed. But here a batch is defined by data of all the receivers running the in the system receiving within the batch

Re: rest on streaming

2015-07-14 Thread Tathagata Das
You can do this. // global variable to keep track of latest stuff var latestTime = _ var latestRDD = _ dstream.foreachRDD((rdd: RDD[..], time: Time) = { latestTime = time latestRDD = rdd }) Now you can asynchronously access the latest RDD. However if you are going to run jobs on the

Re: fileStream with old files

2015-07-14 Thread Tathagata Das
It was added, but its not documented publicly. I am planning to change the name of the conf to spark.streaming.fileStream.minRememberDuration to make it easier to understand On Mon, Jul 13, 2015 at 9:43 PM, Terry Hole hujie.ea...@gmail.com wrote: A new configuration named

Re: Problems after upgrading to spark 1.4.0

2015-07-13 Thread Tathagata Das
Spark 1.4.0 added shutdown hooks in the driver to cleanly shutdown the Sparkcontext in the driver, which would shutdown the executors. I am not sure whether this is related or not, but somehow the executor's shutdown hook is being called. Can you check the driver logs to see if driver's shutdown

Re: Debug Spark Streaming in PyCharm

2015-07-10 Thread Tathagata Das
spark-submit does a lot of magic configurations (classpaths etc) underneath the covers to enable pyspark to find Spark JARs, etc. I am not sure how you can start running things directly from the PyCharm IDE. Others in the community may be able to answer. For now the main way to run pyspark stuff

Re: reduceByKeyAndWindow with initial state

2015-07-10 Thread Tathagata Das
Are you talking about reduceByKeyAndWindow with or without inverse reduce? TD On Fri, Jul 10, 2015 at 2:07 AM, Imran Alam im...@newscred.com wrote: We have a streaming job that makes use of reduceByKeyAndWindow

Re: Spark Streaming Hangs on Start

2015-07-09 Thread Tathagata Das
1. There will be a long running job with description start() as that is the jobs that is running the receivers. It will never end. 2. You need to set the number of cores given to the Spark executors by the YARN container. That is SparkConf spark.executor.cores, --executor-cores in spark-submit.

Re: Breaking lineage and reducing stages in Spark Streaming

2015-07-09 Thread Tathagata Das
If you are continuously unioning RDDs, then you are accumulating ever increasing data, and you are processing ever increasing amount of data in every batch. Obviously this is going to not last for very long. You fundamentally cannot keep processing ever increasing amount of data with finite

Re: Problem in Understanding concept of Physical Cores

2015-07-09 Thread Tathagata Das
Aniruddh On Thu, Jul 9, 2015 at 4:43 AM, Tathagata Das t...@databricks.com wrote: There are several levels of indirection going on here, let me clarify. In the local mode, Spark runs tasks (which includes receivers) using the number of threads defined in the master (either local, or local[2

Re: Spark Streaming Hangs on Start

2015-07-09 Thread Tathagata Das
Do you have enough cores in the configured number of executors in YARN? On Thu, Jul 9, 2015 at 2:29 AM, Bin Wang wbi...@gmail.com wrote: I'm using spark streaming with Kafka, and submit it to YARN cluster with mode yarn-cluster. But it hangs at SparkContext.start(). The Kafka config is

Re: spark streaming performance

2015-07-09 Thread Tathagata Das
What were the number of cores in the executor? It could be that you had only one core in the executor which did all the 50 tasks serially so 50 task X 15 ms = ~ 1 second. Could you take a look at the task details in the stage page to see when the tasks were added to see whether it explains the 5

Re: Breaking lineage and reducing stages in Spark Streaming

2015-07-09 Thread Tathagata Das
the same keys that are in myRDD, so the reduceByKey after union keeps the overall tuple count in myRDD fixed. Or even with fixed tuple count, it will keep consuming more resources? On 9 July 2015 at 16:19, Tathagata Das t...@databricks.com wrote: If you are continuously unioning RDDs, then you

Re: spark streaming performance

2015-07-09 Thread Tathagata Das
I am not sure why you are getting node_local and not process_local. Also there is probably not a good documentation other than that configuration page - http://spark.apache.org/docs/latest/configuration.html (search for locality) On Thu, Jul 9, 2015 at 5:51 AM, Michel Hubert

Re: Streaming checkpoints and logic change

2015-07-08 Thread Tathagata Das
You can use DStream.transform for some stuff. Transform takes a RDD = RDD function that allow arbitrary RDD operations to be done on RDDs of a DStream. This function gets evaluated on the driver on every batch interval. If you are smart about writing the function, it can do different stuff at

Re: Streaming checkpoints and logic change

2015-07-08 Thread Tathagata Das
want to know how makers of Spark Streaming think about this drawback of checkpointing. If anyone had similar experience, suggestions will be appreciated. Jong Wook On 9 July 2015 at 02:13, Tathagata Das t...@databricks.com wrote: You can use DStream.transform for some stuff. Transform

Re: spark core/streaming doubts

2015-07-08 Thread Tathagata Das
Responses inline. On Wed, Jul 8, 2015 at 10:26 AM, Shushant Arora shushantaror...@gmail.com wrote: 1.Does creation of read only singleton object in each map function is same as broadcast object as singleton never gets garbage collected unless executor gets shutdown ? Aim is to avoid creation

Re: Problem in Understanding concept of Physical Cores

2015-07-08 Thread Tathagata Das
There are several levels of indirection going on here, let me clarify. In the local mode, Spark runs tasks (which includes receivers) using the number of threads defined in the master (either local, or local[2], or local[*]). local or local[1] = single thread, so only one task at a time local[2]

Re: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Tathagata Das
This is also discussed in the programming guide. http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd On Wed, Jul 8, 2015 at 8:25 AM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: Thanks, Sean. are you asking about foreach vs

Re: pause and resume streaming app

2015-07-08 Thread Tathagata Das
Currently the only way to pause it is to stop it. The way I would do this is use the Direct Kafka API to access the Kafka offsets, and save them to a data store as batches finish. If you see a batch job failing because downstream is down, stop the context. When it comes back up, start a new

Re: Regarding master node failure

2015-07-07 Thread Tathagata Das
This talk may help - https://spark-summit.org/2015/events/recipes-for-running-spark-streaming-applications-in-production/ On Tue, Jul 7, 2015 at 9:51 AM, swetha swethakasire...@gmail.com wrote: Hi, What happens if the master node fails in the case of Spark Streaming? Would the data be lost?

Re: Spark Kafka Direct Streaming

2015-07-07 Thread Tathagata Das
When you enable checkpointing by setting the checkpoint directory, you enable metadata checkpointing. Data checkpointing kicks in only if you are using a DStream operation that requires it, or you are enabling Write Ahead Logs to prevent data loss on driver failure. More discussion -

Re: java.lang.IllegalArgumentException: A metric named ... already exists

2015-07-06 Thread Tathagata Das
, Juan 2015-06-23 21:57 GMT+02:00 Tathagata Das t...@databricks.com: Aaah this could be potentially major issue as it may prevent metrics from restarted streaming context be not published. Can you make it a JIRA. TD On Tue, Jun 23, 2015 at 7:59 AM, Juan Rodríguez Hortalá

Re: How to recover in case user errors in streaming

2015-07-06 Thread Tathagata Das
Assudani aassud...@impetus.com wrote: Also, how do you suggest catching exceptions while using with connector API like, saveAsNewAPIHadoopFiles ? From: amit assudani aassud...@impetus.com Date: Monday, June 29, 2015 at 9:55 AM To: Tathagata Das t...@databricks.com Cc: Cody Koeninger c

Re: writing to kafka using spark streaming

2015-07-06 Thread Tathagata Das
Yeah, creating a new producer at the granularity of partitions may not be that costly. On Mon, Jul 6, 2015 at 6:40 AM, Cody Koeninger c...@koeninger.org wrote: Use foreachPartition, and allocate whatever the costly resource is once per partition. On Mon, Jul 6, 2015 at 6:11 AM, Shushant

Re: writing to kafka using spark streaming

2015-07-06 Thread Tathagata Das
is an operation and another is action but if I call an opeartion afterwords mapPartitions also, which one is more efficient and recommeded? On Tue, Jul 7, 2015 at 12:21 AM, Tathagata Das t...@databricks.com wrote: Yeah, creating a new producer at the granularity of partitions may

Re: Are Spark Streaming RDDs always processed in order?

2015-07-06 Thread Tathagata Das
...@gmail.com wrote: I had a similar inquiry, copied below. I was also looking into making an SQS Receiver reliable: http://stackoverflow.com/questions/30809975/reliable-sqs-receiver-for-spark-streaming Hope this helps. -- Forwarded message -- From: Tathagata Das t

Re: Spark driver using Spark Streaming shows increasing memory/CPU usage

2015-06-30 Thread Tathagata Das
Could you give more information on the operations that you are using? The code outline? And what do you mean by Spark Driver receiver events? If the driver is receiving events, how is it being sent to the executors. BTW, for memory usages, I strongly recommend using jmap --histo:live to see what

Re: Explanation of the numbers on Spark Streaming UI

2015-06-30 Thread Tathagata Das
Well, the scheduling delay is the time a batch has to wait for getting resources. So even if there is no backlog in processing and scheduling delay is 0, there is one batch that is being processed at any point of time, which explains the difference. On Tue, Jun 30, 2015 at 2:42 AM,

Re: Spark streaming on standalone cluster

2015-06-30 Thread Tathagata Das
How many receivers do you have in the streaming program? You have to have more numbers of core in reserver by your spar application than the number of receivers. That would explain the receiving output after stopping. TD On Tue, Jun 30, 2015 at 7:59 AM, Borja Garrido Bear kazebo...@gmail.com

Re: Serialization Exception

2015-06-30 Thread Tathagata Das
I am guessing one of the two things might work. 1. Either define the pattern SPACE inside the process() 2. Mark streamingContext field and inputStream field as transient. The problem is that the function like PairFunction needs to be serialized for being sent to the tasks. And whole closure of

Re: How to recover in case user errors in streaming

2015-06-29 Thread Tathagata Das
On Mon, Jun 29, 2015 at 8:44 AM, Amit Assudani aassud...@impetus.com wrote: Also, how do you suggest catching exceptions while using with connector API like, saveAsNewAPIHadoopFiles ? From: amit assudani aassud...@impetus.com Date: Monday, June 29, 2015 at 9:55 AM To: Tathagata Das t

Re: Checkpoint FS failure or connectivity issue

2015-06-29 Thread Tathagata Das
Yes, the observation is correct. That connectivity is assumed to be HA. On Mon, Jun 29, 2015 at 2:34 PM, Amit Assudani aassud...@impetus.com wrote: Hi All, While using Checkpoints ( using HDFS ), if connectivity to hadoop cluster is lost for a while and gets restored in some time, what

Re: spark streaming job fails to restart after checkpointing due to DStream initialization errors

2015-06-27 Thread Tathagata Das
Could you also provide the code where you set up the Kafka dstream? I dont see it in the snippet. On Fri, Jun 26, 2015 at 2:45 PM, Ashish Nigam ashnigamt...@gmail.com wrote: Here's code - def createStreamingContext(checkpointDirectory: String) : StreamingContext = { val conf = new

Re: Uncaught exception in thread delete Spark local dirs

2015-06-27 Thread Tathagata Das
= rdd.mapPartitionsWithIndex { (i, kafkaEvent) = . } I understand that I just need a checkpoint if I need to recover the task it something goes wrong, right? 2015-06-27 9:39 GMT+02:00 Tathagata Das t...@databricks.com: How are you trying to execute the code again? From checkpoints

Re: Uncaught exception in thread delete Spark local dirs

2015-06-27 Thread Tathagata Das
SPARK_CLASSPATH The code doesn't have any stateful operation, yo I guess that it¡s okay doesn't have checkpoint. I have executed hundres of times thiscode in VM from Cloudera and never got this error. 2015-06-27 11:21 GMT+02:00 Tathagata Das t...@databricks.com: 1. you need checkpointing mostly

Re: Uncaught exception in thread delete Spark local dirs

2015-06-27 Thread Tathagata Das
How are you trying to execute the code again? From checkpoints, or otherwise? Also cc'ed Hari who may have a better idea of YARN related issues. On Sat, Jun 27, 2015 at 12:35 AM, Guillermo Ortiz konstt2...@gmail.com wrote: Hi, I'm executing a SparkStreamig code with Kafka. IçThe code was

Re: Time is ugly in Spark Streaming....

2015-06-27 Thread Tathagata Das
Could you print the time on the driver (that is, in foreachRDD but before RDD.foreachPartition) and see if it is behaving weird? TD On Fri, Jun 26, 2015 at 3:57 PM, Emrehan Tüzün emrehan.tu...@gmail.com wrote: On Fri, Jun 26, 2015 at 12:30 PM, Sea 261810...@qq.com wrote: Hi, all I find

Re: spark streaming with kafka reset offset

2015-06-27 Thread Tathagata Das
In the receiver based approach, If the receiver crashes for any reason (receiver crashed or executor crashed) the receiver should get restarted on another executor and should start reading data from the offset present in the zookeeper. There is some chance of data loss which can alleviated using

Re: spark streaming - checkpoint

2015-06-27 Thread Tathagata Das
Do you have SPARK_CLASSPATH set in both cases? Before and after checkpoint? If yes, then you should not be using SPARK_CLASSPATH, it has been deprecated since Spark 1.0 because of its ambiguity. Also where do you have spark.executor.extraClassPath set? I dont see it in the spark-submit command.

Re: How to recover in case user errors in streaming

2015-06-27 Thread Tathagata Das
it in map, but the whole point here is to handle something that is breaking in action ). Please help. :( From: amit assudani aassud...@impetus.com Date: Friday, June 26, 2015 at 11:41 AM To: Cody Koeninger c...@koeninger.org Cc: user@spark.apache.org user@spark.apache.org, Tathagata Das t

Re: spark streaming with kafka jar missing

2015-06-23 Thread Tathagata Das
You must not include spark-core and spark-streaming in the assembly. They are already present in the installation and the presence of multiple versions of spark may throw off the classloaders in weird ways. So make the assembly marking the those dependencies as scope=provided. On Tue, Jun 23,

Re: java.lang.IllegalArgumentException: A metric named ... already exists

2015-06-23 Thread Tathagata Das
Aaah this could be potentially major issue as it may prevent metrics from restarted streaming context be not published. Can you make it a JIRA. TD On Tue, Jun 23, 2015 at 7:59 AM, Juan Rodríguez Hortalá juan.rodriguez.hort...@gmail.com wrote: Hi, I'm running a program in Spark 1.4 where

Re: [Spark Streaming] Null Pointer Exception when accessing broadcast variable to store a hashmap in Java

2015-06-23 Thread Tathagata Das
Yes, this is a known behavior. Some static stuff are not serialized as part of a task. On Tue, Jun 23, 2015 at 10:24 AM, Nipun Arora nipunarora2...@gmail.com wrote: I found the error so just posting on the list. It seems broadcast variables cannot be declared static. If you do you get a null

Re: Calculating tuple count /input rate with time

2015-06-23 Thread Tathagata Das
This should give accurate count for each batch, though for getting the rate you have to make sure that you streaming app is stable, that is, batches are processed as fast as they are received (scheduling delay in the spark streaming UI is approx 0). TD On Tue, Jun 23, 2015 at 2:49 AM, anshu

Re: spark streaming with kafka jar missing

2015-06-23 Thread Tathagata Das
/build /project And when I pass streaming jar using --jar option , it threw same java.lang.NoClassDefFoundError: org/apache/spark/util/ThreadUtils$. Thanks On Wed, Jun 24, 2015 at 1:17 AM, Tathagata Das t...@databricks.com wrote: You must not include spark-core and spark-streaming

Re: jars are not loading from 1.3. those set via setJars to the SparkContext

2015-06-23 Thread Tathagata Das
How are you adding that to the classpath? Through spark-submit or otherwise? On Mon, Jun 22, 2015 at 5:02 PM, Murthy Chelankuri kmurt...@gmail.com wrote: Yes I have the producer in the class path. And I am using in standalone mode. Sent from my iPhone On 23-Jun-2015, at 3:31 am, Tathagata

Re: jars are not loading from 1.3. those set via setJars to the SparkContext

2015-06-23 Thread Tathagata Das
at 12:17 PM, Tathagata Das t...@databricks.com wrote: How are you adding that to the classpath? Through spark-submit or otherwise? On Mon, Jun 22, 2015 at 5:02 PM, Murthy Chelankuri kmurt...@gmail.com wrote: Yes I have the producer in the class path. And I am using in standalone mode

Re: [Spark Streaming 1.4.0] SPARK-5063, Checkpointing and queuestream

2015-06-23 Thread Tathagata Das
queue stream does not support driver checkpoint recovery since the RDDs in the queue are arbitrary generated by the user and its hard for Spark Streaming to keep track of the data in the RDDs (thats necessary for recovering from checkpoint). Anyways queue stream is meant of testing and

Re: Shutdown with streaming driver running in cluster broke master web UI permanently

2015-06-23 Thread Tathagata Das
ConnectionStateManager: There are no ConnectionStateListeners registered. 15/06/22 10:44:01 INFO ZooKeeperLeaderElectionAgent: We have lost leadership 15/06/22 10:44:01 ERROR Master: Leadership has been revoked -- master shutting down. On Thu, Jun 11, 2015 at 8:59 PM, Tathagata Das t...@databricks.com

Re: jars are not loading from 1.3. those set via setJars to the SparkContext

2015-06-23 Thread Tathagata Das
for loading the classes or some thing like that? On Tue, Jun 23, 2015 at 12:28 PM, Tathagata Das t...@databricks.com wrote: So you have Kafka in your classpath in you Java application, where you are creating the sparkContext with the spark standalone master URL, right? The recommended way

Re: Kafka createDirectStream ​issue

2015-06-23 Thread Tathagata Das
Please run it in your own application and not in the spark shell. I see that you are trying to stop the Spark context and create a new StreamingContext. That will lead to unexpected issue, that you are seeing. Please make a standalone SBT/Maven app for Spark Streaming. On Tue, Jun 23, 2015 at

Re: jars are not loading from 1.3. those set via setJars to the SparkContext

2015-06-22 Thread Tathagata Das
Do you have Kafka producer in your classpath? If so how are adding that library? Are you running on YARN, or Mesos or Standalone or local. These details will be very useful. On Mon, Jun 22, 2015 at 8:34 AM, Murthy Chelankuri kmurt...@gmail.com wrote: I am using spark streaming. what i am trying

Re: Serial batching with Spark Streaming

2015-06-20 Thread Tathagata Das
to failures*? If not, please could you suggest workarounds and point me to the code? One more thing was not 100% clear to me from the documentation: Is there exactly *1 RDD* published *per a batch interval* in a DStream? On 19 June 2015 at 16:58, Tathagata Das t...@databricks.com wrote: I see

<    1   2   3   4   5   6   7   8   9   >