July 2015 at 08:19, anshu shukla anshushuk...@gmail.com wrote:
Yes I was doing same , if You mean that this is the correct way to do
Then I will verify it once more in my case .
On Thu, Jul 30, 2015 at 1:02 PM, Tathagata Das t...@databricks.com
wrote:
How is sleep not working? Are you
at end of each batch and if set then
gracefully stop the application ?
On Tue, Aug 11, 2015 at 12:43 AM, Tathagata Das t...@databricks.com
wrote:
In general, it is a little risky to put long running stuff in a shutdown
hook as it may delay shutdown of the process which may delay other things
You are correct. The earlier Kinesis receiver (as of Spark 1.4) was not
saving checkpoints correctly and was in general not reliable (even with WAL
enabled). We have improved this in Spark 1.5 with updated Kinesis receiver,
that keeps track of the Kinesis sequence numbers as part of the Spark
Have you tried to do what its suggesting?
If you want to learn more about checkpointing, you can see the programming
guide -
http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing
For more in depth understanding, you can see my talk -
You can use Spark Streaming's updateStateByKey to do arbitrary
sessionization.
See the example -
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/StatefulNetworkWordCount.scala
All it does is store a single number (count of each word seeing
You could very easily strip out the BlockGenerator code from the Spark
source code and use it directly in the same way the Reliable Kafka Receiver
uses it. BTW, you should know that we will be deprecating the receiver
based approach for the Direct Kafka approach. That is quite flexible, can
give
Can you tell us more about streaming app? DStream operation that you are
using?
On Sun, Aug 2, 2015 at 9:14 PM, Anand Nalya anand.na...@gmail.com wrote:
Hi,
I'm writing a Streaming application in Spark 1.3. After running for some
time, I'm getting following execption. I'm sure, that no other
:
Tathagata,
Could the bottleneck possibility be the number of executor nodes in our
cluster? Since we are creating 500 Dstreams based off 500 textfile
directories, do we need at least 500 executors / nodes to be receivers for
each one of the streams?
On Tue, Jul 28, 2015 at 6:09 PM, Tathagata Das t
Yes, and that is indeed the problem. It is trying to process all the data
in Kafka, and therefore taking 60 seconds. You need to set the rate limits
for that.
On Thu, Jul 30, 2015 at 8:51 AM, Cody Koeninger c...@koeninger.org wrote:
If you don't set it, there is no maximum rate, it will get
For the first time it needs to list them. AFter that the list should be
cached by the file stream implementation (as far as I remember).
On Thu, Jul 30, 2015 at 3:55 PM, Brandon White bwwintheho...@gmail.com
wrote:
Is this a known bottle neck for Spark Streaming textFileStream? Does it
need
You have to read the original Spark paper to understand how RDD lineage
works.
https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
On Thu, Jul 30, 2015 at 9:25 PM, Ted Yu yuzhih...@gmail.com wrote:
Please take a look at:
core/src/test/scala/org/apache/spark/CheckpointSuite.scala
I do suggest that the non-spark related discussions be taken to a different
this forum as it does not directly contribute to the contents of this user
list.
On Thu, Jul 30, 2015 at 8:52 PM, UMESH CHAUDHARY umesh9...@gmail.com
wrote:
Thanks for the valuable suggestion.
I also started with
To clarify, that is the number of executors requested by the SparkContext
from the cluster manager.
On Tue, Jul 28, 2015 at 5:18 PM, amkcom amk...@gmail.com wrote:
try sc.getConf.getInt(spark.executor.instances, 1)
--
View this message in context:
.
--
bit1...@163.com
*From:* bit1...@163.com
*Date:* 2015-07-31 13:11
*To:* Tathagata Das tathagata.das1...@gmail.com; yuzhihong
yuzhih...@gmail.com
*CC:* user user@spark.apache.org
*Subject:* Re: Re: How RDD lineage works
Thanks TD and Zhihong for the guide
in logic , in my case sleep(t.s.) is not
working .) So i used to kill coarseGrained job at each slave by script
.Please suggest something .
On Thu, Jul 30, 2015 at 5:14 AM, Tathagata Das t...@databricks.com
wrote:
StreamingContext.stop(stopGracefully = true) stops the streaming context
Rather than using accumulator directly, what you can do is something like
this to lazily create an accumulator and use it (will get lazily recreated
if driver restarts from checkpoint)
dstream.transform { rdd =
val accum = SingletonObject.getOrCreateAccumulator() // single object
method to
.
sc.saveAs) and then run a modified version of the job that skips Stage
0, assuming you have a full understanding of the breakdown of stages in
your job.
On Tue, Jul 28, 2015 at 9:28 PM, Tathagata Das t...@databricks.com
wrote:
Okay, may I am confused on the word would be useful to *restart
There is a known issue that Kafka cannot return leader if there is not data
in the topic. I think it was raised in another thread in this forum. Is
that the issue?
On Wed, Jul 29, 2015 at 10:38 AM, unk1102 umesh.ka...@gmail.com wrote:
Hi I have Spark Streaming code which streams from Kafka
StreamingContext.stop(stopGracefully = true) stops the streaming context
gracefully.
Then you can safely terminate the Spark cluster. They are two different
steps and needs to be done separately ensuring that the driver process has
been completely terminated before the Spark cluster is the
previous batch interval's
offsets..
On Thu, Jul 30, 2015 at 2:49 AM, Tathagata Das t...@databricks.com
wrote:
Rather than using accumulator directly, what you can do is something like
this to lazily create an accumulator and use it (will get lazily recreated
if driver restarts from
(streamTuple._1, SaveMode.Append)
}
}
On Tue, Jul 28, 2015 at 3:42 PM, Tathagata Das t...@databricks.com
wrote:
I dont think any one has really run 500 text streams.
And parSequences do nothing out there, you are only parallelizing the
setup code which does not really compute anything. Also
I dont think any one has really run 500 text streams.
And parSequences do nothing out there, you are only parallelizing the setup
code which does not really compute anything. Also it setsup 500 foreachRDD
operations that will get executed in each batch sequentially, so does not
make sense. The
If you are trying to keep such long term state, it will be more robust in
the long term to use a dedicated data store (cassandra/HBase/etc.) that is
designed for long term storage.
On Tue, Jul 28, 2015 at 4:37 PM, swetha swethakasire...@gmail.com wrote:
Hi TD,
We have a requirement to
You have to partition that data on the Spark Streaming by the primary key,
and then make sure insert data into Cassandra atomically per key, or per
set of keys in the partition. You can use the combination of the (batch
time, and partition Id) of the RDD inside foreachRDD as the unique id for
the
If you are using the same RDDs in the both the attempts to run the job, the
previous stage outputs generated in the previous job will indeed be reused.
This applies to core though. For dataframes, depending on what you do, the
physical plan may get generated again leading to new RDDs which may
Currently that is the best way.
On Thu, Jul 23, 2015 at 12:51 AM, JoneZhang joyoungzh...@gmail.com wrote:
My spark streaming on kafka application is running in spark 1.3.
I want upgrade spark to 1.4 now.
How to deal with the spark streaming application?
Save the kafka topic partition
underhood? If
compete batch is not loaded will using custim size of 100-200 request in
one batch and post will help instead of whole partition ?
On Wed, Jul 22, 2015 at 12:18 AM, Tathagata Das t...@databricks.com
wrote:
If you can post multiple items at a time, then use foreachPartition
For Java, do
OffsetRange[] offsetRanges = ((HasOffsetRanges)rdd*.rdd()*).offsetRanges();
If you fix that error, you should be seeing data.
You can call arbitrary RDD operations on a DStream, using
DStream.transform. Take a look at the docs.
For the direct kafka approach you are doing,
- tasks
With DirectKafkaStream there are two approaches.
1. you increase the number of KAfka partitions Spark will automatically
read in parallel
2. if that's not possible, then explicitly repartition only if there are
more cores in the cluster than the number of Kafka partitions, AND the
first map-like
You could contact the authors of the spark-packages.. maybe that will help?
On Mon, Jul 20, 2015 at 6:41 AM, Jeetendra Gangele gangele...@gmail.com
wrote:
Thanks Todd,
I m not sure whether somebody has used it or not. can somebody confirm if
this integrate nicely with Spark streaming?
On
From what I understand about your code, it is getting data from different
partitions of a topic - get all data from partition 1, then from partition
2, etc. Though you have configured it to read from just one partition
(topicCount has count = 1). So I am not sure what your intention is, read
all
If you can post multiple items at a time, then use foreachPartition to post
the whole partition in a single request.
On Tue, Jul 21, 2015 at 9:35 AM, Richard Marscher rmarsc...@localytics.com
wrote:
You can certainly create threads in a map transformation. We do this to do
concurrent DB
Most shuffle files are really kept around in the OS's buffer/disk cache, so
it is still pretty much in memory. If you are concerned about performance,
you have to do a holistic comparison for end-to-end performance. You could
take a look at this.
to provide sufficient
information as part of the value so that you can take that decision in
the filter function.
As always, thanks for your help
Nikunj
On Thu, Jul 16, 2015 at 1:16 AM, Tathagata Das t...@databricks.com
wrote:
MAke sure you provide the filterFunction with the invertible
Yes. More information in my talk -
https://www.youtube.com/watch?v=d5UJonrruHk
On Fri, Jul 17, 2015 at 1:15 AM, JoneZhang joyoungzh...@gmail.com wrote:
I see now.
There are three steps in SparkStreaming + Kafka date processing
1.Receiving the data
2.Transforming the data
3.Pushing out the
MAke sure you provide the filterFunction with the invertible
reduceByKeyAndWindow. Otherwise none of the keys will get removed, and the
key space will continue increase. This is what is leading to the lag. So
use the filtering function to filter out the keys that are not needed any
more.
On Thu,
Your streaming job may have been seemingly running ok, but the DStream
checkpointing must have been failing in the background. It would have been
visible in the log4j logs. In 1.4.0, we enabled fast-failure for that so
that checkpointing failures dont get hidden in the background.
The fact that
/SPARK-7180 and squashes the following commits:
On Wed, Jul 15, 2015 at 2:37 PM, Chen Song chen.song...@gmail.com wrote:
Thanks
Can you point me to the patch to fix the serialization stack? Maybe I can
pull it in and rerun my job.
Chen
On Wed, Jul 15, 2015 at 4:40 PM, Tathagata Das t
Yep :)
On Tue, Jul 14, 2015 at 2:44 PM, algermissen1971 algermissen1...@icloud.com
wrote:
On 14 Jul 2015, at 23:26, Tathagata Das t...@databricks.com wrote:
Just to be clear, you mean the Spark Standalone cluster manager's
master and not the applications driver, right.
Sorry, by now I
Just to be clear, you mean the Spark Standalone cluster manager's master
and not the applications driver, right.
In that case, the earlier responses are correct.
TD
On Tue, Jul 14, 2015 at 11:26 AM, Mohammed Guller moham...@glassbeam.com
wrote:
The master node does not have to be similar to
[Apologies for repost, for those who have seen this response already in the
dev mailing list]
1. When you set ssc.checkpoint(checkpointDir), the spark streaming
periodically saves the state RDD (which is a snapshot of all the state
data) to HDFS using RDD checkpointing. In fact, a streaming app
, Chen Song chen.song...@gmail.com wrote:
Thanks TD.
As for 1), if timing is not guaranteed, how does exactly once semantics
supported? It feels like exactly once receiving is not necessarily exactly
once processing.
Chen
On Tue, Jul 14, 2015 at 10:16 PM, Tathagata Das t...@databricks.com
This is a known race condition - root cause of SPARK-5681
https://issues.apache.org/jira/browse/SPARK-5681
On Mon, Jul 13, 2015 at 3:35 AM, Juan Rodríguez Hortalá
juan.rodriguez.hort...@gmail.com wrote:
Hi,
I have noticed that when StreamingContext.stop is called when no receiver
has
PM, Tathagata Das t...@databricks.com
wrote:
In the receiver based approach, If the receiver crashes for any
reason (receiver crashed or executor crashed) the receiver should get
restarted on another executor and should start reading data from the
offset
present in the zookeeper
Why is .remember not ideal?
On Sun, Jul 12, 2015 at 7:22 PM, Brandon White bwwintheho...@gmail.com
wrote:
Hi Yin,
Yes there were no new rows. I fixed it by doing a .remember on the
context. Obviously, this is not ideal.
On Sun, Jul 12, 2015 at 6:31 PM, Yin Huai yh...@databricks.com wrote:
I do not recommend using IndexRDD for state management in Spark Streaming.
What it does not solve out-of-the-box is checkpointing of indexRDDs, which
important because long running streaming jobs can lead to infinite chain of
RDDs. Spark Streaming solves it for the updateStateByKey operation which
High Level API.
Regards,
Dibyendu
On Sat, Jun 27, 2015 at 3:04 PM, Tathagata Das
t...@databricks.com wrote:
In the receiver based approach, If the receiver crashes for any
reason (receiver crashed or executor crashed) the receiver should
get
restarted on another executor and should
This has been discussed in a number of threads in this mailing list. Here
is a summary.
1. Processing of batch T+1 always starts after all the processing of batch
T has completed. But here a batch is defined by data of all the receivers
running the in the system receiving within the batch
You can do this.
// global variable to keep track of latest stuff
var latestTime = _
var latestRDD = _
dstream.foreachRDD((rdd: RDD[..], time: Time) = {
latestTime = time
latestRDD = rdd
})
Now you can asynchronously access the latest RDD. However if you are going
to run jobs on the
It was added, but its not documented publicly. I am planning to change the
name of the conf to spark.streaming.fileStream.minRememberDuration to make
it easier to understand
On Mon, Jul 13, 2015 at 9:43 PM, Terry Hole hujie.ea...@gmail.com wrote:
A new configuration named
Spark 1.4.0 added shutdown hooks in the driver to cleanly shutdown the
Sparkcontext in the driver, which would shutdown the executors. I am not
sure whether this is related or not, but somehow the executor's shutdown
hook is being called.
Can you check the driver logs to see if driver's shutdown
spark-submit does a lot of magic configurations (classpaths etc) underneath
the covers to enable pyspark to find Spark JARs, etc. I am not sure how you
can start running things directly from the PyCharm IDE. Others in the
community may be able to answer. For now the main way to run pyspark stuff
Are you talking about reduceByKeyAndWindow with or without inverse reduce?
TD
On Fri, Jul 10, 2015 at 2:07 AM, Imran Alam im...@newscred.com wrote:
We have a streaming job that makes use of reduceByKeyAndWindow
1. There will be a long running job with description start() as that is
the jobs that is running the receivers. It will never end.
2. You need to set the number of cores given to the Spark executors by the
YARN container. That is SparkConf spark.executor.cores, --executor-cores
in spark-submit.
If you are continuously unioning RDDs, then you are accumulating ever
increasing data, and you are processing ever increasing amount of data in
every batch. Obviously this is going to not last for very long. You
fundamentally cannot keep processing ever increasing amount of data with
finite
Aniruddh
On Thu, Jul 9, 2015 at 4:43 AM, Tathagata Das t...@databricks.com wrote:
There are several levels of indirection going on here, let me clarify.
In the local mode, Spark runs tasks (which includes receivers) using the
number of threads defined in the master (either local, or local[2
Do you have enough cores in the configured number of executors in YARN?
On Thu, Jul 9, 2015 at 2:29 AM, Bin Wang wbi...@gmail.com wrote:
I'm using spark streaming with Kafka, and submit it to YARN cluster with
mode yarn-cluster. But it hangs at SparkContext.start(). The Kafka config
is
What were the number of cores in the executor? It could be that you had
only one core in the executor which did all the 50 tasks serially so 50
task X 15 ms = ~ 1 second.
Could you take a look at the task details in the stage page to see when the
tasks were added to see whether it explains the 5
the same keys that are in myRDD, so
the reduceByKey after union keeps the overall tuple count in myRDD
fixed. Or even with fixed tuple count, it will keep consuming more
resources?
On 9 July 2015 at 16:19, Tathagata Das t...@databricks.com wrote:
If you are continuously unioning RDDs, then you
I am not sure why you are getting node_local and not process_local. Also
there is probably not a good documentation other than that configuration
page - http://spark.apache.org/docs/latest/configuration.html (search for
locality)
On Thu, Jul 9, 2015 at 5:51 AM, Michel Hubert
You can use DStream.transform for some stuff. Transform takes a RDD = RDD
function that allow arbitrary RDD operations to be done on RDDs of a
DStream. This function gets evaluated on the driver on every batch
interval. If you are smart about writing the function, it can do different
stuff at
want to know how makers of Spark Streaming think about this
drawback of checkpointing.
If anyone had similar experience, suggestions will be appreciated.
Jong Wook
On 9 July 2015 at 02:13, Tathagata Das t...@databricks.com wrote:
You can use DStream.transform for some stuff. Transform
Responses inline.
On Wed, Jul 8, 2015 at 10:26 AM, Shushant Arora shushantaror...@gmail.com
wrote:
1.Does creation of read only singleton object in each map function is same
as broadcast object as singleton never gets garbage collected unless
executor gets shutdown ? Aim is to avoid creation
There are several levels of indirection going on here, let me clarify.
In the local mode, Spark runs tasks (which includes receivers) using the
number of threads defined in the master (either local, or local[2], or
local[*]).
local or local[1] = single thread, so only one task at a time
local[2]
This is also discussed in the programming guide.
http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd
On Wed, Jul 8, 2015 at 8:25 AM, Dmitry Goldenberg dgoldenberg...@gmail.com
wrote:
Thanks, Sean.
are you asking about foreach vs
Currently the only way to pause it is to stop it. The way I would do this
is use the Direct Kafka API to access the Kafka offsets, and save them to a
data store as batches finish. If you see a batch job failing because
downstream is down, stop the context. When it comes back up, start a new
This talk may help -
https://spark-summit.org/2015/events/recipes-for-running-spark-streaming-applications-in-production/
On Tue, Jul 7, 2015 at 9:51 AM, swetha swethakasire...@gmail.com wrote:
Hi,
What happens if the master node fails in the case of Spark Streaming? Would
the data be lost?
When you enable checkpointing by setting the checkpoint directory, you
enable metadata checkpointing. Data checkpointing kicks in only if you are
using a DStream operation that requires it, or you are enabling Write Ahead
Logs to prevent data loss on driver failure.
More discussion -
,
Juan
2015-06-23 21:57 GMT+02:00 Tathagata Das t...@databricks.com:
Aaah this could be potentially major issue as it may prevent metrics from
restarted streaming context be not published. Can you make it a JIRA.
TD
On Tue, Jun 23, 2015 at 7:59 AM, Juan Rodríguez Hortalá
Assudani aassud...@impetus.com
wrote:
Also, how do you suggest catching exceptions while using with connector
API like, saveAsNewAPIHadoopFiles ?
From: amit assudani aassud...@impetus.com
Date: Monday, June 29, 2015 at 9:55 AM
To: Tathagata Das t...@databricks.com
Cc: Cody Koeninger c
Yeah, creating a new producer at the granularity of partitions may not be
that costly.
On Mon, Jul 6, 2015 at 6:40 AM, Cody Koeninger c...@koeninger.org wrote:
Use foreachPartition, and allocate whatever the costly resource is once
per partition.
On Mon, Jul 6, 2015 at 6:11 AM, Shushant
is an operation and another is action but if I call an opeartion
afterwords mapPartitions also, which one is more efficient and
recommeded?
On Tue, Jul 7, 2015 at 12:21 AM, Tathagata Das t...@databricks.com
wrote:
Yeah, creating a new producer at the granularity of partitions may
...@gmail.com wrote:
I had a similar inquiry, copied below.
I was also looking into making an SQS Receiver reliable:
http://stackoverflow.com/questions/30809975/reliable-sqs-receiver-for-spark-streaming
Hope this helps.
-- Forwarded message --
From: Tathagata Das t
Could you give more information on the operations that you are using? The
code outline?
And what do you mean by Spark Driver receiver events? If the driver is
receiving events, how is it being sent to the executors.
BTW, for memory usages, I strongly recommend using jmap --histo:live to see
what
Well, the scheduling delay is the time a batch has to wait for getting
resources. So even if there is no backlog in processing and scheduling
delay is 0, there is one batch that is being processed at any point of
time, which explains the difference.
On Tue, Jun 30, 2015 at 2:42 AM,
How many receivers do you have in the streaming program? You have to have
more numbers of core in reserver by your spar application than the number
of receivers. That would explain the receiving output after stopping.
TD
On Tue, Jun 30, 2015 at 7:59 AM, Borja Garrido Bear kazebo...@gmail.com
I am guessing one of the two things might work.
1. Either define the pattern SPACE inside the process()
2. Mark streamingContext field and inputStream field as transient.
The problem is that the function like PairFunction needs to be serialized
for being sent to the tasks. And whole closure of
On Mon, Jun 29, 2015 at 8:44 AM, Amit Assudani aassud...@impetus.com
wrote:
Also, how do you suggest catching exceptions while using with connector
API like, saveAsNewAPIHadoopFiles ?
From: amit assudani aassud...@impetus.com
Date: Monday, June 29, 2015 at 9:55 AM
To: Tathagata Das t
Yes, the observation is correct. That connectivity is assumed to be HA.
On Mon, Jun 29, 2015 at 2:34 PM, Amit Assudani aassud...@impetus.com
wrote:
Hi All,
While using Checkpoints ( using HDFS ), if connectivity to hadoop
cluster is lost for a while and gets restored in some time, what
Could you also provide the code where you set up the Kafka dstream? I dont
see it in the snippet.
On Fri, Jun 26, 2015 at 2:45 PM, Ashish Nigam ashnigamt...@gmail.com
wrote:
Here's code -
def createStreamingContext(checkpointDirectory: String) :
StreamingContext = {
val conf = new
= rdd.mapPartitionsWithIndex { (i, kafkaEvent) =
.
}
I understand that I just need a checkpoint if I need to recover the task
it something goes wrong, right?
2015-06-27 9:39 GMT+02:00 Tathagata Das t...@databricks.com:
How are you trying to execute the code again? From checkpoints
SPARK_CLASSPATH
The code doesn't have any stateful operation, yo I guess that it¡s okay
doesn't have checkpoint. I have executed hundres of times thiscode in VM
from Cloudera and never got this error.
2015-06-27 11:21 GMT+02:00 Tathagata Das t...@databricks.com:
1. you need checkpointing mostly
How are you trying to execute the code again? From checkpoints, or
otherwise?
Also cc'ed Hari who may have a better idea of YARN related issues.
On Sat, Jun 27, 2015 at 12:35 AM, Guillermo Ortiz konstt2...@gmail.com
wrote:
Hi,
I'm executing a SparkStreamig code with Kafka. IçThe code was
Could you print the time on the driver (that is, in foreachRDD but before
RDD.foreachPartition) and see if it is behaving weird?
TD
On Fri, Jun 26, 2015 at 3:57 PM, Emrehan Tüzün emrehan.tu...@gmail.com
wrote:
On Fri, Jun 26, 2015 at 12:30 PM, Sea 261810...@qq.com wrote:
Hi, all
I find
In the receiver based approach, If the receiver crashes for any reason
(receiver crashed or executor crashed) the receiver should get restarted on
another executor and should start reading data from the offset present in
the zookeeper. There is some chance of data loss which can alleviated using
Do you have SPARK_CLASSPATH set in both cases? Before and after checkpoint?
If yes, then you should not be using SPARK_CLASSPATH, it has been
deprecated since Spark 1.0 because of its ambiguity.
Also where do you have spark.executor.extraClassPath set? I dont see it in
the spark-submit command.
it in map, but the whole point here is to handle
something that is breaking in action ). Please help. :(
From: amit assudani aassud...@impetus.com
Date: Friday, June 26, 2015 at 11:41 AM
To: Cody Koeninger c...@koeninger.org
Cc: user@spark.apache.org user@spark.apache.org, Tathagata Das
t
You must not include spark-core and spark-streaming in the assembly. They
are already present in the installation and the presence of multiple
versions of spark may throw off the classloaders in weird ways. So make the
assembly marking the those dependencies as scope=provided.
On Tue, Jun 23,
Aaah this could be potentially major issue as it may prevent metrics from
restarted streaming context be not published. Can you make it a JIRA.
TD
On Tue, Jun 23, 2015 at 7:59 AM, Juan Rodríguez Hortalá
juan.rodriguez.hort...@gmail.com wrote:
Hi,
I'm running a program in Spark 1.4 where
Yes, this is a known behavior. Some static stuff are not serialized as part
of a task.
On Tue, Jun 23, 2015 at 10:24 AM, Nipun Arora nipunarora2...@gmail.com
wrote:
I found the error so just posting on the list.
It seems broadcast variables cannot be declared static.
If you do you get a null
This should give accurate count for each batch, though for getting the rate
you have to make sure that you streaming app is stable, that is, batches
are processed as fast as they are received (scheduling delay in the spark
streaming UI is approx 0).
TD
On Tue, Jun 23, 2015 at 2:49 AM, anshu
/build
/project
And when I pass streaming jar using --jar option , it threw
same java.lang.NoClassDefFoundError: org/apache/spark/util/ThreadUtils$.
Thanks
On Wed, Jun 24, 2015 at 1:17 AM, Tathagata Das t...@databricks.com
wrote:
You must not include spark-core and spark-streaming
How are you adding that to the classpath? Through spark-submit or otherwise?
On Mon, Jun 22, 2015 at 5:02 PM, Murthy Chelankuri kmurt...@gmail.com
wrote:
Yes I have the producer in the class path. And I am using in standalone
mode.
Sent from my iPhone
On 23-Jun-2015, at 3:31 am, Tathagata
at 12:17 PM, Tathagata Das t...@databricks.com
wrote:
How are you adding that to the classpath? Through spark-submit or
otherwise?
On Mon, Jun 22, 2015 at 5:02 PM, Murthy Chelankuri kmurt...@gmail.com
wrote:
Yes I have the producer in the class path. And I am using in standalone
mode
queue stream does not support driver checkpoint recovery since the RDDs in
the queue are arbitrary generated by the user and its hard for Spark
Streaming to keep track of the data in the RDDs (thats necessary for
recovering from checkpoint). Anyways queue stream is meant of testing and
ConnectionStateManager: There are no
ConnectionStateListeners registered.
15/06/22 10:44:01 INFO ZooKeeperLeaderElectionAgent: We have lost
leadership
15/06/22 10:44:01 ERROR Master: Leadership has been revoked -- master
shutting down.
On Thu, Jun 11, 2015 at 8:59 PM, Tathagata Das t...@databricks.com
for loading the classes or some thing like that?
On Tue, Jun 23, 2015 at 12:28 PM, Tathagata Das t...@databricks.com
wrote:
So you have Kafka in your classpath in you Java application, where you
are creating the sparkContext with the spark standalone master URL, right?
The recommended way
Please run it in your own application and not in the spark shell. I see
that you are trying to stop the Spark context and create a new
StreamingContext. That will lead to unexpected issue, that you are seeing.
Please make a standalone SBT/Maven app for Spark Streaming.
On Tue, Jun 23, 2015 at
Do you have Kafka producer in your classpath? If so how are adding that
library? Are you running on YARN, or Mesos or Standalone or local. These
details will be very useful.
On Mon, Jun 22, 2015 at 8:34 AM, Murthy Chelankuri kmurt...@gmail.com
wrote:
I am using spark streaming. what i am trying
to failures*? If not, please could you suggest workarounds and point
me to the code?
One more thing was not 100% clear to me from the documentation: Is there
exactly *1 RDD* published *per a batch interval* in a DStream?
On 19 June 2015 at 16:58, Tathagata Das t...@databricks.com wrote:
I see
301 - 400 of 859 matches
Mail list logo