Can you post the code including the values of kafkaParams and topicSet,
ideally the relevant output of kafka-topics.sh --describe as well
On Wed, Jul 29, 2015 at 11:39 PM, Umesh Kacha wrote:
> Hi thanks for the response. Like I already mentioned in the question kafka
> topic is valid and it has
Hi Ferriad,
Can you share the code? because its hard to judge any problem with this
little information.
Thank you
Regards
Himanshu
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Cannot-Work-On-Next-Interval-tp24045p24075.html
Sent from th
Hi thanks for the response. Like I already mentioned in the question kafka
topic is valid and it has data I can see data in it using another kafka
consumer.
On Jul 30, 2015 7:31 AM, "Cody Koeninger" wrote:
> The last time someone brought this up on the mailing list, the issue
> actually was that
The last time someone brought this up on the mailing list, the issue
actually was that the topic(s) didn't exist in Kafka at the time the spark
job was running.
On Wed, Jul 29, 2015 at 6:17 PM, Tathagata Das wrote:
> There is a known issue that Kafka cannot return leader if there is not
> da
There is a known issue that Kafka cannot return leader if there is not data
in the topic. I think it was raised in another thread in this forum. Is
that the issue?
On Wed, Jul 29, 2015 at 10:38 AM, unk1102 wrote:
> Hi I have Spark Streaming code which streams from Kafka topic it used to
> work
>
A side question: Any reason why you're using window(Seconds(10), Seconds(10))
instead of new StreamingContext(conf, Seconds(10)) ?
Making the micro-batch interval 10 seconds instead of 1 will provide you
the same 10-second window with less complexity. Of course, this might just
be a test for the w
If you are trying to keep such long term state, it will be more robust in
the long term to use a dedicated data store (cassandra/HBase/etc.) that is
designed for long term storage.
On Tue, Jul 28, 2015 at 4:37 PM, swetha wrote:
>
>
> Hi TD,
>
> We have a requirement to maintain the user sessio
Hi TD,
We have a requirement to maintain the user session state and to
maintain/update the metrics for minute, day and hour granularities for a
user session in our Streaming job. Can I keep those granularities in the
state and recalculate each time there is a change? How would the performance
You don't have to use some other package in order to get access to the
offsets.
Shushant, have you read the available documentation at
http://spark.apache.org/docs/latest/streaming-kafka-integration.html
https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md
or watched
https:/
If you want the offset of individual kafka messages , you can use this
consumer form Spark Packages ..
http://spark-packages.org/package/dibbhatt/kafka-spark-consumer
Regards,
Dibyendu
On Tue, Jul 28, 2015 at 6:18 PM, Shushant Arora
wrote:
> Hi
>
> I am processing kafka messages using spark str
In spark streaming 1.3 -
Say I have 10 executors each with 4 cores so in total 40 tasks in parllel
at once. If I repartition kafka directstream to 40 partitions vs say I have
in kafka topic 300 partitions - which one will be more efficient , Should I
repartition the kafka stream equal to num of co
With DirectKafkaStream there are two approaches.
1. you increase the number of KAfka partitions Spark will automatically
read in parallel
2. if that's not possible, then explicitly repartition only if there are
more cores in the cluster than the number of Kafka partitions, AND the
first map-like st
For Java, do
OffsetRange[] offsetRanges = ((HasOffsetRanges)rdd*.rdd()*).offsetRanges();
If you fix that error, you should be seeing data.
You can call arbitrary RDD operations on a DStream, using
DStream.transform. Take a look at the docs.
For the direct kafka approach you are doing,
- tasks d
Thanks TD - appreciate the response !
On Jul 21, 2015, at 1:54 PM, Tathagata Das wrote:
> Most shuffle files are really kept around in the OS's buffer/disk cache, so
> it is still pretty much in memory. If you are concerned about performance,
> you have to do a holistic comparison for end-to-e
Most shuffle files are really kept around in the OS's buffer/disk cache, so
it is still pretty much in memory. If you are concerned about performance,
you have to do a holistic comparison for end-to-end performance. You could
take a look at this.
https://spark-summit.org/2015/events/towards-benchm
Thank you for your reply. I will consider hdfs for the checkpoint storage.
Le mar. 21 juil. 2015 à 17:51, Dean Wampler a
écrit :
> TD's Spark Summit talk offers suggestions (
> https://spark-summit.org/2015/events/recipes-for-running-spark-streaming-applications-in-production/).
> He recommend
TD's Spark Summit talk offers suggestions (
https://spark-summit.org/2015/events/recipes-for-running-spark-streaming-applications-in-production/).
He recommends using HDFS, because you get the triplicate resiliency it
offers, albeit with extra overhead. I believe the driver doesn't need
visibility
I'd suggest you upgrading to 1.4 as it has better metrices and UI.
Thanks
Best Regards
On Mon, Jul 20, 2015 at 7:01 PM, Shushant Arora
wrote:
> Is coalesce not applicable to kafkaStream ? How to do coalesce on
> kafkadirectstream its not there in api ?
> Shall calling repartition on directstrea
Is coalesce not applicable to kafkaStream ? How to do coalesce on
kafkadirectstream its not there in api ?
Shall calling repartition on directstream with number of executors as
numpartitions will imrove perfromance ?
Does in 1.3 tasks get launched for partitions which are empty? Does driver
makes
Hi TD,
Yay! Thanks for the help. That solved our issue of ever increasing
processing time. I added filter functions to all our reduceByKeyAndWindow()
operations and now its been stable for over 2 days already! :-).
One small feedback about the API though. The one that accepts the filter
function
It resorts to the following method for finding region location:
private RegionLocations locateRegionInMeta(TableName tableName, byte[]
row,
boolean useCache, boolean retry, int replicaId) throws
IOException {
Note: useCache value is true in this call path.
Meaning the client
Is this map creation happening on client side ?
But how does it know which RS will contain that row key in put operation
until asking the .Meta. table .
Does Hbase client first gets that ranges of keys of each Reagionservers
and then group put objects based on Region servers ?
On Fri, Jul 17, 20
Internally AsyncProcess uses a Map which is keyed by server name:
Map> actionsByServer =
new HashMap>();
Here MultiAction would group Put's in your example which are destined for
the same server.
Cheers
On Fri, Jul 17, 2015 at 5:15 AM, Shushant Arora
wrote:
> Thanks !
>
> My key
Thanks !
My key is random (hexadecimal). So hot spot should not be created.
Is there any concept of bulk put. Say I want to raise a one put request for
a 1000 size batch which will hit a region server instead of individual put
for each key.
Htable.put(List) Does this handles batching of put bas
Hi TD,
Thanks for the response. I do believe I understand the concept and the need
for the filterfunction now. I made the requisite code changes and keeping
it running overnight to see the effect of it. Hopefully this should fix our
issue.
However, there was one place where I encountered a follow
Responses inline.
On Thu, Jul 16, 2015 at 9:27 PM, N B wrote:
> Hi TD,
>
> Yes, we do have the invertible function provided. However, I am not sure I
> understood how to use the filterFunction. Is there an example somewhere
> showing its usage?
>
> The header comment on the function says :
>
> *
Hi TD,
Yes, we do have the invertible function provided. However, I am not sure I
understood how to use the filterFunction. Is there an example somewhere
showing its usage?
The header comment on the function says :
* @param filterFunc function to filter expired key-value pairs;
*
You ask an interesting question…
Lets set aside spark, and look at the overall ingestion pattern.
Its really an ingestion pattern where your input in to the system is from a
queue.
Are the events discrete or continuous? (This is kinda important.)
If the events are continuous then more than
MAke sure you provide the filterFunction with the invertible
reduceByKeyAndWindow. Otherwise none of the keys will get removed, and the
key space will continue increase. This is what is leading to the lag. So
use the filtering function to filter out the keys that are not needed any
more.
On Thu, J
Thanks Akhil. For doing reduceByKeyAndWindow, one has to have checkpointing
enabled. So, yes we do have it enabled. But not Write Ahead Log because we
don't have a need for recovery and we do not recover the process state on
restart.
I don't know if IO Wait fully explains the increasing processing
What is your data volume? Are you having checkpointing/WAL enabled? In that
case make sure you are having SSD disks as this behavior is mainly due to
the IO wait.
Thanks
Best Regards
On Thu, Jul 16, 2015 at 8:43 AM, N B wrote:
> Hello,
>
> We have a Spark streaming application and the problem t
There are there connector packages listed on spark packages web site:
http://spark-packages.org/?q=hbase
HTH.
-Todd
On Wed, Jul 15, 2015 at 2:46 PM, Shushant Arora
wrote:
> Hi
>
> I have a requirement of writing in hbase table from Spark streaming app
> after some processing.
> Is Hbase put o
Of course, exactly once receiving is not same as exactly once. In case of
direct kafka stream, the data may actually be pulled multiple time. But
even if the data of a batch is pulled twice because of some failure, the
final result (that is, transformed data accessed through foreachRDD) will
always
Why is .remember not ideal?
On Sun, Jul 12, 2015 at 7:22 PM, Brandon White
wrote:
> Hi Yin,
>
> Yes there were no new rows. I fixed it by doing a .remember on the
> context. Obviously, this is not ideal.
>
> On Sun, Jul 12, 2015 at 6:31 PM, Yin Huai wrote:
>
>> Hi Brandon,
>>
>> Can you explai
Thanks TD.
As for 1), if timing is not guaranteed, how does exactly once semantics
supported? It feels like exactly once receiving is not necessarily exactly
once processing.
Chen
On Tue, Jul 14, 2015 at 10:16 PM, Tathagata Das wrote:
>
>
> On Tue, Jul 14, 2015 at 6:42 PM, Chen Song wrote:
>
On Tue, Jul 14, 2015 at 6:42 PM, Chen Song wrote:
> Thanks TD and Cody. I saw that.
>
> 1. By doing that (foreachRDD), does KafkaDStream checkpoints its offsets
> on HDFS at the end of each batch interval?
>
The timing is not guaranteed.
> 2. In the code, if I first apply transformations and a
Thanks TD and Cody. I saw that.
1. By doing that (foreachRDD), does KafkaDStream checkpoints its offsets on
HDFS at the end of each batch interval?
2. In the code, if I first apply transformations and actions on the
directKafkaStream and then use foreachRDD on the original KafkaDStream to
commit o
Relevant documentation -
https://spark.apache.org/docs/latest/streaming-kafka-integration.html,
towards the end.
directKafkaStream.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges]
// offsetRanges.length = # of Kafka partitions being consumed
...
}
On Tue,
You have access to the offset ranges for a given rdd in the stream by
typecasting to HasOffsetRanges. You can then store the offsets wherever
you need to.
On Tue, Jul 14, 2015 at 5:00 PM, Chen Song wrote:
> A follow up question.
>
> When using createDirectStream approach, the offsets are checkp
A follow up question.
When using createDirectStream approach, the offsets are checkpointed to
HDFS and it is understandable by Spark Streaming job. Is there a way to
expose the offsets via a REST api to end users. Or alternatively, is there
a way to have offsets committed to Kafka Offset Manager s
Hi Sushant/Cody,
For question 1 , following is my understanding ( I am not 100% sure and
this is only my understanding, I have asked this question in another words
to TD for confirmation which is not confirmed as of now).
Following is my understanding. In accordance with tasks created in
proporti
For second question
I am comparing 2 situtations of processing kafkaRDD.
case I - When I used foreachPartition to process kafka stream I am not able
to see any stream job timing interval like Time: 142905487 ms .
displayed on driver console at start of each stream batch. But it processed
each
Regarding your first question, having more partitions than you do executors
usually means you'll have better utilization, because the workload will be
distributed more evenly. There's some degree of per-task overhead, but as
long as you don't have a huge imbalance between number of tasks and numbe
Hi Yin,
Yes there were no new rows. I fixed it by doing a .remember on the context.
Obviously, this is not ideal.
On Sun, Jul 12, 2015 at 6:31 PM, Yin Huai wrote:
> Hi Brandon,
>
> Can you explain what did you mean by "It simply does not work"? You did
> not see new data files?
>
> Thanks,
>
>
Hi Brandon,
Can you explain what did you mean by "It simply does not work"? You did not
see new data files?
Thanks,
Yin
On Fri, Jul 10, 2015 at 11:55 AM, Brandon White
wrote:
> Why does this not work? Is insert into broken in 1.3.1? It does not throw
> any errors, fail, or throw exceptions. I
On 10 Jul 2015, at 23:10, algermissen1971 wrote:
> Hi,
>
> initially today when moving my streaming application to the cluster the first
> time I ran in to newbie error of using a local file system for checkpointing
> and the RDD partition count differences (see exception below).
>
> Having
Thanks for the help. I set --executor-cores and it works now. I've used
--total-executor-cores and don't realize it changed.
Tathagata Das 于2015年7月10日周五 上午3:11写道:
> 1. There will be a long running job with description "start()" as that is
> the jobs that is running the receivers. It will never e
I am not sure why you are getting node_local and not process_local. Also
there is probably not a good documentation other than that configuration
page - http://spark.apache.org/docs/latest/configuration.html (search for
locality)
On Thu, Jul 9, 2015 at 5:51 AM, Michel Hubert
wrote:
>
>
>
>
>
>
>
1. There will be a long running job with description "start()" as that is
the jobs that is running the receivers. It will never end.
2. You need to set the number of cores given to the Spark executors by the
YARN container. That is SparkConf spark.executor.cores, --executor-cores
in spark-submit.
Yes, it should work, let us know if not.
On Thu, Jul 9, 2015 at 11:34 AM, Shushant Arora
wrote:
> Thanks cody, so is it means if old kafka consumer 0.8.1.1 works with
> kafka cluster version 0.8.2 then spark streaming 1.3 should also work?
>
> I have tested standalone consumer kafka consumer 0
Thanks cody, so is it means if old kafka consumer 0.8.1.1 works with kafka
cluster version 0.8.2 then spark streaming 1.3 should also work?
I have tested standalone consumer kafka consumer 0.8.0 with kafka cluster
0.8.2 and that works.
On Thu, Jul 9, 2015 at 9:58 PM, Cody Koeninger wrote:
> I
It's the consumer version. Should work with 0.8.2 clusters.
On Thu, Jul 9, 2015 at 11:10 AM, Shushant Arora
wrote:
> Does spark streaming 1.3 requires kafka version 0.8.1.1 and is not
> compatible with kafka 0.8.2 ?
>
> As per maven dependency of spark streaming 1.3 with kafka
>
>
> org.apache.
Do you have enough cores in the configured number of executors in YARN?
On Thu, Jul 9, 2015 at 2:29 AM, Bin Wang wrote:
> I'm using spark streaming with Kafka, and submit it to YARN cluster with
> mode "yarn-cluster". But it hangs at SparkContext.start(). The Kafka config
> is right since it c
What were the number of cores in the executor? It could be that you had
only one core in the executor which did all the 50 tasks serially so 50
task X 15 ms = ~ 1 second.
Could you take a look at the task details in the stage page to see when the
tasks were added to see whether it explains the 5 se
updateStateByKey will run for all keys, whether they have new data in a batch
or not so you should be able to still use it.
On 7/3/15, 7:34 AM, "micvog" wrote:
>UpdateStateByKey is useful but what if I want to perform an operation to all
>existing keys (not only the ones in this RDD).
>
>Word
ch conf parameter sets the worker thread count in cluster mode ? is it
>> spark.akka.threads ?
>>
>>
>>
>> *From:* Tathagata Das [mailto:t...@databricks.com]
>> *Sent:* 01 July 2015 01:32
>> *To:* Borja Garrido Bear
>> *Cc:* user
>> *Subj
nt in cluster mode ? is it
> spark.akka.threads ?
>
>
>
> *From:* Tathagata Das [mailto:t...@databricks.com]
> *Sent:* 01 July 2015 01:32
> *To:* Borja Garrido Bear
> *Cc:* user
> *Subject:* Re: Spark streaming on standalone cluster
>
>
>
> How many receivers do
...@databricks.com]
Sent: 01 July 2015 01:32
To: Borja Garrido Bear
Cc: user
Subject: Re: Spark streaming on standalone cluster
How many receivers do you have in the streaming program? You have to have more
numbers of core in reserver by your spar application than the number of
receivers. That
How many receivers do you have in the streaming program? You have to have
more numbers of core in reserver by your spar application than the number
of receivers. That would explain the receiving output after stopping.
TD
On Tue, Jun 30, 2015 at 7:59 AM, Borja Garrido Bear
wrote:
> Hi all,
>
> I
You can't use different versions of spark in your application vs your
cluster.
For the direct stream, it's not 60 partitions per executor, it's 300
partitions, and executors work on them as they are scheduled. Yes, if you
have no messages you will get an empty partition. It's up to you whether
i
Is this 3 is no of parallel consumer threads per receiver , means in total
we have 2*3=6 consumer in same consumer group consuming from all 300
partitions.
3 is just parallelism on same receiver and recommendation is to use 1 per
receiver since consuming from kafka is not cpu bound rather NIC(netwo
What do you mean by "new file", do you upload an already existing file onto
HDFS or create a new one locally and then upload it to HDFS?
bit1...@163.com
From: ravi tella
Date: 2015-06-30 09:59
To: user
Subject: spark streaming HDFS file issue
I am running a spark streaming example from learnin
1. Here you are basically creating 2 receivers and asking each of them to
consume 3 kafka partitions each.
- In 1.2 we have high level consumers so how can we restrict no of kafka
partitions to consume from? Say I have 300 kafka partitions in kafka topic
and as in above I gave 2 receivers and 3 ka
Hi
Let me take ashot at your questions. (I am sure people like Cody and TD
will correct if I am wrong)
0. This is exact copy from the similar question in mail thread from Akhil D:
Since you set local[4] you will have 4 threads for your computation, and
since you are having 2 receivers, you are le
3. You need to use your own method, because you need to set up your job.
Read the checkpoint documentation.
4. Yes, if you want to checkpoint, you need to specify a url to store the
checkpoint at (s3 or hdfs). Yes, for the direct stream checkpoint it's
just offsets, not all the messages.
On Sun
on using yarn-cluster, it works good
On Mon, Jun 29, 2015 at 12:07 PM, ram kumar wrote:
> SPARK_CLASSPATH=$CLASSPATH:/usr/hdp/2.2.0.0-2041/hadoop-mapreduce/*
> in spark-env.sh
>
> I think i am facing the same issue
> https://issues.apache.org/jira/browse/SPARK-6203
>
>
>
> On Mon, Jun 29, 2015 a
SPARK_CLASSPATH=$CLASSPATH:/usr/hdp/2.2.0.0-2041/hadoop-mapreduce/*
in spark-env.sh
I think i am facing the same issue
https://issues.apache.org/jira/browse/SPARK-6203
On Mon, Jun 29, 2015 at 11:38 AM, ram kumar wrote:
> I am using Spark 1.2.0.2.2.0.0-82 (git revision de12451) built for Hadoop
Few doubts :
In 1.2 streaming when I use union of streams , my streaming application
getting hanged sometimes and nothing gets printed on driver.
[Stage 2:>
(0 + 2) / 2]
Whats is 0+2/2 here signifies.
1.Does no of streams in topicsMap.put("testSparkPartitio
Hi,
There is another option to try for Receiver Based Low Level Kafka Consumer
which is part of Spark-Packages (
http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) . This can
be used with WAL as well for end to end zero data loss.
This is also Reliable Receiver and Commit offset to
In the receiver based approach, If the receiver crashes for any reason
(receiver crashed or executor crashed) the receiver should get restarted on
another executor and should start reading data from the offset present in
the zookeeper. There is some chance of data loss which can alleviated using
Wr
Could you also provide the code where you set up the Kafka dstream? I dont
see it in the snippet.
On Fri, Jun 26, 2015 at 2:45 PM, Ashish Nigam
wrote:
> Here's code -
>
> def createStreamingContext(checkpointDirectory: String) :
> StreamingContext = {
>
> val conf = new SparkConf().setAppNam
Do you have SPARK_CLASSPATH set in both cases? Before and after checkpoint?
If yes, then you should not be using SPARK_CLASSPATH, it has been
deprecated since Spark 1.0 because of its ambiguity.
Also where do you have spark.executor.extraClassPath set? I dont see it in
the spark-submit command.
On
Read the spark streaming guide ad the kafka integration guide for a better
understanding of how the receiver based stream works.
Capacity planning is specific to your environment and what the job is
actually doing, youll need to determine it empirically.
On Friday, June 26, 2015, Shushant Arora
In 1.2 how to handle offset management after stream application starts in
each job . I should commit offset after job completion manually?
And what is recommended no of consumer threads. Say I have 300 partitions
in kafka cluster . Load is ~ 1 million events per second.Each event is of
~500bytes.
Here's code -
def createStreamingContext(checkpointDirectory: String) : StreamingContext
= {
val conf = new SparkConf().setAppName("KafkaConsumer")
conf.set("spark.eventLog.enabled", "false")
logger.info("Going to init spark context")
conf.getOption("spark.master") match {
Make sure you're following the docs regarding setting up a streaming
checkpoint.
Post your code if you can't get it figured out.
On Fri, Jun 26, 2015 at 3:45 PM, Ashish Nigam
wrote:
> I bring up spark streaming job that uses Kafka as input source.
> No data to process and then shut it down. And
The receiver-based kafka createStream in spark 1.2 uses zookeeper to store
offsets. If you want finer-grained control over offsets, you can update
the values in zookeeper yourself before starting the job.
createDirectStream in spark 1.3 is still marked as experimental, and
subject to change. Tha
that limits the number of cores
per Executor rather than the total cores for the job and hence will probably
not yield the effect you need
From: Wojciech Pituła [mailto:w.pit...@gmail.com]
Sent: Wednesday, June 24, 2015 10:49 AM
To: Evo Eftimov; user@spark.apache.org
Subject: Re: Spark
Ok, thanks. I have 1 worker process on each machine but I would like to run
my app on only 3 of them. Is it possible?
śr., 24.06.2015 o 11:44 użytkownik Evo Eftimov
napisał:
> There is no direct one to one mapping between Executor and Node
>
>
>
> Executor is simply the spark framework term for
There is no direct one to one mapping between Executor and Node
Executor is simply the spark framework term for JVM instance with some spark
framework system code running in it
A node is a physical server machine
You can have more than one JVM per node
And vice versa you can hav
Thanks a lot. It worked after keeping all versions to same.1.2.0
On Wed, Jun 24, 2015 at 2:24 AM, Tathagata Das wrote:
> Why are you mixing spark versions between streaming and core??
> Your core is 1.2.0 and streaming is 1.4.0.
>
> On Tue, Jun 23, 2015 at 1:32 PM, Shushant Arora > wrote:
>
>>
Why are you mixing spark versions between streaming and core??
Your core is 1.2.0 and streaming is 1.4.0.
On Tue, Jun 23, 2015 at 1:32 PM, Shushant Arora
wrote:
> It throws exception for WriteAheadLogUtils after excluding core and
> streaming jar.
>
> Exception in thread "main" java.lang.NoClass
It throws exception for WriteAheadLogUtils after excluding core and
streaming jar.
Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/spark/streaming/util/WriteAheadLogUtils$
at
org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:84)
at
org
You must not include spark-core and spark-streaming in the assembly. They
are already present in the installation and the presence of multiple
versions of spark may throw off the classloaders in weird ways. So make the
assembly marking the those dependencies as scope=provided.
On Tue, Jun 23, 20
Yes, this is a known behavior. Some static stuff are not serialized as part
of a task.
On Tue, Jun 23, 2015 at 10:24 AM, Nipun Arora
wrote:
> I found the error so just posting on the list.
>
> It seems broadcast variables cannot be declared static.
> If you do you get a null pointer exception.
>
I found the error so just posting on the list.
It seems broadcast variables cannot be declared static.
If you do you get a null pointer exception.
Thanks
Nipun
On Tue, Jun 23, 2015 at 11:08 AM, Nipun Arora
wrote:
> btw. just for reference I have added the code in a gist:
>
> https://gist.githu
I don't think I have explicitly check-pointed anywhere. Unless it's
internal in some interface, I don't believe the application is checkpointed.
Thanks for the suggestion though..
Nipun
On Tue, Jun 23, 2015 at 11:05 AM, Benjamin Fradet wrote:
> Are you using checkpointing?
>
> I had a similar
Are you using checkpointing?
I had a similar issue when recreating a streaming context from checkpoint
as broadcast variables are not checkpointed.
On 23 Jun 2015 5:01 pm, "Nipun Arora" wrote:
> Hi,
>
> I have a spark streaming application where I need to access a model saved
> in a HashMap.
> I
btw. just for reference I have added the code in a gist:
https://gist.github.com/nipunarora/ed987e45028250248edc
and a stackoverflow reference here:
http://stackoverflow.com/questions/31006490/broadcast-variable-null-pointer-exception-in-spark-streaming
On Tue, Jun 23, 2015 at 11:01 AM, Nipun A
Thanks, will try this out and get back...
On Tue, Jun 23, 2015 at 2:30 AM, Tathagata Das wrote:
> Try adding the provided scopes
>
>
> org.apache.spark
> spark-core_2.10
> 1.4.0
>
> *provided *
>
> org.apache.spark
> spark-stre
I can not. I've already limited the number of cores to 10, so it gets 5
executors with 2 cores each...
wt., 23.06.2015 o 13:45 użytkownik Akhil Das
napisał:
> Use *spark.cores.max* to limit the CPU per job, then you can easily
> accommodate your third job also.
>
> Thanks
> Best Regards
>
> On T
Use *spark.cores.max* to limit the CPU per job, then you can easily
accommodate your third job also.
Thanks
Best Regards
On Tue, Jun 23, 2015 at 5:07 PM, Wojciech Pituła wrote:
> I have set up small standalone cluster: 5 nodes, every node has 5GB of
> memory an 8 cores. As you can see, node doe
queue stream does not support driver checkpoint recovery since the RDDs in
the queue are arbitrary generated by the user and its hard for Spark
Streaming to keep track of the data in the RDDs (thats necessary for
recovering from checkpoint). Anyways queue stream is meant of testing and
development,
Try adding the provided scopes
org.apache.spark
spark-core_2.10
1.4.0
*provided *
org.apache.spark
spark-streaming_2.10
1.4.0
*provided *
This prevents these artifacts from being included in the assemb
Hi Tathagata,
I am attaching a snapshot of my pom.xml. It would help immensely, if I can
include max, and min values in my mapper phase.
The question is still open at :
http://stackoverflow.com/questions/30902090/adding-max-and-min-in-spark-stream-in-java/30909796#30909796
I see that there is a
It's a generated set of shell commands to run (written in C, highly
optimized numerical computer), which is create from a set of user provided
parameters.
The snippet above is:
task_outfiles_to_cmds = OrderedDict(run_sieving.leftover_tasks)
task_outfiles_to_cmds.update(generate_sieving_t
Where does "task_batches" come from?
On 22 Jun 2015 4:48 pm, "Shaanan Cohney" wrote:
> Thanks,
>
> I've updated my code to use updateStateByKey but am still getting these
> errors when I resume from a checkpoint.
>
> One thought of mine was that I used sc.parallelize to generate the RDDs
> for th
Thanks,
I've updated my code to use updateStateByKey but am still getting these
errors when I resume from a checkpoint.
One thought of mine was that I used sc.parallelize to generate the RDDs for
the queue, but perhaps on resume, it doesn't recreate the context needed?
--
Shaanan Cohney
PhD S
I would suggest you have a look at the updateStateByKey transformation in
the Spark Streaming programming guide which should fit your needs better
than your update_state function.
On 22 Jun 2015 1:03 pm, "Shaanan Cohney" wrote:
> Counts is a list (counts = []) in the driver, used to collect the r
Counts is a list (counts = []) in the driver, used to collect the results.
It seems like it's also not the best way to be doing things, but I'm new to
spark and editing someone else's code so still learning.
Thanks!
def update_state(out_files, counts, curr_rdd):
try:
for c in curr_rdd
901 - 1000 of 1784 matches
Mail list logo