Re: Spark Streaming Kafka could not find leader offset for Set()

2015-07-30 Thread Cody Koeninger
Can you post the code including the values of kafkaParams and topicSet, ideally the relevant output of kafka-topics.sh --describe as well On Wed, Jul 29, 2015 at 11:39 PM, Umesh Kacha wrote: > Hi thanks for the response. Like I already mentioned in the question kafka > topic is valid and it has

Re: Spark Streaming Cannot Work On Next Interval

2015-07-30 Thread Himanshu Mehra
Hi Ferriad, Can you share the code? because its hard to judge any problem with this little information. Thank you Regards Himanshu -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Cannot-Work-On-Next-Interval-tp24045p24075.html Sent from th

Re: Spark Streaming Kafka could not find leader offset for Set()

2015-07-29 Thread Umesh Kacha
Hi thanks for the response. Like I already mentioned in the question kafka topic is valid and it has data I can see data in it using another kafka consumer. On Jul 30, 2015 7:31 AM, "Cody Koeninger" wrote: > The last time someone brought this up on the mailing list, the issue > actually was that

Re: Spark Streaming Kafka could not find leader offset for Set()

2015-07-29 Thread Cody Koeninger
The last time someone brought this up on the mailing list, the issue actually was that the topic(s) didn't exist in Kafka at the time the spark job was running. On Wed, Jul 29, 2015 at 6:17 PM, Tathagata Das wrote: > There is a known issue that Kafka cannot return leader if there is not > da

Re: Spark Streaming Kafka could not find leader offset for Set()

2015-07-29 Thread Tathagata Das
There is a known issue that Kafka cannot return leader if there is not data in the topic. I think it was raised in another thread in this forum. Is that the issue? On Wed, Jul 29, 2015 at 10:38 AM, unk1102 wrote: > Hi I have Spark Streaming code which streams from Kafka topic it used to > work >

Re: Spark Streaming

2015-07-29 Thread Gerard Maas
A side question: Any reason why you're using window(Seconds(10), Seconds(10)) instead of new StreamingContext(conf, Seconds(10)) ? Making the micro-batch interval 10 seconds instead of 1 will provide you the same 10-second window with less complexity. Of course, this might just be a test for the w

Re: Spark Streaming Json file groupby function

2015-07-28 Thread Tathagata Das
If you are trying to keep such long term state, it will be more robust in the long term to use a dedicated data store (cassandra/HBase/etc.) that is designed for long term storage. On Tue, Jul 28, 2015 at 4:37 PM, swetha wrote: > > > Hi TD, > > We have a requirement to maintain the user sessio

Re: Spark Streaming Json file groupby function

2015-07-28 Thread swetha
Hi TD, We have a requirement to maintain the user session state and to maintain/update the metrics for minute, day and hour granularities for a user session in our Streaming job. Can I keep those granularities in the state and recalculate each time there is a change? How would the performance

Re: spark streaming get kafka individual message's offset and partition no

2015-07-28 Thread Cody Koeninger
You don't have to use some other package in order to get access to the offsets. Shushant, have you read the available documentation at http://spark.apache.org/docs/latest/streaming-kafka-integration.html https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md or watched https:/

Re: spark streaming get kafka individual message's offset and partition no

2015-07-28 Thread Dibyendu Bhattacharya
If you want the offset of individual kafka messages , you can use this consumer form Spark Packages .. http://spark-packages.org/package/dibbhatt/kafka-spark-consumer Regards, Dibyendu On Tue, Jul 28, 2015 at 6:18 PM, Shushant Arora wrote: > Hi > > I am processing kafka messages using spark str

Re: spark streaming 1.3 issues

2015-07-22 Thread Shushant Arora
In spark streaming 1.3 - Say I have 10 executors each with 4 cores so in total 40 tasks in parllel at once. If I repartition kafka directstream to 40 partitions vs say I have in kafka topic 300 partitions - which one will be more efficient , Should I repartition the kafka stream equal to num of co

Re: spark streaming 1.3 coalesce on kafkadirectstream

2015-07-21 Thread Tathagata Das
With DirectKafkaStream there are two approaches. 1. you increase the number of KAfka partitions Spark will automatically read in parallel 2. if that's not possible, then explicitly repartition only if there are more cores in the cluster than the number of Kafka partitions, AND the first map-like st

Re: spark streaming 1.3 issues

2015-07-21 Thread Tathagata Das
For Java, do OffsetRange[] offsetRanges = ((HasOffsetRanges)rdd*.rdd()*).offsetRanges(); If you fix that error, you should be seeing data. You can call arbitrary RDD operations on a DStream, using DStream.transform. Take a look at the docs. For the direct kafka approach you are doing, - tasks d

Re: spark streaming disk hit

2015-07-21 Thread Abhishek R. Singh
Thanks TD - appreciate the response ! On Jul 21, 2015, at 1:54 PM, Tathagata Das wrote: > Most shuffle files are really kept around in the OS's buffer/disk cache, so > it is still pretty much in memory. If you are concerned about performance, > you have to do a holistic comparison for end-to-e

Re: spark streaming disk hit

2015-07-21 Thread Tathagata Das
Most shuffle files are really kept around in the OS's buffer/disk cache, so it is still pretty much in memory. If you are concerned about performance, you have to do a holistic comparison for end-to-end performance. You could take a look at this. https://spark-summit.org/2015/events/towards-benchm

Re: Spark Streaming Checkpointing solutions

2015-07-21 Thread Emmanuel Fortin
Thank you for your reply. I will consider hdfs for the checkpoint storage. Le mar. 21 juil. 2015 à 17:51, Dean Wampler a écrit : > TD's Spark Summit talk offers suggestions ( > https://spark-summit.org/2015/events/recipes-for-running-spark-streaming-applications-in-production/). > He recommend

Re: Spark Streaming Checkpointing solutions

2015-07-21 Thread Dean Wampler
TD's Spark Summit talk offers suggestions ( https://spark-summit.org/2015/events/recipes-for-running-spark-streaming-applications-in-production/). He recommends using HDFS, because you get the triplicate resiliency it offers, albeit with extra overhead. I believe the driver doesn't need visibility

Re: spark streaming 1.3 issues

2015-07-21 Thread Akhil Das
I'd suggest you upgrading to 1.4 as it has better metrices and UI. Thanks Best Regards On Mon, Jul 20, 2015 at 7:01 PM, Shushant Arora wrote: > Is coalesce not applicable to kafkaStream ? How to do coalesce on > kafkadirectstream its not there in api ? > Shall calling repartition on directstrea

Re: spark streaming 1.3 issues

2015-07-20 Thread Shushant Arora
Is coalesce not applicable to kafkaStream ? How to do coalesce on kafkadirectstream its not there in api ? Shall calling repartition on directstream with number of executors as numpartitions will imrove perfromance ? Does in 1.3 tasks get launched for partitions which are empty? Does driver makes

Re: Spark streaming Processing time keeps increasing

2015-07-19 Thread N B
Hi TD, Yay! Thanks for the help. That solved our issue of ever increasing processing time. I added filter functions to all our reduceByKeyAndWindow() operations and now its been stable for over 2 days already! :-). One small feedback about the API though. The one that accepts the filter function

Re: spark streaming job to hbase write

2015-07-17 Thread Ted Yu
It resorts to the following method for finding region location: private RegionLocations locateRegionInMeta(TableName tableName, byte[] row, boolean useCache, boolean retry, int replicaId) throws IOException { Note: useCache value is true in this call path. Meaning the client

Re: spark streaming job to hbase write

2015-07-17 Thread Shushant Arora
Is this map creation happening on client side ? But how does it know which RS will contain that row key in put operation until asking the .Meta. table . Does Hbase client first gets that ranges of keys of each Reagionservers and then group put objects based on Region servers ? On Fri, Jul 17, 20

Re: spark streaming job to hbase write

2015-07-17 Thread Ted Yu
Internally AsyncProcess uses a Map which is keyed by server name: Map> actionsByServer = new HashMap>(); Here MultiAction would group Put's in your example which are destined for the same server. Cheers On Fri, Jul 17, 2015 at 5:15 AM, Shushant Arora wrote: > Thanks ! > > My key

Re: spark streaming job to hbase write

2015-07-17 Thread Shushant Arora
Thanks ! My key is random (hexadecimal). So hot spot should not be created. Is there any concept of bulk put. Say I want to raise a one put request for a 1000 size batch which will hit a region server instead of individual put for each key. Htable.put(List) Does this handles batching of put bas

Re: Spark streaming Processing time keeps increasing

2015-07-17 Thread N B
Hi TD, Thanks for the response. I do believe I understand the concept and the need for the filterfunction now. I made the requisite code changes and keeping it running overnight to see the effect of it. Hopefully this should fix our issue. However, there was one place where I encountered a follow

Re: Spark streaming Processing time keeps increasing

2015-07-17 Thread Tathagata Das
Responses inline. On Thu, Jul 16, 2015 at 9:27 PM, N B wrote: > Hi TD, > > Yes, we do have the invertible function provided. However, I am not sure I > understood how to use the filterFunction. Is there an example somewhere > showing its usage? > > The header comment on the function says : > > *

Re: Spark streaming Processing time keeps increasing

2015-07-16 Thread N B
Hi TD, Yes, we do have the invertible function provided. However, I am not sure I understood how to use the filterFunction. Is there an example somewhere showing its usage? The header comment on the function says : * @param filterFunc function to filter expired key-value pairs; *

Re: spark streaming job to hbase write

2015-07-16 Thread Michael Segel
You ask an interesting question… Lets set aside spark, and look at the overall ingestion pattern. Its really an ingestion pattern where your input in to the system is from a queue. Are the events discrete or continuous? (This is kinda important.) If the events are continuous then more than

Re: Spark streaming Processing time keeps increasing

2015-07-16 Thread Tathagata Das
MAke sure you provide the filterFunction with the invertible reduceByKeyAndWindow. Otherwise none of the keys will get removed, and the key space will continue increase. This is what is leading to the lag. So use the filtering function to filter out the keys that are not needed any more. On Thu, J

Re: Spark streaming Processing time keeps increasing

2015-07-16 Thread N B
Thanks Akhil. For doing reduceByKeyAndWindow, one has to have checkpointing enabled. So, yes we do have it enabled. But not Write Ahead Log because we don't have a need for recovery and we do not recover the process state on restart. I don't know if IO Wait fully explains the increasing processing

Re: Spark streaming Processing time keeps increasing

2015-07-16 Thread Akhil Das
What is your data volume? Are you having checkpointing/WAL enabled? In that case make sure you are having SSD disks as this behavior is mainly due to the IO wait. Thanks Best Regards On Thu, Jul 16, 2015 at 8:43 AM, N B wrote: > Hello, > > We have a Spark streaming application and the problem t

Re: spark streaming job to hbase write

2015-07-15 Thread Todd Nist
There are there connector packages listed on spark packages web site: http://spark-packages.org/?q=hbase HTH. -Todd On Wed, Jul 15, 2015 at 2:46 PM, Shushant Arora wrote: > Hi > > I have a requirement of writing in hbase table from Spark streaming app > after some processing. > Is Hbase put o

Re: spark streaming with kafka reset offset

2015-07-14 Thread Tathagata Das
Of course, exactly once receiving is not same as exactly once. In case of direct kafka stream, the data may actually be pulled multiple time. But even if the data of a batch is pulled twice because of some failure, the final result (that is, transformed data accessed through foreachRDD) will always

Re: Spark Streaming - Inserting into Tables

2015-07-14 Thread Tathagata Das
Why is .remember not ideal? On Sun, Jul 12, 2015 at 7:22 PM, Brandon White wrote: > Hi Yin, > > Yes there were no new rows. I fixed it by doing a .remember on the > context. Obviously, this is not ideal. > > On Sun, Jul 12, 2015 at 6:31 PM, Yin Huai wrote: > >> Hi Brandon, >> >> Can you explai

Re: spark streaming with kafka reset offset

2015-07-14 Thread Chen Song
Thanks TD. As for 1), if timing is not guaranteed, how does exactly once semantics supported? It feels like exactly once receiving is not necessarily exactly once processing. Chen On Tue, Jul 14, 2015 at 10:16 PM, Tathagata Das wrote: > > > On Tue, Jul 14, 2015 at 6:42 PM, Chen Song wrote: >

Re: spark streaming with kafka reset offset

2015-07-14 Thread Tathagata Das
On Tue, Jul 14, 2015 at 6:42 PM, Chen Song wrote: > Thanks TD and Cody. I saw that. > > 1. By doing that (foreachRDD), does KafkaDStream checkpoints its offsets > on HDFS at the end of each batch interval? > The timing is not guaranteed. > 2. In the code, if I first apply transformations and a

Re: spark streaming with kafka reset offset

2015-07-14 Thread Chen Song
Thanks TD and Cody. I saw that. 1. By doing that (foreachRDD), does KafkaDStream checkpoints its offsets on HDFS at the end of each batch interval? 2. In the code, if I first apply transformations and actions on the directKafkaStream and then use foreachRDD on the original KafkaDStream to commit o

Re: spark streaming with kafka reset offset

2015-07-14 Thread Tathagata Das
Relevant documentation - https://spark.apache.org/docs/latest/streaming-kafka-integration.html, towards the end. directKafkaStream.foreachRDD { rdd => val offsetRanges = rdd.asInstanceOf[HasOffsetRanges] // offsetRanges.length = # of Kafka partitions being consumed ... } On Tue,

Re: spark streaming with kafka reset offset

2015-07-14 Thread Cody Koeninger
You have access to the offset ranges for a given rdd in the stream by typecasting to HasOffsetRanges. You can then store the offsets wherever you need to. On Tue, Jul 14, 2015 at 5:00 PM, Chen Song wrote: > A follow up question. > > When using createDirectStream approach, the offsets are checkp

Re: spark streaming with kafka reset offset

2015-07-14 Thread Chen Song
A follow up question. When using createDirectStream approach, the offsets are checkpointed to HDFS and it is understandable by Spark Streaming job. Is there a way to expose the offsets via a REST api to end users. Or alternatively, is there a way to have offsets committed to Kafka Offset Manager s

Re: spark streaming doubt

2015-07-13 Thread Aniruddh Sharma
Hi Sushant/Cody, For question 1 , following is my understanding ( I am not 100% sure and this is only my understanding, I have asked this question in another words to TD for confirmation which is not confirmed as of now). Following is my understanding. In accordance with tasks created in proporti

Re: spark streaming doubt

2015-07-13 Thread Shushant Arora
For second question I am comparing 2 situtations of processing kafkaRDD. case I - When I used foreachPartition to process kafka stream I am not able to see any stream job timing interval like Time: 142905487 ms . displayed on driver console at start of each stream batch. But it processed each

Re: spark streaming doubt

2015-07-13 Thread Cody Koeninger
Regarding your first question, having more partitions than you do executors usually means you'll have better utilization, because the workload will be distributed more evenly. There's some degree of per-task overhead, but as long as you don't have a huge imbalance between number of tasks and numbe

Re: Spark Streaming - Inserting into Tables

2015-07-12 Thread Brandon White
Hi Yin, Yes there were no new rows. I fixed it by doing a .remember on the context. Obviously, this is not ideal. On Sun, Jul 12, 2015 at 6:31 PM, Yin Huai wrote: > Hi Brandon, > > Can you explain what did you mean by "It simply does not work"? You did > not see new data files? > > Thanks, > >

Re: Spark Streaming - Inserting into Tables

2015-07-12 Thread Yin Huai
Hi Brandon, Can you explain what did you mean by "It simply does not work"? You did not see new data files? Thanks, Yin On Fri, Jul 10, 2015 at 11:55 AM, Brandon White wrote: > Why does this not work? Is insert into broken in 1.3.1? It does not throw > any errors, fail, or throw exceptions. I

Re: Spark Streaming and using Swift object store for checkpointing

2015-07-11 Thread algermissen1971
On 10 Jul 2015, at 23:10, algermissen1971 wrote: > Hi, > > initially today when moving my streaming application to the cluster the first > time I ran in to newbie error of using a local file system for checkpointing > and the RDD partition count differences (see exception below). > > Having

Re: Spark Streaming Hangs on Start

2015-07-09 Thread Bin Wang
Thanks for the help. I set --executor-cores and it works now. I've used --total-executor-cores and don't realize it changed. Tathagata Das 于2015年7月10日周五 上午3:11写道: > 1. There will be a long running job with description "start()" as that is > the jobs that is running the receivers. It will never e

Re: spark streaming performance

2015-07-09 Thread Tathagata Das
I am not sure why you are getting node_local and not process_local. Also there is probably not a good documentation other than that configuration page - http://spark.apache.org/docs/latest/configuration.html (search for locality) On Thu, Jul 9, 2015 at 5:51 AM, Michel Hubert wrote: > > > > > > >

Re: Spark Streaming Hangs on Start

2015-07-09 Thread Tathagata Das
1. There will be a long running job with description "start()" as that is the jobs that is running the receivers. It will never end. 2. You need to set the number of cores given to the Spark executors by the YARN container. That is SparkConf spark.executor.cores, --executor-cores in spark-submit.

Re: spark streaming kafka compatibility

2015-07-09 Thread Cody Koeninger
Yes, it should work, let us know if not. On Thu, Jul 9, 2015 at 11:34 AM, Shushant Arora wrote: > Thanks cody, so is it means if old kafka consumer 0.8.1.1 works with > kafka cluster version 0.8.2 then spark streaming 1.3 should also work? > > I have tested standalone consumer kafka consumer 0

Re: spark streaming kafka compatibility

2015-07-09 Thread Shushant Arora
Thanks cody, so is it means if old kafka consumer 0.8.1.1 works with kafka cluster version 0.8.2 then spark streaming 1.3 should also work? I have tested standalone consumer kafka consumer 0.8.0 with kafka cluster 0.8.2 and that works. On Thu, Jul 9, 2015 at 9:58 PM, Cody Koeninger wrote: > I

Re: spark streaming kafka compatibility

2015-07-09 Thread Cody Koeninger
It's the consumer version. Should work with 0.8.2 clusters. On Thu, Jul 9, 2015 at 11:10 AM, Shushant Arora wrote: > Does spark streaming 1.3 requires kafka version 0.8.1.1 and is not > compatible with kafka 0.8.2 ? > > As per maven dependency of spark streaming 1.3 with kafka > > > org.apache.

Re: Spark Streaming Hangs on Start

2015-07-09 Thread Tathagata Das
Do you have enough cores in the configured number of executors in YARN? On Thu, Jul 9, 2015 at 2:29 AM, Bin Wang wrote: > I'm using spark streaming with Kafka, and submit it to YARN cluster with > mode "yarn-cluster". But it hangs at SparkContext.start(). The Kafka config > is right since it c

Re: spark streaming performance

2015-07-09 Thread Tathagata Das
What were the number of cores in the executor? It could be that you had only one core in the executor which did all the 50 tasks serially so 50 task X 15 ms = ~ 1 second. Could you take a look at the task details in the stage page to see when the tasks were added to see whether it explains the 5 se

Re: Spark Streaming broadcast to all keys

2015-07-03 Thread Silvio Fiorito
updateStateByKey will run for all keys, whether they have new data in a batch or not so you should be able to still use it. On 7/3/15, 7:34 AM, "micvog" wrote: >UpdateStateByKey is useful but what if I want to perform an operation to all >existing keys (not only the ones in this RDD). > >Word

Re: Spark streaming on standalone cluster

2015-07-01 Thread Wojciech Pituła
ch conf parameter sets the worker thread count in cluster mode ? is it >> spark.akka.threads ? >> >> >> >> *From:* Tathagata Das [mailto:t...@databricks.com] >> *Sent:* 01 July 2015 01:32 >> *To:* Borja Garrido Bear >> *Cc:* user >> *Subj

Re: Spark streaming on standalone cluster

2015-07-01 Thread Borja Garrido Bear
nt in cluster mode ? is it > spark.akka.threads ? > > > > *From:* Tathagata Das [mailto:t...@databricks.com] > *Sent:* 01 July 2015 01:32 > *To:* Borja Garrido Bear > *Cc:* user > *Subject:* Re: Spark streaming on standalone cluster > > > > How many receivers do

RE: Spark streaming on standalone cluster

2015-07-01 Thread prajod.vettiyattil
...@databricks.com] Sent: 01 July 2015 01:32 To: Borja Garrido Bear Cc: user Subject: Re: Spark streaming on standalone cluster How many receivers do you have in the streaming program? You have to have more numbers of core in reserver by your spar application than the number of receivers. That

Re: Spark streaming on standalone cluster

2015-06-30 Thread Tathagata Das
How many receivers do you have in the streaming program? You have to have more numbers of core in reserver by your spar application than the number of receivers. That would explain the receiving output after stopping. TD On Tue, Jun 30, 2015 at 7:59 AM, Borja Garrido Bear wrote: > Hi all, > > I

Re: spark streaming with kafka reset offset

2015-06-30 Thread Cody Koeninger
You can't use different versions of spark in your application vs your cluster. For the direct stream, it's not 60 partitions per executor, it's 300 partitions, and executors work on them as they are scheduled. Yes, if you have no messages you will get an empty partition. It's up to you whether i

Re: spark streaming with kafka reset offset

2015-06-30 Thread Shushant Arora
Is this 3 is no of parallel consumer threads per receiver , means in total we have 2*3=6 consumer in same consumer group consuming from all 300 partitions. 3 is just parallelism on same receiver and recommendation is to use 1 per receiver since consuming from kafka is not cpu bound rather NIC(netwo

Re: spark streaming HDFS file issue

2015-06-29 Thread bit1...@163.com
What do you mean by "new file", do you upload an already existing file onto HDFS or create a new one locally and then upload it to HDFS? bit1...@163.com From: ravi tella Date: 2015-06-30 09:59 To: user Subject: spark streaming HDFS file issue I am running a spark streaming example from learnin

Re: spark streaming with kafka reset offset

2015-06-29 Thread Shushant Arora
1. Here you are basically creating 2 receivers and asking each of them to consume 3 kafka partitions each. - In 1.2 we have high level consumers so how can we restrict no of kafka partitions to consume from? Say I have 300 kafka partitions in kafka topic and as in above I gave 2 receivers and 3 ka

Re: spark streaming with kafka reset offset

2015-06-29 Thread ayan guha
Hi Let me take ashot at your questions. (I am sure people like Cody and TD will correct if I am wrong) 0. This is exact copy from the similar question in mail thread from Akhil D: Since you set local[4] you will have 4 threads for your computation, and since you are having 2 receivers, you are le

Re: spark streaming with kafka reset offset

2015-06-29 Thread Cody Koeninger
3. You need to use your own method, because you need to set up your job. Read the checkpoint documentation. 4. Yes, if you want to checkpoint, you need to specify a url to store the checkpoint at (s3 or hdfs). Yes, for the direct stream checkpoint it's just offsets, not all the messages. On Sun

Re: spark streaming - checkpoint

2015-06-29 Thread ram kumar
on using yarn-cluster, it works good On Mon, Jun 29, 2015 at 12:07 PM, ram kumar wrote: > SPARK_CLASSPATH=$CLASSPATH:/usr/hdp/2.2.0.0-2041/hadoop-mapreduce/* > in spark-env.sh > > I think i am facing the same issue > https://issues.apache.org/jira/browse/SPARK-6203 > > > > On Mon, Jun 29, 2015 a

Re: spark streaming - checkpoint

2015-06-28 Thread ram kumar
SPARK_CLASSPATH=$CLASSPATH:/usr/hdp/2.2.0.0-2041/hadoop-mapreduce/* in spark-env.sh I think i am facing the same issue https://issues.apache.org/jira/browse/SPARK-6203 On Mon, Jun 29, 2015 at 11:38 AM, ram kumar wrote: > I am using Spark 1.2.0.2.2.0.0-82 (git revision de12451) built for Hadoop

Re: spark streaming with kafka reset offset

2015-06-28 Thread Shushant Arora
Few doubts : In 1.2 streaming when I use union of streams , my streaming application getting hanged sometimes and nothing gets printed on driver. [Stage 2:> (0 + 2) / 2] Whats is 0+2/2 here signifies. 1.Does no of streams in topicsMap.put("testSparkPartitio

Re: spark streaming with kafka reset offset

2015-06-27 Thread Dibyendu Bhattacharya
Hi, There is another option to try for Receiver Based Low Level Kafka Consumer which is part of Spark-Packages ( http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) . This can be used with WAL as well for end to end zero data loss. This is also Reliable Receiver and Commit offset to

Re: spark streaming with kafka reset offset

2015-06-27 Thread Tathagata Das
In the receiver based approach, If the receiver crashes for any reason (receiver crashed or executor crashed) the receiver should get restarted on another executor and should start reading data from the offset present in the zookeeper. There is some chance of data loss which can alleviated using Wr

Re: spark streaming job fails to restart after checkpointing due to DStream initialization errors

2015-06-27 Thread Tathagata Das
Could you also provide the code where you set up the Kafka dstream? I dont see it in the snippet. On Fri, Jun 26, 2015 at 2:45 PM, Ashish Nigam wrote: > Here's code - > > def createStreamingContext(checkpointDirectory: String) : > StreamingContext = { > > val conf = new SparkConf().setAppNam

Re: spark streaming - checkpoint

2015-06-27 Thread Tathagata Das
Do you have SPARK_CLASSPATH set in both cases? Before and after checkpoint? If yes, then you should not be using SPARK_CLASSPATH, it has been deprecated since Spark 1.0 because of its ambiguity. Also where do you have spark.executor.extraClassPath set? I dont see it in the spark-submit command. On

Re: spark streaming with kafka reset offset

2015-06-26 Thread Cody Koeninger
Read the spark streaming guide ad the kafka integration guide for a better understanding of how the receiver based stream works. Capacity planning is specific to your environment and what the job is actually doing, youll need to determine it empirically. On Friday, June 26, 2015, Shushant Arora

Re: spark streaming with kafka reset offset

2015-06-26 Thread Shushant Arora
In 1.2 how to handle offset management after stream application starts in each job . I should commit offset after job completion manually? And what is recommended no of consumer threads. Say I have 300 partitions in kafka cluster . Load is ~ 1 million events per second.Each event is of ~500bytes.

Re: spark streaming job fails to restart after checkpointing due to DStream initialization errors

2015-06-26 Thread Ashish Nigam
Here's code - def createStreamingContext(checkpointDirectory: String) : StreamingContext = { val conf = new SparkConf().setAppName("KafkaConsumer") conf.set("spark.eventLog.enabled", "false") logger.info("Going to init spark context") conf.getOption("spark.master") match {

Re: spark streaming job fails to restart after checkpointing due to DStream initialization errors

2015-06-26 Thread Cody Koeninger
Make sure you're following the docs regarding setting up a streaming checkpoint. Post your code if you can't get it figured out. On Fri, Jun 26, 2015 at 3:45 PM, Ashish Nigam wrote: > I bring up spark streaming job that uses Kafka as input source. > No data to process and then shut it down. And

Re: spark streaming with kafka reset offset

2015-06-26 Thread Cody Koeninger
The receiver-based kafka createStream in spark 1.2 uses zookeeper to store offsets. If you want finer-grained control over offsets, you can update the values in zookeeper yourself before starting the job. createDirectStream in spark 1.3 is still marked as experimental, and subject to change. Tha

RE: Spark Streaming: limit number of nodes

2015-06-24 Thread Evo Eftimov
that limits the number of cores per Executor rather than the total cores for the job and hence will probably not yield the effect you need From: Wojciech Pituła [mailto:w.pit...@gmail.com] Sent: Wednesday, June 24, 2015 10:49 AM To: Evo Eftimov; user@spark.apache.org Subject: Re: Spark

Re: Spark Streaming: limit number of nodes

2015-06-24 Thread Wojciech Pituła
Ok, thanks. I have 1 worker process on each machine but I would like to run my app on only 3 of them. Is it possible? śr., 24.06.2015 o 11:44 użytkownik Evo Eftimov napisał: > There is no direct one to one mapping between Executor and Node > > > > Executor is simply the spark framework term for

RE: Spark Streaming: limit number of nodes

2015-06-24 Thread Evo Eftimov
There is no direct one to one mapping between Executor and Node Executor is simply the spark framework term for JVM instance with some spark framework system code running in it A node is a physical server machine You can have more than one JVM per node And vice versa you can hav

Re: spark streaming with kafka jar missing

2015-06-23 Thread Shushant Arora
Thanks a lot. It worked after keeping all versions to same.1.2.0 On Wed, Jun 24, 2015 at 2:24 AM, Tathagata Das wrote: > Why are you mixing spark versions between streaming and core?? > Your core is 1.2.0 and streaming is 1.4.0. > > On Tue, Jun 23, 2015 at 1:32 PM, Shushant Arora > wrote: > >>

Re: spark streaming with kafka jar missing

2015-06-23 Thread Tathagata Das
Why are you mixing spark versions between streaming and core?? Your core is 1.2.0 and streaming is 1.4.0. On Tue, Jun 23, 2015 at 1:32 PM, Shushant Arora wrote: > It throws exception for WriteAheadLogUtils after excluding core and > streaming jar. > > Exception in thread "main" java.lang.NoClass

Re: spark streaming with kafka jar missing

2015-06-23 Thread Shushant Arora
It throws exception for WriteAheadLogUtils after excluding core and streaming jar. Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/streaming/util/WriteAheadLogUtils$ at org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:84) at org

Re: spark streaming with kafka jar missing

2015-06-23 Thread Tathagata Das
You must not include spark-core and spark-streaming in the assembly. They are already present in the installation and the presence of multiple versions of spark may throw off the classloaders in weird ways. So make the assembly marking the those dependencies as scope=provided. On Tue, Jun 23, 20

Re: [Spark Streaming] Null Pointer Exception when accessing broadcast variable to store a hashmap in Java

2015-06-23 Thread Tathagata Das
Yes, this is a known behavior. Some static stuff are not serialized as part of a task. On Tue, Jun 23, 2015 at 10:24 AM, Nipun Arora wrote: > I found the error so just posting on the list. > > It seems broadcast variables cannot be declared static. > If you do you get a null pointer exception. >

Re: [Spark Streaming] Null Pointer Exception when accessing broadcast variable to store a hashmap in Java

2015-06-23 Thread Nipun Arora
I found the error so just posting on the list. It seems broadcast variables cannot be declared static. If you do you get a null pointer exception. Thanks Nipun On Tue, Jun 23, 2015 at 11:08 AM, Nipun Arora wrote: > btw. just for reference I have added the code in a gist: > > https://gist.githu

Re: [Spark Streaming] Null Pointer Exception when accessing broadcast variable to store a hashmap in Java

2015-06-23 Thread Nipun Arora
I don't think I have explicitly check-pointed anywhere. Unless it's internal in some interface, I don't believe the application is checkpointed. Thanks for the suggestion though.. Nipun On Tue, Jun 23, 2015 at 11:05 AM, Benjamin Fradet wrote: > Are you using checkpointing? > > I had a similar

Re: [Spark Streaming] Null Pointer Exception when accessing broadcast variable to store a hashmap in Java

2015-06-23 Thread Benjamin Fradet
Are you using checkpointing? I had a similar issue when recreating a streaming context from checkpoint as broadcast variables are not checkpointed. On 23 Jun 2015 5:01 pm, "Nipun Arora" wrote: > Hi, > > I have a spark streaming application where I need to access a model saved > in a HashMap. > I

Re: [Spark Streaming] Null Pointer Exception when accessing broadcast variable to store a hashmap in Java

2015-06-23 Thread Nipun Arora
btw. just for reference I have added the code in a gist: https://gist.github.com/nipunarora/ed987e45028250248edc and a stackoverflow reference here: http://stackoverflow.com/questions/31006490/broadcast-variable-null-pointer-exception-in-spark-streaming On Tue, Jun 23, 2015 at 11:01 AM, Nipun A

Re: [Spark Streaming] Runtime Error in call to max function for JavaPairRDD

2015-06-23 Thread Nipun Arora
Thanks, will try this out and get back... On Tue, Jun 23, 2015 at 2:30 AM, Tathagata Das wrote: > Try adding the provided scopes > > > org.apache.spark > spark-core_2.10 > 1.4.0 > > *provided * > > org.apache.spark > spark-stre

Re: Spark Streaming: limit number of nodes

2015-06-23 Thread Wojciech Pituła
I can not. I've already limited the number of cores to 10, so it gets 5 executors with 2 cores each... wt., 23.06.2015 o 13:45 użytkownik Akhil Das napisał: > Use *spark.cores.max* to limit the CPU per job, then you can easily > accommodate your third job also. > > Thanks > Best Regards > > On T

Re: Spark Streaming: limit number of nodes

2015-06-23 Thread Akhil Das
Use *spark.cores.max* to limit the CPU per job, then you can easily accommodate your third job also. Thanks Best Regards On Tue, Jun 23, 2015 at 5:07 PM, Wojciech Pituła wrote: > I have set up small standalone cluster: 5 nodes, every node has 5GB of > memory an 8 cores. As you can see, node doe

Re: [Spark Streaming 1.4.0] SPARK-5063, Checkpointing and queuestream

2015-06-23 Thread Tathagata Das
queue stream does not support driver checkpoint recovery since the RDDs in the queue are arbitrary generated by the user and its hard for Spark Streaming to keep track of the data in the RDDs (thats necessary for recovering from checkpoint). Anyways queue stream is meant of testing and development,

Re: [Spark Streaming] Runtime Error in call to max function for JavaPairRDD

2015-06-22 Thread Tathagata Das
Try adding the provided scopes org.apache.spark spark-core_2.10 1.4.0 *provided * org.apache.spark spark-streaming_2.10 1.4.0 *provided * This prevents these artifacts from being included in the assemb

Re: [Spark Streaming] Runtime Error in call to max function for JavaPairRDD

2015-06-22 Thread Nipun Arora
Hi Tathagata, I am attaching a snapshot of my pom.xml. It would help immensely, if I can include max, and min values in my mapper phase. The question is still open at : http://stackoverflow.com/questions/30902090/adding-max-and-min-in-spark-stream-in-java/30909796#30909796 I see that there is a

Re: [Spark Streaming 1.4.0] SPARK-5063, Checkpointing and queuestream

2015-06-22 Thread Shaanan Cohney
It's a generated set of shell commands to run (written in C, highly optimized numerical computer), which is create from a set of user provided parameters. The snippet above is: task_outfiles_to_cmds = OrderedDict(run_sieving.leftover_tasks) task_outfiles_to_cmds.update(generate_sieving_t

Re: [Spark Streaming 1.4.0] SPARK-5063, Checkpointing and queuestream

2015-06-22 Thread Benjamin Fradet
Where does "task_batches" come from? On 22 Jun 2015 4:48 pm, "Shaanan Cohney" wrote: > Thanks, > > I've updated my code to use updateStateByKey but am still getting these > errors when I resume from a checkpoint. > > One thought of mine was that I used sc.parallelize to generate the RDDs > for th

Re: [Spark Streaming 1.4.0] SPARK-5063, Checkpointing and queuestream

2015-06-22 Thread Shaanan Cohney
Thanks, I've updated my code to use updateStateByKey but am still getting these errors when I resume from a checkpoint. One thought of mine was that I used sc.parallelize to generate the RDDs for the queue, but perhaps on resume, it doesn't recreate the context needed? -- Shaanan Cohney PhD S

Re: [Spark Streaming 1.4.0] SPARK-5063, Checkpointing and queuestream

2015-06-22 Thread Benjamin Fradet
I would suggest you have a look at the updateStateByKey transformation in the Spark Streaming programming guide which should fit your needs better than your update_state function. On 22 Jun 2015 1:03 pm, "Shaanan Cohney" wrote: > Counts is a list (counts = []) in the driver, used to collect the r

Re: [Spark Streaming 1.4.0] SPARK-5063, Checkpointing and queuestream

2015-06-22 Thread Shaanan Cohney
Counts is a list (counts = []) in the driver, used to collect the results. It seems like it's also not the best way to be doing things, but I'm new to spark and editing someone else's code so still learning. Thanks! def update_state(out_files, counts, curr_rdd): try: for c in curr_rdd

<    5   6   7   8   9   10   11   12   13   14   >