Re: Spark Streaming: BatchDuration and Processing time

2016-01-18 Thread Ricardo Paiva
If you are using Kafka as the message queue, Spark will process accordingly the time slices, even if it is late, like in your example. But it will fail sometime, due the fact that your process will ask for a message that is older than the oldest message in Kafka. If your process takes longer than t

Re: Spark Streaming on mesos

2016-01-18 Thread Iulian Dragoș
On Mon, Nov 30, 2015 at 4:09 PM, Renjie Liu wrote: > Hi, Lulian: > Please, it's Iulian, not Lulian. > Are you sure that it'll be a long running process in fine-grained mode? I > think you have a misunderstanding about it. An executor will be launched > for some tasks, but not a long running pr

Re: Spark Streaming: BatchDuration and Processing time

2016-01-17 Thread Silvio Fiorito
It will just queue up the subsequent batches, however if this delay is constant you may start losing batches. It can handle spikes in processing time, but if you know you're consistently running over your batch duration you either need to increase the duration or look at enabling back pressure s

Re: Spark streaming: Fixed time aggregation & handling driver failures

2016-01-15 Thread Cody Koeninger
You can't really use spark batches as the basis for any kind of reliable time aggregation. Time of batch processing in general has nothing to do with time of event. You need to filter / aggregate by the time interval you care about, in your own code, or use a data store that can do the aggregatio

Re: Spark Streaming + Kafka + scala job message read issue

2016-01-15 Thread vivek.meghanathan
bryan.jeff...@gmail.com<mailto:bryan.jeff...@gmail.com> Cc: duc.was.h...@gmail.com<mailto:duc.was.h...@gmail.com>; vivek.meghanat...@wipro.com<mailto:vivek.meghanat...@wipro.com>; user@spark.apache.org<mailto:user@spark.apache.org> Subject: Re: Spark Streaming + Kafka

Re: Spark Streaming: custom actor receiver losing vast majority of data

2016-01-14 Thread Lin Zhao
; Date: Thursday, January 14, 2016 at 4:41 PM To: Lin Zhao mailto:l...@exabeam.com>> Cc: user mailto:user@spark.apache.org>> Subject: Re: Spark Streaming: custom actor receiver losing vast majority of data Could you post the codes of MessageRetriever? And by the way, could you post t

Re: Spark Streaming: custom actor receiver losing vast majority of data

2016-01-14 Thread Shixiong(Ryan) Zhu
16/01/15 00:31:31 INFO storage.BlockManager: Removing RDD 39 > > > From: "Shixiong(Ryan) Zhu" > Date: Thursday, January 14, 2016 at 4:13 PM > To: Lin Zhao > Cc: user > Subject: Re: Spark Streaming: custom actor receiver losing vast majority > of data > > MEMORY_AND_DISK_SER_2 >

Re: Spark Streaming: custom actor receiver losing vast majority of data

2016-01-14 Thread Lin Zhao
o:l...@exabeam.com>> Cc: user mailto:user@spark.apache.org>> Subject: Re: Spark Streaming: custom actor receiver losing vast majority of data MEMORY_AND_DISK_SER_2

Re: Spark Streaming: custom actor receiver losing vast majority of data

2016-01-14 Thread Shixiong(Ryan) Zhu
Could you change MEMORY_ONLY_SER to MEMORY_AND_DISK_SER_2 and see if this still happens? It may be because you don't have enough memory to cache the events. On Thu, Jan 14, 2016 at 4:06 PM, Lin Zhao wrote: > Hi, > > I'm testing spark streaming with actor receiver. The actor keeps calling > store

Re: Spark streaming routing

2016-01-07 Thread Lin Zhao
a low cpu:memory ratio. From: Tathagata Das mailto:t...@databricks.com>> Date: Thursday, January 7, 2016 at 1:56 PM To: Lin Zhao mailto:l...@exabeam.com>> Cc: user mailto:user@spark.apache.org>> Subject: Re: Spark streaming routing You cannot guarantee that each key will forever be on

Re: Spark streaming routing

2016-01-07 Thread Tathagata Das
You cannot guarantee that each key will forever be on the same executor. That is flawed approach to designing an application if you have to take ensure fault-tolerance toward executor failures. On Thu, Jan 7, 2016 at 9:34 AM, Lin Zhao wrote: > I have a need to route the dstream through the strem

Re: Spark Streaming: process only last events

2016-01-06 Thread Julien Naour
Thanks Cody again for your answer. The idea here is to process all events but only launch the big job (that is longer than the batch size) if they are the last events for an id considering the current state of data. Knowing if they are the last is my issue in fact. So I think I need two jobs. One

Re: Spark Streaming: process only last events

2016-01-06 Thread Cody Koeninger
If your job consistently takes longer than the batch time to process, you will keep lagging longer and longer behind. That's not sustainable, you need to increase batch sizes or decrease processing time. In your case, probably increase batch size, since you're pre-filtering it down to only 1 even

Re: Spark Streaming: process only last events

2016-01-06 Thread Julien Naour
The following lines are my understanding of Spark Streaming AFAIK, I could be wrong: Spark Streaming processes data from a Stream in micro-batch, one at a time. When a process takes time, DStream's RDD are accumulated. So in my case (my process takes time) DStream's RDD are accumulated. What I wan

Re: Spark Streaming: process only last events

2016-01-06 Thread Cody Koeninger
if you don't have hot users, you can use the user id as the hash key for publishing into kafka. That will put all events for a given user in the same partition per batch. Then you can do foreachPartition with a local map to store just a single event per user, e.g. foreachPartition { p => val m =

Re: Spark Streaming: process only last events

2016-01-06 Thread Julien Naour
Thanks for your answer, As I understand it, a consumer that stays caught-up will read every message even with compaction. So for a pure Kafka Spark Streaming It will not be a solution. Perhaps I could reconnect to the Kafka topic after each process to get the last state of events and then compare

Re: Spark Streaming: process only last events

2016-01-06 Thread Cody Koeninger
Have you read http://kafka.apache.org/documentation.html#compaction On Wed, Jan 6, 2016 at 8:52 AM, Julien Naour wrote: > Context: Process data coming from Kafka and send back results to Kafka. > > Issue: Each events could take several seconds to process (Work in progress > to improve that).

RE: Spark Streaming + Kafka + scala job message read issue

2016-01-05 Thread vivek.meghanathan
Meghanathan (WT01 - NEP) Sent: 27 December 2015 11:08 To: Bryan Cc: Vivek Meghanathan (WT01 - NEP) ; duc.was.h...@gmail.com; user@spark.apache.org Subject: Re: Spark Streaming + Kafka + scala job message read issue Hi Bryan, Yes we are using only 1 thread per topic as we have only one Kafka

Re: Spark Streaming Application is Stuck Under Heavy Load Due to DeadLock

2016-01-04 Thread Shixiong Zhu
Hye Rachana, could you provide the full jstack outputs? Maybe it's same as https://issues.apache.org/jira/browse/SPARK-11104 Best Regards, Shixiong Zhu 2016-01-04 12:56 GMT-08:00 Rachana Srivastava < rachana.srivast...@markmonitor.com>: > Hello All, > > > > I am running my application on Spark c

Re: Spark Streaming + Kafka + scala job message read issue

2015-12-26 Thread vivek.meghanathan
.@gmail.com> Cc: duc.was.h...@gmail.com<mailto:duc.was.h...@gmail.com>; vivek.meghanat...@wipro.com<mailto:vivek.meghanat...@wipro.com>; user@spark.apache.org<mailto:user@spark.apache.org> Subject: Re: Spark Streaming + Kafka + scala job message read issue Hi Brian,PhuDuc, All 8 jobs

RE: Spark Streaming + Kafka + scala job message read issue

2015-12-26 Thread Bryan
...@gmail.com Cc: duc.was.h...@gmail.com; vivek.meghanat...@wipro.com; user@spark.apache.org Subject: Re: Spark Streaming + Kafka + scala job message read issue Hi Brian,PhuDuc, All 8 jobs are consuming 8 different IN topics. 8 different Scala jobs running each topic map mentioned below has only 1 thread

Re: Spark Streaming + Kafka + scala job message read issue

2015-12-25 Thread vivek.meghanathan
; for Windows 10 phone From: PhuDuc Nguyen<mailto:duc.was.h...@gmail.com> Sent: Friday, December 25, 2015 3:35 PM To: vivek.meghanat...@wipro.com<mailto:vivek.meghanat...@wipro.com> Cc: user@spark.apache.org<mailto:user@spark.apache.org> Subject: Re: Spark Streaming + Kafka + scala

RE: Spark Streaming + Kafka + scala job message read issue

2015-12-25 Thread Bryan
Agreed. I did not see that they were using the same group name. Sent from Outlook Mail for Windows 10 phone From: PhuDuc Nguyen Sent: Friday, December 25, 2015 3:35 PM To: vivek.meghanat...@wipro.com Cc: user@spark.apache.org Subject: Re: Spark Streaming + Kafka + scala job message read issue

Re: Spark Streaming + Kafka + scala job message read issue

2015-12-25 Thread PhuDuc Nguyen
> > > > > > Regards, > Vivek M > > *From:* Bryan [mailto:bryan.jeff...@gmail.com] > *Sent:* 24 December 2015 17:20 > *To:* Vivek Meghanathan (WT01 - NEP) ; > user@spark.apache.org > *Subject:* RE: Spark Streaming + Kafka + scala job message read issue > > >

RE: Spark Streaming + Kafka + scala job message read issue

2015-12-25 Thread Bryan
...@wipro.com Sent: Friday, December 25, 2015 2:18 PM To: bryan.jeff...@gmail.com; user@spark.apache.org Subject: Re: Spark Streaming + Kafka + scala job message read issue Any help is highly appreciated, i am completely stuck here.. From: Vivek Meghanathan (WT01 - NEP) Sent: Thursday, December 24

Re: Spark Streaming + Kafka + scala job message read issue

2015-12-25 Thread vivek.meghanathan
Any help is highly appreciated, i am completely stuck here.. From: Vivek Meghanathan (WT01 - NEP) Sent: Thursday, December 24, 2015 7:50 PM To: Bryan; user@spark.apache.org Subject: RE: Spark Streaming + Kafka + scala job message read issue We are using the

Re: Spark Streaming - print accumulators value every period as logs

2015-12-25 Thread Ali Gouta
Something like Stream.foreachRdd(rdd=> rdd.collect.foreach(print accum)) Should answer your question. You get things printed in Each batch interval Ali Gouta Le 25 déc. 2015 04:22, "Roberto Coluccio" a écrit : > Hello, > > I have a batch and a streaming driver using same functions (Scala). I u

RE: Spark Streaming + Kafka + scala job message read issue

2015-12-24 Thread vivek.meghanathan
anathan (WT01 - NEP) ; user@spark.apache.org Subject: RE: Spark Streaming + Kafka + scala job message read issue Are you using a direct stream consumer, or the older receiver based consumer? If the latter, do the number of partitions you’ve specified for your topic match the number of partitions

RE: Spark Streaming + Kafka + scala job message read issue

2015-12-24 Thread Bryan
Are you using a direct stream consumer, or the older receiver based consumer? If the latter, do the number of partitions you’ve specified for your topic match the number of partitions in the topic on Kafka? That would be an possible cause – as you might receive all data from a given partition

Re: Spark Streaming 1.5.2+Kafka+Python. Strange reading

2015-12-24 Thread Akhil Das
Would you mind posting the relevant code snippet? Thanks Best Regards On Wed, Dec 23, 2015 at 7:33 PM, Vyacheslav Yanuk wrote: > Hi. > I have very strange situation with direct reading from Kafka. > For example. > I have 1000 messages in Kafka. > After submitting my application I read this data

Re: Spark Streaming 1.5.2+Kafka+Python (docs)

2015-12-23 Thread Cody Koeninger
Read the documentation spark.apache.org/docs/latest/streaming-kafka-integration.html If you still have questions, read the resources linked from https://github.com/koeninger/kafka-exactly-once On Wed, Dec 23, 2015 at 7:24 AM, Vyacheslav Yanuk wrote: > Colleagues > Documents written about create

Re: Re: spark streaming updateStateByKey state is nonsupport other type except ClassTag such as list?

2015-12-22 Thread our...@cnsuning.com
rt() ssc.awaitTermination() Ricky Ou(欧 锐) From: our...@cnsuning.com Date: 2015-12-23 14:19 To: Dean Wampler CC: user; t...@databricks.com Subject: Re: Re: spark streaming updateStateByKey state is nonsupport other type except ClassTag such as list? as the following code mo

Re: Re: spark streaming updateStateByKey state is nonsupport other type except ClassTag such as list?

2015-12-22 Thread our...@cnsuning.com
lean, org.apache.spark.rdd.RDD[(String, Seq[Int])]) val stateDstream = wordDstream.updateStateByKey[Seq[Int]](newUpdateFunc, Ricky Ou(欧 锐) 部 门:苏宁云商 IT总部技术支撑研发中心大
数据中心数据平台开发部 tel :18551600418 email : our...@cnsuning.com From: Dean Wampler Date: 2015-12-23 00:46

Re: Re: spark streaming updateStateByKey state is nonsupport other type except ClassTag such as list?

2015-12-22 Thread our...@cnsuning.com
So sorry , should be Seq, not sql . thanks for your help. Ricky Ou(欧 锐) From: Dean Wampler Date: 2015-12-23 00:46 To: our...@cnsuning.com CC: user; t...@databricks.com Subject: Re: spark streaming updateStateByKey state is nonsupport other type except ClassTag such as list? There are

Re: spark streaming updateStateByKey state is nonsupport other type except ClassTag such as list?

2015-12-22 Thread Dean Wampler
There are ClassTags for Array, List, and Map, as well as for Int, etc. that you might have inside those collections. What do you mean by sql? Could you post more of your code? Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition (O'Reilly) T

Re: Spark Streaming - Number of RDDs in Dstream

2015-12-21 Thread Saisai Shao
Yes, basically from the currently implementation it should be. On Mon, Dec 21, 2015 at 6:39 PM, Arun Patel wrote: > So, Does that mean only one RDD is created by all receivers? > > > > On Sun, Dec 20, 2015 at 10:23 PM, Saisai Shao > wrote: > >> Normally there will be one RDD in each batch. >> >

Re: Spark Streaming - Number of RDDs in Dstream

2015-12-21 Thread Arun Patel
So, Does that mean only one RDD is created by all receivers? On Sun, Dec 20, 2015 at 10:23 PM, Saisai Shao wrote: > Normally there will be one RDD in each batch. > > You could refer to the implementation of DStream#getOrCompute. > > > On Mon, Dec 21, 2015 at 11:04 AM, Arun Patel > wrote: > >>

Re: Spark Streaming - Number of RDDs in Dstream

2015-12-20 Thread Saisai Shao
Normally there will be one RDD in each batch. You could refer to the implementation of DStream#getOrCompute. On Mon, Dec 21, 2015 at 11:04 AM, Arun Patel wrote: > It may be simple question...But, I am struggling to understand this > > DStream is a sequence of RDDs created in a batch window

Re: Spark streaming: java.lang.ClassCastException: org.apache.spark.util.SerializableConfiguration ... on restart from checkpoint

2015-12-17 Thread Shixiong Zhu
Streaming checkpoint doesn't support Accumulator or Broadcast. See https://issues.apache.org/jira/browse/SPARK-5206 Here is a workaround: https://issues.apache.org/jira/browse/SPARK-5206?focusedCommentId=14506806&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-1450680

Re: Spark Streaming: How to specify deploy mode through configuration parameter?

2015-12-17 Thread Ted Yu
As far as I can tell, it is not in 1.6.0 RC. You can comment on the JIRA, requesting backport to 1.6.1 Cheers On Thu, Dec 17, 2015 at 5:28 AM, Saiph Kappa wrote: > I am not sure how the process works and if patches are applied to all > upcoming versions of spark. Is it likely that the fix is av

Re: Spark Streaming: How to specify deploy mode through configuration parameter?

2015-12-17 Thread Saiph Kappa
I am not sure how the process works and if patches are applied to all upcoming versions of spark. Is it likely that the fix is available in this build (spark 1.6.0 17-Dec-2015 09:02)? http://people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest/ Thanks! On Wed, Dec 16, 2015 at 9:22 P

Re: Spark streaming: java.lang.ClassCastException: org.apache.spark.util.SerializableConfiguration ... on restart from checkpoint

2015-12-17 Thread Bartłomiej Alberski
I prepared simple example helping in reproducing problem: https://github.com/alberskib/spark-streaming-broadcast-issue I think that in that way it will be easier for you to understand problem and find solution (if any exists) Thanks Bartek 2015-12-16 23:34 GMT+01:00 Bartłomiej Alberski : > Fir

Re: Spark streaming: java.lang.ClassCastException: org.apache.spark.util.SerializableConfiguration ... on restart from checkpoint

2015-12-16 Thread Bartłomiej Alberski
First of all , thanks @tdas for looking into my problem. Yes, I checked it seperately and it is working fine. For below piece of code there is no single exception and values are sent correctly. val reporter = new MyClassReporter(...) reporter.send(...) val out = new FileOutputStream("

Re: Spark Streaming: How to specify deploy mode through configuration parameter?

2015-12-16 Thread Ted Yu
Since both scala and java files are involved in the PR, I don't see an easy way around without building yourself. Cheers On Wed, Dec 16, 2015 at 10:18 AM, Saiph Kappa wrote: > Exactly, but it's only fixed for the next spark version. Is there any work > around for version 1.5.2? > > On Wed, Dec

Re: Spark Streaming: How to specify deploy mode through configuration parameter?

2015-12-16 Thread Saiph Kappa
Exactly, but it's only fixed for the next spark version. Is there any work around for version 1.5.2? On Wed, Dec 16, 2015 at 4:36 PM, Ted Yu wrote: > This seems related: > [SPARK-10123][DEPLOY] Support specifying deploy mode from configuration > > FYI > > On Wed, Dec 16, 2015 at 7:31 AM, Saiph K

Re: Spark Streaming: How to specify deploy mode through configuration parameter?

2015-12-16 Thread Ted Yu
This seems related: [SPARK-10123][DEPLOY] Support specifying deploy mode from configuration FYI On Wed, Dec 16, 2015 at 7:31 AM, Saiph Kappa wrote: > Hi, > > I have a client application running on host0 that is launching multiple > drivers on multiple remote standalone spark clusters (each clus

Re: Spark streaming: java.lang.ClassCastException: org.apache.spark.util.SerializableConfiguration ... on restart from checkpoint

2015-12-15 Thread Tathagata Das
Could you test serializing and deserializing the MyClassReporter class separately? On Mon, Dec 14, 2015 at 8:57 AM, Bartłomiej Alberski wrote: > Below is the full stacktrace(real names of my classes were changed) with > short description of entries from my code: > > rdd.mapPartitions{ partition

Re: Spark Streaming having trouble writing checkpoint

2015-12-14 Thread Robert Towne
I forgot to include the data node logs for this time period: 2015-12-14 00:14:52,836 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: server51:50010:DataXceiver error processing unknown operation src: /127.0.0.1:39442 dst: /127.0.0.1:50010 java.io.EOFException at java.io.DataInputStream.

Re: Spark streaming: java.lang.ClassCastException: org.apache.spark.util.SerializableConfiguration ... on restart from checkpoint

2015-12-14 Thread Bartłomiej Alberski
Below is the full stacktrace(real names of my classes were changed) with short description of entries from my code: rdd.mapPartitions{ partition => //this is the line to which second stacktrace entry is pointing val sender = broadcastedValue.value // this is the maing place to which first stack

Re: Spark streaming: java.lang.ClassCastException: org.apache.spark.util.SerializableConfiguration ... on restart from checkpoint

2015-12-14 Thread Ted Yu
Can you show the complete stack trace for the ClassCastException ? Please see the following thread: http://search-hadoop.com/m/q3RTtgEUHVmJA1T1 Cheers On Mon, Dec 14, 2015 at 7:33 AM, alberskib wrote: > Hey all, > > When my streaming application is restarting from failure (from checkpoint) > I

RE: Spark streaming driver java process RSS memory constantly increasing using cassandra driver

2015-12-14 Thread Singh, Abhijeet
could be using the > native memory. We didn't get any pics of JConsole. > > Thanks. > > -Original Message- > From: Conor Fennell [mailto:conorapa...@gmail.com] > Sent: Monday, December 14, 2015 4:15 PM > To: user@spark.apache.org > Subject: Re: Spark streaming

Re: Spark streaming driver java process RSS memory constantly increasing using cassandra driver

2015-12-14 Thread Conor Fennell
Thanks. > > -Original Message- > From: Conor Fennell [mailto:conorapa...@gmail.com] > Sent: Monday, December 14, 2015 4:15 PM > To: user@spark.apache.org > Subject: Re: Spark streaming driver java process RSS memory constantly > increasing using cassandra driver > > Just

RE: Spark streaming driver java process RSS memory constantly increasing using cassandra driver

2015-12-14 Thread Singh, Abhijeet
any pics of JConsole. Thanks. -Original Message- From: Conor Fennell [mailto:conorapa...@gmail.com] Sent: Monday, December 14, 2015 4:15 PM To: user@spark.apache.org Subject: Re: Spark streaming driver java process RSS memory constantly increasing using cassandra driver Just bumping the i

Re: Spark streaming driver java process RSS memory constantly increasing using cassandra driver

2015-12-14 Thread Conor Fennell
Just bumping the issue I am having, if anyone can provide direction? I have been stuck on this for a while now. Thanks, Conor On Fri, Dec 11, 2015 at 5:10 PM, Conor Fennell wrote: > Hi, > > I have a memory leak in the spark driver which is not in the heap or > the non-heap. > Even though neither

Re: Spark streaming with Kinesis broken?

2015-12-11 Thread Brian London
Yes, it's against master: https://github.com/apache/spark/pull/10256 I'll push the KCL version bump after my local tests finish. On Fri, Dec 11, 2015 at 10:42 AM Nick Pentreath wrote: > Is that PR against master branch? > > S3 read comes from Hadoop / jet3t afaik > > — > Sent from Mailbox

Re: Spark streaming with Kinesis broken?

2015-12-11 Thread Nick Pentreath
Is that PR against master branch? S3 read comes from Hadoop / jet3t afaik — Sent from Mailbox On Fri, Dec 11, 2015 at 5:38 PM, Brian London wrote: > That's good news I've got a PR in to up the SDK version to 1.10.40 and the > KCL to 1.6.1 which I'm running tests on locally now. > Is the

Re: Spark streaming with Kinesis broken?

2015-12-11 Thread Brian London
That's good news I've got a PR in to up the SDK version to 1.10.40 and the KCL to 1.6.1 which I'm running tests on locally now. Is the AWS SDK not used for reading/writing from S3 or do we get that for free from the Hadoop dependencies? On Fri, Dec 11, 2015 at 5:07 AM Nick Pentreath wrote: > c

Re: Spark streaming with Kinesis broken?

2015-12-11 Thread Nick Pentreath
cc'ing dev list Ok, looks like when the KCL version was updated in https://github.com/apache/spark/pull/8957, the AWS SDK version was not, probably leading to dependency conflict, though as Burak mentions its hard to debug as no exceptions seem to get thrown... I've tested 1.5.2 locally and on my

Re: Spark streaming with Kinesis broken?

2015-12-10 Thread Nick Pentreath
Yeah also the integration tests need to be specifically run - I would have thought the contributor would have run those tests and also tested the change themselves using live Kinesis :( — Sent from Mailbox On Fri, Dec 11, 2015 at 6:18 AM, Burak Yavuz wrote: > I don't think the Kinesis tests

Re: Spark streaming with Kinesis broken?

2015-12-10 Thread Burak Yavuz
I don't think the Kinesis tests specifically ran when that was merged into 1.5.2 :( https://github.com/apache/spark/pull/8957 https://github.com/apache/spark/commit/883bd8fccf83aae7a2a847c9a6ca129fac86e6a3 AFAIK pom changes don't trigger the Kinesis tests. Burak On Thu, Dec 10, 2015 at 8:09 PM,

Re: Spark streaming with Kinesis broken?

2015-12-10 Thread Nick Pentreath
Yup also works for me on master branch as I've been testing DynamoDB Streams integration. In fact works with latest KCL 1.6.1 also which I was using. So theKCL version does seem like it could be the issue - somewhere along the line an exception must be getting swallowed. Though the tests shou

Re: Spark Streaming Shuffle to Disk

2015-12-10 Thread manasdebashiskar
how often do you checkpoint? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Shuffle-to-Disk-tp25567p25682.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

Re: Spark streaming with Kinesis broken?

2015-12-10 Thread Brian London
Yes, it worked in the 1.6 branch as of commit db5165246f2888537dd0f3d4c5a515875c7358ed. That makes it much less serious of an issue, although it would be nice to know what the root cause is to avoid a regression. On Thu, Dec 10, 2015 at 4:03 PM Burak Yavuz wrote: > I've noticed this happening w

Re: Spark streaming with Kinesis broken?

2015-12-10 Thread Burak Yavuz
I've noticed this happening when there was some dependency conflicts, and it is super hard to debug. It seems that the KinesisClientLibrary version in Spark 1.5.2 is 1.3.0, but it is 1.2.1 in Spark 1.5.1. I feel like that seems to be the problem... Brian, did you verify that it works with the 1.6.

Re: Spark streaming with Kinesis broken?

2015-12-10 Thread Brian London
Nick's symptoms sound identical to mine. I should mention that I just pulled the latest version from github and it seems to be working there. To reproduce: 1. Download spark 1.5.2 from http://spark.apache.org/downloads.html 2. build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTes

Re: Spark streaming with Kinesis broken?

2015-12-10 Thread Jean-Baptiste Onofré
Hi Nick, Just to be sure: don't you see some ClassCastException in the log ? Thanks, Regards JB On 12/10/2015 07:56 PM, Nick Pentreath wrote: Could you provide an example / test case and more detail on what issue you're facing? I've just tested a simple program reading from a dev Kinesis stre

Re: Spark Streaming Shuffle to Disk

2015-12-07 Thread Akhil Das
UpdateStateByKey and your batch data could be filling up your executor memory and hence it might be hitting the disk, you can verify it by looking at the memory footprint while your job is running. Looking at the executor logs will also give you a better understanding of whats going on. Thanks Bes

Re: Spark Streaming - controlling Cached table registered in memory created from each RDD of a windowed stream

2015-12-05 Thread manasdebashiskar
Ans1) It is the same table name. Ans2) I think you mean persist(memory) = cache and unpersist. If it is your program is caching an dataframe you unpersist it manually. I think if your cached data structure is not being utilized then the new cache will evict the old one. But if memory is your con

Re: Spark Streaming Specify Kafka Partition

2015-12-04 Thread Cody Koeninger
So createDirectStream will give you a JavaInputDStream of R, where R is the return type you chose for your message handler. If you want a JavaPairInputDStream, you may have to call .mapToPair in order to convert the stream, even if the type you chose for R was already Tuple2 (note that I try to s

Re: Spark Streaming from S3

2015-12-04 Thread Steve Loughran
only one being worked on (we're too scared of breaking s3n), its the one to try —and complain about if it underperforms -Steve From: Steve Loughran mailto:ste...@hortonworks.com>> Date: Thursday, December 3, 2015 at 4:12 AM Cc: SPARK-USERS mailto:user@spark.apache.org>> Subject:

Re: Spark Streaming Specify Kafka Partition

2015-12-03 Thread Alan Braithwaite
One quick newbie question since I got another chance to look at this today. We're using java for our spark applications. The createDirectStream we were using previously [1] returns a JavaPairInputDStream, but the createDirectStream with fromOffsets expects an argument recordClass to pass into the

Re: Spark Streaming Running Out Of Memory in 1.5.0.

2015-12-03 Thread Ted Yu
bq. lambda part: save_sets(part, KEY_SET_NAME, Where do you save the part to ? For OutOfMemoryError, the last line was from Utility.scala Anything before that ? Thanks On Thu, Dec 3, 2015 at 11:47 AM, Augustus Hong wrote: > Hi All, > > I'm running Spark Streaming (Python) with Direct Kafka an

Re: Spark Streaming from S3

2015-12-03 Thread Michele Freschi
Hi Steve, I¹m on hadoop 2.7.1 using the s3n From: Steve Loughran Date: Thursday, December 3, 2015 at 4:12 AM Cc: SPARK-USERS Subject: Re: Spark Streaming from S3 > On 3 Dec 2015, at 00:42, Michele Freschi wrote: > > Hi all, > > I have an app streaming from s3 (text

Re: Spark Streaming from S3

2015-12-03 Thread Steve Loughran
On 3 Dec 2015, at 00:42, Michele Freschi mailto:mfres...@palantir.com>> wrote: Hi all, I have an app streaming from s3 (textFileStream) and recently I've observed increasing delay and long time to list files: INFO dstream.FileInputDStream: Finding new files took 394160 ms ... INFO scheduler.J

Re: Spark Streaming 1.6 accumulating state across batches for joins

2015-12-02 Thread Aris
Please disregard the "window" functions...it turns out that was development code. Everything else is correct. val rawLEFT: DStream[String] = ssc.textFileStream(dirLEFT). window(Seconds(30)) val rawRIGHT: DStream[String] = ssc.textFileStream(dirRIGHT). window(Seconds(30)) should be val rawLEFT

Re: Spark Streaming and JMS

2015-12-02 Thread SamyaMaiti
Hi All, Is there any Pub-Sub for JMS provided by Spark out of box like Kafka? Thanks. Regards, Sam -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-and-JMS-tp5371p25548.html Sent from the Apache Spark User List mailing list archive at Nabbl

Re: Spark Streaming - History UI

2015-12-02 Thread patcharee
I meant there is no streaming tab at all. It looks like I need version 1.6 Patcharee On 02. des. 2015 11:34, Steve Loughran wrote: The history UI doesn't update itself for live apps (SPARK-7889) -though I'm working on it Are you trying to view a running streaming job? On 2 Dec 2015, at 05:2

Re: Spark Streaming - History UI

2015-12-02 Thread Steve Loughran
The history UI doesn't update itself for live apps (SPARK-7889) -though I'm working on it Are you trying to view a running streaming job? > On 2 Dec 2015, at 05:28, patcharee wrote: > > Hi, > > On my history server UI, I cannot see "streaming" tab for any streaming jobs? > I am using version

Re: spark streaming count msg in batch

2015-12-01 Thread Gerard Maas
dstream.count() See: http://spark.apache.org/docs/latest/programming-guide.html#actions -kr, Gerard. On Tue, Dec 1, 2015 at 6:32 PM, patcharee wrote: > Hi, > > In spark streaming how to count the total number of message (from Socket) > in one batch? > > Thanks, > Patcharee > >

Re: Spark Streaming Specify Kafka Partition

2015-12-01 Thread Cody Koeninger
I actually haven't tried that, since I tend to do the offset lookups if necessary. It's possible that it will work, try it and let me know. Be aware that if you're doing a count() or take() operation directly on the rdd it'll definitely give you the wrong result if you're using -1 for one of the

Re: Spark Streaming Specify Kafka Partition

2015-12-01 Thread Alan Braithwaite
Neat, thanks. If I specify something like -1 as the offset, will it consume from the latest offset or do I have to instrument that manually? - Alan On Tue, Dec 1, 2015 at 6:43 AM, Cody Koeninger wrote: > Yes, there is a version of createDirectStream that lets you specify > fromOffsets: Map[Top

Re: Spark Streaming Specify Kafka Partition

2015-12-01 Thread Cody Koeninger
Yes, there is a version of createDirectStream that lets you specify fromOffsets: Map[TopicAndPartition, Long] On Mon, Nov 30, 2015 at 7:43 PM, Alan Braithwaite wrote: > Is there any mechanism in the kafka streaming source to specify the exact > partition id that we want a streaming job to consum

Re: Spark streaming job hangs

2015-12-01 Thread Archit Thakur
Which version of spark you are runinng? Have you created Kafka-Directstream ? I am asking coz you might / might not be using receivers. Also, When you say hangs, you mean there is no other log after this and process still up? Or do you mean, it kept on adding the jobs but did nothing else. (I am op

Re: Spark streaming job hangs

2015-12-01 Thread Paul Leclercq
You might not have enough cores to process data from Kafka > When running a Spark Streaming program locally, do not use “local” or > “local[1]” as the master URL. Either of these means that only one thread > will be used for running tasks locally. If you are using a input DStream > based on a rec

Re: Spark Streaming on mesos

2015-11-30 Thread Renjie Liu
Hi, Lulian: Are you sure that it'll be a long running process in fine-grained mode? I think you have a misunderstanding about it. An executor will be launched for some tasks, but not a long running process. When a group of tasks finished, it will get shutdown. On Mon, Nov 30, 2015 at 6:25 PM Iulia

Re: Spark Streaming on mesos

2015-11-30 Thread Iulian Dragoș
Hi, Latency isn't such a big issue as it sounds. Did you try it out and failed some performance metrics? In short, the *Mesos* executor on a given slave is going to be long-running (consuming memory, but no CPUs). Each Spark task will be scheduled using Mesos CPU resources, but they don't suffer

Re: Spark Streaming on mesos

2015-11-29 Thread Renjie Liu
Hi, Tim: Fine grain mode is not suitable for streaming applications since it need to start up an executor each time. When will the revamp get release? In the coming 1.6.0? On Sun, Nov 29, 2015 at 6:16 PM Timothy Chen wrote: > Hi Renjie, > > You can set number of cores per executor with spark exe

Re: Spark Streaming on mesos

2015-11-29 Thread Timothy Chen
Hi Renjie, You can set number of cores per executor with spark executor cores in fine grain mode. If you want coarse grain mode to support that it will Be supported in the near term as he coarse grain scheduler is getting revamped now. Tim > On Nov 28, 2015, at 7:31 PM, Renjie Liu wrote: >

Re: Spark Streaming on mesos

2015-11-28 Thread Renjie Liu
Hi, Nagaraj: Thanks for the response, but this does not solve my problem. I think executor memory should be proportional to number of cores, or number of core in each executor should be the same. On Sat, Nov 28, 2015 at 1:48 AM Nagaraj Chandrashekar < nchandrashe...@innominds.com> wrote: > Hi Ren

Re: Spark Streaming on mesos

2015-11-27 Thread Nagaraj Chandrashekar
Hi Renjie, I have not setup Spark Streaming on Mesos but there is something called reservations in Mesos. It supports both Static and Dynamic reservations. Both types of reservations must have role defined. You may want to explore these options. Excerpts from the Apache Mesos documentation.

Re: [Spark Streaming] How to clear old data from Stream State?

2015-11-25 Thread Ted Yu
trackStateByKey API is in branch-1.6 FYI On Wed, Nov 25, 2015 at 6:03 AM, Todd Nist wrote: > Perhaps the new trackStateByKey targeted for very 1.6 may help you here. > I'm not sure if it is part of 1.6 or not for sure as the jira does not > specify a fixed version. The jira describing it is he

Re: Spark Streaming idempotent writes to HDFS

2015-11-25 Thread Steve Loughran
On 25 Nov 2015, at 07:01, Michael mailto:mfr...@fastest.cc>> wrote: so basically writing them into a temporary directory named with the batch time and then move the files to their destination on success ? I wished there was a way to skip moving files around and be able to set the output filena

Re: [Spark Streaming] How to clear old data from Stream State?

2015-11-25 Thread Todd Nist
Perhaps the new trackStateByKey targeted for very 1.6 may help you here. I'm not sure if it is part of 1.6 or not for sure as the jira does not specify a fixed version. The jira describing it is here: https://issues.apache.org/jira/browse/SPARK-2629, and the design doc that discusses the API chang

Re: Spark Streaming idempotent writes to HDFS

2015-11-24 Thread Michael
so basically writing them into a temporary directory named with the batch time and then move the files to their destination on success ? I wished there was a way to skip moving files around and be able to set the output filenames. Thanks Burak :) -Michael On Mon, Nov 23, 2015, at 09:19 PM, Bura

Re: Spark Streaming idempotent writes to HDFS

2015-11-23 Thread Burak Yavuz
Not sure if it would be the most efficient, but maybe you can think of the filesystem as a key value store, and write each batch to a sub-directory, where the directory name is the batch time. If the directory already exists, then you shouldn't write it. Then you may have a following batch job that

Re: Spark Streaming - stream between 2 applications

2015-11-21 Thread Christian
Instead of sending the results of the one spark app directly to the other one, you could write the results to a Kafka topic which is consumed by your other spark application. On Fri, Nov 20, 2015 at 12:07 PM Saiph Kappa wrote: > I think my problem persists whether I use Kafka or sockets. Or am I

Re: Spark Streaming - stream between 2 applications

2015-11-20 Thread Cody Koeninger
You're confused about which parts of your code are running on the driver vs the executor, which is why you're getting serialization errors. Read http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd On Fri, Nov 20, 2015 at 1:07 PM, Saiph Kapp

Re: Spark Streaming - stream between 2 applications

2015-11-20 Thread Saiph Kappa
I think my problem persists whether I use Kafka or sockets. Or am I wrong? How would you use Kafka here? On Fri, Nov 20, 2015 at 7:12 PM, Christian wrote: > Have you considered using Kafka? > > On Fri, Nov 20, 2015 at 6:48 AM Saiph Kappa wrote: > >> Hi, >> >> I have a basic spark streaming appl

Re: Spark Streaming - stream between 2 applications

2015-11-20 Thread Christian
Have you considered using Kafka? On Fri, Nov 20, 2015 at 6:48 AM Saiph Kappa wrote: > Hi, > > I have a basic spark streaming application like this: > > « > ... > > val ssc = new StreamingContext(sparkConf, Duration(batchMillis)) > val rawStreams = (1 to numStreams).map(_ => > ssc.rawSocketStrea

Re: spark streaming problem saveAsTextFiles() does not write valid JSON to HDFS

2015-11-19 Thread Andy Davidson
Turns out data is in python format. ETL pipeline was over writing original data Andy From: Andrew Davidson Date: Thursday, November 19, 2015 at 6:58 PM To: "user @spark" Subject: spark streaming problem saveAsTextFiles() does not write valid JSON to HDFS > I am working on a simple POS. I a

<    2   3   4   5   6   7   8   9   10   11   >