Re: Negative Number of Active Tasks in Spark UI

2016-01-05 Thread Shixiong(Ryan) Zhu
Did you enable "spark.speculation"? On Tue, Jan 5, 2016 at 9:14 AM, Prasad Ravilla wrote: > I am using Spark 1.5.2. > > I am not using Dynamic allocation. > > Thanks, > Prasad. > > > > > On 1/5/16, 3:24 AM, "Ted Yu" wrote: > > >Which version of Spark do

Re: Double Counting When Using Accumulators with Spark Streaming

2016-01-05 Thread Shixiong(Ryan) Zhu
Hey Rachana, There are two jobs in your codes actually: `rdd.isEmpty` and `rdd.saveAsTextFile`. Since you don't cache or checkpoint this rdd, it will execute your map function twice for each record. You can move "accum.add(1)" to "rdd.saveAsTextFile" like this: JavaDStream lines =

Re: How to register a Tuple3 with KryoSerializer?

2015-12-30 Thread Shixiong(Ryan) Zhu
You can use "_", e.g., sparkConf.registerKryoClasses(Array(classOf[scala.Tuple3[_, _, _]])) Best Regards, Shixiong(Ryan) Zhu Software Engineer Databricks Inc. shixi...@databricks.com databricks.com <http://databricks.com/> On Wed, Dec 30, 2015 at 10:16 AM, Russ <russ.br..

Re: 2 of 20,675 Spark Streaming : Out put frequency different from read frequency in StatefulNetworkWordCount

2015-12-30 Thread Shixiong(Ryan) Zhu
You can use "reduceByKeyAndWindow", e.g., val lines = ssc.socketTextStream("localhost", ) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKeyAndWindow((x: Int, y: Int) => x + y, Seconds(60), Seconds(60)) wordCounts.print() On Wed, Dec

Re: Kryo serializer Exception during serialization: java.io.IOException: java.lang.IllegalArgumentException:

2016-01-08 Thread Shixiong(Ryan) Zhu
Could you disable `spark.kryo.registrationRequired`? Some classes may not be registered but they work well with Kryo's default serializer. On Fri, Jan 8, 2016 at 8:58 AM, Ted Yu wrote: > bq. try adding scala.collection.mutable.WrappedArray > > But the hint said registering

Re: Too many tasks killed the scheduler

2016-01-11 Thread Shixiong(Ryan) Zhu
Could you use "coalesce" to reduce the number of partitions? Shixiong Zhu On Mon, Jan 11, 2016 at 12:21 AM, Gavin Yue wrote: > Here is more info. > > The job stuck at: > INFO cluster.YarnScheduler: Adding task set 1.0 with 79212 tasks > > Then got the error: > Caused

Re: [Spark 1.6][Streaming] About the behavior of mapWithState

2016-01-15 Thread Shixiong(Ryan) Zhu
Hey Terry, That's expected. If you want to only output (1, 3), you can use "reduceByKey" before "mapWithState" like this: dstream.reduceByKey(_ + _).mapWithState(spec) On Fri, Jan 15, 2016 at 1:21 AM, Terry Hoo wrote: > Hi, > I am doing a simple test with mapWithState,

Re: 答复: 答复: 答复: spark streaming context trigger invoke stop why?

2016-01-15 Thread Shixiong(Ryan) Zhu
gt; allKafkaWindowData = > this.sparkReceiverDStream.createReceiverDStream(jsc,this.streamingConf.getWindowDuration(), > > this.streamingConf.getSlideDuration()); > > > > this.businessProcess(allKafkaWindowData); > > this.sleep(); > >jsc.start(); &

Re: Calling SparkContext methods in scala Future

2016-01-18 Thread Shixiong(Ryan) Zhu
Hey Marco, Since the codes in Future is in an asynchronous way, you cannot call "sparkContext.stop" at the end of "fetch" because the codes in Future may not finish. However, the exception seems weird. Do you have a simple reproducer? On Mon, Jan 18, 2016 at 9:13 AM, Ted Yu

Re: Spark Streaming: Does mapWithState implicitly partition the dsteram?

2016-01-18 Thread Shixiong(Ryan) Zhu
mapWithState uses HashPartitioner by default. You can use "StateSpec.partitioner" to set your custom partitioner. On Sun, Jan 17, 2016 at 11:00 AM, Lin Zhao wrote: > When the state is passed to the task that handles a mapWithState for a > particular key, if the key is

Re: Incorrect timeline for Scheduling Delay in Streaming page in web UI?

2016-01-18 Thread Shixiong(Ryan) Zhu
Hey, did you mean that the scheduling delay timeline is incorrect because it's too short and some values are missing? A batch won't have a scheduling delay until it starts to run. In your example, a lot of batches are waiting so that they don't have the scheduling delay. On Sun, Jan 17, 2016 at

Re: Spark Streaming: custom actor receiver losing vast majority of data

2016-01-14 Thread Shixiong(Ryan) Zhu
Could you change MEMORY_ONLY_SER to MEMORY_AND_DISK_SER_2 and see if this still happens? It may be because you don't have enough memory to cache the events. On Thu, Jan 14, 2016 at 4:06 PM, Lin Zhao wrote: > Hi, > > I'm testing spark streaming with actor receiver. The actor

Re: How to bind webui to localhost?

2016-01-14 Thread Shixiong(Ryan) Zhu
Yeah, it's hard code as "0.0.0.0". Could you send a PR to add a configuration for it? On Thu, Jan 14, 2016 at 2:51 PM, Zee Chen wrote: > Hi, what is the easiest way to configure the Spark webui to bind to > localhost or 127.0.0.1? I intend to use this with ssh socks proxy to >

Re: Spark Streaming: custom actor receiver losing vast majority of data

2016-01-14 Thread Shixiong(Ryan) Zhu
:31:31 INFO storage.BlockManager: Removing RDD 44 > 16/01/15 00:31:31 INFO storage.BlockManager: Removing RDD 43 > 16/01/15 00:31:31 INFO storage.BlockManager: Removing RDD 42 > 16/01/15 00:31:31 INFO storage.BlockManager: Removing RDD 41 > 16/01/15 00:31:31 INFO storage.BlockMana

Re: NPE when using Joda DateTime

2016-01-14 Thread Shixiong(Ryan) Zhu
Could you try to use "Kryo.setDefaultSerializer" like this: class YourKryoRegistrator extends KryoRegistrator { override def registerClasses(kryo: Kryo) { kryo.setDefaultSerializer(classOf[com.esotericsoftware.kryo.serializers.JavaSerializer]) } } On Thu, Jan 14, 2016 at 12:54 PM, Durgesh

Re: 答复: 答复: spark streaming context trigger invoke stop why?

2016-01-14 Thread Shixiong(Ryan) Zhu
Could you show your codes? Did you use `StreamingContext.awaitTermination`? If so, it will return if any exception happens. On Wed, Jan 13, 2016 at 11:47 PM, Triones,Deng(vip.com) < triones.d...@vipshop.com> wrote: > What’s more, I am running a 7*24 hours job , so I won’t call System.exit() > by

Re: 答复: 答复: 答复: 答复: spark streaming context trigger invoke stop why?

2016-01-18 Thread Shixiong(Ryan) Zhu
oint(RDD.scala:300) > > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) > > at org.a

Re: Spark Streaming : Limiting number of receivers per executor

2016-02-10 Thread Shixiong(Ryan) Zhu
You can't. The number of cores must be great than the number of receivers. On Wed, Feb 10, 2016 at 2:34 AM, ajay garg wrote: > Hi All, > I am running 3 executors in my spark streaming application with 3 > cores per executors. I have written my custom receiver

Re: Spark Streaming - 1.6.0: mapWithState Kinesis huge memory usage

2016-02-04 Thread Shixiong(Ryan) Zhu
tion.Iterator$$anon$ > org.apache.spark.InterruptibleIterator# > scala.collection.IndexedSeqLike$Elements# > scala.collection.mutable.ArrayOps$ofRef# > java.lang.Object[]# > > > > > On Thu, Feb 4, 2016 at 7:14 PM, Shixiong(Ryan) Zhu < > shixi...@databricks.com> wrot

Re: Spark Streaming - 1.6.0: mapWithState Kinesis huge memory usage

2016-02-04 Thread Shixiong(Ryan) Zhu
Hey Udo, mapWithState usually uses much more memory than updateStateByKey since it caches the states in memory. However, from your description, looks BlockGenerator cannot push data into BlockManager, there may be something wrong in BlockGenerator. Could you share the top 50 objects in the heap

Re: [Spark 1.5+] ReceiverTracker seems not to stop Kinesis receivers

2016-02-09 Thread Shixiong(Ryan) Zhu
Could you do a thread dump in the executor that runs the Kinesis receiver and post it? It would be great if you can provide the executor log as well? On Tue, Feb 9, 2016 at 3:14 PM, Roberto Coluccio wrote: > Hello, > > can anybody kindly help me out a little bit

Re: Skip empty batches - spark streaming

2016-02-11 Thread Shixiong(Ryan) Zhu
Are you using a custom input dstream? If so, you can make the `compute` method return None to skip a batch. On Thu, Feb 11, 2016 at 1:03 PM, Sebastian Piu wrote: > I was wondering if there is there any way to skip batches with zero events > when streaming? > By skip I

Re: Skip empty batches - spark streaming

2016-02-11 Thread Shixiong(Ryan) Zhu
is behaviour > > Thanks! > On 11 Feb 2016 9:07 p.m., "Shixiong(Ryan) Zhu" <shixi...@databricks.com> > wrote: > >> Are you using a custom input dstream? If so, you can make the `compute` >> method return None to skip a batch. >> >> On Thu, Feb 11

Re: PairDStreamFunctions.mapWithState fails in case timeout is set without updating State[S]

2016-02-04 Thread Shixiong(Ryan) Zhu
Thanks for reporting it. I will take a look. On Thu, Feb 4, 2016 at 6:56 AM, Yuval.Itzchakov wrote: > Hi, > I've been playing with the expiramental PairDStreamFunctions.mapWithState > feature and I've seem to have stumbled across a bug, and was wondering if > anyone else has

Re: Spark Streaming from existing RDD

2016-01-29 Thread Shixiong(Ryan) Zhu
Do you just want to write some unit tests? If so, you can use "queueStream" to create a DStream from a queue of RDDs. However, because it doesn't support metadata checkpointing, it's better to only use it in unit tests. On Fri, Jan 29, 2016 at 7:35 AM, Sateesh Karuturi <

Re: streaming in 1.6.0 slower than 1.5.1

2016-01-28 Thread Shixiong(Ryan) Zhu
Hey Jesse, Could you provide the operators you using? For the heap dump, it may be not a real memory leak. Since batches started to queue up, the memory usage should increase. On Thu, Jan 28, 2016 at 11:54 AM, Ted Yu wrote: > bq. The total size by class B is 3GB in 1.5.1

Re: Data not getting printed in Spark Streaming with print().

2016-01-28 Thread Shixiong(Ryan) Zhu
fileStream has a parameter "newFilesOnly". By default, it's true and means processing only new files and ignore existing files in the directory. So you need to ***move*** the files into the directory, otherwise it will ignore existing files. You can also set "newFilesOnly" to false. Then in the

Re: local class incompatible: stream classdesc serialVersionUID

2016-02-01 Thread Shixiong(Ryan) Zhu
I guess he used client model and the local Spark version is 1.5.2 but the standalone Spark version is 1.5.1. In other words, he used a 1.5.2 driver to talk with 1.5.1 executors. On Mon, Feb 1, 2016 at 2:08 PM, Holden Karau wrote: > So I'm a little confused to exactly how

Re: Spark streaming and ThreadLocal

2016-01-29 Thread Shixiong(Ryan) Zhu
hat all threads inside the worker clean up the >> ThreadLocal once they are done with processing this task? >> >> Thanks >> NB >> >> >> On Fri, Jan 29, 2016 at 1:00 PM, Shixiong(Ryan) Zhu < >> shixi...@databricks.com> wrote: >> >>> Sp

Re: Spark streaming and ThreadLocal

2016-01-29 Thread Shixiong(Ryan) Zhu
anyways. > > Thanks > NB > > > On Fri, Jan 29, 2016 at 5:09 PM, Shixiong(Ryan) Zhu < > shixi...@databricks.com> wrote: > >> It looks weird. Why don't you just pass "new SomeClass()" to "somefunc"? >> You don't need to use ThreadLocal if ther

Re: mapWithState: remove key

2016-01-29 Thread Shixiong(Ryan) Zhu
1. To remove a state, you need to call "state.remove()". If you return a None in the function, it just means don't output it as the DStream's output, but the state won't be removed if you don't call "state.remove()". 2. For NoSuchElementException, here is the doc for "State.get": /** * Get

Re: spark.kryo.classesToRegister

2016-01-27 Thread Shixiong(Ryan) Zhu
It depends. The default Kryo serializer cannot handle all cases. If you encounter any issue, you can follow the Kryo doc to set up custom serializer: https://github.com/EsotericSoftware/kryo/blob/master/README.md On Wed, Jan 27, 2016 at 3:13 AM, amit tewari wrote: > This

Re: Getting Exceptions/WARN during random runs for same dataset

2016-01-29 Thread Shixiong(Ryan) Zhu
It's a known issue. See https://issues.apache.org/jira/browse/SPARK-10719 On Thu, Jan 28, 2016 at 5:44 PM, Khusro Siddiqui wrote: > It is happening on random executors on random nodes. Not on any specific > node everytime. > Or not happening at all > > On Thu, Jan 28, 2016 at

Re: Spark streaming and ThreadLocal

2016-01-29 Thread Shixiong(Ryan) Zhu
Of cause. If you use a ThreadLocal in a long living thread and forget to remove it, it's definitely a memory leak. On Thu, Jan 28, 2016 at 9:31 PM, N B wrote: > Hello, > > Does anyone know if there are any potential pitfalls associated with using > ThreadLocal variables in

Re: Spark streaming and ThreadLocal

2016-01-29 Thread Shixiong(Ryan) Zhu
gt; On Fri, Jan 29, 2016 at 12:12 PM, Shixiong(Ryan) Zhu < > shixi...@databricks.com> wrote: > >> Of cause. If you use a ThreadLocal in a long living thread and forget to >> remove it, it's definitely a memory leak. >> >> On Thu, Jan 28, 2016 at 9:31 PM,

Re: Spark Streaming not reading input coming from the other ip

2016-02-22 Thread Shixiong(Ryan) Zhu
What's the error info reported by Streaming? And could you use "telnet" to test if the network is normal? On Mon, Feb 22, 2016 at 6:59 AM, Vinti Maheshwari wrote: > For reference, my program: > > def main(args: Array[String]): Unit = { > val conf = new

Re: Stateful Operation on JavaPairDStream Help Needed !!

2016-02-22 Thread Shixiong(Ryan) Zhu
Hey Abhi, Could you post how you use mapWithState? By default, it should do checkpointing every 10 batches. However, there is a known issue that prevents mapWithState from checkpointing in some special cases: https://issues.apache.org/jira/browse/SPARK-6847 On Mon, Feb 22, 2016 at 5:47 AM,

Re: Stateful Operation on JavaPairDStream Help Needed !!

2016-02-22 Thread Shixiong(Ryan) Zhu
=> rdd.count()) On Mon, Feb 22, 2016 at 12:25 PM, Ted Yu <yuzhih...@gmail.com> wrote: > Fix for SPARK-6847 is not in branch-1.6 > > Should the fix be ported to branch-1.6 ? > > Thanks > > On Feb 22, 2016, at 11:55 AM, Shixiong(Ryan) Zhu <shixi...@databricks.com>

Re: Stateful Operation on JavaPairDStream Help Needed !!

2016-02-22 Thread Shixiong(Ryan) Zhu
urrently I am using reducebykeyandwindow without the inverse function and > I am able to get the correct data. But, issue the might arise is when I > have to restart my application from checkpoint and it repartitions and > computes the previous 120 partitions, which delays the incoming batch

Re: java.io.IOException: java.lang.reflect.InvocationTargetException on new spark machines

2016-02-28 Thread Shixiong(Ryan) Zhu
This is because the Snappy library cannot load the native library. Did you forget to install the snappy native library in your new machines? On Fri, Feb 26, 2016 at 11:05 PM, Abhishek Anand wrote: > Any insights on this ? > > On Fri, Feb 26, 2016 at 1:21 PM, Abhishek

Re: Stateful Operation on JavaPairDStream Help Needed !!

2016-02-28 Thread Shixiong(Ryan) Zhu
; > Thanks > Abhi > > > On Tue, Feb 23, 2016 at 2:36 AM, Shixiong(Ryan) Zhu < > shixi...@databricks.com> wrote: > >> Hey Abhi, >> >> Using reducebykeyandwindow and mapWithState will trigger the bug >> in SPARK-6847. Here is a workaround to trigger c

Re: PLease help: installation of spark 1.6.0 on ubuntu fails

2016-02-25 Thread Shixiong(Ryan) Zhu
Please use Java 7 instead. On Thu, Feb 25, 2016 at 1:54 PM, Marco Mistroni wrote: > Hello all > could anyone help? > i have tried to install spark 1.6.0 on ubuntu, but the installation failed > Here are my steps > > 1. download spark (successful) > > 31 wget

Re: Access fields by name/index from Avro data read from Kafka through Spark Streaming

2016-02-25 Thread Shixiong(Ryan) Zhu
You can use `DStream.map` to transform objects to anything you want. On Thu, Feb 25, 2016 at 11:06 AM, Mohammad Tariq wrote: > Hi group, > > I have just started working with confluent platform and spark streaming, > and was wondering if it is possible to access individual

Re: Spark UI standalone "crashes" after an application finishes

2016-02-29 Thread Shixiong(Ryan) Zhu
Do you mean you cannot access Master UI after your application completes? Could you check the master log? On Mon, Feb 29, 2016 at 3:48 PM, Sumona Routh wrote: > Hi there, > I've been doing some performance tuning of our Spark application, which is > using Spark 1.2.1

Re: Stateful Operation on JavaPairDStream Help Needed !!

2016-02-29 Thread Shixiong(Ryan) Zhu
ype of operation that we can perform > inside mapWithState ? > > Really need to resolve this one as currently if my application is > restarted from checkpoint it has to repartition 120 previous stages which > takes hell lot of time. > > Thanks !! > Abhi > > On Mon, Feb

Re: streaming textFileStream problem - got only ONE line

2016-01-25 Thread Shixiong(Ryan) Zhu
Did you move the file into "hdfs://helmhdfs/user/patcharee/cerdata/", or write into it directly? `textFileStream` requires that files must be written to the monitored directory by "moving" them from another location within the same file system. On Mon, Jan 25, 2016 at 6:30 AM, patcharee

Re: mapWithState and context start when checkpoint exists

2016-01-25 Thread Shixiong(Ryan) Zhu
Hey Andrey, `ConstantInputDStream` doesn't support checkpoint as it contains an RDD field. It cannot resume from checkpoints. On Mon, Jan 25, 2016 at 1:12 PM, Andrey Yegorov wrote: > Hi, > > I am new to spark (and scala) and hope someone can help me with the issue > I

Re: Spark Streaming Write Ahead Log (WAL) not replaying data after restart

2016-01-25 Thread Shixiong(Ryan) Zhu
You need to define a create function and use StreamingContext.getOrCreate. See the example here: http://spark.apache.org/docs/latest/streaming-programming-guide.html#how-to-configure-checkpointing On Thu, Jan 21, 2016 at 3:32 AM, Patrick McGloin wrote: > Hi all, > >

Re: updateStateByKey not persisting in Spark 1.5.1

2016-01-20 Thread Shixiong(Ryan) Zhu
Could you share your log? On Wed, Jan 20, 2016 at 7:55 AM, Brian London wrote: > I'm running a streaming job that has two calls to updateStateByKey. When > run in standalone mode both calls to updateStateByKey behave as expected. > When run on a cluster, however, it

Re: Container exited with a non-zero exit code 1-SparkJOb on YARN

2016-01-20 Thread Shixiong(Ryan) Zhu
Could you share your log? On Wed, Jan 20, 2016 at 5:37 AM, Siddharth Ubale < siddharth.ub...@syncoms.com> wrote: > > > Hi, > > > > I am running a Spark Job on the yarn cluster. > > The spark job is a spark streaming application which is reading JSON from > a kafka topic , inserting the JSON

Re: using spark context in map funciton TASk not serilizable error

2016-01-20 Thread Shixiong(Ryan) Zhu
You should not use SparkContext or RDD directly in your closures. Could you show the codes of "method1"? Maybe you only needs join or something else. E.g., val cassandraRDD = sc.cassandraTable("keySpace", "tableName") reRDD.join(cassandraRDD).map().saveAsTextFile(outputDir) On Tue, Jan 19,

Re: FAIR scheduler in Spark Streaming

2016-01-26 Thread Shixiong(Ryan) Zhu
The number of concurrent Streaming job is controlled by "spark.streaming.concurrentJobs". It's 1 by default. However, you need to keep in mind that setting it to a bigger number will allow jobs of several batches running at the same time. It's hard to predicate the behavior and sometimes will

Re: Need a sample code to load XML files into cassandra database using spark streaming

2016-01-26 Thread Shixiong(Ryan) Zhu
You can use spark-xml to read the xml files. https://github.com/databricks/spark-xml has some examples. To save your results to cassandra, you can use spark-cassandra-connector: https://github.com/datastax/spark-cassandra-connector On Tue, Jan 26, 2016 at 10:10 AM, Sree Eedupuganti

Re: SparkListener onApplicationEnd processing an RDD throws exception because of stopped SparkContext

2016-02-17 Thread Shixiong(Ryan) Zhu
`onApplicationEnd` is posted when SparkContext is stopping, and you cannot submit any job to a stopping SparkContext. In general, SparkListener is used to monitor the job progress and collect job information, an you should not submit jobs there. Why not submit your jobs in the main thread? On

Re: listening to recursive folder structures in s3 using pyspark streaming (textFileStream)

2016-02-17 Thread Shixiong(Ryan) Zhu
textFileStream doesn't support that. It only supports monitoring one folder. On Wed, Feb 17, 2016 at 7:20 AM, in4maniac wrote: > Hi all, > > I am new to pyspark streaming and I was following a tutorial I saw in the > internet > ( >

Re: Access to broadcasted variable

2016-02-19 Thread Shixiong(Ryan) Zhu
The broadcasted object is serialized in driver and sent to the executors. And in the executor, it will deserialize the bytes to get the broadcasted object. On Fri, Feb 19, 2016 at 5:54 AM, jeff saremi wrote: > could someone please comment on this? thanks > >

Re: Using a non serializable third party JSON serializable on a spark worker node throws NotSerializableException

2016-03-01 Thread Shixiong(Ryan) Zhu
Don't know where "argonaut.EncodeJson$$anon$2" comes from. However, you can always put your codes into an method of an "object". Then just call it like a Java static method. On Tue, Mar 1, 2016 at 10:30 AM, Yuval.Itzchakov wrote: > I have a small snippet of code which relays

Re: Using a non serializable third party JSON serializable on a spark worker node throws NotSerializableException

2016-03-01 Thread Shixiong(Ryan) Zhu
lared inside a companion object of a case class. > > The problem is that Spark will still try to serialize the method, as it > needs to execute on the worker. How will that change the fact that > `EncodeJson[T]` is not serializable? > > > On Tue, Mar 1, 2016, 21:12 Shixiong(Ryan) Zhu

Re: Converting array to DF

2016-03-01 Thread Shixiong(Ryan) Zhu
For Array, you need to all `toSeq` at first. Scala can convert Array to ArrayOps automatically. However, it's not a `Seq` and you need to call `toSeq` explicitly. On Tue, Mar 1, 2016 at 1:02 AM, Ashok Kumar wrote: > Thank you sir > > This works OK > import

Re: Spark executor killed without apparent reason

2016-03-01 Thread Shixiong(Ryan) Zhu
Could you search "OutOfMemoryError" in the executor logs? It could be "OufOfMemoryError: Direct Buffer Memory" or something else. On Tue, Mar 1, 2016 at 6:23 AM, Nirav Patel wrote: > Hi, > > We are using spark 1.5.2 or yarn. We have a spark application utilizing > about

Re: Docker configuration for akka spark streaming

2016-03-14 Thread Shixiong(Ryan) Zhu
Could you use netstat to show the ports that the driver is listening? On Mon, Mar 14, 2016 at 1:45 PM, David Gomez Saavedra wrote: > hi everyone, > > I'm trying to set up spark streaming using akka with a similar example of > the word count provided. When using spark master

Re: how to deploy new code with checkpointing

2016-04-11 Thread Shixiong(Ryan) Zhu
You cannot. Streaming doesn't support it because code changes will break Java serialization. On Mon, Apr 11, 2016 at 4:30 PM, Siva Gudavalli wrote: > hello, > > i am writing a spark streaming application to read data from kafka. I am > using no receiver approach and enabled

Re: spark streaming

2016-03-02 Thread Shixiong(Ryan) Zhu
Hey, KafkaUtils.createDirectStream doesn't need a StorageLevel as it doesn't store blocks to BlockManager. However, the error is not related to StorageLevel. It may be a bug. Could you provide more info about it? E.g., Spark version, your codes, logs. On Wed, Mar 2, 2016 at 3:02 AM, Vinti

Re: Spark Streaming 1.6 mapWithState not working well with Kryo Serialization

2016-03-02 Thread Shixiong(Ryan) Zhu
See https://issues.apache.org/jira/browse/SPARK-12591 After applying the patch, it should work. However, if you want to enable "registrationRequired", you still need to register "org.apache.spark.streaming.util.OpenHashMapBasedStateMap", "org.apache.spark.streaming.util.EmptyStateMap" and

Re: Can not kill driver properly

2016-03-21 Thread Shixiong(Ryan) Zhu
Could you post the log of Master? On Mon, Mar 21, 2016 at 9:25 AM, Hao Ren wrote: > Update: > > I am using --supervise flag for fault tolerance. > > > > On Mon, Mar 21, 2016 at 4:16 PM, Hao Ren wrote: > >> Using spark 1.6.1 >> Spark Streaming Jobs are

Re: Spark and Kafka direct approach problem

2016-05-04 Thread Shixiong(Ryan) Zhu
It's because the Scala version of Spark and the Scala version of Kafka don't match. Please check them. On Wed, May 4, 2016 at 6:17 AM, أنس الليثي wrote: > NoSuchMethodError usually appears because of a difference in the library > versions. > > Check the version of the

Re: Unresponsive Spark Streaming UI in YARN cluster mode - 1.5.2

2016-07-08 Thread Shixiong(Ryan) Zhu
Hey Tom, Could you provide all blocked threads? Perhaps due to some potential deadlock. On Fri, Jul 8, 2016 at 10:30 AM, Ellis, Tom (Financial Markets IT) < tom.el...@lloydsbanking.com.invalid> wrote: > Hi There, > > > > We’re currently using HDP 2.3.4, Spark 1.5.2 with a Spark Streaming job in

Re: Resource Leak in Spark Streaming

2017-01-31 Thread Shixiong(Ryan) Zhu
> } > > private void initialize() { > pool = new ConcurrentLinkedQueue<KafkaProducer<String, String>>(); > > for (int i = 0; i < minIdle; i++) { > pool.add(createProducer()); > } >} > >public void close

Re: Resource Leak in Spark Streaming

2017-01-31 Thread Shixiong(Ryan) Zhu
Could you provide your Spark version please? On Tue, Jan 31, 2017 at 10:37 AM, Nipun Arora wrote: > Hi, > > I get a resource leak, where the number of file descriptors in spark > streaming keeps increasing. We end up with a "too many file open" error > eventually

Re: Spark streaming question - SPARK-13758 Need to use an external RDD inside DStream processing...Please help

2017-02-07 Thread Shixiong(Ryan) Zhu
You can create lazily instantiated singleton instances. See http://spark.apache.org/docs/latest/streaming-programming-guide.html#accumulators-broadcast-variables-and-checkpoints for examples of accumulators and broadcast variables. You can use the same approach to create your cached RDD. On Tue,

Re: Fault tolerant broadcast in updateStateByKey

2017-02-07 Thread Shixiong(Ryan) Zhu
It's documented here: http://spark.apache.org/docs/latest/streaming-programming-guide.html#accumulators-broadcast-variables-and-checkpoints On Tue, Feb 7, 2017 at 8:12 AM, Amit Sela wrote: > Hi all, > > I was wondering if anyone ever used a broadcast variable within > an

Re: Question about "Output Op Duration" in SparkStreaming Batch details UX

2017-01-31 Thread Shixiong(Ryan) Zhu
It means the total time to run a batch, including the Spark job duration + time spent on the driver. E.g., foreachRDD { rdd => rdd.count() // say this takes 1 second. Thread.sleep(1) // sleep 10 seconds } In the above example, the Spark job duration is 1 seconds and the output op

Re: Resource Leak in Spark Streaming

2017-01-31 Thread Shixiong(Ryan) Zhu
gt; On Tue, Jan 31, 2017 at 1:45 PM Shixiong(Ryan) Zhu < > shixi...@databricks.com> wrote: > >> Could you provide your Spark version please? >> >> On Tue, Jan 31, 2017 at 10:37 AM, Nipun Arora <nipunarora2...@gmail.com> >> wrote: >> >>

Re: Resource Leak in Spark Streaming

2017-01-31 Thread Shixiong(Ryan) Zhu
t; String>(topic, message); > > if (async) { > producer.send(record); > } else { > try { > producer.send(record).get(); > } catch (Exception e) { > e.printStackTrace(); > }

Re: Setting startingOffsets to earliest in structured streaming never catches up

2017-01-22 Thread Shixiong(Ryan) Zhu
Which Spark version are you using? If you are using 2.1.0, could you use the monitoring APIs ( http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#monitoring-streaming-queries) to check the input rate and the processing rate? One possible issue is that the Kafka source

Re: kafka structured streaming source refuses to read

2017-01-27 Thread Shixiong(Ryan) Zhu
Thanks for reporting this. Which Spark version are you using? Could you provide the full log, please? On Fri, Jan 27, 2017 at 10:24 AM, Koert Kuipers wrote: > i checked my topic. it has 5 partitions but all the data is written to a > single partition: wikipedia-2 > i turned

Fwd: Spark streaming app that processes Kafka DStreams produces no output and no error

2017-01-20 Thread Shixiong(Ryan) Zhu
-- Forwarded message -- From: Shixiong(Ryan) Zhu <shixi...@databricks.com> Date: Fri, Jan 20, 2017 at 12:06 PM Subject: Re: Spark streaming app that processes Kafka DStreams produces no output and no error To: shyla deshpande <deshpandesh...@gmail.com> That's how K

Re: Using mapWithState without a checkpoint

2017-01-23 Thread Shixiong(Ryan) Zhu
Even if you don't need the checkpointing data for recovery, "mapWithState" still needs to use "checkpoint" to cut off the RDD lineage. On Mon, Jan 23, 2017 at 12:30 AM, shyla deshpande wrote: > Hello spark users, > > I do have the same question as Daniel. > > I would

Re: How to access metrics for Structured Streaming 2.1

2017-01-17 Thread Shixiong(Ryan) Zhu
You can use the monitoring APIs of Structured Streaming to get metrics. See http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#monitoring-streaming-queries On Tue, Jan 17, 2017 at 5:01 PM, Heji Kim wrote: > Hello. We are trying to

Re: Why spark history server does not show RDD even if it is persisted?

2017-02-28 Thread Shixiong(Ryan) Zhu
The REST APIs are not just for Spark history server. When an application is running, you can use the REST APIs to talk to Spark UI HTTP server as well. On Tue, Feb 28, 2017 at 10:46 AM, Parag Chaudhari wrote: > ping... > > > > *Thanks,Parag Chaudhari,**USC Alumnus (Fight

Re: Map with state keys serialization

2016-10-10 Thread Shixiong(Ryan) Zhu
Conf and I registered the class. Is there a different > configuration setting for the state map keys? > > Thanks! > > -Joey > > On Sun, Oct 9, 2016 at 10:58 PM, Shixiong(Ryan) Zhu > <shixi...@databricks.com> wrote: > > You can use Kryo. It also implements KryoSerializa

Re: Instability issues with Spark 2.0.1 and Kafka 0.10

2016-11-10 Thread Shixiong(Ryan) Zhu
Yeah, the KafkaRDD cannot be reused. It's better to document it. On Thu, Nov 10, 2016 at 8:26 AM, Ivan von Nagy wrote: > Ok, I have split he KafkaRDD logic to each use their own group and bumped > the poll.ms to 10 seconds. Anything less then 2 seconds on the poll.ms > ends up

Re: Map with state keys serialization

2016-10-12 Thread Shixiong(Ryan) Zhu
> wrote: > I tried with 1.6.2 and saw the same behavior. > > -Joey > > On Tue, Oct 11, 2016 at 5:18 PM, Shixiong(Ryan) Zhu > <shixi...@databricks.com> wrote: > > There are some known issues in 1.6.0, e.g., > > https://issues.apache.org/jira/browse/SPARK-12591

Re: getting error on spark streaming : java.lang.OutOfMemoryError: unable to create new native thread

2016-11-22 Thread Shixiong(Ryan) Zhu
Possibly https://issues.apache.org/jira/browse/SPARK-17396 On Tue, Nov 22, 2016 at 1:42 PM, Mohit Durgapal wrote: > Hi Everyone, > > > I am getting the following error while running a spark streaming example > on my local machine, the being ingested is only 506kb. > > >

Re: Pasting into spark-shell doesn't work for Databricks example

2016-11-23 Thread Shixiong(Ryan) Zhu
; On 22 November 2016 at 22:13, Shixiong(Ryan) Zhu <shixi...@databricks.com> > wrote: > >> The workaround is defining the imports and class together using ":paste". >> >> On Tue, Nov 22, 2016 at 11:12 AM, Shixiong(Ryan) Zhu < >> shixi...@databricks.com&

Re: Pasting into spark-shell doesn't work for Databricks example

2016-11-22 Thread Shixiong(Ryan) Zhu
This relates to a known issue: https://issues.apache.org/jira/browse/SPARK-14146 and https://issues.scala-lang.org/browse/SI-9799 On Tue, Nov 22, 2016 at 6:37 AM, dbolshak wrote: > Hello, > > We have the same issue, > > We use latest release 2.0.2. > > Setup with

Re: Pasting into spark-shell doesn't work for Databricks example

2016-11-22 Thread Shixiong(Ryan) Zhu
The workaround is defining the imports and class together using ":paste". On Tue, Nov 22, 2016 at 11:12 AM, Shixiong(Ryan) Zhu < shixi...@databricks.com> wrote: > This relates to a known issue: https://issues.apache. > org/jira/browse/SPARK-14146 and https://issues.scala

Re: Spark 2.0.2 with Kafka source, Error please help!

2016-11-16 Thread Shixiong(Ryan) Zhu
Could you provide the Person class? On Wed, Nov 16, 2016 at 1:19 PM, shyla deshpande <deshpandesh...@gmail.com> wrote: > I am using 2.11.8. Thanks > > On Wed, Nov 16, 2016 at 1:15 PM, Shixiong(Ryan) Zhu < > shixi...@databricks.com> wrote: > >> Which Scala versio

Re: Spark 2.0.2 with Kafka source, Error please help!

2016-11-16 Thread Shixiong(Ryan) Zhu
Which Scala version are you using? Is it Scala 2.10? Scala 2.10 has some known race conditions in reflection and the Scala community doesn't have plan to fix it ( http://docs.scala-lang.org/overviews/reflection/thread-safety.html) AFAIK, the only way to fix it is upgrading to Scala 2.11. On Wed,

Re: HiveContext.getOrCreate not accessible

2016-11-17 Thread Shixiong(Ryan) Zhu
`SQLContext.getOrCreate` will return the HiveContext you created. On Mon, Nov 14, 2016 at 11:17 PM, Praseetha wrote: > > Hi All, > > > I have a streaming app and when i try invoking the > HiveContext.getOrCreate, it errors out with the following stmt. 'object >

Re: Spark 2.0.2 with Kafka source, Error please help!

2016-11-17 Thread Shixiong(Ryan) Zhu
aultInstance = com.example.protos.demo.Person( >> ) >> implicit class PersonLens[UpperPB](_l: com.trueaccord.lenses.Lens[UpperPB, >> com.example.protos.demo.Person]) extends >> com.trueaccord.lenses.ObjectLens[UpperPB, >> com.example.protos.demo.Person](_l) { >>

Re: why spark driver program is creating so many threads? How can I limit this number?

2016-10-31 Thread Shixiong(Ryan) Zhu
rs for me to > reproduce the error so I can get back as early as possible. > > Thanks a lot! > > On Mon, Oct 31, 2016 at 12:41 PM, Shixiong(Ryan) Zhu < > shixi...@databricks.com> wrote: > >> Then it should not be a Receiver issue. Could you use `jstack` to find >> ou

Re: why spark driver program is creating so many threads? How can I limit this number?

2016-11-01 Thread Shixiong(Ryan) Zhu
Dstream "Window" uses "union" to combine multiple RDDs in one window into a single RDD. On Tue, Nov 1, 2016 at 2:59 AM kant kodali wrote: > @Sean It looks like this problem can happen with other RDD's as well. Not > just unionRDD > > On Tue, Nov 1, 2016 at 2:52 AM, kant

Re: why spark driver program is creating so many threads? How can I limit this number?

2016-10-31 Thread Shixiong(Ryan) Zhu
mode). > > Thanks! > > On Mon, Oct 31, 2016 at 12:32 PM, Shixiong(Ryan) Zhu < > shixi...@databricks.com> wrote: > >> Sorry, there is a typo in my previous email: this may **not** be the >> root cause if the leak threads are in the driver side. >> >> Doe

Re: why spark driver program is creating so many threads? How can I limit this number?

2016-11-01 Thread Shixiong(Ryan) Zhu
Yes, try 2.0.1! On Tue, Nov 1, 2016 at 11:25 AM, kant kodali <kanth...@gmail.com> wrote: > AH!!! Got it! Should I use 2.0.1 then ? I don't see 2.1.0 > > On Tue, Nov 1, 2016 at 10:14 AM, Shixiong(Ryan) Zhu < > shixi...@databricks.com> wrote: > >> Dstream "Wi

Re: why spark driver program is creating so many threads? How can I limit this number?

2016-10-31 Thread Shixiong(Ryan) Zhu
So in your code, each Receiver will start a new thread. Did you stop the receiver properly in `Receiver.onStop`? Otherwise, you may leak threads after a receiver crashes and is restarted by Spark. However, this may be the root cause since the leak threads are in the driver side. Could you use

Re: why spark driver program is creating so many threads? How can I limit this number?

2016-10-31 Thread Shixiong(Ryan) Zhu
cs.oracle.com/javase/8/docs/api/java/lang/Thread.html doesn't seem > to have any method where I can clean up the threads created during OnStart. > any ideas? > > Thanks! > > > On Mon, Oct 31, 2016 at 11:58 AM, Shixiong(Ryan) Zhu < > shixi...@databricks.com> wrote: &g

Re: Map with state keys serialization

2016-10-11 Thread Shixiong(Ryan) Zhu
; Kryo. Also, this is with 1.6.0 so if this behavior changed/got fixed > > in a later release let me know. > > > > -Joey > > > > On Mon, Oct 10, 2016 at 9:54 AM, Shixiong(Ryan) Zhu > > <shixi...@databricks.com> wrote: > >> That's enough. Did you see a

Re: structured streaming polling timeouts

2017-01-11 Thread Shixiong(Ryan) Zhu
You can increase the timeout using the option "kafkaConsumer.pollTimeoutMs". In addition, I would recommend you try Spark 2.1.0 as there are many improvements in Structured Streaming. On Wed, Jan 11, 2017 at 11:05 AM, Timothy Chan wrote: > I'm using Spark 2.0.2 and running

Re: [Spark Streaming] NoClassDefFoundError : StateSpec

2017-01-12 Thread Shixiong(Ryan) Zhu
You can find the Spark version of spark-submit in the log. Could you check if it's not consistent? On Thu, Jan 12, 2017 at 7:35 AM Ramkumar Venkataraman < ram.the.m...@gmail.com> wrote: > Spark: 1.6.1 > > I am trying to use the new mapWithState API and I am getting the following > error: > >

  1   2   >