SparkR driver side JNI

2015-08-06 Thread Renyi Xiong
why SparkR chose to uses inter-process socket solution eventually on driver side instead of in-process JNI showed in one of its doc's below (about page 20)? https://spark-summit.org/wp-content/uploads/2014/07/SparkR-Interactive-R-Programs-at-Scale-Shivaram-Vankataraman-Zongheng-Yang.pdf

Re: SparkR driver side JNI

2015-09-11 Thread Renyi Xiong
> Yeah in addition to the downside of having 2 JVMs the command line > arguments and SparkConf etc. will be set by spark-submit in the first > JVM which won't be available in the second JVM. > > Shivaram > > On Thu, Sep 10, 2015 at 5:18 PM, Renyi Xiong > wrote: > >

Re: SparkR driver side JNI

2015-09-11 Thread Renyi Xiong
ough JNI? > > > > On Thu, Sep 10, 2015 at 9:29 PM, Shivaram Venkataraman > > wrote: > >> > >> Yeah in addition to the downside of having 2 JVMs the command line > >> arguments and SparkConf etc. will be set by spark-submit in the first > >&

pyspark streaming DStream compute

2015-09-15 Thread Renyi Xiong
Can anybody help understand why pyspark streaming uses py4j callback to execute python code while pyspark batch uses worker.py? regarding pyspark streaming, is py4j callback only used for DStream, worker.py still used for RDD? thanks, Renyi.

SparkR streaming source code

2015-09-16 Thread Renyi Xiong
SparkR streaming is mentioned at about page 17 in below pdf, can anyone share source code? (could not find it on GitHub) https://spark-summit.org/2015-east/wp-content/uploads/2015/03/SSE15-19-Hao-Lin-Haichuan-Wang.pdf Thanks, Renyi.

Re: SparkR streaming source code

2015-09-16 Thread Renyi Xiong
Reynold Xin wrote: > > You should reach out to the speakers directly. > > > > > > On Wed, Sep 16, 2015 at 9:52 AM, Renyi Xiong > wrote: > >> > >> SparkR streaming is mentioned at about page 17 in below pdf, can anyone > >> share source code? (cou

Spark streaming DStream state on worker

2015-09-16 Thread Renyi Xiong
Hi, I want to do temporal join operation on DStream across RDDs, my question is: Are RDDs from same DStream always computed on same worker (except failover) ? thanks, Renyi.

failed to run spark sample on windows

2015-09-28 Thread Renyi Xiong
I tried to run HdfsTest sample on windows spark-1.4.0 bin\run-sample org.apache.spark.examples.HdfsTest but got below exception, any body any idea what was wrong here? 15/09/28 16:33:56.565 ERROR SparkContext: Error initializing SparkContext. java.lang.NullPointerException at java.lang.

Re: failed to run spark sample on windows

2015-09-29 Thread Renyi Xiong
ith the one which was used to build Spark > 1.4.0 ? > > Cheers > > On Mon, Sep 28, 2015 at 4:36 PM, Renyi Xiong > wrote: > >> I tried to run HdfsTest sample on windows spark-1.4.0 >> >> bin\run-sample org.apache.spark.examples.HdfsTest >> >> but got

Re: failed to run spark sample on windows

2015-09-30 Thread Renyi Xiong
thanks a lot, it works now after I set %HADOOP_HOME% On Tue, Sep 29, 2015 at 1:22 PM, saurfang wrote: > See > > http://stackoverflow.com/questions/26516865/is-it-possible-to-run-hadoop-jobs-like-the-wordcount-sample-in-the-local-mode > , > https://issues.apache.org/jira/browse/SPARK-6961 and fin

SparkR dataframe UDF

2015-10-02 Thread Renyi Xiong
Hi Shiva, Is Dataframe UDF implemented in SparkR yet? - I could not find it in below URL https://github.com/hlin09/spark/tree/SparkR-streaming/R/pkg/R Thanks, Renyi.

Re: failure notice

2015-10-05 Thread Renyi Xiong
t; g...@mail.gmail.com> > 2...@mail.gmail.com> > Date: Mon, 28 Sep 2015 16:31:42 -0700 > Message-ID: < > cangsv69hyqbbvb8_8zshstlrpdy-37fjnwyvxce-xf7dphq...@mail.gmail.com> > Subject: Re: Spark streaming DStream state on worker > From: Renyi Xiong >

Re: failure notice

2015-10-06 Thread Renyi Xiong
should not be any node-specific long term state that > you rely on unless you can recover that state on a different node. > > On Mon, Oct 5, 2015 at 3:03 PM, Renyi Xiong wrote: > >> if RDDs from same DStream not guaranteed to run on same worker, then the >> question becom

Re: [Streaming] join events in last 10 minutes

2015-10-14 Thread Renyi Xiong
Hi TD, The scenario here is to let events from topic1 wait a fixed 10 minutes for events with same key from topic2 to come and left outer join them by the key does the query do what is expected? if not, what is the right way to achieve this? thanks, Renyi. On Tue, Oct 13, 2015 at 5:14 PM, Danie

let spark streaming sample come to stop

2015-11-13 Thread Renyi Xiong
Hi, I try to run the following 1.4.1 sample by putting a words.txt under localdir bin\run-example org.apache.spark.examples.streaming.HdfsWordCount localdir 2 questions 1. it does not pick up words.txt because it's 'old' I guess - any option to let it picked up? 2. I managed to put a 'new' file

DStream not initialized SparkException

2015-12-09 Thread Renyi Xiong
hi, I met following exception when the driver program tried to recover from checkpoint, looks like the logic relies on zeroTime being set which doesn't seem to happen here. am I missing anything or is it a bug in 1.4.1? org.apache.spark.SparkException: org.apache.spark.streaming.api.csharp.CSharp

Re: let spark streaming sample come to stop

2015-12-09 Thread Renyi Xiong
push them into a dstream. It is meant to be run > indefinitely, unless interrupted by ctrl-c, for example. > > -bryan > On Nov 13, 2015 10:52 AM, "Renyi Xiong" wrote: > >> Hi, >> >> I try to run the following 1.4.1 s

Re: DStream not initialized SparkException

2015-12-09 Thread Renyi Xiong
bmit(SparkSubmit.scala:193) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) On Wed, Dec 9, 2015 at 12:45 PM, Renyi Xiong wrote: > hi, > > I met following exception when the driver program tried

Re: DStream not initialized SparkException

2015-12-09 Thread Renyi Xiong
never mind, one of my peers correct the driver program for me - all dstream operations need to be within the scope of getOrCreate API On Wed, Dec 9, 2015 at 3:32 PM, Renyi Xiong wrote: > following scala program throws same exception, I know people are running > streaming jobs against ka

pyspark streaming 1.6 mapWithState?

2015-12-21 Thread Renyi Xiong
Hi TD, I noticed mapWithState was available in spark 1.6. Is there any plan to enable it in pyspark as well? thanks, Renyi.

Spark Streaming KafkaUtils missing Save API?

2016-01-15 Thread Renyi Xiong
Hi, We noticed there's no Save method in KafkaUtils. we do have scenarios where we want to save RDD back to Kafka queue to be consumed by down stream streaming applications. I wonder if this is a common scenario, if yes, any plan to add it? Thanks, Renyi.

pyspark worker concurrency

2016-02-06 Thread Renyi Xiong
Hi, is it a good idea to have 2 threads in pyspark worker? - main thread responsible for receive and send data over socket while the other thread is calling user functions to process data? since CPU is idle (?) during network I/O, this should improve concurrency quite a bit. can expert answer t

Re: pyspark worker concurrency

2016-02-08 Thread Renyi Xiong
never mind, I think pyspark is already doing async socket read / write, but on scala side in PythonRDD.scala On Sat, Feb 6, 2016 at 6:27 PM, Renyi Xiong wrote: > Hi, > > is it a good idea to have 2 threads in pyspark worker? - main thread > responsible for receive and send data

DynamicPartitionKafkaRDD - 1:n mapping between kafka and RDD partition

2016-03-10 Thread Renyi Xiong
Hi TD, Thanks a lot for offering to look at our PR (if we fire one) at the conference NYC. As we discussed briefly the issues of unbalanced and under-distributed kafka partitions when developing Spark streaming application in Mobius (C# for Spark), we're trying the option of repartitioning within

Re: DynamicPartitionKafkaRDD - 1:n mapping between kafka and RDD partition

2016-03-14 Thread Renyi Xiong
a lot of potential change already underway. > > See > > https://issues.apache.org/jira/browse/SPARK-12177 > > On Thu, Mar 10, 2016 at 1:59 PM, Renyi Xiong > wrote: > > Hi TD, > > > > Thanks a lot for offering to look at our PR (if we fire one) at the > > conf

Declare rest of @Experimental items non-experimental if they've existed since 1.2.0

2016-04-01 Thread Renyi Xiong
Hi Sean, We're upgrading Mobius (C# binding for Spark) in Microsoft to align with Spark 1.6.2 and noticed some changes in API you did in https://github.com/apache/spark/commit/6f81eae24f83df51a99d4bb2629dd7daadc01519 mostly on APIs with Approx postfix. (still marked as experimental in pyspark t

RE: Declare rest of @Experimental items non-experimental if they'veexisted since 1.2.0

2016-04-01 Thread Renyi Xiong
Thanks a lot, Sean, really appreciate your comments. Sent from my Windows 10 phone From: Sean Owen Sent: Friday, April 1, 2016 12:55 PM To: Renyi Xiong Cc: Tathagata Das; dev Subject: Re: Declare rest of @Experimental items non-experimental if they'veexisted since 1.2.0 The change ther

Spark Streaming UI reporting a different task duration

2016-04-05 Thread Renyi Xiong
Hi TD, We noticed that Spark Streaming UI is reporting a different task duration from time to time. e.g. here's the standard output of the application which reports the duration of the longest task is about 3.3 minutes: 16/04/01 16:07:19 INFO TaskSetManager: Finished task 1077.0 in stage 0.0 (TI

Spark streaming Kafka receiver WriteAheadLog question

2016-04-22 Thread Renyi Xiong
Hi, Is it possible for Kafka receiver generated WriteAheadLogBackedBlockRDD to hold corresponded Kafka offset range so that during recovery the RDD can refer back to Kafka queue instead of paying the cost of write ahead log? I guess there must be a reason here. Could anyone please help me underst

Spark streaming concurrent job scheduling question

2016-04-28 Thread Renyi Xiong
Hi, I am trying to run an I/O intensive RDD in parallel with CPU intensive RDD within an application through a window like below: var ssc = new StreamingContext(sc, 1min); var ds1 = ... var ds2 = ds1.Window(2min).ForeachRDD(...) ds1.ForeachRDD(...) I hope ds1 to start its job at 1min interval ev

persist versus checkpoint

2016-04-30 Thread Renyi Xiong
Hi, Is RDD.persist equivalent to RDD.checkpoint If they save same number of copies (say 3) to disk? (I assume persist saves copies on different machines ?) thanks, Renyi.

MetadataFetchFailedException if executorLost when spark.speculation enabled ?

2016-05-01 Thread Renyi Xiong
Hi, We observed MetadataFetchFailedException during executorLost if spark.speculation enabled. looks like something out of sync between original task on failed executor and its speculative counterpart task? is it a known issue? please let me know if you need more details. Thanks, Renyi.

Re: Spark streaming Kafka receiver WriteAheadLog question

2016-05-02 Thread Renyi Xiong
; inconsistencies between data reliably received by Spark Streaming and > offsets tracked by Zookeeper. * > > > thanks > Mario > > [image: Inactive hide details for Renyi Xiong ---01/05/2016 03:34:51 > am---Hi, Thanks a lot, Cody and Mario, for your comments.]Renyi Xiong > ---01/05

Spark shuffling OutOfMemoryError Java heap space

2016-05-15 Thread Renyi Xiong
Hi I am consistently observing driver OutOfMemoryError (Java heap space) during shuffling operation indicated by the log: 16/05/14 21:57:03 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 2 is 36060250 bytes à shuffle metadata size is big and the full metadata will be sent