why SparkR chose to uses inter-process socket solution eventually on driver
side instead of in-process JNI showed in one of its doc's below (about page
20)?
https://spark-summit.org/wp-content/uploads/2014/07/SparkR-Interactive-R-Programs-at-Scale-Shivaram-Vankataraman-Zongheng-Yang.pdf
> Yeah in addition to the downside of having 2 JVMs the command line
> arguments and SparkConf etc. will be set by spark-submit in the first
> JVM which won't be available in the second JVM.
>
> Shivaram
>
> On Thu, Sep 10, 2015 at 5:18 PM, Renyi Xiong
> wrote:
> >
ough JNI?
> >
> > On Thu, Sep 10, 2015 at 9:29 PM, Shivaram Venkataraman
> > wrote:
> >>
> >> Yeah in addition to the downside of having 2 JVMs the command line
> >> arguments and SparkConf etc. will be set by spark-submit in the first
> >&
Can anybody help understand why pyspark streaming uses py4j callback to
execute python code while pyspark batch uses worker.py?
regarding pyspark streaming, is py4j callback only used for
DStream, worker.py still used for RDD?
thanks,
Renyi.
SparkR streaming is mentioned at about page 17 in below pdf, can anyone
share source code? (could not find it on GitHub)
https://spark-summit.org/2015-east/wp-content/uploads/2015/03/SSE15-19-Hao-Lin-Haichuan-Wang.pdf
Thanks,
Renyi.
Reynold Xin wrote:
> > You should reach out to the speakers directly.
> >
> >
> > On Wed, Sep 16, 2015 at 9:52 AM, Renyi Xiong
> wrote:
> >>
> >> SparkR streaming is mentioned at about page 17 in below pdf, can anyone
> >> share source code? (cou
Hi,
I want to do temporal join operation on DStream across RDDs, my question
is: Are RDDs from same DStream always computed on same worker (except
failover) ?
thanks,
Renyi.
I tried to run HdfsTest sample on windows spark-1.4.0
bin\run-sample org.apache.spark.examples.HdfsTest
but got below exception, any body any idea what was wrong here?
15/09/28 16:33:56.565 ERROR SparkContext: Error initializing SparkContext.
java.lang.NullPointerException
at java.lang.
ith the one which was used to build Spark
> 1.4.0 ?
>
> Cheers
>
> On Mon, Sep 28, 2015 at 4:36 PM, Renyi Xiong
> wrote:
>
>> I tried to run HdfsTest sample on windows spark-1.4.0
>>
>> bin\run-sample org.apache.spark.examples.HdfsTest
>>
>> but got
thanks a lot, it works now after I set %HADOOP_HOME%
On Tue, Sep 29, 2015 at 1:22 PM, saurfang wrote:
> See
>
> http://stackoverflow.com/questions/26516865/is-it-possible-to-run-hadoop-jobs-like-the-wordcount-sample-in-the-local-mode
> ,
> https://issues.apache.org/jira/browse/SPARK-6961 and fin
Hi Shiva,
Is Dataframe UDF implemented in SparkR yet? - I could not find it in below
URL
https://github.com/hlin09/spark/tree/SparkR-streaming/R/pkg/R
Thanks,
Renyi.
t; g...@mail.gmail.com>
> 2...@mail.gmail.com>
> Date: Mon, 28 Sep 2015 16:31:42 -0700
> Message-ID: <
> cangsv69hyqbbvb8_8zshstlrpdy-37fjnwyvxce-xf7dphq...@mail.gmail.com>
> Subject: Re: Spark streaming DStream state on worker
> From: Renyi Xiong
>
should not be any node-specific long term state that
> you rely on unless you can recover that state on a different node.
>
> On Mon, Oct 5, 2015 at 3:03 PM, Renyi Xiong wrote:
>
>> if RDDs from same DStream not guaranteed to run on same worker, then the
>> question becom
Hi TD,
The scenario here is to let events from topic1 wait a fixed 10 minutes for
events with same key from topic2 to come and left outer join them by the key
does the query do what is expected? if not, what is the right way to
achieve this?
thanks,
Renyi.
On Tue, Oct 13, 2015 at 5:14 PM, Danie
Hi,
I try to run the following 1.4.1 sample by putting a words.txt under
localdir
bin\run-example org.apache.spark.examples.streaming.HdfsWordCount localdir
2 questions
1. it does not pick up words.txt because it's 'old' I guess - any option to
let it picked up?
2. I managed to put a 'new' file
hi,
I met following exception when the driver program tried to recover from
checkpoint, looks like the logic relies on zeroTime being set which doesn't
seem to happen here. am I missing anything or is it a bug in 1.4.1?
org.apache.spark.SparkException:
org.apache.spark.streaming.api.csharp.CSharp
push them into a dstream. It is meant to be run
> indefinitely, unless interrupted by ctrl-c, for example.
>
> -bryan
> On Nov 13, 2015 10:52 AM, "Renyi Xiong" wrote:
>
>> Hi,
>>
>> I try to run the following 1.4.1 s
bmit(SparkSubmit.scala:193)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
On Wed, Dec 9, 2015 at 12:45 PM, Renyi Xiong wrote:
> hi,
>
> I met following exception when the driver program tried
never mind, one of my peers correct the driver program for me - all dstream
operations need to be within the scope of getOrCreate API
On Wed, Dec 9, 2015 at 3:32 PM, Renyi Xiong wrote:
> following scala program throws same exception, I know people are running
> streaming jobs against ka
Hi TD,
I noticed mapWithState was available in spark 1.6. Is there any plan to
enable it in pyspark as well?
thanks,
Renyi.
Hi,
We noticed there's no Save method in KafkaUtils. we do have scenarios where
we want to save RDD back to Kafka queue to be consumed by down stream
streaming applications.
I wonder if this is a common scenario, if yes, any plan to add it?
Thanks,
Renyi.
Hi,
is it a good idea to have 2 threads in pyspark worker? - main thread
responsible for receive and send data over socket while the other thread is
calling user functions to process data?
since CPU is idle (?) during network I/O, this should improve concurrency
quite a bit.
can expert answer t
never mind, I think pyspark is already doing async socket read / write,
but on scala side in PythonRDD.scala
On Sat, Feb 6, 2016 at 6:27 PM, Renyi Xiong wrote:
> Hi,
>
> is it a good idea to have 2 threads in pyspark worker? - main thread
> responsible for receive and send data
Hi TD,
Thanks a lot for offering to look at our PR (if we fire one) at the
conference NYC.
As we discussed briefly the issues of unbalanced and
under-distributed kafka partitions when developing Spark streaming
application in Mobius (C# for Spark), we're trying the option of
repartitioning within
a lot of potential change already underway.
>
> See
>
> https://issues.apache.org/jira/browse/SPARK-12177
>
> On Thu, Mar 10, 2016 at 1:59 PM, Renyi Xiong
> wrote:
> > Hi TD,
> >
> > Thanks a lot for offering to look at our PR (if we fire one) at the
> > conf
Hi Sean,
We're upgrading Mobius (C# binding for Spark) in Microsoft to align with
Spark 1.6.2 and noticed some changes in API you did in
https://github.com/apache/spark/commit/6f81eae24f83df51a99d4bb2629dd7daadc01519
mostly on APIs with Approx postfix. (still marked as experimental in
pyspark t
Thanks a lot, Sean, really appreciate your comments.
Sent from my Windows 10 phone
From: Sean Owen
Sent: Friday, April 1, 2016 12:55 PM
To: Renyi Xiong
Cc: Tathagata Das; dev
Subject: Re: Declare rest of @Experimental items non-experimental if
they'veexisted since 1.2.0
The change ther
Hi TD,
We noticed that Spark Streaming UI is reporting a different task duration
from time to time.
e.g. here's the standard output of the application which reports the
duration of the longest task is about 3.3 minutes:
16/04/01 16:07:19 INFO TaskSetManager: Finished task 1077.0 in stage 0.0
(TI
Hi,
Is it possible for Kafka receiver generated WriteAheadLogBackedBlockRDD to
hold corresponded Kafka offset range so that during recovery the RDD can
refer back to Kafka queue instead of paying the cost of write ahead log?
I guess there must be a reason here. Could anyone please help me underst
Hi,
I am trying to run an I/O intensive RDD in parallel with CPU intensive RDD
within an application through a window like below:
var ssc = new StreamingContext(sc, 1min);
var ds1 = ...
var ds2 = ds1.Window(2min).ForeachRDD(...)
ds1.ForeachRDD(...)
I hope ds1 to start its job at 1min interval ev
Hi,
Is RDD.persist equivalent to RDD.checkpoint If they save same number of
copies (say 3) to disk?
(I assume persist saves copies on different machines ?)
thanks,
Renyi.
Hi,
We observed MetadataFetchFailedException during executorLost if
spark.speculation enabled.
looks like something out of sync between original task on failed executor
and its speculative counterpart task?
is it a known issue?
please let me know if you need more details.
Thanks,
Renyi.
; inconsistencies between data reliably received by Spark Streaming and
> offsets tracked by Zookeeper. *
>
>
> thanks
> Mario
>
> [image: Inactive hide details for Renyi Xiong ---01/05/2016 03:34:51
> am---Hi, Thanks a lot, Cody and Mario, for your comments.]Renyi Xiong
> ---01/05
Hi
I am consistently observing driver OutOfMemoryError (Java heap space)
during shuffling operation indicated by the log:
16/05/14 21:57:03 INFO MapOutputTrackerMaster: Size of output statuses for
shuffle 2 is 36060250 bytes à shuffle metadata size is big and the full
metadata will be sent
34 matches
Mail list logo