I need to cache the DataFrame for accelerating query. In such case, the two query may simultaneously run the DAG before cache data actually happen.
Sonal Goyal <sonalgoy...@gmail.com> 于2019年11月19日周二 下午9:46写道: > the RDD or the dataframe is distributed and partitioned by Spark so as to > leverage all your workers (CPUs) effectively. So all the Dataframe > operations are actually happening simultaneously on a section of the data. > Why do you want to use threading here? > > Thanks, > Sonal > Nube Technologies <http://www.nubetech.co> > > <http://in.linkedin.com/in/sonalgoyal> > > > > > On Tue, Nov 12, 2019 at 7:18 AM Chang Chen <baibaic...@gmail.com> wrote: > >> >> Hi all >> >> I meet a case where I need cache a source RDD, and then create different >> DataFrame from it in different threads to accelerate query. >> >> I know that SparkSession is thread safe( >> https://issues.apache.org/jira/browse/SPARK-15135), but i am not sure >> whether RDD si thread safe or not >> >> Thanks >> Chang >> >