Hi Jun Thank you for your reply. My question is what is best practices? My for loop run over 16000 joins. I get an out of memory exception.
What is the indented use of createOrReplaceTempView if I need to manage the cache or create a uniq name each time Kind regards Andy On Tue, Dec 21, 2021 at 6:12 AM Jun Zhu <[email protected]> wrote: > Hi > > As far as I know. The warning should be caused by create same temp view > names.rawCountsSDF.createOrReplaceTempView( "rawCounts" ) > You create a view "rawCounts", then in for loop, second round, you create > a new view with name "rawCounts", spark3 would uncache the > previous "rawCounts". > > Correct me if I'm wrong. > > Regards > > > On Tue, Dec 21, 2021 at 10:05 PM Andrew Davidson <[email protected]> > wrote: > >> Happy Holidays >> >> >> >> I am a newbie >> >> >> >> I have 16,000 data files, all files have the same number of rows and >> columns. The row ids are identical and are in the same order. I want to >> create a new data frame that contains the 3rd column from each data file. >> My pyspark script runs correctly when I test on small number of files how >> ever I get an OOM when I run on all 16000. >> >> >> >> To try and debug I ran a small test and set warning level to INFO. I >> found the following >> >> >> >> 2021-12-21 00:47:04 INFO CreateViewCommand:57 - Try to uncache >> `rawCounts` before replacing. >> >> >> >> for i in range( 1, len( self.sampleNamesList ) ): >> >> sampleName = self.sampleNamesList[i] >> >> >> >> # select the key and counts from the sample. >> >> qsdf = quantSparkDFList[i] >> >> sampleSDF = qsdf\ >> >> .select( ["Name", "NumReads", ] )\ >> >> .withColumnRenamed( "NumReads", sampleName ) >> >> >> >> sampleSDF.createOrReplaceTempView( "sample" ) >> >> >> >> # the sample name must be quoted else column names with a '-' >> >> # like GTEX-1117F-0426-SM-5EGHI will generate an error >> >> # spark think the '-' is an expression. '_' is also >> >> # a special char for the sql like operator >> >> # https://stackoverflow.com/a/63899306/4586180 >> >> sqlStmt = '\t\t\t\t\t\tselect rc.*, `{}` \n\ >> >> from \n\ >> >> rawCounts as rc, \n\ >> >> sample \n\ >> >> where \n\ >> >> rc.Name == sample.Name \n'.format( >> sampleName ) >> >> >> >> rawCountsSDF = self.spark.sql( sqlStmt ) >> >> rawCountsSDF.createOrReplaceTempView( "rawCounts" ) >> >> >> >> >> >> The way I wrote my script, I do a lot of transformations, the first >> action is at the end of the script >> >> retCountDF.coalesce(1).write.csv( outfileCount, mode='overwrite', >> header=True) >> >> >> >> Should I be calling sql.spark.sql( ‘uncache table rawCountsSDF “) before >> calling rawCountsSDF.createOrReplaceTempView( "rawCounts" ) ? I >> expected to manage spark to manage the cache automatically given that I do >> not explicitly call cache(). >> >> >> >> >> >> How come I do not get a similar warning from? >> >> sampleSDF.createOrReplaceTempView( "sample" ) >> >> >> >> Will this reduce my memory requirements? >> >> >> >> Kind regards >> >> >> >> Andy >> > > > -- > [image: vshapesaqua11553186012.gif] <https://vungle.com/> *Jun Zhu* > Sr. Engineer I, Data > +86 18565739171 > > [image: in1552694272.png] <https://www.linkedin.com/company/vungle> [image: > fb1552694203.png] <https://facebook.com/vungle> [image: > tw1552694330.png] <https://twitter.com/vungle> [image: > ig1552694392.png] <https://www.instagram.com/vungle> > Units 3801, 3804, 38F, C Block, Beijing Yintai Center, Beijing, China > >
