Hi As far as I know. The warning should be caused by create same temp view names.rawCountsSDF.createOrReplaceTempView( "rawCounts" ) You create a view "rawCounts", then in for loop, second round, you create a new view with name "rawCounts", spark3 would uncache the previous "rawCounts".
Correct me if I'm wrong. Regards On Tue, Dec 21, 2021 at 10:05 PM Andrew Davidson <aedav...@ucsc.edu.invalid> wrote: > Happy Holidays > > > > I am a newbie > > > > I have 16,000 data files, all files have the same number of rows and > columns. The row ids are identical and are in the same order. I want to > create a new data frame that contains the 3rd column from each data file. > My pyspark script runs correctly when I test on small number of files how > ever I get an OOM when I run on all 16000. > > > > To try and debug I ran a small test and set warning level to INFO. I found > the following > > > > 2021-12-21 00:47:04 INFO CreateViewCommand:57 - Try to uncache > `rawCounts` before replacing. > > > > for i in range( 1, len( self.sampleNamesList ) ): > > sampleName = self.sampleNamesList[i] > > > > # select the key and counts from the sample. > > qsdf = quantSparkDFList[i] > > sampleSDF = qsdf\ > > .select( ["Name", "NumReads", ] )\ > > .withColumnRenamed( "NumReads", sampleName ) > > > > sampleSDF.createOrReplaceTempView( "sample" ) > > > > # the sample name must be quoted else column names with a '-' > > # like GTEX-1117F-0426-SM-5EGHI will generate an error > > # spark think the '-' is an expression. '_' is also > > # a special char for the sql like operator > > # https://stackoverflow.com/a/63899306/4586180 > <https://stackoverflow.com/a/63899306/4586180> > > sqlStmt = '\t\t\t\t\t\tselect rc.*, `{}` \n\ > > from \n\ > > rawCounts as rc, \n\ > > sample \n\ > > where \n\ > > rc.Name == sample.Name \n'.format( > sampleName ) > > > > rawCountsSDF = self.spark.sql( sqlStmt ) > > rawCountsSDF.createOrReplaceTempView( "rawCounts" ) > > > > > > The way I wrote my script, I do a lot of transformations, the first action > is at the end of the script > > retCountDF.coalesce(1).write.csv( outfileCount, mode='overwrite', > header=True) > > > > Should I be calling sql.spark.sql( ‘uncache table rawCountsSDF “) before > calling rawCountsSDF.createOrReplaceTempView( "rawCounts" ) ? I > expected to manage spark to manage the cache automatically given that I do > not explicitly call cache(). > > > > > > How come I do not get a similar warning from? > > sampleSDF.createOrReplaceTempView( "sample" ) > > > > Will this reduce my memory requirements? > > > > Kind regards > > > > Andy > -- [image: vshapesaqua11553186012.gif] <https://vungle.com/> *Jun Zhu* Sr. Engineer I, Data +86 18565739171 [image: in1552694272.png] <https://www.linkedin.com/company/vungle> [image: fb1552694203.png] <https://facebook.com/vungle> [image: tw1552694330.png] <https://twitter.com/vungle> [image: ig1552694392.png] <https://www.instagram.com/vungle> Units 3801, 3804, 38F, C Block, Beijing Yintai Center, Beijing, China