Hi

As far as I know. The warning should be caused by create same temp view
names.rawCountsSDF.createOrReplaceTempView( "rawCounts" )
You create a view "rawCounts", then in for loop, second round, you create a
new view with name "rawCounts", spark3 would uncache the
previous "rawCounts".

Correct me if I'm wrong.

Regards


On Tue, Dec 21, 2021 at 10:05 PM Andrew Davidson <aedav...@ucsc.edu.invalid>
wrote:

> Happy Holidays
>
>
>
> I am a newbie
>
>
>
> I have 16,000 data files, all files have the same number of rows and
> columns. The row ids are identical and are in the same order. I want to
> create a new data frame that contains the 3rd column from each data file.
> My pyspark script runs correctly when I test on small number of files how
> ever I get an OOM when I run on all 16000.
>
>
>
> To try and debug I ran a small test and set warning level to INFO. I found
> the following
>
>
>
> 2021-12-21 00:47:04 INFO  CreateViewCommand:57 - Try to uncache
> `rawCounts` before replacing.
>
>
>
>         for i in range( 1, len( self.sampleNamesList ) ):
>
>             sampleName = self.sampleNamesList[i]
>
>
>
>             # select the key and counts from the sample.
>
>             qsdf = quantSparkDFList[i]
>
>             sampleSDF = qsdf\
>
>                 .select( ["Name", "NumReads", ] )\
>
>                 .withColumnRenamed( "NumReads", sampleName )
>
>
>
>             sampleSDF.createOrReplaceTempView( "sample" )
>
>
>
>             # the sample name must be quoted else column names with a '-'
>
>             # like GTEX-1117F-0426-SM-5EGHI will generate an error
>
>             # spark think the '-' is an expression. '_' is also
>
>             # a special char for the sql like operator
>
>             # https://stackoverflow.com/a/63899306/4586180
> <https://stackoverflow.com/a/63899306/4586180>
>
>             sqlStmt = '\t\t\t\t\t\tselect rc.*, `{}` \n\
>
>                             from \n\
>
>                                rawCounts as rc, \n\
>
>                                sample  \n\
>
>                             where \n\
>
>                                 rc.Name == sample.Name \n'.format(
> sampleName )
>
>
>
>             rawCountsSDF = self.spark.sql( sqlStmt )
>
>             rawCountsSDF.createOrReplaceTempView( "rawCounts" )
>
>
>
>
>
> The way I wrote my script, I do a lot of transformations, the first action
> is at the end of the script
>
>     retCountDF.coalesce(1).write.csv( outfileCount, mode='overwrite',
> header=True)
>
>
>
> Should I be calling sql.spark.sql( ‘uncache table rawCountsSDF “) before
> calling   rawCountsSDF.createOrReplaceTempView( "rawCounts" ) ? I
> expected to manage spark to manage the cache automatically given that I do
> not explicitly call cache().
>
>
>
>
>
> How come I do not get a similar warning from?
>
>             sampleSDF.createOrReplaceTempView( "sample" )
>
>
>
> Will this reduce my memory requirements?
>
>
>
> Kind regards
>
>
>
> Andy
>


-- 
[image: vshapesaqua11553186012.gif] <https://vungle.com/>   *Jun Zhu*
Sr. Engineer I, Data
+86 18565739171

[image: in1552694272.png] <https://www.linkedin.com/company/vungle>    [image:
fb1552694203.png] <https://facebook.com/vungle>      [image:
tw1552694330.png] <https://twitter.com/vungle>      [image:
ig1552694392.png] <https://www.instagram.com/vungle>
Units 3801, 3804, 38F, C Block, Beijing Yintai Center, Beijing, China

Reply via email to