Hi Jun

Thank you for your reply. My question is what is best practices? My for
loop run over 16000 joins. I get an out of memory exception.

What is the indented use of createOrReplaceTempView if I need to manage the
cache or create a uniq name each time



Kind regards

Andy

On Tue, Dec 21, 2021 at 6:12 AM Jun Zhu <[email protected]> wrote:

> Hi
>
> As far as I know. The warning should be caused by create same temp view
> names.rawCountsSDF.createOrReplaceTempView( "rawCounts" )
> You create a view "rawCounts", then in for loop, second round, you create
> a new view with name "rawCounts", spark3 would uncache the
> previous "rawCounts".
>
> Correct me if I'm wrong.
>
> Regards
>
>
> On Tue, Dec 21, 2021 at 10:05 PM Andrew Davidson <[email protected]>
> wrote:
>
>> Happy Holidays
>>
>>
>>
>> I am a newbie
>>
>>
>>
>> I have 16,000 data files, all files have the same number of rows and
>> columns. The row ids are identical and are in the same order. I want to
>> create a new data frame that contains the 3rd column from each data file.
>> My pyspark script runs correctly when I test on small number of files how
>> ever I get an OOM when I run on all 16000.
>>
>>
>>
>> To try and debug I ran a small test and set warning level to INFO. I
>> found the following
>>
>>
>>
>> 2021-12-21 00:47:04 INFO  CreateViewCommand:57 - Try to uncache
>> `rawCounts` before replacing.
>>
>>
>>
>>         for i in range( 1, len( self.sampleNamesList ) ):
>>
>>             sampleName = self.sampleNamesList[i]
>>
>>
>>
>>             # select the key and counts from the sample.
>>
>>             qsdf = quantSparkDFList[i]
>>
>>             sampleSDF = qsdf\
>>
>>                 .select( ["Name", "NumReads", ] )\
>>
>>                 .withColumnRenamed( "NumReads", sampleName )
>>
>>
>>
>>             sampleSDF.createOrReplaceTempView( "sample" )
>>
>>
>>
>>             # the sample name must be quoted else column names with a '-'
>>
>>             # like GTEX-1117F-0426-SM-5EGHI will generate an error
>>
>>             # spark think the '-' is an expression. '_' is also
>>
>>             # a special char for the sql like operator
>>
>>             # https://stackoverflow.com/a/63899306/4586180
>>
>>             sqlStmt = '\t\t\t\t\t\tselect rc.*, `{}` \n\
>>
>>                             from \n\
>>
>>                                rawCounts as rc, \n\
>>
>>                                sample  \n\
>>
>>                             where \n\
>>
>>                                 rc.Name == sample.Name \n'.format(
>> sampleName )
>>
>>
>>
>>             rawCountsSDF = self.spark.sql( sqlStmt )
>>
>>             rawCountsSDF.createOrReplaceTempView( "rawCounts" )
>>
>>
>>
>>
>>
>> The way I wrote my script, I do a lot of transformations, the first
>> action is at the end of the script
>>
>>     retCountDF.coalesce(1).write.csv( outfileCount, mode='overwrite',
>> header=True)
>>
>>
>>
>> Should I be calling sql.spark.sql( ‘uncache table rawCountsSDF “) before
>> calling   rawCountsSDF.createOrReplaceTempView( "rawCounts" ) ? I
>> expected to manage spark to manage the cache automatically given that I do
>> not explicitly call cache().
>>
>>
>>
>>
>>
>> How come I do not get a similar warning from?
>>
>>             sampleSDF.createOrReplaceTempView( "sample" )
>>
>>
>>
>> Will this reduce my memory requirements?
>>
>>
>>
>> Kind regards
>>
>>
>>
>> Andy
>>
>
>
> --
> [image: vshapesaqua11553186012.gif] <https://vungle.com/>   *Jun Zhu*
> Sr. Engineer I, Data
> +86 18565739171
>
> [image: in1552694272.png] <https://www.linkedin.com/company/vungle>    [image:
> fb1552694203.png] <https://facebook.com/vungle>      [image:
> tw1552694330.png] <https://twitter.com/vungle>      [image:
> ig1552694392.png] <https://www.instagram.com/vungle>
> Units 3801, 3804, 38F, C Block, Beijing Yintai Center, Beijing, China
>
>

Reply via email to