Hi Sean

Debugging big data projects is always hard. It is a black art that takes a lot 
of experience.

Can you tell me more about “Why you're running out of mem is probably more a 
function of your parallelism, cluster size” ?

I have cluster with 2 worker nodes. Each with 1.4 TB of memory , 96 vcpus, and 
as much ssd as I want. It was really hard to get quota for these machines on 
GCP. Would I be better with dozens of smaller machines?

This has been an incredibly hard problem to debug. What I wound up doing is 
just using spark to select the column of interest and write these columns to 
individual part files.

Next I used a special research computer at my university with 64 cores and a 1 
TB of memory. I copied the part files from gcp to the computer.  I used the 
UNIX paste command to create a single table. Finally I am doing all my analysis 
on a single machine using R. paste took about 40 min. Spark would crash after 
about 12 hrs.

column bind and row sums are common operations. Seem like there should be an 
easy solution? Maybe I should submit a RFE (request for enhancement)

Kind regards

Andy

From: Sean Owen <sro...@gmail.com>
Date: Tuesday, February 8, 2022 at 8:57 AM
To: Andrew Davidson <aedav...@ucsc.edu>
Cc: "user @spark" <user@spark.apache.org>
Subject: Re: Does spark have something like rowsum() in R?

That seems like a fine way to do it. Why you're running out of mem is probably 
more a function of your parallelism, cluster size, and the fact that R is a 
memory hog.
I'm not sure there are great alternatives in R and Spark; in other languages 
you might more directly get the array of (numeric?) row value and sum them 
efficiently. Certainly pandas UDFs would make short work of that.

On Tue, Feb 8, 2022 at 10:02 AM Andrew Davidson <aedav...@ucsc.edu.invalid> 
wrote:
As part of my data normalization process I need to calculate row sums. The 
following code works on smaller test data sets. It does not work on my big 
tables. When I run on a table with over 10,000 columns I get an OOM on a 
cluster with 2.8 TB. Is there a better way to implement this

Kind regards

Andy

https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/rowsum
“Compute column sums across rows of a numeric matrix-like object for each level 
of a grouping variable. “


    
###############################################################################

    def rowSums( self, countsSparkDF, newColName, columnNames ):

        '''

        calculates actual sum of columns



        arguments

            countSparkDF



            newColumName:

                results from column sum will be sorted here



            columnNames:

                list of columns to sum



        returns

            amended countSparkDF

        '''

        self.logger.warn( "rowSumsImpl BEGIN" )



        # https://stackoverflow.com/a/54283997/4586180

        retDF = countsSparkDF.na.fill( 0 ).withColumn( newColName , reduce( 
add, [col( x ) for x in columnNames] ) )



        # self.logger.warn( "rowSums retDF numRows:{} numCols:{}"\

        #                  .format( retDF.count(), len( retDF.columns ) ) )

        #

        # self.logger.warn("AEDWIP remove show")

        # retDF.show()



        self.logger.warn( "rowSumsImpl END\n" )

        return retDF


Reply via email to