As part of my data normalization process I need to calculate row sums. The 
following code works on smaller test data sets. It does not work on my big 
tables. When I run on a table with over 10,000 columns I get an OOM on a 
cluster with 2.8 TB. Is there a better way to implement this

Kind regards

Andy

https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/rowsum
“Compute column sums across rows of a numeric matrix-like object for each level 
of a grouping variable. “


    
###############################################################################

    def rowSums( self, countsSparkDF, newColName, columnNames ):

        '''

        calculates actual sum of columns



        arguments

            countSparkDF



            newColumName:

                results from column sum will be sorted here



            columnNames:

                list of columns to sum



        returns

            amended countSparkDF

        '''

        self.logger.warn( "rowSumsImpl BEGIN" )



        # https://stackoverflow.com/a/54283997/4586180

        retDF = countsSparkDF.na.fill( 0 ).withColumn( newColName , reduce( 
add, [col( x ) for x in columnNames] ) )



        # self.logger.warn( "rowSums retDF numRows:{} numCols:{}"\

        #                  .format( retDF.count(), len( retDF.columns ) ) )

        #

        # self.logger.warn("AEDWIP remove show")

        # retDF.show()



        self.logger.warn( "rowSumsImpl END\n" )

        return retDF


Reply via email to