As part of my data normalization process I need to calculate row sums. The following code works on smaller test data sets. It does not work on my big tables. When I run on a table with over 10,000 columns I get an OOM on a cluster with 2.8 TB. Is there a better way to implement this
Kind regards Andy https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/rowsum “Compute column sums across rows of a numeric matrix-like object for each level of a grouping variable. “ ############################################################################### def rowSums( self, countsSparkDF, newColName, columnNames ): ''' calculates actual sum of columns arguments countSparkDF newColumName: results from column sum will be sorted here columnNames: list of columns to sum returns amended countSparkDF ''' self.logger.warn( "rowSumsImpl BEGIN" ) # https://stackoverflow.com/a/54283997/4586180 retDF = countsSparkDF.na.fill( 0 ).withColumn( newColName , reduce( add, [col( x ) for x in columnNames] ) ) # self.logger.warn( "rowSums retDF numRows:{} numCols:{}"\ # .format( retDF.count(), len( retDF.columns ) ) ) # # self.logger.warn("AEDWIP remove show") # retDF.show() self.logger.warn( "rowSumsImpl END\n" ) return retDF