That seems like a fine way to do it. Why you're running out of mem is
probably more a function of your parallelism, cluster size, and the fact
that R is a memory hog.
I'm not sure there are great alternatives in R and Spark; in other
languages you might more directly get the array of (numeric?) row value and
sum them efficiently. Certainly pandas UDFs would make short work of that.

On Tue, Feb 8, 2022 at 10:02 AM Andrew Davidson <aedav...@ucsc.edu.invalid>
wrote:

> As part of my data normalization process I need to calculate row sums. The
> following code works on smaller test data sets. It does not work on my big
> tables. When I run on a table with over 10,000 columns I get an OOM on a
> cluster with 2.8 TB. Is there a better way to implement this
>
>
>
> Kind regards
>
>
>
> Andy
>
>
>
> https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/rowsum
>
> “Compute column sums across rows of a numeric matrix-like object for each
> level of a grouping variable. “
>
>
>
>
> ###############################################################################
>
>     def rowSums( self, countsSparkDF, newColName, columnNames ):
>
>         '''
>
>         calculates actual sum of columns
>
>
>
>         arguments
>
>             countSparkDF
>
>
>
>             newColumName:
>
>                 results from column sum will be sorted here
>
>
>
>             columnNames:
>
>                 list of columns to sum
>
>
>
>         returns
>
>             amended countSparkDF
>
>         '''
>
>         self.logger.warn( "rowSumsImpl BEGIN" )
>
>
>
>         # https://stackoverflow.com/a/54283997/4586180
>
>         retDF = countsSparkDF.na.fill( 0 ).withColumn( newColName ,
> reduce( add, [col( x ) for x in columnNames] ) )
>
>
>
>         # self.logger.warn( "rowSums retDF numRows:{} numCols:{}"\
>
>         #                  .format( retDF.count(), len( retDF.columns ) )
> )
>
>         #
>
>         # self.logger.warn("AEDWIP remove show")
>
>         # retDF.show()
>
>
>
>         self.logger.warn( "rowSumsImpl END\n" )
>
>         return retDF
>
>
>
>
>

Reply via email to