What is the issue you are encountering? Memory bound? Is it GCP nodes (are
you using a Dataproc cluster). Have you checked the logs
<https://cloud.google.com/dataproc/docs/guides/logging> in GCP

How about Spark GUI, what does it say? With two nodes of cluster you sound
like you are doing more of vertical scaling than horizontal  distributed
computing. Basically you have two mainframes.


HTH


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 9 Feb 2022 at 16:28, Andrew Davidson <aedav...@ucsc.edu.invalid>
wrote:

> Hi Sean
>
>
>
> I have 2 big for loops in my code. One for loop uses join to implement R’s
> cbind() the other implements R’s  rowsum(). Each for loop iterates 10411
> times.
>
>
>
> It debug I added an action to each iteration and of the loop. I think I
> used count() and logged the results.  So I am confident this is where the
> problem is.
>
>
>
> In my experience you need to be really careful anytime you use for loops
> in big data. There is a potential loss of computation efficiency. The idea
> of spark’s lazy evaluation and optimization is very appealing
>
>
>
> Andy
>
>
>
>
>
> *From: *Sean Owen <sro...@gmail.com>
> *Date: *Wednesday, February 9, 2022 at 8:19 AM
> *To: *Andrew Davidson <aedav...@ucsc.edu>
> *Cc: *"user @spark" <user@spark.apache.org>
> *Subject: *Re: Does spark have something like rowsum() in R?
>
>
>
> It really depends on what is running out of memory. You can have all the
> workers in the world but if something is blowing up the driver, won't do
> anything. You can have a huge cluster but data skew makes it impossible to
> break up the problem you express. Spark running out of mem is not the same
> as R running out of mem.
>
>
>
> You can definitely do this faster with Spark with enough parallelism. It
> can be harder to reason about a distributed system for sure. WIthout a lot
> more detail, hard to say 'why'. For example, it's not clear that the
> operation you pasted fails. Did you collect huge results to the driver
> afterwards? etc
>
>
>
>
>
> On Wed, Feb 9, 2022 at 10:10 AM Andrew Davidson <aedav...@ucsc.edu> wrote:
>
> Hi Sean
>
>
>
> Debugging big data projects is always hard. It is a black art that takes a
> lot of experience.
>
>
>
> Can you tell me more about “Why you're running out of mem is probably
> more a function of your parallelism, cluster size” ?
>
>
>
> I have cluster with 2 worker nodes. Each with 1.4 TB of memory , 96 vcpus,
> and as much ssd as I want. It was really hard to get quota for these
> machines on GCP. Would I be better with dozens of smaller machines?
>
>
>
> This has been an incredibly hard problem to debug. What I wound up doing
> is just using spark to select the column of interest and write these
> columns to individual part files.
>
>
>
> Next I used a special research computer at my university with 64 cores and
> a 1 TB of memory. I copied the part files from gcp to the computer.  I used
> the UNIX paste command to create a single table. Finally I am doing all my
> analysis on a single machine using R. paste took about 40 min. Spark would
> crash after about 12 hrs.
>
>
>
> column bind and row sums are common operations. Seem like there should be
> an easy solution? Maybe I should submit a RFE (request for enhancement)
>
>
>
> Kind regards
>
>
>
> Andy
>
>
>
> *From: *Sean Owen <sro...@gmail.com>
> *Date: *Tuesday, February 8, 2022 at 8:57 AM
> *To: *Andrew Davidson <aedav...@ucsc.edu>
> *Cc: *"user @spark" <user@spark.apache.org>
> *Subject: *Re: Does spark have something like rowsum() in R?
>
>
>
> That seems like a fine way to do it. Why you're running out of mem is
> probably more a function of your parallelism, cluster size, and the fact
> that R is a memory hog.
>
> I'm not sure there are great alternatives in R and Spark; in other
> languages you might more directly get the array of (numeric?) row value and
> sum them efficiently. Certainly pandas UDFs would make short work of that.
>
>
>
> On Tue, Feb 8, 2022 at 10:02 AM Andrew Davidson <aedav...@ucsc.edu.invalid>
> wrote:
>
> As part of my data normalization process I need to calculate row sums. The
> following code works on smaller test data sets. It does not work on my big
> tables. When I run on a table with over 10,000 columns I get an OOM on a
> cluster with 2.8 TB. Is there a better way to implement this
>
>
>
> Kind regards
>
>
>
> Andy
>
>
>
> https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/rowsum
>
> “Compute column sums across rows of a numeric matrix-like object for each
> level of a grouping variable. “
>
>
>
>
> ###############################################################################
>
>     def rowSums( self, countsSparkDF, newColName, columnNames ):
>
>         '''
>
>         calculates actual sum of columns
>
>
>
>         arguments
>
>             countSparkDF
>
>
>
>             newColumName:
>
>                 results from column sum will be sorted here
>
>
>
>             columnNames:
>
>                 list of columns to sum
>
>
>
>         returns
>
>             amended countSparkDF
>
>         '''
>
>         self.logger.warn( "rowSumsImpl BEGIN" )
>
>
>
>         # https://stackoverflow.com/a/54283997/4586180
>
>         retDF = countsSparkDF.na.fill( 0 ).withColumn( newColName ,
> reduce( add, [col( x ) for x in columnNames] ) )
>
>
>
>         # self.logger.warn( "rowSums retDF numRows:{} numCols:{}"\
>
>         #                  .format( retDF.count(), len( retDF.columns ) )
> )
>
>         #
>
>         # self.logger.warn("AEDWIP remove show")
>
>         # retDF.show()
>
>
>
>         self.logger.warn( "rowSumsImpl END\n" )
>
>         return retDF
>
>
>
>
>
>

Reply via email to