Re: Custom aggregations: modular and lightweight solutions?

2019-08-13 Thread Andrew Leverentz
uld find a way to get this example working (for arbitrary values of rowSize), I suspect that it would also give me a solution to the custom-aggregation issue I outlined in my previous email. Any suggestions would be much appreciated. Thanks, ~ Andrew On Mon, Aug 12, 2019 at 5:31 PM Andrew L

Custom aggregations: modular and lightweight solutions?

2019-08-12 Thread Andrew Leverentz
Hi All, I'm attempting to clean up some Spark code which performs groupByKey / mapGroups to compute custom aggregations, and I could use some help understanding the Spark API's necessary to make my code more modular and maintainable. In particular, my current approach is as follows: - Start w

RandomForest - subsamplingRate parameter

2015-06-03 Thread Andrew Leverentz
When training a RandomForest model, the Strategy class (in mllib.tree.configuration) provides a subsamplingRate parameter. I was hoping to use this to cut down on processing time for large datasets (more than 2MM rows and 9K predictors), but I've found that the runtime stays approximately cons

RE: Understanding Spark/MLlib failures

2015-04-24 Thread Andrew Leverentz
: Thursday, April 23, 2015 4:46 PM To: Andrew Leverentz Cc: user@spark.apache.org Subject: Re: Understanding Spark/MLlib failures Hi Andrew, I observed similar behavior under high GC pressure, when running ALS. What happened to me was that, there would be very long Full GC pauses (over 600 seconds

RE: Understanding Spark/MLlib failures

2015-04-24 Thread Andrew Leverentz
cryptic error messages along the lines of “Missing an output location for shuffle.” Having some way to diagnose what’s really going here on would be helpful. ~ Andrew From: Reza Zadeh [mailto:r...@databricks.com] Sent: Thursday, April 23, 2015 4:58 PM To: Andrew Leverentz Cc: user Subject: Re