date:20161222

Best Practice for Spark Job Jar Generation

2016-12-22 Thread Chetan Khatri

Hello Spark Community, For Spark Job Creation I use SBT Assembly to build Uber("Super") Jar and then submit to spark-submit. Example, bin/spark-submit --class hbase.spark.chetan.com.SparkHbaseJob /home/chetan/hbase-spark/SparkMSAPoc-assembly-1.0.jar But other folks has debate with for Uber Less

Re: java.lang.AssertionError: assertion failed

2016-12-22 Thread Liang-Chi Hsieh

Hi, I think there is an issue in `ExternalAppendOnlyMap.forceSpill` which is called to release memory when there is another memory consumer tried to ask more memory than current available. I created a Jira and submit a PR for it. Please check out https://issues.apache.org/jira/browse/SPARK-18986

Re: stratified sampling scales poorly

2016-12-22 Thread Liang-Chi Hsieh

Hi, I quoted the description of `sampleByKeyExact`: "This method differs from [[sampleByKey]] in that we make additional passes over the RDD to create a sample size that's exactly equal to the sum of math.ceil(numItems * samplingRate) over all key values with a 99.99% confidence. When sampling w

Re: Aggregating over sorted data

2016-12-22 Thread Koert Kuipers

yes it's less optimal because an abstraction is missing and with mapPartitions it is done without optimizations. but aggregator is not the right abstraction to begin with, is assumes a monoid which means no ordering guarantees. you need a fold operation. On Dec 22, 2016 02:20, "Liang-Chi Hsieh" w

Re: Aggregating over sorted data

2016-12-22 Thread trsell

I would love this feature On Thu, 22 Dec 2016, 18:45 assaf.mendelson, wrote: > It seems that this aggregation is for dataset operations only. I would > have hoped to be able to do dataframe aggregation. Something along the line > of: sort_df(df).agg(my_agg_func) > > > > In any case, note that th

RE: Aggregating over sorted data

2016-12-22 Thread assaf.mendelson

It seems that this aggregation is for dataset operations only. I would have hoped to be able to do dataframe aggregation. Something along the line of: sort_df(df).agg(my_agg_func) In any case, note that this kind of sorting is less efficient than the sorting done in window functions for example

Re: Aggregating over sorted data

2016-12-22 Thread Liang-Chi Hsieh

You can't use existing aggregation functions with that. Besides, the execution plan of `mapPartitions` doesn't support wholestage codegen. Without that and some optimization around aggregation, that might be possible performance degradation. Also when you have more than one keys in a partition, yo

Best Practice for Spark Job Jar Generation

Re: java.lang.AssertionError: assertion failed

Re: stratified sampling scales poorly

Re: Aggregating over sorted data

Re: Aggregating over sorted data

RE: Aggregating over sorted data

Re: Aggregating over sorted data

7 matches

Site Navigation

Mail list logo

Footer information