Best Practice for Spark Job Jar Generation

2016-12-22 Thread Chetan Khatri
Hello Spark Community, For Spark Job Creation I use SBT Assembly to build Uber("Super") Jar and then submit to spark-submit. Example, bin/spark-submit --class hbase.spark.chetan.com.SparkHbaseJob /home/chetan/hbase-spark/SparkMSAPoc-assembly-1.0.jar But other folks has debate with for Uber

Re: java.lang.AssertionError: assertion failed

2016-12-22 Thread Liang-Chi Hsieh
Hi, I think there is an issue in `ExternalAppendOnlyMap.forceSpill` which is called to release memory when there is another memory consumer tried to ask more memory than current available. I created a Jira and submit a PR for it. Please check out

Re: stratified sampling scales poorly

2016-12-22 Thread Liang-Chi Hsieh
Hi, I quoted the description of `sampleByKeyExact`: "This method differs from [[sampleByKey]] in that we make additional passes over the RDD to create a sample size that's exactly equal to the sum of math.ceil(numItems * samplingRate) over all key values with a 99.99% confidence. When sampling

Re: Aggregating over sorted data

2016-12-22 Thread Koert Kuipers
yes it's less optimal because an abstraction is missing and with mapPartitions it is done without optimizations. but aggregator is not the right abstraction to begin with, is assumes a monoid which means no ordering guarantees. you need a fold operation. On Dec 22, 2016 02:20, "Liang-Chi Hsieh"

Re: Aggregating over sorted data

2016-12-22 Thread trsell
I would love this feature On Thu, 22 Dec 2016, 18:45 assaf.mendelson, wrote: > It seems that this aggregation is for dataset operations only. I would > have hoped to be able to do dataframe aggregation. Something along the line > of: sort_df(df).agg(my_agg_func) > > > >

RE: Aggregating over sorted data

2016-12-22 Thread assaf.mendelson
It seems that this aggregation is for dataset operations only. I would have hoped to be able to do dataframe aggregation. Something along the line of: sort_df(df).agg(my_agg_func) In any case, note that this kind of sorting is less efficient than the sorting done in window functions for

Re: Aggregating over sorted data

2016-12-22 Thread Liang-Chi Hsieh
You can't use existing aggregation functions with that. Besides, the execution plan of `mapPartitions` doesn't support wholestage codegen. Without that and some optimization around aggregation, that might be possible performance degradation. Also when you have more than one keys in a partition,