subject:"Dataset \- reduceByKey"

Re: Dataset - reduceByKey

2016-06-07 Thread Jacek Laskowski

Hi Bryan, What about groupBy [1] and agg [2]? What about UserDefinedAggregateFunction [3]? [1] https://home.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset@groupBy(col1:String,cols:String*):org.apache.spark.sql.RelationalGroupedDatase

Re: Dataset - reduceByKey

2016-06-07 Thread Bryan Jeffrey

All, Thank you for the replies. It seems as though the Dataset API is still far behind the RDD API. This is unfortunate as the Dataset API potentially provides a number of performance benefits. I will move to using it in a more limited set of cases for the moment. Thank you! Bryan Jeffrey On

Re: Dataset - reduceByKey

2016-06-07 Thread Richard Marscher

There certainly are some gaps between the richness of the RDD API and the Dataset API. I'm also migrating from RDD to Dataset and ran into reduceByKey and join scenarios. In the spark-dev list, one person was discussing reduceByKey being sub-optimal at the moment and it spawned this JIRA https://i

Re: Dataset - reduceByKey

2016-06-07 Thread Takeshi Yamamuro

Seems you can see docs for 2.0 for now; https://home.apache.org/~pwendell/spark-nightly/spark-branch-2.0-docs/spark-2.0.0-SNAPSHOT-2016_06_07_07_01-1e2c931-docs/ // maropu On Tue, Jun 7, 2016 at 11:40 AM, Bryan Jeffrey wrote: > It would also be nice if there was a better example of joining two

Re: Dataset - reduceByKey

2016-06-07 Thread Bryan Jeffrey

It would also be nice if there was a better example of joining two Datasets. I am looking at the documentation here: http://spark.apache.org/docs/latest/sql-programming-guide.html. It seems a little bit sparse - is there a better documentation source? Regards, Bryan Jeffrey On Tue, Jun 7, 2016 a

Dataset - reduceByKey

2016-06-07 Thread Bryan Jeffrey

Hello. I am looking at the option of moving RDD based operations to Dataset based operations. We are calling 'reduceByKey' on some pair RDDs we have. What would the equivalent be in the Dataset interface - I do not see a simple reduceByKey replacement. Regards, Bryan Jeffrey

Re: Large dataset, reduceByKey - java heap space error

2015-01-22 Thread Sean McNamara

Hi Kane- http://spark.apache.org/docs/latest/tuning.html has excellent information that may be helpful. In particular increasing the number of tasks may help, as well as confirming that you don’t have more data than you're expecting landing on a key. Also, if you are using spark < 1.2.0, set

Large dataset, reduceByKey - java heap space error

2015-01-22 Thread Kane Kim

I'm trying to process a large dataset, mapping/filtering works ok, but as long as I try to reduceByKey, I get out of memory errors: http://pastebin.com/70M5d0Bn Any ideas how I can fix that? Thanks. - To unsubscribe, e-mail: us

Re: Dataset - reduceByKey

Re: Dataset - reduceByKey

Re: Dataset - reduceByKey

Re: Dataset - reduceByKey

Re: Dataset - reduceByKey

Dataset - reduceByKey

Re: Large dataset, reduceByKey - java heap space error

Large dataset, reduceByKey - java heap space error

8 matches

Site Navigation

Mail list logo

Footer information