Hi Bryan,
What about groupBy [1] and agg [2]? What about UserDefinedAggregateFunction [3]?
[1]
https://home.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset@groupBy(col1:String,cols:String*):org.apache.spark.sql.RelationalGroupedDatase
All,
Thank you for the replies. It seems as though the Dataset API is still far
behind the RDD API. This is unfortunate as the Dataset API potentially
provides a number of performance benefits. I will move to using it in a
more limited set of cases for the moment.
Thank you!
Bryan Jeffrey
On
There certainly are some gaps between the richness of the RDD API and the
Dataset API. I'm also migrating from RDD to Dataset and ran into
reduceByKey and join scenarios.
In the spark-dev list, one person was discussing reduceByKey being
sub-optimal at the moment and it spawned this JIRA
https://i
Seems you can see docs for 2.0 for now;
https://home.apache.org/~pwendell/spark-nightly/spark-branch-2.0-docs/spark-2.0.0-SNAPSHOT-2016_06_07_07_01-1e2c931-docs/
// maropu
On Tue, Jun 7, 2016 at 11:40 AM, Bryan Jeffrey
wrote:
> It would also be nice if there was a better example of joining two
It would also be nice if there was a better example of joining two
Datasets. I am looking at the documentation here:
http://spark.apache.org/docs/latest/sql-programming-guide.html. It seems a
little bit sparse - is there a better documentation source?
Regards,
Bryan Jeffrey
On Tue, Jun 7, 2016 a
Hello.
I am looking at the option of moving RDD based operations to Dataset based
operations. We are calling 'reduceByKey' on some pair RDDs we have. What
would the equivalent be in the Dataset interface - I do not see a simple
reduceByKey replacement.
Regards,
Bryan Jeffrey
Hi Kane-
http://spark.apache.org/docs/latest/tuning.html has excellent information that
may be helpful. In particular increasing the number of tasks may help, as well
as confirming that you don’t have more data than you're expecting landing on a
key.
Also, if you are using spark < 1.2.0, set
I'm trying to process a large dataset, mapping/filtering works ok, but
as long as I try to reduceByKey, I get out of memory errors:
http://pastebin.com/70M5d0Bn
Any ideas how I can fix that?
Thanks.
-
To unsubscribe, e-mail: us