Re: Dataset API Question

2017-10-25 Thread Wenchen Fan
It's because of different API design. *RDD.checkpoint* returns void, which means it mutates the RDD state so you need a *RDD**.isCheckpointed* method to check if this RDD is checkpointed. *Dataset.checkpoint* returns a new Dataset, which means there is no isCheckpointed state in Dataset, and thus

Re: Dataset API Question

2017-10-25 Thread Bernard Jesop
Actually, I realized keeping the info would not be enough as I need to find back the checkpoint files to delete them :/ 2017-10-25 19:07 GMT+02:00 Bernard Jesop : > As far as I understand, Dataset.rdd is not the same as InternalRDD. > It is just another RDD representation of the same Dataset and

Re: Dataset API Question

2017-10-25 Thread Bernard Jesop
As far as I understand, Dataset.rdd is not the same as InternalRDD. It is just another RDD representation of the same Dataset and is created on demand (lazy val) when Dataset.rdd is called. This totally explains the observed behavior. But how would would it be possible to know that a Dataset have

Push epoch updates to executors on fetch failure to avoid fetch retries for missing executors

2017-10-25 Thread Juan Rodríguez Hortalá
Hi, I opened https://issues.apache.org/jira/browse/SPARK-22339 some days ago, and I would like to get some feedback on that. The idea is pushing epoch updates to the executors after a fetch failure by piggybacking on the executor heartbeat response, in order to fail faster when an executor and the

Re: Dataset API Question

2017-10-25 Thread Reynold Xin
It is a bit more than syntactic sugar, but not much more: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L533 BTW this is basically writing all the data out, and then create a new Dataset to load them in. On Wed, Oct 25, 2017 at 6:51 AM, Be

Structured Stream equivalent of reduceByKey

2017-10-25 Thread Piyush Mukati
Hi, we are migrating some jobs from Dstream to Structured Stream. Currently to handle aggregations we call map and reducebyKey on each RDD like rdd.map(event => (event._1, event)).reduceByKey((a, b) => merge(a, b)) The final output of each RDD is merged to the sink with support for aggregation at

Dataset API Question

2017-10-25 Thread Bernard Jesop
Hello everyone, I have a question about checkpointing on dataset. It seems in 2.1.0 that there is a Dataset.checkpoint(), however unlike RDD there is no Dataset.isCheckpointed(). I wonder if Dataset.checkpoint is a syntactic sugar for Dataset.rdd.checkpoint. When I do : Dataset.checkpoint; Data

Re: Kicking off the process around Spark 2.2.1

2017-10-25 Thread Sean Owen
It would be reasonably consistent with the timing of other x.y.1 releases, and more release managers sounds useful, yeah. Note also that in theory the code freeze for 2.3.0 starts in about 2 weeks. On Wed, Oct 25, 2017 at 12:29 PM Holden Karau wrote: > Now that Spark 2.1.2 is out it seems like

Kicking off the process around Spark 2.2.1

2017-10-25 Thread Holden Karau
Now that Spark 2.1.2 is out it seems like now is a good time to get started on the Spark 2.2.1 release. There are some streaming fixes I’m aware of that would be good to get into a release, is there anything else people are working on for 2.2.1 we should be tracking? To switch it up I’d like to su

[ANNOUNCE] Apache Spark 2.1.2

2017-10-25 Thread Holden Karau
We are happy to announce the availability of Spark 2.1.2! Apache Spark 2.1.2 is a maintenance release, based on the branch-2.1 maintenance branch of Spark. We strongly recommend all 2.1.x users to upgrade to this stable release. To download Apache Spark 2.1.2 visit http://spark.apache.org/downloa

Re: CRAN SparkR package removed?

2017-10-25 Thread Holden Karau
Ok, so I’ll say it’s available in the CRAN “archive” and we hope to have it fully available in future releases. On Wed, Oct 25, 2017 at 9:46 AM Felix Cheung wrote: > Yes - unfortunately something was found after it was published and made > available publicly. > > We have a JIRA on this and are w

Re: CRAN SparkR package removed?

2017-10-25 Thread Felix Cheung
Yes - unfortunately something was found after it was published and made available publicly. We have a JIRA on this and are working on the best course of action. _ From: Holden Karau mailto:hol...@pigscanfly.ca>> Sent: Wednesday, October 25, 2017 1:35 AM Subject: CRAN

CRAN SparkR package removed?

2017-10-25 Thread Holden Karau
Looking at https://cran.r-project.org/web/packages/SparkR/ it seems like the package has been removed. Any ideas what's up? (Just asking since I'm working on the release e-mail and it was also mentioned in the keynote just now). -- Twitter: https://twitter.com/holdenkarau