[GitHub] spark issue #22428: [SPARK-25430][SQL] Add map parameter for withColumnRenam...
Github user goungoun commented on the issue: https://github.com/apache/spark/pull/22428 Awesome! @HyukjinKwon , @gatorsmile thanks for good information. Let me look into it further. By the way, I still hope this conversation is open to users' voice, not limited with developers' perspective. Like me who have to do data wrangling/engineering everyday, it makes things easier. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22428: [SPARK-25430][SQL] Add map parameter for withColumnRenam...
Github user goungoun commented on the issue: https://github.com/apache/spark/pull/22428 @HyukjinKwon , thanks for your review. Actually, that is the reason that I open this pull request. I think it is better to giving reusable option to users than repeating too much of same code in their analysis. In notebook environment, whenever visualization is required in the middle of the analysis, I had to convert column names rather than using it as it is so that I can deliver right messages to the report readers. During the process, I had to repeat withColumenRenamed too many times. So, I've researched how the other users are trying to overcome the limitation. It seems that users tend to use foldleft or for loop with withColumnRenamed which can cause performance issue creating too many dataframes inside of Spark engine even without knowing it. The arguments can be found as follows. StackOverflow - https://stackoverflow.com/questions/38798567/pyspark-rename-more-than-one-column-using-withcolumnrenamed - https://stackoverflow.com/questions/35592917/renaming-column-names-of-a-dataframe-in-spark-scala?noredirect=1=1 Spark Issues [SPARK-12225] Support adding or replacing multiple columns at once in DataFrame API [SPARK-21582] DataFrame.withColumnRenamed cause huge performance overhead If foldleft is used, too many columns can cause performance issue --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22428: [SPARK-25430][SQL] Add map parameter for withColu...
GitHub user goungoun opened a pull request: https://github.com/apache/spark/pull/22428 [SPARK-25430][SQL] Add map parameter for withColumnRenamed ## What changes were proposed in this pull request? This PR allows withColumnRenamed with a map input argument ## How was this patch tested? unit tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/goungoun/spark SPARK-25430 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22428.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22428 commit eb0858989952454e2112bf6247d85d33438525e0 Author: Goun Na Date: 2018-09-15T13:30:45Z [SPARK-18073][SQL] Add map parameter for withColumnRenamed --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20800: [SPARK-23627][SQL] Provide isEmpty in Dataset
Github user goungoun commented on the issue: https://github.com/apache/spark/pull/20800 Thanks!! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20800: [SPARK-23627][SQL] Provide isEmpty in Dataset
Github user goungoun commented on a diff in the pull request: https://github.com/apache/spark/pull/20800#discussion_r188019259 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -511,6 +511,14 @@ class Dataset[T] private[sql]( */ def isLocal: Boolean = logicalPlan.isInstanceOf[LocalRelation] + /** + * Returns true if the `Dataset` is empty. + * + * @group basic + * @since 2.4.0 + */ + def isEmpty: Boolean = rdd.isEmpty() --- End diff -- In my understanding, the @maropu's suggestion is to implement isEmpty exactly the same way of count() method except the return type. ``` def count(): Long = withAction("count", groupBy().count().queryExecution) { plan => plan.executeCollect().head.getLong(0) } def isEmpty: Boolean = withAction("isEmpty", groupBy().count().queryExecution) { plan => plan.executeCollect().head.getLong(0) == 0 } ``` And @cloud-fan 's idea is adding limit(1) before doing count! Is it right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19876: [ML][SPARK-23783][SPARK-11239] Add PMML export to...
Github user goungoun commented on a diff in the pull request: https://github.com/apache/spark/pull/19876#discussion_r178432400 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala --- @@ -710,15 +711,58 @@ class LinearRegressionModel private[ml] ( } /** - * Returns a [[org.apache.spark.ml.util.MLWriter]] instance for this ML instance. + * Returns a [[org.apache.spark.ml.util.GeneralMLWriter]] instance for this ML instance. * * For [[LinearRegressionModel]], this does NOT currently save the training [[summary]]. * An option to save [[summary]] may be added in the future. * * This also does not save the [[parent]] currently. */ @Since("1.6.0") - override def write: MLWriter = new LinearRegressionModel.LinearRegressionModelWriter(this) + override def write: GeneralMLWriter = new GeneralMLWriter(this) +} + +/** A writer for LinearRegression that handles the "internal" (or default) format */ +private class InternalLinearRegressionModelWriter + extends MLWriterFormat with MLFormatRegister { + + override def format(): String = "internal" + override def stageName(): String = "org.apache.spark.ml.regression.LinearRegressionModel" + + private case class Data(intercept: Double, coefficients: Vector, scale: Double) + + override def write(path: String, sparkSession: SparkSession, +optionMap: mutable.Map[String, String], stage: PipelineStage): Unit = { +val instance = stage.asInstanceOf[LinearRegressionModel] +val sc = sparkSession.sparkContext +// Save metadata and Params +DefaultParamsWriter.saveMetadata(instance, path, sc) +// Save model data: intercept, coefficients, scale +val data = Data(instance.intercept, instance.coefficients, instance.scale) +val dataPath = new Path(path, "data").toString + sparkSession.createDataFrame(Seq(data)).repartition(1).write.parquet(dataPath) + } +} + +/** A writer for LinearRegression that handles the "pmml" format */ +private class PMMLLinearRegressionModelWriter +extends MLWriterFormat with MLFormatRegister { --- End diff -- Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19876: [ML][SPARK-23783][SPARK-11239] Add PMML export to...
Github user goungoun commented on a diff in the pull request: https://github.com/apache/spark/pull/19876#discussion_r176911201 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala --- @@ -710,15 +711,58 @@ class LinearRegressionModel private[ml] ( } /** - * Returns a [[org.apache.spark.ml.util.MLWriter]] instance for this ML instance. + * Returns a [[org.apache.spark.ml.util.GeneralMLWriter]] instance for this ML instance. * * For [[LinearRegressionModel]], this does NOT currently save the training [[summary]]. * An option to save [[summary]] may be added in the future. * * This also does not save the [[parent]] currently. */ @Since("1.6.0") - override def write: MLWriter = new LinearRegressionModel.LinearRegressionModelWriter(this) + override def write: GeneralMLWriter = new GeneralMLWriter(this) +} + +/** A writer for LinearRegression that handles the "internal" (or default) format */ +private class InternalLinearRegressionModelWriter + extends MLWriterFormat with MLFormatRegister { + + override def format(): String = "internal" + override def stageName(): String = "org.apache.spark.ml.regression.LinearRegressionModel" + + private case class Data(intercept: Double, coefficients: Vector, scale: Double) + + override def write(path: String, sparkSession: SparkSession, +optionMap: mutable.Map[String, String], stage: PipelineStage): Unit = { +val instance = stage.asInstanceOf[LinearRegressionModel] +val sc = sparkSession.sparkContext +// Save metadata and Params +DefaultParamsWriter.saveMetadata(instance, path, sc) +// Save model data: intercept, coefficients, scale +val data = Data(instance.intercept, instance.coefficients, instance.scale) +val dataPath = new Path(path, "data").toString + sparkSession.createDataFrame(Seq(data)).repartition(1).write.parquet(dataPath) + } +} + +/** A writer for LinearRegression that handles the "pmml" format */ +private class PMMLLinearRegressionModelWriter +extends MLWriterFormat with MLFormatRegister { --- End diff -- Should be two space indentation `extends MLWriterFormat with MLFormatRegister {` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20800: [SPARK-23627][SQL] Provide isEmpty in Dataset
Github user goungoun commented on a diff in the pull request: https://github.com/apache/spark/pull/20800#discussion_r176728379 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -511,6 +511,14 @@ class Dataset[T] private[sql]( */ def isLocal: Boolean = logicalPlan.isInstanceOf[LocalRelation] + /** + * Returns true if the `Dataset` is empty. + * + * @group basic + * @since 2.4.0 + */ + def isEmpty: Boolean = rdd.isEmpty() --- End diff -- @gatorsmile, just simply running df.rdd.isEmpty in spark-shell was quite responsive even in tera byte sized tables. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20800: [SPARK-23627][SQL] Provide isEmpty in Dataset
Github user goungoun commented on the issue: https://github.com/apache/spark/pull/20800 For additional check that I mentioned. The following code shows that Spark users does not need to add take(1). ds.rdd.take(1).isEmpty is redundant. [RDD.scala](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala) `def isEmpty(): Boolean = withScope { partitions.length == 0 || take(1).length == 0 }` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20800: [SPARK-23627][SQL] Provide isEmpty in Dataset
Github user goungoun commented on the issue: https://github.com/apache/spark/pull/20800 @rxin, checking empty is likely to be a common process in every ETL batch job. I think it is the right place to provide that functionality. When a basic function is missing already supposed to be provided, people spend unnecessary time for searching and creating their own creative functions. It does not help us develop clean code or business value neither. I added one of the stack-overflow discussions at SPARK-23627 for your reference. I also would like to confirm rdd.isEmpty is optimized internally following up this issue. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20800: [SPARK-23627][SQL] Provide isEmpty in Dataset
Github user goungoun commented on a diff in the pull request: https://github.com/apache/spark/pull/20800#discussion_r174673621 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -511,6 +511,14 @@ class Dataset[T] private[sql]( */ def isLocal: Boolean = logicalPlan.isInstanceOf[LocalRelation] + /** + * Returns true if the `Dataset` is empty --- End diff -- @HyukjinKwon, thanks. I added period. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20800: [SPARK-23627][SQL] Provide isEmpty in Dataset
Github user goungoun commented on the issue: https://github.com/apache/spark/pull/20800 @HyukjinKwon, @maropu Just a gentle reminder. Jenkins is waiting for a comment like 'ok to test'. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20800: [SPARK-23627][SQL] Provide isEmpty in DataSet
Github user goungoun commented on a diff in the pull request: https://github.com/apache/spark/pull/20800#discussion_r174002184 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -511,6 +511,12 @@ class Dataset[T] private[sql]( */ def isLocal: Boolean = logicalPlan.isInstanceOf[LocalRelation] + /** + * Returns true if the `DataSet` is empty + * --- End diff -- @mgaido91 Thanks, I modified the comment. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20782: [SPARK-23627][SQL] Provide isEmpty in DataSet
Github user goungoun commented on the issue: https://github.com/apache/spark/pull/20782 As unnecessary information is included, I closed this pull request. Please refer request #20800 instead of #20782. I am sorry for your inconvenience. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20800: isEmpty in Dataset and its testSuite
GitHub user goungoun opened a pull request: https://github.com/apache/spark/pull/20800 isEmpty in Dataset and its testSuite ## What changes were proposed in this pull request? This PR adds isEmpty() in DataSet ## How was this patch tested? Unit tests added Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/goungoun/spark SPARK-23627 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20800.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20800 commit d3fc620e49ae07f6ebc0ee20976c12da49f2e0af Author: Goun Na <gounna@...> Date: 2018-03-12T02:57:19Z isEmpty in Dataset and its testSuite --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20782: [SPARK-SPARK-23627][SQL] Provide isEmpty in DataS...
Github user goungoun closed the pull request at: https://github.com/apache/spark/pull/20782 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20782: [SPARK-SPARK-23627][SQL] Provide isEmpty in DataS...
GitHub user goungoun opened a pull request: https://github.com/apache/spark/pull/20782 [SPARK-SPARK-23627][SQL] Provide isEmpty in DataSet ## What changes were proposed in this pull request? This PR adds a isEmpty in DataSet ## How was this patch tested? Unit tests added. Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/goungoun/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20782.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20782 commit c1488201b4bfb050402cc91da3c9619a5e4a0a6e Author: goungoun <gounna@...> Date: 2018-03-08T14:59:20Z Update DatasetSuite.scala testsuit add commit 1c2211807cda14d891dfa1b82f713c3b24c53bab Author: goungoun <gounna@...> Date: 2018-03-09T02:31:05Z Merge pull request #2 from goungoun/goungoun-patch-1 Update DatasetSuite.scala commit 7c8e87ce42fe2a4b1942f0d95660c93831c92730 Author: Goun Na <gounna@...> Date: 2018-03-09T06:09:04Z SPARK-23627 provide isEmpty in DataSet commit a1debb997336e9a2edda98dca6c0c03c9a0e7c3c Author: Goun Na <gounna@...> Date: 2018-03-09T06:10:29Z SPARK-23627 provide isEmpty in DataSet --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org