[GitHub] spark issue #22428: [SPARK-25430][SQL] Add map parameter for withColumnRenam...

2018-10-02 Thread goungoun
Github user goungoun commented on the issue:

https://github.com/apache/spark/pull/22428
  
Awesome! @HyukjinKwon , @gatorsmile thanks for good information. Let me 
look into it further. By the way, I still hope this conversation is open to 
users' voice, not limited with developers' perspective. Like me who have to do 
data wrangling/engineering everyday, it makes things easier.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22428: [SPARK-25430][SQL] Add map parameter for withColumnRenam...

2018-09-16 Thread goungoun
Github user goungoun commented on the issue:

https://github.com/apache/spark/pull/22428
  
@HyukjinKwon , thanks for your review. Actually, that is the reason that I 
open this pull request. I think it is better to giving reusable option to users 
than repeating too much of same code in their analysis. In notebook 
environment, whenever visualization is required in the middle of the analysis, 
I had to convert column names rather than using it as it is so that I can 
deliver right messages to the report readers. During the process, I had to 
repeat withColumenRenamed too many times. 

So, I've researched how the other users are trying to overcome the 
limitation. It seems that users tend to use foldleft or for loop with 
withColumnRenamed which can cause performance issue creating too many 
dataframes inside of Spark engine even without knowing it. The arguments can be 
found as follows.

StackOverflow
- 
https://stackoverflow.com/questions/38798567/pyspark-rename-more-than-one-column-using-withcolumnrenamed
- 
https://stackoverflow.com/questions/35592917/renaming-column-names-of-a-dataframe-in-spark-scala?noredirect=1=1

Spark Issues
[SPARK-12225] Support adding or replacing multiple columns at once in 
DataFrame API

[SPARK-21582] DataFrame.withColumnRenamed cause huge performance overhead
If foldleft is used, too many columns can cause performance issue



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22428: [SPARK-25430][SQL] Add map parameter for withColu...

2018-09-15 Thread goungoun
GitHub user goungoun opened a pull request:

https://github.com/apache/spark/pull/22428

[SPARK-25430][SQL] Add map parameter for withColumnRenamed

## What changes were proposed in this pull request?
This PR allows withColumnRenamed with a map input argument

## How was this patch tested?
unit tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/goungoun/spark SPARK-25430

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22428.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22428


commit eb0858989952454e2112bf6247d85d33438525e0
Author: Goun Na 
Date:   2018-09-15T13:30:45Z

[SPARK-18073][SQL] Add map parameter for withColumnRenamed




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20800: [SPARK-23627][SQL] Provide isEmpty in Dataset

2018-05-15 Thread goungoun
Github user goungoun commented on the issue:

https://github.com/apache/spark/pull/20800
  
Thanks!!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20800: [SPARK-23627][SQL] Provide isEmpty in Dataset

2018-05-14 Thread goungoun
Github user goungoun commented on a diff in the pull request:

https://github.com/apache/spark/pull/20800#discussion_r188019259
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -511,6 +511,14 @@ class Dataset[T] private[sql](
*/
   def isLocal: Boolean = logicalPlan.isInstanceOf[LocalRelation]
 
+  /**
+   * Returns true if the `Dataset` is empty.
+   *
+   * @group basic
+   * @since 2.4.0
+   */
+  def isEmpty: Boolean = rdd.isEmpty()
--- End diff --

In my understanding, the @maropu's suggestion is to implement isEmpty 
exactly the same way of count() method except the return type. 

```
def count(): Long = withAction("count", groupBy().count().queryExecution) { 
plan =>
   plan.executeCollect().head.getLong(0)
 }

def isEmpty: Boolean = withAction("isEmpty", 
groupBy().count().queryExecution) { plan =>
  plan.executeCollect().head.getLong(0) == 0
}
```
And @cloud-fan 's idea is adding limit(1) before doing count! Is it right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19876: [ML][SPARK-23783][SPARK-11239] Add PMML export to...

2018-03-31 Thread goungoun
Github user goungoun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19876#discussion_r178432400
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala ---
@@ -710,15 +711,58 @@ class LinearRegressionModel private[ml] (
   }
 
   /**
-   * Returns a [[org.apache.spark.ml.util.MLWriter]] instance for this ML 
instance.
+   * Returns a [[org.apache.spark.ml.util.GeneralMLWriter]] instance for 
this ML instance.
*
* For [[LinearRegressionModel]], this does NOT currently save the 
training [[summary]].
* An option to save [[summary]] may be added in the future.
*
* This also does not save the [[parent]] currently.
*/
   @Since("1.6.0")
-  override def write: MLWriter = new 
LinearRegressionModel.LinearRegressionModelWriter(this)
+  override def write: GeneralMLWriter = new GeneralMLWriter(this)
+}
+
+/** A writer for LinearRegression that handles the "internal" (or default) 
format */
+private class InternalLinearRegressionModelWriter
+  extends MLWriterFormat with MLFormatRegister {
+
+  override def format(): String = "internal"
+  override def stageName(): String = 
"org.apache.spark.ml.regression.LinearRegressionModel"
+
+  private case class Data(intercept: Double, coefficients: Vector, scale: 
Double)
+
+  override def write(path: String, sparkSession: SparkSession,
+optionMap: mutable.Map[String, String], stage: PipelineStage): Unit = {
+val instance = stage.asInstanceOf[LinearRegressionModel]
+val sc = sparkSession.sparkContext
+// Save metadata and Params
+DefaultParamsWriter.saveMetadata(instance, path, sc)
+// Save model data: intercept, coefficients, scale
+val data = Data(instance.intercept, instance.coefficients, 
instance.scale)
+val dataPath = new Path(path, "data").toString
+
sparkSession.createDataFrame(Seq(data)).repartition(1).write.parquet(dataPath)
+  }
+}
+
+/** A writer for LinearRegression that handles the "pmml" format */
+private class PMMLLinearRegressionModelWriter
+extends MLWriterFormat with MLFormatRegister {
--- End diff --

Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19876: [ML][SPARK-23783][SPARK-11239] Add PMML export to...

2018-03-24 Thread goungoun
Github user goungoun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19876#discussion_r176911201
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala ---
@@ -710,15 +711,58 @@ class LinearRegressionModel private[ml] (
   }
 
   /**
-   * Returns a [[org.apache.spark.ml.util.MLWriter]] instance for this ML 
instance.
+   * Returns a [[org.apache.spark.ml.util.GeneralMLWriter]] instance for 
this ML instance.
*
* For [[LinearRegressionModel]], this does NOT currently save the 
training [[summary]].
* An option to save [[summary]] may be added in the future.
*
* This also does not save the [[parent]] currently.
*/
   @Since("1.6.0")
-  override def write: MLWriter = new 
LinearRegressionModel.LinearRegressionModelWriter(this)
+  override def write: GeneralMLWriter = new GeneralMLWriter(this)
+}
+
+/** A writer for LinearRegression that handles the "internal" (or default) 
format */
+private class InternalLinearRegressionModelWriter
+  extends MLWriterFormat with MLFormatRegister {
+
+  override def format(): String = "internal"
+  override def stageName(): String = 
"org.apache.spark.ml.regression.LinearRegressionModel"
+
+  private case class Data(intercept: Double, coefficients: Vector, scale: 
Double)
+
+  override def write(path: String, sparkSession: SparkSession,
+optionMap: mutable.Map[String, String], stage: PipelineStage): Unit = {
+val instance = stage.asInstanceOf[LinearRegressionModel]
+val sc = sparkSession.sparkContext
+// Save metadata and Params
+DefaultParamsWriter.saveMetadata(instance, path, sc)
+// Save model data: intercept, coefficients, scale
+val data = Data(instance.intercept, instance.coefficients, 
instance.scale)
+val dataPath = new Path(path, "data").toString
+
sparkSession.createDataFrame(Seq(data)).repartition(1).write.parquet(dataPath)
+  }
+}
+
+/** A writer for LinearRegression that handles the "pmml" format */
+private class PMMLLinearRegressionModelWriter
+extends MLWriterFormat with MLFormatRegister {
--- End diff --

Should be two space indentation
`extends MLWriterFormat with MLFormatRegister {` 



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20800: [SPARK-23627][SQL] Provide isEmpty in Dataset

2018-03-23 Thread goungoun
Github user goungoun commented on a diff in the pull request:

https://github.com/apache/spark/pull/20800#discussion_r176728379
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -511,6 +511,14 @@ class Dataset[T] private[sql](
*/
   def isLocal: Boolean = logicalPlan.isInstanceOf[LocalRelation]
 
+  /**
+   * Returns true if the `Dataset` is empty.
+   *
+   * @group basic
+   * @since 2.4.0
+   */
+  def isEmpty: Boolean = rdd.isEmpty()
--- End diff --

@gatorsmile, just simply running df.rdd.isEmpty in spark-shell was quite 
responsive even in tera byte sized tables. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20800: [SPARK-23627][SQL] Provide isEmpty in Dataset

2018-03-15 Thread goungoun
Github user goungoun commented on the issue:

https://github.com/apache/spark/pull/20800
  
For additional check that I mentioned. The following code shows that Spark 
users does not need to add take(1). ds.rdd.take(1).isEmpty is redundant.


[RDD.scala](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala)
 
`def isEmpty(): Boolean = withScope {
partitions.length == 0 || take(1).length == 0
}`



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20800: [SPARK-23627][SQL] Provide isEmpty in Dataset

2018-03-14 Thread goungoun
Github user goungoun commented on the issue:

https://github.com/apache/spark/pull/20800
  
@rxin, checking empty is likely to be a common process in every ETL batch 
job. I think it is the right place to provide that functionality. When a basic 
function is missing already supposed to be provided, people spend unnecessary 
time for searching and creating their own creative functions. It does not help 
us develop clean code or business value neither. I added one of the 
stack-overflow discussions at SPARK-23627 for your reference. I also would like 
to confirm rdd.isEmpty is optimized internally following up this issue.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20800: [SPARK-23627][SQL] Provide isEmpty in Dataset

2018-03-14 Thread goungoun
Github user goungoun commented on a diff in the pull request:

https://github.com/apache/spark/pull/20800#discussion_r174673621
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -511,6 +511,14 @@ class Dataset[T] private[sql](
*/
   def isLocal: Boolean = logicalPlan.isInstanceOf[LocalRelation]
 
+  /**
+   * Returns true if the `Dataset` is empty
--- End diff --

@HyukjinKwon, thanks. I added period.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20800: [SPARK-23627][SQL] Provide isEmpty in Dataset

2018-03-14 Thread goungoun
Github user goungoun commented on the issue:

https://github.com/apache/spark/pull/20800
  
@HyukjinKwon, @maropu 
Just a gentle reminder. Jenkins is waiting for a comment like 'ok to test'.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20800: [SPARK-23627][SQL] Provide isEmpty in DataSet

2018-03-12 Thread goungoun
Github user goungoun commented on a diff in the pull request:

https://github.com/apache/spark/pull/20800#discussion_r174002184
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -511,6 +511,12 @@ class Dataset[T] private[sql](
*/
   def isLocal: Boolean = logicalPlan.isInstanceOf[LocalRelation]
 
+  /**
+   * Returns true if the `DataSet` is empty
+   *
--- End diff --

@mgaido91 Thanks, I modified the comment.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20782: [SPARK-23627][SQL] Provide isEmpty in DataSet

2018-03-11 Thread goungoun
Github user goungoun commented on the issue:

https://github.com/apache/spark/pull/20782
  
As unnecessary information is included, I closed this pull request. Please 
refer request #20800 instead of #20782. I am sorry for your inconvenience.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20800: isEmpty in Dataset and its testSuite

2018-03-11 Thread goungoun
GitHub user goungoun opened a pull request:

https://github.com/apache/spark/pull/20800

isEmpty in Dataset and its testSuite

## What changes were proposed in this pull request?

This PR adds isEmpty() in DataSet

## How was this patch tested?

Unit tests added

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/goungoun/spark SPARK-23627

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20800.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20800


commit d3fc620e49ae07f6ebc0ee20976c12da49f2e0af
Author: Goun Na <gounna@...>
Date:   2018-03-12T02:57:19Z

isEmpty in Dataset and its testSuite




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20782: [SPARK-SPARK-23627][SQL] Provide isEmpty in DataS...

2018-03-08 Thread goungoun
Github user goungoun closed the pull request at:

https://github.com/apache/spark/pull/20782


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20782: [SPARK-SPARK-23627][SQL] Provide isEmpty in DataS...

2018-03-08 Thread goungoun
GitHub user goungoun opened a pull request:

https://github.com/apache/spark/pull/20782

[SPARK-SPARK-23627][SQL] Provide isEmpty in DataSet

## What changes were proposed in this pull request?

This PR adds a isEmpty in DataSet

## How was this patch tested?

Unit tests added.

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/goungoun/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20782.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20782


commit c1488201b4bfb050402cc91da3c9619a5e4a0a6e
Author: goungoun <gounna@...>
Date:   2018-03-08T14:59:20Z

Update DatasetSuite.scala

testsuit add

commit 1c2211807cda14d891dfa1b82f713c3b24c53bab
Author: goungoun <gounna@...>
Date:   2018-03-09T02:31:05Z

Merge pull request #2 from goungoun/goungoun-patch-1

Update DatasetSuite.scala

commit 7c8e87ce42fe2a4b1942f0d95660c93831c92730
Author: Goun Na <gounna@...>
Date:   2018-03-09T06:09:04Z

SPARK-23627 provide isEmpty in DataSet

commit a1debb997336e9a2edda98dca6c0c03c9a0e7c3c
Author: Goun Na <gounna@...>
Date:   2018-03-09T06:10:29Z

SPARK-23627 provide isEmpty in DataSet




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org