[GitHub] spark issue #20070: SPARK-22896 Improvement in String interpolation

2018-01-03 Thread chetkhatri
Github user chetkhatri commented on the issue:

https://github.com/apache/spark/pull/20070
  
@srowen Request for review when you get on this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20070: SPARK-22896 Improvement in String interpolation

2018-01-02 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20070#discussion_r159192570
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/ml/QuantileDiscretizerExample.scala
 ---
@@ -45,7 +45,7 @@ object QuantileDiscretizerExample {
   .setNumBuckets(3)
 
 val result = discretizer.fit(df).transform(df)
-result.show()
+result.show(false)
--- End diff --

@srowen correct either way it works for ex. examples/ml/LDAExamples.scala 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20070: SPARK-22896 Improvement in String interpolation

2018-01-01 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20070#discussion_r159152519
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/LatentDirichletAllocationExample.scala
 ---
@@ -46,7 +46,10 @@ object LatentDirichletAllocationExample {
 val topics = ldaModel.topicsMatrix
 for (topic <- Range(0, 3)) {
   print(s"Topic $topic :")
-  for (word <- Range(0, ldaModel.vocabSize)) { print(s" ${topics(word, 
topic)}") }
+  for (word <- Range(0, ldaModel.vocabSize))
+  {
--- End diff --

@srowen sure done.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20070: SPARK-22896 Improvement in String interpolation

2017-12-30 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20070#discussion_r159134529
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/LatentDirichletAllocationExample.scala
 ---
@@ -42,11 +42,11 @@ object LatentDirichletAllocationExample {
 val ldaModel = new LDA().setK(3).run(corpus)
 
 // Output topics. Each is a distribution over words (matching word 
count vectors)
-println("Learned topics (as distributions over vocab of " + 
ldaModel.vocabSize + " words):")
+println(s"Learned topics (as distributions over vocab of 
${ldaModel.vocabSize} words):")
 val topics = ldaModel.topicsMatrix
 for (topic <- Range(0, 3)) {
-  print("Topic " + topic + ":")
-  for (word <- Range(0, ldaModel.vocabSize)) { print(" " + 
topics(word, topic)); }
+  print(s"Topic $topic :")
+  for (word <- Range(0, ldaModel.vocabSize)) { print(s" ${topics(word, 
topic)}") }
--- End diff --

@srowen Thanks for suggestion, it has been addressed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20070: SPARK-22896 Improvement in String interpolation

2017-12-30 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20070#discussion_r159134489
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/HypothesisTestingExample.scala
 ---
@@ -68,7 +68,7 @@ object HypothesisTestingExample {
 // against the label.
 val featureTestResults: Array[ChiSqTestResult] = 
Statistics.chiSqTest(obs)
 featureTestResults.zipWithIndex.foreach { case (k, v) =>
-  println("Column " + (v + 1).toString + ":")
+  println(s"Column ${(v + 1).toString} :")
--- End diff --

@srowen Thanks , Changes addressed


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20070: SPARK-22896 Improvement in String interpolation

2017-12-30 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20070#discussion_r159134414
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/ml/QuantileDiscretizerExample.scala
 ---
@@ -45,7 +45,7 @@ object QuantileDiscretizerExample {
   .setNumBuckets(3)
 
 val result = discretizer.fit(df).transform(df)
-result.show()
+result.show(false)
--- End diff --

We're following same style in other examples so it is good to do.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20070: SPARK-22896 Improvement in String interpolation

2017-12-30 Thread chetkhatri
Github user chetkhatri commented on the issue:

https://github.com/apache/spark/pull/20070
  
@srowen Okey. current status looks good


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20070: SPARK-22896 Improvement in String interpolation

2017-12-29 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20070#discussion_r159084968
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/ml/VectorIndexerExample.scala 
---
@@ -41,8 +41,8 @@ object VectorIndexerExample {
 val indexerModel = indexer.fit(data)
 
 val categoricalFeatures: Set[Int] = 
indexerModel.categoryMaps.keys.toSet
-println(s"Chose ${categoricalFeatures.size} categorical features: " +
-  categoricalFeatures.mkString(", "))
+println(s"Chose ${categoricalFeatures.size} " +
+  s"categorical features: {$categoricalFeatures.mkString(", ")}")
--- End diff --

I did fixed this. Can you please give me steps as a check list before 
commit for test.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20070: SPARK-22896 Improvement in String interpolation

2017-12-28 Thread chetkhatri
Github user chetkhatri commented on the issue:

https://github.com/apache/spark/pull/20070
  
@srowen please do re-run the build.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20070: SPARK-22896 Improvement in String interpolation

2017-12-28 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20070#discussion_r158978102
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/AssociationRulesExample.scala
 ---
@@ -42,14 +42,13 @@ object AssociationRulesExample {
 val results = ar.run(freqItemsets)
 
 results.collect().foreach { rule =>
-  println("[" + rule.antecedent.mkString(",")
-+ "=>"
-+ rule.consequent.mkString(",") + "]," + rule.confidence)
+
println(s"[${rule.antecedent.mkString(",")}=>${rule.consequent.mkString(",")} 
]" +
+s" ${rule.confidence}")
 }
 // $example off$
 
 sc.stop()
   }
 
 }
-// scalastyle:on println
+// scalastyle:on println
--- End diff --

Done, addressed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20070: SPARK-22896 Improvement in String interpolation

2017-12-28 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20070#discussion_r158978648
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/streaming/CustomReceiver.scala
 ---
@@ -82,9 +82,9 @@ class CustomReceiver(host: String, port: Int)
var socket: Socket = null
var userInput: String = null
try {
- logInfo("Connecting to " + host + ":" + port)
+ logInfo(s"Connecting to $host $port")
--- End diff --

Done, addressed


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20070: SPARK-22896 Improvement in String interpolation

2017-12-28 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20070#discussion_r158976171
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/ml/DeveloperApiExample.scala 
---
@@ -169,10 +169,10 @@ private class MyLogisticRegressionModel(
 Vectors.dense(-margin, margin)
   }
 
-  /** Number of classes the label can take. 2 indicates binary 
classification. */
+  // Number of classes the label can take. 2 indicates binary 
classification.
--- End diff --

+1


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20070: SPARK-22896 Improvement in String interpolation

2017-12-28 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20070#discussion_r158976219
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/ml/QuantileDiscretizerExample.scala
 ---
@@ -31,12 +31,11 @@ object QuantileDiscretizerExample {
 
 // $example on$
 val data = Array((0, 18.0), (1, 19.0), (2, 8.0), (3, 5.0), (4, 2.2))
-val df = spark.createDataFrame(data).toDF("id", "hour")
+val df = spark.createDataFrame(data).toDF("id", "hour").repartition(1)
--- End diff --

ok


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20070: SPARK-22896 Improvement in String interpolation

2017-12-28 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20070#discussion_r158976155
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/ml/CorrelationExample.scala 
---
@@ -51,10 +51,10 @@ object CorrelationExample {
 
 val df = data.map(Tuple1.apply).toDF("features")
 val Row(coeff1: Matrix) = Correlation.corr(df, "features").head
-println("Pearson correlation matrix:\n" + coeff1.toString)
+println(s"Pearson correlation matrix:\n ${coeff1.toString}")
--- End diff --

Addressed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20070: SPARK-22896 Improvement in String interpolation

2017-12-28 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20070#discussion_r158975980
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/ml/ChiSquareTestExample.scala 
---
@@ -52,9 +52,9 @@ object ChiSquareTestExample {
 
 val df = data.toDF("label", "features")
 val chi = ChiSquareTest.test(df, "features", "label").head
-println("pValues = " + chi.getAs[Vector](0))
-println("degreesOfFreedom = " + chi.getSeq[Int](1).mkString("[", ",", 
"]"))
-println("statistics = " + chi.getAs[Vector](2))
+println(s"pValues = ${chi.getAs[Vector](0)}")
--- End diff --

Ok


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20070: SPARK-22896 Improvement in String interpolation

2017-12-28 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20070#discussion_r158974779
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/graphx/Analytics.scala ---
@@ -145,9 +145,9 @@ object Analytics extends Logging {
   // TriangleCount requires the graph to be partitioned
   
.partitionBy(partitionStrategy.getOrElse(RandomVertexCut)).cache()
 val triangles = TriangleCount.run(graph)
-println("Triangles: " + triangles.vertices.map {
+println(s"Triangles: ${triangles.vertices.map {
--- End diff --

sure


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20070: SPARK-22896 Improvement in String interpolation

2017-12-26 Thread chetkhatri
Github user chetkhatri commented on the issue:

https://github.com/apache/spark/pull/20070
  
@srowen I rechecked all scala examples and this is commulative PR for the 
same.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20070: SPARK-22896 Improvement in String interpolation

2017-12-26 Thread chetkhatri
Github user chetkhatri commented on the issue:

https://github.com/apache/spark/pull/20070
  
You're correct - I missed other packages. I will re-confirm soon. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20070: SPARK-22896 Improvement in String interpolation

2017-12-26 Thread chetkhatri
Github user chetkhatri commented on the issue:

https://github.com/apache/spark/pull/20070
  
In scala ? I don't think so. I am re-iterating and doing double check.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20071: SPARK-22896 Improvement in String interpolation |...

2017-12-26 Thread chetkhatri
Github user chetkhatri closed the pull request at:

https://github.com/apache/spark/pull/20071


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20071: SPARK-22896 Improvement in String interpolation | Graphx

2017-12-26 Thread chetkhatri
Github user chetkhatri commented on the issue:

https://github.com/apache/spark/pull/20071
  
@srowen Thanks, Addressed and changes done.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20070: SPARK-22896 Improvement in String interpolation

2017-12-26 Thread chetkhatri
Github user chetkhatri commented on the issue:

https://github.com/apache/spark/pull/20070
  
@srowen also i did merge another similiar PR with graphx to here. so Just  
FYI  - we are good.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20070: SPARK-22896 Improvement in String interpolation

2017-12-26 Thread chetkhatri
Github user chetkhatri commented on the issue:

https://github.com/apache/spark/pull/20070
  
@srowen Absolutely correct, this all in one shot. I did changes in all.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20081: [SPARK-22833][EXAMPLE] Improvement SparkHive Scala Examp...

2017-12-26 Thread chetkhatri
Github user chetkhatri commented on the issue:

https://github.com/apache/spark/pull/20081
  
@cloud-fan @srowen I am good with changes proposed. please do merge.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20081: [SPARK-22833][EXAMPLE] Improvement SparkHive Scala Examp...

2017-12-25 Thread chetkhatri
Github user chetkhatri commented on the issue:

https://github.com/apache/spark/pull/20081
  
@cloud-fan spark.sql.files.maxRecordsPerFile didn't worked out when i was 
working with mine 30 TB of Spark Hive workload whereas repartition and coalesce 
made sense.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20081: [SPARK-22833][EXAMPLE] Improvement SparkHive Scala Examp...

2017-12-25 Thread chetkhatri
Github user chetkhatri commented on the issue:

https://github.com/apache/spark/pull/20081
  
@cloud-fan Thanks for PR
4. spark.sql.parquet.writeLegacyFormat - if you don't use this 
configuration, hive external table won't be able to access parquet data.
5. repartition and coalesce is most common use case in Industry to control 
N Number of files under directory while doing partitioning data.
i.e  If Data volume is very huge, then every partitions would have many 
small-small files which may harm
downstream query performance due to File I/O, Bandwidth I/O, Network 
I/O, Disk I/O.
Else I am good this your approach. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20071: SPARK-22896 Improvement in String interpolation |...

2017-12-24 Thread chetkhatri
GitHub user chetkhatri opened a pull request:

https://github.com/apache/spark/pull/20071

SPARK-22896 Improvement in String interpolation | Graphx

## What changes were proposed in this pull request?
* String interpolation in scala style corrected.
## How was this patch tested?
* Manually tested

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/chetkhatri/spark graphx-contrib

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20071.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20071


commit 9916fd1f67234b1fa5608231181bdf3b08718981
Author: chetkhatri <ckhatrimanjal@...>
Date:   2017-12-24T08:33:49Z

SPARK-22896 Improvement in String interpolation

commit 162ac276cfb5aa3215e3e0bdf2723f3e7aacf7d5
Author: chetkhatri <ckhatrimanjal@...>
Date:   2017-12-24T08:37:25Z

SPARK-22896 Improvement in String interpolation - fixed typo

commit aa2de00b62f920c8691c81a085402533f76c036d
Author: chetkhatri <ckhatrimanjal@...>
Date:   2017-12-24T08:54:43Z

Merge branch 'master' of https://github.com/apache/spark into 
mllib-chetan-contrib

commit 8186a34178b108f71bea4f7b21080a2b527b445e
Author: chetkhatri <ckhatrimanjal@...>
Date:   2017-12-24T11:25:56Z

SPARK-22896 Improvement in String interpolation




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20070: SPARK-22896 Improvement in String interpolation

2017-12-24 Thread chetkhatri
GitHub user chetkhatri opened a pull request:

https://github.com/apache/spark/pull/20070

SPARK-22896 Improvement in String interpolation

## What changes were proposed in this pull request?

* String interpolation in ml pipeline example has been corrected as per 
scala standard.

## How was this patch tested?
* manually tested.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/chetkhatri/spark mllib-chetan-contrib

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20070.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20070


commit 9916fd1f67234b1fa5608231181bdf3b08718981
Author: chetkhatri <ckhatrimanjal@...>
Date:   2017-12-24T08:33:49Z

SPARK-22896 Improvement in String interpolation




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20018: SPARK-22833 [Improvement] in SparkHive Scala Examples

2017-12-23 Thread chetkhatri
Github user chetkhatri commented on the issue:

https://github.com/apache/spark/pull/20018
  
Thanks @HyukjinKwon @wangyum 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20018: SPARK-22833 [Improvement] in SparkHive Scala Examples

2017-12-22 Thread chetkhatri
Github user chetkhatri commented on the issue:

https://github.com/apache/spark/pull/20018
  
@srowen Apologize, i was not aware with that PMC member gets auto 
notification for the same. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20018: SPARK-22833 [Improvement] in SparkHive Scala Examples

2017-12-22 Thread chetkhatri
Github user chetkhatri commented on the issue:

https://github.com/apache/spark/pull/20018
  
@HyukjinKwon @srowen Kindly review now, if looks good do merge. Thanks


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-22 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158454282
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -102,8 +101,41 @@ object SparkHiveExample {
 // |  4| val_4|  4| val_4|
 // |  5| val_5|  5| val_5|
 // ...
-// $example off:spark_hive$
 
+// Create Hive managed table with parquet
+sql("CREATE TABLE records(key int, value string) STORED AS PARQUET")
+// Save DataFrame to Hive Managed table as Parquet format
+val hiveTableDF = sql("SELECT * FROM records")
+
hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
+// Create External Hive table with parquet
+sql("CREATE EXTERNAL TABLE records(key int, value string) " +
+  "STORED AS PARQUET LOCATION '/user/hive/warehouse/'")
+// to make Hive parquet format compatible with spark parquet format
+spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
+
+// Multiple parquet files could be created accordingly to volume of 
data under directory given.
+val hiveExternalTableLocation = 
"/user/hive/warehouse/database_name.db/records"
+
+// Save DataFrame to Hive External table as compatible parquet format
+
hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation)
+
+// turn on flag for Dynamic Partitioning
+spark.sqlContext.setConf("hive.exec.dynamic.partition", "true")
+spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", 
"nonstrict")
+
+// You can create partitions in Hive table, so downstream queries run 
much faster.
+hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key")
+  .parquet(hiveExternalTableLocation)
+
+// reduce number of files for each partition by repartition
--- End diff --

@HyukjinKwon Thanks for highlight, improved the same.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-22 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158454291
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -102,8 +101,41 @@ object SparkHiveExample {
 // |  4| val_4|  4| val_4|
 // |  5| val_5|  5| val_5|
 // ...
-// $example off:spark_hive$
 
+// Create Hive managed table with parquet
+sql("CREATE TABLE records(key int, value string) STORED AS PARQUET")
+// Save DataFrame to Hive Managed table as Parquet format
+val hiveTableDF = sql("SELECT * FROM records")
+
hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
+// Create External Hive table with parquet
+sql("CREATE EXTERNAL TABLE records(key int, value string) " +
+  "STORED AS PARQUET LOCATION '/user/hive/warehouse/'")
+// to make Hive parquet format compatible with spark parquet format
+spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
+
+// Multiple parquet files could be created accordingly to volume of 
data under directory given.
+val hiveExternalTableLocation = 
"/user/hive/warehouse/database_name.db/records"
+
+// Save DataFrame to Hive External table as compatible parquet format
+
hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation)
+
+// turn on flag for Dynamic Partitioning
+spark.sqlContext.setConf("hive.exec.dynamic.partition", "true")
+spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", 
"nonstrict")
+
+// You can create partitions in Hive table, so downstream queries run 
much faster.
+hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key")
+  .parquet(hiveExternalTableLocation)
+
+// reduce number of files for each partition by repartition
+hiveTableDF.repartition($"key").write.mode(SaveMode.Overwrite)
+  .partitionBy("key").parquet(hiveExternalTableLocation)
+
+// Control number of files in each partition by coalesce
--- End diff --

@HyukjinKwon Thanks for highlight, improved the same.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-22 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158454275
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -102,8 +101,41 @@ object SparkHiveExample {
 // |  4| val_4|  4| val_4|
 // |  5| val_5|  5| val_5|
 // ...
-// $example off:spark_hive$
 
+// Create Hive managed table with parquet
+sql("CREATE TABLE records(key int, value string) STORED AS PARQUET")
+// Save DataFrame to Hive Managed table as Parquet format
+val hiveTableDF = sql("SELECT * FROM records")
+
hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
+// Create External Hive table with parquet
+sql("CREATE EXTERNAL TABLE records(key int, value string) " +
+  "STORED AS PARQUET LOCATION '/user/hive/warehouse/'")
+// to make Hive parquet format compatible with spark parquet format
+spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
+
+// Multiple parquet files could be created accordingly to volume of 
data under directory given.
+val hiveExternalTableLocation = 
"/user/hive/warehouse/database_name.db/records"
+
+// Save DataFrame to Hive External table as compatible parquet format
+
hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation)
+
+// turn on flag for Dynamic Partitioning
--- End diff --

@HyukjinKwon Thanks for highlight, improved the same.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-22 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158454240
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -102,8 +101,41 @@ object SparkHiveExample {
 // |  4| val_4|  4| val_4|
 // |  5| val_5|  5| val_5|
 // ...
-// $example off:spark_hive$
 
+// Create Hive managed table with parquet
+sql("CREATE TABLE records(key int, value string) STORED AS PARQUET")
+// Save DataFrame to Hive Managed table as Parquet format
+val hiveTableDF = sql("SELECT * FROM records")
+
hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
+// Create External Hive table with parquet
+sql("CREATE EXTERNAL TABLE records(key int, value string) " +
+  "STORED AS PARQUET LOCATION '/user/hive/warehouse/'")
+// to make Hive parquet format compatible with spark parquet format
--- End diff --

@HyukjinKwon Thanks for highlight, improved the same.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-22 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158454252
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -102,8 +101,41 @@ object SparkHiveExample {
 // |  4| val_4|  4| val_4|
 // |  5| val_5|  5| val_5|
 // ...
-// $example off:spark_hive$
 
+// Create Hive managed table with parquet
+sql("CREATE TABLE records(key int, value string) STORED AS PARQUET")
+// Save DataFrame to Hive Managed table as Parquet format
+val hiveTableDF = sql("SELECT * FROM records")
+
hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
+// Create External Hive table with parquet
+sql("CREATE EXTERNAL TABLE records(key int, value string) " +
+  "STORED AS PARQUET LOCATION '/user/hive/warehouse/'")
+// to make Hive parquet format compatible with spark parquet format
+spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
+
+// Multiple parquet files could be created accordingly to volume of 
data under directory given.
--- End diff --

@HyukjinKwon Thanks for highlight, improved the same.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-22 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158454265
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -102,8 +101,41 @@ object SparkHiveExample {
 // |  4| val_4|  4| val_4|
 // |  5| val_5|  5| val_5|
 // ...
-// $example off:spark_hive$
 
+// Create Hive managed table with parquet
+sql("CREATE TABLE records(key int, value string) STORED AS PARQUET")
+// Save DataFrame to Hive Managed table as Parquet format
+val hiveTableDF = sql("SELECT * FROM records")
+
hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
+// Create External Hive table with parquet
+sql("CREATE EXTERNAL TABLE records(key int, value string) " +
+  "STORED AS PARQUET LOCATION '/user/hive/warehouse/'")
+// to make Hive parquet format compatible with spark parquet format
+spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
+
+// Multiple parquet files could be created accordingly to volume of 
data under directory given.
+val hiveExternalTableLocation = 
"/user/hive/warehouse/database_name.db/records"
+
+// Save DataFrame to Hive External table as compatible parquet format
--- End diff --

@HyukjinKwon Thanks for highlight, improved the same.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-22 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158454218
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -102,8 +101,41 @@ object SparkHiveExample {
 // |  4| val_4|  4| val_4|
 // |  5| val_5|  5| val_5|
 // ...
-// $example off:spark_hive$
 
+// Create Hive managed table with parquet
--- End diff --

@HyukjinKwon Thanks for highlight, improved the same.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-22 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158454228
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -102,8 +101,41 @@ object SparkHiveExample {
 // |  4| val_4|  4| val_4|
 // |  5| val_5|  5| val_5|
 // ...
-// $example off:spark_hive$
 
+// Create Hive managed table with parquet
+sql("CREATE TABLE records(key int, value string) STORED AS PARQUET")
+// Save DataFrame to Hive Managed table as Parquet format
--- End diff --

@HyukjinKwon Thanks for highlight, improved the same.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-21 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158370581
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -102,8 +101,63 @@ object SparkHiveExample {
 // |  4| val_4|  4| val_4|
 // |  5| val_5|  5| val_5|
 // ...
-// $example off:spark_hive$
 
+/*
+ * Save DataFrame to Hive Managed table as Parquet format
+ * 1. Create Hive Database / Schema with location at HDFS if you want 
to mentioned explicitly else default
+ * warehouse location will be used to store Hive table Data.
+ * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path;
+ * You don't have to explicitly give location for each table, every 
tables under specified schema will be located at
+ * location given while creating schema.
+ * 2. Create Hive Managed table with storage format as 'Parquet'
+ * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET;
+ */
+val hiveTableDF = sql("SELECT * FROM records").toDF()
+
hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
+
+/*
+ * Save DataFrame to Hive External table as compatible parquet format.
+ * 1. Create Hive External table with storage format as parquet.
+ * Ex: CREATE EXTERNAL TABLE records(key int, value string) STORED AS 
PARQUET;
+ * Since we are not explicitly providing hive database location, it 
automatically takes default warehouse location
+ * given to 'spark.sql.warehouse.dir' while creating SparkSession with 
enableHiveSupport().
+ * For example, we have given '/user/hive/warehouse/' as a Hive 
Warehouse location. It will create schema directories
+ * under '/user/hive/warehouse/' as 
'/user/hive/warehouse/database_name.db' and 
'/user/hive/warehouse/database_name'.
+ */
+
+// to make Hive parquet format compatible with spark parquet format
+spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
+// Multiple parquet files could be created accordingly to volume of 
data under directory given.
+val hiveExternalTableLocation = 
s"/user/hive/warehouse/database_name.db/records"
+
hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation)
+
+// turn on flag for Dynamic Partitioning
+spark.sqlContext.setConf("hive.exec.dynamic.partition", "true")
+spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", 
"nonstrict")
+// You can create partitions in Hive table, so downstream queries run 
much faster.
+hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key")
+  .parquet(hiveExternalTableLocation)
+/*
+If Data volume is very huge, then every partitions would have many 
small-small files which may harm
+downstream query performance due to File I/O, Bandwidth I/O, Network 
I/O, Disk I/O.
+To improve performance you can create single parquet file under each 
partition directory using 'repartition'
+on partitioned key for Hive table. When you add partition to table, 
there will be change in table DDL.
+Ex: CREATE TABLE records(value string) PARTITIONED BY(key int) STORED 
AS PARQUET;
+ */
+hiveTableDF.repartition($"key").write.mode(SaveMode.Overwrite)
+  .partitionBy("key").parquet(hiveExternalTableLocation)
+
+/*
+ You can also do coalesce to control number of files under each 
partitions, repartition does full shuffle and equal
+ data distribution to all partitions. here coalesce can reduce number 
of files to given 'Int' argument without
--- End diff --

@srowen done.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-21 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158370509
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -102,8 +101,63 @@ object SparkHiveExample {
 // |  4| val_4|  4| val_4|
 // |  5| val_5|  5| val_5|
 // ...
-// $example off:spark_hive$
 
+/*
+ * Save DataFrame to Hive Managed table as Parquet format
+ * 1. Create Hive Database / Schema with location at HDFS if you want 
to mentioned explicitly else default
+ * warehouse location will be used to store Hive table Data.
+ * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path;
+ * You don't have to explicitly give location for each table, every 
tables under specified schema will be located at
+ * location given while creating schema.
+ * 2. Create Hive Managed table with storage format as 'Parquet'
+ * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET;
+ */
+val hiveTableDF = sql("SELECT * FROM records").toDF()
+
hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
+
+/*
+ * Save DataFrame to Hive External table as compatible parquet format.
+ * 1. Create Hive External table with storage format as parquet.
+ * Ex: CREATE EXTERNAL TABLE records(key int, value string) STORED AS 
PARQUET;
+ * Since we are not explicitly providing hive database location, it 
automatically takes default warehouse location
+ * given to 'spark.sql.warehouse.dir' while creating SparkSession with 
enableHiveSupport().
+ * For example, we have given '/user/hive/warehouse/' as a Hive 
Warehouse location. It will create schema directories
+ * under '/user/hive/warehouse/' as 
'/user/hive/warehouse/database_name.db' and 
'/user/hive/warehouse/database_name'.
+ */
+
+// to make Hive parquet format compatible with spark parquet format
+spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
+// Multiple parquet files could be created accordingly to volume of 
data under directory given.
+val hiveExternalTableLocation = 
s"/user/hive/warehouse/database_name.db/records"
+
hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation)
+
+// turn on flag for Dynamic Partitioning
+spark.sqlContext.setConf("hive.exec.dynamic.partition", "true")
+spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", 
"nonstrict")
+// You can create partitions in Hive table, so downstream queries run 
much faster.
+hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key")
+  .parquet(hiveExternalTableLocation)
+/*
+If Data volume is very huge, then every partitions would have many 
small-small files which may harm
+downstream query performance due to File I/O, Bandwidth I/O, Network 
I/O, Disk I/O.
+To improve performance you can create single parquet file under each 
partition directory using 'repartition'
+on partitioned key for Hive table. When you add partition to table, 
there will be change in table DDL.
+Ex: CREATE TABLE records(value string) PARTITIONED BY(key int) STORED 
AS PARQUET;
+ */
+hiveTableDF.repartition($"key").write.mode(SaveMode.Overwrite)
--- End diff --

@cloud-fan removed all comments , as discussed with @srowen it does really 
make sense to have at docs with removed inconsitency.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-21 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158370168
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -102,8 +101,63 @@ object SparkHiveExample {
 // |  4| val_4|  4| val_4|
 // |  5| val_5|  5| val_5|
 // ...
-// $example off:spark_hive$
 
+/*
+ * Save DataFrame to Hive Managed table as Parquet format
+ * 1. Create Hive Database / Schema with location at HDFS if you want 
to mentioned explicitly else default
+ * warehouse location will be used to store Hive table Data.
+ * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path;
+ * You don't have to explicitly give location for each table, every 
tables under specified schema will be located at
+ * location given while creating schema.
+ * 2. Create Hive Managed table with storage format as 'Parquet'
+ * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET;
+ */
+val hiveTableDF = sql("SELECT * FROM records").toDF()
+
hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
+
+/*
+ * Save DataFrame to Hive External table as compatible parquet format.
+ * 1. Create Hive External table with storage format as parquet.
+ * Ex: CREATE EXTERNAL TABLE records(key int, value string) STORED AS 
PARQUET;
+ * Since we are not explicitly providing hive database location, it 
automatically takes default warehouse location
+ * given to 'spark.sql.warehouse.dir' while creating SparkSession with 
enableHiveSupport().
+ * For example, we have given '/user/hive/warehouse/' as a Hive 
Warehouse location. It will create schema directories
+ * under '/user/hive/warehouse/' as 
'/user/hive/warehouse/database_name.db' and 
'/user/hive/warehouse/database_name'.
+ */
+
+// to make Hive parquet format compatible with spark parquet format
+spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
+// Multiple parquet files could be created accordingly to volume of 
data under directory given.
+val hiveExternalTableLocation = 
s"/user/hive/warehouse/database_name.db/records"
+
hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation)
+
+// turn on flag for Dynamic Partitioning
+spark.sqlContext.setConf("hive.exec.dynamic.partition", "true")
+spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", 
"nonstrict")
+// You can create partitions in Hive table, so downstream queries run 
much faster.
+hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key")
+  .parquet(hiveExternalTableLocation)
+/*
+If Data volume is very huge, then every partitions would have many 
small-small files which may harm
--- End diff --

@srowen I totally agree with you. I will rephrase content for docs. from 
here: i have removed as of now. please check and do needful.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-21 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158368719
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -102,8 +101,63 @@ object SparkHiveExample {
 // |  4| val_4|  4| val_4|
 // |  5| val_5|  5| val_5|
 // ...
-// $example off:spark_hive$
 
+/*
+ * Save DataFrame to Hive Managed table as Parquet format
+ * 1. Create Hive Database / Schema with location at HDFS if you want 
to mentioned explicitly else default
+ * warehouse location will be used to store Hive table Data.
+ * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path;
+ * You don't have to explicitly give location for each table, every 
tables under specified schema will be located at
+ * location given while creating schema.
+ * 2. Create Hive Managed table with storage format as 'Parquet'
+ * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET;
+ */
+val hiveTableDF = sql("SELECT * FROM records").toDF()
+
hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
+
+/*
+ * Save DataFrame to Hive External table as compatible parquet format.
+ * 1. Create Hive External table with storage format as parquet.
+ * Ex: CREATE EXTERNAL TABLE records(key int, value string) STORED AS 
PARQUET;
--- End diff --

@cloud-fan we'll keep all comments description at documentation with user 
friendly lines. I have added location also.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-21 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158368554
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -102,8 +101,63 @@ object SparkHiveExample {
 // |  4| val_4|  4| val_4|
 // |  5| val_5|  5| val_5|
 // ...
-// $example off:spark_hive$
 
+/*
+ * Save DataFrame to Hive Managed table as Parquet format
+ * 1. Create Hive Database / Schema with location at HDFS if you want 
to mentioned explicitly else default
+ * warehouse location will be used to store Hive table Data.
+ * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path;
+ * You don't have to explicitly give location for each table, every 
tables under specified schema will be located at
+ * location given while creating schema.
+ * 2. Create Hive Managed table with storage format as 'Parquet'
+ * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET;
+ */
+val hiveTableDF = sql("SELECT * FROM records").toDF()
--- End diff --

@srowen done cc\ @cloud-fan removed toDF() 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-21 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158366994
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -102,8 +101,63 @@ object SparkHiveExample {
 // |  4| val_4|  4| val_4|
 // |  5| val_5|  5| val_5|
 // ...
-// $example off:spark_hive$
 
+/*
--- End diff --

@srowen Done, changes addressed


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20016: SPARK-22830 Scala Coding style has been improved in Spar...

2017-12-20 Thread chetkhatri
Github user chetkhatri commented on the issue:

https://github.com/apache/spark/pull/20016
  
@srowen I think, we can merge this now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-20 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158113948
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -104,6 +103,60 @@ object SparkHiveExample {
 // ...
 // $example off:spark_hive$
--- End diff --

@srowen I mis-understood your first comment. I have reverted as suggested. 
Please check now


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20016: SPARK-22830 Scala Coding style has been improved in Spar...

2017-12-20 Thread chetkhatri
Github user chetkhatri commented on the issue:

https://github.com/apache/spark/pull/20016
  
@srowen Thank you for re-run, now it passes all.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20018: SPARK-22833 [Improvement] in SparkHive Scala Examples

2017-12-20 Thread chetkhatri
Github user chetkhatri commented on the issue:

https://github.com/apache/spark/pull/20018
  
Adding other contributor of the same file for review. cc\
@cloud-fan 
@aokolnychyi
@liancheng 
@HyukjinKwon


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20016: SPARK-22830 Scala Coding style has been improved ...

2017-12-20 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20016#discussion_r157941778
  
--- Diff: examples/src/main/scala/org/apache/spark/examples/HdfsTest.scala 
---
@@ -39,7 +39,7 @@ object HdfsTest {
   val start = System.currentTimeMillis()
   for (x <- mapped) { x + 2 }
   val end = System.currentTimeMillis()
-  println("Iteration " + iter + " took " + (end-start) + " ms")
+  println(s"Iteration ${iter} took ${(end-start)} ms")
--- End diff --

@HyukjinKwon $end-start won't work, both are different variables see. I 
made changes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20016: SPARK-22830 Scala Coding style has been improved ...

2017-12-20 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20016#discussion_r157941813
  
--- Diff: examples/src/main/scala/org/apache/spark/examples/SparkALS.scala 
---
@@ -100,7 +100,7 @@ object SparkALS {
 ITERATIONS = iters.getOrElse("5").toInt
 slices = slices_.getOrElse("2").toInt
   case _ =>
-System.err.println("Usage: SparkALS [M] [U] [F] [iters] 
[partitions]")
+System.err.println(s"Usage: SparkALS [M] [U] [F] [iters] 
[partitions]")
--- End diff --

@HyukjinKwon Addressed ! Kindly do review


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20016: SPARK-22830 Scala Coding style has been improved ...

2017-12-20 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20016#discussion_r157816044
  
--- Diff: examples/src/main/scala/org/apache/spark/examples/LocalALS.scala 
---
@@ -95,7 +95,7 @@ object LocalALS {
 
   def showWarning() {
 System.err.println(
-  """WARN: This is a naive implementation of ALS and is given as an 
example!
+  s"""WARN: This is a naive implementation of ALS and is given as an 
example!
--- End diff --

@mgaido91 Thank you for feedback, changed addressed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20016: SPARK-22830 Scala Coding style has been improved ...

2017-12-20 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20016#discussion_r157794109
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/DFSReadWriteTest.scala ---
@@ -127,11 +125,11 @@ object DFSReadWriteTest {
 spark.stop()
 
 if (localWordCount == dfsWordCount) {
-  println(s"Success! Local Word Count ($localWordCount) " +
-s"and DFS Word Count ($dfsWordCount) agree.")
+  println(s"Success! Local Word Count ($localWordCount) 
+and DFS Word Count ($dfsWordCount) agree.")
--- End diff --

@srowen Thanks for review, I did addressed changes. Please review.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20016: SPARK-22830 Scala Coding style has been improved ...

2017-12-20 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20016#discussion_r157941808
  
--- Diff: examples/src/main/scala/org/apache/spark/examples/SparkALS.scala 
---
@@ -80,7 +80,7 @@ object SparkALS {
 
   def showWarning() {
 System.err.println(
-  """WARN: This is a naive implementation of ALS and is given as an 
example!
+  s"""WARN: This is a naive implementation of ALS and is given as an 
example!
--- End diff --

@HyukjinKwon Addressed ! Kindly do review


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20016: SPARK-22830 Scala Coding style has been improved ...

2017-12-20 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20016#discussion_r157816026
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/DFSReadWriteTest.scala ---
@@ -49,12 +49,10 @@ object DFSReadWriteTest {
   }
 
   private def printUsage(): Unit = {
-val usage: String = "DFS Read-Write Test\n" +
-"\n" +
-"Usage: localFile dfsDir\n" +
-"\n" +
-"localFile - (string) local file to use in test\n" +
-"dfsDir - (string) DFS directory for read/write tests\n"
+val usage = s"""DFS Read-Write Test 
--- End diff --

@mgaido91 Thank you for feedback, changed addressed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20016: SPARK-22830 Scala Coding style has been improved ...

2017-12-20 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20016#discussion_r157794196
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/LocalFileLR.scala ---
@@ -58,10 +58,10 @@ object LocalFileLR {
 
 // Initialize w to a random value
 val w = DenseVector.fill(D) {2 * rand.nextDouble - 1}
-println("Initial w: " + w)
+println(s"Initial w: ${w}")
--- End diff --

@srowen Thanks for review, I did addressed changes. Please review


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20016: SPARK-22830 Scala Coding style has been improved in Spar...

2017-12-20 Thread chetkhatri
Github user chetkhatri commented on the issue:

https://github.com/apache/spark/pull/20016
  
@srowen Thanks for response, correct - i went through error of jenkins and 
found error 
Online[https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4018/console]
 which fixed accordingly to me and committed so please take a look, if not 
correct please suggest the same.
Thank you


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20018: SPARK-22833 [Improvement] in SparkHive Scala Examples

2017-12-20 Thread chetkhatri
Github user chetkhatri commented on the issue:

https://github.com/apache/spark/pull/20018
  
@srowen Can you please review and if everything seems correct then run test 
build 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20016: SPARK-22830 Scala Coding style has been improved in Spar...

2017-12-20 Thread chetkhatri
Github user chetkhatri commented on the issue:

https://github.com/apache/spark/pull/20016
  
@srowen why only recent commit is going to merge, can't we get "squash 
merge" ? 
Please re-run test build. and let me know if still seems wrong.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20016: SPARK-22830 Scala Coding style has been improved in Spar...

2017-12-20 Thread chetkhatri
Github user chetkhatri commented on the issue:

https://github.com/apache/spark/pull/20016
  
@HyukjinKwon @mgaido91 @srowen All the changes are addressed and committed, 
please do review and needful.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-20 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r157973588
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -104,6 +103,60 @@ object SparkHiveExample {
 // ...
 // $example off:spark_hive$
--- End diff --

@srowen I have updated DDL when storing data with parititoning in Hive.
cc\ @HyukjinKwon @mgaido91 @markgrover @markhamstra 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-19 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r157942866
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -104,6 +103,60 @@ object SparkHiveExample {
 // ...
 // $example off:spark_hive$
--- End diff --

@srowen Can you please review this cc\ @holdenk @sameeragarwal 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-19 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r157796580
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -104,6 +103,60 @@ object SparkHiveExample {
 // ...
 // $example off:spark_hive$
--- End diff --

@srowen Thank you for valueable feedback review, I have added that so it 
can help other develoeprs.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20016: SPARK-22830 Scala Coding style has been improved ...

2017-12-19 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20016#discussion_r157793283
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/DFSReadWriteTest.scala ---
@@ -97,22 +95,22 @@ object DFSReadWriteTest {
   def main(args: Array[String]): Unit = {
 parseArgs(args)
 
-println("Performing local word count")
+println(s"Performing local word count")
--- End diff --

@srowen Thanks for review, I did addressed changes. Please review


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20016: SPARK-22830 Scala Coding style has been improved ...

2017-12-19 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20016#discussion_r157792884
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/BroadcastTest.scala ---
@@ -42,7 +42,7 @@ object BroadcastTest {
 val arr1 = (0 until num).toArray
 
 for (i <- 0 until 3) {
-  println("Iteration " + i)
+  println(s"Iteration ${i}")
--- End diff --

@markhamstra Thank you for valueable suggestion, I am addressed and did new 
commit.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20018: SPARK-22833 [Improvement] in SparkHive Scala Examples

2017-12-19 Thread chetkhatri
Github user chetkhatri commented on the issue:

https://github.com/apache/spark/pull/20018
  
@holdenk @sameeragarwal  Please do review and do needful .


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-19 Thread chetkhatri
GitHub user chetkhatri opened a pull request:

https://github.com/apache/spark/pull/20018

SPARK-22833 [Improvement] in SparkHive Scala Examples

## What changes were proposed in this pull request?

SparkHive Scala Examples Improvement made:
* Writing DataFrame / DataSet to Hive Managed , Hive External table using 
different storage format.
* Implementation of Partition, Reparition, Coalesce with appropriate 
example.

## How was this patch tested?
* Patch has been tested manually and by running ./dev/run-tests.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/chetkhatri/spark scala-sparkhive-examples

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20018.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20018


commit 9d9b42bb49997ce7d308fbf50072e5f5e0eccaa2
Author: chetkhatri <ckhatriman...@gmail.com>
Date:   2017-12-19T11:33:47Z

SPARK-22833 [Improvement] in SparkHive Scala Examples




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20016: SPARK-22830 Scala Coding style has been improved in Spar...

2017-12-18 Thread chetkhatri
Github user chetkhatri commented on the issue:

https://github.com/apache/spark/pull/20016
  
+ @holdenk @sameeragarwal 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20016: SPARK-22830 Scala Coding style has been improved in Spar...

2017-12-18 Thread chetkhatri
Github user chetkhatri commented on the issue:

https://github.com/apache/spark/pull/20016
  
@HyukjinKwon I agree with you ! But since Spark is scala project - A lot's 
of developers refer examples available here and if those are Java Developers 
they might don't understand that this is right way to do in Scala ! I had a 
same discussion in Scala Days with developers and I think it does make sense 
here too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20016: SPARK-22830 Scala Coding style has been improved ...

2017-12-18 Thread chetkhatri
GitHub user chetkhatri opened a pull request:

https://github.com/apache/spark/pull/20016

SPARK-22830 Scala Coding style has been improved in Spark Examples

## What changes were proposed in this pull request?

* Under Spark Scala Examples: Some of the syntax were written like Java 
way, It has been re-written as per scala style guide.
* Most of all changes are followed to println() statement.

## How was this patch tested?

Since, All changes proposed are re-writing println statements in scala way, 
manual run used to test println.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/chetkhatri/spark scala-style-spark-examples

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20016.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20016


commit 4ac1cb1c2aa6f72eee339e8b8b647647e879d91f
Author: chetkhatri <ckhatriman...@gmail.com>
Date:   2017-12-19T07:17:37Z

SPARK-22830 Scala Coding style has been improved in Spark Examples




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org