[GitHub] spark issue #20070: SPARK-22896 Improvement in String interpolation
Github user chetkhatri commented on the issue: https://github.com/apache/spark/pull/20070 @srowen Request for review when you get on this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20070: SPARK-22896 Improvement in String interpolation
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20070#discussion_r159192570 --- Diff: examples/src/main/scala/org/apache/spark/examples/ml/QuantileDiscretizerExample.scala --- @@ -45,7 +45,7 @@ object QuantileDiscretizerExample { .setNumBuckets(3) val result = discretizer.fit(df).transform(df) -result.show() +result.show(false) --- End diff -- @srowen correct either way it works for ex. examples/ml/LDAExamples.scala --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20070: SPARK-22896 Improvement in String interpolation
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20070#discussion_r159152519 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/LatentDirichletAllocationExample.scala --- @@ -46,7 +46,10 @@ object LatentDirichletAllocationExample { val topics = ldaModel.topicsMatrix for (topic <- Range(0, 3)) { print(s"Topic $topic :") - for (word <- Range(0, ldaModel.vocabSize)) { print(s" ${topics(word, topic)}") } + for (word <- Range(0, ldaModel.vocabSize)) + { --- End diff -- @srowen sure done. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20070: SPARK-22896 Improvement in String interpolation
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20070#discussion_r159134529 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/LatentDirichletAllocationExample.scala --- @@ -42,11 +42,11 @@ object LatentDirichletAllocationExample { val ldaModel = new LDA().setK(3).run(corpus) // Output topics. Each is a distribution over words (matching word count vectors) -println("Learned topics (as distributions over vocab of " + ldaModel.vocabSize + " words):") +println(s"Learned topics (as distributions over vocab of ${ldaModel.vocabSize} words):") val topics = ldaModel.topicsMatrix for (topic <- Range(0, 3)) { - print("Topic " + topic + ":") - for (word <- Range(0, ldaModel.vocabSize)) { print(" " + topics(word, topic)); } + print(s"Topic $topic :") + for (word <- Range(0, ldaModel.vocabSize)) { print(s" ${topics(word, topic)}") } --- End diff -- @srowen Thanks for suggestion, it has been addressed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20070: SPARK-22896 Improvement in String interpolation
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20070#discussion_r159134489 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/HypothesisTestingExample.scala --- @@ -68,7 +68,7 @@ object HypothesisTestingExample { // against the label. val featureTestResults: Array[ChiSqTestResult] = Statistics.chiSqTest(obs) featureTestResults.zipWithIndex.foreach { case (k, v) => - println("Column " + (v + 1).toString + ":") + println(s"Column ${(v + 1).toString} :") --- End diff -- @srowen Thanks , Changes addressed --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20070: SPARK-22896 Improvement in String interpolation
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20070#discussion_r159134414 --- Diff: examples/src/main/scala/org/apache/spark/examples/ml/QuantileDiscretizerExample.scala --- @@ -45,7 +45,7 @@ object QuantileDiscretizerExample { .setNumBuckets(3) val result = discretizer.fit(df).transform(df) -result.show() +result.show(false) --- End diff -- We're following same style in other examples so it is good to do. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20070: SPARK-22896 Improvement in String interpolation
Github user chetkhatri commented on the issue: https://github.com/apache/spark/pull/20070 @srowen Okey. current status looks good --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20070: SPARK-22896 Improvement in String interpolation
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20070#discussion_r159084968 --- Diff: examples/src/main/scala/org/apache/spark/examples/ml/VectorIndexerExample.scala --- @@ -41,8 +41,8 @@ object VectorIndexerExample { val indexerModel = indexer.fit(data) val categoricalFeatures: Set[Int] = indexerModel.categoryMaps.keys.toSet -println(s"Chose ${categoricalFeatures.size} categorical features: " + - categoricalFeatures.mkString(", ")) +println(s"Chose ${categoricalFeatures.size} " + + s"categorical features: {$categoricalFeatures.mkString(", ")}") --- End diff -- I did fixed this. Can you please give me steps as a check list before commit for test. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20070: SPARK-22896 Improvement in String interpolation
Github user chetkhatri commented on the issue: https://github.com/apache/spark/pull/20070 @srowen please do re-run the build. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20070: SPARK-22896 Improvement in String interpolation
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20070#discussion_r158978102 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/AssociationRulesExample.scala --- @@ -42,14 +42,13 @@ object AssociationRulesExample { val results = ar.run(freqItemsets) results.collect().foreach { rule => - println("[" + rule.antecedent.mkString(",") -+ "=>" -+ rule.consequent.mkString(",") + "]," + rule.confidence) + println(s"[${rule.antecedent.mkString(",")}=>${rule.consequent.mkString(",")} ]" + +s" ${rule.confidence}") } // $example off$ sc.stop() } } -// scalastyle:on println +// scalastyle:on println --- End diff -- Done, addressed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20070: SPARK-22896 Improvement in String interpolation
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20070#discussion_r158978648 --- Diff: examples/src/main/scala/org/apache/spark/examples/streaming/CustomReceiver.scala --- @@ -82,9 +82,9 @@ class CustomReceiver(host: String, port: Int) var socket: Socket = null var userInput: String = null try { - logInfo("Connecting to " + host + ":" + port) + logInfo(s"Connecting to $host $port") --- End diff -- Done, addressed --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20070: SPARK-22896 Improvement in String interpolation
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20070#discussion_r158976171 --- Diff: examples/src/main/scala/org/apache/spark/examples/ml/DeveloperApiExample.scala --- @@ -169,10 +169,10 @@ private class MyLogisticRegressionModel( Vectors.dense(-margin, margin) } - /** Number of classes the label can take. 2 indicates binary classification. */ + // Number of classes the label can take. 2 indicates binary classification. --- End diff -- +1 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20070: SPARK-22896 Improvement in String interpolation
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20070#discussion_r158976219 --- Diff: examples/src/main/scala/org/apache/spark/examples/ml/QuantileDiscretizerExample.scala --- @@ -31,12 +31,11 @@ object QuantileDiscretizerExample { // $example on$ val data = Array((0, 18.0), (1, 19.0), (2, 8.0), (3, 5.0), (4, 2.2)) -val df = spark.createDataFrame(data).toDF("id", "hour") +val df = spark.createDataFrame(data).toDF("id", "hour").repartition(1) --- End diff -- ok --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20070: SPARK-22896 Improvement in String interpolation
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20070#discussion_r158976155 --- Diff: examples/src/main/scala/org/apache/spark/examples/ml/CorrelationExample.scala --- @@ -51,10 +51,10 @@ object CorrelationExample { val df = data.map(Tuple1.apply).toDF("features") val Row(coeff1: Matrix) = Correlation.corr(df, "features").head -println("Pearson correlation matrix:\n" + coeff1.toString) +println(s"Pearson correlation matrix:\n ${coeff1.toString}") --- End diff -- Addressed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20070: SPARK-22896 Improvement in String interpolation
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20070#discussion_r158975980 --- Diff: examples/src/main/scala/org/apache/spark/examples/ml/ChiSquareTestExample.scala --- @@ -52,9 +52,9 @@ object ChiSquareTestExample { val df = data.toDF("label", "features") val chi = ChiSquareTest.test(df, "features", "label").head -println("pValues = " + chi.getAs[Vector](0)) -println("degreesOfFreedom = " + chi.getSeq[Int](1).mkString("[", ",", "]")) -println("statistics = " + chi.getAs[Vector](2)) +println(s"pValues = ${chi.getAs[Vector](0)}") --- End diff -- Ok --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20070: SPARK-22896 Improvement in String interpolation
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20070#discussion_r158974779 --- Diff: examples/src/main/scala/org/apache/spark/examples/graphx/Analytics.scala --- @@ -145,9 +145,9 @@ object Analytics extends Logging { // TriangleCount requires the graph to be partitioned .partitionBy(partitionStrategy.getOrElse(RandomVertexCut)).cache() val triangles = TriangleCount.run(graph) -println("Triangles: " + triangles.vertices.map { +println(s"Triangles: ${triangles.vertices.map { --- End diff -- sure --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20070: SPARK-22896 Improvement in String interpolation
Github user chetkhatri commented on the issue: https://github.com/apache/spark/pull/20070 @srowen I rechecked all scala examples and this is commulative PR for the same. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20070: SPARK-22896 Improvement in String interpolation
Github user chetkhatri commented on the issue: https://github.com/apache/spark/pull/20070 You're correct - I missed other packages. I will re-confirm soon. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20070: SPARK-22896 Improvement in String interpolation
Github user chetkhatri commented on the issue: https://github.com/apache/spark/pull/20070 In scala ? I don't think so. I am re-iterating and doing double check. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20071: SPARK-22896 Improvement in String interpolation |...
Github user chetkhatri closed the pull request at: https://github.com/apache/spark/pull/20071 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20071: SPARK-22896 Improvement in String interpolation | Graphx
Github user chetkhatri commented on the issue: https://github.com/apache/spark/pull/20071 @srowen Thanks, Addressed and changes done. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20070: SPARK-22896 Improvement in String interpolation
Github user chetkhatri commented on the issue: https://github.com/apache/spark/pull/20070 @srowen also i did merge another similiar PR with graphx to here. so Just FYI - we are good. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20070: SPARK-22896 Improvement in String interpolation
Github user chetkhatri commented on the issue: https://github.com/apache/spark/pull/20070 @srowen Absolutely correct, this all in one shot. I did changes in all. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20081: [SPARK-22833][EXAMPLE] Improvement SparkHive Scala Examp...
Github user chetkhatri commented on the issue: https://github.com/apache/spark/pull/20081 @cloud-fan @srowen I am good with changes proposed. please do merge. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20081: [SPARK-22833][EXAMPLE] Improvement SparkHive Scala Examp...
Github user chetkhatri commented on the issue: https://github.com/apache/spark/pull/20081 @cloud-fan spark.sql.files.maxRecordsPerFile didn't worked out when i was working with mine 30 TB of Spark Hive workload whereas repartition and coalesce made sense. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20081: [SPARK-22833][EXAMPLE] Improvement SparkHive Scala Examp...
Github user chetkhatri commented on the issue: https://github.com/apache/spark/pull/20081 @cloud-fan Thanks for PR 4. spark.sql.parquet.writeLegacyFormat - if you don't use this configuration, hive external table won't be able to access parquet data. 5. repartition and coalesce is most common use case in Industry to control N Number of files under directory while doing partitioning data. i.e If Data volume is very huge, then every partitions would have many small-small files which may harm downstream query performance due to File I/O, Bandwidth I/O, Network I/O, Disk I/O. Else I am good this your approach. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20071: SPARK-22896 Improvement in String interpolation |...
GitHub user chetkhatri opened a pull request: https://github.com/apache/spark/pull/20071 SPARK-22896 Improvement in String interpolation | Graphx ## What changes were proposed in this pull request? * String interpolation in scala style corrected. ## How was this patch tested? * Manually tested You can merge this pull request into a Git repository by running: $ git pull https://github.com/chetkhatri/spark graphx-contrib Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20071.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20071 commit 9916fd1f67234b1fa5608231181bdf3b08718981 Author: chetkhatri <ckhatrimanjal@...> Date: 2017-12-24T08:33:49Z SPARK-22896 Improvement in String interpolation commit 162ac276cfb5aa3215e3e0bdf2723f3e7aacf7d5 Author: chetkhatri <ckhatrimanjal@...> Date: 2017-12-24T08:37:25Z SPARK-22896 Improvement in String interpolation - fixed typo commit aa2de00b62f920c8691c81a085402533f76c036d Author: chetkhatri <ckhatrimanjal@...> Date: 2017-12-24T08:54:43Z Merge branch 'master' of https://github.com/apache/spark into mllib-chetan-contrib commit 8186a34178b108f71bea4f7b21080a2b527b445e Author: chetkhatri <ckhatrimanjal@...> Date: 2017-12-24T11:25:56Z SPARK-22896 Improvement in String interpolation --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20070: SPARK-22896 Improvement in String interpolation
GitHub user chetkhatri opened a pull request: https://github.com/apache/spark/pull/20070 SPARK-22896 Improvement in String interpolation ## What changes were proposed in this pull request? * String interpolation in ml pipeline example has been corrected as per scala standard. ## How was this patch tested? * manually tested. You can merge this pull request into a Git repository by running: $ git pull https://github.com/chetkhatri/spark mllib-chetan-contrib Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20070.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20070 commit 9916fd1f67234b1fa5608231181bdf3b08718981 Author: chetkhatri <ckhatrimanjal@...> Date: 2017-12-24T08:33:49Z SPARK-22896 Improvement in String interpolation --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20018: SPARK-22833 [Improvement] in SparkHive Scala Examples
Github user chetkhatri commented on the issue: https://github.com/apache/spark/pull/20018 Thanks @HyukjinKwon @wangyum --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20018: SPARK-22833 [Improvement] in SparkHive Scala Examples
Github user chetkhatri commented on the issue: https://github.com/apache/spark/pull/20018 @srowen Apologize, i was not aware with that PMC member gets auto notification for the same. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20018: SPARK-22833 [Improvement] in SparkHive Scala Examples
Github user chetkhatri commented on the issue: https://github.com/apache/spark/pull/20018 @HyukjinKwon @srowen Kindly review now, if looks good do merge. Thanks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158454282 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,41 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +// Create Hive managed table with parquet +sql("CREATE TABLE records(key int, value string) STORED AS PARQUET") +// Save DataFrame to Hive Managed table as Parquet format +val hiveTableDF = sql("SELECT * FROM records") + hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records") +// Create External Hive table with parquet +sql("CREATE EXTERNAL TABLE records(key int, value string) " + + "STORED AS PARQUET LOCATION '/user/hive/warehouse/'") +// to make Hive parquet format compatible with spark parquet format +spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true") + +// Multiple parquet files could be created accordingly to volume of data under directory given. +val hiveExternalTableLocation = "/user/hive/warehouse/database_name.db/records" + +// Save DataFrame to Hive External table as compatible parquet format + hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation) + +// turn on flag for Dynamic Partitioning +spark.sqlContext.setConf("hive.exec.dynamic.partition", "true") +spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict") + +// You can create partitions in Hive table, so downstream queries run much faster. +hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key") + .parquet(hiveExternalTableLocation) + +// reduce number of files for each partition by repartition --- End diff -- @HyukjinKwon Thanks for highlight, improved the same. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158454291 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,41 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +// Create Hive managed table with parquet +sql("CREATE TABLE records(key int, value string) STORED AS PARQUET") +// Save DataFrame to Hive Managed table as Parquet format +val hiveTableDF = sql("SELECT * FROM records") + hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records") +// Create External Hive table with parquet +sql("CREATE EXTERNAL TABLE records(key int, value string) " + + "STORED AS PARQUET LOCATION '/user/hive/warehouse/'") +// to make Hive parquet format compatible with spark parquet format +spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true") + +// Multiple parquet files could be created accordingly to volume of data under directory given. +val hiveExternalTableLocation = "/user/hive/warehouse/database_name.db/records" + +// Save DataFrame to Hive External table as compatible parquet format + hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation) + +// turn on flag for Dynamic Partitioning +spark.sqlContext.setConf("hive.exec.dynamic.partition", "true") +spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict") + +// You can create partitions in Hive table, so downstream queries run much faster. +hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key") + .parquet(hiveExternalTableLocation) + +// reduce number of files for each partition by repartition +hiveTableDF.repartition($"key").write.mode(SaveMode.Overwrite) + .partitionBy("key").parquet(hiveExternalTableLocation) + +// Control number of files in each partition by coalesce --- End diff -- @HyukjinKwon Thanks for highlight, improved the same. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158454275 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,41 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +// Create Hive managed table with parquet +sql("CREATE TABLE records(key int, value string) STORED AS PARQUET") +// Save DataFrame to Hive Managed table as Parquet format +val hiveTableDF = sql("SELECT * FROM records") + hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records") +// Create External Hive table with parquet +sql("CREATE EXTERNAL TABLE records(key int, value string) " + + "STORED AS PARQUET LOCATION '/user/hive/warehouse/'") +// to make Hive parquet format compatible with spark parquet format +spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true") + +// Multiple parquet files could be created accordingly to volume of data under directory given. +val hiveExternalTableLocation = "/user/hive/warehouse/database_name.db/records" + +// Save DataFrame to Hive External table as compatible parquet format + hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation) + +// turn on flag for Dynamic Partitioning --- End diff -- @HyukjinKwon Thanks for highlight, improved the same. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158454240 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,41 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +// Create Hive managed table with parquet +sql("CREATE TABLE records(key int, value string) STORED AS PARQUET") +// Save DataFrame to Hive Managed table as Parquet format +val hiveTableDF = sql("SELECT * FROM records") + hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records") +// Create External Hive table with parquet +sql("CREATE EXTERNAL TABLE records(key int, value string) " + + "STORED AS PARQUET LOCATION '/user/hive/warehouse/'") +// to make Hive parquet format compatible with spark parquet format --- End diff -- @HyukjinKwon Thanks for highlight, improved the same. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158454252 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,41 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +// Create Hive managed table with parquet +sql("CREATE TABLE records(key int, value string) STORED AS PARQUET") +// Save DataFrame to Hive Managed table as Parquet format +val hiveTableDF = sql("SELECT * FROM records") + hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records") +// Create External Hive table with parquet +sql("CREATE EXTERNAL TABLE records(key int, value string) " + + "STORED AS PARQUET LOCATION '/user/hive/warehouse/'") +// to make Hive parquet format compatible with spark parquet format +spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true") + +// Multiple parquet files could be created accordingly to volume of data under directory given. --- End diff -- @HyukjinKwon Thanks for highlight, improved the same. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158454265 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,41 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +// Create Hive managed table with parquet +sql("CREATE TABLE records(key int, value string) STORED AS PARQUET") +// Save DataFrame to Hive Managed table as Parquet format +val hiveTableDF = sql("SELECT * FROM records") + hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records") +// Create External Hive table with parquet +sql("CREATE EXTERNAL TABLE records(key int, value string) " + + "STORED AS PARQUET LOCATION '/user/hive/warehouse/'") +// to make Hive parquet format compatible with spark parquet format +spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true") + +// Multiple parquet files could be created accordingly to volume of data under directory given. +val hiveExternalTableLocation = "/user/hive/warehouse/database_name.db/records" + +// Save DataFrame to Hive External table as compatible parquet format --- End diff -- @HyukjinKwon Thanks for highlight, improved the same. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158454218 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,41 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +// Create Hive managed table with parquet --- End diff -- @HyukjinKwon Thanks for highlight, improved the same. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158454228 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,41 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +// Create Hive managed table with parquet +sql("CREATE TABLE records(key int, value string) STORED AS PARQUET") +// Save DataFrame to Hive Managed table as Parquet format --- End diff -- @HyukjinKwon Thanks for highlight, improved the same. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158370581 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,63 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +/* + * Save DataFrame to Hive Managed table as Parquet format + * 1. Create Hive Database / Schema with location at HDFS if you want to mentioned explicitly else default + * warehouse location will be used to store Hive table Data. + * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path; + * You don't have to explicitly give location for each table, every tables under specified schema will be located at + * location given while creating schema. + * 2. Create Hive Managed table with storage format as 'Parquet' + * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET; + */ +val hiveTableDF = sql("SELECT * FROM records").toDF() + hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records") + +/* + * Save DataFrame to Hive External table as compatible parquet format. + * 1. Create Hive External table with storage format as parquet. + * Ex: CREATE EXTERNAL TABLE records(key int, value string) STORED AS PARQUET; + * Since we are not explicitly providing hive database location, it automatically takes default warehouse location + * given to 'spark.sql.warehouse.dir' while creating SparkSession with enableHiveSupport(). + * For example, we have given '/user/hive/warehouse/' as a Hive Warehouse location. It will create schema directories + * under '/user/hive/warehouse/' as '/user/hive/warehouse/database_name.db' and '/user/hive/warehouse/database_name'. + */ + +// to make Hive parquet format compatible with spark parquet format +spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true") +// Multiple parquet files could be created accordingly to volume of data under directory given. +val hiveExternalTableLocation = s"/user/hive/warehouse/database_name.db/records" + hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation) + +// turn on flag for Dynamic Partitioning +spark.sqlContext.setConf("hive.exec.dynamic.partition", "true") +spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict") +// You can create partitions in Hive table, so downstream queries run much faster. +hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key") + .parquet(hiveExternalTableLocation) +/* +If Data volume is very huge, then every partitions would have many small-small files which may harm +downstream query performance due to File I/O, Bandwidth I/O, Network I/O, Disk I/O. +To improve performance you can create single parquet file under each partition directory using 'repartition' +on partitioned key for Hive table. When you add partition to table, there will be change in table DDL. +Ex: CREATE TABLE records(value string) PARTITIONED BY(key int) STORED AS PARQUET; + */ +hiveTableDF.repartition($"key").write.mode(SaveMode.Overwrite) + .partitionBy("key").parquet(hiveExternalTableLocation) + +/* + You can also do coalesce to control number of files under each partitions, repartition does full shuffle and equal + data distribution to all partitions. here coalesce can reduce number of files to given 'Int' argument without --- End diff -- @srowen done. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158370509 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,63 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +/* + * Save DataFrame to Hive Managed table as Parquet format + * 1. Create Hive Database / Schema with location at HDFS if you want to mentioned explicitly else default + * warehouse location will be used to store Hive table Data. + * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path; + * You don't have to explicitly give location for each table, every tables under specified schema will be located at + * location given while creating schema. + * 2. Create Hive Managed table with storage format as 'Parquet' + * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET; + */ +val hiveTableDF = sql("SELECT * FROM records").toDF() + hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records") + +/* + * Save DataFrame to Hive External table as compatible parquet format. + * 1. Create Hive External table with storage format as parquet. + * Ex: CREATE EXTERNAL TABLE records(key int, value string) STORED AS PARQUET; + * Since we are not explicitly providing hive database location, it automatically takes default warehouse location + * given to 'spark.sql.warehouse.dir' while creating SparkSession with enableHiveSupport(). + * For example, we have given '/user/hive/warehouse/' as a Hive Warehouse location. It will create schema directories + * under '/user/hive/warehouse/' as '/user/hive/warehouse/database_name.db' and '/user/hive/warehouse/database_name'. + */ + +// to make Hive parquet format compatible with spark parquet format +spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true") +// Multiple parquet files could be created accordingly to volume of data under directory given. +val hiveExternalTableLocation = s"/user/hive/warehouse/database_name.db/records" + hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation) + +// turn on flag for Dynamic Partitioning +spark.sqlContext.setConf("hive.exec.dynamic.partition", "true") +spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict") +// You can create partitions in Hive table, so downstream queries run much faster. +hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key") + .parquet(hiveExternalTableLocation) +/* +If Data volume is very huge, then every partitions would have many small-small files which may harm +downstream query performance due to File I/O, Bandwidth I/O, Network I/O, Disk I/O. +To improve performance you can create single parquet file under each partition directory using 'repartition' +on partitioned key for Hive table. When you add partition to table, there will be change in table DDL. +Ex: CREATE TABLE records(value string) PARTITIONED BY(key int) STORED AS PARQUET; + */ +hiveTableDF.repartition($"key").write.mode(SaveMode.Overwrite) --- End diff -- @cloud-fan removed all comments , as discussed with @srowen it does really make sense to have at docs with removed inconsitency. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158370168 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,63 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +/* + * Save DataFrame to Hive Managed table as Parquet format + * 1. Create Hive Database / Schema with location at HDFS if you want to mentioned explicitly else default + * warehouse location will be used to store Hive table Data. + * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path; + * You don't have to explicitly give location for each table, every tables under specified schema will be located at + * location given while creating schema. + * 2. Create Hive Managed table with storage format as 'Parquet' + * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET; + */ +val hiveTableDF = sql("SELECT * FROM records").toDF() + hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records") + +/* + * Save DataFrame to Hive External table as compatible parquet format. + * 1. Create Hive External table with storage format as parquet. + * Ex: CREATE EXTERNAL TABLE records(key int, value string) STORED AS PARQUET; + * Since we are not explicitly providing hive database location, it automatically takes default warehouse location + * given to 'spark.sql.warehouse.dir' while creating SparkSession with enableHiveSupport(). + * For example, we have given '/user/hive/warehouse/' as a Hive Warehouse location. It will create schema directories + * under '/user/hive/warehouse/' as '/user/hive/warehouse/database_name.db' and '/user/hive/warehouse/database_name'. + */ + +// to make Hive parquet format compatible with spark parquet format +spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true") +// Multiple parquet files could be created accordingly to volume of data under directory given. +val hiveExternalTableLocation = s"/user/hive/warehouse/database_name.db/records" + hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation) + +// turn on flag for Dynamic Partitioning +spark.sqlContext.setConf("hive.exec.dynamic.partition", "true") +spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict") +// You can create partitions in Hive table, so downstream queries run much faster. +hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key") + .parquet(hiveExternalTableLocation) +/* +If Data volume is very huge, then every partitions would have many small-small files which may harm --- End diff -- @srowen I totally agree with you. I will rephrase content for docs. from here: i have removed as of now. please check and do needful. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158368719 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,63 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +/* + * Save DataFrame to Hive Managed table as Parquet format + * 1. Create Hive Database / Schema with location at HDFS if you want to mentioned explicitly else default + * warehouse location will be used to store Hive table Data. + * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path; + * You don't have to explicitly give location for each table, every tables under specified schema will be located at + * location given while creating schema. + * 2. Create Hive Managed table with storage format as 'Parquet' + * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET; + */ +val hiveTableDF = sql("SELECT * FROM records").toDF() + hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records") + +/* + * Save DataFrame to Hive External table as compatible parquet format. + * 1. Create Hive External table with storage format as parquet. + * Ex: CREATE EXTERNAL TABLE records(key int, value string) STORED AS PARQUET; --- End diff -- @cloud-fan we'll keep all comments description at documentation with user friendly lines. I have added location also. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158368554 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,63 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +/* + * Save DataFrame to Hive Managed table as Parquet format + * 1. Create Hive Database / Schema with location at HDFS if you want to mentioned explicitly else default + * warehouse location will be used to store Hive table Data. + * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path; + * You don't have to explicitly give location for each table, every tables under specified schema will be located at + * location given while creating schema. + * 2. Create Hive Managed table with storage format as 'Parquet' + * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET; + */ +val hiveTableDF = sql("SELECT * FROM records").toDF() --- End diff -- @srowen done cc\ @cloud-fan removed toDF() --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158366994 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,63 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +/* --- End diff -- @srowen Done, changes addressed --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20016: SPARK-22830 Scala Coding style has been improved in Spar...
Github user chetkhatri commented on the issue: https://github.com/apache/spark/pull/20016 @srowen I think, we can merge this now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158113948 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -104,6 +103,60 @@ object SparkHiveExample { // ... // $example off:spark_hive$ --- End diff -- @srowen I mis-understood your first comment. I have reverted as suggested. Please check now --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20016: SPARK-22830 Scala Coding style has been improved in Spar...
Github user chetkhatri commented on the issue: https://github.com/apache/spark/pull/20016 @srowen Thank you for re-run, now it passes all. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20018: SPARK-22833 [Improvement] in SparkHive Scala Examples
Github user chetkhatri commented on the issue: https://github.com/apache/spark/pull/20018 Adding other contributor of the same file for review. cc\ @cloud-fan @aokolnychyi @liancheng @HyukjinKwon --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20016: SPARK-22830 Scala Coding style has been improved ...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20016#discussion_r157941778 --- Diff: examples/src/main/scala/org/apache/spark/examples/HdfsTest.scala --- @@ -39,7 +39,7 @@ object HdfsTest { val start = System.currentTimeMillis() for (x <- mapped) { x + 2 } val end = System.currentTimeMillis() - println("Iteration " + iter + " took " + (end-start) + " ms") + println(s"Iteration ${iter} took ${(end-start)} ms") --- End diff -- @HyukjinKwon $end-start won't work, both are different variables see. I made changes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20016: SPARK-22830 Scala Coding style has been improved ...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20016#discussion_r157941813 --- Diff: examples/src/main/scala/org/apache/spark/examples/SparkALS.scala --- @@ -100,7 +100,7 @@ object SparkALS { ITERATIONS = iters.getOrElse("5").toInt slices = slices_.getOrElse("2").toInt case _ => -System.err.println("Usage: SparkALS [M] [U] [F] [iters] [partitions]") +System.err.println(s"Usage: SparkALS [M] [U] [F] [iters] [partitions]") --- End diff -- @HyukjinKwon Addressed ! Kindly do review --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20016: SPARK-22830 Scala Coding style has been improved ...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20016#discussion_r157816044 --- Diff: examples/src/main/scala/org/apache/spark/examples/LocalALS.scala --- @@ -95,7 +95,7 @@ object LocalALS { def showWarning() { System.err.println( - """WARN: This is a naive implementation of ALS and is given as an example! + s"""WARN: This is a naive implementation of ALS and is given as an example! --- End diff -- @mgaido91 Thank you for feedback, changed addressed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20016: SPARK-22830 Scala Coding style has been improved ...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20016#discussion_r157794109 --- Diff: examples/src/main/scala/org/apache/spark/examples/DFSReadWriteTest.scala --- @@ -127,11 +125,11 @@ object DFSReadWriteTest { spark.stop() if (localWordCount == dfsWordCount) { - println(s"Success! Local Word Count ($localWordCount) " + -s"and DFS Word Count ($dfsWordCount) agree.") + println(s"Success! Local Word Count ($localWordCount) +and DFS Word Count ($dfsWordCount) agree.") --- End diff -- @srowen Thanks for review, I did addressed changes. Please review. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20016: SPARK-22830 Scala Coding style has been improved ...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20016#discussion_r157941808 --- Diff: examples/src/main/scala/org/apache/spark/examples/SparkALS.scala --- @@ -80,7 +80,7 @@ object SparkALS { def showWarning() { System.err.println( - """WARN: This is a naive implementation of ALS and is given as an example! + s"""WARN: This is a naive implementation of ALS and is given as an example! --- End diff -- @HyukjinKwon Addressed ! Kindly do review --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20016: SPARK-22830 Scala Coding style has been improved ...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20016#discussion_r157816026 --- Diff: examples/src/main/scala/org/apache/spark/examples/DFSReadWriteTest.scala --- @@ -49,12 +49,10 @@ object DFSReadWriteTest { } private def printUsage(): Unit = { -val usage: String = "DFS Read-Write Test\n" + -"\n" + -"Usage: localFile dfsDir\n" + -"\n" + -"localFile - (string) local file to use in test\n" + -"dfsDir - (string) DFS directory for read/write tests\n" +val usage = s"""DFS Read-Write Test --- End diff -- @mgaido91 Thank you for feedback, changed addressed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20016: SPARK-22830 Scala Coding style has been improved ...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20016#discussion_r157794196 --- Diff: examples/src/main/scala/org/apache/spark/examples/LocalFileLR.scala --- @@ -58,10 +58,10 @@ object LocalFileLR { // Initialize w to a random value val w = DenseVector.fill(D) {2 * rand.nextDouble - 1} -println("Initial w: " + w) +println(s"Initial w: ${w}") --- End diff -- @srowen Thanks for review, I did addressed changes. Please review --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20016: SPARK-22830 Scala Coding style has been improved in Spar...
Github user chetkhatri commented on the issue: https://github.com/apache/spark/pull/20016 @srowen Thanks for response, correct - i went through error of jenkins and found error Online[https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4018/console] which fixed accordingly to me and committed so please take a look, if not correct please suggest the same. Thank you --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20018: SPARK-22833 [Improvement] in SparkHive Scala Examples
Github user chetkhatri commented on the issue: https://github.com/apache/spark/pull/20018 @srowen Can you please review and if everything seems correct then run test build --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20016: SPARK-22830 Scala Coding style has been improved in Spar...
Github user chetkhatri commented on the issue: https://github.com/apache/spark/pull/20016 @srowen why only recent commit is going to merge, can't we get "squash merge" ? Please re-run test build. and let me know if still seems wrong. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20016: SPARK-22830 Scala Coding style has been improved in Spar...
Github user chetkhatri commented on the issue: https://github.com/apache/spark/pull/20016 @HyukjinKwon @mgaido91 @srowen All the changes are addressed and committed, please do review and needful. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r157973588 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -104,6 +103,60 @@ object SparkHiveExample { // ... // $example off:spark_hive$ --- End diff -- @srowen I have updated DDL when storing data with parititoning in Hive. cc\ @HyukjinKwon @mgaido91 @markgrover @markhamstra --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r157942866 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -104,6 +103,60 @@ object SparkHiveExample { // ... // $example off:spark_hive$ --- End diff -- @srowen Can you please review this cc\ @holdenk @sameeragarwal --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r157796580 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -104,6 +103,60 @@ object SparkHiveExample { // ... // $example off:spark_hive$ --- End diff -- @srowen Thank you for valueable feedback review, I have added that so it can help other develoeprs. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20016: SPARK-22830 Scala Coding style has been improved ...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20016#discussion_r157793283 --- Diff: examples/src/main/scala/org/apache/spark/examples/DFSReadWriteTest.scala --- @@ -97,22 +95,22 @@ object DFSReadWriteTest { def main(args: Array[String]): Unit = { parseArgs(args) -println("Performing local word count") +println(s"Performing local word count") --- End diff -- @srowen Thanks for review, I did addressed changes. Please review --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20016: SPARK-22830 Scala Coding style has been improved ...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20016#discussion_r157792884 --- Diff: examples/src/main/scala/org/apache/spark/examples/BroadcastTest.scala --- @@ -42,7 +42,7 @@ object BroadcastTest { val arr1 = (0 until num).toArray for (i <- 0 until 3) { - println("Iteration " + i) + println(s"Iteration ${i}") --- End diff -- @markhamstra Thank you for valueable suggestion, I am addressed and did new commit. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20018: SPARK-22833 [Improvement] in SparkHive Scala Examples
Github user chetkhatri commented on the issue: https://github.com/apache/spark/pull/20018 @holdenk @sameeragarwal Please do review and do needful . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
GitHub user chetkhatri opened a pull request: https://github.com/apache/spark/pull/20018 SPARK-22833 [Improvement] in SparkHive Scala Examples ## What changes were proposed in this pull request? SparkHive Scala Examples Improvement made: * Writing DataFrame / DataSet to Hive Managed , Hive External table using different storage format. * Implementation of Partition, Reparition, Coalesce with appropriate example. ## How was this patch tested? * Patch has been tested manually and by running ./dev/run-tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/chetkhatri/spark scala-sparkhive-examples Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20018.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20018 commit 9d9b42bb49997ce7d308fbf50072e5f5e0eccaa2 Author: chetkhatri <ckhatriman...@gmail.com> Date: 2017-12-19T11:33:47Z SPARK-22833 [Improvement] in SparkHive Scala Examples --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20016: SPARK-22830 Scala Coding style has been improved in Spar...
Github user chetkhatri commented on the issue: https://github.com/apache/spark/pull/20016 + @holdenk @sameeragarwal --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20016: SPARK-22830 Scala Coding style has been improved in Spar...
Github user chetkhatri commented on the issue: https://github.com/apache/spark/pull/20016 @HyukjinKwon I agree with you ! But since Spark is scala project - A lot's of developers refer examples available here and if those are Java Developers they might don't understand that this is right way to do in Scala ! I had a same discussion in Scala Days with developers and I think it does make sense here too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20016: SPARK-22830 Scala Coding style has been improved ...
GitHub user chetkhatri opened a pull request: https://github.com/apache/spark/pull/20016 SPARK-22830 Scala Coding style has been improved in Spark Examples ## What changes were proposed in this pull request? * Under Spark Scala Examples: Some of the syntax were written like Java way, It has been re-written as per scala style guide. * Most of all changes are followed to println() statement. ## How was this patch tested? Since, All changes proposed are re-writing println statements in scala way, manual run used to test println. You can merge this pull request into a Git repository by running: $ git pull https://github.com/chetkhatri/spark scala-style-spark-examples Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20016.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20016 commit 4ac1cb1c2aa6f72eee339e8b8b647647e879d91f Author: chetkhatri <ckhatriman...@gmail.com> Date: 2017-12-19T07:17:37Z SPARK-22830 Scala Coding style has been improved in Spark Examples --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org