[GitHub] spark pull request #20134: [SPARK-22613][SQL] Make UNCACHE TABLE behaviour c...
Github user vinodkc closed the pull request at: https://github.com/apache/spark/pull/20134 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20947: [SPARK-23705][SQL]Handle non-distinct columns in ...
Github user vinodkc closed the pull request at: https://github.com/apache/spark/pull/20947 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22171: [SPARK-25177][SQL] When dataframe decimal type co...
Github user vinodkc closed the pull request at: https://github.com/apache/spark/pull/22171 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22171: [SPARK-25177][SQL] When dataframe decimal type column ha...
Github user vinodkc commented on the issue: https://github.com/apache/spark/pull/22171 @viirya , Current issue occurs only in the case of 0 values, none zero values with higher scale are still save in non scientific notation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22171: [SPARK-25177][SQL] When dataframe decimal type column ha...
Github user vinodkc commented on the issue: https://github.com/apache/spark/pull/22171 @gatorsmile @HyukjinKwon @viirya , I rechecked the customer scenario. It seems dataframe is saved as csv file and then netezza loads the csv data into netezza table. In csv output 0 values, with higher scale than 6, are store in scientific notation and due to this [limitation](http://www-01.ibm.com/support/docview.wss?crawler=1=swg21570795) of netezza ,it fails to load the data. if the 0 values in csv is in non scientific notation, netezza loads the data. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22307: [SPARK-25301][SQL] When a view uses an UDF from a non de...
Github user vinodkc commented on the issue: https://github.com/apache/spark/pull/22307 @HyukjinKwon , I'll close this PR --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22307: [SPARK-25301][SQL] When a view uses an UDF from a...
Github user vinodkc closed the pull request at: https://github.com/apache/spark/pull/22307 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22171: [SPARK-25177][SQL] When dataframe decimal type column ha...
Github user vinodkc commented on the issue: https://github.com/apache/spark/pull/22171 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22307: [SPARK-25301][SQL] When a view uses an UDF from a non de...
Github user vinodkc commented on the issue: https://github.com/apache/spark/pull/22307 @HyukjinKwon , even with this ```create function d100.udf100 as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFUpper'; ``` we can simulate this issue. I've updated PR description. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22307: [SPARK-25301][SQL] When a view uses an UDF from a...
GitHub user vinodkc opened a pull request: https://github.com/apache/spark/pull/22307 [SPARK-25301][SQL] When a view uses an UDF from a non default database, Spark analyser throws AnalysisException ## What changes were proposed in this pull request? When a hive view uses an UDF from a non default database, Spark analyser throws AnalysisException Steps to simulate this issue - In Hive 1) CREATE DATABASE d100; 2) ADD JAR /usr/udf/masking.jar // masking.jar has a custom udf class 'com.uzx.udf.Masking' 3) create function d100.udf100 as "com.uzx.udf.Masking"; // Note: udf100 is created in d100 4) create view d100.v100 as select d100.udf100(name)ââfrom default.emp; // Note : table default.emp has two columns 'nanme', 'address', 5) select * from d100.v100; // query on view d100.v100 gives correct result In Spark - 1) spark.sql("select * from d100.v100").show throws ``` org.apache.spark.sql.AnalysisException: Undefined function: 'd100.udf100'. This function is neither a registered temporary function nor a permanent function registered in the database 'default' ``` This is because, while parsing the SQL statement of the View ``` 'select `d100.udf100`(`emp`.`name`) from `default`.`emp`' ``` , spark parser fails to split database name and udf name and hence Spark function registry tries to load the UDF 'd100.udf100' from 'default' database. To solve this issue, before creating 'FunctionIdentifier' , try to get actual database name and then create FunctionIdentifier using that database name and function name ## How was this patch tested? Added 1 unit test You can merge this pull request into a Git repository by running: $ git pull https://github.com/vinodkc/spark br_fix_view_with_udf_issue Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22307.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22307 commit 60cc1c9c66dade490dc0501622f8ac6b554b7ff4 Author: Vinod KC Date: 2018-08-31T16:57:00Z fix issue with non default udf in hive view --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22171: [SPARK-25177][SQL] When dataframe decimal type co...
Github user vinodkc commented on a diff in the pull request: https://github.com/apache/spark/pull/22171#discussion_r212251852 --- Diff: sql/core/src/test/resources/sql-tests/results/literals.sql.out --- @@ -197,7 +197,7 @@ select .e3 -- !query 20 select 1E309, -1E309 -- !query 20 schema -struct<1E+309:decimal(1,-309),-1E+309:decimal(1,-309)> +struct<10:decimal(1,-309),-10:decimal(1,-309)> --- End diff -- @viirya This schema is auto generated. Actual issue is only with 0 value when scale higher than 6. If we need to reduce the scope of impact, can we add this condition? ``` override def toString: String = if (decimalVal == 0 && _scale > 6) { toBigDecimal.bigDecimal.toPlainString() } else { toBigDecimal.toString() } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22171: [SPARK-25177][SQL] When dataframe decimal type co...
Github user vinodkc commented on a diff in the pull request: https://github.com/apache/spark/pull/22171#discussion_r212243480 --- Diff: sql/core/src/test/resources/sql-tests/results/literals.sql.out --- @@ -197,7 +197,7 @@ select .e3 -- !query 20 select 1E309, -1E309 -- !query 20 schema -struct<1E+309:decimal(1,-309),-1E+309:decimal(1,-309)> +struct<10:decimal(1,-309),-10:decimal(1,-309)> --- End diff -- Result In Postgresql, ``` CREATE TABLE TestdecBig (a DECIMAL(10,7), b DECIMAL(10,6), c DECIMAL(10,8), d DECIMAL(310,309)); INSERT INTO TestdecBig VALUES (1,1,1,1); INSERT INTO TestdecBig VALUES (0,0,0,0); Output -- select * from TestdecBig; a |b | c | d ---+--++--- --- --- 1.000 | 1.00 | 1. | 1. 000 00 0.000 | 0.00 | 0. | 0. 000 00 (2 rows) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22171: [SPARK-25177][SQL] When dataframe decimal type co...
Github user vinodkc commented on a diff in the pull request: https://github.com/apache/spark/pull/22171#discussion_r212243149 --- Diff: sql/core/src/test/resources/sql-tests/results/higher-order-functions.sql.out --- @@ -201,6 +201,7 @@ struct<> -- !query 20 output + --- End diff -- Golden file generator automatically added this new line --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22171: [SPARK-25177][SQL] When dataframe decimal type column ha...
Github user vinodkc commented on the issue: https://github.com/apache/spark/pull/22171 @viirya, This issue is not only related to Dataset.show but also in dataset write operations. External databases like netezza fails save the result due to scientific notation on "0" values . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22171: [SPARK-25177][SQL] When dataframe decimal type co...
GitHub user vinodkc opened a pull request: https://github.com/apache/spark/pull/22171 [SPARK-25177][SQL] When dataframe decimal type column having scale higher than 6, 0 values are shown in scientific notation ## What changes were proposed in this pull request? If scale of decimal type is > 6 , 0 value will be shown in scientific notation and hence, when the dataframe output is saved to external database, it fails due to scientific notation on "0" values. In java.math.BigDecimal, if the scale is >6 , 0 will be show in scientific notation. In Postgrasql, 0 decimal value will be shown with non-scientific notation (plain string), this PR make spark SQL result consistent with Postgrsql. ## How was this patch tested? Added 2 unit tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/vinodkc/spark br_fix_precision_zero Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22171.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22171 commit 1ebeae518f44439af7ceff2ce5fb80caf44f1d45 Author: Vinod KC Date: 2018-08-21T15:10:47Z Fix precision issue with zero when decimal type scale > 6 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22130: [SPARK-25137][Spark Shell] NumberFormatException` when s...
Github user vinodkc commented on the issue: https://github.com/apache/spark/pull/22130 @dongjoon-hyun , Thanks for taking a look at this PR, I've added Mac OS version in the PR description, IMO, an update of ncurses is causing this issue Reference : https://github.com/jline/jline2/issues/281 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22130: [SPARK-25137][Spark Shell] NumberFormatException`...
GitHub user vinodkc opened a pull request: https://github.com/apache/spark/pull/22130 [SPARK-25137][Spark Shell] NumberFormatException` when starting spark-shell from Mac terminal ## What changes were proposed in this pull request? When starting spark-shell from Mac terminal, Getting exception [ERROR] Failed to construct terminal; falling back to unsupported java.lang.NumberFormatException: For input string: "0x100" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:580) at java.lang.Integer.valueOf(Integer.java:766) at jline.internal.InfoCmp.parseInfoCmp(InfoCmp.java:59) at jline.UnixTerminal.parseInfoCmp(UnixTerminal.java:242) at jline.UnixTerminal.(UnixTerminal.java:65) at jline.UnixTerminal.(UnixTerminal.java:50) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at java.lang.Class.newInstance(Class.java:442) at jline.TerminalFactory.getFlavor(TerminalFactory.java:211) This issue is due a jline defect : https://github.com/jline/jline2/issues/281, which is fixed in Jline 2.14.4, bumping up JLine version in spark to version > Jline 2.14.4 will fix the issue ## How was this patch tested? No new UT/automation test added, after upgrade to latest Jline version 2.14.6, manually tested spark shell features You can merge this pull request into a Git repository by running: $ git pull https://github.com/vinodkc/spark br_UpgradeJLineVersion Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22130.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22130 commit d00929f28b2523869252d67fefc04297aadc5af6 Author: Vinod KC Date: 2018-08-17T04:10:18Z Upgrade JLine to 2.14.6 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20611: [SPARK-23425][SQL]Support wildcard in HDFS path f...
Github user vinodkc commented on a diff in the pull request: https://github.com/apache/spark/pull/20611#discussion_r180993462 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -370,26 +339,35 @@ case class LoadDataCommand( throw new AnalysisException( s"LOAD DATA: URI scheme is required for non-local input paths: '$path'") } - // Follow Hive's behavior: // If LOCAL is not specified, and the path is relative, // then the path is interpreted relative to "/user/" val uriPath = uri.getPath() val absolutePath = if (uriPath != null && uriPath.startsWith("/")) { uriPath } else { -s"/user/${System.getProperty("user.name")}/$uriPath" +s"/user/${ System.getProperty("user.name") }/$uriPath" } new URI(scheme, authority, absolutePath, uri.getQuery(), uri.getFragment()) } -val hadoopConf = sparkSession.sessionState.newHadoopConf() -val srcPath = new Path(hdfsUri) -val fs = srcPath.getFileSystem(hadoopConf) -if (!fs.exists(srcPath)) { - throw new AnalysisException(s"LOAD DATA input path does not exist: $path") -} -hdfsUri } +} +val srcPath = new Path(loadPath) +val fs = srcPath.getFileSystem(sparkSession.sessionState.newHadoopConf()) +// This handling is because while reoslving the invalid urls starting with file:/// +// system throws IllegalArgumentException from globStatus api,so inorder to handle +// such scenarios this code is added in try catch block and after catching the +// run time exception a generic error will be displayed to the user. +try { + if (null == fs.globStatus(srcPath) || fs.globStatus(srcPath).isEmpty) { +throw new AnalysisException(s"LOAD DATA input path does not exist: $path") + } +} +catch { + case e: Exception => --- End diff -- Avoid catching generic exception, catch IllegalArgumentException --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20611: [SPARK-23425][SQL]Support wildcard in HDFS path f...
Github user vinodkc commented on a diff in the pull request: https://github.com/apache/spark/pull/20611#discussion_r180993068 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -370,26 +339,35 @@ case class LoadDataCommand( throw new AnalysisException( s"LOAD DATA: URI scheme is required for non-local input paths: '$path'") } - // Follow Hive's behavior: // If LOCAL is not specified, and the path is relative, // then the path is interpreted relative to "/user/" val uriPath = uri.getPath() val absolutePath = if (uriPath != null && uriPath.startsWith("/")) { uriPath } else { -s"/user/${System.getProperty("user.name")}/$uriPath" +s"/user/${ System.getProperty("user.name") }/$uriPath" --- End diff -- nit: Please remove space --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20947: [SPARK-23705][SQL]Handle non-distinct columns in ...
GitHub user vinodkc opened a pull request: https://github.com/apache/spark/pull/20947 [SPARK-23705][SQL]Handle non-distinct columns in DataSet.groupBy ## What changes were proposed in this pull request? If input columns to DataSet.groupBy contains non unique columns, remove those columns ## How was this patch tested? Added unit test You can merge this pull request into a Git repository by running: $ git pull https://github.com/vinodkc/spark br_FIX_SPARK-23705 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20947.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20947 commit eb93f0590f47227a16055f7eea6bd1e906dec3c9 Author: vinodkc <vinod.kc.in@...> Date: 2018-03-30T15:53:49Z Handle non-distinct columns in groupBy --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20917: [SPARK-23705][SQL]Handle non-distinct columns in ...
Github user vinodkc closed the pull request at: https://github.com/apache/spark/pull/20917 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20917: [SPARK-23705][SQL]Handle non-distinct columns in ...
GitHub user vinodkc opened a pull request: https://github.com/apache/spark/pull/20917 [SPARK-23705][SQL]Handle non-distinct columns in DataSet.groupBy ## What changes were proposed in this pull request? If input columns to DataSet.groupBy contains non unique columns, remove those columns ## How was this patch tested? Added unit test You can merge this pull request into a Git repository by running: $ git pull https://github.com/vinodkc/spark br_FIX_SPARK-23705 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20917.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20917 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20134: [SPARK-22613][SQL] Make UNCACHE TABLE behaviour c...
GitHub user vinodkc opened a pull request: https://github.com/apache/spark/pull/20134 [SPARK-22613][SQL] Make UNCACHE TABLE behaviour consistent with CACHE TABLE ## What changes were proposed in this pull request? Added an option LAZY for UNCACHE TABLE. Eg: UNCACHE LAZY TABLE tableName This will uncache the table lazily instead of blocking until all blocks are deleted. ## How was this patch tested? Added Test cases You can merge this pull request into a Git repository by running: $ git pull https://github.com/vinodkc/spark br_fix_SPARK-22613 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20134.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20134 commit ecc9352b2888bc25fa28273abf72b86bb8688350 Author: vinodkc <vinod.kc.in@...> Date: 2018-01-02T12:49:24Z Support UNCACHE LAZY TABLE --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19809: [SPARK-17920][SQL] [FOLLOWUP] Backport PR 19779 to branc...
Github user vinodkc commented on the issue: https://github.com/apache/spark/pull/19809 Thank you. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19809: [SPARK-17920][SQL] [FOLLOWUP] Backport PR 19779 t...
Github user vinodkc closed the pull request at: https://github.com/apache/spark/pull/19809 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19809: [SPARK-17920][SQL] [FOLLOWUP] Backport PR 19779 t...
Github user vinodkc commented on a diff in the pull request: https://github.com/apache/spark/pull/19809#discussion_r152922206 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala --- @@ -700,12 +700,7 @@ class VersionsSuite extends SparkFunSuite with Logging { test(s"$version: SPARK-17920: Insert into/overwrite avro table") { --- End diff -- In master, this testcase uses an avro file & decimal schema from [https://github.com/apache/spark/pull/19003], but in branch-2.2 that PR 19003 is not backported. So I had to change the testcase in branch-2.2 to avoid using schema & data added by PR 19003. If we are fine to backport the PR 19003 in branch-2.2, same testcase in master can be used in branch-2.2 too. Please give your suggestions. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19809: [SPARK-17920][SQL] [FOLLOWUP] Backport PR 19779 to branc...
Github user vinodkc commented on the issue: https://github.com/apache/spark/pull/19809 ping @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19809: [SPARK-17920][SQL] [FOLLOWUP] Backport PR 19779 t...
GitHub user vinodkc opened a pull request: https://github.com/apache/spark/pull/19809 [SPARK-17920][SQL] [FOLLOWUP] Backport PR 19779 to branch-2.2 ## What changes were proposed in this pull request? A followup of > https://github.com/apache/spark/pull/19795 , to simplify the file creation. ## How was this patch tested? Only test case is updated You can merge this pull request into a Git repository by running: $ git pull https://github.com/vinodkc/spark br_FollowupSPARK-17920_branch-2.2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19809.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19809 commit 9cd03d38500f04d8d1ebf8771e79b1ba82d1f79b Author: vinodkc <vinod.kc...@gmail.com> Date: 2017-11-24T05:59:30Z simplify the schema file creation in test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19795: [SPARK-17920][SPARK-19580][SPARK-19878][SQL] Backport PR...
Github user vinodkc commented on the issue: https://github.com/apache/spark/pull/19795 Thank you --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19795: [SPARK-17920][SPARK-19580][SPARK-19878][SQL] Back...
Github user vinodkc closed the pull request at: https://github.com/apache/spark/pull/19795 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19795: [SPARK-17920][SPARK-19580][SPARK-19878][SQL] Back...
GitHub user vinodkc opened a pull request: https://github.com/apache/spark/pull/19795 [SPARK-17920][SPARK-19580][SPARK-19878][SQL] Backport PR 19779 to branch-2.2 - Support writing to Hive table which uses Avro schema url 'avro.schema.url' ## What changes were proposed in this pull request? > Backport https://github.com/apache/spark/pull/19779 to branch-2.2 SPARK-19580 Support for avro.schema.url while writing to hive table SPARK-19878 Add hive configuration when initialize hive serde in InsertIntoHiveTable.scala SPARK-17920 HiveWriterContainer passes null configuration to serde.initialize, causing NullPointerException in AvroSerde when using avro.schema.url Support writing to Hive table which uses Avro schema url 'avro.schema.url' For ex: create external table avro_in (a string) stored as avro location '/avro-in/' tblproperties ('avro.schema.url'='/avro-schema/avro.avsc'); create external table avro_out (a string) stored as avro location '/avro-out/' tblproperties ('avro.schema.url'='/avro-schema/avro.avsc'); insert overwrite table avro_out select * from avro_in; // fails with java.lang.NullPointerException WARN AvroSerDe: Encountered exception determining schema. Returning signal schema to indicate problem java.lang.NullPointerException at org.apache.hadoop.fs.FileSystem.getDefaultUri(FileSystem.java:182) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:174) ## Changes proposed in this fix Currently 'null' value is passed to serializer, which causes NPE during insert operation, instead pass Hadoop configuration object ## How was this patch tested? Added new test case in VersionsSuite You can merge this pull request into a Git repository by running: $ git pull https://github.com/vinodkc/spark br_Fix_SPARK-17920_branch-2.2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19795.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19795 commit 63e40e866e8ad3307b91ea430c29938a0050e6f7 Author: vinodkc <vinod.kc...@gmail.com> Date: 2017-11-22T12:47:47Z pass hadoop Configuration to serializer --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19779: [SPARK-17920][SPARK-19580][SPARK-19878][SQL] Support wri...
Github user vinodkc commented on the issue: https://github.com/apache/spark/pull/19779 @gatorsmile , @cloud-fan and @dongjoon-hyun Thanks for the review comments and guidence Sure, I'll submit a separate PR for backporting it to 2.2 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19779: [SPARK-17920][SPARK-19580][SPARK-19878][SQL] Supp...
Github user vinodkc commented on a diff in the pull request: https://github.com/apache/spark/pull/19779#discussion_r152474528 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala --- @@ -841,6 +841,76 @@ class VersionsSuite extends SparkFunSuite with Logging { } } +test(s"$version: SPARK-17920: Insert into/overwrite avro table") { + withTempDir { dir => +val path = dir.getAbsolutePath +val schemaPath = s"""$path${File.separator}avroschemadir""" + +new File(schemaPath).mkdir() +val avroSchema = + """{ +| "name": "test_record", +| "type": "record", +| "fields": [ { +|"name": "f0", +|"type": [ +| "null", +| { +|"precision": 38, +|"scale": 2, +|"type": "bytes", +|"logicalType": "decimal" +| } +|] +| } ] +|} + """.stripMargin +val schemaUrl = s"""$schemaPath${File.separator}avroDecimal.avsc""" +val schemaFile = new File(schemaPath, "avroDecimal.avsc") +val writer = new PrintWriter(schemaFile) +writer.write(avroSchema) +writer.close() + +val url = Thread.currentThread().getContextClassLoader.getResource("avroDecimal") +val srcLocation = new File(url.getFile) +val destTableName = "tab1" +val srcTableName = "tab2" + +withTable(srcTableName, destTableName) { + versionSpark.sql( +s""" + |CREATE EXTERNAL TABLE $srcTableName + |ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' + |WITH SERDEPROPERTIES ('respectSparkSchema' = 'true') + |STORED AS + | INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' + | OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' + |LOCATION '$srcLocation' + |TBLPROPERTIES ('avro.schema.url' = '$schemaUrl') + """.stripMargin + ) + + versionSpark.sql( +s""" + |CREATE TABLE $destTableName + |ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' + |WITH SERDEPROPERTIES ('respectSparkSchema' = 'true') + |STORED AS + | INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' + | OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' + |TBLPROPERTIES ('avro.schema.url' = '$schemaUrl') + """.stripMargin + ) + versionSpark.sql( +s"""INSERT OVERWRITE TABLE $destTableName SELECT * FROM $srcTableName""".stripMargin) + val result = versionSpark.table(srcTableName).collect() + assert(versionSpark.table(destTableName).collect() === result) + versionSpark.sql( +s"""INSERT INTO TABLE $destTableName SELECT * FROM $srcTableName""".stripMargin) --- End diff -- Updated --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19779: [SPARK-17920][SPARK-19580][SPARK-19878][SQL] Supp...
Github user vinodkc commented on a diff in the pull request: https://github.com/apache/spark/pull/19779#discussion_r152473900 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala --- @@ -841,6 +841,76 @@ class VersionsSuite extends SparkFunSuite with Logging { } } +test(s"$version: SPARK-17920: Insert into/overwrite avro table") { + withTempDir { dir => +val path = dir.getAbsolutePath +val schemaPath = s"""$path${File.separator}avroschemadir""" + +new File(schemaPath).mkdir() +val avroSchema = + """{ +| "name": "test_record", +| "type": "record", +| "fields": [ { +|"name": "f0", +|"type": [ +| "null", +| { +|"precision": 38, +|"scale": 2, +|"type": "bytes", +|"logicalType": "decimal" +| } +|] +| } ] +|} + """.stripMargin +val schemaUrl = s"""$schemaPath${File.separator}avroDecimal.avsc""" +val schemaFile = new File(schemaPath, "avroDecimal.avsc") +val writer = new PrintWriter(schemaFile) +writer.write(avroSchema) +writer.close() + +val url = Thread.currentThread().getContextClassLoader.getResource("avroDecimal") +val srcLocation = new File(url.getFile) +val destTableName = "tab1" +val srcTableName = "tab2" + +withTable(srcTableName, destTableName) { + versionSpark.sql( +s""" + |CREATE EXTERNAL TABLE $srcTableName + |ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' + |WITH SERDEPROPERTIES ('respectSparkSchema' = 'true') + |STORED AS + | INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' + | OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' + |LOCATION '$srcLocation' + |TBLPROPERTIES ('avro.schema.url' = '$schemaUrl') + """.stripMargin + ) + + versionSpark.sql( +s""" + |CREATE TABLE $destTableName + |ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' + |WITH SERDEPROPERTIES ('respectSparkSchema' = 'true') + |STORED AS + | INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' + | OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' + |TBLPROPERTIES ('avro.schema.url' = '$schemaUrl') + """.stripMargin + ) + versionSpark.sql( +s"""INSERT OVERWRITE TABLE $destTableName SELECT * FROM $srcTableName""".stripMargin) --- End diff -- Sure, I'll update it --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19779: [SPARK-17920][SPARK-19580][SPARK-19878][SQL] Supp...
Github user vinodkc commented on a diff in the pull request: https://github.com/apache/spark/pull/19779#discussion_r152473845 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala --- @@ -841,6 +841,76 @@ class VersionsSuite extends SparkFunSuite with Logging { } } +test(s"$version: SPARK-17920: Insert into/overwrite avro table") { + withTempDir { dir => +val path = dir.getAbsolutePath +val schemaPath = s"""$path${File.separator}avroschemadir""" + +new File(schemaPath).mkdir() +val avroSchema = + """{ +| "name": "test_record", +| "type": "record", +| "fields": [ { +|"name": "f0", +|"type": [ +| "null", +| { +|"precision": 38, +|"scale": 2, +|"type": "bytes", +|"logicalType": "decimal" +| } +|] +| } ] +|} + """.stripMargin +val schemaUrl = s"""$schemaPath${File.separator}avroDecimal.avsc""" +val schemaFile = new File(schemaPath, "avroDecimal.avsc") +val writer = new PrintWriter(schemaFile) +writer.write(avroSchema) +writer.close() + +val url = Thread.currentThread().getContextClassLoader.getResource("avroDecimal") +val srcLocation = new File(url.getFile) +val destTableName = "tab1" +val srcTableName = "tab2" + +withTable(srcTableName, destTableName) { + versionSpark.sql( +s""" + |CREATE EXTERNAL TABLE $srcTableName + |ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' + |WITH SERDEPROPERTIES ('respectSparkSchema' = 'true') + |STORED AS + | INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' + | OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' + |LOCATION '$srcLocation' + |TBLPROPERTIES ('avro.schema.url' = '$schemaUrl') + """.stripMargin + ) + + versionSpark.sql( +s""" + |CREATE TABLE $destTableName + |ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' + |WITH SERDEPROPERTIES ('respectSparkSchema' = 'true') + |STORED AS + | INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' + | OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' + |TBLPROPERTIES ('avro.schema.url' = '$schemaUrl') + """.stripMargin + ) + versionSpark.sql( +s"""INSERT OVERWRITE TABLE $destTableName SELECT * FROM $srcTableName""".stripMargin) --- End diff -- @gatorsmile , I tried to remove 'stripMargin', but getting org.apache.spark.sql.catalyst.parser.ParseException: extraneous input '|' expecting {'(', 'SELECT', 'FROM', 'ADD',..} --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19779: [SPARK-17920][SPARK-19580][SPARK-19878][SQL] Supp...
Github user vinodkc commented on a diff in the pull request: https://github.com/apache/spark/pull/19779#discussion_r152464029 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala --- @@ -841,6 +841,75 @@ class VersionsSuite extends SparkFunSuite with Logging { } } +test(s"$version: SPARK-17920: Insert into/overwrite external avro table") { + withTempDir { dir => +val path = dir.getAbsolutePath +val schemaPath = s"""$path${File.separator}avroschemadir""" + +new File(schemaPath).mkdir() +val avroSchema = + """{ +| "name": "test_record", +| "type": "record", +| "fields": [ { +|"name": "f0", +|"type": [ +| "null", +| { +|"precision": 38, +|"scale": 2, +|"type": "bytes", +|"logicalType": "decimal" +| } +|] +| } ] +|} + """.stripMargin +val schemaurl = s"""$schemaPath${File.separator}avroDecimal.avsc""" +new java.io.PrintWriter(schemaurl) { write(avroSchema); close() } +val url = Thread.currentThread().getContextClassLoader.getResource("avroDecimal") +val srcLocation = new File(url.getFile) +val destTableName = "tab1" +val srcTableName = "tab2" + +withTable(srcTableName, destTableName) { + versionSpark.sql( +s""" + |CREATE TABLE $srcTableName + |ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' + |WITH SERDEPROPERTIES ('respectSparkSchema' = 'true') + |STORED AS + | INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' + | OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' + |LOCATION '$srcLocation' + |TBLPROPERTIES ('avro.schema.url' = '$schemaurl') + """.stripMargin + ) + val destLocation = s"""$path${File.separator}destTableLocation""" + new File(destLocation).mkdir() + + versionSpark.sql( +s""" + |CREATE TABLE $destTableName + |ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' + |WITH SERDEPROPERTIES ('respectSparkSchema' = 'true') + |STORED AS + | INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' + | OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' + |LOCATION '$destLocation' --- End diff -- Thanks, I've updated the test case to test only managed tables and avoided creating a temp directory. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19779: [SPARK-17920][SPARK-19580][SPARK-19878][SQL] Supp...
Github user vinodkc commented on a diff in the pull request: https://github.com/apache/spark/pull/19779#discussion_r152446382 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala --- @@ -841,6 +841,75 @@ class VersionsSuite extends SparkFunSuite with Logging { } } +test(s"$version: SPARK-17920: Insert into/overwrite external avro table") { + withTempDir { dir => +val path = dir.getAbsolutePath +val schemaPath = s"""$path${File.separator}avroschemadir""" + +new File(schemaPath).mkdir() +val avroSchema = + """{ +| "name": "test_record", +| "type": "record", +| "fields": [ { +|"name": "f0", +|"type": [ +| "null", +| { +|"precision": 38, +|"scale": 2, +|"type": "bytes", +|"logicalType": "decimal" +| } +|] +| } ] +|} + """.stripMargin +val schemaurl = s"""$schemaPath${File.separator}avroDecimal.avsc""" +new java.io.PrintWriter(schemaurl) { write(avroSchema); close() } +val url = Thread.currentThread().getContextClassLoader.getResource("avroDecimal") +val srcLocation = new File(url.getFile) +val destTableName = "tab1" +val srcTableName = "tab2" + +withTable(srcTableName, destTableName) { + versionSpark.sql( +s""" + |CREATE TABLE $srcTableName + |ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' + |WITH SERDEPROPERTIES ('respectSparkSchema' = 'true') + |STORED AS + | INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' + | OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' + |LOCATION '$srcLocation' + |TBLPROPERTIES ('avro.schema.url' = '$schemaurl') + """.stripMargin + ) + val destLocation = s"""$path${File.separator}destTableLocation""" + new File(destLocation).mkdir() + + versionSpark.sql( +s""" + |CREATE TABLE $destTableName + |ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' + |WITH SERDEPROPERTIES ('respectSparkSchema' = 'true') + |STORED AS + | INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' + | OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' + |LOCATION '$destLocation' --- End diff -- @cloud-fan , This bug is for both external and managed tables. I've added a new test case for managed table too. However, to avoid code duplication, should I include both test inside same test method?. Please suggest. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19779: [SPARK-17920][SPARK-19580][SPARK-19878][SQL] Supp...
Github user vinodkc commented on a diff in the pull request: https://github.com/apache/spark/pull/19779#discussion_r152349156 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala --- @@ -841,6 +841,75 @@ class VersionsSuite extends SparkFunSuite with Logging { } } +test(s"$version: SPARK-17920: Insert into/overwrite external avro table") { + withTempDir { dir => +val path = dir.getAbsolutePath +val schemaPath = s"""$path${File.separator}avroschemadir""" + +new File(schemaPath).mkdir() +val avroSchema = + """{ +| "name": "test_record", +| "type": "record", +| "fields": [ { +|"name": "f0", +|"type": [ +| "null", +| { +|"precision": 38, +|"scale": 2, +|"type": "bytes", +|"logicalType": "decimal" +| } +|] +| } ] +|} + """.stripMargin +val schemaurl = s"""$schemaPath${File.separator}avroDecimal.avsc""" +new java.io.PrintWriter(schemaurl) { write(avroSchema); close() } +val url = Thread.currentThread().getContextClassLoader.getResource("avroDecimal") +val srcLocation = new File(url.getFile) +val destTableName = "tab1" +val srcTableName = "tab2" + +withTable(srcTableName, destTableName) { + versionSpark.sql( +s""" + |CREATE TABLE $srcTableName + |ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' + |WITH SERDEPROPERTIES ('respectSparkSchema' = 'true') + |STORED AS + | INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' + | OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' + |LOCATION '$srcLocation' + |TBLPROPERTIES ('avro.schema.url' = '$schemaurl') + """.stripMargin + ) + val destLocation = s"""$path${File.separator}destTableLocation""" + new File(destLocation).mkdir() + + versionSpark.sql( +s""" + |CREATE TABLE $destTableName + |ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' + |WITH SERDEPROPERTIES ('respectSparkSchema' = 'true') + |STORED AS + | INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' + | OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' + |LOCATION '$destLocation' --- End diff -- Will change to 'CREATE EXTERNAL TABLE' --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19779: [SPARK-17920][SPARK-19580][SPARK-19878][SQL] Supp...
Github user vinodkc commented on a diff in the pull request: https://github.com/apache/spark/pull/19779#discussion_r152286208 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala --- @@ -800,7 +800,7 @@ class VersionsSuite extends SparkFunSuite with Logging { } } -test(s"$version: read avro file containing decimal") { +test(s"$version: SPARK-17920: read avro file containing decimal") { --- End diff -- @cloud-fan , Yes --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19779: [SPARK-17920][SPARK-19580][SPARK-19878][SQL] Supp...
Github user vinodkc commented on a diff in the pull request: https://github.com/apache/spark/pull/19779#discussion_r152067290 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala --- @@ -841,6 +841,76 @@ class VersionsSuite extends SparkFunSuite with Logging { } } +test(s"$version: Insert into/overwrite external avro table") { + withTempDir { dir => +val path = dir.getAbsolutePath +val schemaPath = s"""$path${File.separator}avroschemadir""" + +new File(schemaPath).mkdir() +val avroSchema = + """{ +| "name": "test_record", +| "type": "record", +| "fields": [ { +|"name": "f0", +|"type": [ +| "null", +| { +|"precision": 38, +|"scale": 2, +|"type": "bytes", +|"logicalType": "decimal" +| } +|] +| } ] +|} + """.stripMargin +val schemaurl = s"""$schemaPath${File.separator}avroDecimal.avsc""" +new java.io.PrintWriter(schemaurl) { write(avroSchema); close() } +val url = Thread.currentThread().getContextClassLoader.getResource("avroDecimal") +val srcLocation = new File(url.getFile) +val destTableName = "tab1" +val srcTableName = "tab2" + +withTable(srcTableName, destTableName) { + versionSpark.sql( +s""" + |CREATE TABLE $srcTableName + |ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' + |WITH SERDEPROPERTIES ('respectSparkSchema' = 'true') + |STORED AS + | INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' + | OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' + |LOCATION '$srcLocation' + |TBLPROPERTIES ('avro.schema.url' = '$schemaurl') + """.stripMargin + ) + val destLocation = s"""$path${File.separator}destTableLocation""" + new File(destLocation).mkdir() + + versionSpark.sql( +s""" + |CREATE TABLE $destTableName + |ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' + |WITH SERDEPROPERTIES ('respectSparkSchema' = 'true') + |STORED AS + | INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' + | OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' + |LOCATION '$destLocation' + |TBLPROPERTIES ('avro.schema.url' = '$schemaurl') + """.stripMargin + ) + versionSpark.sql( +s"""insert overwrite table $destTableName select * from $srcTableName""".stripMargin) --- End diff -- Thank you for your review comments, I'll fix all your comments --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19779: [SPARK-17920][SPARK-19580][SPARK-19878][SQL] Supp...
Github user vinodkc commented on a diff in the pull request: https://github.com/apache/spark/pull/19779#discussion_r151902033 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveFileFormat.scala --- @@ -89,6 +90,8 @@ class HiveFileFormat(fileSinkConf: FileSinkDesc) val fileSinkConfSer = fileSinkConf new OutputWriterFactory { private val jobConf = new SerializableJobConf(new JobConf(conf)) + private val broadcastHadoopConf = sparkSession.sparkContext.broadcast( --- End diff -- Thanks for the comment, I'll change code to use jobConf --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19779: [SPARK-17920][SPARK-19580][SPARK-19878][SQL] Supp...
GitHub user vinodkc opened a pull request: https://github.com/apache/spark/pull/19779 [SPARK-17920][SPARK-19580][SPARK-19878][SQL] Support writing to Hive table which uses Avro schema url 'avro.schema.url' ## What changes were proposed in this pull request? Support writing to Hive table which uses Avro schema url 'avro.schema.url' For ex: create external table avro_in (a string) stored as avro location '/avro-in/' tblproperties ('avro.schema.url'='/avro-schema/avro.avsc'); create external table avro_out (a string) stored as avro location '/avro-out/' tblproperties ('avro.schema.url'='/avro-schema/avro.avsc'); insert overwrite table avro_out select * from avro_in; // fails with java.lang.NullPointerException WARN AvroSerDe: Encountered exception determining schema. Returning signal schema to indicate problem java.lang.NullPointerException at org.apache.hadoop.fs.FileSystem.getDefaultUri(FileSystem.java:182) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:174) ## Changes proposed in this fix Currently 'null' value is passed to serializer, which causes NPE during insert operation, instead pass Hadoop configuration object ## How was this patch tested? Added new test case in VersionsSuite You can merge this pull request into a Git repository by running: $ git pull https://github.com/vinodkc/spark br_Fix_SPARK-17920 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19779.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19779 commit 034b2466d073c008b71eae072ee98353df56cbf2 Author: vinodkc <vinod.kc...@gmail.com> Date: 2017-11-18T07:52:59Z pass hadoopConfiguration to Serializer --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19008: [SPARK-21756][SQL]Add JSON option to allow unquoted cont...
Github user vinodkc commented on the issue: https://github.com/apache/spark/pull/19008 @rxin , Sure, I'll update it --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19008: [SPARK-21756][SQL]Add JSON option to allow unquot...
Github user vinodkc commented on a diff in the pull request: https://github.com/apache/spark/pull/19008#discussion_r134227248 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonParsingOptionsSuite.scala --- @@ -72,6 +72,21 @@ class JsonParsingOptionsSuite extends QueryTest with SharedSQLContext { assert(df.first().getString(0) == "Reynold Xin") } + test("allowUnquotedControlChars off") { +val str = """{"name" : " + "a\tb"}""" --- End diff -- Updated the the test case. Thanks for the review comment --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19008: [SPARK-21756][SQL]Add JSON option to allow unquot...
Github user vinodkc commented on a diff in the pull request: https://github.com/apache/spark/pull/19008#discussion_r134227247 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonParsingOptionsSuite.scala --- @@ -72,6 +72,21 @@ class JsonParsingOptionsSuite extends QueryTest with SharedSQLContext { assert(df.first().getString(0) == "Reynold Xin") } + test("allowUnquotedControlChars off") { +val str = """{"name" : " + "a\tb"}""" --- End diff -- Update the the test case. Thanks for the review comment --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19008: [SPARK-21756][SQL]Add JSON option to allow unquot...
GitHub user vinodkc opened a pull request: https://github.com/apache/spark/pull/19008 [SPARK-21756][SQL]Add JSON option to allow unquoted control characters ## What changes were proposed in this pull request? This patch adds allowUnquotedControlChars option in JSON data source to allow JSON Strings to contain unquoted control characters (ASCII characters with value less than 32, including tab and line feed characters) ## How was this patch tested? Add new test cases You can merge this pull request into a Git repository by running: $ git pull https://github.com/vinodkc/spark br_fix_SPARK-21756 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19008.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19008 commit 6f009579687e11f34b26bb2f21883377b88f5b35 Author: vinodkc <vinod.kc...@gmail.com> Date: 2017-08-20T17:39:27Z Add JSON option to allow unquoted control characters --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19007: [SPARK-21783][SQL]Turn on ORC filter push-down by...
Github user vinodkc closed the pull request at: https://github.com/apache/spark/pull/19007 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19007: [SPARK-21783][SQL]Turn on ORC filter push-down by defaul...
Github user vinodkc commented on the issue: https://github.com/apache/spark/pull/19007 Ok , I'm closing my PR. Now a days, Spark JIRA is not showing PR status.That is why I missed your PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19007: [SPARK-21783][SQL]Turn on ORC filter push-down by...
GitHub user vinodkc opened a pull request: https://github.com/apache/spark/pull/19007 [SPARK-21783][SQL]Turn on ORC filter push-down by default ## What changes were proposed in this pull request? Turned on ORC filter push-down option by default, spark.sql.orc.filterPushdown was turned off by default from the beginning. ## How was this patch tested? Updated existing test cases You can merge this pull request into a Git repository by running: $ git pull https://github.com/vinodkc/spark br_Fix_SPARK-21783 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19007.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19007 commit 8756a541f2bdb3c37211d17452f55a930955416d Author: vinodkc <vinod.kc...@gmail.com> Date: 2017-08-20T15:10:13Z Turn on ORC filter push-down by default --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18880: [SPARK-21665][Core]Need to close resources after use
Github user vinodkc commented on the issue: https://github.com/apache/spark/pull/18880 retest this --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18880: [SPARK-21665][Core]Need to close resources after ...
GitHub user vinodkc opened a pull request: https://github.com/apache/spark/pull/18880 [SPARK-21665][Core]Need to close resources after use ## What changes were proposed in this pull request? Resources in Core - SparkSubmitArguments.scala, Spark-launcher - AbstractCommandBuilder.java, resource-managers- YARN - Client.scala are released ## How was this patch tested? No new test cases added, Unit test have been passed You can merge this pull request into a Git repository by running: $ git pull https://github.com/vinodkc/spark br_fixresouceleak Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18880.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18880 commit ca6ca5b3a30eb82fdf83188746255cd12c118646 Author: vinodkc <vinod.kc...@gmail.com> Date: 2017-08-08T03:45:19Z Release resources --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18852: [SPARK-21588][SQL] SQLContext.getConf(key, null) ...
Github user vinodkc commented on a diff in the pull request: https://github.com/apache/spark/pull/18852#discussion_r131519921 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala --- @@ -808,6 +808,12 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { Row("1")) } + test("SPARK-21588 SQLContext.getConf(key, null) should return null") { --- End diff -- Thanks for the review comment. Moved the test to SQLConfSuite --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18852: [SPARK-21588][SQL] SQLContext.getConf(key, null) ...
GitHub user vinodkc opened a pull request: https://github.com/apache/spark/pull/18852 [SPARK-21588][SQL] SQLContext.getConf(key, null) should return null ## What changes were proposed in this pull request? In SQLContext.get(key,null) for a key that is not defined in the conf, and doesn't have a default value defined, throws a NPE. In happens only when conf has a value converter Added null check on defaultValue inside SQLConf.getConfString to avoid calling entry.valueConverter(defaultValue) ## How was this patch tested? Added unit test You can merge this pull request into a Git repository by running: $ git pull https://github.com/vinodkc/spark br_Fix_SPARK-21588 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18852.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18852 commit 53b73ed21a028ea1915df8a9a03d6d9368e53c8a Author: vinodkc <vinod.kc...@gmail.com> Date: 2017-08-05T06:39:04Z SQLContext.getConf(key, null) should return null --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18771: [SPARK-21478][SQL] Avoid unpersisting related Dat...
Github user vinodkc closed the pull request at: https://github.com/apache/spark/pull/18771 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18771: [SPARK-21478][SQL] Avoid unpersisting related Dat...
Github user vinodkc commented on a diff in the pull request: https://github.com/apache/spark/pull/18771#discussion_r130234632 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala --- @@ -114,13 +114,13 @@ class CacheManager extends Logging { } /** - * Un-cache all the cache entries that refer to the given plan. + * Un-cache the cache entry that refers to the given plan. */ def uncacheQuery(spark: SparkSession, plan: LogicalPlan, blocking: Boolean): Unit = writeLock { val it = cachedData.iterator() while (it.hasNext) { val cd = it.next() - if (cd.plan.find(_.sameResult(plan)).isDefined) { + if (plan.sameResult(cd.plan)) { --- End diff -- @gatorsmile Thanks for the update. I'll close this PR --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18771: [SPARK-21478][SQL] Avoid unpersisting related Dat...
GitHub user vinodkc opened a pull request: https://github.com/apache/spark/pull/18771 [SPARK-21478][SQL] Avoid unpersisting related Datasets ## What changes were proposed in this pull request? While unpersisting a dataset, only unpersist and remove that datasets's plan from Cachemanager's cachedData. ## How was this patch tested? Added unit tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/vinodkc/spark br_SPARK-21478 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18771.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18771 commit e187aeb67a13493c6f5a9e540779a677d3502b04 Author: vinodkc <vinod.kc...@gmail.com> Date: 2017-07-29T15:10:24Z Fixed unpersisting related DFs commit 857b3dd5804355331d3bef4eb8136e6604232758 Author: vinodkc <vinod.kc...@gmail.com> Date: 2017-07-29T18:01:01Z Updated test cases and condition for unpersist --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10631][Documentation, MLlib, PySpark]Ad...
GitHub user vinodkc opened a pull request: https://github.com/apache/spark/pull/8834 [SPARK-10631][Documentation, MLlib, PySpark]Added documentation for few APIs There are some missing API docs in pyspark.mllib.linalg.Vector (including DenseVector and SparseVector). We should add them based on their Scala counterparts. You can merge this pull request into a Git repository by running: $ git pull https://github.com/vinodkc/spark fix_SPARK-10631 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/8834.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #8834 commit 3ee70548760c28bfc96e17a42450ac6f356de923 Author: vinodkc <vinod.kc...@gmail.com> Date: 2015-09-19T06:00:26Z Added documentation for few APIs --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10516][ MLlib]Added values property in ...
Github user vinodkc commented on the pull request: https://github.com/apache/spark/pull/8682#issuecomment-140645936 Sure I'll work on SPARK-10631 Thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10575][Spark Core]Wrapped RDD.takeSampl...
Github user vinodkc commented on the pull request: https://github.com/apache/spark/pull/8730#issuecomment-139956635 Can we retest?, it is failing in unrelated module --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10575][Spark Core]Wrapped RDD.takeSampl...
Github user vinodkc commented on the pull request: https://github.com/apache/spark/pull/8730#issuecomment-139957047 jenkins retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10575][Spark Core]Wrapped RDD.takeSampl...
GitHub user vinodkc opened a pull request: https://github.com/apache/spark/pull/8730 [SPARK-10575][Spark Core]Wrapped RDD.takeSample with Scope You can merge this pull request into a Git repository by running: $ git pull https://github.com/vinodkc/spark fix_takesample_return Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/8730.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #8730 commit cf2c4665ed695a72818ca5c674f4c791013d4f2e Author: vinodkc <vinod.kc...@gmail.com> Date: 2015-09-12T07:56:19Z wrapped in scope --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10016][ML]Stored broadcast var in a pri...
Github user vinodkc closed the pull request at: https://github.com/apache/spark/pull/8233 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10516][ MLlib]Added values property in ...
GitHub user vinodkc opened a pull request: https://github.com/apache/spark/pull/8682 [SPARK-10516][ MLlib]Added values property in DenseVector You can merge this pull request into a Git repository by running: $ git pull https://github.com/vinodkc/spark fix_SPARK-10516 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/8682.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #8682 commit b8e1e6e6dbf213bb16db29d0cae632f3c4e60e92 Author: Vinod K C <vinod...@huawei.com> Date: 2015-09-10T07:26:21Z Added values property --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10199][SPARK-10200][SPARK-10201][SPARK-...
Github user vinodkc closed the pull request at: https://github.com/apache/spark/pull/8507 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10199][SPARK-10200][SPARK-10201][SPARK-...
Github user vinodkc commented on the pull request: https://github.com/apache/spark/pull/8507#issuecomment-138789894 Closing this PR --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10468][ MLlib ]Verify schema before Dat...
GitHub user vinodkc opened a pull request: https://github.com/apache/spark/pull/8636 [SPARK-10468][ MLlib ]Verify schema before Dataframe select API call Loader.checkSchema was called to verify the schema after dataframe.select(...). Schema verification should be done before dataframe.select(...) You can merge this pull request into a Git repository by running: $ git pull https://github.com/vinodkc/spark fix_GaussianMixtureModel_load_verification Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/8636.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #8636 commit cc99ea99756b72c39041cb95c11628a78eaa1457 Author: Vinod K C <vinod...@huawei.com> Date: 2015-09-07T07:02:23Z Changed schema verification order --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10414][MLlib]Fix hashcode issues in MLL...
Github user vinodkc commented on the pull request: https://github.com/apache/spark/pull/8565#issuecomment-138167561 Closing this PR, as there is already an existing JIRA & PR on same issue --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10414][MLlib]Fix hashcode issues in MLL...
Github user vinodkc closed the pull request at: https://github.com/apache/spark/pull/8565 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10199][SPARK-10200][SPARK-10201][SPARK-...
Github user vinodkc commented on the pull request: https://github.com/apache/spark/pull/8507#issuecomment-137686626 @feynmanliang , I've removed case classes used for schema inference. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10199][SPARK-10200][SPARK-10201][SPARK-...
Github user vinodkc commented on a diff in the pull request: https://github.com/apache/spark/pull/8507#discussion_r38733907 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala --- @@ -125,16 +126,19 @@ object KMeansModel extends Loader[KMeansModel] { def save(sc: SparkContext, model: KMeansModel, path: String): Unit = { val sqlContext = new SQLContext(sc) - import sqlContext.implicits._ val metadata = compact(render( ("class" -> thisClassName) ~ ("version" -> thisFormatVersion) ~ ("k" -> model.k))) sc.parallelize(Seq(metadata), 1).saveAsTextFile(Loader.metadataPath(path)) - val dataRDD = sc.parallelize(model.clusterCenters.zipWithIndex).map { case (point, id) => -Cluster(id, point) --- End diff -- Removed case classes except NodeData,SplitData,PredictData, these classes simplify data extraction --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10414][MLlib]Fix hashcode issues in MLL...
Github user vinodkc commented on a diff in the pull request: https://github.com/apache/spark/pull/8565#discussion_r38507995 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala --- @@ -278,7 +278,8 @@ class DenseMatrix @Since("1.3.0") ( } override def hashCode: Int = { -com.google.common.base.Objects.hashCode(numRows : Integer, numCols: Integer, toArray) +val state = Seq(numRows, numCols, Arrays.hashCode(values), isTransposed.hashCode) --- End diff -- I tried with 'toBreeze.hashCode', but hashCode of BM[Double] is not consistent. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10414][MLlib]Fix hashcode issues in MLL...
GitHub user vinodkc opened a pull request: https://github.com/apache/spark/pull/8565 [SPARK-10414][MLlib]Fix hashcode issues in MLLib Added/Updated hashcode methods in 3 classes You can merge this pull request into a Git repository by running: $ git pull https://github.com/vinodkc/spark fix_ML_hashcode_issue Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/8565.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #8565 commit a4d256b37b0e4c8acdc9d9bca43588cd90cf1bcd Author: Vinod K C <vinod...@huawei.com> Date: 2015-09-02T07:01:32Z Added hashcode method commit 7878c42d568d4c08a64717e66176b0ecf7ee27ab Author: Vinod K C <vinod...@huawei.com> Date: 2015-09-02T07:09:31Z Removed blank lines commit 28737427ff16c2e4d84540361968c9edef3a79dc Author: Vinod K C <vinod...@huawei.com> Date: 2015-09-02T09:22:58Z Updated testcase --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10199][SPARK-10200][SPARK-10201][SPARK-...
Github user vinodkc commented on the pull request: https://github.com/apache/spark/pull/8507#issuecomment-137312813 @feynmanliang , fixed your review comments. Could you please verify it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10430][SPARK-10430]Added hashCode metho...
GitHub user vinodkc opened a pull request: https://github.com/apache/spark/pull/8581 [SPARK-10430][SPARK-10430]Added hashCode methods in AccumulableInfo and RDDOperationScope You can merge this pull request into a Git repository by running: $ git pull https://github.com/vinodkc/spark fix_RDDOperationScope_Hashcode Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/8581.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #8581 commit 25c0f7b471356b829d3611972741b4a40982cc2f Author: Vinod K C <vinod...@huawei.com> Date: 2015-09-03T07:03:40Z Added hashCode methods --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10414][MLlib]Fix hashcode issues in MLL...
Github user vinodkc commented on a diff in the pull request: https://github.com/apache/spark/pull/8565#discussion_r38609166 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala --- @@ -278,7 +278,8 @@ class DenseMatrix @Since("1.3.0") ( } override def hashCode: Int = { -com.google.common.base.Objects.hashCode(numRows : Integer, numCols: Integer, toArray) +val state = Seq(numRows, numCols, Arrays.hashCode(values), isTransposed.hashCode) --- End diff -- Got confirmation from Breeze mailing list that it is an issue and suggested to me to raise an issue in Breeze => https://github.com/scalanlp/breeze/issues/440 Can you please suggest whether we need to wait for Breeze issue fix or change equals method of CSCMatrix & DenseMatrix to match my hashCode implementation? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10414][MLlib]Fix hashcode issues in MLL...
Github user vinodkc commented on a diff in the pull request: https://github.com/apache/spark/pull/8565#discussion_r38522537 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala --- @@ -278,7 +278,8 @@ class DenseMatrix @Since("1.3.0") ( } override def hashCode: Int = { -com.google.common.base.Objects.hashCode(numRows : Integer, numCols: Integer, toArray) +val state = Seq(numRows, numCols, Arrays.hashCode(values), isTransposed.hashCode) --- End diff -- I've asked on breeze mailing list and waiting for their reply. Also I checked breeze CSCMatrix '==' method implementation , it seems '==' uses the input data we passed from SparseMatrix and just hashCode is not implemented. I think this will guarantee that my hashCode implementation will be consistent with Breeze '=='. All MLlib equals hashCode test cases are passing now --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10414][MLlib]Fix hashcode issues in MLL...
Github user vinodkc commented on a diff in the pull request: https://github.com/apache/spark/pull/8565#discussion_r38514784 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala --- @@ -278,7 +278,8 @@ class DenseMatrix @Since("1.3.0") ( } override def hashCode: Int = { -com.google.common.base.Objects.hashCode(numRows : Integer, numCols: Integer, toArray) +val state = Seq(numRows, numCols, Arrays.hashCode(values), isTransposed.hashCode) --- End diff -- To verify that, I did this test 'val bm1 = new BSM[Double](values, numRows, numCols, colPtrs, rowIndices) val bm2 = new BSM[Double](values, numRows, numCols, colPtrs, rowIndices) if(bm1 == bm2){ println("[==] passed") if( bm1.hashCode == bm2.hashCode){ println(" [hashCode] passed") } else { println(" [hashCode] failed") } } else{ println(" [==] failed") }' Got this output [==] passed [hashCode] failed On further investigation, I noticed , this thirdparty class has implemented only 'equals' method, not 'hashCode' In that case, shall I change 'toBreeze == m.toBreeze' comparison in equals? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10199][MLlib]Avoid using reflections fo...
Github user vinodkc commented on a diff in the pull request: https://github.com/apache/spark/pull/8507#discussion_r38410341 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModel.scala --- @@ -159,19 +159,25 @@ object GaussianMixtureModel extends Loader[GaussianMixtureModel] { // Create Parquet data. val dataArray = Array.tabulate(weights.length) { i => -Data(weights(i), gaussians(i).mu, gaussians(i).sigma) +Row(weights(i), gaussians(i).mu, gaussians(i).sigma) } - sc.parallelize(dataArray, 1).toDF().write.parquet(Loader.dataPath(path)) + val dataRDD: RDD[Row] = sc.parallelize(dataArray, 1) + + sqlContext.createDataFrame(dataRDD, schema).write.parquet(Loader.dataPath(path)) } +private val schema = StructType( + StructField("weight", DoubleType, nullable = false):: + StructField("mu", new VectorUDT, nullable = false):: + StructField("sigma", new MatrixUDT, nullable = false)::Nil) def load(sc: SparkContext, path: String): GaussianMixtureModel = { val dataPath = Loader.dataPath(path) val sqlContext = new SQLContext(sc) val dataFrame = sqlContext.read.parquet(dataPath) - val dataArray = dataFrame.select("weight", "mu", "sigma").collect() --- End diff -- Loader.checkSchema was called to verify the schema after dataframe.select(...). Schema verification should be done before dataframe.select(...), thats why I changed the order --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10199][MLlib]Avoid using reflections fo...
GitHub user vinodkc opened a pull request: https://github.com/apache/spark/pull/8507 [SPARK-10199][MLlib]Avoid using reflections for parquet model save You can merge this pull request into a Git repository by running: $ git pull https://github.com/vinodkc/spark fix_SPARK-10200 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/8507.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #8507 commit 233ebc8571d6a5b3284783a7b3a69f054f45389d Author: Vinod K C vinod...@huawei.com Date: 2015-08-28T13:28:23Z Avoid using reflections for parquet model save commit c6b2091b4f019407d2f35ec41038ea5fc925d1eb Author: Vinod K C vinod...@huawei.com Date: 2015-08-28T13:45:18Z Removed blank lines commit 6cc24458908c1858e1a8cb6628e6bffee12513ad Author: Vinod K C vinod...@huawei.com Date: 2015-08-28T14:04:14Z Removed blank lines and orderer imports commit 0dc71324e8e4d2ff70f7cc2183e61c10f4e4f6d8 Author: Vinod K C vinod...@huawei.com Date: 2015-08-28T14:33:58Z Reordered imports --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10016][ML]Stored broadcast var in a pri...
Github user vinodkc commented on the pull request: https://github.com/apache/spark/pull/8233#issuecomment-134471497 Sure, I'll update the code --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10016][ML]Stored broadcast var in a pri...
GitHub user vinodkc opened a pull request: https://github.com/apache/spark/pull/8233 [SPARK-10016][ML]Stored broadcast var in a private var You can merge this pull request into a Git repository by running: $ git pull https://github.com/vinodkc/spark fix_SPARK-10016 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/8233.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #8233 commit 377f2a1b66d374971ece6c7efd5e92bea8cdf0e1 Author: vinodkc vinod.kc...@gmail.com Date: 2015-08-16T18:09:55Z Stored broadcast var in a private var --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8919][Documentation, MLlib]Added @since...
Github user vinodkc commented on the pull request: https://github.com/apache/spark/pull/7325#issuecomment-125820807 @mengxr , I've added message in JIRA page from my JIRA account. Thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8761][Deploy]Added synchronization in r...
Github user vinodkc closed the pull request at: https://github.com/apache/spark/pull/7364 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8761][Deploy]Added synchronization in r...
Github user vinodkc commented on the pull request: https://github.com/apache/spark/pull/7364#issuecomment-121303647 Closing this incomplete PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8636][SQL] Fix equalNullSafe comparison
Github user vinodkc closed the pull request at: https://github.com/apache/spark/pull/7040 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8761][Deploy]Added synchronization in r...
Github user vinodkc commented on a diff in the pull request: https://github.com/apache/spark/pull/7364#discussion_r34534208 --- Diff: core/src/main/scala/org/apache/spark/deploy/master/Master.scala --- @@ -727,40 +727,44 @@ private[master] class Master( def removeApplication(app: ApplicationInfo, state: ApplicationState.Value) { if (apps.contains(app)) { - logInfo(Removing app + app.id) - apps -= app - idToApp -= app.id - endpointToApp -= app.driver - addressToApp -= app.driver.address - if (completedApps.size = RETAINED_APPLICATIONS) { -val toRemove = math.max(RETAINED_APPLICATIONS / 10, 1) -completedApps.take(toRemove).foreach( a = { - appIdToUI.remove(a.id).foreach { ui = webUi.detachSparkUI(ui) } - applicationMetricsSystem.removeSource(a.appSource) -}) -completedApps.trimStart(toRemove) - } - completedApps += app // Remember it in our history - waitingApps -= app + synchronized{ --- End diff -- I've added duplicate condition to avoid synchronization in case `app` doesn't exist in `apps`.if that is not required , I'll remove double checking. Yes, `removeApplication` can be called concurrently with the same `app` instance from message loop in Master.scala and also from handleAppKillRequest in MasterPage.scala. This fix is not for handling concurrent modification of 'app' --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8991][ML]Update SharedParamsCodeGen's G...
GitHub user vinodkc opened a pull request: https://github.com/apache/spark/pull/7367 [SPARK-8991][ML]Update SharedParamsCodeGen's Generated Documentation Removed private[ml] from Generated documentation You can merge this pull request into a Git repository by running: $ git pull https://github.com/vinodkc/spark fix_sharedparmascodegen Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/7367.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #7367 commit 7e19025450c55ccbcef45dd8d460bc8dfdeef1c2 Author: Vinod K C vinod...@huawei.com Date: 2015-07-13T10:27:33Z Removed private[ml] commit 4fa3c8f880f761d4d4113ac747d6c2688c8dfd3b Author: Vinod K C vinod...@huawei.com Date: 2015-07-13T10:45:59Z Adding auto generated code --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8761][Deploy]Added synchronization in r...
GitHub user vinodkc opened a pull request: https://github.com/apache/spark/pull/7364 [SPARK-8761][Deploy]Added synchronization in removeApplication Master.removeApplication is not thread-safe. But it's called both in the message loop of Master and MasterPage.handleAppKillRequest which runs in threads of the Web server. In order to avoid parallel calls from message loop of Master and Web server, added synchronization and a condition to reconfirm existence of applicationInfo object 'app' in Set 'apps' You can merge this pull request into a Git repository by running: $ git pull https://github.com/vinodkc/spark fix_sync_removeApplication Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/7364.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #7364 commit 7b04d8f1527868770f6bcab96537320596190933 Author: Vinod K C vinod...@huawei.com Date: 2015-07-13T07:20:42Z Added synchronization in removeApplication --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8919][Documentation, MLlib]Added @since...
GitHub user vinodkc opened a pull request: https://github.com/apache/spark/pull/7325 [SPARK-8919][Documentation, MLlib]Added @since tags to mllib.recommendation You can merge this pull request into a Git repository by running: $ git pull https://github.com/vinodkc/spark add_since_mllib.recommendation Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/7325.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #7325 commit c41335072cf370a971e0f79eeb85ebfaa8e85ca7 Author: vinodkc vinod.kc...@gmail.com Date: 2015-07-09T17:08:28Z Added @since --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8636][SQL] Fix equalNullSafe comparison
Github user vinodkc commented on a diff in the pull request: https://github.com/apache/spark/pull/7040#discussion_r34328058 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionals.scala --- @@ -296,9 +295,7 @@ case class CaseKeyWhen(key: Expression, branches: Seq[Expression]) extends CaseW } private def equalNullSafe(l: Any, r: Any) = { --- End diff -- @marmbrus Shall I rename the function private def equalNullSafe(l: Any, r: Any) to private def equal(l: Any, r: Any) ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8787][SQL]Changed parameter order of @d...
GitHub user vinodkc opened a pull request: https://github.com/apache/spark/pull/7183 [SPARK-8787][SQL]Changed parameter order of @deprecated in package object sql You can merge this pull request into a Git repository by running: $ git pull https://github.com/vinodkc/spark fix_deprecated_param_order Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/7183.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #7183 commit 700911c2efac05cc095edf2f95f4137e236d2fe2 Author: Vinod K C vinod...@huawei.com Date: 2015-07-02T10:35:39Z Changed order of parameters --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8787][SQL]Changed parameter order of @d...
Github user vinodkc commented on a diff in the pull request: https://github.com/apache/spark/pull/7183#discussion_r33763973 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/package.scala --- @@ -46,6 +46,6 @@ package object sql { * Type alias for [[DataFrame]]. Kept here for backward source compatibility for Scala. * @deprecated As of 1.3.0, replaced by `DataFrame`. */ - @deprecated(1.3.0, use DataFrame) + @deprecated(use DataFrame instead, 1.3.0) --- End diff -- Updated the message to use DataFrame --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8787][SQL]Changed parameter order of @d...
Github user vinodkc commented on the pull request: https://github.com/apache/spark/pull/7183#issuecomment-117957848 Added description --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8628][SQL]Race condition in AbstractSpa...
Github user vinodkc commented on a diff in the pull request: https://github.com/apache/spark/pull/7015#discussion_r33558094 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/AbstractSparkSQLParser.scala --- @@ -30,12 +30,14 @@ private[sql] abstract class AbstractSparkSQLParser def parse(input: String): LogicalPlan = { // Initialize the Keywords. -lexical.initialize(reservedWords) +initLexical phrase(start)(new lexical.Scanner(input)) match { case Success(plan, _) = plan case failureOrError = sys.error(failureOrError.toString) } } + /* One time initialization of lexical.This avoid reinitialization of lexical in parse method */ + protected lazy val initLexical: Unit = lexical.initialize(reservedWords) --- End diff -- 'reservedWords' is generated using reflection. So during non-lazy 'lexical' initialization in constructor, method call to collect reservedWords will fail as the object construction is still not over. Coud you suggest any other approach? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8628][SQL]Race condition in AbstractSpa...
Github user vinodkc commented on the pull request: https://github.com/apache/spark/pull/7015#issuecomment-115572040 Python test failure in unrelated module mllib File /home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/classification.py, line 106, in __main__.LogisticRegressionModel Failed example: lrm = LogisticRegressionWithSGD.train(sc.parallelize(data), iterations=10) Expected nothing Got: Exception happened during processing of request from ('127.0.0.1', 53982) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8636][SQL] Fix equalNullSafe comparison
Github user vinodkc commented on the pull request: https://github.com/apache/spark/pull/7040#issuecomment-115654691 Ok, I'll update that test suite. @smola , In predicates.scala, eval method of case class EqualNullSafe has similar if (l == null r == null) check , shall I modify that condition too? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8636][SQL] Fix equalNullSafe comparison
GitHub user vinodkc opened a pull request: https://github.com/apache/spark/pull/7040 [SPARK-8636][SQL] Fix equalNullSafe comparison You can merge this pull request into a Git repository by running: $ git pull https://github.com/vinodkc/spark fix_CaseKeyWhen_equalNullSafe Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/7040.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #7040 commit f2d0b53ebd0930f30f05fda2b00c9c0e835b3b79 Author: Vinod K C vinod...@huawei.com Date: 2015-06-26T13:48:08Z Fix equalNullSafe comparison --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8628][SQL]Race condition in AbstractSpa...
GitHub user vinodkc opened a pull request: https://github.com/apache/spark/pull/7015 [SPARK-8628][SQL]Race condition in AbstractSparkSQLParser.parse Added synchronization in 'initialize' method You can merge this pull request into a Git repository by running: $ git pull https://github.com/vinodkc/spark handle_lexical_initialize_schronization Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/7015.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #7015 commit e9fc49a3dab3086953b4b8d5bb5a679bbeab0038 Author: Vinod K C vinod...@huawei.com Date: 2015-06-25T14:14:55Z handle synchronization in SqlLexical.initialize commit ef4f60f20bfcbe3c9611b20a3456b671d8239a35 Author: Vinod K C vinod...@huawei.com Date: 2015-06-25T14:27:21Z Reverted import order --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8628][SQL]Race condition in AbstractSpa...
Github user vinodkc commented on the pull request: https://github.com/apache/spark/pull/7015#issuecomment-115492255 @smola ,I've updated the code.Can you please check whether that fix solves the issue ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7470] [SQL] Spark shell SQLContext cras...
Github user vinodkc commented on the pull request: https://github.com/apache/spark/pull/5997#issuecomment-100585781 @marmbrus , PR https://github.com/apache/spark/pull/6013 handled it --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org