[GitHub] [spark] HyukjinKwon closed pull request #24724: User friendly dataset, dataframe generation for csv datasources without explicit StructType definitions.
HyukjinKwon closed pull request #24724: User friendly dataset, dataframe generation for csv datasources without explicit StructType definitions. URL: https://github.com/apache/spark/pull/24724 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on issue #24724: User friendly dataset, dataframe generation for csv datasources without explicit StructType definitions.
HyukjinKwon commented on issue #24724: User friendly dataset, dataframe generation for csv datasources without explicit StructType definitions. URL: https://github.com/apache/spark/pull/24724#issuecomment-496375843 There's virtually no diff: ```scala case class Person(name: String, age: Long) val df = spark.createDataFrame[A]("/tmp/csv") ``` vs ```scala case class Person(name: String, age: Long) spark.read.schema("name string, age long").csv("/tmp/csv").as[Person] ``` and it's super confusing that `createDataFrame` takes CSV. how about JSON and other formats? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on issue #24724: User friendly dataset, dataframe generation for csv datasources without explicit StructType definitions.
HyukjinKwon commented on issue #24724: User friendly dataset, dataframe generation for csv datasources without explicit StructType definitions. URL: https://github.com/apache/spark/pull/24724#issuecomment-496375267 API itself is two lines. It's one liner or two liner - workaround is easy. I don't think we need this and I would like to avoid to introduce some other variants like this. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on issue #24711: [SPARK-27859][SS] Use efficient sorting instead of `.sorted.reverse` sequence
dongjoon-hyun commented on issue #24711: [SPARK-27859][SS] Use efficient sorting instead of `.sorted.reverse` sequence URL: https://github.com/apache/spark/pull/24711#issuecomment-496374107 You're welcome, @wenxuanguan . This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun edited a comment on issue #24724: User friendly dataset, dataframe generation for csv datasources without explicit StructType definitions.
dongjoon-hyun edited a comment on issue #24724: User friendly dataset, dataframe generation for csv datasources without explicit StructType definitions. URL: https://github.com/apache/spark/pull/24724#issuecomment-496373474 First of all, the followings are the most frequent use cases. (And, the recommended way.) 1. HEADER and INFERSCHEMA ```scala scala> spark.read.option("header", true).option("inferSchema", true).csv("/tmp/csv").as[Person] res0: org.apache.spark.sql.Dataset[Person] = [name: string, age: int] ``` 2. USER-DEFINED SCHEMA or Hive MetaStore ```scala scala> case class Person(name: String, age: Long) scala> spark.read.schema("name string, age long").csv("/tmp/csv").as[Person] res0: org.apache.spark.sql.Dataset[Person] = [name: string, age: bigint] ``` I believe the above two are more natural. Anyway, cc @HyukjinKwon and @MaxGekk This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on issue #24724: User friendly dataset, dataframe generation for csv datasources without explicit StructType definitions.
dongjoon-hyun commented on issue #24724: User friendly dataset, dataframe generation for csv datasources without explicit StructType definitions. URL: https://github.com/apache/spark/pull/24724#issuecomment-496373474 First of all, the followings are the most frequent use cases. 1. HEADER and INFERSCHEMA ``` scala> spark.read.option("header", true).option("inferSchema", true).csv("/tmp/csv").as[Person] res0: org.apache.spark.sql.Dataset[Person] = [name: string, age: int] ``` 2. USER-DEFINED SCHEMA or Hive MetaStore ``` scala> case class Person(name: String, age: Long) scala> spark.read.schema("name string, age long").csv("/tmp/csv").as[Person] res0: org.apache.spark.sql.Dataset[Person] = [name: string, age: bigint] ``` I believe the above two are more natural. Anyway, cc @HyukjinKwon and @MaxGekk This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun edited a comment on issue #24724: User friendly dataset, dataframe generation for csv datasources without explicit StructType definitions.
dongjoon-hyun edited a comment on issue #24724: User friendly dataset, dataframe generation for csv datasources without explicit StructType definitions. URL: https://github.com/apache/spark/pull/24724#issuecomment-496373474 First of all, the followings are the most frequent use cases. 1. HEADER and INFERSCHEMA ```scala scala> spark.read.option("header", true).option("inferSchema", true).csv("/tmp/csv").as[Person] res0: org.apache.spark.sql.Dataset[Person] = [name: string, age: int] ``` 2. USER-DEFINED SCHEMA or Hive MetaStore ```scala scala> case class Person(name: String, age: Long) scala> spark.read.schema("name string, age long").csv("/tmp/csv").as[Person] res0: org.apache.spark.sql.Dataset[Person] = [name: string, age: bigint] ``` I believe the above two are more natural. Anyway, cc @HyukjinKwon and @MaxGekk This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #24716: [SPARK-27848][R][BUILD] AppVeyor change to latest R version (3.6.0)
HyukjinKwon closed pull request #24716: [SPARK-27848][R][BUILD] AppVeyor change to latest R version (3.6.0) URL: https://github.com/apache/spark/pull/24716 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on issue #24716: [SPARK-25944][R][BUILD] AppVeyor change to latest R version (3.6.0)
HyukjinKwon commented on issue #24716: [SPARK-25944][R][BUILD] AppVeyor change to latest R version (3.6.0) URL: https://github.com/apache/spark/pull/24716#issuecomment-496372569 Merged to master. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wenxuanguan commented on issue #24711: [SPARK-27859][SS] Use efficient sorting instead of `.sorted.reverse` sequence
wenxuanguan commented on issue #24711: [SPARK-27859][SS] Use efficient sorting instead of `.sorted.reverse` sequence URL: https://github.com/apache/spark/pull/24711#issuecomment-496371722 @srowen @dongjoon-hyun Thank you for review This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24043: [SPARK-11412][SQL] Support merge schema for ORC
AmplabJenkins removed a comment on issue #24043: [SPARK-11412][SQL] Support merge schema for ORC URL: https://github.com/apache/spark/pull/24043#issuecomment-496369660 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24043: [SPARK-11412][SQL] Support merge schema for ORC
AmplabJenkins removed a comment on issue #24043: [SPARK-11412][SQL] Support merge schema for ORC URL: https://github.com/apache/spark/pull/24043#issuecomment-496369663 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/105853/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24043: [SPARK-11412][SQL] Support merge schema for ORC
AmplabJenkins commented on issue #24043: [SPARK-11412][SQL] Support merge schema for ORC URL: https://github.com/apache/spark/pull/24043#issuecomment-496369660 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24043: [SPARK-11412][SQL] Support merge schema for ORC
AmplabJenkins commented on issue #24043: [SPARK-11412][SQL] Support merge schema for ORC URL: https://github.com/apache/spark/pull/24043#issuecomment-496369663 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/105853/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #24043: [SPARK-11412][SQL] Support merge schema for ORC
SparkQA removed a comment on issue #24043: [SPARK-11412][SQL] Support merge schema for ORC URL: https://github.com/apache/spark/pull/24043#issuecomment-496340435 **[Test build #105853 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/105853/testReport)** for PR 24043 at commit [`7d833b0`](https://github.com/apache/spark/commit/7d833b0d37c3cb646810d723651e9ceaa96da1fb). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #24043: [SPARK-11412][SQL] Support merge schema for ORC
SparkQA commented on issue #24043: [SPARK-11412][SQL] Support merge schema for ORC URL: https://github.com/apache/spark/pull/24043#issuecomment-496369350 **[Test build #105853 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/105853/testReport)** for PR 24043 at commit [`7d833b0`](https://github.com/apache/spark/commit/7d833b0d37c3cb646810d723651e9ceaa96da1fb). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24382: [SPARK-27330][SS] support task abort in foreach writer
AmplabJenkins removed a comment on issue #24382: [SPARK-27330][SS] support task abort in foreach writer URL: https://github.com/apache/spark/pull/24382#issuecomment-483678508 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on issue #24382: [SPARK-27330][SS] support task abort in foreach writer
HeartSaVioR commented on issue #24382: [SPARK-27330][SS] support task abort in foreach writer URL: https://github.com/apache/spark/pull/24382#issuecomment-496369272 test this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] swapnilushinde edited a comment on issue #24724: User friendly dataset, dataframe generation for csv datasources without explicit StructType definitions.
swapnilushinde edited a comment on issue #24724: User friendly dataset, dataframe generation for csv datasources without explicit StructType definitions. URL: https://github.com/apache/spark/pull/24724#issuecomment-496367606 Hi, @dongjoon-hyun Thanks for reply. Yes, I use this API sometimes as well. Passing schema as DDL string is one-liner but would require to define case class for Dataset creation anyways. So, creating dataset would require to define schema as both DDL string and case class. for instance, ``` case class A(id: Int, name: String, subject: String, marks: Int, result: Boolean) val df = spark.read.schema("id int, name string, subject string, marks int, result boolean").load("/tmp/csv") val ds = df.as[A] ``` Above change would need to define schema just once with Product class and dataset/dataframes can be created easily. Furthermore, this API is in line with all other similar APIs of creating dataset/dataframe. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on issue #24716: [SPARK-25944][R][BUILD] AppVeyor change to latest R version (3.6.0)
HyukjinKwon commented on issue #24716: [SPARK-25944][R][BUILD] AppVeyor change to latest R version (3.6.0) URL: https://github.com/apache/spark/pull/24716#issuecomment-496368610 Oops, thanks This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] swapnilushinde commented on issue #24724: User friendly dataset, dataframe generation for csv datasources without explicit StructType definitions.
swapnilushinde commented on issue #24724: User friendly dataset, dataframe generation for csv datasources without explicit StructType definitions. URL: https://github.com/apache/spark/pull/24724#issuecomment-496367606 Hi, @dongjoon-hyun Thanks for reply. Yes, I use this API sometimes as well. Passing schema as DDL string is one-liner but would require to define case class for Dataset creation anyways. So, creating dataset would require to define schema as both DDL string and case class. Above change would need to define schema just once with Product class and dataset/dataframes can be created easily. Furthermore, this API is in line with all other similar APIs of creating dataset/dataframe. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on issue #24716: [SPARK-25944][R][BUILD] AppVeyor change to latest R version (3.6.0)
dongjoon-hyun commented on issue #24716: [SPARK-25944][R][BUILD] AppVeyor change to latest R version (3.6.0) URL: https://github.com/apache/spark/pull/24716#issuecomment-496366021 BTW, @HyukjinKwon . Could you fix the PR description? > R 3.5.1 is released 2019-04-26. It seems to be a typo of `3.6.0` because - R version 3.6.0 (Planting of a Tree) has been released on 2019-04-26. - R version 3.5.3 (Great Truth) has been released on 2019-03-11. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on issue #24724: User friendly dataset, dataframe generation for csv datasources without explicit StructType definitions.
dongjoon-hyun commented on issue #24724: User friendly dataset, dataframe generation for csv datasources without explicit StructType definitions. URL: https://github.com/apache/spark/pull/24724#issuecomment-496365550 Hi, @swapnilushinde . Thank you for making a PR, but do you the following? It's one-liner. ```scala scala> spark.version res0: String = 2.4.3 scala> spark.read.schema("id int, name string, subject string, marks int, result boolean").load("/tmp/csv").printSchema root |-- id: integer (nullable = true) |-- name: string (nullable = true) |-- subject: string (nullable = true) |-- marks: integer (nullable = true) |-- result: boolean (nullable = true) ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #24711: [SPARK-27859][SS] Use efficient sorting instead of `.sorted.reverse` sequence
dongjoon-hyun closed pull request #24711: [SPARK-27859][SS] Use efficient sorting instead of `.sorted.reverse` sequence URL: https://github.com/apache/spark/pull/24711 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on issue #24711: [SPARK-27859][SS] Use efficient sorting instead of `.sorted.reverse` sequence
dongjoon-hyun commented on issue #24711: [SPARK-27859][SS] Use efficient sorting instead of `.sorted.reverse` sequence URL: https://github.com/apache/spark/pull/24711#issuecomment-496364119 Merged to master. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on issue #24711: [Minor][SS] Use efficient sorting instead of `.sorted.reverse` sequence
dongjoon-hyun commented on issue #24711: [Minor][SS] Use efficient sorting instead of `.sorted.reverse` sequence URL: https://github.com/apache/spark/pull/24711#issuecomment-496363920 I'll create for you. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on issue #24711: [Minor][SS]avoid inefficient sort when getLatest in HDFSMetadataLog
dongjoon-hyun commented on issue #24711: [Minor][SS]avoid inefficient sort when getLatest in HDFSMetadataLog URL: https://github.com/apache/spark/pull/24711#issuecomment-496363597 Also, please update PR title and description. You didn't include the changes in `streaming/ui/BatchPage.scala` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun edited a comment on issue #24711: [Minor][SS]avoid inefficient sort when getLatest in HDFSMetadataLog
dongjoon-hyun edited a comment on issue #24711: [Minor][SS]avoid inefficient sort when getLatest in HDFSMetadataLog URL: https://github.com/apache/spark/pull/24711#issuecomment-496362843 Thank you for pinging me, @wenxuanguan . Please make a JIRA issue and use the ID in the PR title. This is trivial but worth for it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on issue #24711: [Minor][SS]avoid inefficient sort when getLatest in HDFSMetadataLog
dongjoon-hyun commented on issue #24711: [Minor][SS]avoid inefficient sort when getLatest in HDFSMetadataLog URL: https://github.com/apache/spark/pull/24711#issuecomment-496362843 Thank you for pinging me, @wenxuanguan . Please make a JIRA issue and use the ID. This is trivial but worth for it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24724: User friendly dataset, dataframe generation for csv datasources without explicit StructType definitions.
AmplabJenkins removed a comment on issue #24724: User friendly dataset, dataframe generation for csv datasources without explicit StructType definitions. URL: https://github.com/apache/spark/pull/24724#issuecomment-496360880 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24724: User friendly dataset, dataframe generation for csv datasources without explicit StructType definitions.
AmplabJenkins commented on issue #24724: User friendly dataset, dataframe generation for csv datasources without explicit StructType definitions. URL: https://github.com/apache/spark/pull/24724#issuecomment-496361177 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24724: User friendly dataset, dataframe generation for csv datasources without explicit StructType definitions.
AmplabJenkins removed a comment on issue #24724: User friendly dataset, dataframe generation for csv datasources without explicit StructType definitions. URL: https://github.com/apache/spark/pull/24724#issuecomment-496360804 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24671: [SPARK-27811][Core][Docs]Improve docs about spark.driver.memoryOverhead and spark.executor.memoryOverhead.
AmplabJenkins removed a comment on issue #24671: [SPARK-27811][Core][Docs]Improve docs about spark.driver.memoryOverhead and spark.executor.memoryOverhead. URL: https://github.com/apache/spark/pull/24671#issuecomment-496360724 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/105854/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24671: [SPARK-27811][Core][Docs]Improve docs about spark.driver.memoryOverhead and spark.executor.memoryOverhead.
AmplabJenkins removed a comment on issue #24671: [SPARK-27811][Core][Docs]Improve docs about spark.driver.memoryOverhead and spark.executor.memoryOverhead. URL: https://github.com/apache/spark/pull/24671#issuecomment-496360721 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24724: User friendly dataset, dataframe generation for csv datasources without explicit StructType definitions.
AmplabJenkins commented on issue #24724: User friendly dataset, dataframe generation for csv datasources without explicit StructType definitions. URL: https://github.com/apache/spark/pull/24724#issuecomment-496360880 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wenxuanguan commented on issue #24711: [Minor][SS]avoid inefficient sort when getLatest in HDFSMetadataLog
wenxuanguan commented on issue #24711: [Minor][SS]avoid inefficient sort when getLatest in HDFSMetadataLog URL: https://github.com/apache/spark/pull/24711#issuecomment-496360902 @dongjoon-hyun @HyukjinKwon Can you please have a look? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24671: [SPARK-27811][Core][Docs]Improve docs about spark.driver.memoryOverhead and spark.executor.memoryOverhead.
AmplabJenkins commented on issue #24671: [SPARK-27811][Core][Docs]Improve docs about spark.driver.memoryOverhead and spark.executor.memoryOverhead. URL: https://github.com/apache/spark/pull/24671#issuecomment-496360721 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24724: User friendly dataset, dataframe generation for csv datasources without explicit StructType definitions.
AmplabJenkins commented on issue #24724: User friendly dataset, dataframe generation for csv datasources without explicit StructType definitions. URL: https://github.com/apache/spark/pull/24724#issuecomment-496360804 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24671: [SPARK-27811][Core][Docs]Improve docs about spark.driver.memoryOverhead and spark.executor.memoryOverhead.
AmplabJenkins commented on issue #24671: [SPARK-27811][Core][Docs]Improve docs about spark.driver.memoryOverhead and spark.executor.memoryOverhead. URL: https://github.com/apache/spark/pull/24671#issuecomment-496360724 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/105854/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] swapnilushinde opened a new pull request #24724: User friendly dataset, dataframe generation for csv datasources without explicit StructType definitions.
swapnilushinde opened a new pull request #24724: User friendly dataset, dataframe generation for csv datasources without explicit StructType definitions. URL: https://github.com/apache/spark/pull/24724 ## What changes were proposed in this pull request? Many users frequently load structured data from csv datasources. It's is very common with current APIs to load csv as Dataframe where schema needs to be defined as StructType object. Many users then convert Dataframe to Dataset with objects of Product (case classes). Loading CSV files becomes relatively complex which can be easily simplified. This change would help to work with csv files more user friendly. **Input -** ``` csv file with five columns - {id: Int, name: String, subject: String, marks: Int, result: Boolean} ``` **Current approach -** ``` val schema = StructType(StructField(id,IntegerType,false), StructField(name,StringType,false), StructField(subject,StringType,false), StructField(marks,IntegerType,false), StructField(result,Booleanype,false)) val df = spark.read.schema(schema).csv() case class A(id: Int, name: String, subject: String, marks: Int, result: Boolean) val ds = df.as[A] ``` **Proposed change -** ``` case class A (id: Int, name: String, subject: String, marks: Int, result: Boolean) val df = spark.createDataframe[A](optionsMap, ) val ds = spark.createDataset[A](optionsMap, ) ``` - No explicit schema definition with StructType is needed as it can be resolved by Product classes. - Redundant codebase in applications to define verbose structType can be avoided with this change. - Proposed APIs are similar to current APIs so easy to use. All current and future csv options can be used as is with no changes needed. (exception - inferSchema is internally disabled as it's useless/confusing with this api) - Similar to current createDataset/createDataframe APIs, it would make loading csv files for debug purpose more convenient. ## How was this patch tested? This change is manually tested. I didnt see similar createDataset/createDataframe unit test cases. Please let me know best place to add unit test cases for this and existing similar APIs. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #24671: [SPARK-27811][Core][Docs]Improve docs about spark.driver.memoryOverhead and spark.executor.memoryOverhead.
SparkQA commented on issue #24671: [SPARK-27811][Core][Docs]Improve docs about spark.driver.memoryOverhead and spark.executor.memoryOverhead. URL: https://github.com/apache/spark/pull/24671#issuecomment-496360449 **[Test build #105854 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/105854/testReport)** for PR 24671 at commit [`1f31fc6`](https://github.com/apache/spark/commit/1f31fc6aac694889f1b1450be4f30773deb51ad5). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #24671: [SPARK-27811][Core][Docs]Improve docs about spark.driver.memoryOverhead and spark.executor.memoryOverhead.
SparkQA removed a comment on issue #24671: [SPARK-27811][Core][Docs]Improve docs about spark.driver.memoryOverhead and spark.executor.memoryOverhead. URL: https://github.com/apache/spark/pull/24671#issuecomment-496341756 **[Test build #105854 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/105854/testReport)** for PR 24671 at commit [`1f31fc6`](https://github.com/apache/spark/commit/1f31fc6aac694889f1b1450be4f30773deb51ad5). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24472: [SPARK-27578][SQL] Support INTERVAL ... HOUR TO SECOND syntax
AmplabJenkins removed a comment on issue #24472: [SPARK-27578][SQL] Support INTERVAL ... HOUR TO SECOND syntax URL: https://github.com/apache/spark/pull/24472#issuecomment-496360440 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24472: [SPARK-27578][SQL] Support INTERVAL ... HOUR TO SECOND syntax
AmplabJenkins removed a comment on issue #24472: [SPARK-27578][SQL] Support INTERVAL ... HOUR TO SECOND syntax URL: https://github.com/apache/spark/pull/24472#issuecomment-496360443 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/105852/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24472: [SPARK-27578][SQL] Support INTERVAL ... HOUR TO SECOND syntax
AmplabJenkins commented on issue #24472: [SPARK-27578][SQL] Support INTERVAL ... HOUR TO SECOND syntax URL: https://github.com/apache/spark/pull/24472#issuecomment-496360440 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24472: [SPARK-27578][SQL] Support INTERVAL ... HOUR TO SECOND syntax
AmplabJenkins commented on issue #24472: [SPARK-27578][SQL] Support INTERVAL ... HOUR TO SECOND syntax URL: https://github.com/apache/spark/pull/24472#issuecomment-496360443 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/105852/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #24472: [SPARK-27578][SQL] Support INTERVAL ... HOUR TO SECOND syntax
SparkQA removed a comment on issue #24472: [SPARK-27578][SQL] Support INTERVAL ... HOUR TO SECOND syntax URL: https://github.com/apache/spark/pull/24472#issuecomment-496340414 **[Test build #105852 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/105852/testReport)** for PR 24472 at commit [`29fcc08`](https://github.com/apache/spark/commit/29fcc087fbd10ce4188f228c7ccf11337912f225). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #24472: [SPARK-27578][SQL] Support INTERVAL ... HOUR TO SECOND syntax
SparkQA commented on issue #24472: [SPARK-27578][SQL] Support INTERVAL ... HOUR TO SECOND syntax URL: https://github.com/apache/spark/pull/24472#issuecomment-496360207 **[Test build #105852 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/105852/testReport)** for PR 24472 at commit [`29fcc08`](https://github.com/apache/spark/commit/29fcc087fbd10ce4188f228c7ccf11337912f225). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun edited a comment on issue #24472: [SPARK-27578][SQL] Support INTERVAL ... HOUR TO SECOND syntax
dongjoon-hyun edited a comment on issue #24472: [SPARK-27578][SQL] Support INTERVAL ... HOUR TO SECOND syntax URL: https://github.com/apache/spark/pull/24472#issuecomment-496358371 Hi, @gatorsmile and @cloud-fan . Could you give us some directional advice, please? - First, this PR wants to support `INTERVAL ... HOUR TO SECOND` like `INTERVAL ... DAY TO SECOND` like Presto/Terradata. It looks reasonable to me, too. - Second, originally, this PR added a new pattern and new function (which is similar to the existing one). To avoid maintaining two similar functions, I recommended to extend the existing pattern and handling `DAY` and `HOUR` with the same function. To sum up, we will support 2~4 additionally. 1. SELECT INTERVAL '0 23:59:59.155' DAY TO SECOND (Current Spark) 1. SELECT INTERVAL '23:59:59.155' HOUR TO SECOND 1. SELECT INTERVAL '23:59:59.155' DAY TO SECOND 1. SELECT INTERVAL '1 23:59:59.155' HOUR TO SECOND If you think these are okay, I want to merge this PR. How do you think about this? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #24689: [SPARK-26946][SQL][FOLLOWUP] Handle lookupCatalog function not defined
SparkQA commented on issue #24689: [SPARK-26946][SQL][FOLLOWUP] Handle lookupCatalog function not defined URL: https://github.com/apache/spark/pull/24689#issuecomment-496359179 **[Test build #105857 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/105857/testReport)** for PR 24689 at commit [`0f4d9aa`](https://github.com/apache/spark/commit/0f4d9aa403d61790c88cdea027352294abcf340d). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24700: [SPARK-27834][SQL][R][PYTHON] Make separate PySpark/SparkR vectorization configurations
AmplabJenkins removed a comment on issue #24700: [SPARK-27834][SQL][R][PYTHON] Make separate PySpark/SparkR vectorization configurations URL: https://github.com/apache/spark/pull/24700#issuecomment-496358953 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24700: [SPARK-27834][SQL][R][PYTHON] Make separate PySpark/SparkR vectorization configurations
AmplabJenkins removed a comment on issue #24700: [SPARK-27834][SQL][R][PYTHON] Make separate PySpark/SparkR vectorization configurations URL: https://github.com/apache/spark/pull/24700#issuecomment-496358955 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/105851/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24689: [SPARK-26946][SQL][FOLLOWUP] Handle lookupCatalog function not defined
AmplabJenkins removed a comment on issue #24689: [SPARK-26946][SQL][FOLLOWUP] Handle lookupCatalog function not defined URL: https://github.com/apache/spark/pull/24689#issuecomment-496358913 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24689: [SPARK-26946][SQL][FOLLOWUP] Handle lookupCatalog function not defined
AmplabJenkins removed a comment on issue #24689: [SPARK-26946][SQL][FOLLOWUP] Handle lookupCatalog function not defined URL: https://github.com/apache/spark/pull/24689#issuecomment-496358914 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24700: [SPARK-27834][SQL][R][PYTHON] Make separate PySpark/SparkR vectorization configurations
AmplabJenkins commented on issue #24700: [SPARK-27834][SQL][R][PYTHON] Make separate PySpark/SparkR vectorization configurations URL: https://github.com/apache/spark/pull/24700#issuecomment-496358955 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/105851/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24700: [SPARK-27834][SQL][R][PYTHON] Make separate PySpark/SparkR vectorization configurations
AmplabJenkins commented on issue #24700: [SPARK-27834][SQL][R][PYTHON] Make separate PySpark/SparkR vectorization configurations URL: https://github.com/apache/spark/pull/24700#issuecomment-496358953 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24689: [SPARK-26946][SQL][FOLLOWUP] Handle lookupCatalog function not defined
AmplabJenkins commented on issue #24689: [SPARK-26946][SQL][FOLLOWUP] Handle lookupCatalog function not defined URL: https://github.com/apache/spark/pull/24689#issuecomment-496358913 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24689: [SPARK-26946][SQL][FOLLOWUP] Handle lookupCatalog function not defined
AmplabJenkins commented on issue #24689: [SPARK-26946][SQL][FOLLOWUP] Handle lookupCatalog function not defined URL: https://github.com/apache/spark/pull/24689#issuecomment-496358914 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #24700: [SPARK-27834][SQL][R][PYTHON] Make separate PySpark/SparkR vectorization configurations
SparkQA removed a comment on issue #24700: [SPARK-27834][SQL][R][PYTHON] Make separate PySpark/SparkR vectorization configurations URL: https://github.com/apache/spark/pull/24700#issuecomment-496328630 **[Test build #105851 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/105851/testReport)** for PR 24700 at commit [`9fbc9e1`](https://github.com/apache/spark/commit/9fbc9e12840a44f40cd750b0e841eb2aaab7f67d). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #24700: [SPARK-27834][SQL][R][PYTHON] Make separate PySpark/SparkR vectorization configurations
SparkQA commented on issue #24700: [SPARK-27834][SQL][R][PYTHON] Make separate PySpark/SparkR vectorization configurations URL: https://github.com/apache/spark/pull/24700#issuecomment-496358643 **[Test build #105851 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/105851/testReport)** for PR 24700 at commit [`9fbc9e1`](https://github.com/apache/spark/commit/9fbc9e12840a44f40cd750b0e841eb2aaab7f67d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on issue #24472: [SPARK-27578][SQL] Support INTERVAL ... HOUR TO SECOND syntax
dongjoon-hyun commented on issue #24472: [SPARK-27578][SQL] Support INTERVAL ... HOUR TO SECOND syntax URL: https://github.com/apache/spark/pull/24472#issuecomment-496358371 Hi, @gatorsmile and @cloud-fan . Could you give us some directional advice, please? - First, this PR wants to support `INTERVAL ... HOUR TO SECOND` like `INTERVAL ... DAY TO SECOND` like Presto/Terradata. It looks reasonable to me, too. - Second, originally, this PR added a new pattern and new function (which is similar to the existing one). To avoid maintaining two similar functions, I recommended to extend the existing pattern and handling `DAY` and `HOUR` with the same function. So, we will support 2~4 additionally. 1. SELECT INTERVAL '0 23:59:59.155' DAY TO SECOND (Current Spark) 1. SELECT INTERVAL '23:59:59.155' HOUR TO SECOND 1. SELECT INTERVAL '23:59:59.155' DAY TO SECOND 1. SELECT INTERVAL '1 23:59:59.155' HOUR TO SECOND If you think these are okay, I want to merge this PR. How do you think about this? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24717: [SPARK-27847][ML] One-Pass MultilabelMetrics & MulticlassMetrics
AmplabJenkins removed a comment on issue #24717: [SPARK-27847][ML] One-Pass MultilabelMetrics & MulticlassMetrics URL: https://github.com/apache/spark/pull/24717#issuecomment-496356858 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24717: [SPARK-27847][ML] One-Pass MultilabelMetrics & MulticlassMetrics
AmplabJenkins removed a comment on issue #24717: [SPARK-27847][ML] One-Pass MultilabelMetrics & MulticlassMetrics URL: https://github.com/apache/spark/pull/24717#issuecomment-496356864 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/105856/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #24569: [SPARK-23191][CORE] Warn rather than terminate when duplicate worker register happens
cloud-fan closed pull request #24569: [SPARK-23191][CORE] Warn rather than terminate when duplicate worker register happens URL: https://github.com/apache/spark/pull/24569 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] jzhuge commented on a change in pull request #24689: [SPARK-26946][SQL][FOLLOWUP] Handle lookupCatalog function not defined
jzhuge commented on a change in pull request #24689: [SPARK-26946][SQL][FOLLOWUP] Handle lookupCatalog function not defined URL: https://github.com/apache/spark/pull/24689#discussion_r287922975 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/LookupCatalog.scala ## @@ -26,27 +28,31 @@ import org.apache.spark.sql.catalyst.TableIdentifier @Experimental trait LookupCatalog { - def lookupCatalog: Option[(String) => CatalogPlugin] = None + def lookupCatalog: Option[String => CatalogPlugin] = None type CatalogObjectIdentifier = (Option[CatalogPlugin], Identifier) /** * Extract catalog plugin and identifier from a multi-part identifier. */ object CatalogObjectIdentifier { -def unapply(parts: Seq[String]): Option[CatalogObjectIdentifier] = lookupCatalog.map { lookup => - parts match { -case Seq(name) => - (None, Identifier.of(Array.empty, name)) -case Seq(catalogName, tail @ _*) => - try { -val catalog = lookup(catalogName) -(Some(catalog), Identifier.of(tail.init.toArray, tail.last)) - } catch { -case _: CatalogNotFoundException => - (None, Identifier.of(parts.init.toArray, parts.last)) - } - } +def unapply(parts: Seq[String]): Option[CatalogObjectIdentifier] = parts match { + case Seq(name) => +Some((None, Identifier.of(Array.empty, name))) + case Seq(catalogName, tail @ _*) => +lookupCatalog match { + case Some(lookup) => +Try(lookup(catalogName)) match { Review comment: Thanks @HyukjinKwon for pointing out the style guide. Back to try/catch. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24717: [SPARK-27847][ML] One-Pass MultilabelMetrics & MulticlassMetrics
AmplabJenkins commented on issue #24717: [SPARK-27847][ML] One-Pass MultilabelMetrics & MulticlassMetrics URL: https://github.com/apache/spark/pull/24717#issuecomment-496356858 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24717: [SPARK-27847][ML] One-Pass MultilabelMetrics & MulticlassMetrics
AmplabJenkins commented on issue #24717: [SPARK-27847][ML] One-Pass MultilabelMetrics & MulticlassMetrics URL: https://github.com/apache/spark/pull/24717#issuecomment-496356864 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/105856/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #24717: [SPARK-27847][ML] One-Pass MultilabelMetrics & MulticlassMetrics
SparkQA removed a comment on issue #24717: [SPARK-27847][ML] One-Pass MultilabelMetrics & MulticlassMetrics URL: https://github.com/apache/spark/pull/24717#issuecomment-496346720 **[Test build #105856 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/105856/testReport)** for PR 24717 at commit [`9261f16`](https://github.com/apache/spark/commit/9261f16ff3ded7e10fc69c50df8131be589cde49). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #24717: [SPARK-27847][ML] One-Pass MultilabelMetrics & MulticlassMetrics
SparkQA commented on issue #24717: [SPARK-27847][ML] One-Pass MultilabelMetrics & MulticlassMetrics URL: https://github.com/apache/spark/pull/24717#issuecomment-496356711 **[Test build #105856 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/105856/testReport)** for PR 24717 at commit [`9261f16`](https://github.com/apache/spark/commit/9261f16ff3ded7e10fc69c50df8131be589cde49). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on issue #24569: [SPARK-23191][CORE] Warn rather than terminate when duplicate worker register happens
cloud-fan commented on issue #24569: [SPARK-23191][CORE] Warn rather than terminate when duplicate worker register happens URL: https://github.com/apache/spark/pull/24569#issuecomment-496356630 thanks, merging to master! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on issue #24696: [SPARK-27832][SQL] Don't decompress and create column batch when the task is completed
cloud-fan commented on issue #24696: [SPARK-27832][SQL] Don't decompress and create column batch when the task is completed URL: https://github.com/apache/spark/pull/24696#issuecomment-496355838 > At the moment, the returned batch is also immediately closed I'm a little lost here. Can you give an events sequence that can cause the error? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun edited a comment on issue #24722: [SPARK-27858][SQL] Fix for avro deserialization on union types with multiple non-null types
dongjoon-hyun edited a comment on issue #24722: [SPARK-27858][SQL] Fix for avro deserialization on union types with multiple non-null types URL: https://github.com/apache/spark/pull/24722#issuecomment-496351652 @gcmerz . What is your id in Apache JIRA? If you don't have, please create one. Then, I can assign that issue to you. - https://issues.apache.org/jira/browse/SPARK-27858 And, FYI, in GitHub personal setting, you can additionally add your Palantir email (used in this PR). Then, your commit with Palantir ID also will show your GitHub profile. - https://github.com/apache/spark/commits/master This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on issue #24722: [SPARK-27858][SQL] Fix for avro deserialization on union types with multiple non-null types
dongjoon-hyun commented on issue #24722: [SPARK-27858][SQL] Fix for avro deserialization on union types with multiple non-null types URL: https://github.com/apache/spark/pull/24722#issuecomment-496351652 @gcmerz . What is your id in Apache JIRA? If you don't have, please create one. Then, I can assign that issue to you. - https://issues.apache.org/jira/browse/SPARK-27858 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on issue #14325: [SPARK-16692] [ML] Add multi label classification evaluator, DataFrame
zhengruifeng commented on issue #14325: [SPARK-16692] [ML] Add multi label classification evaluator, DataFrame URL: https://github.com/apache/spark/pull/14325#issuecomment-496350955 What's the progress now? @liwzhi @WeichenXu123 @srowen If @liwzhi are not working on this, can I take it over? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #24722: [SPARK-27858][SQL] Fix for avro deserialization on union types with multiple non-null types
dongjoon-hyun closed pull request #24722: [SPARK-27858][SQL] Fix for avro deserialization on union types with multiple non-null types URL: https://github.com/apache/spark/pull/24722 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24722: [SPARK-27858][SQL] Fix for avro deserialization on union types with multiple non-null types
AmplabJenkins removed a comment on issue #24722: [SPARK-27858][SQL] Fix for avro deserialization on union types with multiple non-null types URL: https://github.com/apache/spark/pull/24722#issuecomment-496349036 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24722: [SPARK-27858][SQL] Fix for avro deserialization on union types with multiple non-null types
AmplabJenkins removed a comment on issue #24722: [SPARK-27858][SQL] Fix for avro deserialization on union types with multiple non-null types URL: https://github.com/apache/spark/pull/24722#issuecomment-496349037 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/105855/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24722: [SPARK-27858][SQL] Fix for avro deserialization on union types with multiple non-null types
AmplabJenkins commented on issue #24722: [SPARK-27858][SQL] Fix for avro deserialization on union types with multiple non-null types URL: https://github.com/apache/spark/pull/24722#issuecomment-496349037 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/105855/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24722: [SPARK-27858][SQL] Fix for avro deserialization on union types with multiple non-null types
AmplabJenkins commented on issue #24722: [SPARK-27858][SQL] Fix for avro deserialization on union types with multiple non-null types URL: https://github.com/apache/spark/pull/24722#issuecomment-496349036 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on issue #24722: [SPARK-27858][SQL] Fix for avro deserialization on union types with multiple non-null types
dongjoon-hyun commented on issue #24722: [SPARK-27858][SQL] Fix for avro deserialization on union types with multiple non-null types URL: https://github.com/apache/spark/pull/24722#issuecomment-496348939 Merged to `master` and `branch-2.4`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #24722: [SPARK-27858][SQL] Fix for avro deserialization on union types with multiple non-null types
SparkQA removed a comment on issue #24722: [SPARK-27858][SQL] Fix for avro deserialization on union types with multiple non-null types URL: https://github.com/apache/spark/pull/24722#issuecomment-496344329 **[Test build #105855 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/105855/testReport)** for PR 24722 at commit [`4edbe09`](https://github.com/apache/spark/commit/4edbe093b1a4e6369fe7327675e2f49de62cb934). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #24722: [SPARK-27858][SQL] Fix for avro deserialization on union types with multiple non-null types
SparkQA commented on issue #24722: [SPARK-27858][SQL] Fix for avro deserialization on union types with multiple non-null types URL: https://github.com/apache/spark/pull/24722#issuecomment-496348853 **[Test build #105855 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/105855/testReport)** for PR 24722 at commit [`4edbe09`](https://github.com/apache/spark/commit/4edbe093b1a4e6369fe7327675e2f49de62cb934). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on issue #24722: [SPARK-27858][SQL] Fix for avro deserialization on union types with multiple non-null types
dongjoon-hyun commented on issue #24722: [SPARK-27858][SQL] Fix for avro deserialization on union types with multiple non-null types URL: https://github.com/apache/spark/pull/24722#issuecomment-496348246 You're welcome. Thank you for swift update. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on issue #24043: [SPARK-11412][SQL] Support merge schema for ORC
dongjoon-hyun commented on issue #24043: [SPARK-11412][SQL] Support merge schema for ORC URL: https://github.com/apache/spark/pull/24043#issuecomment-496347909 Lastly, it would be great if you can add some performance comparisons between Parquet/ORC merge schema in the PR description. This PR aims to add new features for ORC/Parquet feature parity. So, if there is a big slowness on new code, it's not desirable. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24717: [SPARK-27847][ML] One-Pass MultilabelMetrics & MulticlassMetrics
AmplabJenkins removed a comment on issue #24717: [SPARK-27847][ML] One-Pass MultilabelMetrics & MulticlassMetrics URL: https://github.com/apache/spark/pull/24717#issuecomment-496347648 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24717: [SPARK-27847][ML] One-Pass MultilabelMetrics & MulticlassMetrics
AmplabJenkins removed a comment on issue #24717: [SPARK-27847][ML] One-Pass MultilabelMetrics & MulticlassMetrics URL: https://github.com/apache/spark/pull/24717#issuecomment-496347641 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24717: [SPARK-27847][ML] One-Pass MultilabelMetrics & MulticlassMetrics
AmplabJenkins commented on issue #24717: [SPARK-27847][ML] One-Pass MultilabelMetrics & MulticlassMetrics URL: https://github.com/apache/spark/pull/24717#issuecomment-496347648 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24717: [SPARK-27847][ML] One-Pass MultilabelMetrics & MulticlassMetrics
AmplabJenkins commented on issue #24717: [SPARK-27847][ML] One-Pass MultilabelMetrics & MulticlassMetrics URL: https://github.com/apache/spark/pull/24717#issuecomment-496347641 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #24717: [SPARK-27847][ML] One-Pass MultilabelMetrics & MulticlassMetrics
SparkQA commented on issue #24717: [SPARK-27847][ML] One-Pass MultilabelMetrics & MulticlassMetrics URL: https://github.com/apache/spark/pull/24717#issuecomment-496346720 **[Test build #105856 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/105856/testReport)** for PR 24717 at commit [`9261f16`](https://github.com/apache/spark/commit/9261f16ff3ded7e10fc69c50df8131be589cde49). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24043: [SPARK-11412][SQL] Support merge schema for ORC
dongjoon-hyun commented on a change in pull request #24043: [SPARK-11412][SQL] Support merge schema for ORC URL: https://github.com/apache/spark/pull/24043#discussion_r287913435 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileOperator.scala ## @@ -101,6 +101,19 @@ private[hive] object OrcFileOperator extends Logging { } } + /** + * Read single ORC file schema using Hive ORC library + */ + def singleFileSchemaReader(file: String, conf: Configuration, ignoreCorruptFiles: Boolean) + : Option[StructType] = { Review comment: ditto. 2-space. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24043: [SPARK-11412][SQL] Support merge schema for ORC
dongjoon-hyun commented on a change in pull request #24043: [SPARK-11412][SQL] Support merge schema for ORC URL: https://github.com/apache/spark/pull/24043#discussion_r287912775 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala ## @@ -82,14 +83,95 @@ object OrcUtils extends Logging { : Option[StructType] = { val ignoreCorruptFiles = sparkSession.sessionState.conf.ignoreCorruptFiles val conf = sparkSession.sessionState.newHadoopConf() -// TODO: We need to support merge schema. Please see SPARK-11412. files.toIterator.map(file => readSchema(file.getPath, conf, ignoreCorruptFiles)).collectFirst { case Some(schema) => logDebug(s"Reading schema from file $files, got Hive schema string: $schema") CatalystSqlParser.parseDataType(schema.toString).asInstanceOf[StructType] } } + /** + * Read single ORC file schema using native version of ORC + */ + def singleFileSchemaReader(file: String, conf: Configuration, ignoreCorruptFiles: Boolean) + : Option[StructType] = { +OrcUtils.readSchema(new Path(file), conf, ignoreCorruptFiles) + .map(s => CatalystSqlParser.parseDataType(s.toString).asInstanceOf[StructType]) + } + + /** + * Figures out a merged ORC schema with a distributed Spark job. + */ + def mergeSchemasInParallel( + sparkSession: SparkSession, + files: Seq[FileStatus], + singleFileSchemaReader: (String, Configuration, Boolean) => Option[StructType]) + : Option[StructType] = { Review comment: ditto. 2-space. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24043: [SPARK-11412][SQL] Support merge schema for ORC
dongjoon-hyun commented on a change in pull request #24043: [SPARK-11412][SQL] Support merge schema for ORC URL: https://github.com/apache/spark/pull/24043#discussion_r287912742 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala ## @@ -82,14 +83,95 @@ object OrcUtils extends Logging { : Option[StructType] = { val ignoreCorruptFiles = sparkSession.sessionState.conf.ignoreCorruptFiles val conf = sparkSession.sessionState.newHadoopConf() -// TODO: We need to support merge schema. Please see SPARK-11412. files.toIterator.map(file => readSchema(file.getPath, conf, ignoreCorruptFiles)).collectFirst { case Some(schema) => logDebug(s"Reading schema from file $files, got Hive schema string: $schema") CatalystSqlParser.parseDataType(schema.toString).asInstanceOf[StructType] } } + /** + * Read single ORC file schema using native version of ORC + */ + def singleFileSchemaReader(file: String, conf: Configuration, ignoreCorruptFiles: Boolean) + : Option[StructType] = { Review comment: Unfortunately, the existing code around here follows a wrong indentation rule. Let's use correct indentation at least at new code. `: Option[StructType]` should have 2-space indentation instead of 4-space. ```scala def singleFileSchemaReader(file: String, conf: Configuration, ignoreCorruptFiles: Boolean) - : Option[StructType] = { +: Option[StructType] = { ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #24722: [SPARK-27858][SQL] Fix for avro deserialization on union types with multiple non-null types
SparkQA commented on issue #24722: [SPARK-27858][SQL] Fix for avro deserialization on union types with multiple non-null types URL: https://github.com/apache/spark/pull/24722#issuecomment-496344329 **[Test build #105855 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/105855/testReport)** for PR 24722 at commit [`4edbe09`](https://github.com/apache/spark/commit/4edbe093b1a4e6369fe7327675e2f49de62cb934). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gcmerz commented on issue #24722: [SPARK-27858][SQL] Fix for avro deserialization on union types with multiple non-null types
gcmerz commented on issue #24722: [SPARK-27858][SQL] Fix for avro deserialization on union types with multiple non-null types URL: https://github.com/apache/spark/pull/24722#issuecomment-496344189 Applied the tweaks--thank you so much for the quick review! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24722: [SPARK-27858][SQL] Fix for avro deserialization on union types with multiple non-null types
AmplabJenkins removed a comment on issue #24722: [SPARK-27858][SQL] Fix for avro deserialization on union types with multiple non-null types URL: https://github.com/apache/spark/pull/24722#issuecomment-496344048 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24722: [SPARK-27858][SQL] Fix for avro deserialization on union types with multiple non-null types
AmplabJenkins removed a comment on issue #24722: [SPARK-27858][SQL] Fix for avro deserialization on union types with multiple non-null types URL: https://github.com/apache/spark/pull/24722#issuecomment-496344050 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gcmerz commented on a change in pull request #24722: [SPARK-27858][SQL] Fix for avro deserialization on union types with multiple non-null types
gcmerz commented on a change in pull request #24722: [SPARK-27858][SQL] Fix for avro deserialization on union types with multiple non-null types URL: https://github.com/apache/spark/pull/24722#discussion_r287912404 ## File path: external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala ## @@ -247,6 +247,32 @@ class AvroSuite extends QueryTest with SharedSQLContext with SQLTestUtils { } } + test("Union type: More than one non-null type") { Review comment: Also done! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gcmerz commented on a change in pull request #24722: [SPARK-27858][SQL] Fix for avro deserialization on union types with multiple non-null types
gcmerz commented on a change in pull request #24722: [SPARK-27858][SQL] Fix for avro deserialization on union types with multiple non-null types URL: https://github.com/apache/spark/pull/24722#discussion_r287912367 ## File path: external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala ## @@ -247,6 +247,32 @@ class AvroSuite extends QueryTest with SharedSQLContext with SQLTestUtils { } } + test("Union type: More than one non-null type") { +withTempDir { dir => + val complexNullUnionType = Schema.createUnion( +List(Schema.create(Type.INT), Schema.create(Type.NULL), Schema.create(Type.STRING)).asJava) + val fields = Seq( +new Field("field1", complexNullUnionType, "doc", null.asInstanceOf[AnyVal])).asJava + val schema = Schema.createRecord("name", "docs", "namespace", false) + schema.setFields(fields) + val datumWriter = new GenericDatumWriter[GenericRecord](schema) + val dataFileWriter = new DataFileWriter[GenericRecord](datumWriter) + dataFileWriter.create(schema, new File(s"$dir.avro")) + val avroRec = new GenericData.Record(schema) + avroRec.put("field1", 42) + dataFileWriter.append(avroRec) + val avroRec2 = new GenericData.Record(schema) + avroRec2.put("field1", "Alice") + dataFileWriter.append(avroRec2) + dataFileWriter.flush() + dataFileWriter.close() + + val df = spark.read.format("avro").load(s"$dir.avro") + assertResult(42)(df.selectExpr("field1.member0").take(1)(0).get(0)) + assertResult("Alice")(df.selectExpr("field1.member1").take(2).drop(1)(0).get(0)) Review comment: Done! Agreed this is much cleaner/more robust This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24722: [SPARK-27858][SQL] Fix for avro deserialization on union types with multiple non-null types
AmplabJenkins commented on issue #24722: [SPARK-27858][SQL] Fix for avro deserialization on union types with multiple non-null types URL: https://github.com/apache/spark/pull/24722#issuecomment-496344050 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24722: [SPARK-27858][SQL] Fix for avro deserialization on union types with multiple non-null types
AmplabJenkins commented on issue #24722: [SPARK-27858][SQL] Fix for avro deserialization on union types with multiple non-null types URL: https://github.com/apache/spark/pull/24722#issuecomment-496344048 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org