[jira] [Resolved] (SPARK-29870) Unify the logic of multi-units interval string to CalendarInterval
[ https://issues.apache.org/jira/browse/SPARK-29870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-29870. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26491 [https://github.com/apache/spark/pull/26491] > Unify the logic of multi-units interval string to CalendarInterval > -- > > Key: SPARK-29870 > URL: https://issues.apache.org/jira/browse/SPARK-29870 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > We now have two different implementation for multi-units interval strings to > CalendarInterval type values. > One is used to covert interval string literals to CalendarInterval. This > approach will re-delegate the interval string to spark parser which handles > the string as a `singleInterval` -> `multiUnitsInterval` -> eventually call > `IntervalUtils.fromUnitStrings` > The other is used in `Cast`, which eventually calls > `IntervalUtils.stringToInterval`. This approach is ~10 times faster than the > other. > We should unify these two for better performance and simple logic. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29870) Unify the logic of multi-units interval string to CalendarInterval
[ https://issues.apache.org/jira/browse/SPARK-29870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-29870: --- Assignee: Kent Yao > Unify the logic of multi-units interval string to CalendarInterval > -- > > Key: SPARK-29870 > URL: https://issues.apache.org/jira/browse/SPARK-29870 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > > We now have two different implementation for multi-units interval strings to > CalendarInterval type values. > One is used to covert interval string literals to CalendarInterval. This > approach will re-delegate the interval string to spark parser which handles > the string as a `singleInterval` -> `multiUnitsInterval` -> eventually call > `IntervalUtils.fromUnitStrings` > The other is used in `Cast`, which eventually calls > `IntervalUtils.stringToInterval`. This approach is ~10 times faster than the > other. > We should unify these two for better performance and simple logic. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29940) Whether contains schema for this parameter "spark.yarn.historyServer.address"
[ https://issues.apache.org/jira/browse/SPARK-29940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hehuiyuan updated SPARK-29940: -- Description: !image-2019-11-18-15-44-10-358.png|width=815,height=156! !image-2019-11-18-15-45-33-295.png|width=673,height=273! was: !image-2019-11-18-15-44-10-358.png|width=815,height=156! > Whether contains schema for this parameter "spark.yarn.historyServer.address" > -- > > Key: SPARK-29940 > URL: https://issues.apache.org/jira/browse/SPARK-29940 > Project: Spark > Issue Type: Wish > Components: Documentation >Affects Versions: 3.0.0 >Reporter: hehuiyuan >Priority: Minor > Attachments: image-2019-11-18-15-44-10-358.png, > image-2019-11-18-15-45-33-295.png > > > > !image-2019-11-18-15-44-10-358.png|width=815,height=156! > > !image-2019-11-18-15-45-33-295.png|width=673,height=273! > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29940) Whether contains schema for this parameter "spark.yarn.historyServer.address"
[ https://issues.apache.org/jira/browse/SPARK-29940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hehuiyuan updated SPARK-29940: -- Attachment: image-2019-11-18-15-45-33-295.png > Whether contains schema for this parameter "spark.yarn.historyServer.address" > -- > > Key: SPARK-29940 > URL: https://issues.apache.org/jira/browse/SPARK-29940 > Project: Spark > Issue Type: Wish > Components: Documentation >Affects Versions: 3.0.0 >Reporter: hehuiyuan >Priority: Minor > Attachments: image-2019-11-18-15-44-10-358.png, > image-2019-11-18-15-45-33-295.png > > > > !image-2019-11-18-15-44-10-358.png|width=815,height=156! > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29940) Whether contains schema for this parameter "spark.yarn.historyServer.address"
[ https://issues.apache.org/jira/browse/SPARK-29940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hehuiyuan updated SPARK-29940: -- Description: !image-2019-11-18-15-44-10-358.png|width=815,height=156! was: !image-2019-11-18-15-37-20-628.png! !image-2019-11-18-15-38-21-515.png! > Whether contains schema for this parameter "spark.yarn.historyServer.address" > -- > > Key: SPARK-29940 > URL: https://issues.apache.org/jira/browse/SPARK-29940 > Project: Spark > Issue Type: Wish > Components: Documentation >Affects Versions: 3.0.0 >Reporter: hehuiyuan >Priority: Minor > Attachments: image-2019-11-18-15-44-10-358.png > > > > !image-2019-11-18-15-44-10-358.png|width=815,height=156! > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29940) Whether contains schema for this parameter "spark.yarn.historyServer.address"
[ https://issues.apache.org/jira/browse/SPARK-29940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hehuiyuan updated SPARK-29940: -- Attachment: image-2019-11-18-15-44-10-358.png > Whether contains schema for this parameter "spark.yarn.historyServer.address" > -- > > Key: SPARK-29940 > URL: https://issues.apache.org/jira/browse/SPARK-29940 > Project: Spark > Issue Type: Wish > Components: Documentation >Affects Versions: 3.0.0 >Reporter: hehuiyuan >Priority: Minor > Attachments: image-2019-11-18-15-44-10-358.png > > > !image-2019-11-18-15-37-20-628.png! > > !image-2019-11-18-15-38-21-515.png! > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29783) Support SQL Standard output style for interval type
[ https://issues.apache.org/jira/browse/SPARK-29783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-29783. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26418 [https://github.com/apache/spark/pull/26418] > Support SQL Standard output style for interval type > --- > > Key: SPARK-29783 > URL: https://issues.apache.org/jira/browse/SPARK-29783 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > Support sql standard interval-style for output. > > ||Style ||conf||Year-Month Interval||Day-Time Interval||Mixed Interval|| > |{{sql_standard}}|ANSI enabled|1-2|3 4:05:06|-1-2 3 -4:05:06| > |{{spark's current}}|ansi disabled|1 year 2 mons|1 days 2 hours 3 minutes > 4.123456 seconds|interval 1 days 2 hours 3 minutes 4.123456 seconds| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29783) Support SQL Standard output style for interval type
[ https://issues.apache.org/jira/browse/SPARK-29783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-29783: --- Assignee: Kent Yao > Support SQL Standard output style for interval type > --- > > Key: SPARK-29783 > URL: https://issues.apache.org/jira/browse/SPARK-29783 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > > Support sql standard interval-style for output. > > ||Style ||conf||Year-Month Interval||Day-Time Interval||Mixed Interval|| > |{{sql_standard}}|ANSI enabled|1-2|3 4:05:06|-1-2 3 -4:05:06| > |{{spark's current}}|ansi disabled|1 year 2 mons|1 days 2 hours 3 minutes > 4.123456 seconds|interval 1 days 2 hours 3 minutes 4.123456 seconds| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29940) Whether contains schema for this parameter "spark.yarn.historyServer.address"
hehuiyuan created SPARK-29940: - Summary: Whether contains schema for this parameter "spark.yarn.historyServer.address" Key: SPARK-29940 URL: https://issues.apache.org/jira/browse/SPARK-29940 Project: Spark Issue Type: Wish Components: Documentation Affects Versions: 3.0.0 Reporter: hehuiyuan !image-2019-11-18-15-37-20-628.png! !image-2019-11-18-15-38-21-515.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25694) URL.setURLStreamHandlerFactory causing incompatible HttpURLConnection issue
[ https://issues.apache.org/jira/browse/SPARK-25694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-25694: - Assignee: Zhou Jiang (was: Dongjoon Hyun) > URL.setURLStreamHandlerFactory causing incompatible HttpURLConnection issue > --- > > Key: SPARK-25694 > URL: https://issues.apache.org/jira/browse/SPARK-25694 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.4, 3.0.0 >Reporter: Bo Yang >Assignee: Zhou Jiang >Priority: Minor > Fix For: 3.0.0 > > > URL.setURLStreamHandlerFactory() in SharedState causes URL.openConnection() > returns FsUrlConnection object, which is not compatible with > HttpURLConnection. This will cause exception when using some third party http > library (e.g. scalaj.http). > The following code in Spark 2.3.0 introduced the issue: > sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala: > {code} > object SharedState extends Logging { ... > URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory()) ... > } > {code} > Here is the example exception when using scalaj.http in Spark: > {code} > StackTrace: scala.MatchError: > org.apache.hadoop.fs.FsUrlConnection:[http://.example.com|http://.example.com/] > (of class org.apache.hadoop.fs.FsUrlConnection) > at > scalaj.http.HttpRequest.scalaj$http$HttpRequest$$doConnection(Http.scala:343) > at scalaj.http.HttpRequest.exec(Http.scala:335) > at scalaj.http.HttpRequest.asString(Http.scala:455) > {code} > > One option to fix the issue is to return null in > URLStreamHandlerFactory.createURLStreamHandler when the protocol is > http/https, so it will use the default behavior and be compatible with > scalaj.http. Following is the code example: > {code} > class SparkUrlStreamHandlerFactory extends URLStreamHandlerFactory with > Logging { > private val fsUrlStreamHandlerFactory = new FsUrlStreamHandlerFactory() > override def createURLStreamHandler(protocol: String): URLStreamHandler = { > val handler = fsUrlStreamHandlerFactory.createURLStreamHandler(protocol) > if (handler == null) { > return null > } > if (protocol != null && > (protocol.equalsIgnoreCase("http") > || protocol.equalsIgnoreCase("https"))) { > // return null to use system default URLStreamHandler > null > } else { > handler > } > } > } > {code} > I would like to get some discussion here before submitting a pull request. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29939) Add a conf for CompressionCodec for Ser/Deser of MapOutputStatus
Xiao Li created SPARK-29939: --- Summary: Add a conf for CompressionCodec for Ser/Deser of MapOutputStatus Key: SPARK-29939 URL: https://issues.apache.org/jira/browse/SPARK-29939 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: Xiao Li Assignee: wuyi All the other compressions have conf. Could we do it for this too? See the examples: https://github.com/apache/spark/blob/1b575ef5d1b8e3e672b2fca5c354d6678bd78bd1/core/src/main/scala/org/apache/spark/serializer/SerializerManager.scala#L67-L73 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29020) Unifying behaviour between array_sort and sort_array
[ https://issues.apache.org/jira/browse/SPARK-29020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-29020. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 25728 [https://github.com/apache/spark/pull/25728] > Unifying behaviour between array_sort and sort_array > > > Key: SPARK-29020 > URL: https://issues.apache.org/jira/browse/SPARK-29020 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: German Schiavon Matteo >Assignee: German Schiavon Matteo >Priority: Major > Fix For: 3.0.0 > > > I've noticed that there are two functions to sort arrays *sort_array* and > *array_sort*. > *sort_array* is from 1.5.0 and it has the possibility of ordering both > ascending and descending > *array_sort* is from 2.4.0 and it only has the possibility of ordering in > ascending. > Basically I just added the possibility of ordering either ascending or > descending using *array_sort*. > I think it would be good to have unified behaviours. > > This is the link to the [PR|[https://github.com/apache/spark/pull/25728]] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29020) Unifying behaviour between array_sort and sort_array
[ https://issues.apache.org/jira/browse/SPARK-29020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-29020: Assignee: German Schiavon Matteo > Unifying behaviour between array_sort and sort_array > > > Key: SPARK-29020 > URL: https://issues.apache.org/jira/browse/SPARK-29020 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: German Schiavon Matteo >Assignee: German Schiavon Matteo >Priority: Major > > I've noticed that there are two functions to sort arrays *sort_array* and > *array_sort*. > *sort_array* is from 1.5.0 and it has the possibility of ordering both > ascending and descending > *array_sort* is from 2.4.0 and it only has the possibility of ordering in > ascending. > Basically I just added the possibility of ordering either ascending or > descending using *array_sort*. > I think it would be good to have unified behaviours. > > This is the link to the [PR|[https://github.com/apache/spark/pull/25728]] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29600) array_contains built in function is not backward compatible in 3.0
[ https://issues.apache.org/jira/browse/SPARK-29600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976319#comment-16976319 ] Udbhav Agrawal edited comment on SPARK-29600 at 11/18/19 6:56 AM: -- Hi [~hyukjin.kwon], this failure is because we cannot cast the literal to array type after the above behavior change. For example: array(0.1,0.2,0.33) is type decimal(2,2) and literal 0.1 and 0.2 is also changed to decimal(2,2) but if we check 0.2 which actually is of type decimal(1,1) this query fails its data type doesn't match with array's data type. was (Author: udbhav agrawal): Hi [~hyukjin.kwon], this failure is because we cannot after the above behavior chnage spark doesn't cast the literal to array type . For example: > array_contains built in function is not backward compatible in 3.0 > -- > > Key: SPARK-29600 > URL: https://issues.apache.org/jira/browse/SPARK-29600 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > SELECT array_contains(array(0,0.1,0.2,0.3,0.5,0.02,0.033), .2); throws > Exception in 3.0 where as in 2.3.2 is working fine. > Spark 3.0 output: > 0: jdbc:hive2://10.18.19.208:23040/default> SELECT > array_contains(array(0,0.1,0.2,0.3,0.5,0.02,0.033), .2); > Error: org.apache.spark.sql.AnalysisException: cannot resolve > 'array_contains(array(CAST(0 AS DECIMAL(13,3)), CAST(0.1BD AS DECIMAL(13,3)), > CAST(0.2BD AS DECIMAL(13,3)), CAST(0.3BD AS DECIMAL(13,3)), CAST(0.5BD AS > DECIMAL(13,3)), CAST(0.02BD AS DECIMAL(13,3)), CAST(0.033BD AS > DECIMAL(13,3))), 0.2BD)' due to data type mismatch: Input to function > array_contains should have been array followed by a value with same element > type, but it's [array, decimal(1,1)].; line 1 pos 7; > 'Project [unresolvedalias(array_contains(array(cast(0 as decimal(13,3)), > cast(0.1 as decimal(13,3)), cast(0.2 as decimal(13,3)), cast(0.3 as > decimal(13,3)), cast(0.5 as decimal(13,3)), cast(0.02 as decimal(13,3)), > cast(0.033 as decimal(13,3))), 0.2), None)] > Spark 2.3.2 output > 0: jdbc:hive2://10.18.18.214:23040/default> SELECT > array_contains(array(0,0.1,0.2,0.3,0.5,0.02,0.033), .2); > |array_contains(array(CAST(0 AS DECIMAL(13,3)), CAST(0.1 AS DECIMAL(13,3)), > CAST(0.2 AS DECIMAL(13,3)), CAST(0.3 AS DECIMAL(13,3)), CAST(0.5 AS > DECIMAL(13,3)), CAST(0.02 AS DECIMAL(13,3)), CAST(0.033 AS DECIMAL(13,3))), > CAST(0.2 AS DECIMAL(13,3)))| > |true| > 1 row selected (0.18 seconds) > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29600) array_contains built in function is not backward compatible in 3.0
[ https://issues.apache.org/jira/browse/SPARK-29600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976319#comment-16976319 ] Udbhav Agrawal commented on SPARK-29600: Hi [~hyukjin.kwon], this failure is because we cannot after the above behavior chnage spark doesn't cast the literal to array type . For example: > array_contains built in function is not backward compatible in 3.0 > -- > > Key: SPARK-29600 > URL: https://issues.apache.org/jira/browse/SPARK-29600 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > SELECT array_contains(array(0,0.1,0.2,0.3,0.5,0.02,0.033), .2); throws > Exception in 3.0 where as in 2.3.2 is working fine. > Spark 3.0 output: > 0: jdbc:hive2://10.18.19.208:23040/default> SELECT > array_contains(array(0,0.1,0.2,0.3,0.5,0.02,0.033), .2); > Error: org.apache.spark.sql.AnalysisException: cannot resolve > 'array_contains(array(CAST(0 AS DECIMAL(13,3)), CAST(0.1BD AS DECIMAL(13,3)), > CAST(0.2BD AS DECIMAL(13,3)), CAST(0.3BD AS DECIMAL(13,3)), CAST(0.5BD AS > DECIMAL(13,3)), CAST(0.02BD AS DECIMAL(13,3)), CAST(0.033BD AS > DECIMAL(13,3))), 0.2BD)' due to data type mismatch: Input to function > array_contains should have been array followed by a value with same element > type, but it's [array, decimal(1,1)].; line 1 pos 7; > 'Project [unresolvedalias(array_contains(array(cast(0 as decimal(13,3)), > cast(0.1 as decimal(13,3)), cast(0.2 as decimal(13,3)), cast(0.3 as > decimal(13,3)), cast(0.5 as decimal(13,3)), cast(0.02 as decimal(13,3)), > cast(0.033 as decimal(13,3))), 0.2), None)] > Spark 2.3.2 output > 0: jdbc:hive2://10.18.18.214:23040/default> SELECT > array_contains(array(0,0.1,0.2,0.3,0.5,0.02,0.033), .2); > |array_contains(array(CAST(0 AS DECIMAL(13,3)), CAST(0.1 AS DECIMAL(13,3)), > CAST(0.2 AS DECIMAL(13,3)), CAST(0.3 AS DECIMAL(13,3)), CAST(0.5 AS > DECIMAL(13,3)), CAST(0.02 AS DECIMAL(13,3)), CAST(0.033 AS DECIMAL(13,3))), > CAST(0.2 AS DECIMAL(13,3)))| > |true| > 1 row selected (0.18 seconds) > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29938) Add batching in alter table add partition flow
[ https://issues.apache.org/jira/browse/SPARK-29938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prakhar Jain updated SPARK-29938: - Description: When lot of new partitions are added by an Insert query on a partitioned datasource table, sometimes the query fails with - {noformat} An error was encountered: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out; at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106) at org.apache.spark.sql.hive.HiveExternalCatalog.createPartitions(HiveExternalCatalog.scala:928) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createPartitions(SessionCatalog.scala:798) at org.apache.spark.sql.execution.command.AlterTableAddPartitionCommand.run(ddl.scala:448) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.refreshUpdatedPartitions$1(InsertIntoHadoopFsRelationCommand.scala:137) {noformat} This happens because adding thousands of partition in a single call takes lot of time and the client eventually timesout. Also adding lot of partitions can lead to OOM in Hive Metastore (similar issue in [recover partition flow|https://github.com/apache/spark/pull/14607] fixed). Steps to reproduce - {noformat} case class Partition(data: Int, partition_key: Int) val df = sc.parallelize(1 to 15000, 15000).map(x => Partition(x,x)).toDF df.registerTempTable("temp_table") spark.sql("""CREATE TABLE `test_table` (`data` INT, `partition_key` INT) USING parquet PARTITIONED BY (partition_key) """) spark.sql("INSERT OVERWRITE TABLE test_table select * from temp_table").collect() {noformat} was: When lot of new partitions are added by an Insert query on a partitioned datasource table, sometimes the query fails with - {noformat} An error was encountered: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out; at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106) at org.apache.spark.sql.hive.HiveExternalCatalog.createPartitions(HiveExternalCatalog.scala:928) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createPartitions(SessionCatalog.scala:798) at org.apache.spark.sql.execution.command.AlterTableAddPartitionCommand.run(ddl.scala:448) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.refreshUpdatedPartitions$1(InsertIntoHadoopFsRelationCommand.scala:137) {noformat} This happens because adding thousands of partition in a single call takes lot of time and the client eventually timesout. Also adding lot of partitions can lead to OOM in Hive Metastore (similar issue in [recover partition flow|https://github.com/apache/spark/pull/14607] fixed). Steps to reproduce - {noformat} case class Partition(data: Int, partition_key: Int) val df = sc.parallelize(1 to 15000, 15000).map(x => Partition(x,x)).toDF df.registerTempTable("temp_table") spark.sql("""CREATE TABLE `test_table` (`data` INT, `partition_key` INT) USING parquet PARTITIONED BY (partition_key) """) spark.sql("INSERT OVERWRITE TABLE test_table select * from temp_table").collect() {noformat} > Add batching in alter table add partition flow > -- > > Key: SPARK-29938 > URL: https://issues.apache.org/jira/browse/SPARK-29938 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.4 >Reporter: Prakhar Jain >Priority: Major > > When lot of new partitions are added by an Insert query on a partitioned > datasource table, sometimes the query fails with - > {noformat} > An error was encountered: org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: > org.apache.thrift.transport.TTransportException: > java.net.SocketTimeoutException: Read timed out; at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106) > at > org.apache.spark.sql.hive.HiveExternalCatalog.createPartitions(HiveExternalCatalog.scala:928) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.createPartitions(SessionCatalog.scala:798) > at > org.apache.spark.sql.execution.command.AlterTableAddPartitionCommand.run(ddl.scala:448) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.refreshUpdatedPartitions$1(InsertIntoHadoopFsRelationCommand.scala:137) > {noformat} > This happens because adding thousands of partition in a single call takes lot > of time and the client eventually timesout. > Also adding lot of partitions can lead to OOM in Hive Metastore (similar > issue in [recover partition
[jira] [Updated] (SPARK-29938) Add batching in alter table add partition flow
[ https://issues.apache.org/jira/browse/SPARK-29938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prakhar Jain updated SPARK-29938: - Description: When lot of new partitions are added by an Insert query on a partitioned datasource table, sometimes the query fails with - {noformat} An error was encountered: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out; at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106) at org.apache.spark.sql.hive.HiveExternalCatalog.createPartitions(HiveExternalCatalog.scala:928) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createPartitions(SessionCatalog.scala:798) at org.apache.spark.sql.execution.command.AlterTableAddPartitionCommand.run(ddl.scala:448) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.refreshUpdatedPartitions$1(InsertIntoHadoopFsRelationCommand.scala:137) {noformat} This happens because adding thousands of partition in a single call takes lot of time and the client eventually timesout. Also adding lot of partitions can lead to OOM in Hive Metastore (similar issue in [recover partition flow|https://github.com/apache/spark/pull/14607] fixed). Steps to reproduce - {noformat} case class Partition(data: Int, partition_key: Int) val df = sc.parallelize(1 to 15000, 15000).map(x => Partition(x,x)).toDF df.registerTempTable("temp_table") spark.sql("""CREATE TABLE `test_table` (`data` INT, `partition_key` INT) USING parquet PARTITIONED BY (partition_key) """) spark.sql("INSERT OVERWRITE TABLE test_table select * from temp_table").collect() {noformat} was: When lot of new partitions are added by an Insert query on a partitioned datasource table, sometimes the query fails with - {noformat} An error was encountered: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out; at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106) at org.apache.spark.sql.hive.HiveExternalCatalog.createPartitions(HiveExternalCatalog.scala:928) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createPartitions(SessionCatalog.scala:798) at org.apache.spark.sql.execution.command.AlterTableAddPartitionCommand.run(ddl.scala:448) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.refreshUpdatedPartitions$1(InsertIntoHadoopFsRelationCommand.scala:137) {noformat} This happens because adding thousands of partition in a single call takes lot of time and the client eventually timesout. Also adding lot of partitions can lead to OOM in Hive Metastore (similar issue in [recover partition flow|https://github.com/apache/spark/pull/14607] fixed). Steps to reproduce - {noformat} case class Partition(data: Int, partition_key: Int) val df = sc.parallelize(1 to 15000, 15000).map(x => Partition(x,x)).toDF df.registerTempTable("temp_table") spark.sql("""CREATE TABLE `test_table` (`data` INT, `partition_key` INT) USING parquet PARTITIONED BY (partition_key) """) spark.sql("INSERT OVERWRITE TABLE test_table select * from temp_table").collect() {noformat} > Add batching in alter table add partition flow > -- > > Key: SPARK-29938 > URL: https://issues.apache.org/jira/browse/SPARK-29938 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.4 >Reporter: Prakhar Jain >Priority: Major > > When lot of new partitions are added by an Insert query on a partitioned > datasource table, sometimes the query fails with - > {noformat} > An error was encountered: org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: > org.apache.thrift.transport.TTransportException: > java.net.SocketTimeoutException: Read timed out; at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106) > at > org.apache.spark.sql.hive.HiveExternalCatalog.createPartitions(HiveExternalCatalog.scala:928) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.createPartitions(SessionCatalog.scala:798) > at > org.apache.spark.sql.execution.command.AlterTableAddPartitionCommand.run(ddl.scala:448) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.refreshUpdatedPartitions$1(InsertIntoHadoopFsRelationCommand.scala:137) > {noformat} > This happens because adding thousands of partition in a single call takes lot > of time and the client eventually timesout. > Also adding lot of partitions can lead to OOM in Hive Metastore (similar > issue in [recover partition
[jira] [Created] (SPARK-29938) Add batching in alter table add partition flow
Prakhar Jain created SPARK-29938: Summary: Add batching in alter table add partition flow Key: SPARK-29938 URL: https://issues.apache.org/jira/browse/SPARK-29938 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.4, 2.3.4 Reporter: Prakhar Jain When lot of new partitions are added by an Insert query on a partitioned datasource table, sometimes the query fails with - {noformat} An error was encountered: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out; at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106) at org.apache.spark.sql.hive.HiveExternalCatalog.createPartitions(HiveExternalCatalog.scala:928) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createPartitions(SessionCatalog.scala:798) at org.apache.spark.sql.execution.command.AlterTableAddPartitionCommand.run(ddl.scala:448) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.refreshUpdatedPartitions$1(InsertIntoHadoopFsRelationCommand.scala:137) {noformat} This happens because adding thousands of partition in a single call takes lot of time and the client eventually timesout. Also adding lot of partitions can lead to OOM in Hive Metastore (similar issue in [recover partition flow|https://github.com/apache/spark/pull/14607] fixed). Steps to reproduce - {noformat} case class Partition(data: Int, partition_key: Int) val df = sc.parallelize(1 to 15000, 15000).map(x => Partition(x,x)).toDF df.registerTempTable("temp_table") spark.sql("""CREATE TABLE `test_table` (`data` INT, `partition_key` INT) USING parquet PARTITIONED BY (partition_key) """) spark.sql("INSERT OVERWRITE TABLE test_table select * from temp_table").collect() {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-29587) Real data type is not supported in Spark SQL which is supporting in postgresql
[ https://issues.apache.org/jira/browse/SPARK-29587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankit Raj Boudh updated SPARK-29587: Comment: was deleted (was: I will analyse this issue) > Real data type is not supported in Spark SQL which is supporting in postgresql > -- > > Key: SPARK-29587 > URL: https://issues.apache.org/jira/browse/SPARK-29587 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.4 >Reporter: jobit mathew >Priority: Minor > > Real data type is not supported in Spark SQL which is supporting in > postgresql. > +*In postgresql query success*+ > CREATE TABLE weather2(prcp real); > insert into weather2 values(2.5); > select * from weather2; > > || ||prcp|| > |1|2,5| > +*In spark sql getting error*+ > spark-sql> CREATE TABLE weather2(prcp real); > Error in query: > DataType real is not supported.(line 1, pos 27) > == SQL == > CREATE TABLE weather2(prcp real) > --- > Better to add the datatype "real " support in sql also > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25694) URL.setURLStreamHandlerFactory causing incompatible HttpURLConnection issue
[ https://issues.apache.org/jira/browse/SPARK-25694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai resolved SPARK-25694. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26530 [https://github.com/apache/spark/pull/26530] > URL.setURLStreamHandlerFactory causing incompatible HttpURLConnection issue > --- > > Key: SPARK-25694 > URL: https://issues.apache.org/jira/browse/SPARK-25694 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.4, 3.0.0 >Reporter: Bo Yang >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.0.0 > > > URL.setURLStreamHandlerFactory() in SharedState causes URL.openConnection() > returns FsUrlConnection object, which is not compatible with > HttpURLConnection. This will cause exception when using some third party http > library (e.g. scalaj.http). > The following code in Spark 2.3.0 introduced the issue: > sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala: > {code} > object SharedState extends Logging { ... > URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory()) ... > } > {code} > Here is the example exception when using scalaj.http in Spark: > {code} > StackTrace: scala.MatchError: > org.apache.hadoop.fs.FsUrlConnection:[http://.example.com|http://.example.com/] > (of class org.apache.hadoop.fs.FsUrlConnection) > at > scalaj.http.HttpRequest.scalaj$http$HttpRequest$$doConnection(Http.scala:343) > at scalaj.http.HttpRequest.exec(Http.scala:335) > at scalaj.http.HttpRequest.asString(Http.scala:455) > {code} > > One option to fix the issue is to return null in > URLStreamHandlerFactory.createURLStreamHandler when the protocol is > http/https, so it will use the default behavior and be compatible with > scalaj.http. Following is the code example: > {code} > class SparkUrlStreamHandlerFactory extends URLStreamHandlerFactory with > Logging { > private val fsUrlStreamHandlerFactory = new FsUrlStreamHandlerFactory() > override def createURLStreamHandler(protocol: String): URLStreamHandler = { > val handler = fsUrlStreamHandlerFactory.createURLStreamHandler(protocol) > if (handler == null) { > return null > } > if (protocol != null && > (protocol.equalsIgnoreCase("http") > || protocol.equalsIgnoreCase("https"))) { > // return null to use system default URLStreamHandler > null > } else { > handler > } > } > } > {code} > I would like to get some discussion here before submitting a pull request. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25694) URL.setURLStreamHandlerFactory causing incompatible HttpURLConnection issue
[ https://issues.apache.org/jira/browse/SPARK-25694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai reassigned SPARK-25694: --- Assignee: Dongjoon Hyun > URL.setURLStreamHandlerFactory causing incompatible HttpURLConnection issue > --- > > Key: SPARK-25694 > URL: https://issues.apache.org/jira/browse/SPARK-25694 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.4, 3.0.0 >Reporter: Bo Yang >Assignee: Dongjoon Hyun >Priority: Minor > > URL.setURLStreamHandlerFactory() in SharedState causes URL.openConnection() > returns FsUrlConnection object, which is not compatible with > HttpURLConnection. This will cause exception when using some third party http > library (e.g. scalaj.http). > The following code in Spark 2.3.0 introduced the issue: > sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala: > {code} > object SharedState extends Logging { ... > URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory()) ... > } > {code} > Here is the example exception when using scalaj.http in Spark: > {code} > StackTrace: scala.MatchError: > org.apache.hadoop.fs.FsUrlConnection:[http://.example.com|http://.example.com/] > (of class org.apache.hadoop.fs.FsUrlConnection) > at > scalaj.http.HttpRequest.scalaj$http$HttpRequest$$doConnection(Http.scala:343) > at scalaj.http.HttpRequest.exec(Http.scala:335) > at scalaj.http.HttpRequest.asString(Http.scala:455) > {code} > > One option to fix the issue is to return null in > URLStreamHandlerFactory.createURLStreamHandler when the protocol is > http/https, so it will use the default behavior and be compatible with > scalaj.http. Following is the code example: > {code} > class SparkUrlStreamHandlerFactory extends URLStreamHandlerFactory with > Logging { > private val fsUrlStreamHandlerFactory = new FsUrlStreamHandlerFactory() > override def createURLStreamHandler(protocol: String): URLStreamHandler = { > val handler = fsUrlStreamHandlerFactory.createURLStreamHandler(protocol) > if (handler == null) { > return null > } > if (protocol != null && > (protocol.equalsIgnoreCase("http") > || protocol.equalsIgnoreCase("https"))) { > // return null to use system default URLStreamHandler > null > } else { > handler > } > } > } > {code} > I would like to get some discussion here before submitting a pull request. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29929) Allow V2 Datasources to require a data distribution
[ https://issues.apache.org/jira/browse/SPARK-29929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976294#comment-16976294 ] Jungtaek Lim commented on SPARK-29929: -- Possibly duplicated with SPARK-23889 , though no one is working on SPARK-23889 as of now. SPARK-23889 has broader requirements. > Allow V2 Datasources to require a data distribution > --- > > Key: SPARK-29929 > URL: https://issues.apache.org/jira/browse/SPARK-29929 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Andrew K Long >Priority: Major > > Currently users are unable to specify that their v2 Datasource requires a > particular Distribution before inserting data. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29936) Fix SparkR lint errors and add lint-r GitHub Action
[ https://issues.apache.org/jira/browse/SPARK-29936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29936. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26564 [https://github.com/apache/spark/pull/26564] > Fix SparkR lint errors and add lint-r GitHub Action > --- > > Key: SPARK-29936 > URL: https://issues.apache.org/jira/browse/SPARK-29936 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.5, 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29936) Fix SparkR lint errors and add lint-r GitHub Action
[ https://issues.apache.org/jira/browse/SPARK-29936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-29936: - Assignee: Dongjoon Hyun > Fix SparkR lint errors and add lint-r GitHub Action > --- > > Key: SPARK-29936 > URL: https://issues.apache.org/jira/browse/SPARK-29936 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.5, 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29907) Move DELETE/UPDATE/MERGE relative rules to dmlStatementNoWith to support cte.
[ https://issues.apache.org/jira/browse/SPARK-29907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-29907: --- Assignee: Xianyin Xin > Move DELETE/UPDATE/MERGE relative rules to dmlStatementNoWith to support cte. > - > > Key: SPARK-29907 > URL: https://issues.apache.org/jira/browse/SPARK-29907 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xianyin Xin >Assignee: Xianyin Xin >Priority: Major > > SPARK-27444 introduced `dmlStatementNoWith` so that any dml that needs cte > support can leverage it. It be better if we move DELETE/UPDATE/MERGE rules to > `dmlStatementNoWith`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29907) Move DELETE/UPDATE/MERGE relative rules to dmlStatementNoWith to support cte.
[ https://issues.apache.org/jira/browse/SPARK-29907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-29907. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26536 [https://github.com/apache/spark/pull/26536] > Move DELETE/UPDATE/MERGE relative rules to dmlStatementNoWith to support cte. > - > > Key: SPARK-29907 > URL: https://issues.apache.org/jira/browse/SPARK-29907 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xianyin Xin >Assignee: Xianyin Xin >Priority: Major > Fix For: 3.0.0 > > > SPARK-27444 introduced `dmlStatementNoWith` so that any dml that needs cte > support can leverage it. It be better if we move DELETE/UPDATE/MERGE rules to > `dmlStatementNoWith`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29903) Add documentation for recursiveFileLookup
[ https://issues.apache.org/jira/browse/SPARK-29903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976252#comment-16976252 ] Nicholas Chammas commented on SPARK-29903: -- Happy to do that. Going to wait for [this PR|https://github.com/apache/spark/pull/26525] to be completed before writing any docs though, so I can address both the DataFrame and SQL APIs in one go. > Add documentation for recursiveFileLookup > - > > Key: SPARK-29903 > URL: https://issues.apache.org/jira/browse/SPARK-29903 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Nicholas Chammas >Priority: Minor > > SPARK-27990 added a new option, {{recursiveFileLookup}}, for recursively > loading data from a source directory. There is currently no documentation for > this option. > We should document this both for the DataFrame API as well as for SQL. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29807) Rename "spark.sql.ansi.enabled" to "spark.sql.dialect.spark.ansi.enabled"
[ https://issues.apache.org/jira/browse/SPARK-29807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-29807. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26444 [https://github.com/apache/spark/pull/26444] > Rename "spark.sql.ansi.enabled" to "spark.sql.dialect.spark.ansi.enabled" > - > > Key: SPARK-29807 > URL: https://issues.apache.org/jira/browse/SPARK-29807 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuanjian Li >Assignee: Yuanjian Li >Priority: Major > Fix For: 3.0.0 > > > The relation between "spark.sql.ansi.enabled" and "spark.sql.dialect" is > confusing, since the "PostgreSQL" dialect should contain the features of > "spark.sql.ansi.enabled". > To make things clearer, we can rename the "spark.sql.ansi.enabled" to > "spark.sql.dialect.spark.ansi.enabled", thus the option > "spark.sql.dialect.spark.ansi.enabled" is only for Spark dialect. > For the casting and arithmetic operations, runtime exceptions should be > thrown if "spark.sql.dialect" is "spark" and > "spark.sql.dialect.spark.ansi.enabled" is true or "spark.sql.dialect" is > PostgresSQL. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16872) Impl Gaussian Naive Bayes Classifier
[ https://issues.apache.org/jira/browse/SPARK-16872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-16872. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 26413 [https://github.com/apache/spark/pull/26413] > Impl Gaussian Naive Bayes Classifier > > > Key: SPARK-16872 > URL: https://issues.apache.org/jira/browse/SPARK-16872 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Major > Fix For: 3.1.0 > > > I implemented Gaussian NB according to scikit-learn's {{GaussianNB}}. > In GaussianNB model, the {{theta}} matrix is used to store means and there is > a extra {{sigma}} matrix storing the variance of each feature. > GaussianNB in spark > {code} > scala> import org.apache.spark.ml.classification.GaussianNaiveBayes > import org.apache.spark.ml.classification.GaussianNaiveBayes > scala> val path = > "/Users/zrf/.dev/spark-2.1.0-bin-hadoop2.7/data/mllib/sample_multiclass_classification_data.txt" > path: String = > /Users/zrf/.dev/spark-2.1.0-bin-hadoop2.7/data/mllib/sample_multiclass_classification_data.txt > scala> val data = spark.read.format("libsvm").load(path).persist() > data: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [label: > double, features: vector] > scala> val gnb = new GaussianNaiveBayes() > gnb: org.apache.spark.ml.classification.GaussianNaiveBayes = gnb_54c50467306c > scala> val model = gnb.fit(data) > 17/01/03 14:25:48 INFO Instrumentation: > GaussianNaiveBayes-gnb_54c50467306c-720112035-1: training: numPartitions=1 > storageLevel=StorageLevel(1 replicas) > 17/01/03 14:25:48 INFO Instrumentation: > GaussianNaiveBayes-gnb_54c50467306c-720112035-1: {} > 17/01/03 14:25:49 INFO Instrumentation: > GaussianNaiveBayes-gnb_54c50467306c-720112035-1: {"numFeatures":4} > 17/01/03 14:25:49 INFO Instrumentation: > GaussianNaiveBayes-gnb_54c50467306c-720112035-1: {"numClasses":3} > 17/01/03 14:25:49 INFO Instrumentation: > GaussianNaiveBayes-gnb_54c50467306c-720112035-1: training finished > model: org.apache.spark.ml.classification.GaussianNaiveBayesModel = > GaussianNaiveBayesModel (uid=gnb_54c50467306c) with 3 classes > scala> model.pi > res0: org.apache.spark.ml.linalg.Vector = > [-1.0986122886681098,-1.0986122886681098,-1.0986122886681098] > scala> model.pi.toArray.map(math.exp) > res1: Array[Double] = Array(0., 0., > 0.) > scala> model.theta > res2: org.apache.spark.ml.linalg.Matrix = > 0.270067018001 -0.188540006 0.543050720001 0.60546 > -0.60779998 0.18172 -0.842711740006 > -0.88139998 > -0.091425964 -0.35858001 0.105084738 > 0.021666701507102017 > scala> model.sigma > res3: org.apache.spark.ml.linalg.Matrix = > 0.1223012510889361 0.07078051983960698 0.0343595243976 > 0.051336071297393815 > 0.03758145300924998 0.09880280046403413 0.003390296940069426 > 0.007822241779598893 > 0.08058763609659315 0.06701386661293329 0.024866409227781675 > 0.02661391644759426 > scala> model.transform(data).select("probability").take(10) > [rdd_68_0] > res4: Array[org.apache.spark.sql.Row] = > Array([[1.0627410543476422E-21,0.9938,6.2765233965353945E-15]], > [[7.254521422345374E-26,1.0,1.3849442153180895E-18]], > [[1.9629244119173135E-24,0.9998,1.9424765181237926E-16]], > [[6.061218297948492E-22,0.9902,9.853216073401884E-15]], > [[0.9972225671942837,8.844241161578932E-165,0.002777432805716399]], > [[5.361683970373604E-26,1.0,2.3004604508982183E-18]], > [[0.01062850630038623,3.3102617689978775E-100,0.9893714936996136]], > [[1.9297314618271785E-4,2.124922209137708E-71,0.9998070268538172]], > [[3.118816393732361E-27,1.0,6.5310299615983584E-21]], > [[0.926009854522,8.734773657627494E-206,7.399014547943611E-6]]) > scala> model.transform(data).select("prediction").take(10) > [rdd_68_0] > res5: Array[org.apache.spark.sql.Row] = Array([1.0], [1.0], [1.0], [1.0], > [0.0], [1.0], [2.0], [2.0], [1.0], [0.0]) > {code} > GaussianNB in scikit-learn > {code} > import numpy as np > from sklearn.naive_bayes import GaussianNB > from sklearn.datasets import load_svmlight_file > path = > '/Users/zrf/.dev/spark-2.1.0-bin-hadoop2.7/data/mllib/sample_multiclass_classification_data.txt' > X, y = load_svmlight_file(path) > X = X.toarray() > clf = GaussianNB() > clf.fit(X, y) > >>> clf.class_prior_ > array([ 0., 0., 0.]) > >>> clf.theta_ > array([[ 0.2701, -0.1885, 0.54305072, 0.6055], >[-0.6078, 0.1817, -0.84271174, -0.8814], >[-0.0914, -0.3586, 0.10508474, 0.0216667 ]]) > > >>>
[jira] [Updated] (SPARK-29936) Fix SparkR lint errors and add lint-r GitHub Action
[ https://issues.apache.org/jira/browse/SPARK-29936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29936: -- Issue Type: Bug (was: Task) > Fix SparkR lint errors and add lint-r GitHub Action > --- > > Key: SPARK-29936 > URL: https://issues.apache.org/jira/browse/SPARK-29936 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.5, 3.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29936) Fix SparkR lint errors and add lint-r GitHub Action
[ https://issues.apache.org/jira/browse/SPARK-29936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29936: -- Summary: Fix SparkR lint errors and add lint-r GitHub Action (was: Add `lint-r` GitHub Action) > Fix SparkR lint errors and add lint-r GitHub Action > --- > > Key: SPARK-29936 > URL: https://issues.apache.org/jira/browse/SPARK-29936 > Project: Spark > Issue Type: Task > Components: SparkR >Affects Versions: 2.4.5, 3.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29936) Add `lint-r` GitHub Action
[ https://issues.apache.org/jira/browse/SPARK-29936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29936: -- Component/s: (was: Tests) SparkR > Add `lint-r` GitHub Action > -- > > Key: SPARK-29936 > URL: https://issues.apache.org/jira/browse/SPARK-29936 > Project: Spark > Issue Type: Task > Components: SparkR >Affects Versions: 2.4.5, 3.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29937) Make FileSourceScanExec class fields lazy
ulysses you created SPARK-29937: --- Summary: Make FileSourceScanExec class fields lazy Key: SPARK-29937 URL: https://issues.apache.org/jira/browse/SPARK-29937 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: ulysses you Since JIRA SPARK-28346,PR [25111|https://github.com/apache/spark/pull/25111], QueryExecution will copy all node stage-by-stage. This make all node instance twice almost. So we should make all class fields lazy to avoid create more unexpected object. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29581) Enable cleanup old event log files
[ https://issues.apache.org/jira/browse/SPARK-29581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-29581. -- Resolution: Invalid We took the different approach: see SPARK-29779 > Enable cleanup old event log files > --- > > Key: SPARK-29581 > URL: https://issues.apache.org/jira/browse/SPARK-29581 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Priority: Major > > This issue can be start only with SPARK-29579 is addressed properly. > After SPARK-29579 Spark would guarantee strong compatibility on both live > entities and snapshots, which means snapshot file could replace older origin > event log files. This issue tracks the efforts on automatically cleaning up > old event logs if snapshot file can replace them, which lets overall size of > event log on streaming query to be manageable. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29935) Remove `Spark QA Compile` Jenkins Dashboard (and jobs)
[ https://issues.apache.org/jira/browse/SPARK-29935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976209#comment-16976209 ] Dongjoon Hyun commented on SPARK-29935: --- SPARK-29936 will recover Lint-R in GitHub Action. > Remove `Spark QA Compile` Jenkins Dashboard (and jobs) > -- > > Key: SPARK-29935 > URL: https://issues.apache.org/jira/browse/SPARK-29935 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 2.4.5, 3.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > > The following dashboard has 6 jobs. > - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/ > Those 6 jobs are a subset of GitHub Action now. So, we can save our Jenkins > computing resources and reduces our maintenance efforts. > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.4-compile-maven-hadoop-2.6/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.4-compile-maven-hadoop-2.7/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.4-lint/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-2.7/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-3.2/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-lint/ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29935) Remove `Spark QA Compile` Jenkins Dashboard (and jobs)
[ https://issues.apache.org/jira/browse/SPARK-29935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976208#comment-16976208 ] Dongjoon Hyun edited comment on SPARK-29935 at 11/18/19 12:19 AM: -- Yes, it's much better now. Also, we can re-trigger the failed task indefinitely. I've monitoring it and it still faces failures some times due to Maven downloading (We didn't cache all). In addition to that, for now, 2 of the above jobs are broken in Jenkins. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.4-lint/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-lint/ was (Author: dongjoon): Yes, it's much better now. Also, we can re-trigger the failed task indefinitely. I've monitoring it and it still faces failures some times due to Maven downloading (We didn't cache all). For now, 2 of the above jobs are broken in Jenkins. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.4-lint/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-lint/ > Remove `Spark QA Compile` Jenkins Dashboard (and jobs) > -- > > Key: SPARK-29935 > URL: https://issues.apache.org/jira/browse/SPARK-29935 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 2.4.5, 3.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > > The following dashboard has 6 jobs. > - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/ > Those 6 jobs are a subset of GitHub Action now. So, we can save our Jenkins > computing resources and reduces our maintenance efforts. > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.4-compile-maven-hadoop-2.6/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.4-compile-maven-hadoop-2.7/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.4-lint/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-2.7/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-3.2/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-lint/ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29935) Remove `Spark QA Compile` Jenkins Dashboard (and jobs)
[ https://issues.apache.org/jira/browse/SPARK-29935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976208#comment-16976208 ] Dongjoon Hyun commented on SPARK-29935: --- Yes, it's much better now. Also, we can re-trigger the failed task indefinitely. I've monitoring it and it still faces failures some times due to Maven downloading (We didn't cache all). For now, 2 of the above jobs are broken in Jenkins. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.4-lint/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-lint/ > Remove `Spark QA Compile` Jenkins Dashboard (and jobs) > -- > > Key: SPARK-29935 > URL: https://issues.apache.org/jira/browse/SPARK-29935 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 2.4.5, 3.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > > The following dashboard has 6 jobs. > - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/ > Those 6 jobs are a subset of GitHub Action now. So, we can save our Jenkins > computing resources and reduces our maintenance efforts. > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.4-compile-maven-hadoop-2.6/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.4-compile-maven-hadoop-2.7/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.4-lint/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-2.7/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-3.2/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-lint/ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29935) Remove `Spark QA Compile` Jenkins Dashboard (and jobs)
[ https://issues.apache.org/jira/browse/SPARK-29935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976197#comment-16976197 ] Sean R. Owen commented on SPARK-29935: -- Do the github actions work reliably now? I haven't watched them in a while. > Remove `Spark QA Compile` Jenkins Dashboard (and jobs) > -- > > Key: SPARK-29935 > URL: https://issues.apache.org/jira/browse/SPARK-29935 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 2.4.5, 3.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > > The following dashboard has 6 jobs. > - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/ > Those 6 jobs are a subset of GitHub Action now. So, we can save our Jenkins > computing resources and reduces our maintenance efforts. > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.4-compile-maven-hadoop-2.6/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.4-compile-maven-hadoop-2.7/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.4-lint/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-2.7/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-3.2/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-lint/ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29936) Add `lint-r` GitHub Action
Dongjoon Hyun created SPARK-29936: - Summary: Add `lint-r` GitHub Action Key: SPARK-29936 URL: https://issues.apache.org/jira/browse/SPARK-29936 Project: Spark Issue Type: Task Components: Tests Affects Versions: 2.4.5, 3.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29935) Remove `Spark QA Compile` Jenkins Dashboard (and jobs)
[ https://issues.apache.org/jira/browse/SPARK-29935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29935: -- Priority: Minor (was: Major) > Remove `Spark QA Compile` Jenkins Dashboard (and jobs) > -- > > Key: SPARK-29935 > URL: https://issues.apache.org/jira/browse/SPARK-29935 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 2.4.5, 3.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > > The following dashboard has 6 jobs. > - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/ > Those 6 jobs are a subset of GitHub Action now. So, we can save our Jenkins > computing resources and reduces our maintenance efforts. > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.4-compile-maven-hadoop-2.6/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.4-compile-maven-hadoop-2.7/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.4-lint/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-2.7/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-3.2/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-lint/ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29935) Remove `Spark QA Compile` Jenkins Dashboard (and jobs)
[ https://issues.apache.org/jira/browse/SPARK-29935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976153#comment-16976153 ] Dongjoon Hyun commented on SPARK-29935: --- cc [~shaneknapp], [~srowen] > Remove `Spark QA Compile` Jenkins Dashboard (and jobs) > -- > > Key: SPARK-29935 > URL: https://issues.apache.org/jira/browse/SPARK-29935 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 2.4.5, 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > The following dashboard has 6 jobs. > - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/ > Those 6 jobs are a subset of GitHub Action now. So, we can save our Jenkins > computing resources and reduces our maintenance efforts. > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.4-compile-maven-hadoop-2.6/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.4-compile-maven-hadoop-2.7/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.4-lint/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-2.7/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-3.2/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-lint/ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29935) Remove `Spark QA Compile` Jenkins Dashboard (and jobs)
Dongjoon Hyun created SPARK-29935: - Summary: Remove `Spark QA Compile` Jenkins Dashboard (and jobs) Key: SPARK-29935 URL: https://issues.apache.org/jira/browse/SPARK-29935 Project: Spark Issue Type: Task Components: Project Infra Affects Versions: 2.4.5, 3.0.0 Reporter: Dongjoon Hyun The following dashboard has 6 jobs. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/ Those 6 jobs are a subset of GitHub Action now. So, we can save our Jenkins computing resources and reduces our maintenance efforts. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.4-compile-maven-hadoop-2.6/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.4-compile-maven-hadoop-2.7/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.4-lint/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-2.7/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-3.2/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-lint/ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29934) Dataset support GraphX
[ https://issues.apache.org/jira/browse/SPARK-29934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976151#comment-16976151 ] Dongjoon Hyun commented on SPARK-29934: --- Hi, [~darion]. According to your context, the issue type should be 'Improvement' or 'New Feature` instead of `Bug`. In addition, in that case, `Affects Version/s` should be `3.0.0`. BTW, JIRA is not for Q For the question, please ask on the dev mailing list fist. > Dataset support GraphX > -- > > Key: SPARK-29934 > URL: https://issues.apache.org/jira/browse/SPARK-29934 > Project: Spark > Issue Type: Bug > Components: Graph, GraphX, Spark Core >Affects Versions: 2.4.4 >Reporter: darion yaphet >Priority: Minor > > Do we have some plan to support GraphX with dataset ? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29931) Declare all SQL legacy configs as will be removed in Spark 4.0
[ https://issues.apache.org/jira/browse/SPARK-29931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976134#comment-16976134 ] Reynold Xin commented on SPARK-29931: - You can say "This config will be removed in Spark 4.0 or a later release." > Declare all SQL legacy configs as will be removed in Spark 4.0 > -- > > Key: SPARK-29931 > URL: https://issues.apache.org/jira/browse/SPARK-29931 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Minor > > Add the sentence to descriptions of all legacy SQL configs existed before > Spark 3.0: "This config will be removed in Spark 4.0.". Here is the list of > such configs: > * spark.sql.legacy.execution.pandas.groupedMap.assignColumnsByName > * spark.sql.legacy.literal.pickMinimumPrecision > * spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation > * spark.sql.legacy.sizeOfNull > * spark.sql.legacy.replaceDatabricksSparkAvro.enabled > * spark.sql.legacy.setopsPrecedence.enabled > * spark.sql.legacy.integralDivide.returnBigint > * spark.sql.legacy.bucketedTableScan.outputOrdering > * spark.sql.legacy.parser.havingWithoutGroupByAsWhere > * spark.sql.legacy.dataset.nameNonStructGroupingKeyAsValue > * spark.sql.legacy.setCommandRejectsSparkCoreConfs > * spark.sql.legacy.utcTimestampFunc.enabled > * spark.sql.legacy.typeCoercion.datetimeToString > * spark.sql.legacy.looseUpcast > * spark.sql.legacy.ctePrecedence.enabled > * spark.sql.legacy.arrayExistsFollowsThreeValuedLogic -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29758) json_tuple truncates fields
[ https://issues.apache.org/jira/browse/SPARK-29758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976106#comment-16976106 ] Maxim Gekk edited comment on SPARK-29758 at 11/17/19 6:17 PM: -- Another solution is to disable this optimization: [https://github.com/apache/spark/blob/v2.4.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L475-L478] was (Author: maxgekk): Another solution is to remove this optimization: https://github.com/apache/spark/blob/v2.4.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L475-L478 > json_tuple truncates fields > --- > > Key: SPARK-29758 > URL: https://issues.apache.org/jira/browse/SPARK-29758 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.4.4 > Environment: EMR 5.15.0 (Spark 2.3.0) And MacBook Pro (Mojave > 10.14.3, Spark 2.4.4) > Jdk 8, Scala 2.11.12 >Reporter: Stanislav >Priority: Major > > `json_tuple` has inconsistent behaviour with `from_json` - but only if json > string is longer than 2700 characters or so. > This can be reproduced in spark-shell and on cluster, but not in scalatest, > for some reason. > {code} > import org.apache.spark.sql.functions.{from_json, json_tuple} > import org.apache.spark.sql.types._ > val counterstring = > "*3*5*7*9*12*15*18*21*24*27*30*33*36*39*42*45*48*51*54*57*60*63*66*69*72*75*78*81*84*87*90*93*96*99*103*107*111*115*119*123*127*131*135*139*143*147*151*155*159*163*167*171*175*179*183*187*191*195*199*203*207*211*215*219*223*227*231*235*239*243*247*251*255*259*263*267*271*275*279*283*287*291*295*299*303*307*311*315*319*323*327*331*335*339*343*347*351*355*359*363*367*371*375*379*383*387*391*395*399*403*407*411*415*419*423*427*431*435*439*443*447*451*455*459*463*467*471*475*479*483*487*491*495*499*503*507*511*515*519*523*527*531*535*539*543*547*551*555*559*563*567*571*575*579*583*587*591*595*599*603*607*611*615*619*623*627*631*635*639*643*647*651*655*659*663*667*671*675*679*683*687*691*695*699*703*707*711*715*719*723*727*731*735*739*743*747*751*755*759*763*767*771*775*779*783*787*791*795*799*803*807*811*815*819*823*827*831*835*839*843*847*851*855*859*863*867*871*875*879*883*887*891*895*899*903*907*911*915*919*923*927*931*935*939*943*947*951*955*959*963*967*971*975*979*983*987*991*995*1000*1005*1010*1015*1020*1025*1030*1035*1040*1045*1050*1055*1060*1065*1070*1075*1080*1085*1090*1095*1100*1105*1110*1115*1120*1125*1130*1135*1140*1145*1150*1155*1160*1165*1170*1175*1180*1185*1190*1195*1200*1205*1210*1215*1220*1225*1230*1235*1240*1245*1250*1255*1260*1265*1270*1275*1280*1285*1290*1295*1300*1305*1310*1315*1320*1325*1330*1335*1340*1345*1350*1355*1360*1365*1370*1375*1380*1385*1390*1395*1400*1405*1410*1415*1420*1425*1430*1435*1440*1445*1450*1455*1460*1465*1470*1475*1480*1485*1490*1495*1500*1505*1510*1515*1520*1525*1530*1535*1540*1545*1550*1555*1560*1565*1570*1575*1580*1585*1590*1595*1600*1605*1610*1615*1620*1625*1630*1635*1640*1645*1650*1655*1660*1665*1670*1675*1680*1685*1690*1695*1700*1705*1710*1715*1720*1725*1730*1735*1740*1745*1750*1755*1760*1765*1770*1775*1780*1785*1790*1795*1800*1805*1810*1815*1820*1825*1830*1835*1840*1845*1850*1855*1860*1865*1870*1875*1880*1885*1890*1895*1900*1905*1910*1915*1920*1925*1930*1935*1940*1945*1950*1955*1960*1965*1970*1975*1980*1985*1990*1995*2000*2005*2010*2015*2020*2025*2030*2035*2040*2045*2050*2055*2060*2065*2070*2075*2080*2085*2090*2095*2100*2105*2110*2115*2120*2125*2130*2135*2140*2145*2150*2155*2160*2165*2170*2175*2180*2185*2190*2195*2200*2205*2210*2215*2220*2225*2230*2235*2240*2245*2250*2255*2260*2265*2270*2275*2280*2285*2290*2295*2300*2305*2310*2315*2320*2325*2330*2335*2340*2345*2350*2355*2360*2365*2370*2375*2380*2385*2390*2395*2400*2405*2410*2415*2420*2425*2430*2435*2440*2445*2450*2455*2460*2465*2470*2475*2480*2485*2490*2495*2500*2505*2510*2515*2520*2525*2530*2535*2540*2545*2550*2555*2560*2565*2570*2575*2580*2585*2590*2595*2600*2605*2610*2615*2620*2625*2630*2635*2640*2645*2650*2655*2660*2665*2670*2675*2680*2685*2690*2695*2700*2705*2710*2715*2720*2725*2730*2735*2740*2745*2750*2755*2760*2765*2770*2775*2780*2785*2790*2795*2800*" > val json_tuple_result = Seq(s"""{"test":"$counterstring"}""").toDF("json") > .withColumn("result", json_tuple('json, "test")) > .select('result) > .as[String].head.length > val from_json_result = Seq(s"""{"test":"$counterstring"}""").toDF("json") > .withColumn("parsed", from_json('json, StructType(Seq(StructField("test", > StringType) > .withColumn("result", $"parsed.test") > .select('result) > .as[String].head.length > scala> json_tuple_result > res62: Int = 2791 > scala> from_json_result > res63: Int = 2800 > {code} > Result is influenced by the total length of the json string at the moment of > parsing: > {code} > val
[jira] [Assigned] (SPARK-29930) Remove SQL configs declared to be removed in Spark 3.0
[ https://issues.apache.org/jira/browse/SPARK-29930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-29930: - Assignee: Maxim Gekk > Remove SQL configs declared to be removed in Spark 3.0 > -- > > Key: SPARK-29930 > URL: https://issues.apache.org/jira/browse/SPARK-29930 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > > Need to remove the following SQL configs: > * spark.sql.fromJsonForceNullableSchema > * spark.sql.legacy.compareDateTimestampInTimestamp -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29930) Remove SQL configs declared to be removed in Spark 3.0
[ https://issues.apache.org/jira/browse/SPARK-29930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29930. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26559 [https://github.com/apache/spark/pull/26559] > Remove SQL configs declared to be removed in Spark 3.0 > -- > > Key: SPARK-29930 > URL: https://issues.apache.org/jira/browse/SPARK-29930 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.0.0 > > > Need to remove the following SQL configs: > * spark.sql.fromJsonForceNullableSchema > * spark.sql.legacy.compareDateTimestampInTimestamp -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29932) lint-r should do non-zero exit in case of errors
[ https://issues.apache.org/jira/browse/SPARK-29932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29932. --- Fix Version/s: 3.0.0 2.4.5 Resolution: Fixed Issue resolved by pull request 26561 [https://github.com/apache/spark/pull/26561] > lint-r should do non-zero exit in case of errors > > > Key: SPARK-29932 > URL: https://issues.apache.org/jira/browse/SPARK-29932 > Project: Spark > Issue Type: Bug > Components: SparkR, Tests >Affects Versions: 2.3.4, 2.4.4, 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.5, 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29932) lint-r should do non-zero exit in case of errors
[ https://issues.apache.org/jira/browse/SPARK-29932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-29932: - Assignee: Dongjoon Hyun > lint-r should do non-zero exit in case of errors > > > Key: SPARK-29932 > URL: https://issues.apache.org/jira/browse/SPARK-29932 > Project: Spark > Issue Type: Bug > Components: SparkR, Tests >Affects Versions: 2.3.4, 2.4.4, 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29758) json_tuple truncates fields
[ https://issues.apache.org/jira/browse/SPARK-29758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976106#comment-16976106 ] Maxim Gekk commented on SPARK-29758: Another solution is to remove this optimization: https://github.com/apache/spark/blob/v2.4.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L475-L478 > json_tuple truncates fields > --- > > Key: SPARK-29758 > URL: https://issues.apache.org/jira/browse/SPARK-29758 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.4.4 > Environment: EMR 5.15.0 (Spark 2.3.0) And MacBook Pro (Mojave > 10.14.3, Spark 2.4.4) > Jdk 8, Scala 2.11.12 >Reporter: Stanislav >Priority: Major > > `json_tuple` has inconsistent behaviour with `from_json` - but only if json > string is longer than 2700 characters or so. > This can be reproduced in spark-shell and on cluster, but not in scalatest, > for some reason. > {code} > import org.apache.spark.sql.functions.{from_json, json_tuple} > import org.apache.spark.sql.types._ > val counterstring = > "*3*5*7*9*12*15*18*21*24*27*30*33*36*39*42*45*48*51*54*57*60*63*66*69*72*75*78*81*84*87*90*93*96*99*103*107*111*115*119*123*127*131*135*139*143*147*151*155*159*163*167*171*175*179*183*187*191*195*199*203*207*211*215*219*223*227*231*235*239*243*247*251*255*259*263*267*271*275*279*283*287*291*295*299*303*307*311*315*319*323*327*331*335*339*343*347*351*355*359*363*367*371*375*379*383*387*391*395*399*403*407*411*415*419*423*427*431*435*439*443*447*451*455*459*463*467*471*475*479*483*487*491*495*499*503*507*511*515*519*523*527*531*535*539*543*547*551*555*559*563*567*571*575*579*583*587*591*595*599*603*607*611*615*619*623*627*631*635*639*643*647*651*655*659*663*667*671*675*679*683*687*691*695*699*703*707*711*715*719*723*727*731*735*739*743*747*751*755*759*763*767*771*775*779*783*787*791*795*799*803*807*811*815*819*823*827*831*835*839*843*847*851*855*859*863*867*871*875*879*883*887*891*895*899*903*907*911*915*919*923*927*931*935*939*943*947*951*955*959*963*967*971*975*979*983*987*991*995*1000*1005*1010*1015*1020*1025*1030*1035*1040*1045*1050*1055*1060*1065*1070*1075*1080*1085*1090*1095*1100*1105*1110*1115*1120*1125*1130*1135*1140*1145*1150*1155*1160*1165*1170*1175*1180*1185*1190*1195*1200*1205*1210*1215*1220*1225*1230*1235*1240*1245*1250*1255*1260*1265*1270*1275*1280*1285*1290*1295*1300*1305*1310*1315*1320*1325*1330*1335*1340*1345*1350*1355*1360*1365*1370*1375*1380*1385*1390*1395*1400*1405*1410*1415*1420*1425*1430*1435*1440*1445*1450*1455*1460*1465*1470*1475*1480*1485*1490*1495*1500*1505*1510*1515*1520*1525*1530*1535*1540*1545*1550*1555*1560*1565*1570*1575*1580*1585*1590*1595*1600*1605*1610*1615*1620*1625*1630*1635*1640*1645*1650*1655*1660*1665*1670*1675*1680*1685*1690*1695*1700*1705*1710*1715*1720*1725*1730*1735*1740*1745*1750*1755*1760*1765*1770*1775*1780*1785*1790*1795*1800*1805*1810*1815*1820*1825*1830*1835*1840*1845*1850*1855*1860*1865*1870*1875*1880*1885*1890*1895*1900*1905*1910*1915*1920*1925*1930*1935*1940*1945*1950*1955*1960*1965*1970*1975*1980*1985*1990*1995*2000*2005*2010*2015*2020*2025*2030*2035*2040*2045*2050*2055*2060*2065*2070*2075*2080*2085*2090*2095*2100*2105*2110*2115*2120*2125*2130*2135*2140*2145*2150*2155*2160*2165*2170*2175*2180*2185*2190*2195*2200*2205*2210*2215*2220*2225*2230*2235*2240*2245*2250*2255*2260*2265*2270*2275*2280*2285*2290*2295*2300*2305*2310*2315*2320*2325*2330*2335*2340*2345*2350*2355*2360*2365*2370*2375*2380*2385*2390*2395*2400*2405*2410*2415*2420*2425*2430*2435*2440*2445*2450*2455*2460*2465*2470*2475*2480*2485*2490*2495*2500*2505*2510*2515*2520*2525*2530*2535*2540*2545*2550*2555*2560*2565*2570*2575*2580*2585*2590*2595*2600*2605*2610*2615*2620*2625*2630*2635*2640*2645*2650*2655*2660*2665*2670*2675*2680*2685*2690*2695*2700*2705*2710*2715*2720*2725*2730*2735*2740*2745*2750*2755*2760*2765*2770*2775*2780*2785*2790*2795*2800*" > val json_tuple_result = Seq(s"""{"test":"$counterstring"}""").toDF("json") > .withColumn("result", json_tuple('json, "test")) > .select('result) > .as[String].head.length > val from_json_result = Seq(s"""{"test":"$counterstring"}""").toDF("json") > .withColumn("parsed", from_json('json, StructType(Seq(StructField("test", > StringType) > .withColumn("result", $"parsed.test") > .select('result) > .as[String].head.length > scala> json_tuple_result > res62: Int = 2791 > scala> from_json_result > res63: Int = 2800 > {code} > Result is influenced by the total length of the json string at the moment of > parsing: > {code} > val json_tuple_result_with_prefix = Seq(s"""{"prefix": "dummy", > "test":"$counterstring"}""").toDF("json") > .withColumn("result", json_tuple('json, "test")) > .select('result) > .as[String].head.length > scala> json_tuple_result_with_prefix > res64: Int = 2772 > {code}
[jira] [Commented] (SPARK-29575) from_json can produce nulls for fields which are marked as non-nullable
[ https://issues.apache.org/jira/browse/SPARK-29575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976102#comment-16976102 ] Maxim Gekk commented on SPARK-29575: This is intentional behavior. User's schema is forcibly set as nullable. See SPARK-23173 > from_json can produce nulls for fields which are marked as non-nullable > --- > > Key: SPARK-29575 > URL: https://issues.apache.org/jira/browse/SPARK-29575 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.4 >Reporter: Victor Lopez >Priority: Major > > I believe this issue was resolved elsewhere > (https://issues.apache.org/jira/browse/SPARK-23173), though for Pyspark this > bug seems to still be there. > The issue appears when using {{from_json}} to parse a column in a Spark > dataframe. It seems like {{from_json}} ignores whether the schema provided > has any {{nullable:False}} property. > {code:java} > schema = T.StructType().add(T.StructField('id', T.LongType(), > nullable=False)).add(T.StructField('name', T.StringType(), nullable=False)) > data = [{'user': str({'name': 'joe', 'id':1})}, {'user': str({'name': > 'jane'})}] > df = spark.read.json(sc.parallelize(data)) > df.withColumn("details", F.from_json("user", > schema)).select("details.*").show() > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29758) json_tuple truncates fields
[ https://issues.apache.org/jira/browse/SPARK-29758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976099#comment-16976099 ] Maxim Gekk commented on SPARK-29758: I have reproduced the issue on 2.4. The problem is in Jackson core 2.6.7. It was fixed by https://github.com/FasterXML/jackson-core/commit/554f8db0f940b2a53f974852a2af194739d65200#diff-7990edc67621822770cdc62e12d933d4R647-R650 in the version 2.7.7. We could try to back port this https://github.com/apache/spark/pull/21596 on 2.4. [~hyukjin.kwon] WDYT? > json_tuple truncates fields > --- > > Key: SPARK-29758 > URL: https://issues.apache.org/jira/browse/SPARK-29758 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.4.4 > Environment: EMR 5.15.0 (Spark 2.3.0) And MacBook Pro (Mojave > 10.14.3, Spark 2.4.4) > Jdk 8, Scala 2.11.12 >Reporter: Stanislav >Priority: Major > > `json_tuple` has inconsistent behaviour with `from_json` - but only if json > string is longer than 2700 characters or so. > This can be reproduced in spark-shell and on cluster, but not in scalatest, > for some reason. > {code} > import org.apache.spark.sql.functions.{from_json, json_tuple} > import org.apache.spark.sql.types._ > val counterstring = > "*3*5*7*9*12*15*18*21*24*27*30*33*36*39*42*45*48*51*54*57*60*63*66*69*72*75*78*81*84*87*90*93*96*99*103*107*111*115*119*123*127*131*135*139*143*147*151*155*159*163*167*171*175*179*183*187*191*195*199*203*207*211*215*219*223*227*231*235*239*243*247*251*255*259*263*267*271*275*279*283*287*291*295*299*303*307*311*315*319*323*327*331*335*339*343*347*351*355*359*363*367*371*375*379*383*387*391*395*399*403*407*411*415*419*423*427*431*435*439*443*447*451*455*459*463*467*471*475*479*483*487*491*495*499*503*507*511*515*519*523*527*531*535*539*543*547*551*555*559*563*567*571*575*579*583*587*591*595*599*603*607*611*615*619*623*627*631*635*639*643*647*651*655*659*663*667*671*675*679*683*687*691*695*699*703*707*711*715*719*723*727*731*735*739*743*747*751*755*759*763*767*771*775*779*783*787*791*795*799*803*807*811*815*819*823*827*831*835*839*843*847*851*855*859*863*867*871*875*879*883*887*891*895*899*903*907*911*915*919*923*927*931*935*939*943*947*951*955*959*963*967*971*975*979*983*987*991*995*1000*1005*1010*1015*1020*1025*1030*1035*1040*1045*1050*1055*1060*1065*1070*1075*1080*1085*1090*1095*1100*1105*1110*1115*1120*1125*1130*1135*1140*1145*1150*1155*1160*1165*1170*1175*1180*1185*1190*1195*1200*1205*1210*1215*1220*1225*1230*1235*1240*1245*1250*1255*1260*1265*1270*1275*1280*1285*1290*1295*1300*1305*1310*1315*1320*1325*1330*1335*1340*1345*1350*1355*1360*1365*1370*1375*1380*1385*1390*1395*1400*1405*1410*1415*1420*1425*1430*1435*1440*1445*1450*1455*1460*1465*1470*1475*1480*1485*1490*1495*1500*1505*1510*1515*1520*1525*1530*1535*1540*1545*1550*1555*1560*1565*1570*1575*1580*1585*1590*1595*1600*1605*1610*1615*1620*1625*1630*1635*1640*1645*1650*1655*1660*1665*1670*1675*1680*1685*1690*1695*1700*1705*1710*1715*1720*1725*1730*1735*1740*1745*1750*1755*1760*1765*1770*1775*1780*1785*1790*1795*1800*1805*1810*1815*1820*1825*1830*1835*1840*1845*1850*1855*1860*1865*1870*1875*1880*1885*1890*1895*1900*1905*1910*1915*1920*1925*1930*1935*1940*1945*1950*1955*1960*1965*1970*1975*1980*1985*1990*1995*2000*2005*2010*2015*2020*2025*2030*2035*2040*2045*2050*2055*2060*2065*2070*2075*2080*2085*2090*2095*2100*2105*2110*2115*2120*2125*2130*2135*2140*2145*2150*2155*2160*2165*2170*2175*2180*2185*2190*2195*2200*2205*2210*2215*2220*2225*2230*2235*2240*2245*2250*2255*2260*2265*2270*2275*2280*2285*2290*2295*2300*2305*2310*2315*2320*2325*2330*2335*2340*2345*2350*2355*2360*2365*2370*2375*2380*2385*2390*2395*2400*2405*2410*2415*2420*2425*2430*2435*2440*2445*2450*2455*2460*2465*2470*2475*2480*2485*2490*2495*2500*2505*2510*2515*2520*2525*2530*2535*2540*2545*2550*2555*2560*2565*2570*2575*2580*2585*2590*2595*2600*2605*2610*2615*2620*2625*2630*2635*2640*2645*2650*2655*2660*2665*2670*2675*2680*2685*2690*2695*2700*2705*2710*2715*2720*2725*2730*2735*2740*2745*2750*2755*2760*2765*2770*2775*2780*2785*2790*2795*2800*" > val json_tuple_result = Seq(s"""{"test":"$counterstring"}""").toDF("json") > .withColumn("result", json_tuple('json, "test")) > .select('result) > .as[String].head.length > val from_json_result = Seq(s"""{"test":"$counterstring"}""").toDF("json") > .withColumn("parsed", from_json('json, StructType(Seq(StructField("test", > StringType) > .withColumn("result", $"parsed.test") > .select('result) > .as[String].head.length > scala> json_tuple_result > res62: Int = 2791 > scala> from_json_result > res63: Int = 2800 > {code} > Result is influenced by the total length of the json string at the moment of > parsing: > {code} > val json_tuple_result_with_prefix = Seq(s"""{"prefix": "dummy", > "test":"$counterstring"}""").toDF("json") >
[jira] [Created] (SPARK-29934) Dataset support GraphX
darion yaphet created SPARK-29934: - Summary: Dataset support GraphX Key: SPARK-29934 URL: https://issues.apache.org/jira/browse/SPARK-29934 Project: Spark Issue Type: Bug Components: Graph, GraphX, Spark Core Affects Versions: 2.4.4 Reporter: darion yaphet Do we have some plan to support GraphX with dataset ? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29644) Corrected ShortType and ByteType mapping to SmallInt and TinyInt in JDBCUtils
[ https://issues.apache.org/jira/browse/SPARK-29644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-29644: - Fix Version/s: 2.4.5 > Corrected ShortType and ByteType mapping to SmallInt and TinyInt in JDBCUtils > - > > Key: SPARK-29644 > URL: https://issues.apache.org/jira/browse/SPARK-29644 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.0.0 >Reporter: Shiv Prashant Sood >Assignee: Shiv Prashant Sood >Priority: Minor > Fix For: 2.4.5, 3.0.0 > > > @maropu pointed out this issue during [PR > 25344|https://github.com/apache/spark/pull/25344] review discussion. > In > [JDBCUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala] > line number 547 > case ShortType => > (stmt: PreparedStatement, row: Row, pos: Int) => > stmt.setInt(pos + 1, row.getShort(pos)) > I dont see any reproducible issue, but this is clearly a problem that must be > fixed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29456) Add tooltip information for Session Statistics Table column in JDBC/ODBC Server Tab
[ https://issues.apache.org/jira/browse/SPARK-29456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-29456. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 26138 [https://github.com/apache/spark/pull/26138] > Add tooltip information for Session Statistics Table column in JDBC/ODBC > Server Tab > > > Key: SPARK-29456 > URL: https://issues.apache.org/jira/browse/SPARK-29456 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Assignee: pavithra ramachandran >Priority: Minor > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29456) Add tooltip information for Session Statistics Table column in JDBC/ODBC Server Tab
[ https://issues.apache.org/jira/browse/SPARK-29456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-29456: - Fix Version/s: (was: 3.1.0) 3.0.0 > Add tooltip information for Session Statistics Table column in JDBC/ODBC > Server Tab > > > Key: SPARK-29456 > URL: https://issues.apache.org/jira/browse/SPARK-29456 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Assignee: pavithra ramachandran >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29456) Add tooltip information for Session Statistics Table column in JDBC/ODBC Server Tab
[ https://issues.apache.org/jira/browse/SPARK-29456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-29456: Assignee: pavithra ramachandran > Add tooltip information for Session Statistics Table column in JDBC/ODBC > Server Tab > > > Key: SPARK-29456 > URL: https://issues.apache.org/jira/browse/SPARK-29456 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Assignee: pavithra ramachandran >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29456) Add tooltip information for Session Statistics Table column in JDBC/ODBC Server Tab
[ https://issues.apache.org/jira/browse/SPARK-29456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-29456: - Priority: Minor (was: Major) > Add tooltip information for Session Statistics Table column in JDBC/ODBC > Server Tab > > > Key: SPARK-29456 > URL: https://issues.apache.org/jira/browse/SPARK-29456 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29933) ThriftServerQueryTestSuite runs tests with wrong settings
[ https://issues.apache.org/jira/browse/SPARK-29933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-29933: --- Attachment: filter_tests.patch > ThriftServerQueryTestSuite runs tests with wrong settings > - > > Key: SPARK-29933 > URL: https://issues.apache.org/jira/browse/SPARK-29933 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Minor > Attachments: filter_tests.patch > > > ThriftServerQueryTestSuite must run ANSI tests in the Spark dialect but it > keeps settings from previous runs. And in fact, it run `ansi/interval.sql` in > the PostgreSQL dialect. See > https://github.com/apache/spark/pull/26473#issuecomment-554510643 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29933) ThriftServerQueryTestSuite runs tests with wrong settings
Maxim Gekk created SPARK-29933: -- Summary: ThriftServerQueryTestSuite runs tests with wrong settings Key: SPARK-29933 URL: https://issues.apache.org/jira/browse/SPARK-29933 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk ThriftServerQueryTestSuite must run ANSI tests in the Spark dialect but it keeps settings from previous runs. And in fact, it run `ansi/interval.sql` in the PostgreSQL dialect. See https://github.com/apache/spark/pull/26473#issuecomment-554510643 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29931) Declare all SQL legacy configs as will be removed in Spark 4.0
[ https://issues.apache.org/jira/browse/SPARK-29931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975944#comment-16975944 ] Maxim Gekk commented on SPARK-29931: > It's conceivable there could a reason to do it later, or sooner. Later is not problem what about sooner. Most of the configs were added for Spark 3.0. If you decide to remove one of them in a minor release between 3.0 and 4.0, you can break user apps that is unacceptable for minor releases, I do believe. > Declare all SQL legacy configs as will be removed in Spark 4.0 > -- > > Key: SPARK-29931 > URL: https://issues.apache.org/jira/browse/SPARK-29931 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Minor > > Add the sentence to descriptions of all legacy SQL configs existed before > Spark 3.0: "This config will be removed in Spark 4.0.". Here is the list of > such configs: > * spark.sql.legacy.execution.pandas.groupedMap.assignColumnsByName > * spark.sql.legacy.literal.pickMinimumPrecision > * spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation > * spark.sql.legacy.sizeOfNull > * spark.sql.legacy.replaceDatabricksSparkAvro.enabled > * spark.sql.legacy.setopsPrecedence.enabled > * spark.sql.legacy.integralDivide.returnBigint > * spark.sql.legacy.bucketedTableScan.outputOrdering > * spark.sql.legacy.parser.havingWithoutGroupByAsWhere > * spark.sql.legacy.dataset.nameNonStructGroupingKeyAsValue > * spark.sql.legacy.setCommandRejectsSparkCoreConfs > * spark.sql.legacy.utcTimestampFunc.enabled > * spark.sql.legacy.typeCoercion.datetimeToString > * spark.sql.legacy.looseUpcast > * spark.sql.legacy.ctePrecedence.enabled > * spark.sql.legacy.arrayExistsFollowsThreeValuedLogic -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org