[jira] [Updated] (SPARK-28589) Introduce a new type coercion mode by following PostgreSQL
[ https://issues.apache.org/jira/browse/SPARK-28589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-28589: Description: The type coercion rules in Spark SQL are not consistent with any system. Sometimes, it is confusing to end users. We can introduce a new type coercion mode so that Spark can function as some popular DBMS. For example, PostgreSQL. was: The type conversion in current Spark can be improved by learning from PostgreSQL: 1. The table insertion should disallow certain type conversions(string -> int), see https://docs.google.com/document/d/1b9nnWWbKVDRp7lpzhQS1buv1_lDzWIZY2ApFs5rBcGI/edit#heading=h.82w8qxfl2uwl 2. Spark should return null result on cast an out-of-range value to an integral type, instead of returning a value with lower bits. 3. We can have a new optional mode: throw runtime exceptions on casting failures 4. improve the implicit casting etc.. > Introduce a new type coercion mode by following PostgreSQL > -- > > Key: SPARK-28589 > URL: https://issues.apache.org/jira/browse/SPARK-28589 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > Attachments: postgreSQL_type_conversion.txt > > > The type coercion rules in Spark SQL are not consistent with any system. > Sometimes, it is confusing to end users. > We can introduce a new type coercion mode so that Spark can function as some > popular DBMS. For example, PostgreSQL. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28589) Introduce a new type coercion mode by following PostgreSQL
[ https://issues.apache.org/jira/browse/SPARK-28589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-28589: Summary: Introduce a new type coercion mode by following PostgreSQL (was: Follow the type conversion rules of PostgreSQL) > Introduce a new type coercion mode by following PostgreSQL > -- > > Key: SPARK-28589 > URL: https://issues.apache.org/jira/browse/SPARK-28589 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > Attachments: postgreSQL_type_conversion.txt > > > The type conversion in current Spark can be improved by learning from > PostgreSQL: > 1. The table insertion should disallow certain type conversions(string -> > int), see > https://docs.google.com/document/d/1b9nnWWbKVDRp7lpzhQS1buv1_lDzWIZY2ApFs5rBcGI/edit#heading=h.82w8qxfl2uwl > 2. Spark should return null result on cast an out-of-range value to an > integral type, instead of returning a value with lower bits. > 3. We can have a new optional mode: throw runtime exceptions on casting > failures > 4. improve the implicit casting > etc.. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25280) Add support for USING syntax for DataSourceV2
[ https://issues.apache.org/jira/browse/SPARK-25280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899536#comment-16899536 ] Hyukjin Kwon commented on SPARK-25280: -- Hey, thanks for heads up. Yea, let's leave it resolved. But do you mind if I ask some related tickets? It doesn't have to be all the related tickets. I just want to leave some links against this JIRA for trackability. > Add support for USING syntax for DataSourceV2 > - > > Key: SPARK-25280 > URL: https://issues.apache.org/jira/browse/SPARK-25280 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > {code} > class SourcesTest extends SparkFunSuite { > val spark = SparkSession.builder().master("local").getOrCreate() > test("Test CREATE TABLE ... USING - v1") { > spark.read.format(classOf[SimpleDataSourceV1].getCanonicalName).load() > } > test("Test DataFrameReader - v1") { > spark.sql(s"CREATE TABLE tableA USING > ${classOf[SimpleDataSourceV1].getCanonicalName}") > } > test("Test CREATE TABLE ... USING - v2") { > spark.read.format(classOf[SimpleDataSourceV2].getCanonicalName).load() > } > test("Test DataFrameReader - v2") { > spark.sql(s"CREATE TABLE tableB USING > ${classOf[SimpleDataSourceV2].getCanonicalName}") > } > } > {code} > {code} > org.apache.spark.sql.sources.v2.SimpleDataSourceV2 is not a valid Spark SQL > Data Source.; > org.apache.spark.sql.AnalysisException: > org.apache.spark.sql.sources.v2.SimpleDataSourceV2 is not a valid Spark SQL > Data Source.; > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:385) > at > org.apache.spark.sql.execution.command.CreateDataSourceTableCommand.run(createDataSourceTables.scala:78) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) > at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) > at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) > at org.apache.spark.sql.Dataset$$anonfun$54.apply(Dataset.scala:3296) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3295) > at org.apache.spark.sql.Dataset.(Dataset.scala:190) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:75) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:641) > at > org.apache.spark.sql.sources.v2.SourcesTest$$anonfun$4.apply(DataSourceV2Suite.scala:45) > at > org.apache.spark.sql.sources.v2.SourcesTest$$anonfun$4.apply(DataSourceV2Suite.scala:45) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > at org.scalatest.TestSuite$class.withFixture(TestSuite.scala:196) > at org.scalatest.FunSuite.withFixture(FunSuite.scala:1560) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196) > at org.scalatest.FunSuite.runTest(FunSuite.scala:1560) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384) > at scala.collection.immutable.List.foreach(List.scala:392) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379) > at
[jira] [Created] (SPARK-28612) DataSourceV2: Add new DataFrameWriter API for v2
Ryan Blue created SPARK-28612: - Summary: DataSourceV2: Add new DataFrameWriter API for v2 Key: SPARK-28612 URL: https://issues.apache.org/jira/browse/SPARK-28612 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Ryan Blue This tracks adding an API like the one proposed in SPARK-23521: {code:lang=scala} df.writeTo("catalog.db.table").append() // AppendData df.writeTo("catalog.db.table").overwriteDynamic() // OverwritePartiitonsDynamic df.writeTo("catalog.db.table").overwrite($"date" === '2019-01-01') // OverwriteByExpression df.writeTo("catalog.db.table").partitionBy($"type", $"date").create() // CTAS df.writeTo("catalog.db.table").replace() // RTAS {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4502) Spark SQL reads unneccesary nested fields from Parquet
[ https://issues.apache.org/jira/browse/SPARK-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899534#comment-16899534 ] Tomas Bartalos commented on SPARK-4502: --- I'm fighting with the same problem, few findings which are not obvious after reading the issue: * To use this feature, set _spark.sql.optimizer.nestedSchemaPruning.enabled=true_ * According to my tests, fix works only for 1 level of nesting. For example _event.amount_ is ok, while _event.spent.amount_ reads the whole event structure :( The only way how to optimise read of 2+ level nesting is to specify projected schema at read time (as suggested by [~aeroevan]): _val df = spark.read.format("parquet").schema(projectedSchema).load()_ *This workaround can't be used on queries from thrift server and connected BI tools. Needless to say having this feature to work on any level of nesting would be just wonderful.* This is a real performance killer for big deeply nested structures. My test for summing one field on top level vs. nested shows difference 2.5 min vs. 4 seconds > Spark SQL reads unneccesary nested fields from Parquet > -- > > Key: SPARK-4502 > URL: https://issues.apache.org/jira/browse/SPARK-4502 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.1.0 >Reporter: Liwen Sun >Assignee: Michael Allman >Priority: Critical > Fix For: 2.4.0 > > > When reading a field of a nested column from Parquet, SparkSQL reads and > assemble all the fields of that nested column. This is unnecessary, as > Parquet supports fine-grained field reads out of a nested column. This may > degrades the performance significantly when a nested column has many fields. > For example, I loaded json tweets data into SparkSQL and ran the following > query: > {{SELECT User.contributors_enabled from Tweets;}} > User is a nested structure that has 38 primitive fields (for Tweets schema, > see: https://dev.twitter.com/overview/api/tweets), here is the log message: > {{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed > 385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 > cell/ms}} > For comparison, I also ran: > {{SELECT User FROM Tweets;}} > And here is the log message: > {{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed > 385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}} > So both queries load 38 columns from Parquet, while the first query only > needs 1 column. I also measured the bytes read within Parquet. In these two > cases, the same number of bytes (99365194 bytes) were read. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23204) DataSourceV2 should support named tables in DataFrameReader, DataFrameWriter
[ https://issues.apache.org/jira/browse/SPARK-23204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved SPARK-23204. --- Resolution: Fixed Fix Version/s: 3.0.0 I'm closing this because it is implemented by SPARK-28178 and SPARK-28565. > DataSourceV2 should support named tables in DataFrameReader, DataFrameWriter > > > Key: SPARK-23204 > URL: https://issues.apache.org/jira/browse/SPARK-23204 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Priority: Major > Fix For: 3.0.0 > > > DataSourceV2 is currently only configured with a path, passed in options as > {{path}}. For many data sources, like JDBC, a table name is more appropriate. > I propose testing the "location" passed to load(String) and save(String) to > see if it is a path and if not, parsing it as a table name and passing > "database" and "table" options to readers and writers. > This also creates a way to pass the table identifier when using DataSourceV2 > tables from SQL. For example, {{SELECT * FROM db.table}} creates an > {{UnresolvedRelation(db,table)}} that could be resolved using the default > source, passing the db and table name using the same options. Similarly, we > can add a table property for the datasource implementation to metastore > tables and add a rule to convert them to DataSourceV2 relations. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25280) Add support for USING syntax for DataSourceV2
[ https://issues.apache.org/jira/browse/SPARK-25280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899532#comment-16899532 ] Ryan Blue commented on SPARK-25280: --- [~hyukjin.kwon], is there anything left to do for this? I think that most of the functionality has been added at this point. > Add support for USING syntax for DataSourceV2 > - > > Key: SPARK-25280 > URL: https://issues.apache.org/jira/browse/SPARK-25280 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > {code} > class SourcesTest extends SparkFunSuite { > val spark = SparkSession.builder().master("local").getOrCreate() > test("Test CREATE TABLE ... USING - v1") { > spark.read.format(classOf[SimpleDataSourceV1].getCanonicalName).load() > } > test("Test DataFrameReader - v1") { > spark.sql(s"CREATE TABLE tableA USING > ${classOf[SimpleDataSourceV1].getCanonicalName}") > } > test("Test CREATE TABLE ... USING - v2") { > spark.read.format(classOf[SimpleDataSourceV2].getCanonicalName).load() > } > test("Test DataFrameReader - v2") { > spark.sql(s"CREATE TABLE tableB USING > ${classOf[SimpleDataSourceV2].getCanonicalName}") > } > } > {code} > {code} > org.apache.spark.sql.sources.v2.SimpleDataSourceV2 is not a valid Spark SQL > Data Source.; > org.apache.spark.sql.AnalysisException: > org.apache.spark.sql.sources.v2.SimpleDataSourceV2 is not a valid Spark SQL > Data Source.; > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:385) > at > org.apache.spark.sql.execution.command.CreateDataSourceTableCommand.run(createDataSourceTables.scala:78) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) > at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) > at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) > at org.apache.spark.sql.Dataset$$anonfun$54.apply(Dataset.scala:3296) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3295) > at org.apache.spark.sql.Dataset.(Dataset.scala:190) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:75) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:641) > at > org.apache.spark.sql.sources.v2.SourcesTest$$anonfun$4.apply(DataSourceV2Suite.scala:45) > at > org.apache.spark.sql.sources.v2.SourcesTest$$anonfun$4.apply(DataSourceV2Suite.scala:45) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > at org.scalatest.TestSuite$class.withFixture(TestSuite.scala:196) > at org.scalatest.FunSuite.withFixture(FunSuite.scala:1560) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196) > at org.scalatest.FunSuite.runTest(FunSuite.scala:1560) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384) > at scala.collection.immutable.List.foreach(List.scala:392) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461) > at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229) > at
[jira] [Commented] (SPARK-28543) Document Spark Jobs page
[ https://issues.apache.org/jira/browse/SPARK-28543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899482#comment-16899482 ] Pablo Langa Blanco commented on SPARK-28543: Hi! I want to contribute on this. > Document Spark Jobs page > > > Key: SPARK-28543 > URL: https://issues.apache.org/jira/browse/SPARK-28543 > Project: Spark > Issue Type: Sub-task > Components: SQL, Web UI >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28611) Histogram's height is diffrent
[ https://issues.apache.org/jira/browse/SPARK-28611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899470#comment-16899470 ] Yuming Wang commented on SPARK-28611: - cc [~mgaido] > Histogram's height is diffrent > -- > > Key: SPARK-28611 > URL: https://issues.apache.org/jira/browse/SPARK-28611 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: Yuming Wang >Priority: Major > > {code:sql} > CREATE TABLE desc_col_table (key int COMMENT 'column_comment') USING PARQUET; > -- Test output for histogram statistics > SET spark.sql.statistics.histogram.enabled=true; > SET spark.sql.statistics.histogram.numBins=2; > INSERT INTO desc_col_table values 1, 2, 3, 4; > ANALYZE TABLE desc_col_table COMPUTE STATISTICS FOR COLUMNS key; > DESC EXTENDED desc_col_table key; > {code} > {noformat} > spark-sql> DESC EXTENDED desc_col_table key; > col_name key > data_type int > comment column_comment > min 1 > max 4 > num_nulls 0 > distinct_count4 > avg_col_len 4 > max_col_len 4 > histogram height: 4.0, num_of_bins: 2 > bin_0 lower_bound: 1.0, upper_bound: 2.0, distinct_count: 2 > bin_1 lower_bound: 2.0, upper_bound: 4.0, distinct_count: 2 > {noformat} > But our result is: > https://github.com/apache/spark/blob/v2.4.3/sql/core/src/test/resources/sql-tests/results/describe-table-column.sql.out#L231-L242 -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28611) Histogram's height is diffrent
Yuming Wang created SPARK-28611: --- Summary: Histogram's height is diffrent Key: SPARK-28611 URL: https://issues.apache.org/jira/browse/SPARK-28611 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.3 Reporter: Yuming Wang {code:sql} CREATE TABLE desc_col_table (key int COMMENT 'column_comment') USING PARQUET; -- Test output for histogram statistics SET spark.sql.statistics.histogram.enabled=true; SET spark.sql.statistics.histogram.numBins=2; INSERT INTO desc_col_table values 1, 2, 3, 4; ANALYZE TABLE desc_col_table COMPUTE STATISTICS FOR COLUMNS key; DESC EXTENDED desc_col_table key; {code} {noformat} spark-sql> DESC EXTENDED desc_col_table key; col_namekey data_type int comment column_comment min 1 max 4 num_nulls 0 distinct_count 4 avg_col_len 4 max_col_len 4 histogram height: 4.0, num_of_bins: 2 bin_0 lower_bound: 1.0, upper_bound: 2.0, distinct_count: 2 bin_1 lower_bound: 2.0, upper_bound: 4.0, distinct_count: 2 {noformat} But our result is: https://github.com/apache/spark/blob/v2.4.3/sql/core/src/test/resources/sql-tests/results/describe-table-column.sql.out#L231-L242 -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-28585) Improve WebUI DAG information: Add extra info to rdd from spark plan
[ https://issues.apache.org/jira/browse/SPARK-28585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pablo Langa Blanco closed SPARK-28585. -- > Improve WebUI DAG information: Add extra info to rdd from spark plan > > > Key: SPARK-28585 > URL: https://issues.apache.org/jira/browse/SPARK-28585 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Web UI >Affects Versions: 3.0.0 >Reporter: Pablo Langa Blanco >Priority: Minor > > The mainly improve that i want to achieve is to help developers to explore > the DAG information in the Web UI in complex flows. > Sometimes is very dificult to know what part of your spark plan corresponds > to the DAG you are looking. > > This is an initial improvement only in one simple spark plan type > (UnionExec). > If you consider it a good idea, i want to extend it to other spark plans to > improve the visualization iteratively. > > More info in the pull request -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28585) Improve WebUI DAG information: Add extra info to rdd from spark plan
[ https://issues.apache.org/jira/browse/SPARK-28585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pablo Langa Blanco resolved SPARK-28585. Resolution: Won't Fix > Improve WebUI DAG information: Add extra info to rdd from spark plan > > > Key: SPARK-28585 > URL: https://issues.apache.org/jira/browse/SPARK-28585 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Web UI >Affects Versions: 3.0.0 >Reporter: Pablo Langa Blanco >Priority: Minor > > The mainly improve that i want to achieve is to help developers to explore > the DAG information in the Web UI in complex flows. > Sometimes is very dificult to know what part of your spark plan corresponds > to the DAG you are looking. > > This is an initial improvement only in one simple spark plan type > (UnionExec). > If you consider it a good idea, i want to extend it to other spark plans to > improve the visualization iteratively. > > More info in the pull request -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28587) JDBC data source's partition whereClause should support jdbc dialect
[ https://issues.apache.org/jira/browse/SPARK-28587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899428#comment-16899428 ] Takeshi Yamamuro commented on SPARK-28587: -- ok, I'll work on it. > JDBC data source's partition whereClause should support jdbc dialect > > > Key: SPARK-28587 > URL: https://issues.apache.org/jira/browse/SPARK-28587 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: wyp >Priority: Minor > > When we use JDBC data source to search data from Phoenix, and use timestamp > data type column for partitionColumn, e.g. > {code:java} > val url = "jdbc:phoenix:thin:url=localhost:8765;serialization=PROTOBUF" > val driver = "org.apache.phoenix.queryserver.client.Driver" > val df = spark.read.format("jdbc") > .option("url", url) > .option("driver", driver) > .option("fetchsize", "1000") > .option("numPartitions", "6") > .option("partitionColumn", "times") > .option("lowerBound", "2019-07-31 00:00:00") > .option("upperBound", "2019-08-01 00:00:00") > .option("dbtable", "test") > .load().select("id") > println(df.count()) > {code} > there will throw AvaticaSqlException in phoenix: > {code:java} > org.apache.calcite.avatica.AvaticaSqlException: Error -1 (0) : while > preparing SQL: SELECT 1 FROM search_info_test WHERE "TIMES" < '2019-07-31 > 04:00:00' or "TIMES" is null > at org.apache.calcite.avatica.Helper.createException(Helper.java:54) > at org.apache.calcite.avatica.Helper.createException(Helper.java:41) > at > org.apache.calcite.avatica.AvaticaConnection.prepareStatement(AvaticaConnection.java:368) > at > org.apache.calcite.avatica.AvaticaConnection.prepareStatement(AvaticaConnection.java:299) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:300) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) > at org.apache.spark.scheduler.Task.run(Task.scala:121) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > java.lang.RuntimeException: org.apache.phoenix.schema.TypeMismatchException: > ERROR 203 (22005): Type mismatch. TIMESTAMP and VARCHAR for "TIMES" < > '2019-07-31 04:00:00' > at org.apache.calcite.avatica.jdbc.JdbcMeta.propagate(JdbcMeta.java:700) > at > org.apache.calcite.avatica.jdbc.PhoenixJdbcMeta.prepare(PhoenixJdbcMeta.java:67) > at > org.apache.calcite.avatica.remote.LocalService.apply(LocalService.java:195) > at > org.apache.calcite.avatica.remote.Service$PrepareRequest.accept(Service.java:1215) > at > org.apache.calcite.avatica.remote.Service$PrepareRequest.accept(Service.java:1186) > at > org.apache.calcite.avatica.remote.AbstractHandler.apply(AbstractHandler.java:94) > at > org.apache.calcite.avatica.remote.ProtobufHandler.apply(ProtobufHandler.java:46) > at > org.apache.calcite.avatica.server.AvaticaProtobufHandler.handle(AvaticaProtobufHandler.java:127) > at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:52) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.eclipse.jetty.server.Server.handle(Server.java:534) > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320) > at > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) > at > org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) > at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108) > at > org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > at > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) > at >
[jira] [Created] (SPARK-28610) Support larger buffer for sum of long
Marco Gaido created SPARK-28610: --- Summary: Support larger buffer for sum of long Key: SPARK-28610 URL: https://issues.apache.org/jira/browse/SPARK-28610 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Marco Gaido The sum of a long field currently uses a buffer of type long. When the flag for throwing exceptions on overflow for arithmetic operations in turned on, this is a problem in case there are intermediate overflows which are then resolved by other rows. Indeed, in such a case, we are throwing an exception, while the result is representable in a long value. An example of this issue can be seen running: {code} val df = sc.parallelize(Seq(100L, Long.MaxValue, -1000L)).toDF("a") df.select(sum($"a")).show() {code} According to [~cloud_fan]'s suggestion in https://github.com/apache/spark/pull/21599, we should introduce a flag in order to let users choose among a wider datatype for the sum buffer using a config, so that the above issue can be fixed. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28527) Build a Test Framework for Thriftserver
[ https://issues.apache.org/jira/browse/SPARK-28527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899368#comment-16899368 ] Yuming Wang commented on SPARK-28527: - When building this test framework, found one inconsistent behavior: [SPARK-28461] Thrift Server pad decimal numbers with trailing zeros to the scale of the column: !image-2019-08-03-13-58-06-106.png! > Build a Test Framework for Thriftserver > --- > > Key: SPARK-28527 > URL: https://issues.apache.org/jira/browse/SPARK-28527 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > Attachments: image-2019-08-03-13-58-06-106.png > > > https://issues.apache.org/jira/browse/SPARK-28463 shows we need to improve > the overall test coverage for Thrift-server. > We can bulid a test framework that can directly re-run all the tests in > SQLQueryTestSuite via Thrift Server. This can help our test coverage of BI > use cases. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org