[jira] [Updated] (SPARK-28589) Introduce a new type coercion mode by following PostgreSQL

2019-08-03 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-28589:

Description: 
The type coercion rules in Spark SQL are not consistent with any system. 
Sometimes, it is confusing to end users. 

We can introduce a new type coercion mode so that Spark can function as some 
popular DBMS. For example, PostgreSQL.

  was:
The type conversion in current Spark can be improved by learning from 
PostgreSQL:
1. The table insertion should disallow certain type conversions(string -> int), 
see 
https://docs.google.com/document/d/1b9nnWWbKVDRp7lpzhQS1buv1_lDzWIZY2ApFs5rBcGI/edit#heading=h.82w8qxfl2uwl
2. Spark should return null result on cast an out-of-range value to an integral 
type, instead of returning a value with lower bits.
3. We can have a new optional mode: throw runtime exceptions on casting failures
4. improve the implicit casting
etc..


> Introduce a new type coercion mode by following PostgreSQL
> --
>
> Key: SPARK-28589
> URL: https://issues.apache.org/jira/browse/SPARK-28589
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
> Attachments: postgreSQL_type_conversion.txt
>
>
> The type coercion rules in Spark SQL are not consistent with any system. 
> Sometimes, it is confusing to end users. 
> We can introduce a new type coercion mode so that Spark can function as some 
> popular DBMS. For example, PostgreSQL.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28589) Introduce a new type coercion mode by following PostgreSQL

2019-08-03 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-28589:

Summary: Introduce a new type coercion mode by following PostgreSQL  (was: 
Follow the type conversion rules of PostgreSQL)

> Introduce a new type coercion mode by following PostgreSQL
> --
>
> Key: SPARK-28589
> URL: https://issues.apache.org/jira/browse/SPARK-28589
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
> Attachments: postgreSQL_type_conversion.txt
>
>
> The type conversion in current Spark can be improved by learning from 
> PostgreSQL:
> 1. The table insertion should disallow certain type conversions(string -> 
> int), see 
> https://docs.google.com/document/d/1b9nnWWbKVDRp7lpzhQS1buv1_lDzWIZY2ApFs5rBcGI/edit#heading=h.82w8qxfl2uwl
> 2. Spark should return null result on cast an out-of-range value to an 
> integral type, instead of returning a value with lower bits.
> 3. We can have a new optional mode: throw runtime exceptions on casting 
> failures
> 4. improve the implicit casting
> etc..



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25280) Add support for USING syntax for DataSourceV2

2019-08-03 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899536#comment-16899536
 ] 

Hyukjin Kwon commented on SPARK-25280:
--

Hey, thanks for heads up. Yea, let's leave it resolved.
But do you mind if I ask some related tickets? It doesn't have to be all the 
related tickets. I just want to leave some links against this JIRA for 
trackability.

> Add support for USING syntax for DataSourceV2
> -
>
> Key: SPARK-25280
> URL: https://issues.apache.org/jira/browse/SPARK-25280
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> class SourcesTest extends SparkFunSuite {
>   val spark = SparkSession.builder().master("local").getOrCreate()
>   test("Test CREATE TABLE ... USING - v1") {
> spark.read.format(classOf[SimpleDataSourceV1].getCanonicalName).load()
>   }
>   test("Test DataFrameReader - v1") {
> spark.sql(s"CREATE TABLE tableA USING 
> ${classOf[SimpleDataSourceV1].getCanonicalName}")
>   }
>   test("Test CREATE TABLE ... USING - v2") {
> spark.read.format(classOf[SimpleDataSourceV2].getCanonicalName).load()
>   }
>   test("Test DataFrameReader - v2") {
> spark.sql(s"CREATE TABLE tableB USING 
> ${classOf[SimpleDataSourceV2].getCanonicalName}")
>   }
> }
> {code}
> {code}
> org.apache.spark.sql.sources.v2.SimpleDataSourceV2 is not a valid Spark SQL 
> Data Source.;
> org.apache.spark.sql.AnalysisException: 
> org.apache.spark.sql.sources.v2.SimpleDataSourceV2 is not a valid Spark SQL 
> Data Source.;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:385)
>   at 
> org.apache.spark.sql.execution.command.CreateDataSourceTableCommand.run(createDataSourceTables.scala:78)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
>   at org.apache.spark.sql.Dataset$$anonfun$54.apply(Dataset.scala:3296)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3295)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:190)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:75)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:641)
>   at 
> org.apache.spark.sql.sources.v2.SourcesTest$$anonfun$4.apply(DataSourceV2Suite.scala:45)
>   at 
> org.apache.spark.sql.sources.v2.SourcesTest$$anonfun$4.apply(DataSourceV2Suite.scala:45)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.scalatest.TestSuite$class.withFixture(TestSuite.scala:196)
>   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1560)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196)
>   at org.scalatest.FunSuite.runTest(FunSuite.scala:1560)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
>   at 

[jira] [Created] (SPARK-28612) DataSourceV2: Add new DataFrameWriter API for v2

2019-08-03 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-28612:
-

 Summary: DataSourceV2: Add new DataFrameWriter API for v2
 Key: SPARK-28612
 URL: https://issues.apache.org/jira/browse/SPARK-28612
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ryan Blue


This tracks adding an API like the one proposed in SPARK-23521:

{code:lang=scala}
df.writeTo("catalog.db.table").append() // AppendData
df.writeTo("catalog.db.table").overwriteDynamic() // OverwritePartiitonsDynamic
df.writeTo("catalog.db.table").overwrite($"date" === '2019-01-01') // 
OverwriteByExpression
df.writeTo("catalog.db.table").partitionBy($"type", $"date").create() // CTAS
df.writeTo("catalog.db.table").replace() // RTAS
{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4502) Spark SQL reads unneccesary nested fields from Parquet

2019-08-03 Thread Tomas Bartalos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899534#comment-16899534
 ] 

Tomas Bartalos commented on SPARK-4502:
---

I'm fighting with the same problem, few findings which are not obvious after 
reading the issue:
 * To use this feature, set 
_spark.sql.optimizer.nestedSchemaPruning.enabled=true_
 * According to my tests, fix works only for 1 level of nesting. For example 
_event.amount_ is ok, while _event.spent.amount_ reads the whole event 
structure :(

The only way how to optimise read of 2+ level nesting is to specify projected 
schema at read time (as suggested by [~aeroevan]):

_val df = spark.read.format("parquet").schema(projectedSchema).load()_

*This workaround can't be used on queries from thrift server and connected BI 
tools. Needless to say having this feature to work on any level of nesting 
would be just wonderful.*

This is a real performance killer for big deeply nested structures. My test for 
summing one field on top level vs. nested shows difference 2.5 min vs. 4 seconds

 

> Spark SQL reads unneccesary nested fields from Parquet
> --
>
> Key: SPARK-4502
> URL: https://issues.apache.org/jira/browse/SPARK-4502
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Liwen Sun
>Assignee: Michael Allman
>Priority: Critical
> Fix For: 2.4.0
>
>
> When reading a field of a nested column from Parquet, SparkSQL reads and 
> assemble all the fields of that nested column. This is unnecessary, as 
> Parquet supports fine-grained field reads out of a nested column. This may 
> degrades the performance significantly when a nested column has many fields. 
> For example, I loaded json tweets data into SparkSQL and ran the following 
> query:
> {{SELECT User.contributors_enabled from Tweets;}}
> User is a nested structure that has 38 primitive fields (for Tweets schema, 
> see: https://dev.twitter.com/overview/api/tweets), here is the log message:
> {{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed 
> 385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 
> cell/ms}}
> For comparison, I also ran:
> {{SELECT User FROM Tweets;}}
> And here is the log message:
> {{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed 
> 385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}}
> So both queries load 38 columns from Parquet, while the first query only 
> needs 1 column. I also measured the bytes read within Parquet. In these two 
> cases, the same number of bytes (99365194 bytes) were read. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23204) DataSourceV2 should support named tables in DataFrameReader, DataFrameWriter

2019-08-03 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved SPARK-23204.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

I'm closing this because it is implemented by SPARK-28178 and SPARK-28565.

> DataSourceV2 should support named tables in DataFrameReader, DataFrameWriter
> 
>
> Key: SPARK-23204
> URL: https://issues.apache.org/jira/browse/SPARK-23204
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
> Fix For: 3.0.0
>
>
> DataSourceV2 is currently only configured with a path, passed in options as 
> {{path}}. For many data sources, like JDBC, a table name is more appropriate. 
> I propose testing the "location" passed to load(String) and save(String) to 
> see if it is a path and if not, parsing it as a table name and passing 
> "database" and "table" options to readers and writers.
> This also creates a way to pass the table identifier when using DataSourceV2 
> tables from SQL. For example, {{SELECT * FROM db.table}} creates an 
> {{UnresolvedRelation(db,table)}} that could be resolved using the default 
> source, passing the db and table name using the same options. Similarly, we 
> can add a table property for the datasource implementation to metastore 
> tables and add a rule to convert them to DataSourceV2 relations.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25280) Add support for USING syntax for DataSourceV2

2019-08-03 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899532#comment-16899532
 ] 

Ryan Blue commented on SPARK-25280:
---

[~hyukjin.kwon], is there anything left to do for this? I think that most of 
the functionality has been added at this point.

> Add support for USING syntax for DataSourceV2
> -
>
> Key: SPARK-25280
> URL: https://issues.apache.org/jira/browse/SPARK-25280
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> class SourcesTest extends SparkFunSuite {
>   val spark = SparkSession.builder().master("local").getOrCreate()
>   test("Test CREATE TABLE ... USING - v1") {
> spark.read.format(classOf[SimpleDataSourceV1].getCanonicalName).load()
>   }
>   test("Test DataFrameReader - v1") {
> spark.sql(s"CREATE TABLE tableA USING 
> ${classOf[SimpleDataSourceV1].getCanonicalName}")
>   }
>   test("Test CREATE TABLE ... USING - v2") {
> spark.read.format(classOf[SimpleDataSourceV2].getCanonicalName).load()
>   }
>   test("Test DataFrameReader - v2") {
> spark.sql(s"CREATE TABLE tableB USING 
> ${classOf[SimpleDataSourceV2].getCanonicalName}")
>   }
> }
> {code}
> {code}
> org.apache.spark.sql.sources.v2.SimpleDataSourceV2 is not a valid Spark SQL 
> Data Source.;
> org.apache.spark.sql.AnalysisException: 
> org.apache.spark.sql.sources.v2.SimpleDataSourceV2 is not a valid Spark SQL 
> Data Source.;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:385)
>   at 
> org.apache.spark.sql.execution.command.CreateDataSourceTableCommand.run(createDataSourceTables.scala:78)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
>   at org.apache.spark.sql.Dataset$$anonfun$54.apply(Dataset.scala:3296)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3295)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:190)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:75)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:641)
>   at 
> org.apache.spark.sql.sources.v2.SourcesTest$$anonfun$4.apply(DataSourceV2Suite.scala:45)
>   at 
> org.apache.spark.sql.sources.v2.SourcesTest$$anonfun$4.apply(DataSourceV2Suite.scala:45)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.scalatest.TestSuite$class.withFixture(TestSuite.scala:196)
>   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1560)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196)
>   at org.scalatest.FunSuite.runTest(FunSuite.scala:1560)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229)
>   at 

[jira] [Commented] (SPARK-28543) Document Spark Jobs page

2019-08-03 Thread Pablo Langa Blanco (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899482#comment-16899482
 ] 

Pablo Langa Blanco commented on SPARK-28543:


Hi! I want to contribute on this.

> Document Spark Jobs page
> 
>
> Key: SPARK-28543
> URL: https://issues.apache.org/jira/browse/SPARK-28543
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Web UI
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28611) Histogram's height is diffrent

2019-08-03 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899470#comment-16899470
 ] 

Yuming Wang commented on SPARK-28611:
-

cc [~mgaido]

> Histogram's height is diffrent
> --
>
> Key: SPARK-28611
> URL: https://issues.apache.org/jira/browse/SPARK-28611
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Yuming Wang
>Priority: Major
>
> {code:sql}
> CREATE TABLE desc_col_table (key int COMMENT 'column_comment') USING PARQUET;
> -- Test output for histogram statistics
> SET spark.sql.statistics.histogram.enabled=true;
> SET spark.sql.statistics.histogram.numBins=2;
> INSERT INTO desc_col_table values 1, 2, 3, 4;
> ANALYZE TABLE desc_col_table COMPUTE STATISTICS FOR COLUMNS key;
> DESC EXTENDED desc_col_table key;
> {code}
> {noformat}
> spark-sql> DESC EXTENDED desc_col_table key;
> col_name  key
> data_type int
> comment   column_comment
> min   1
> max   4
> num_nulls 0
> distinct_count4
> avg_col_len   4
> max_col_len   4
> histogram height: 4.0, num_of_bins: 2
> bin_0 lower_bound: 1.0, upper_bound: 2.0, distinct_count: 2
> bin_1 lower_bound: 2.0, upper_bound: 4.0, distinct_count: 2
> {noformat}
> But our result is:
> https://github.com/apache/spark/blob/v2.4.3/sql/core/src/test/resources/sql-tests/results/describe-table-column.sql.out#L231-L242



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28611) Histogram's height is diffrent

2019-08-03 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-28611:
---

 Summary: Histogram's height is diffrent
 Key: SPARK-28611
 URL: https://issues.apache.org/jira/browse/SPARK-28611
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.3
Reporter: Yuming Wang


{code:sql}
CREATE TABLE desc_col_table (key int COMMENT 'column_comment') USING PARQUET;
-- Test output for histogram statistics
SET spark.sql.statistics.histogram.enabled=true;
SET spark.sql.statistics.histogram.numBins=2;

INSERT INTO desc_col_table values 1, 2, 3, 4;

ANALYZE TABLE desc_col_table COMPUTE STATISTICS FOR COLUMNS key;

DESC EXTENDED desc_col_table key;
{code}

{noformat}
spark-sql> DESC EXTENDED desc_col_table key;
col_namekey
data_type   int
comment column_comment
min 1
max 4
num_nulls   0
distinct_count  4
avg_col_len 4
max_col_len 4
histogram   height: 4.0, num_of_bins: 2
bin_0   lower_bound: 1.0, upper_bound: 2.0, distinct_count: 2
bin_1   lower_bound: 2.0, upper_bound: 4.0, distinct_count: 2
{noformat}

But our result is:
https://github.com/apache/spark/blob/v2.4.3/sql/core/src/test/resources/sql-tests/results/describe-table-column.sql.out#L231-L242




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-28585) Improve WebUI DAG information: Add extra info to rdd from spark plan

2019-08-03 Thread Pablo Langa Blanco (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pablo Langa Blanco closed SPARK-28585.
--

> Improve WebUI DAG information: Add extra info to rdd from spark plan
> 
>
> Key: SPARK-28585
> URL: https://issues.apache.org/jira/browse/SPARK-28585
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 3.0.0
>Reporter: Pablo Langa Blanco
>Priority: Minor
>
> The mainly improve that i want to achieve is to help developers to explore 
> the DAG information in the Web UI in complex flows. 
> Sometimes is very dificult to know what part of your spark plan corresponds 
> to the DAG you are looking.
>  
> This is an initial improvement only in one simple spark plan type 
> (UnionExec). 
> If you consider it a good idea, i want to extend it to other spark plans to 
> improve the visualization iteratively.
>  
> More info in the pull request 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28585) Improve WebUI DAG information: Add extra info to rdd from spark plan

2019-08-03 Thread Pablo Langa Blanco (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pablo Langa Blanco resolved SPARK-28585.

Resolution: Won't Fix

> Improve WebUI DAG information: Add extra info to rdd from spark plan
> 
>
> Key: SPARK-28585
> URL: https://issues.apache.org/jira/browse/SPARK-28585
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 3.0.0
>Reporter: Pablo Langa Blanco
>Priority: Minor
>
> The mainly improve that i want to achieve is to help developers to explore 
> the DAG information in the Web UI in complex flows. 
> Sometimes is very dificult to know what part of your spark plan corresponds 
> to the DAG you are looking.
>  
> This is an initial improvement only in one simple spark plan type 
> (UnionExec). 
> If you consider it a good idea, i want to extend it to other spark plans to 
> improve the visualization iteratively.
>  
> More info in the pull request 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28587) JDBC data source's partition whereClause should support jdbc dialect

2019-08-03 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899428#comment-16899428
 ] 

Takeshi Yamamuro commented on SPARK-28587:
--

ok, I'll work on it.

> JDBC data source's partition whereClause should support jdbc dialect
> 
>
> Key: SPARK-28587
> URL: https://issues.apache.org/jira/browse/SPARK-28587
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: wyp
>Priority: Minor
>
> When we use JDBC data source to search data from Phoenix, and use timestamp 
> data type column for partitionColumn, e.g.
> {code:java}
> val url = "jdbc:phoenix:thin:url=localhost:8765;serialization=PROTOBUF"
> val driver = "org.apache.phoenix.queryserver.client.Driver"
> val df = spark.read.format("jdbc")
> .option("url", url)
> .option("driver", driver)
> .option("fetchsize", "1000")
> .option("numPartitions", "6")
> .option("partitionColumn", "times")
> .option("lowerBound", "2019-07-31 00:00:00")
> .option("upperBound", "2019-08-01 00:00:00")
> .option("dbtable", "test")
> .load().select("id")
> println(df.count())
> {code}
> there will throw AvaticaSqlException in phoenix:
> {code:java}
> org.apache.calcite.avatica.AvaticaSqlException: Error -1 (0) : while 
> preparing SQL: SELECT 1 FROM search_info_test WHERE "TIMES" < '2019-07-31 
> 04:00:00' or "TIMES" is null
>   at org.apache.calcite.avatica.Helper.createException(Helper.java:54)
>   at org.apache.calcite.avatica.Helper.createException(Helper.java:41)
>   at 
> org.apache.calcite.avatica.AvaticaConnection.prepareStatement(AvaticaConnection.java:368)
>   at 
> org.apache.calcite.avatica.AvaticaConnection.prepareStatement(AvaticaConnection.java:299)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:300)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
>   at org.apache.spark.scheduler.Task.run(Task.scala:121)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> java.lang.RuntimeException: org.apache.phoenix.schema.TypeMismatchException: 
> ERROR 203 (22005): Type mismatch. TIMESTAMP and VARCHAR for "TIMES" < 
> '2019-07-31 04:00:00'
>   at org.apache.calcite.avatica.jdbc.JdbcMeta.propagate(JdbcMeta.java:700)
>   at 
> org.apache.calcite.avatica.jdbc.PhoenixJdbcMeta.prepare(PhoenixJdbcMeta.java:67)
>   at 
> org.apache.calcite.avatica.remote.LocalService.apply(LocalService.java:195)
>   at 
> org.apache.calcite.avatica.remote.Service$PrepareRequest.accept(Service.java:1215)
>   at 
> org.apache.calcite.avatica.remote.Service$PrepareRequest.accept(Service.java:1186)
>   at 
> org.apache.calcite.avatica.remote.AbstractHandler.apply(AbstractHandler.java:94)
>   at 
> org.apache.calcite.avatica.remote.ProtobufHandler.apply(ProtobufHandler.java:46)
>   at 
> org.apache.calcite.avatica.server.AvaticaProtobufHandler.handle(AvaticaProtobufHandler.java:127)
>   at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:52)
>   at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>   at org.eclipse.jetty.server.Server.handle(Server.java:534)
>   at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
>   at 
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
>   at 
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
>   at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
>   at 
> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>   at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
>   at 
> 

[jira] [Created] (SPARK-28610) Support larger buffer for sum of long

2019-08-03 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-28610:
---

 Summary: Support larger buffer for sum of long
 Key: SPARK-28610
 URL: https://issues.apache.org/jira/browse/SPARK-28610
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Marco Gaido


The sum of a long field currently uses a buffer of type long.

When the flag for throwing exceptions on overflow for arithmetic operations in 
turned on, this is a problem in case there are intermediate overflows which are 
then resolved by other rows. Indeed, in such a case, we are throwing an 
exception, while the result is representable in a long value. An example of 
this issue can be seen running:

{code}
val df = sc.parallelize(Seq(100L, Long.MaxValue, -1000L)).toDF("a")
df.select(sum($"a")).show()
{code}

According to [~cloud_fan]'s suggestion in 
https://github.com/apache/spark/pull/21599, we should introduce a flag in order 
to let users choose among a wider datatype for the sum buffer using a config, 
so that the above issue can be fixed.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28527) Build a Test Framework for Thriftserver

2019-08-03 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899368#comment-16899368
 ] 

Yuming Wang commented on SPARK-28527:
-

When building this test framework, found one inconsistent behavior:

[SPARK-28461] Thrift Server pad decimal numbers with trailing zeros to the 
scale of the column:
!image-2019-08-03-13-58-06-106.png!

> Build a Test Framework for Thriftserver
> ---
>
> Key: SPARK-28527
> URL: https://issues.apache.org/jira/browse/SPARK-28527
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
> Attachments: image-2019-08-03-13-58-06-106.png
>
>
> https://issues.apache.org/jira/browse/SPARK-28463 shows we need to improve 
> the overall test coverage for Thrift-server. 
> We can bulid a test framework that can directly re-run all the tests in 
> SQLQueryTestSuite via Thrift Server. This can help our test coverage of BI 
> use cases. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org