[jira] [Resolved] (SPARK-28205) useV1SourceList configuration should be for all data sources

2019-06-30 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-28205.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 25004
[https://github.com/apache/spark/pull/25004]

> useV1SourceList configuration should be for all data sources
> 
>
> Key: SPARK-28205
> URL: https://issues.apache.org/jira/browse/SPARK-28205
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
> Fix For: 3.0.0
>
>
> In the migration PR of Kafka V2: 
> https://github.com/apache/spark/pull/24738/files/ac16c9a9ef1c68db5aeda6c7001ae9abe96a358a#r298470645
> We find that the useV1SourceList 
> configuration(spark.sql.sources.read.useV1SourceList and 
> spark.sql.sources.write.useV1SourceList) should be for all data sources, 
> instead of file source V2 only.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28205) useV1SourceList configuration should be for all data sources

2019-06-30 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-28205:
---

Assignee: Gengliang Wang

> useV1SourceList configuration should be for all data sources
> 
>
> Key: SPARK-28205
> URL: https://issues.apache.org/jira/browse/SPARK-28205
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
>
> In the migration PR of Kafka V2: 
> https://github.com/apache/spark/pull/24738/files/ac16c9a9ef1c68db5aeda6c7001ae9abe96a358a#r298470645
> We find that the useV1SourceList 
> configuration(spark.sql.sources.read.useV1SourceList and 
> spark.sql.sources.write.useV1SourceList) should be for all data sources, 
> instead of file source V2 only.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28220) join foldable condition not pushed down when parent filter is totally pushed down

2019-06-30 Thread liupengcheng (JIRA)
liupengcheng created SPARK-28220:


 Summary: join foldable condition not pushed down when parent 
filter is totally pushed down
 Key: SPARK-28220
 URL: https://issues.apache.org/jira/browse/SPARK-28220
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.2, 3.0.0
Reporter: liupengcheng


We encountered a issue that join conditions not pushed down when we are running 
spark app on spark2.3, after carefully looking into the code and debugging, we 
found that it's because there is a bug in the rule `PushPredicateThroughJoin`:

It will try to push parent filter down though the join, however, when the 
parent filter is wholly pushed down through the join, the join will become the 
top node, and then the `transform` method will skip the join to apply the rule. 

 

Suppose we have two tables: table1 and table2:

table1: (a: string, b: string, c: string)

table2: (d: string)

sql as:

 
{code:java}
select * from table1 left join (select d, 'w1' as r from table2) on a = d and r 
= 'w2' where b = 2{code}
 

let's focus on the following optimizer rules:

PushPredicateThroughJoin

FodablePropagation

BooleanSimplification

PruneFilters

 

In the above case, on the first iteration of these rules:

PushPredicateThroughJoin -> 
{code:java}
select * from table1 where b=2 left join (select d, 'w1' as r from table2) on a 
= d and r = 'w2'
{code}
FodablePropagation ->
{code:java}
select * from table1 where b=2 left join (select d, 'w1' as r from table2) on a 
= d and 'w1' = 'w2'{code}
BooleanSimplification ->
{code:java}
select * from table1 where b=2 left join (select d, 'w1' as r from table2) on 
false{code}
PruneFilters -> No effective

 

After several iteration of these rules, the join condition will still never be 
pushed to the 

right hand of the left join. thus, in some case(e.g. Large right table), the 
`BroadcastNestedLoopJoin` may be slow or oom.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27802) SparkUI throws NoSuchElementException when inconsistency appears between `ExecutorStageSummaryWrapper`s and `ExecutorSummaryWrapper`s

2019-06-30 Thread liupengcheng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16875945#comment-16875945
 ] 

liupengcheng commented on SPARK-27802:
--

[~shahid] yes, but I checked master branch, I found that these logic was 
removed in 3.0.0, so I'am not sure whether we can fix it only in 2.3? that's 
why I haven't put an PR for it.

you can follow these steps to reproduce the issue:
 # set spark.ui.retainedDeadExecutors=0 and set spark.ui.retainedStages=1000
 # set spark.dynamicAllocation.enabled=true
 # run a spark app, and wait for complete, and let executors idle.
 # check the stage UI.

> SparkUI throws NoSuchElementException when inconsistency appears between 
> `ExecutorStageSummaryWrapper`s and `ExecutorSummaryWrapper`s
> -
>
> Key: SPARK-27802
> URL: https://issues.apache.org/jira/browse/SPARK-27802
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.2
>Reporter: liupengcheng
>Priority: Major
>
> Recently, we hit this issue when testing spark2.3. It report the following 
> error messages when clicking on the stage UI link.
> We add more logs to print the executorId(here is 10) to debug, and finally 
> find out that it's caused by the inconsistency between the list of 
> `ExecutorStageSummaryWrapper` and the `ExecutorSummaryWrapper` in the 
> KVStore. The number of deadExecutors may exceeded threshold and being removed 
> from list of `ExecutorSummaryWrapper`, however, it may still be kept in the 
> list of `ExecutorStageSummaryWrapper` in the store.
> {code:java}
> HTTP ERROR 500
> Problem accessing /stages/stage/. Reason:
> Server Error
> Caused by:
> java.util.NoSuchElementException: 10
>   at 
> org.apache.spark.util.kvstore.InMemoryStore.read(InMemoryStore.java:83)
>   at 
> org.apache.spark.status.ElementTrackingStore.read(ElementTrackingStore.scala:95)
>   at 
> org.apache.spark.status.AppStatusStore.executorSummary(AppStatusStore.scala:70)
>   at 
> org.apache.spark.ui.jobs.ExecutorTable$$anonfun$createExecutorTable$2.apply(ExecutorTable.scala:99)
>   at 
> org.apache.spark.ui.jobs.ExecutorTable$$anonfun$createExecutorTable$2.apply(ExecutorTable.scala:92)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.ui.jobs.ExecutorTable.createExecutorTable(ExecutorTable.scala:92)
>   at 
> org.apache.spark.ui.jobs.ExecutorTable.toNodeSeq(ExecutorTable.scala:75)
>   at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:478)
>   at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82)
>   at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82)
>   at org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>   at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
>   at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
>   at 
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:166)
>   at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>   at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>   at 
> org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493)
>   at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
>   at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>   at org.spark_project.jetty.server.Server.handle(Server.java:539)
>   at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:33

[jira] [Assigned] (SPARK-28195) CheckAnalysis not working for Command and report misleading error message

2019-06-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28195:


Assignee: Apache Spark

> CheckAnalysis not working for Command and report misleading error message
> -
>
> Key: SPARK-28195
> URL: https://issues.apache.org/jira/browse/SPARK-28195
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: liupengcheng
>Assignee: Apache Spark
>Priority: Major
>
> Currently, we encountered an issue when executing 
> `InsertIntoDataSourceDirCommand`, and we found that it's query relied on 
> non-exist table or view, but we finally got a misleading error message:
> {code:java}
> Caused by: org.apache.spark.sql.catalyst.analysis.UnresolvedException: 
> Invalid call to dataType on unresolved object, tree: 'kr.objective_id
> at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:105)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:440)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:440)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.immutable.List.foreach(List.scala:381)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.immutable.List.map(List.scala:285)
> at org.apache.spark.sql.types.StructType$.fromAttributes(StructType.scala:440)
> at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.schema$lzycompute(QueryPlan.scala:159)
> at org.apache.spark.sql.catalyst.plans.QueryPlan.schema(QueryPlan.scala:159)
> at 
> org.apache.spark.sql.execution.datasources.DataSource.planForWriting(DataSource.scala:544)
> at 
> org.apache.spark.sql.execution.command.InsertIntoDataSourceDirCommand.run(InsertIntoDataSourceDirCommand.scala:70)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
> at 
> org.apache.spark.sql.execution.adaptive.QueryStage.executeCollect(QueryStage.scala:246)
> at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
> at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
> at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3277)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
> at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3276)
> at org.apache.spark.sql.Dataset.(Dataset.scala:190)
> at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:75)
> at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:694)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:277)
> ... 11 more
> {code}
> After looking into the code, I found that it's because we support 
> `runSQLOnFiles` feature since 2.3, and if the table does not exist and it's 
> not a temporary table, then It will be treated as running directly on files.
> `ResolveSQLOnFile` rule will analyze it, and return an `UnresolvedRelation` 
> on resolve failure(it's actually not a sql on files, so it will fail when 
> resolving). Due to Command has empty children, `CheckAnalysis` will skip 
> check the `UnresolvedRelation` and finally we got the above misleading error 
> message when executing this command.
> I think maybe we should checkAnalysis for command's query plan? Or is there 
> any consideration for not checking analysis for command?
> Seems this issue still exists in master branch. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28215) as_tibble was removed from Arrow R API

2019-06-30 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28215.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 25012
[https://github.com/apache/spark/pull/25012]

> as_tibble was removed from Arrow R API
> --
>
> Key: SPARK-28215
> URL: https://issues.apache.org/jira/browse/SPARK-28215
> Project: Spark
>  Issue Type: Bug
>  Components: R
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 3.0.0
>
>
> New R api of Arrow has removed `as_tibble`. Arrow optimized collect in R 
> doesn't work now due to the change.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28215) as_tibble was removed from Arrow R API

2019-06-30 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-28215:


Assignee: Liang-Chi Hsieh

> as_tibble was removed from Arrow R API
> --
>
> Key: SPARK-28215
> URL: https://issues.apache.org/jira/browse/SPARK-28215
> Project: Spark
>  Issue Type: Bug
>  Components: R
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
>
> New R api of Arrow has removed `as_tibble`. Arrow optimized collect in R 
> doesn't work now due to the change.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28195) CheckAnalysis not working for Command and report misleading error message

2019-06-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28195:


Assignee: (was: Apache Spark)

> CheckAnalysis not working for Command and report misleading error message
> -
>
> Key: SPARK-28195
> URL: https://issues.apache.org/jira/browse/SPARK-28195
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: liupengcheng
>Priority: Major
>
> Currently, we encountered an issue when executing 
> `InsertIntoDataSourceDirCommand`, and we found that it's query relied on 
> non-exist table or view, but we finally got a misleading error message:
> {code:java}
> Caused by: org.apache.spark.sql.catalyst.analysis.UnresolvedException: 
> Invalid call to dataType on unresolved object, tree: 'kr.objective_id
> at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:105)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:440)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:440)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.immutable.List.foreach(List.scala:381)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.immutable.List.map(List.scala:285)
> at org.apache.spark.sql.types.StructType$.fromAttributes(StructType.scala:440)
> at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.schema$lzycompute(QueryPlan.scala:159)
> at org.apache.spark.sql.catalyst.plans.QueryPlan.schema(QueryPlan.scala:159)
> at 
> org.apache.spark.sql.execution.datasources.DataSource.planForWriting(DataSource.scala:544)
> at 
> org.apache.spark.sql.execution.command.InsertIntoDataSourceDirCommand.run(InsertIntoDataSourceDirCommand.scala:70)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
> at 
> org.apache.spark.sql.execution.adaptive.QueryStage.executeCollect(QueryStage.scala:246)
> at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
> at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
> at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3277)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
> at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3276)
> at org.apache.spark.sql.Dataset.(Dataset.scala:190)
> at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:75)
> at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:694)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:277)
> ... 11 more
> {code}
> After looking into the code, I found that it's because we support 
> `runSQLOnFiles` feature since 2.3, and if the table does not exist and it's 
> not a temporary table, then It will be treated as running directly on files.
> `ResolveSQLOnFile` rule will analyze it, and return an `UnresolvedRelation` 
> on resolve failure(it's actually not a sql on files, so it will fail when 
> resolving). Due to Command has empty children, `CheckAnalysis` will skip 
> check the `UnresolvedRelation` and finally we got the above misleading error 
> message when executing this command.
> I think maybe we should checkAnalysis for command's query plan? Or is there 
> any consideration for not checking analysis for command?
> Seems this issue still exists in master branch. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28201) Revisit MakeDecimal behavior on overflow

2019-06-30 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-28201:
---

Assignee: Marco Gaido

> Revisit MakeDecimal behavior on overflow
> 
>
> Key: SPARK-28201
> URL: https://issues.apache.org/jira/browse/SPARK-28201
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Major
>
> As pointed out in 
> https://github.com/apache/spark/pull/20350#issuecomment-505997469, in special 
> cases of decimal aggregation we are using the `MakeDecimal` operator.
> This operator has a not well defined behavior in case of overflow, namely 
> what it does currently is:
>  - if codegen is enabled it returns null;
>  -  in interpreted mode it throws an `IllegalArgumentException`.
> So we should make his behavior uniform with other similar cases and in 
> particular we should honor the value of the conf introduced in SPARK-23179 
> and behave accordingly, ie.:
>  - returning null if the flag is true;
>  - throw an `ArithmeticException` if the flag is false.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28201) Revisit MakeDecimal behavior on overflow

2019-06-30 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-28201.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 25010
[https://github.com/apache/spark/pull/25010]

> Revisit MakeDecimal behavior on overflow
> 
>
> Key: SPARK-28201
> URL: https://issues.apache.org/jira/browse/SPARK-28201
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 3.0.0
>
>
> As pointed out in 
> https://github.com/apache/spark/pull/20350#issuecomment-505997469, in special 
> cases of decimal aggregation we are using the `MakeDecimal` operator.
> This operator has a not well defined behavior in case of overflow, namely 
> what it does currently is:
>  - if codegen is enabled it returns null;
>  -  in interpreted mode it throws an `IllegalArgumentException`.
> So we should make his behavior uniform with other similar cases and in 
> particular we should honor the value of the conf introduced in SPARK-23179 
> and behave accordingly, ie.:
>  - returning null if the flag is true;
>  - throw an `ArithmeticException` if the flag is false.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28170) DenseVector .toArray() and .values documentation do not specify they are aliases

2019-06-30 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28170.
--
   Resolution: Fixed
Fix Version/s: 2.4.4
   3.0.0

Issue resolved by pull request 25011
[https://github.com/apache/spark/pull/25011]

> DenseVector .toArray() and .values documentation do not specify they are 
> aliases
> 
>
> Key: SPARK-28170
> URL: https://issues.apache.org/jira/browse/SPARK-28170
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, PySpark
>Affects Versions: 2.4.3
>Reporter: Sivam Pasupathipillai
>Assignee: Marco Gaido
>Priority: Trivial
> Fix For: 3.0.0, 2.4.4
>
>
> The documentation of the *toArray()* method and the *values* property in 
> pyspark.ml.linalg.DenseVector is confusing.
> *toArray():* Returns an numpy.ndarray
> *values**:* Returns a list of values
> However, they are actually aliases and they both return a numpy.ndarray.
> FIX: either change the documentation or change  the *values* property to 
> return a Python list.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28170) DenseVector .toArray() and .values documentation do not specify they are aliases

2019-06-30 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-28170:


Assignee: Marco Gaido

> DenseVector .toArray() and .values documentation do not specify they are 
> aliases
> 
>
> Key: SPARK-28170
> URL: https://issues.apache.org/jira/browse/SPARK-28170
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, PySpark
>Affects Versions: 2.4.3
>Reporter: Sivam Pasupathipillai
>Assignee: Marco Gaido
>Priority: Trivial
>
> The documentation of the *toArray()* method and the *values* property in 
> pyspark.ml.linalg.DenseVector is confusing.
> *toArray():* Returns an numpy.ndarray
> *values**:* Returns a list of values
> However, they are actually aliases and they both return a numpy.ndarray.
> FIX: either change the documentation or change  the *values* property to 
> return a Python list.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26879) Inconsistency in default column names for functions like inline and stack

2019-06-30 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-26879:
-
Affects Version/s: (was: 2.4.0)
   3.0.0

> Inconsistency in default column names for functions like inline and stack
> -
>
> Key: SPARK-26879
> URL: https://issues.apache.org/jira/browse/SPARK-26879
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jash Gala
>Priority: Minor
>
> In the Spark SQL functions definitions, `inline` uses col1, col2, etc. (i.e. 
> 1-indexed columns), while `stack` uses col0, col1, col2, etc. (i.e. 0-indexed 
> columns).
> {code:title=spark-shell|borderStyle=solid}
> scala> spark.sql("SELECT stack(2, 1, 2, 3)").show
> +++
> |col0|col1|
> +++
> |   1|   2|
> |   3|null|
> +++
> scala>  spark.sql("SELECT inline_outer(array(struct(1, 'a'), struct(2, 
> 'b')))").show
> +++
> |col1|col2|
> +++
> |   1|   a|
> |   2|   b|
> +++
> {code}
> This feels like an issue with consistency. As discussed on [PR 
> #23748|https://github.com/apache/spark/pull/23748], it might be a good idea 
> to standardize this to something specific (like zero-based indexing) for 
> these and other similar functions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25390) data source V2 API refactoring

2019-06-30 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16875893#comment-16875893
 ] 

Wenchen Fan commented on SPARK-25390:
-

Yes, we should have a user guide for data source v2 in Spark 3.0. I've created 
a blocker ticket for it: https://issues.apache.org/jira/browse/SPARK-28219. 
Also cc [~rdblue]

> data source V2 API refactoring
> --
>
> Key: SPARK-25390
> URL: https://issues.apache.org/jira/browse/SPARK-25390
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Currently it's not very clear how we should abstract data source v2 API. The 
> abstraction should be unified between batch and streaming, or similar but 
> have a well-defined difference between batch and streaming. And the 
> abstraction should also include catalog/table.
> An example of the abstraction:
> {code}
> batch: catalog -> table -> scan
> streaming: catalog -> table -> stream -> scan
> {code}
> We should refactor the data source v2 API according to the abstraction



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28219) Data source v2 user guide

2019-06-30 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-28219:

Priority: Blocker  (was: Major)

> Data source v2 user guide
> -
>
> Key: SPARK-28219
> URL: https://issues.apache.org/jira/browse/SPARK-28219
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28219) Data source v2 user guide

2019-06-30 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-28219:
---

 Summary: Data source v2 user guide
 Key: SPARK-28219
 URL: https://issues.apache.org/jira/browse/SPARK-28219
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27780) Shuffle server & client should be versioned to enable smoother upgrade

2019-06-30 Thread koert kuipers (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16875858#comment-16875858
 ] 

koert kuipers edited comment on SPARK-27780 at 6/30/19 8:29 PM:


fwiw i just ran into this since i am building and deploying spark from master 
(which includes SPARK-27665) but my shuffle service is spark 2.4.x and i cannot 
easily upgrade it

so i now get:
{code}
Caused by: java.lang.RuntimeException: java.lang.IllegalArgumentException: 
Unknown message type: 9
at 
org.apache.spark.network.shuffle.protocol.BlockTransferMessage$Decoder.fromByteBuffer(BlockTransferMessage.java:71)
{code}

i am aware of the workarounds. just wanted to let you know.  


was (Author: koert):
fwiw i just ran into this since i am building and deploying spark from master 
(which includes SPARK-27665) but my shuffle service is spark 2.4.x and i cannot 
easily upgrade it

> Shuffle server & client should be versioned to enable smoother upgrade
> --
>
> Key: SPARK-27780
> URL: https://issues.apache.org/jira/browse/SPARK-27780
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, Spark Core
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>
> The external shuffle service is often upgraded at a different time than spark 
> itself.  However, this causes problems when the protocol changes between the 
> shuffle service and the spark runtime -- this forces users to upgrade 
> everything simultaneously.
> We should add versioning to the shuffle client & server, so they know what 
> messages the other will support.  This would allow better handling of mixed 
> versions, from better error msgs to allowing some mismatched versions (with 
> reduced capabilities).
> This originally came up in a discussion here: 
> https://github.com/apache/spark/pull/24565#issuecomment-493496466
> There are a few ways we could do the versioning which we still need to 
> discuss:
> 1) Version specified by config.  This allows for mixed versions across the 
> cluster and rolling upgrades.  It also will let a spark 3.0 client talk to a 
> 2.4 shuffle service.  But, may be a nuisance for users to get this right.
> 2) Auto-detection during registration with local shuffle service.  This makes 
> the versioning easy for the end user, and can even handle a 2.4 shuffle 
> service though it does not support the new versioning.  However, it will not 
> handle a rolling upgrade correctly -- if the local shuffle service has been 
> upgraded, but other nodes in the cluster have not, it will get the version 
> wrong.
> 3) Exchange versions per-connection.  When a connection is opened, the server 
> & client could first exchange messages with their versions, so they know how 
> to continue communication after that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27780) Shuffle server & client should be versioned to enable smoother upgrade

2019-06-30 Thread koert kuipers (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16875858#comment-16875858
 ] 

koert kuipers commented on SPARK-27780:
---

fwiw i just ran into this since i am building and deploying spark from master 
(which includes SPARK-27665) but my shuffle service is spark 2.4.x and i cannot 
easily upgrade it

> Shuffle server & client should be versioned to enable smoother upgrade
> --
>
> Key: SPARK-27780
> URL: https://issues.apache.org/jira/browse/SPARK-27780
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, Spark Core
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>
> The external shuffle service is often upgraded at a different time than spark 
> itself.  However, this causes problems when the protocol changes between the 
> shuffle service and the spark runtime -- this forces users to upgrade 
> everything simultaneously.
> We should add versioning to the shuffle client & server, so they know what 
> messages the other will support.  This would allow better handling of mixed 
> versions, from better error msgs to allowing some mismatched versions (with 
> reduced capabilities).
> This originally came up in a discussion here: 
> https://github.com/apache/spark/pull/24565#issuecomment-493496466
> There are a few ways we could do the versioning which we still need to 
> discuss:
> 1) Version specified by config.  This allows for mixed versions across the 
> cluster and rolling upgrades.  It also will let a spark 3.0 client talk to a 
> 2.4 shuffle service.  But, may be a nuisance for users to get this right.
> 2) Auto-detection during registration with local shuffle service.  This makes 
> the versioning easy for the end user, and can even handle a 2.4 shuffle 
> service though it does not support the new versioning.  However, it will not 
> handle a rolling upgrade correctly -- if the local shuffle service has been 
> upgraded, but other nodes in the cluster have not, it will get the version 
> wrong.
> 3) Exchange versions per-connection.  When a connection is opened, the server 
> & client could first exchange messages with their versions, so they know how 
> to continue communication after that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28160) TransportClient.sendRpcSync may hang forever

2019-06-30 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-28160:
-

Assignee: Lantao Jin

> TransportClient.sendRpcSync may hang forever
> 
>
> Key: SPARK-28160
> URL: https://issues.apache.org/jira/browse/SPARK-28160
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.3, 3.0.0, 2.4.3
>Reporter: Lantao Jin
>Assignee: Lantao Jin
>Priority: Major
>
> This is very like 
> [SPARK-26665|https://issues.apache.org/jira/browse/SPARK-26665]
> `ByteBuffer.allocate` may throw OutOfMemoryError when the response is large 
> but no enough memory is available. However, when this happens, 
> TransportClient.sendRpcSync will just hang forever if the timeout set to 
> unlimited.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28160) TransportClient.sendRpcSync may hang forever

2019-06-30 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-28160.
---
   Resolution: Fixed
Fix Version/s: 2.4.4
   2.3.4
   3.0.0

Issue resolved by pull request 24964
[https://github.com/apache/spark/pull/24964]

> TransportClient.sendRpcSync may hang forever
> 
>
> Key: SPARK-28160
> URL: https://issues.apache.org/jira/browse/SPARK-28160
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.3, 3.0.0, 2.4.3
>Reporter: Lantao Jin
>Assignee: Lantao Jin
>Priority: Major
> Fix For: 3.0.0, 2.3.4, 2.4.4
>
>
> This is very like 
> [SPARK-26665|https://issues.apache.org/jira/browse/SPARK-26665]
> `ByteBuffer.allocate` may throw OutOfMemoryError when the response is large 
> but no enough memory is available. However, when this happens, 
> TransportClient.sendRpcSync will just hang forever if the timeout set to 
> unlimited.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11412) Support merge schema for ORC

2019-06-30 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-11412:
---

Assignee: EdisonWang

> Support merge schema for ORC
> 
>
> Key: SPARK-11412
> URL: https://issues.apache.org/jira/browse/SPARK-11412
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.0, 2.1.1, 2.2.0
>Reporter: Dave
>Assignee: EdisonWang
>Priority: Major
> Fix For: 3.0.0
>
>
> when I tried to load partitioned orc files with a slight difference in a 
> nested column. say 
> column 
> -- request: struct (nullable = true)
>  ||-- datetime: string (nullable = true)
>  ||-- host: string (nullable = true)
>  ||-- ip: string (nullable = true)
>  ||-- referer: string (nullable = true)
>  ||-- request_uri: string (nullable = true)
>  ||-- uri: string (nullable = true)
>  ||-- useragent: string (nullable = true)
> And then there's a page_url_lists attributes in the later partitions.
> I tried to use
> val s = sqlContext.read.format("orc").option("mergeSchema", 
> "true").load("/data/warehouse/") to load the data.
> But the schema doesn't show request.page_url_lists.
> I am wondering if schema merge doesn't work for orc?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23153) Support application dependencies in submission client's local file system

2019-06-30 Thread Eric Joel Blanco-Hermida (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16875778#comment-16875778
 ] 

Eric Joel Blanco-Hermida commented on SPARK-23153:
--

Has this been fixed for Spark 2.4.X too? 

> Support application dependencies in submission client's local file system
> -
>
> Key: SPARK-23153
> URL: https://issues.apache.org/jira/browse/SPARK-23153
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Assignee: Stavros Kontopoulos
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently local dependencies are not supported with Spark on K8S i.e. if the 
> user has code or dependencies only on the client where they run 
> {{spark-submit}} then the current implementation has no way to make those 
> visible to the Spark application running inside the K8S pods that get 
> launched.  This limits users to only running applications where the code and 
> dependencies are either baked into the Docker images used or where those are 
> available via some external and globally accessible file system e.g. HDFS 
> which are not viable options for many users and environments



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28218) Migrate Avro to File source V2

2019-06-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28218:


Assignee: Apache Spark

> Migrate Avro to File source V2
> --
>
> Key: SPARK-28218
> URL: https://issues.apache.org/jira/browse/SPARK-28218
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28218) Migrate Avro to File source V2

2019-06-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28218:


Assignee: (was: Apache Spark)

> Migrate Avro to File source V2
> --
>
> Key: SPARK-28218
> URL: https://issues.apache.org/jira/browse/SPARK-28218
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28218) Migrate Avro to File source V2

2019-06-30 Thread Gengliang Wang (JIRA)
Gengliang Wang created SPARK-28218:
--

 Summary: Migrate Avro to File source V2
 Key: SPARK-28218
 URL: https://issues.apache.org/jira/browse/SPARK-28218
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Gengliang Wang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org