[jira] [Resolved] (SPARK-24934) Complex type and binary type in in-memory partition pruning does not work due to missing upper/lower bounds cases

2018-07-29 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-24934.
-
   Resolution: Fixed
 Assignee: Hyukjin Kwon
Fix Version/s: 2.4.0
   2.3.2

> Complex type and binary type in in-memory partition pruning does not work due 
> to missing upper/lower bounds cases
> -
>
> Key: SPARK-24934
> URL: https://issues.apache.org/jira/browse/SPARK-24934
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
>  Labels: correctness
> Fix For: 2.3.2, 2.4.0
>
>
> For example, if array is used (where the lower and upper bounds for its 
> column batch are {{null}})), it looks wrongly filtering all data out:
> {code}
> scala> import org.apache.spark.sql.functions
> import org.apache.spark.sql.functions
> scala> val df = Seq(Array("a", "b"), Array("c", "d")).toDF("arrayCol")
> df: org.apache.spark.sql.DataFrame = [arrayCol: array]
> scala> 
> df.filter(df.col("arrayCol").eqNullSafe(functions.array(functions.lit("a"), 
> functions.lit("b".show()
> ++
> |arrayCol|
> ++
> |  [a, b]|
> ++
> scala> 
> df.cache().filter(df.col("arrayCol").eqNullSafe(functions.array(functions.lit("a"),
>  functions.lit("b".show()
> ++
> |arrayCol|
> ++
> ++
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24966) Fix the precedence rule for set operations.

2018-07-29 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24966:

Target Version/s: 2.4.0

> Fix the precedence rule for set operations.
> ---
>
> Key: SPARK-24966
> URL: https://issues.apache.org/jira/browse/SPARK-24966
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Dilip Biswal
>Priority: Major
>
> Currently the set operations INTERSECT, UNION and EXCEPT are assigned the 
> same precedence. We need to change to make sure INTERSECT is given higher 
> precedence than UNION and EXCEPT. UNION and EXCEPT should be evaluated in the 
> order they appear in the query from left to right. 
> Given this will result in a change in behavior, we need to keep it under a 
> config.
> Here is a reference :
> https://docs.microsoft.com/en-us/sql/t-sql/language-elements/set-operators-except-and-intersect-transact-sql?view=sql-server-2017



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24955) spark continuing to execute on a task despite not reading all data from a downed machine

2018-07-29 Thread San Tung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

San Tung updated SPARK-24955:
-
Description: 
We've recently run into a few instances where a downed node has led to 
incomplete data, causing correctness issues, which we can reproduce some of the 
time.

*Setup:*
 - we're currently on spark 2.3.0
 - we allow retries on failed tasks and stages
 - we use PySpark to perform these operations

*Stages:*

Simplistically, the job does the following:
 - Stage 1/2: computes a number of `(sha256 hash, 0, 1)` partitioned into 65536 
partitions
 - Stage 3/4: computes a number of `(sha256 hash, 1, 0)` partitioned into 6408 
partitions (one hash may exist in multiple partitions)
 - Stage 5:
 - repartitions stage 2 and stage 4 by the first 2 bytes of each hash, and find 
which ones are not in common (stage 2 hashes - stage 4 hashes).
 - store this partition into a persistent data source.

*Failure Scenario:*
 - We take out one of the machines (do a forced shutdown, for example)
 - For some tasks, stage 5 will die immediately with one of the following:
 ** `ExecutorLostFailure (executor 24 exited caused by one of the running 
tasks) Reason: worker lost`
 ** `FetchFailed(BlockManagerId(24, [redacted], 36829, None), shuffleId=2, 
mapId=14377, reduceId=48402, message=`
 - these tasks are reused to calculate stage 1-2 and 3-4 again that were 
missing on downed nodes, which is correctly recalculated by spark.
 - However, some tasks still continue executing from Stage 5, seemingly missing 
stage 4 data, dumping incorrect data to the stage 5 data source. We noticed the 
subtract operation taking ~1-2 minutes after the machine goes down, and stores 
a lot more data than usual (which on inspection is wrong).
 - we've seen this happen with slightly different execution plans too which 
don't involve or-ing, but end up being some variant of missing some stage 4 
data.

However, we cannot reproduce this consistently - sometimes all tasks fail 
gracefully. Correctly downed nodes means all these tasks fail and re-work on 
stage 1-2/3-4. Note that this solution produces the correct results if machines 
stay alive!

We were wondering if a machine going down can result in a state where a task 
could keep executing even though not all data has been fetched which gives us 
incorrect results (or if there is setting that allows this - we tried scanning 
spark configs up and down). This seems similar to 
https://issues.apache.org/jira/browse/SPARK-24160 (maybe we get an empty 
packet?), but it doesn't look like that was to explicitly resolve any known bug.

  was:
We've recently run into a few instances where a downed node has led to 
incomplete data, causing correctness issues, which we can reproduce some of the 
time.

*Setup:*
 - we're currently on spark 2.3.0
 - we allow retries on failed tasks and stages
 - we use PySpark to perform these operations

*Stages:*

Simplistically, the job does the following:
 - Stage 1/2: computes a number of `(sha256 hash, 0, 1)` partitioned into 65536 
partitions
 - Stage 3/4: computes a number of `(sha256 hash, 1, 0)` partitioned into 6408 
partitions (one hash may exist in multiple partitions)
 - Stage 5:
 - repartitions stage 2 and stage 4 by the first 2 bytes of each hash, and find 
which ones are not in common (stage 2 hashes - stage 4 hashes).
 - store this partition into a persistent data source.

*Failure Scenario:*
 - We take out one of the machines (do a forced shutdown, for example)
 - For some tasks, stage 5 will die immediately with one of the following:
 - `ExecutorLostFailure (executor 24 exited caused by one of the running tasks) 
Reason: worker lost`
 - `FetchFailed(BlockManagerId(24, [redacted], 36829, None), shuffleId=2, 
mapId=14377, reduceId=48402, message=`
 - these tasks are reused to calculate stage 1-2 and 3-4 again that were 
missing on downed nodes, which is correctly recalculated by spark.
 - However, some tasks still continue executing from Stage 5, seemingly missing 
stage 4 data, dumping incorrect data to the stage 5 data source. We noticed the 
subtract operation taking ~1-2 minutes after the machine goes down, and stores 
a lot more data than usual (which on inspection is wrong).
 - we've seen this happen with slightly different execution plans too which 
don't involve or-ing, but end up being some variant of missing some stage 4 
data.

However, we cannot reproduce this consistently - sometimes all tasks fail 
gracefully. Correctly downed nodes means all these tasks fail and re-work on 
stage 1-2/3-4. Note that this solution produces the correct results if machines 
stay alive!

We were wondering if a machine going down can result in a state where a task 
could keep executing even though not all data has been fetched which gives us 
incorrect results (or if there is setting that allows this - we tried scanning 
spark configs up and down). This seems similar to 

[jira] [Updated] (SPARK-24957) Decimal arithmetic can lead to wrong values using codegen

2018-07-29 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24957:

Target Version/s: 2.4.0

> Decimal arithmetic can lead to wrong values using codegen
> -
>
> Key: SPARK-24957
> URL: https://issues.apache.org/jira/browse/SPARK-24957
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: David Vogelbacher
>Priority: Major
>
> I noticed a bug when doing arithmetic on a dataframe containing decimal 
> values with codegen enabled.
> I tried to narrow it down on a small repro and got this (executed in 
> spark-shell):
> {noformat}
> scala> val df = Seq(
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("11.88")),
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("11.88")),
>  | ("a", BigDecimal("11.88"))
>  | ).toDF("text", "number")
> df: org.apache.spark.sql.DataFrame = [text: string, number: decimal(38,18)]
> scala> val df_grouped_1 = 
> df.groupBy(df.col("text")).agg(functions.avg(df.col("number")).as("number"))
> df_grouped_1: org.apache.spark.sql.DataFrame = [text: string, number: 
> decimal(38,22)]
> scala> df_grouped_1.collect()
> res0: Array[org.apache.spark.sql.Row] = Array([a,11.94857142857143])
> scala> val df_grouped_2 = 
> df_grouped_1.groupBy(df_grouped_1.col("text")).agg(functions.sum(df_grouped_1.col("number")).as("number"))
> df_grouped_2: org.apache.spark.sql.DataFrame = [text: string, number: 
> decimal(38,22)]
> scala> df_grouped_2.collect()
> res1: Array[org.apache.spark.sql.Row] = 
> Array([a,11948571.4285714285714285714286])
> scala> val df_total_sum = 
> df_grouped_1.agg(functions.sum(df_grouped_1.col("number")).as("number"))
> df_total_sum: org.apache.spark.sql.DataFrame = [number: decimal(38,22)]
> scala> df_total_sum.collect()
> res2: Array[org.apache.spark.sql.Row] = Array([11.94857142857143])
> {noformat}
> The results of {{df_grouped_1}} and {{df_total_sum}} are correct, whereas the 
> result of {{df_grouped_2}} is clearly incorrect (it is the value of the 
> correct result times {{10^14}}).
> When codegen is disabled all results are correct. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24957) Decimal arithmetic can lead to wrong values using codegen

2018-07-29 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16561334#comment-16561334
 ] 

Saisai Shao commented on SPARK-24957:
-

Not necessary to mark as blocker, we still have plenty of time for 2.4 release, 
I will mark the target version as 2.4.0.

> Decimal arithmetic can lead to wrong values using codegen
> -
>
> Key: SPARK-24957
> URL: https://issues.apache.org/jira/browse/SPARK-24957
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: David Vogelbacher
>Priority: Major
>
> I noticed a bug when doing arithmetic on a dataframe containing decimal 
> values with codegen enabled.
> I tried to narrow it down on a small repro and got this (executed in 
> spark-shell):
> {noformat}
> scala> val df = Seq(
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("11.88")),
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("11.88")),
>  | ("a", BigDecimal("11.88"))
>  | ).toDF("text", "number")
> df: org.apache.spark.sql.DataFrame = [text: string, number: decimal(38,18)]
> scala> val df_grouped_1 = 
> df.groupBy(df.col("text")).agg(functions.avg(df.col("number")).as("number"))
> df_grouped_1: org.apache.spark.sql.DataFrame = [text: string, number: 
> decimal(38,22)]
> scala> df_grouped_1.collect()
> res0: Array[org.apache.spark.sql.Row] = Array([a,11.94857142857143])
> scala> val df_grouped_2 = 
> df_grouped_1.groupBy(df_grouped_1.col("text")).agg(functions.sum(df_grouped_1.col("number")).as("number"))
> df_grouped_2: org.apache.spark.sql.DataFrame = [text: string, number: 
> decimal(38,22)]
> scala> df_grouped_2.collect()
> res1: Array[org.apache.spark.sql.Row] = 
> Array([a,11948571.4285714285714285714286])
> scala> val df_total_sum = 
> df_grouped_1.agg(functions.sum(df_grouped_1.col("number")).as("number"))
> df_total_sum: org.apache.spark.sql.DataFrame = [number: decimal(38,22)]
> scala> df_total_sum.collect()
> res2: Array[org.apache.spark.sql.Row] = Array([11.94857142857143])
> {noformat}
> The results of {{df_grouped_1}} and {{df_total_sum}} are correct, whereas the 
> result of {{df_grouped_2}} is clearly incorrect (it is the value of the 
> correct result times {{10^14}}).
> When codegen is disabled all results are correct. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24932) Allow update mode for streaming queries with join

2018-07-29 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24932:

Target Version/s:   (was: 2.3.2)

> Allow update mode for streaming queries with join
> -
>
> Key: SPARK-24932
> URL: https://issues.apache.org/jira/browse/SPARK-24932
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Eric Fu
>Priority: Major
>
> In issue SPARK-19140 we supported update output mode for non-aggregation 
> streaming queries. This should also be applied to streaming join to keep 
> semantic consistent.
> PS. Streaming join feature is added after SPARK-19140. 
> When using update _output_ mode the join will works exactly as _append_ mode. 
> However, for example, this will allow user to run an aggregation-after-join 
> query in update mode in order to get a more real-time result output.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24964) Please add OWASP Dependency Check to all comonent builds(pom.xml)

2018-07-29 Thread Albert Baker (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16561330#comment-16561330
 ] 

Albert Baker commented on SPARK-24964:
--

Fair enough, thanks for your help.

>  Please add OWASP Dependency Check to all comonent builds(pom.xml)
> --
>
> Key: SPARK-24964
> URL: https://issues.apache.org/jira/browse/SPARK-24964
> Project: Spark
>  Issue Type: New Feature
>  Components: Build, MLlib, Spark Core, SparkR
>Affects Versions: 2.3.1
> Environment: All development, build, test, environments.
> ~/workspace/spark-2.3.1/pom.xml
> ~/workspace/spark-2.3.1/assembly/pom.xml
> ~/workspace/spark-2.3.1/common/kvstore/dependency-reduced-pom.xml
> ~/workspace/spark-2.3.1/common/kvstore/pom.xml
> ~/workspace/spark-2.3.1/common/network-common/dependency-reduced-pom.xml
> ~/workspace/spark-2.3.1/common/network-common/pom.xml
> ~/workspace/spark-2.3.1/common/network-shuffle/dependency-reduced-pom.xml
> ~/workspace/spark-2.3.1/common/network-shuffle/pom.xml
> ~/workspace/spark-2.3.1/common/network-yarn/pom.xml
> ~/workspace/spark-2.3.1/common/sketch/dependency-reduced-pom.xml
> ~/workspace/spark-2.3.1/common/sketch/pom.xml
> ~/workspace/spark-2.3.1/common/tags/dependency-reduced-pom.xml
> ~/workspace/spark-2.3.1/common/tags/pom.xml
> ~/workspace/spark-2.3.1/common/unsafe/pom.xml
> ~/workspace/spark-2.3.1/core/pom.xml
> ~/workspace/spark-2.3.1/examples/pom.xml
> ~/workspace/spark-2.3.1/external/docker-integration-tests/pom.xml
> ~/workspace/spark-2.3.1/external/flume/pom.xml
> ~/workspace/spark-2.3.1/external/flume-assembly/pom.xml
> ~/workspace/spark-2.3.1/external/flume-sink/pom.xml
> ~/workspace/spark-2.3.1/external/kafka-0-10/pom.xml
> ~/workspace/spark-2.3.1/external/kafka-0-10-assembly/pom.xml
> ~/workspace/spark-2.3.1/external/kafka-0-10-sql/pom.xml
> ~/workspace/spark-2.3.1/external/kafka-0-8/pom.xml
> ~/workspace/spark-2.3.1/external/kafka-0-8-assembly/pom.xml
> ~/workspace/spark-2.3.1/external/kinesis-asl/pom.xml
> ~/workspace/spark-2.3.1/external/kinesis-asl-assembly/pom.xml
> ~/workspace/spark-2.3.1/external/spark-ganglia-lgpl/pom.xml
> ~/workspace/spark-2.3.1/graphx/pom.xml
> ~/workspace/spark-2.3.1/hadoop-cloud/pom.xml
> ~/workspace/spark-2.3.1/launcher/pom.xml
> ~/workspace/spark-2.3.1/mllib/pom.xml
> ~/workspace/spark-2.3.1/mllib-local/pom.xml
> ~/workspace/spark-2.3.1/repl/pom.xml
> ~/workspace/spark-2.3.1/resource-managers/kubernetes/core/pom.xml
> ~/workspace/spark-2.3.1/resource-managers/mesos/pom.xml
> ~/workspace/spark-2.3.1/resource-managers/yarn/pom.xml
> ~/workspace/spark-2.3.1/sql/catalyst/pom.xml
> ~/workspace/spark-2.3.1/sql/core/pom.xml
> ~/workspace/spark-2.3.1/sql/hive/pom.xml
> ~/workspace/spark-2.3.1/sql/hive-thriftserver/pom.xml
> ~/workspace/spark-2.3.1/streaming/pom.xml
> ~/workspace/spark-2.3.1/tools/pom.xml
>Reporter: Albert Baker
>Priority: Major
>  Labels: build, easy-fix, security
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> OWASP DC makes an outbound REST call to MITRE Common Vulnerabilities & 
> Exposures (CVE) to perform a lookup for each dependant .jar to list any/all 
> known vulnerabilities for each jar. This step is needed because a manual 
> MITRE CVE lookup/check on the main component does not include checking for 
> vulnerabilities in dependant libraries.
> OWASP Dependency check : 
> https://www.owasp.org/index.php/OWASP_Dependency_Check has plug-ins for most 
> Java build/make types (ant, maven, ivy, gradle). Also, add the appropriate 
> command to the nightly build to generate a report of all known 
> vulnerabilities in any/all third party libraries/dependencies that get pulled 
> in. example : mvn -Powasp -Dtest=false -DfailIfNoTests=false clean aggregate
> Generating this report nightly/weekly will help inform the project's 
> development team if any dependant libraries have a reported known 
> vulneraility. Project teams that keep up with removing vulnerabilities on a 
> weekly basis will help protect businesses that rely on these open source 
> componets.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24932) Allow update mode for streaming queries with join

2018-07-29 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16561331#comment-16561331
 ] 

Saisai Shao commented on SPARK-24932:
-

I'm going to remove the target version for this JIRA. Committers will set the 
fix version properly when merged.

> Allow update mode for streaming queries with join
> -
>
> Key: SPARK-24932
> URL: https://issues.apache.org/jira/browse/SPARK-24932
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Eric Fu
>Priority: Major
>
> In issue SPARK-19140 we supported update output mode for non-aggregation 
> streaming queries. This should also be applied to streaming join to keep 
> semantic consistent.
> PS. Streaming join feature is added after SPARK-19140. 
> When using update _output_ mode the join will works exactly as _append_ mode. 
> However, for example, this will allow user to run an aggregation-after-join 
> query in update mode in order to get a more real-time result output.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24964) Please add OWASP Dependency Check to all comonent builds(pom.xml)

2018-07-29 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24964:

Target Version/s:   (was: 2.3.2, 2.4.0, 3.0.0, 2.3.3)

>  Please add OWASP Dependency Check to all comonent builds(pom.xml)
> --
>
> Key: SPARK-24964
> URL: https://issues.apache.org/jira/browse/SPARK-24964
> Project: Spark
>  Issue Type: New Feature
>  Components: Build, MLlib, Spark Core, SparkR
>Affects Versions: 2.3.1
> Environment: All development, build, test, environments.
> ~/workspace/spark-2.3.1/pom.xml
> ~/workspace/spark-2.3.1/assembly/pom.xml
> ~/workspace/spark-2.3.1/common/kvstore/dependency-reduced-pom.xml
> ~/workspace/spark-2.3.1/common/kvstore/pom.xml
> ~/workspace/spark-2.3.1/common/network-common/dependency-reduced-pom.xml
> ~/workspace/spark-2.3.1/common/network-common/pom.xml
> ~/workspace/spark-2.3.1/common/network-shuffle/dependency-reduced-pom.xml
> ~/workspace/spark-2.3.1/common/network-shuffle/pom.xml
> ~/workspace/spark-2.3.1/common/network-yarn/pom.xml
> ~/workspace/spark-2.3.1/common/sketch/dependency-reduced-pom.xml
> ~/workspace/spark-2.3.1/common/sketch/pom.xml
> ~/workspace/spark-2.3.1/common/tags/dependency-reduced-pom.xml
> ~/workspace/spark-2.3.1/common/tags/pom.xml
> ~/workspace/spark-2.3.1/common/unsafe/pom.xml
> ~/workspace/spark-2.3.1/core/pom.xml
> ~/workspace/spark-2.3.1/examples/pom.xml
> ~/workspace/spark-2.3.1/external/docker-integration-tests/pom.xml
> ~/workspace/spark-2.3.1/external/flume/pom.xml
> ~/workspace/spark-2.3.1/external/flume-assembly/pom.xml
> ~/workspace/spark-2.3.1/external/flume-sink/pom.xml
> ~/workspace/spark-2.3.1/external/kafka-0-10/pom.xml
> ~/workspace/spark-2.3.1/external/kafka-0-10-assembly/pom.xml
> ~/workspace/spark-2.3.1/external/kafka-0-10-sql/pom.xml
> ~/workspace/spark-2.3.1/external/kafka-0-8/pom.xml
> ~/workspace/spark-2.3.1/external/kafka-0-8-assembly/pom.xml
> ~/workspace/spark-2.3.1/external/kinesis-asl/pom.xml
> ~/workspace/spark-2.3.1/external/kinesis-asl-assembly/pom.xml
> ~/workspace/spark-2.3.1/external/spark-ganglia-lgpl/pom.xml
> ~/workspace/spark-2.3.1/graphx/pom.xml
> ~/workspace/spark-2.3.1/hadoop-cloud/pom.xml
> ~/workspace/spark-2.3.1/launcher/pom.xml
> ~/workspace/spark-2.3.1/mllib/pom.xml
> ~/workspace/spark-2.3.1/mllib-local/pom.xml
> ~/workspace/spark-2.3.1/repl/pom.xml
> ~/workspace/spark-2.3.1/resource-managers/kubernetes/core/pom.xml
> ~/workspace/spark-2.3.1/resource-managers/mesos/pom.xml
> ~/workspace/spark-2.3.1/resource-managers/yarn/pom.xml
> ~/workspace/spark-2.3.1/sql/catalyst/pom.xml
> ~/workspace/spark-2.3.1/sql/core/pom.xml
> ~/workspace/spark-2.3.1/sql/hive/pom.xml
> ~/workspace/spark-2.3.1/sql/hive-thriftserver/pom.xml
> ~/workspace/spark-2.3.1/streaming/pom.xml
> ~/workspace/spark-2.3.1/tools/pom.xml
>Reporter: Albert Baker
>Priority: Major
>  Labels: build, easy-fix, security
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> OWASP DC makes an outbound REST call to MITRE Common Vulnerabilities & 
> Exposures (CVE) to perform a lookup for each dependant .jar to list any/all 
> known vulnerabilities for each jar. This step is needed because a manual 
> MITRE CVE lookup/check on the main component does not include checking for 
> vulnerabilities in dependant libraries.
> OWASP Dependency check : 
> https://www.owasp.org/index.php/OWASP_Dependency_Check has plug-ins for most 
> Java build/make types (ant, maven, ivy, gradle). Also, add the appropriate 
> command to the nightly build to generate a report of all known 
> vulnerabilities in any/all third party libraries/dependencies that get pulled 
> in. example : mvn -Powasp -Dtest=false -DfailIfNoTests=false clean aggregate
> Generating this report nightly/weekly will help inform the project's 
> development team if any dependant libraries have a reported known 
> vulneraility. Project teams that keep up with removing vulnerabilities on a 
> weekly basis will help protect businesses that rely on these open source 
> componets.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24964) Please add OWASP Dependency Check to all comonent builds(pom.xml)

2018-07-29 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16561327#comment-16561327
 ] 

Saisai Shao commented on SPARK-24964:
-

I'm going to remove the set target version, usually we don't set this for a 
feature, committers will set a fix version when merged.

>  Please add OWASP Dependency Check to all comonent builds(pom.xml)
> --
>
> Key: SPARK-24964
> URL: https://issues.apache.org/jira/browse/SPARK-24964
> Project: Spark
>  Issue Type: New Feature
>  Components: Build, MLlib, Spark Core, SparkR
>Affects Versions: 2.3.1
> Environment: All development, build, test, environments.
> ~/workspace/spark-2.3.1/pom.xml
> ~/workspace/spark-2.3.1/assembly/pom.xml
> ~/workspace/spark-2.3.1/common/kvstore/dependency-reduced-pom.xml
> ~/workspace/spark-2.3.1/common/kvstore/pom.xml
> ~/workspace/spark-2.3.1/common/network-common/dependency-reduced-pom.xml
> ~/workspace/spark-2.3.1/common/network-common/pom.xml
> ~/workspace/spark-2.3.1/common/network-shuffle/dependency-reduced-pom.xml
> ~/workspace/spark-2.3.1/common/network-shuffle/pom.xml
> ~/workspace/spark-2.3.1/common/network-yarn/pom.xml
> ~/workspace/spark-2.3.1/common/sketch/dependency-reduced-pom.xml
> ~/workspace/spark-2.3.1/common/sketch/pom.xml
> ~/workspace/spark-2.3.1/common/tags/dependency-reduced-pom.xml
> ~/workspace/spark-2.3.1/common/tags/pom.xml
> ~/workspace/spark-2.3.1/common/unsafe/pom.xml
> ~/workspace/spark-2.3.1/core/pom.xml
> ~/workspace/spark-2.3.1/examples/pom.xml
> ~/workspace/spark-2.3.1/external/docker-integration-tests/pom.xml
> ~/workspace/spark-2.3.1/external/flume/pom.xml
> ~/workspace/spark-2.3.1/external/flume-assembly/pom.xml
> ~/workspace/spark-2.3.1/external/flume-sink/pom.xml
> ~/workspace/spark-2.3.1/external/kafka-0-10/pom.xml
> ~/workspace/spark-2.3.1/external/kafka-0-10-assembly/pom.xml
> ~/workspace/spark-2.3.1/external/kafka-0-10-sql/pom.xml
> ~/workspace/spark-2.3.1/external/kafka-0-8/pom.xml
> ~/workspace/spark-2.3.1/external/kafka-0-8-assembly/pom.xml
> ~/workspace/spark-2.3.1/external/kinesis-asl/pom.xml
> ~/workspace/spark-2.3.1/external/kinesis-asl-assembly/pom.xml
> ~/workspace/spark-2.3.1/external/spark-ganglia-lgpl/pom.xml
> ~/workspace/spark-2.3.1/graphx/pom.xml
> ~/workspace/spark-2.3.1/hadoop-cloud/pom.xml
> ~/workspace/spark-2.3.1/launcher/pom.xml
> ~/workspace/spark-2.3.1/mllib/pom.xml
> ~/workspace/spark-2.3.1/mllib-local/pom.xml
> ~/workspace/spark-2.3.1/repl/pom.xml
> ~/workspace/spark-2.3.1/resource-managers/kubernetes/core/pom.xml
> ~/workspace/spark-2.3.1/resource-managers/mesos/pom.xml
> ~/workspace/spark-2.3.1/resource-managers/yarn/pom.xml
> ~/workspace/spark-2.3.1/sql/catalyst/pom.xml
> ~/workspace/spark-2.3.1/sql/core/pom.xml
> ~/workspace/spark-2.3.1/sql/hive/pom.xml
> ~/workspace/spark-2.3.1/sql/hive-thriftserver/pom.xml
> ~/workspace/spark-2.3.1/streaming/pom.xml
> ~/workspace/spark-2.3.1/tools/pom.xml
>Reporter: Albert Baker
>Priority: Major
>  Labels: build, easy-fix, security
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> OWASP DC makes an outbound REST call to MITRE Common Vulnerabilities & 
> Exposures (CVE) to perform a lookup for each dependant .jar to list any/all 
> known vulnerabilities for each jar. This step is needed because a manual 
> MITRE CVE lookup/check on the main component does not include checking for 
> vulnerabilities in dependant libraries.
> OWASP Dependency check : 
> https://www.owasp.org/index.php/OWASP_Dependency_Check has plug-ins for most 
> Java build/make types (ant, maven, ivy, gradle). Also, add the appropriate 
> command to the nightly build to generate a report of all known 
> vulnerabilities in any/all third party libraries/dependencies that get pulled 
> in. example : mvn -Powasp -Dtest=false -DfailIfNoTests=false clean aggregate
> Generating this report nightly/weekly will help inform the project's 
> development team if any dependant libraries have a reported known 
> vulneraility. Project teams that keep up with removing vulnerabilities on a 
> weekly basis will help protect businesses that rely on these open source 
> componets.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9850) Adaptive execution in Spark

2018-07-29 Thread Carson Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16561326#comment-16561326
 ] 

Carson Wang commented on SPARK-9850:


We have a new proposal and implementation for Spark SQL adaptive execution 
discussed in SPARK-23128.  Optimizing join strategy at run time and handling 
skewed join are also supported. The full code is also available at 
[https://github.com/Intel-bigdata/spark-adaptive|https://github.com/Intel-bigdata/spark-adaptive/].
 

> Adaptive execution in Spark
> ---
>
> Key: SPARK-9850
> URL: https://issues.apache.org/jira/browse/SPARK-9850
> Project: Spark
>  Issue Type: Epic
>  Components: Spark Core, SQL
>Reporter: Matei Zaharia
>Assignee: Yin Huai
>Priority: Major
> Attachments: AdaptiveExecutionInSpark.pdf
>
>
> Query planning is one of the main factors in high performance, but the 
> current Spark engine requires the execution DAG for a job to be set in 
> advance. Even with cost­-based optimization, it is hard to know the behavior 
> of data and user-defined functions well enough to always get great execution 
> plans. This JIRA proposes to add adaptive query execution, so that the engine 
> can change the plan for each query as it sees what data earlier stages 
> produced.
> We propose adding this to Spark SQL / DataFrames first, using a new API in 
> the Spark engine that lets libraries run DAGs adaptively. In future JIRAs, 
> the functionality could be extended to other libraries or the RDD API, but 
> that is more difficult than adding it in SQL.
> I've attached a design doc by Yin Huai and myself explaining how it would 
> work in more detail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24966) Fix the precedence rule for set operations.

2018-07-29 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16561303#comment-16561303
 ] 

Dilip Biswal commented on SPARK-24966:
--

I am working on a fix for this.

> Fix the precedence rule for set operations.
> ---
>
> Key: SPARK-24966
> URL: https://issues.apache.org/jira/browse/SPARK-24966
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Dilip Biswal
>Priority: Major
>
> Currently the set operations INTERSECT, UNION and EXCEPT are assigned the 
> same precedence. We need to change to make sure INTERSECT is given higher 
> precedence than UNION and EXCEPT. UNION and EXCEPT should be evaluated in the 
> order they appear in the query from left to right. 
> Given this will result in a change in behavior, we need to keep it under a 
> config.
> Here is a reference :
> https://docs.microsoft.com/en-us/sql/t-sql/language-elements/set-operators-except-and-intersect-transact-sql?view=sql-server-2017



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24966) Fix the precedence rule for set operations.

2018-07-29 Thread Dilip Biswal (JIRA)
Dilip Biswal created SPARK-24966:


 Summary: Fix the precedence rule for set operations.
 Key: SPARK-24966
 URL: https://issues.apache.org/jira/browse/SPARK-24966
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1
Reporter: Dilip Biswal


Currently the set operations INTERSECT, UNION and EXCEPT are assigned the same 
precedence. We need to change to make sure INTERSECT is given higher precedence 
than UNION and EXCEPT. UNION and EXCEPT should be evaluated in the order they 
appear in the query from left to right. 

Given this will result in a change in behavior, we need to keep it under a 
config.

Here is a reference :
https://docs.microsoft.com/en-us/sql/t-sql/language-elements/set-operators-except-and-intersect-transact-sql?view=sql-server-2017



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24965) Spark SQL fails when reading a partitioned hive table with different formats per partition

2018-07-29 Thread Kris Geusebroek (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16561278#comment-16561278
 ] 

Kris Geusebroek commented on SPARK-24965:
-

PR: https://github.com/apache/spark/pull/21893

> Spark SQL fails when reading a partitioned hive table with different formats 
> per partition
> --
>
> Key: SPARK-24965
> URL: https://issues.apache.org/jira/browse/SPARK-24965
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Kris Geusebroek
>Priority: Major
>  Labels: pull-request-available
>
> When a hive parquet partitioned table contains a partition with a different 
> format (avro for example) the select * fails with a read exception (avro file 
> is not a parquet file)
> Selecting in hive acts as expected.
> To support this a new sql syntax needed to be supported also:
>  * ALTER TABLE   SET FILEFORMAT 
> This is included in the same PR since the unittest needs this to setup the 
> testdata.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24957) Decimal arithmetic can lead to wrong values using codegen

2018-07-29 Thread David Vogelbacher (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16561277#comment-16561277
 ] 

David Vogelbacher edited comment on SPARK-24957 at 7/29/18 9:55 PM:


[~mgaido] thanks for putting up the PR!

I wasn't able to reproduce the incorrectness for the specific example I gave 
with wholestage codegen disabled:
{noformat}
scala> spark.conf.set("spark.sql.codegen.wholeStage", false)

scala> import org.apache.spark.sql.functions
import org.apache.spark.sql.functions

scala> val df = Seq(
 | ("a", BigDecimal("12.0")),
 | ("a", BigDecimal("12.0")),
 | ("a", BigDecimal("11.88")),
 | ("a", BigDecimal("12.0")),
 | ("a", BigDecimal("12.0")),
 | ("a", BigDecimal("11.88")),
 | ("a", BigDecimal("11.88"))
 | ).toDF("text", "number")
df: org.apache.spark.sql.DataFrame = [text: string, number: decimal(38,18)]

scala> val df_grouped_1 = 
df.groupBy(df.col("text")).agg(functions.avg(df.col("number")).as("number"))
df_grouped_1: org.apache.spark.sql.DataFrame = [text: string, number: 
decimal(38,22)]

scala> df_grouped_1.collect()
res1: Array[org.apache.spark.sql.Row] = Array([a,11.94857142857143])

scala> val df_grouped_2 = 
df_grouped_1.groupBy(df_grouped_1.col("text")).agg(functions.sum(df_grouped_1.col("number")).as("number"))
df_grouped_2: org.apache.spark.sql.DataFrame = [text: string, number: 
decimal(38,22)]

scala> df_grouped_2.collect()
res2: Array[org.apache.spark.sql.Row] = Array([a,11.94857142857143])

scala> val df_total_sum = 
df_grouped_1.agg(functions.sum(df_grouped_1.col("number")).as("number"))
df_total_sum: org.apache.spark.sql.DataFrame = [number: decimal(38,22)]

scala> df_total_sum.collect()
res3: Array[org.apache.spark.sql.Row] = Array([11.94857142857143])
{noformat}


was (Author: dvogelbacher):
[~mgaido] I wasn't able to reproduce the incorrectness for the specific example 
I gave with wholestage codegen disabled, that's what I meant:
{noformat}
scala> spark.conf.set("spark.sql.codegen.wholeStage", false)

scala> import org.apache.spark.sql.functions
import org.apache.spark.sql.functions

scala> val df = Seq(
 | ("a", BigDecimal("12.0")),
 | ("a", BigDecimal("12.0")),
 | ("a", BigDecimal("11.88")),
 | ("a", BigDecimal("12.0")),
 | ("a", BigDecimal("12.0")),
 | ("a", BigDecimal("11.88")),
 | ("a", BigDecimal("11.88"))
 | ).toDF("text", "number")
df: org.apache.spark.sql.DataFrame = [text: string, number: decimal(38,18)]

scala> val df_grouped_1 = 
df.groupBy(df.col("text")).agg(functions.avg(df.col("number")).as("number"))
df_grouped_1: org.apache.spark.sql.DataFrame = [text: string, number: 
decimal(38,22)]

scala> df_grouped_1.collect()
res1: Array[org.apache.spark.sql.Row] = Array([a,11.94857142857143])

scala> val df_grouped_2 = 
df_grouped_1.groupBy(df_grouped_1.col("text")).agg(functions.sum(df_grouped_1.col("number")).as("number"))
df_grouped_2: org.apache.spark.sql.DataFrame = [text: string, number: 
decimal(38,22)]

scala> df_grouped_2.collect()
res2: Array[org.apache.spark.sql.Row] = Array([a,11.94857142857143])

scala> val df_total_sum = 
df_grouped_1.agg(functions.sum(df_grouped_1.col("number")).as("number"))
df_total_sum: org.apache.spark.sql.DataFrame = [number: decimal(38,22)]

scala> df_total_sum.collect()
res3: Array[org.apache.spark.sql.Row] = Array([11.94857142857143])
{noformat}

> Decimal arithmetic can lead to wrong values using codegen
> -
>
> Key: SPARK-24957
> URL: https://issues.apache.org/jira/browse/SPARK-24957
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: David Vogelbacher
>Priority: Major
>
> I noticed a bug when doing arithmetic on a dataframe containing decimal 
> values with codegen enabled.
> I tried to narrow it down on a small repro and got this (executed in 
> spark-shell):
> {noformat}
> scala> val df = Seq(
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("11.88")),
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("11.88")),
>  | ("a", BigDecimal("11.88"))
>  | ).toDF("text", "number")
> df: org.apache.spark.sql.DataFrame = [text: string, number: decimal(38,18)]
> scala> val df_grouped_1 = 
> df.groupBy(df.col("text")).agg(functions.avg(df.col("number")).as("number"))
> df_grouped_1: org.apache.spark.sql.DataFrame = [text: string, number: 
> decimal(38,22)]
> scala> df_grouped_1.collect()
> res0: Array[org.apache.spark.sql.Row] = Array([a,11.94857142857143])
> scala> val df_grouped_2 = 
> 

[jira] [Commented] (SPARK-24957) Decimal arithmetic can lead to wrong values using codegen

2018-07-29 Thread David Vogelbacher (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16561277#comment-16561277
 ] 

David Vogelbacher commented on SPARK-24957:
---

[~mgaido] I wasn't able to reproduce the incorrectness for the specific example 
I gave with wholestage codegen disabled, that's what I meant:
{noformat}
scala> spark.conf.set("spark.sql.codegen.wholeStage", false)

scala> import org.apache.spark.sql.functions
import org.apache.spark.sql.functions

scala> val df = Seq(
 | ("a", BigDecimal("12.0")),
 | ("a", BigDecimal("12.0")),
 | ("a", BigDecimal("11.88")),
 | ("a", BigDecimal("12.0")),
 | ("a", BigDecimal("12.0")),
 | ("a", BigDecimal("11.88")),
 | ("a", BigDecimal("11.88"))
 | ).toDF("text", "number")
df: org.apache.spark.sql.DataFrame = [text: string, number: decimal(38,18)]

scala> val df_grouped_1 = 
df.groupBy(df.col("text")).agg(functions.avg(df.col("number")).as("number"))
df_grouped_1: org.apache.spark.sql.DataFrame = [text: string, number: 
decimal(38,22)]

scala> df_grouped_1.collect()
res1: Array[org.apache.spark.sql.Row] = Array([a,11.94857142857143])

scala> val df_grouped_2 = 
df_grouped_1.groupBy(df_grouped_1.col("text")).agg(functions.sum(df_grouped_1.col("number")).as("number"))
df_grouped_2: org.apache.spark.sql.DataFrame = [text: string, number: 
decimal(38,22)]

scala> df_grouped_2.collect()
res2: Array[org.apache.spark.sql.Row] = Array([a,11.94857142857143])

scala> val df_total_sum = 
df_grouped_1.agg(functions.sum(df_grouped_1.col("number")).as("number"))
df_total_sum: org.apache.spark.sql.DataFrame = [number: decimal(38,22)]

scala> df_total_sum.collect()
res3: Array[org.apache.spark.sql.Row] = Array([11.94857142857143])
{noformat}

> Decimal arithmetic can lead to wrong values using codegen
> -
>
> Key: SPARK-24957
> URL: https://issues.apache.org/jira/browse/SPARK-24957
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: David Vogelbacher
>Priority: Major
>
> I noticed a bug when doing arithmetic on a dataframe containing decimal 
> values with codegen enabled.
> I tried to narrow it down on a small repro and got this (executed in 
> spark-shell):
> {noformat}
> scala> val df = Seq(
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("11.88")),
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("11.88")),
>  | ("a", BigDecimal("11.88"))
>  | ).toDF("text", "number")
> df: org.apache.spark.sql.DataFrame = [text: string, number: decimal(38,18)]
> scala> val df_grouped_1 = 
> df.groupBy(df.col("text")).agg(functions.avg(df.col("number")).as("number"))
> df_grouped_1: org.apache.spark.sql.DataFrame = [text: string, number: 
> decimal(38,22)]
> scala> df_grouped_1.collect()
> res0: Array[org.apache.spark.sql.Row] = Array([a,11.94857142857143])
> scala> val df_grouped_2 = 
> df_grouped_1.groupBy(df_grouped_1.col("text")).agg(functions.sum(df_grouped_1.col("number")).as("number"))
> df_grouped_2: org.apache.spark.sql.DataFrame = [text: string, number: 
> decimal(38,22)]
> scala> df_grouped_2.collect()
> res1: Array[org.apache.spark.sql.Row] = 
> Array([a,11948571.4285714285714285714286])
> scala> val df_total_sum = 
> df_grouped_1.agg(functions.sum(df_grouped_1.col("number")).as("number"))
> df_total_sum: org.apache.spark.sql.DataFrame = [number: decimal(38,22)]
> scala> df_total_sum.collect()
> res2: Array[org.apache.spark.sql.Row] = Array([11.94857142857143])
> {noformat}
> The results of {{df_grouped_1}} and {{df_total_sum}} are correct, whereas the 
> result of {{df_grouped_2}} is clearly incorrect (it is the value of the 
> correct result times {{10^14}}).
> When codegen is disabled all results are correct. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24965) Spark SQL fails when reading a partitioned hive table with different formats per partition

2018-07-29 Thread Kris Geusebroek (JIRA)
Kris Geusebroek created SPARK-24965:
---

 Summary: Spark SQL fails when reading a partitioned hive table 
with different formats per partition
 Key: SPARK-24965
 URL: https://issues.apache.org/jira/browse/SPARK-24965
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1
Reporter: Kris Geusebroek


When a hive parquet partitioned table contains a partition with a different 
format (avro for example) the select * fails with a read exception (avro file 
is not a parquet file)

Selecting in hive acts as expected.

To support this a new sql syntax needed to be supported also:
 * ALTER TABLE   SET FILEFORMAT 

This is included in the same PR since the unittest needs this to setup the 
testdata.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24935) Problem with Executing Hive UDF's from Spark 2.2 Onwards

2018-07-29 Thread Parth Gandhi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16561266#comment-16561266
 ] 

Parth Gandhi commented on SPARK-24935:
--

[~cloud_fan] I too had the same doubt that maybe hive UDAF might still have 
issues supporting partial aggregation completely though I am not so sure. Would 
it make sense to add support for complete aggregation mode to ensure backward 
compatibility? Thank you.

> Problem with Executing Hive UDF's from Spark 2.2 Onwards
> 
>
> Key: SPARK-24935
> URL: https://issues.apache.org/jira/browse/SPARK-24935
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.1
>Reporter: Parth Gandhi
>Priority: Major
>
> A user of sketches library(https://github.com/DataSketches/sketches-hive) 
> reported an issue with HLL Sketch Hive UDAF that seems to be a bug in Spark 
> or Hive. Their code runs fine in 2.1 but has an issue from 2.2 onwards. For 
> more details on the issue, you can refer to the discussion in the 
> sketches-user list:
> [https://groups.google.com/forum/?utm_medium=email_source=footer#!msg/sketches-user/GmH4-OlHP9g/MW-J7Hg4BwAJ]
>  
> On further debugging, we figured out that from 2.2 onwards, Spark hive UDAF 
> provides support for partial aggregation, and has removed the functionality 
> that supported complete mode aggregation(Refer 
> https://issues.apache.org/jira/browse/SPARK-19060 and 
> https://issues.apache.org/jira/browse/SPARK-18186). Thus, instead of 
> expecting update method to be called, merge method is called here 
> ([https://github.com/DataSketches/sketches-hive/blob/master/src/main/java/com/yahoo/sketches/hive/hll/SketchEvaluator.java#L56)]
>  which throws the exception as described in the forums above.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24964) Please add OWASP Dependency Check to all comonent builds(pom.xml)

2018-07-29 Thread Albert Baker (JIRA)
Albert Baker created SPARK-24964:


 Summary:  Please add OWASP Dependency Check to all comonent 
builds(pom.xml)
 Key: SPARK-24964
 URL: https://issues.apache.org/jira/browse/SPARK-24964
 Project: Spark
  Issue Type: New Feature
  Components: Build, MLlib, Spark Core, SparkR
Affects Versions: 2.3.1
 Environment: All development, build, test, environments.

~/workspace/spark-2.3.1/pom.xml
~/workspace/spark-2.3.1/assembly/pom.xml
~/workspace/spark-2.3.1/common/kvstore/dependency-reduced-pom.xml
~/workspace/spark-2.3.1/common/kvstore/pom.xml
~/workspace/spark-2.3.1/common/network-common/dependency-reduced-pom.xml
~/workspace/spark-2.3.1/common/network-common/pom.xml
~/workspace/spark-2.3.1/common/network-shuffle/dependency-reduced-pom.xml
~/workspace/spark-2.3.1/common/network-shuffle/pom.xml
~/workspace/spark-2.3.1/common/network-yarn/pom.xml
~/workspace/spark-2.3.1/common/sketch/dependency-reduced-pom.xml
~/workspace/spark-2.3.1/common/sketch/pom.xml
~/workspace/spark-2.3.1/common/tags/dependency-reduced-pom.xml
~/workspace/spark-2.3.1/common/tags/pom.xml
~/workspace/spark-2.3.1/common/unsafe/pom.xml
~/workspace/spark-2.3.1/core/pom.xml
~/workspace/spark-2.3.1/examples/pom.xml
~/workspace/spark-2.3.1/external/docker-integration-tests/pom.xml
~/workspace/spark-2.3.1/external/flume/pom.xml
~/workspace/spark-2.3.1/external/flume-assembly/pom.xml
~/workspace/spark-2.3.1/external/flume-sink/pom.xml
~/workspace/spark-2.3.1/external/kafka-0-10/pom.xml
~/workspace/spark-2.3.1/external/kafka-0-10-assembly/pom.xml
~/workspace/spark-2.3.1/external/kafka-0-10-sql/pom.xml
~/workspace/spark-2.3.1/external/kafka-0-8/pom.xml
~/workspace/spark-2.3.1/external/kafka-0-8-assembly/pom.xml
~/workspace/spark-2.3.1/external/kinesis-asl/pom.xml
~/workspace/spark-2.3.1/external/kinesis-asl-assembly/pom.xml
~/workspace/spark-2.3.1/external/spark-ganglia-lgpl/pom.xml
~/workspace/spark-2.3.1/graphx/pom.xml
~/workspace/spark-2.3.1/hadoop-cloud/pom.xml
~/workspace/spark-2.3.1/launcher/pom.xml
~/workspace/spark-2.3.1/mllib/pom.xml
~/workspace/spark-2.3.1/mllib-local/pom.xml
~/workspace/spark-2.3.1/repl/pom.xml
~/workspace/spark-2.3.1/resource-managers/kubernetes/core/pom.xml
~/workspace/spark-2.3.1/resource-managers/mesos/pom.xml
~/workspace/spark-2.3.1/resource-managers/yarn/pom.xml
~/workspace/spark-2.3.1/sql/catalyst/pom.xml
~/workspace/spark-2.3.1/sql/core/pom.xml
~/workspace/spark-2.3.1/sql/hive/pom.xml
~/workspace/spark-2.3.1/sql/hive-thriftserver/pom.xml
~/workspace/spark-2.3.1/streaming/pom.xml
~/workspace/spark-2.3.1/tools/pom.xml
Reporter: Albert Baker


OWASP DC makes an outbound REST call to MITRE Common Vulnerabilities & 
Exposures (CVE) to perform a lookup for each dependant .jar to list any/all 
known vulnerabilities for each jar. This step is needed because a manual MITRE 
CVE lookup/check on the main component does not include checking for 
vulnerabilities in dependant libraries.

OWASP Dependency check : https://www.owasp.org/index.php/OWASP_Dependency_Check 
has plug-ins for most Java build/make types (ant, maven, ivy, gradle). Also, 
add the appropriate command to the nightly build to generate a report of all 
known vulnerabilities in any/all third party libraries/dependencies that get 
pulled in. example : mvn -Powasp -Dtest=false -DfailIfNoTests=false clean 
aggregate

Generating this report nightly/weekly will help inform the project's 
development team if any dependant libraries have a reported known vulneraility. 
Project teams that keep up with removing vulnerabilities on a weekly basis will 
help protect businesses that rely on these open source componets.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24963) Integration tests will fail if they run in a namespace not being the default

2018-07-29 Thread Stavros Kontopoulos (JIRA)
Stavros Kontopoulos created SPARK-24963:
---

 Summary: Integration tests will fail if they run in a namespace 
not being the default
 Key: SPARK-24963
 URL: https://issues.apache.org/jira/browse/SPARK-24963
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 2.4.0
Reporter: Stavros Kontopoulos


Related discussion is here: 
[https://github.com/apache/spark/pull/21748#pullrequestreview-141048893]

If  spark-rbac.yaml is used when tests are used locally, client mode tests will 
fail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24809) Serializing LongHashedRelation in executor may result in data error

2018-07-29 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-24809:
---

Assignee: Lijia Liu

> Serializing LongHashedRelation in executor may result in data error
> ---
>
> Key: SPARK-24809
> URL: https://issues.apache.org/jira/browse/SPARK-24809
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0
> Environment: Spark 2.2.1
> hadoop 2.7.1
>Reporter: Lijia Liu
>Assignee: Lijia Liu
>Priority: Critical
>  Labels: correctness
> Fix For: 2.1.4, 2.2.3, 2.3.2, 2.4.0
>
> Attachments: Spark LongHashedRelation serialization.svg
>
>
> When join key is long or int in broadcast join, Spark will use 
> LongHashedRelation as the broadcast value. Details see SPARK-14419. But, if 
> the broadcast value is abnormal big, executor will serialize it to disk. But, 
> data will lost when serializing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24809) Serializing LongHashedRelation in executor may result in data error

2018-07-29 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24809.
-
   Resolution: Fixed
Fix Version/s: 2.4.0
   2.3.2
   2.2.3
   2.1.4

> Serializing LongHashedRelation in executor may result in data error
> ---
>
> Key: SPARK-24809
> URL: https://issues.apache.org/jira/browse/SPARK-24809
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0
> Environment: Spark 2.2.1
> hadoop 2.7.1
>Reporter: Lijia Liu
>Priority: Critical
>  Labels: correctness
> Fix For: 2.1.4, 2.2.3, 2.3.2, 2.4.0
>
> Attachments: Spark LongHashedRelation serialization.svg
>
>
> When join key is long or int in broadcast join, Spark will use 
> LongHashedRelation as the broadcast value. Details see SPARK-14419. But, if 
> the broadcast value is abnormal big, executor will serialize it to disk. But, 
> data will lost when serializing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24005) Remove usage of Scala’s parallel collection

2018-07-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24005:


Assignee: Apache Spark

> Remove usage of Scala’s parallel collection
> ---
>
> Key: SPARK-24005
> URL: https://issues.apache.org/jira/browse/SPARK-24005
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Major
>  Labels: starter
>
> {noformat}
> val par = (1 to 100).par.flatMap { i =>
>   Thread.sleep(1000)
>   1 to 1000
> }.toSeq
> {noformat}
> We are unable to interrupt the execution of parallel collections. We need to 
> create a common utility function to do it, instead of using Scala parallel 
> collections



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24005) Remove usage of Scala’s parallel collection

2018-07-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24005:


Assignee: (was: Apache Spark)

> Remove usage of Scala’s parallel collection
> ---
>
> Key: SPARK-24005
> URL: https://issues.apache.org/jira/browse/SPARK-24005
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>  Labels: starter
>
> {noformat}
> val par = (1 to 100).par.flatMap { i =>
>   Thread.sleep(1000)
>   1 to 1000
> }.toSeq
> {noformat}
> We are unable to interrupt the execution of parallel collections. We need to 
> create a common utility function to do it, instead of using Scala parallel 
> collections



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24005) Remove usage of Scala’s parallel collection

2018-07-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16561221#comment-16561221
 ] 

Apache Spark commented on SPARK-24005:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/21913

> Remove usage of Scala’s parallel collection
> ---
>
> Key: SPARK-24005
> URL: https://issues.apache.org/jira/browse/SPARK-24005
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>  Labels: starter
>
> {noformat}
> val par = (1 to 100).par.flatMap { i =>
>   Thread.sleep(1000)
>   1 to 1000
> }.toSeq
> {noformat}
> We are unable to interrupt the execution of parallel collections. We need to 
> create a common utility function to do it, instead of using Scala parallel 
> collections



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24962) refactor CodeGenerator.createUnsafeArray

2018-07-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16561211#comment-16561211
 ] 

Apache Spark commented on SPARK-24962:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/21912

> refactor CodeGenerator.createUnsafeArray
> 
>
> Key: SPARK-24962
> URL: https://issues.apache.org/jira/browse/SPARK-24962
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>
> {{CodeGenerator.createUnsafeArray()}} generates code for allocating 
> {{UnsafeArrayData}}. This method can support to generate code for allocating 
> {{UnsafeArrayData}} or {{GenericArrayData}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24962) refactor CodeGenerator.createUnsafeArray

2018-07-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24962:


Assignee: Apache Spark

> refactor CodeGenerator.createUnsafeArray
> 
>
> Key: SPARK-24962
> URL: https://issues.apache.org/jira/browse/SPARK-24962
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kazuaki Ishizaki
>Assignee: Apache Spark
>Priority: Major
>
> {{CodeGenerator.createUnsafeArray()}} generates code for allocating 
> {{UnsafeArrayData}}. This method can support to generate code for allocating 
> {{UnsafeArrayData}} or {{GenericArrayData}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24962) refactor CodeGenerator.createUnsafeArray

2018-07-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24962:


Assignee: (was: Apache Spark)

> refactor CodeGenerator.createUnsafeArray
> 
>
> Key: SPARK-24962
> URL: https://issues.apache.org/jira/browse/SPARK-24962
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>
> {{CodeGenerator.createUnsafeArray()}} generates code for allocating 
> {{UnsafeArrayData}}. This method can support to generate code for allocating 
> {{UnsafeArrayData}} or {{GenericArrayData}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24962) refactor CodeGenerator.createUnsafeArray

2018-07-29 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-24962:


 Summary: refactor CodeGenerator.createUnsafeArray
 Key: SPARK-24962
 URL: https://issues.apache.org/jira/browse/SPARK-24962
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Kazuaki Ishizaki


{{CodeGenerator.createUnsafeArray()}} generates code for allocating 
{{UnsafeArrayData}}. This method can support to generate code for allocating 
{{UnsafeArrayData}} or {{GenericArrayData}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24940) Coalesce Hint for SQL Queries

2018-07-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24940:


Assignee: Apache Spark

> Coalesce Hint for SQL Queries
> -
>
> Key: SPARK-24940
> URL: https://issues.apache.org/jira/browse/SPARK-24940
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: John Zhuge
>Assignee: Apache Spark
>Priority: Major
>
> Many Spark SQL users in my company have asked for a way to control the number 
> of output files in Spark SQL. The users prefer not to use function 
> repartition\(n\) or coalesce(n, shuffle) that require them to write and 
> deploy Scala/Java/Python code.
>   
>  There are use cases to either reduce or increase the number.
>   
>  The DataFrame API has repartition/coalesce for a long time. However, we do 
> not have an equivalent functionality in SQL queries. We propose adding the 
> following Hive-style Coalesce hint to Spark SQL.
> {noformat}
> /*+ COALESCE(n, shuffle) */
> /*+ REPARTITION(n) */
> {noformat}
> REPARTITION\(n\) is equal to COALESCE(n, shuffle=true).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24940) Coalesce Hint for SQL Queries

2018-07-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24940:


Assignee: (was: Apache Spark)

> Coalesce Hint for SQL Queries
> -
>
> Key: SPARK-24940
> URL: https://issues.apache.org/jira/browse/SPARK-24940
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: John Zhuge
>Priority: Major
>
> Many Spark SQL users in my company have asked for a way to control the number 
> of output files in Spark SQL. The users prefer not to use function 
> repartition\(n\) or coalesce(n, shuffle) that require them to write and 
> deploy Scala/Java/Python code.
>   
>  There are use cases to either reduce or increase the number.
>   
>  The DataFrame API has repartition/coalesce for a long time. However, we do 
> not have an equivalent functionality in SQL queries. We propose adding the 
> following Hive-style Coalesce hint to Spark SQL.
> {noformat}
> /*+ COALESCE(n, shuffle) */
> /*+ REPARTITION(n) */
> {noformat}
> REPARTITION\(n\) is equal to COALESCE(n, shuffle=true).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24940) Coalesce Hint for SQL Queries

2018-07-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16561175#comment-16561175
 ] 

Apache Spark commented on SPARK-24940:
--

User 'jzhuge' has created a pull request for this issue:
https://github.com/apache/spark/pull/21911

> Coalesce Hint for SQL Queries
> -
>
> Key: SPARK-24940
> URL: https://issues.apache.org/jira/browse/SPARK-24940
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: John Zhuge
>Priority: Major
>
> Many Spark SQL users in my company have asked for a way to control the number 
> of output files in Spark SQL. The users prefer not to use function 
> repartition\(n\) or coalesce(n, shuffle) that require them to write and 
> deploy Scala/Java/Python code.
>   
>  There are use cases to either reduce or increase the number.
>   
>  The DataFrame API has repartition/coalesce for a long time. However, we do 
> not have an equivalent functionality in SQL queries. We propose adding the 
> following Hive-style Coalesce hint to Spark SQL.
> {noformat}
> /*+ COALESCE(n, shuffle) */
> /*+ REPARTITION(n) */
> {noformat}
> REPARTITION\(n\) is equal to COALESCE(n, shuffle=true).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24882) separate responsibilities of the data source v2 read API

2018-07-29 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16561145#comment-16561145
 ] 

Ryan Blue commented on SPARK-24882:
---

[~cloud_fan], thanks for making those changes. I'll have a look at the updated 
doc.

For scan configuration, I think this builder pattern would work. The builder's 
super-class would be provided by Spark. That way, the methods for pushing 
always work. Similarly, the ScanConfig interface would be provided with default 
implementations, so Spark can always get the scan configuration. When a source 
supports push-down, it would override {{pushPredicates}} and return the 
predicates that were pushed in the ScanConfig ({{pushedPredicates}}. Then Spark 
can remove those pushed predicates.

If the source doesn't support push-down, then it needs to implement nothing at 
all: the default {{pushPredicates}} implementation on the builder is a no-op, 
and the default {{pushedPredicates}} implementation returns {{new 
Expression[0]}} to indicate that nothing was pushed. The feedback that Spark 
needs comes from the final ScanConfig and then there's no need to do instanceOf 
checks for interfaces. Spark's code always makes the pushdown calls and they 
can be easily ignored by the source implementation.

> separate responsibilities of the data source v2 read API
> 
>
> Key: SPARK-24882
> URL: https://issues.apache.org/jira/browse/SPARK-24882
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>
> Data source V2 is out for a while, see the SPIP 
> [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing].
>  We have already migrated most of the built-in streaming data sources to the 
> V2 API, and the file source migration is in progress. During the migration, 
> we found several problems and want to address them before we stabilize the V2 
> API.
> To solve these problems, we need to separate responsibilities in the data 
> source v2 read API. Details please see the attached google doc: 
> https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24956) Upgrade maven from 3.3.9 to 3.5.4

2018-07-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-24956.
---
   Resolution: Fixed
 Assignee: Kazuaki Ishizaki
Fix Version/s: 2.4.0

Resolved by https://github.com/apache/spark/pull/21905

> Upgrade maven from 3.3.9 to 3.5.4
> -
>
> Key: SPARK-24956
> URL: https://issues.apache.org/jira/browse/SPARK-24956
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Minor
> Fix For: 2.4.0
>
>
> Maven 3.3.9 looks pretty old. It would be good to upgrade this to the latest.
> As suggest in SPARK-24895, the current maven will see a problem with some 
> plugins.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24956) Upgrade maven from 3.3.9 to 3.5.4

2018-07-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-24956:
--
Priority: Minor  (was: Major)

> Upgrade maven from 3.3.9 to 3.5.4
> -
>
> Key: SPARK-24956
> URL: https://issues.apache.org/jira/browse/SPARK-24956
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Kazuaki Ishizaki
>Priority: Minor
>
> Maven 3.3.9 looks pretty old. It would be good to upgrade this to the latest.
> As suggest in SPARK-24895, the current maven will see a problem with some 
> plugins.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24957) Decimal arithmetic can lead to wrong values using codegen

2018-07-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16561080#comment-16561080
 ] 

Apache Spark commented on SPARK-24957:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/21910

> Decimal arithmetic can lead to wrong values using codegen
> -
>
> Key: SPARK-24957
> URL: https://issues.apache.org/jira/browse/SPARK-24957
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: David Vogelbacher
>Priority: Major
>
> I noticed a bug when doing arithmetic on a dataframe containing decimal 
> values with codegen enabled.
> I tried to narrow it down on a small repro and got this (executed in 
> spark-shell):
> {noformat}
> scala> val df = Seq(
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("11.88")),
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("11.88")),
>  | ("a", BigDecimal("11.88"))
>  | ).toDF("text", "number")
> df: org.apache.spark.sql.DataFrame = [text: string, number: decimal(38,18)]
> scala> val df_grouped_1 = 
> df.groupBy(df.col("text")).agg(functions.avg(df.col("number")).as("number"))
> df_grouped_1: org.apache.spark.sql.DataFrame = [text: string, number: 
> decimal(38,22)]
> scala> df_grouped_1.collect()
> res0: Array[org.apache.spark.sql.Row] = Array([a,11.94857142857143])
> scala> val df_grouped_2 = 
> df_grouped_1.groupBy(df_grouped_1.col("text")).agg(functions.sum(df_grouped_1.col("number")).as("number"))
> df_grouped_2: org.apache.spark.sql.DataFrame = [text: string, number: 
> decimal(38,22)]
> scala> df_grouped_2.collect()
> res1: Array[org.apache.spark.sql.Row] = 
> Array([a,11948571.4285714285714285714286])
> scala> val df_total_sum = 
> df_grouped_1.agg(functions.sum(df_grouped_1.col("number")).as("number"))
> df_total_sum: org.apache.spark.sql.DataFrame = [number: decimal(38,22)]
> scala> df_total_sum.collect()
> res2: Array[org.apache.spark.sql.Row] = Array([11.94857142857143])
> {noformat}
> The results of {{df_grouped_1}} and {{df_total_sum}} are correct, whereas the 
> result of {{df_grouped_2}} is clearly incorrect (it is the value of the 
> correct result times {{10^14}}).
> When codegen is disabled all results are correct. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24957) Decimal arithmetic can lead to wrong values using codegen

2018-07-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24957:


Assignee: Apache Spark

> Decimal arithmetic can lead to wrong values using codegen
> -
>
> Key: SPARK-24957
> URL: https://issues.apache.org/jira/browse/SPARK-24957
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: David Vogelbacher
>Assignee: Apache Spark
>Priority: Major
>
> I noticed a bug when doing arithmetic on a dataframe containing decimal 
> values with codegen enabled.
> I tried to narrow it down on a small repro and got this (executed in 
> spark-shell):
> {noformat}
> scala> val df = Seq(
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("11.88")),
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("11.88")),
>  | ("a", BigDecimal("11.88"))
>  | ).toDF("text", "number")
> df: org.apache.spark.sql.DataFrame = [text: string, number: decimal(38,18)]
> scala> val df_grouped_1 = 
> df.groupBy(df.col("text")).agg(functions.avg(df.col("number")).as("number"))
> df_grouped_1: org.apache.spark.sql.DataFrame = [text: string, number: 
> decimal(38,22)]
> scala> df_grouped_1.collect()
> res0: Array[org.apache.spark.sql.Row] = Array([a,11.94857142857143])
> scala> val df_grouped_2 = 
> df_grouped_1.groupBy(df_grouped_1.col("text")).agg(functions.sum(df_grouped_1.col("number")).as("number"))
> df_grouped_2: org.apache.spark.sql.DataFrame = [text: string, number: 
> decimal(38,22)]
> scala> df_grouped_2.collect()
> res1: Array[org.apache.spark.sql.Row] = 
> Array([a,11948571.4285714285714285714286])
> scala> val df_total_sum = 
> df_grouped_1.agg(functions.sum(df_grouped_1.col("number")).as("number"))
> df_total_sum: org.apache.spark.sql.DataFrame = [number: decimal(38,22)]
> scala> df_total_sum.collect()
> res2: Array[org.apache.spark.sql.Row] = Array([11.94857142857143])
> {noformat}
> The results of {{df_grouped_1}} and {{df_total_sum}} are correct, whereas the 
> result of {{df_grouped_2}} is clearly incorrect (it is the value of the 
> correct result times {{10^14}}).
> When codegen is disabled all results are correct. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24957) Decimal arithmetic can lead to wrong values using codegen

2018-07-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24957:


Assignee: (was: Apache Spark)

> Decimal arithmetic can lead to wrong values using codegen
> -
>
> Key: SPARK-24957
> URL: https://issues.apache.org/jira/browse/SPARK-24957
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: David Vogelbacher
>Priority: Major
>
> I noticed a bug when doing arithmetic on a dataframe containing decimal 
> values with codegen enabled.
> I tried to narrow it down on a small repro and got this (executed in 
> spark-shell):
> {noformat}
> scala> val df = Seq(
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("11.88")),
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("11.88")),
>  | ("a", BigDecimal("11.88"))
>  | ).toDF("text", "number")
> df: org.apache.spark.sql.DataFrame = [text: string, number: decimal(38,18)]
> scala> val df_grouped_1 = 
> df.groupBy(df.col("text")).agg(functions.avg(df.col("number")).as("number"))
> df_grouped_1: org.apache.spark.sql.DataFrame = [text: string, number: 
> decimal(38,22)]
> scala> df_grouped_1.collect()
> res0: Array[org.apache.spark.sql.Row] = Array([a,11.94857142857143])
> scala> val df_grouped_2 = 
> df_grouped_1.groupBy(df_grouped_1.col("text")).agg(functions.sum(df_grouped_1.col("number")).as("number"))
> df_grouped_2: org.apache.spark.sql.DataFrame = [text: string, number: 
> decimal(38,22)]
> scala> df_grouped_2.collect()
> res1: Array[org.apache.spark.sql.Row] = 
> Array([a,11948571.4285714285714285714286])
> scala> val df_total_sum = 
> df_grouped_1.agg(functions.sum(df_grouped_1.col("number")).as("number"))
> df_total_sum: org.apache.spark.sql.DataFrame = [number: decimal(38,22)]
> scala> df_total_sum.collect()
> res2: Array[org.apache.spark.sql.Row] = Array([11.94857142857143])
> {noformat}
> The results of {{df_grouped_1}} and {{df_total_sum}} are correct, whereas the 
> result of {{df_grouped_2}} is clearly incorrect (it is the value of the 
> correct result times {{10^14}}).
> When codegen is disabled all results are correct. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24957) Decimal arithmetic can lead to wrong values using codegen

2018-07-29 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16561077#comment-16561077
 ] 

Marco Gaido commented on SPARK-24957:
-

I am not sure what you mean by "When codegen is disabled all results are 
correct.". I checked and I was able to reproduce both with codegen enabled and 
with codegen disabled.

cc [~jerryshao] this doesn't seem a regression to me but it is a pretty serious 
bug, I am not sure whether we should include it in the next 2.3 version.
cc [~smilegator] [~cloud_fan] I think we should consider this a blocker for 
2.4. What do you think? Thanks.

> Decimal arithmetic can lead to wrong values using codegen
> -
>
> Key: SPARK-24957
> URL: https://issues.apache.org/jira/browse/SPARK-24957
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: David Vogelbacher
>Priority: Major
>
> I noticed a bug when doing arithmetic on a dataframe containing decimal 
> values with codegen enabled.
> I tried to narrow it down on a small repro and got this (executed in 
> spark-shell):
> {noformat}
> scala> val df = Seq(
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("11.88")),
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("11.88")),
>  | ("a", BigDecimal("11.88"))
>  | ).toDF("text", "number")
> df: org.apache.spark.sql.DataFrame = [text: string, number: decimal(38,18)]
> scala> val df_grouped_1 = 
> df.groupBy(df.col("text")).agg(functions.avg(df.col("number")).as("number"))
> df_grouped_1: org.apache.spark.sql.DataFrame = [text: string, number: 
> decimal(38,22)]
> scala> df_grouped_1.collect()
> res0: Array[org.apache.spark.sql.Row] = Array([a,11.94857142857143])
> scala> val df_grouped_2 = 
> df_grouped_1.groupBy(df_grouped_1.col("text")).agg(functions.sum(df_grouped_1.col("number")).as("number"))
> df_grouped_2: org.apache.spark.sql.DataFrame = [text: string, number: 
> decimal(38,22)]
> scala> df_grouped_2.collect()
> res1: Array[org.apache.spark.sql.Row] = 
> Array([a,11948571.4285714285714285714286])
> scala> val df_total_sum = 
> df_grouped_1.agg(functions.sum(df_grouped_1.col("number")).as("number"))
> df_total_sum: org.apache.spark.sql.DataFrame = [number: decimal(38,22)]
> scala> df_total_sum.collect()
> res2: Array[org.apache.spark.sql.Row] = Array([11.94857142857143])
> {noformat}
> The results of {{df_grouped_1}} and {{df_total_sum}} are correct, whereas the 
> result of {{df_grouped_2}} is clearly incorrect (it is the value of the 
> correct result times {{10^14}}).
> When codegen is disabled all results are correct. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org