[jira] [Created] (SPARK-24181) Better error message for writing sorted data

2018-05-04 Thread DB Tsai (JIRA)
DB Tsai created SPARK-24181:
---

 Summary: Better error message for writing sorted data
 Key: SPARK-24181
 URL: https://issues.apache.org/jira/browse/SPARK-24181
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: DB Tsai


This PR is related to [SPARK-15718].

When a user tries to write a sorted data using {{save}} or {{insertInto}}, it 
will throw an exception with message that {{s"'$operation' does not support 
bucketing right now}}. We should throw {{s"'$operation' does not support 
sorting right now}} instead.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24181) Better error message for writing sorted data

2018-05-04 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-24181:
---

Assignee: DB Tsai

> Better error message for writing sorted data
> 
>
> Key: SPARK-24181
> URL: https://issues.apache.org/jira/browse/SPARK-24181
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
>
> This PR is related to [SPARK-15718].
> When a user tries to write a sorted data using {{save}} or {{insertInto}}, it 
> will throw an exception with message that {{s"'$operation' does not support 
> bucketing right now}}. We should throw {{s"'$operation' does not support 
> sorting right now}} instead.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23775) Flaky test: DataFrameRangeSuite

2018-05-04 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23775:
--
Description: 
DataFrameRangeSuite.test("Cancelling stage in a query with Range.") stays 
sometimes in an infinite loop and times out the build.

I presume the original intention of this test is to start a job with range and 
just cancel it.
The submitted job has 2 stages but I think the author tried to cancel the first 
stage with ID 0 which is not the case here:

{code:java}
eventually(timeout(10.seconds), interval(1.millis)) {
  assert(DataFrameRangeSuite.stageToKill > 0)
}
{code}

All in all if the first stage is slower than 10 seconds it throws 
TestFailedDueToTimeoutException and cancelStage will be never ever called.


- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/4780/

  was:
DataFrameRangeSuite.test("Cancelling stage in a query with Range.") stays 
sometimes in an infinite loop and times out the build.

I presume the original intention of this test is to start a job with range and 
just cancel it.
The submitted job has 2 stages but I think the author tried to cancel the first 
stage with ID 0 which is not the case here:

{code:java}
eventually(timeout(10.seconds), interval(1.millis)) {
  assert(DataFrameRangeSuite.stageToKill > 0)
}
{code}

All in all if the first stage is slower than 10 seconds it throws 
TestFailedDueToTimeoutException and cancelStage will be never ever called.



> Flaky test: DataFrameRangeSuite
> ---
>
> Key: SPARK-23775
> URL: https://issues.apache.org/jira/browse/SPARK-23775
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: Gabor Somogyi
>Priority: Major
> Attachments: filtered.log, filtered_more_logs.log
>
>
> DataFrameRangeSuite.test("Cancelling stage in a query with Range.") stays 
> sometimes in an infinite loop and times out the build.
> I presume the original intention of this test is to start a job with range 
> and just cancel it.
> The submitted job has 2 stages but I think the author tried to cancel the 
> first stage with ID 0 which is not the case here:
> {code:java}
> eventually(timeout(10.seconds), interval(1.millis)) {
>   assert(DataFrameRangeSuite.stageToKill > 0)
> }
> {code}
> All in all if the first stage is slower than 10 seconds it throws 
> TestFailedDueToTimeoutException and cancelStage will be never ever called.
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/4780/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23780) Failed to use googleVis library with new SparkR

2018-05-04 Thread Albert Chan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16464153#comment-16464153
 ] 

Albert Chan commented on SPARK-23780:
-

Thanks Ivan.  I'll give it a try.

> Failed to use googleVis library with new SparkR
> ---
>
> Key: SPARK-23780
> URL: https://issues.apache.org/jira/browse/SPARK-23780
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1
>Reporter: Ivan Dzikovsky
>Priority: Major
>
> I've tried to use googleVis library with Spark 2.2.1, and faced with problem.
> Steps to reproduce:
> # Install R with googleVis library.
> # Run SparkR:
> {code}
> sparkR --master yarn --deploy-mode client
> {code}
> # Run code that uses googleVis:
> {code}
> library(googleVis)
> df=data.frame(country=c("US", "GB", "BR"), 
>   val1=c(10,13,14), 
>   val2=c(23,12,32))
> Bar <- gvisBarChart(df)
> cat("%html ", Bar$html$chart)
> {code}
> Than I got following error message:
> {code}
> Error : .onLoad failed in loadNamespace() for 'googleVis', details:
>   call: rematchDefinition(definition, fdef, mnames, fnames, signature)
>   error: methods can add arguments to the generic 'toJSON' only if '...' is 
> an argument to the generic
> Error : package or namespace load failed for 'googleVis'
> {code}
> But expected result is to get some HTML code output, as it was with Spark 
> 2.1.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24179) History Server for Kubernetes

2018-05-04 Thread Anirudh Ramanathan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anirudh Ramanathan updated SPARK-24179:
---
Issue Type: New Feature  (was: Task)

> History Server for Kubernetes
> -
>
> Key: SPARK-24179
> URL: https://issues.apache.org/jira/browse/SPARK-24179
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Eric Charles
>Priority: Major
>
> The History server is missing when running on Kubernetes, with the side 
> effect we can not debug post-mortem or analyze after-the-fact.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23935) High-order function: map_entries(map<K, V>) → array<row<K,V>>

2018-05-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16464262#comment-16464262
 ] 

Apache Spark commented on SPARK-23935:
--

User 'mn-mikke' has created a pull request for this issue:
https://github.com/apache/spark/pull/21236

> High-order function: map_entries(map) → array>
> -
>
> Key: SPARK-23935
> URL: https://issues.apache.org/jira/browse/SPARK-23935
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/map.html
> Returns an array of all entries in the given map.
> {noformat}
> SELECT map_entries(MAP(ARRAY[1, 2], ARRAY['x', 'y'])); -- [ROW(1, 'x'), 
> ROW(2, 'y')]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23935) High-order function: map_entries(map<K, V>) → array<row<K,V>>

2018-05-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23935:


Assignee: Apache Spark

> High-order function: map_entries(map) → array>
> -
>
> Key: SPARK-23935
> URL: https://issues.apache.org/jira/browse/SPARK-23935
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/map.html
> Returns an array of all entries in the given map.
> {noformat}
> SELECT map_entries(MAP(ARRAY[1, 2], ARRAY['x', 'y'])); -- [ROW(1, 'x'), 
> ROW(2, 'y')]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23935) High-order function: map_entries(map<K, V>) → array<row<K,V>>

2018-05-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23935:


Assignee: (was: Apache Spark)

> High-order function: map_entries(map) → array>
> -
>
> Key: SPARK-23935
> URL: https://issues.apache.org/jira/browse/SPARK-23935
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/map.html
> Returns an array of all entries in the given map.
> {noformat}
> SELECT map_entries(MAP(ARRAY[1, 2], ARRAY['x', 'y'])); -- [ROW(1, 'x'), 
> ROW(2, 'y')]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23936) High-order function: map_concat(map1<K, V>, map2<K, V>, ..., mapN<K, V>) → map<K,V>

2018-05-04 Thread Bruce Robbins (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16464245#comment-16464245
 ] 

Bruce Robbins commented on SPARK-23936:
---

[~ueshin]

I have a question about map_concat's behavior as it pertains to this part of 
the function description: "If a key is found in multiple given maps, that key’s 
value in the resulting map comes from the last one of those maps."

Spark maps can have duplicate keys, e.g.:
{noformat}
scala> val df = sql("select map('a', 1, 'a', 2, 'b', 3, 'c', 10) as map1, 
map('a', 7, 'b', 8, 'b', 9) as map2")
scala> df.show(truncate=false)
+-++
|map1 |map2|
+-++
|[a -> 1, a -> 2, b -> 3, c -> 10]|[a -> 7, b -> 8, b -> 9]|
+-++
{noformat}
I'm not sure the duplicate handling part of the description makes sense for 
maps that allow duplicate keys.

I can think of 3 ways of handling the duplicate key handling requirement:

Scheme #1: Ignore it. map_concat would be a pure concantenation. Using the 
above example maps:
{noformat}
scala> df.selectExpr("map_concat(map1, map2)").show(truncate=false)
+-+
|map_concat(map1, map2)   |
+-+
|[a -> 1, a -> 2, b -> 3, c -> 10, a -> 7, b -> 8, b -> 9]|
+-+
{noformat}
Duplicate keys are preserved from the original maps, and, in this example, 
additional duplicates are introduced.

Scheme #2: Preserve duplicates within input maps, but still pick a winner 
across maps. That is, treat the maps like so:
{noformat}
map1:
a -> [1, 2]
b -> [3]
c -> [10]

map2:
a -> [7]
b -> [8, 9]
{noformat}
Then use the rule that the key's value comes from the last map in which the key 
appears:
{noformat}
resulting map
a -> [7]// from map2
b -> [8, 9] // from map2
c -> [10]   // from map1
{noformat}
In Spark, it would look like this:
{noformat}
scala> df.selectExpr("map_concat(map1, map2)").show(truncate=false)
+-+
|map_concat(map1, map2)   |
+-+
|[a -> 7, b -> 8, b -> 9, c -> 10]|
+-+
{noformat}
Scheme #3: Don't allow any duplicates in the resulting map. That is, treat the 
input maps collectively as a stream of tuples, and keep only the last value for 
_any_ key:
{noformat}
a -> 1, a -> 2, b -> 3, c -> 10, a -> 7, b -> 8, b -> 9
^^   ^   ^
||   |   |
 overwrites   overwrites |overwrites
   a -> 1   a -> 2   |  b -> 8
 overwrites
   b -> 3

scala> df.selectExpr("map_concat(map1, map2)").show(truncate=false)
+-+
|map_concat(map1, map2)   |
+-+
|[a -> 7, b -> 9, c -> 10]|
+-+
{noformat}
Note: This is what I've actually implemented in my PR. It made sense to me due 
to the requirement that we pick a winner across maps. But I wasn't aware then 
that the source maps could have duplicates.

As a wrinkle to this, spark-sql, for some reason, eliminates duplicates in maps 
on display:
{noformat}
spark-sql> select map1, map2 from mapsWithDupKeys;
{"a":2,"b":3,"c":10}{"a":7,"b":9}
Time taken: 0.147 seconds, Fetched 1 row(s)
spark-sql> select map_keys(map1) from mapsWithDupKeys;
["a","a","b","c"]
Time taken: 0.093 seconds, Fetched 1 row(s)
{noformat}

> High-order function: map_concat(map1, map2, ..., mapN) → 
> map
> ---
>
> Key: SPARK-23936
> URL: https://issues.apache.org/jira/browse/SPARK-23936
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref:  https://prestodb.io/docs/current/functions/map.html
> Returns the union of all the given maps. If a key is found in multiple given 
> maps, that key’s value in the resulting map comes from the last one of those 
> maps.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23020) Re-enable Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher

2018-05-04 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16464141#comment-16464141
 ] 

Marcelo Vanzin commented on SPARK-23020:


If that still fails somewhere it means there is still a bug somewhere. I don't 
think disabling the test is the right thing unless it's actually common enough 
that it's causing problems. Lots of our tests are flaky.

That's like the only failure recently, BTW.
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.6/lastCompletedBuild/testReport/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/

> Re-enable Flaky Test: 
> org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher
> 
>
> Key: SPARK-23020
> URL: https://issues.apache.org/jira/browse/SPARK-23020
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Sameer Agarwal
>Assignee: Marcelo Vanzin
>Priority: Blocker
> Fix For: 2.3.1, 2.4.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.7/42/testReport/junit/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23020) Re-enable Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher

2018-05-04 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16464088#comment-16464088
 ] 

Dongjoon Hyun commented on SPARK-23020:
---

Hi, All.

This seems to fail again in branch 2.3. Can we disable this in branch-2.3 for 
Apache Spark 2.3.1 at least?
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.6/lastCompletedBuild/testReport/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/

> Re-enable Flaky Test: 
> org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher
> 
>
> Key: SPARK-23020
> URL: https://issues.apache.org/jira/browse/SPARK-23020
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Sameer Agarwal
>Assignee: Marcelo Vanzin
>Priority: Blocker
> Fix For: 2.3.1, 2.4.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.7/42/testReport/junit/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24181) Better error message for writing sorted data

2018-05-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24181:


Assignee: Apache Spark  (was: DB Tsai)

> Better error message for writing sorted data
> 
>
> Key: SPARK-24181
> URL: https://issues.apache.org/jira/browse/SPARK-24181
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Assignee: Apache Spark
>Priority: Major
>
> This PR is related to [SPARK-15718].
> When a user tries to write a sorted data using {{save}} or {{insertInto}}, it 
> will throw an exception with message that {{s"'$operation' does not support 
> bucketing right now}}. We should throw {{s"'$operation' does not support 
> sorting right now}} instead.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24181) Better error message for writing sorted data

2018-05-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24181:


Assignee: DB Tsai  (was: Apache Spark)

> Better error message for writing sorted data
> 
>
> Key: SPARK-24181
> URL: https://issues.apache.org/jira/browse/SPARK-24181
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
>
> This PR is related to [SPARK-15718].
> When a user tries to write a sorted data using {{save}} or {{insertInto}}, it 
> will throw an exception with message that {{s"'$operation' does not support 
> bucketing right now}}. We should throw {{s"'$operation' does not support 
> sorting right now}} instead.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24181) Better error message for writing sorted data

2018-05-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16464231#comment-16464231
 ] 

Apache Spark commented on SPARK-24181:
--

User 'dbtsai' has created a pull request for this issue:
https://github.com/apache/spark/pull/21235

> Better error message for writing sorted data
> 
>
> Key: SPARK-24181
> URL: https://issues.apache.org/jira/browse/SPARK-24181
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
>
> This PR is related to [SPARK-15718].
> When a user tries to write a sorted data using {{save}} or {{insertInto}}, it 
> will throw an exception with message that {{s"'$operation' does not support 
> bucketing right now}}. We should throw {{s"'$operation' does not support 
> sorting right now}} instead.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB

2018-05-04 Thread William Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16464553#comment-16464553
 ] 

William Shen commented on SPARK-5928:
-

[~UZiVcbfPXaNrMtT], if you increase parallelism (having more partitions) over 
your massive data size, would that be able to reduce the size for each 
partition to work around this issue?

> Remote Shuffle Blocks cannot be more than 2 GB
> --
>
> Key: SPARK-5928
> URL: https://issues.apache.org/jira/browse/SPARK-5928
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Imran Rashid
>Priority: Major
> Attachments: image-2018-03-29-11-52-32-075.png
>
>
> If a shuffle block is over 2GB, the shuffle fails, with an uninformative 
> exception.  The tasks get retried a few times and then eventually the job 
> fails.
> Here is an example program which can cause the exception:
> {code}
> val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore =>
>   val n = 3e3.toInt
>   val arr = new Array[Byte](n)
>   //need to make sure the array doesn't compress to something small
>   scala.util.Random.nextBytes(arr)
>   arr
> }
> rdd.map { x => (1, x)}.groupByKey().count()
> {code}
> Note that you can't trigger this exception in local mode, it only happens on 
> remote fetches.   I triggered these exceptions running with 
> {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}}
> {noformat}
> 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, 
> imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, 
> imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message=
> org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds 
> 2147483647: 3021252889 - discarded
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
>   at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
>   at 
> org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46)
>   at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: io.netty.handler.codec.TooLongFrameException: Adjusted frame 
> length exceeds 2147483647: 3021252889 - discarded
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501)
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477)
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403)
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343)
>   at 
> io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249)
>   at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> 

[jira] [Commented] (SPARK-15384) Codegen CompileException "mapelements_isNull" is not an rvalue

2018-05-04 Thread howie yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16464581#comment-16464581
 ] 

howie yu commented on SPARK-15384:
--

Hi I have similar error at spark 2.3.0

[https://stackoverflow.com/questions/50185228/spark-left-outer-join-cause-codegenerator-error]

 

== Physical Plan ==
*(2) Project [siteId#3, rid#2, impressionCount#0, impressionRate#1, 
clickCount#8]
+- *(2) BroadcastHashJoin [siteId#3, rid#2], [siteId#10, rid#9], LeftOuter, 
BuildRight
   :- *(2) FileScan json [impressionCount#0,impressionRate#1,rid#2,siteId#3] 
Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/tmp/test1], 
PartitionFilters: [], PushedFilters: [], ReadSchema: 
struct
   +- BroadcastExchange HashedRelationBroadcastMode(List(input[2, string, 
true], input[1, string, true]))
  +- *(1) FileScan json [clickCount#8,rid#9,siteId#10] Batched: false, 
Format: JSON, Location: InMemoryFileIndex[file:/tmp/test2], PartitionFilters: 
[], PushedFilters: [], ReadSchema: 
struct
[2018-05-05 10:29:36,772][ERROR] CodeGenerator    : failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
118, Column 16: Expression "scan_isNull" is not an rvalue
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
118, Column 16: Expression "scan_isNull" is not an rvalue
    at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11821)
    at 
org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:7170)
    at 
org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5332)
    at org.codehaus.janino.UnitCompiler.access$9400(UnitCompiler.java:212)
    at 
org.codehaus.janino.UnitCompiler$13$1.visitAmbiguousName(UnitCompiler.java:5287)
    at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:4053)
    at org.codehaus.janino.UnitCompiler$13.visitLvalue(UnitCompiler.java:5284)
    at org.codehaus.janino.Java$Lvalue.accept(Java.java:3977)
    at org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:5280)
    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2391)
    at org.codehaus.janino.UnitCompiler.access$1900(UnitCompiler.java:212)
    at 
org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1474)
    at 
org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1466)
    at org.codehaus.janino.Java$IfStatement.accept(Java.java:2926)
    at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1466)
    at 
org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1546)
    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1532)
    at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:212)
    at org.codehaus.janino.UnitCompiler$6.visitBlock(UnitCompiler.java:1472)
    at org.codehaus.janino.UnitCompiler$6.visitBlock(UnitCompiler.java:1466)
    at org.codehaus.janino.Java$Block.accept(Java.java:2756)
    at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1466)
    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2444)
    at org.codehaus.janino.UnitCompiler.access$1900(UnitCompiler.java:212)
    at 
org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1474)
    at 
org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1466)
    at org.codehaus.janino.Java$IfStatement.accept(Java.java:2926)
    at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1466)
    at 
org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1546)
    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1532)
    at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:212)
    at org.codehaus.janino.UnitCompiler$6.visitBlock(UnitCompiler.java:1472)
    at org.codehaus.janino.UnitCompiler$6.visitBlock(UnitCompiler.java:1466)
    at org.codehaus.janino.Java$Block.accept(Java.java:2756)
    at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1466)
    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1821)
    at org.codehaus.janino.UnitCompiler.access$2200(UnitCompiler.java:212)
    at 
org.codehaus.janino.UnitCompiler$6.visitWhileStatement(UnitCompiler.java:1477)
    at 
org.codehaus.janino.UnitCompiler$6.visitWhileStatement(UnitCompiler.java:1466)
    at org.codehaus.janino.Java$WhileStatement.accept(Java.java:3031)
    at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1466)
    at 
org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1546)
    at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3075)
    at 
org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1336)
    at 
org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1309)
    

[jira] [Created] (SPARK-24189) Spark Strcutured Streaming not working with the Kafka Transactions

2018-05-04 Thread bharath kumar avusherla (JIRA)
bharath kumar avusherla created SPARK-24189:
---

 Summary: Spark Strcutured Streaming not working with the Kafka 
Transactions
 Key: SPARK-24189
 URL: https://issues.apache.org/jira/browse/SPARK-24189
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: bharath kumar avusherla


Was trying to read kafka transactional topic using Spark Structured Streaming 
2.3.0 with the  kafka option isolation-level = "read_committed", but spark 
reading the data immediately without waiting for the data in topic to be 
committed. In spark documentation it was mentioned as Structured Streaming 
supports Kafka version 0.10 or higher. I am using below command to test the 
scenario.

val df = spark
 .readStream
 .format("kafka")
 .option("kafka.bootstrap.servers", "localhost:9092")
 .option("subscribe", "test-topic")
 .option("isolation-level","read_committed")
 .load()

Can you please let me know if the transactional read is supported in SPark 
2.3.0 strcutured Streaming or am i missing anything.

 

Thank you.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24186) add array reverse and concat

2018-05-04 Thread Huaxin Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16464508#comment-16464508
 ] 

Huaxin Gao commented on SPARK-24186:


I will work on this. Thanks!

> add array reverse and concat 
> -
>
> Key: SPARK-24186
> URL: https://issues.apache.org/jira/browse/SPARK-24186
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Add R versions of https://issues.apache.org/jira/browse/SPARK-23736 and 
> https://issues.apache.org/jira/browse/SPARK-23926
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24187) add array join

2018-05-04 Thread Huaxin Gao (JIRA)
Huaxin Gao created SPARK-24187:
--

 Summary: add array join
 Key: SPARK-24187
 URL: https://issues.apache.org/jira/browse/SPARK-24187
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Affects Versions: 2.4.0
Reporter: Huaxin Gao


add R version of https://issues.apache.org/jira/browse/SPARK-23916



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24188) /api/v1/version not working

2018-05-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16464536#comment-16464536
 ] 

Apache Spark commented on SPARK-24188:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/21245

> /api/v1/version not working
> ---
>
> Key: SPARK-24188
> URL: https://issues.apache.org/jira/browse/SPARK-24188
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> That URI from the REST API is currently returning a 404.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24188) /api/v1/version not working

2018-05-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24188:


Assignee: (was: Apache Spark)

> /api/v1/version not working
> ---
>
> Key: SPARK-24188
> URL: https://issues.apache.org/jira/browse/SPARK-24188
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> That URI from the REST API is currently returning a 404.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24188) /api/v1/version not working

2018-05-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24188:


Assignee: Apache Spark

> /api/v1/version not working
> ---
>
> Key: SPARK-24188
> URL: https://issues.apache.org/jira/browse/SPARK-24188
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Major
>
> That URI from the REST API is currently returning a 404.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24182) Improve error message for client mode when AM fails

2018-05-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16464490#comment-16464490
 ] 

Apache Spark commented on SPARK-24182:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/21243

> Improve error message for client mode when AM fails
> ---
>
> Key: SPARK-24182
> URL: https://issues.apache.org/jira/browse/SPARK-24182
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> Today, when the client AM fails, there's not a lot of useful information 
> printed on the output. Depending on the type of failure, the information 
> provided by the YARN AM is also not very useful. For example, you'd see this 
> in the Spark shell:
> {noformat}
> 18/05/04 11:07:38 ERROR spark.SparkContext: Error initializing SparkContext.
> org.apache.spark.SparkException: Yarn application has already ended! It might 
> have been killed or unable to launch application master.
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:86)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:63)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:164)
> at org.apache.spark.SparkContext.(SparkContext.scala:500)
>  [long stack trace]
> {noformat}
> Similarly, on the YARN RM, for certain failures you see a generic error like 
> this:
> {noformat}
> ExitCodeException exitCode=10: at 
> org.apache.hadoop.util.Shell.runCommand(Shell.java:543) at 
> org.apache.hadoop.util.Shell.run(Shell.java:460) at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:720) at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:366)
>  at 
> [blah blah blah]
> {noformat}
> It would be nice if we could provide a more accurate description of what went 
> wrong when possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24182) Improve error message for client mode when AM fails

2018-05-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24182:


Assignee: Apache Spark

> Improve error message for client mode when AM fails
> ---
>
> Key: SPARK-24182
> URL: https://issues.apache.org/jira/browse/SPARK-24182
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Minor
>
> Today, when the client AM fails, there's not a lot of useful information 
> printed on the output. Depending on the type of failure, the information 
> provided by the YARN AM is also not very useful. For example, you'd see this 
> in the Spark shell:
> {noformat}
> 18/05/04 11:07:38 ERROR spark.SparkContext: Error initializing SparkContext.
> org.apache.spark.SparkException: Yarn application has already ended! It might 
> have been killed or unable to launch application master.
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:86)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:63)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:164)
> at org.apache.spark.SparkContext.(SparkContext.scala:500)
>  [long stack trace]
> {noformat}
> Similarly, on the YARN RM, for certain failures you see a generic error like 
> this:
> {noformat}
> ExitCodeException exitCode=10: at 
> org.apache.hadoop.util.Shell.runCommand(Shell.java:543) at 
> org.apache.hadoop.util.Shell.run(Shell.java:460) at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:720) at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:366)
>  at 
> [blah blah blah]
> {noformat}
> It would be nice if we could provide a more accurate description of what went 
> wrong when possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24182) Improve error message for client mode when AM fails

2018-05-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24182:


Assignee: (was: Apache Spark)

> Improve error message for client mode when AM fails
> ---
>
> Key: SPARK-24182
> URL: https://issues.apache.org/jira/browse/SPARK-24182
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> Today, when the client AM fails, there's not a lot of useful information 
> printed on the output. Depending on the type of failure, the information 
> provided by the YARN AM is also not very useful. For example, you'd see this 
> in the Spark shell:
> {noformat}
> 18/05/04 11:07:38 ERROR spark.SparkContext: Error initializing SparkContext.
> org.apache.spark.SparkException: Yarn application has already ended! It might 
> have been killed or unable to launch application master.
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:86)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:63)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:164)
> at org.apache.spark.SparkContext.(SparkContext.scala:500)
>  [long stack trace]
> {noformat}
> Similarly, on the YARN RM, for certain failures you see a generic error like 
> this:
> {noformat}
> ExitCodeException exitCode=10: at 
> org.apache.hadoop.util.Shell.runCommand(Shell.java:543) at 
> org.apache.hadoop.util.Shell.run(Shell.java:460) at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:720) at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:366)
>  at 
> [blah blah blah]
> {noformat}
> It would be nice if we could provide a more accurate description of what went 
> wrong when possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24157) Enable no-data micro batches for streaming aggregation and deduplication

2018-05-04 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-24157.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 21220
[https://github.com/apache/spark/pull/21220]

> Enable no-data micro batches for streaming aggregation and deduplication
> 
>
> Key: SPARK-24157
> URL: https://issues.apache.org/jira/browse/SPARK-24157
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24187) add array join

2018-05-04 Thread Huaxin Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16464511#comment-16464511
 ] 

Huaxin Gao commented on SPARK-24187:


I will work on this. Thanks!

> add array join
> --
>
> Key: SPARK-24187
> URL: https://issues.apache.org/jira/browse/SPARK-24187
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Priority: Major
>
> add R version of https://issues.apache.org/jira/browse/SPARK-23916



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7924) Consolidate example code in MLlib

2018-05-04 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-7924.
--
Resolution: Done

> Consolidate example code in MLlib
> -
>
> Key: SPARK-7924
> URL: https://issues.apache.org/jira/browse/SPARK-7924
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>
> This JIRA is an umbrella for consolidating example code in MLlib, now that we 
> are able to insert code snippets from examples into the user guide.  This 
> will contain tasks not already handled by [SPARK-11337].
> Goal: Have all example code in the {{examples/}} folder, and insert code 
> snippets for examples into the user guide.  This will make the example code 
> easily testable and reduce duplication.
> We will have 1 subtask per example.  If you would like to help, please either 
> create a subtask or comment below asking us to create a subtask for you.
> For an example to follow, look at:
> * 
> [https://github.com/apache/spark/blob/0171b71e9511cef512e96a759e407207037f3c49/examples/src/main/scala/org/apache/spark/examples/ml/TfIdfExample.scala]
> * TF-IDF example in 
> [https://raw.githubusercontent.com/apache/spark/0171b71e9511cef512e96a759e407207037f3c49/docs/ml-features.md]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18924) Improve collect/createDataFrame performance in SparkR

2018-05-04 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-18924:
-

Assignee: (was: Xiangrui Meng)

> Improve collect/createDataFrame performance in SparkR
> -
>
> Key: SPARK-18924
> URL: https://issues.apache.org/jira/browse/SPARK-18924
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Xiangrui Meng
>Priority: Critical
>
> SparkR has its own SerDe for data serialization between JVM and R.
> The SerDe on the JVM side is implemented in:
> * 
> [SerDe.scala|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/SerDe.scala]
> * 
> [SQLUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala]
> The SerDe on the R side is implemented in:
> * 
> [deserialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/deserialize.R]
> * 
> [serialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/serialize.R]
> The serialization between JVM and R suffers from huge storage and computation 
> overhead. For example, a short round trip of 1 million doubles surprisingly 
> took 3 minutes on my laptop:
> {code}
> > system.time(collect(createDataFrame(data.frame(x=runif(100)
>user  system elapsed
>  14.224   0.582 189.135
> {code}
> Collecting a medium-sized DataFrame to local and continuing with a local R 
> workflow is a use case we should pay attention to. SparkR will never be able 
> to cover all existing features from CRAN packages. It is also unnecessary for 
> Spark to do so because not all features need scalability. 
> Several factors contribute to the serialization overhead:
> 1. The SerDe in R side is implemented using high-level R methods.
> 2. DataFrame columns are not efficiently serialized, primitive type columns 
> in particular.
> 3. Some overhead in the serialization protocol/impl.
> 1) might be discussed before because R packages like rJava exist before 
> SparkR. I'm not sure whether we have a license issue in depending on those 
> libraries. Another option is to switch to low-level R'C interface or Rcpp, 
> which again might have license issue. I'm not an expert here. If we have to 
> implement our own, there still exist much space for improvement, discussed 
> below.
> 2) is a huge gap. The current collect is implemented by `SQLUtils.dfToCols`, 
> which collects rows to local and then constructs columns. However,
> * it ignores column types and results boxing/unboxing overhead
> * it collects all objects to driver and results high GC pressure
> A relatively simple change is to implement specialized column builder based 
> on column types, primitive types in particular. We need to handle null/NA 
> values properly. A simple data structure we can use is
> {code}
> val size: Int
> val nullIndexes: Array[Int]
> val notNullValues: Array[T] // specialized for primitive types
> {code}
> On the R side, we can use `readBin` and `writeBin` to read the entire vector 
> in a single method call. The speed seems reasonable (at the order of GB/s):
> {code}
> > x <- runif(1000) # 1e7, not 1e6
> > system.time(r <- writeBin(x, raw(0)))
>user  system elapsed
>   0.036   0.021   0.059
> > > system.time(y <- readBin(r, double(), 1000))
>user  system elapsed
>   0.015   0.007   0.024
> {code}
> This is just a proposal that needs to be discussed and formalized. But in 
> general, it should be feasible to obtain 20x or more performance gain.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10383) Sync example code between API doc and user guide

2018-05-04 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-10383.
---
  Resolution: Won't Do
Target Version/s:   (was: )

> Sync example code between API doc and user guide
> 
>
> Key: SPARK-10383
> URL: https://issues.apache.org/jira/browse/SPARK-10383
> Project: Spark
>  Issue Type: Brainstorming
>  Components: Documentation, ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>
> It would be nice to provide example code in both user guide and API docs. 
> However, it would become hard to keep the content in-sync. This JIRA is to 
> collect approaches/processes to make it feasible.
> This is related to SPARK-10382, where we discuss how to move example code 
> from user guide markdown to `spark/examples/`. After that, we can look for 
> solutions that can pick up example code from `spark/examples` and make them 
> available in the API doc. Though I don't know any feasible solution right 
> now, those are some relevant projects:
> * https://github.com/tkawachi/sbt-doctest
> * http://www.doctester.org/
> It would be nice to hear more ideas.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24185) add flatten function

2018-05-04 Thread Huaxin Gao (JIRA)
Huaxin Gao created SPARK-24185:
--

 Summary: add  flatten function
 Key: SPARK-24185
 URL: https://issues.apache.org/jira/browse/SPARK-24185
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Affects Versions: 2.4.0
Reporter: Huaxin Gao


Add R versions of https://issues.apache.org/jira/browse/SPARK-23821



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24185) add flatten function

2018-05-04 Thread Huaxin Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16464506#comment-16464506
 ] 

Huaxin Gao commented on SPARK-24185:


I will submit a PR soon.

> add  flatten function
> -
>
> Key: SPARK-24185
> URL: https://issues.apache.org/jira/browse/SPARK-24185
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Add R versions of https://issues.apache.org/jira/browse/SPARK-23821



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12285) MLlib user guide: umbrella for missing sections

2018-05-04 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-12285.
---
Resolution: Done

> MLlib user guide: umbrella for missing sections
> ---
>
> Key: SPARK-12285
> URL: https://issues.apache.org/jira/browse/SPARK-12285
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Major
>
> This is an umbrella for updating the MLlib user/programming guide for new 
> APIs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5874) How to improve the current ML pipeline API?

2018-05-04 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-5874.
--
Resolution: Done

> How to improve the current ML pipeline API?
> ---
>
> Key: SPARK-5874
> URL: https://issues.apache.org/jira/browse/SPARK-5874
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> I created this JIRA to collect feedbacks about the ML pipeline API we 
> introduced in Spark 1.2. The target is to graduate this set of APIs in 1.4 
> with confidence, which requires valuable input from the community. I'll 
> create sub-tasks for each major issue.
> Design doc (WIP): 
> https://docs.google.com/a/databricks.com/document/d/1plFBPJY_PriPTuMiFYLSm7fQgD1FieP4wt3oMVKMGcc/edit#



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24184) Allow escape comma in spark files

2018-05-04 Thread holdenk (JIRA)
holdenk created SPARK-24184:
---

 Summary: Allow escape comma in spark files
 Key: SPARK-24184
 URL: https://issues.apache.org/jira/browse/SPARK-24184
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes, Mesos, Spark Core, YARN
Affects Versions: 2.4.0
Reporter: holdenk


I'm not 100% sure if we want to do this, but I was thinking it might make sense 
to unify our parsing of input file lists and doing so in such a way as to allow 
files with ","s in them. I don't think its urgent (unless someone has a library 
which needs a file with that name in it which I imagine is unlikely) but seems 
like it might be reasonable (maybe in the 3.0 timeframe since it could change 
the meaning of currently configured strings).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24186) add array reverse and concat

2018-05-04 Thread Huaxin Gao (JIRA)
Huaxin Gao created SPARK-24186:
--

 Summary: add array reverse and concat 
 Key: SPARK-24186
 URL: https://issues.apache.org/jira/browse/SPARK-24186
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Affects Versions: 2.4.0
Reporter: Huaxin Gao


Add R versions of https://issues.apache.org/jira/browse/SPARK-23736 and 
https://issues.apache.org/jira/browse/SPARK-23926

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24188) /api/v1/version not working

2018-05-04 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-24188:
--

 Summary: /api/v1/version not working
 Key: SPARK-24188
 URL: https://issues.apache.org/jira/browse/SPARK-24188
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 2.3.0
Reporter: Marcelo Vanzin


That URI from the REST API is currently returning a 404.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24124) Spark history server should create spark.history.store.path and set permissions properly

2018-05-04 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-24124.

   Resolution: Fixed
 Assignee: Thomas Graves
Fix Version/s: 2.4.0

> Spark history server should create spark.history.store.path and set 
> permissions properly
> 
>
> Key: SPARK-24124
> URL: https://issues.apache.org/jira/browse/SPARK-24124
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Major
> Fix For: 2.4.0
>
>
> Current with the new spark history server you can set 
> spark.history.store.path to a location to store the levelDB files.  Currently 
> the directory has to be made before it can use that path.
> We should just have the history server create it and set the file permissions 
> on the leveldb files to be restrictive -> new FsPermission((short) 0700)
> the shuffle service already does this, this would be much more convenient to 
> use and prevent people from making mistakes with the permissions on the 
> directory and files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23291) SparkR : substr : In SparkR dataframe , starting and ending position arguments in "substr" is giving wrong result when the position is greater than 1

2018-05-04 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16464391#comment-16464391
 ] 

Yanbo Liang commented on SPARK-23291:
-

This should be backported to Spark 2.3, as this is a bug fix and we can't wait 
several months for the next release. [~hyukjin.kwon] Do you like to send a PR? 
Thanks.

> SparkR : substr : In SparkR dataframe , starting and ending position 
> arguments in "substr" is giving wrong result  when the position is greater 
> than 1
> --
>
> Key: SPARK-23291
> URL: https://issues.apache.org/jira/browse/SPARK-23291
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.2, 2.2.0, 2.2.1, 2.3.0
>Reporter: Narendra
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.0
>
>
> Defect Description :
> -
> For example ,an input string "2017-12-01" is read into a SparkR dataframe 
> "df" with column name "col1".
>  The target is to create a a new column named "col2" with the value "12" 
> which is inside the string ."12" can be extracted with "starting position" as 
> "6" and "Ending position" as "7"
>  (the starting position of the first character is considered as "1" )
> But,the current code that needs to be written is :
>  
>  df <- withColumn(df,"col2",substr(df$col1,7,8)))
> Observe that the first argument in the "substr" API , which indicates the 
> 'starting position', is mentioned as "7" 
>  Also, observe that the second argument in the "substr" API , which indicates 
> the 'ending position', is mentioned as "8"
> i.e the number that should be mentioned to indicate the position should be 
> the "actual position + 1"
> Expected behavior :
> 
> The code that needs to be written is :
>  
>  df <- withColumn(df,"col2",substr(df$col1,6,7)))
> Note :
> ---
>  This defect is observed with only when the starting position is greater than 
> 1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24137) [K8s] Mount temporary directories in emptydir volumes

2018-05-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24137:


Assignee: Apache Spark

> [K8s] Mount temporary directories in emptydir volumes
> -
>
> Key: SPARK-24137
> URL: https://issues.apache.org/jira/browse/SPARK-24137
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Assignee: Apache Spark
>Priority: Major
>
> Currently the Spark local directories do not get any volumes and volume 
> mounts, which means we're writing Spark shuffle and cache contents to the 
> file system mounted by Docker. This can be terribly inefficient. We should 
> use emptydir volumes for these directories instead for significant 
> performance improvements.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24137) [K8s] Mount temporary directories in emptydir volumes

2018-05-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24137:


Assignee: (was: Apache Spark)

> [K8s] Mount temporary directories in emptydir volumes
> -
>
> Key: SPARK-24137
> URL: https://issues.apache.org/jira/browse/SPARK-24137
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Priority: Major
>
> Currently the Spark local directories do not get any volumes and volume 
> mounts, which means we're writing Spark shuffle and cache contents to the 
> file system mounted by Docker. This can be terribly inefficient. We should 
> use emptydir volumes for these directories instead for significant 
> performance improvements.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24137) [K8s] Mount temporary directories in emptydir volumes

2018-05-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16464429#comment-16464429
 ] 

Apache Spark commented on SPARK-24137:
--

User 'mccheah' has created a pull request for this issue:
https://github.com/apache/spark/pull/21238

> [K8s] Mount temporary directories in emptydir volumes
> -
>
> Key: SPARK-24137
> URL: https://issues.apache.org/jira/browse/SPARK-24137
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Priority: Major
>
> Currently the Spark local directories do not get any volumes and volume 
> mounts, which means we're writing Spark shuffle and cache contents to the 
> file system mounted by Docker. This can be terribly inefficient. We should 
> use emptydir volumes for these directories instead for significant 
> performance improvements.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size

2018-05-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24135:


Assignee: Apache Spark

> [K8s] Executors that fail to start up because of init-container errors are 
> not retried and limit the executor pool size
> ---
>
> Key: SPARK-24135
> URL: https://issues.apache.org/jira/browse/SPARK-24135
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Assignee: Apache Spark
>Priority: Major
>
> In KubernetesClusterSchedulerBackend, we detect if executors disconnect after 
> having been started or if executors hit the {{ERROR}} or {{DELETED}} states. 
> When executors fail in these ways, they are removed from the pending 
> executors pool and the driver should retry requesting these executors.
> However, the driver does not handle a different class of error: when the pod 
> enters the {{Init:Error}} state. This state comes up when the executor fails 
> to launch because one of its init-containers fails. Spark itself doesn't 
> attach any init-containers to the executors. However, custom web hooks can 
> run on the cluster and attach init-containers to the executor pods. 
> Additionally, pod presets can specify init containers to run on these pods. 
> Therefore Spark should be handling the {{Init:Error}} cases regardless if 
> Spark itself is aware of init-containers or not.
> This class of error is particularly bad because when we hit this state, the 
> failed executor will never start, but it's still seen as pending by the 
> executor allocator. The executor allocator won't request more rounds of 
> executors because its current batch hasn't been resolved to either running or 
> failed. Therefore we end up with being stuck with the number of executors 
> that successfully started before the faulty one failed to start, potentially 
> creating a fake resource bottleneck.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size

2018-05-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16464455#comment-16464455
 ] 

Apache Spark commented on SPARK-24135:
--

User 'mccheah' has created a pull request for this issue:
https://github.com/apache/spark/pull/21241

> [K8s] Executors that fail to start up because of init-container errors are 
> not retried and limit the executor pool size
> ---
>
> Key: SPARK-24135
> URL: https://issues.apache.org/jira/browse/SPARK-24135
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Priority: Major
>
> In KubernetesClusterSchedulerBackend, we detect if executors disconnect after 
> having been started or if executors hit the {{ERROR}} or {{DELETED}} states. 
> When executors fail in these ways, they are removed from the pending 
> executors pool and the driver should retry requesting these executors.
> However, the driver does not handle a different class of error: when the pod 
> enters the {{Init:Error}} state. This state comes up when the executor fails 
> to launch because one of its init-containers fails. Spark itself doesn't 
> attach any init-containers to the executors. However, custom web hooks can 
> run on the cluster and attach init-containers to the executor pods. 
> Additionally, pod presets can specify init containers to run on these pods. 
> Therefore Spark should be handling the {{Init:Error}} cases regardless if 
> Spark itself is aware of init-containers or not.
> This class of error is particularly bad because when we hit this state, the 
> failed executor will never start, but it's still seen as pending by the 
> executor allocator. The executor allocator won't request more rounds of 
> executors because its current batch hasn't been resolved to either running or 
> failed. Therefore we end up with being stuck with the number of executors 
> that successfully started before the faulty one failed to start, potentially 
> creating a fake resource bottleneck.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size

2018-05-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24135:


Assignee: (was: Apache Spark)

> [K8s] Executors that fail to start up because of init-container errors are 
> not retried and limit the executor pool size
> ---
>
> Key: SPARK-24135
> URL: https://issues.apache.org/jira/browse/SPARK-24135
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Priority: Major
>
> In KubernetesClusterSchedulerBackend, we detect if executors disconnect after 
> having been started or if executors hit the {{ERROR}} or {{DELETED}} states. 
> When executors fail in these ways, they are removed from the pending 
> executors pool and the driver should retry requesting these executors.
> However, the driver does not handle a different class of error: when the pod 
> enters the {{Init:Error}} state. This state comes up when the executor fails 
> to launch because one of its init-containers fails. Spark itself doesn't 
> attach any init-containers to the executors. However, custom web hooks can 
> run on the cluster and attach init-containers to the executor pods. 
> Additionally, pod presets can specify init containers to run on these pods. 
> Therefore Spark should be handling the {{Init:Error}} cases regardless if 
> Spark itself is aware of init-containers or not.
> This class of error is particularly bad because when we hit this state, the 
> failed executor will never start, but it's still seen as pending by the 
> executor allocator. The executor allocator won't request more rounds of 
> executors because its current batch hasn't been resolved to either running or 
> failed. Therefore we end up with being stuck with the number of executors 
> that successfully started before the faulty one failed to start, potentially 
> creating a fake resource bottleneck.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24039) remove restarting iterators hack

2018-05-04 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-24039.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 21200
[https://github.com/apache/spark/pull/21200]

> remove restarting iterators hack
> 
>
> Key: SPARK-24039
> URL: https://issues.apache.org/jira/browse/SPARK-24039
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Jose Torres
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, continuous processing execution calls next() to restart the query 
> iterator after it returns false. This doesn't work for complex RDDs - we need 
> to call compute() instead.
> This isn't refactoring-only; changes will be required to keep the reader from 
> starting over in each compute() call.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23325) DataSourceV2 readers should always produce InternalRow.

2018-05-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16464284#comment-16464284
 ] 

Apache Spark commented on SPARK-23325:
--

User 'rdblue' has created a pull request for this issue:
https://github.com/apache/spark/pull/21237

> DataSourceV2 readers should always produce InternalRow.
> ---
>
> Key: SPARK-23325
> URL: https://issues.apache.org/jira/browse/SPARK-23325
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>
> DataSourceV2 row-oriented implementations are limited to producing either 
> {{Row}} instances or {{UnsafeRow}} instances by implementing 
> {{SupportsScanUnsafeRow}}. Instead, I think that implementations should 
> always produce {{InternalRow}}.
> The problem with the choice between {{Row}} and {{UnsafeRow}} is that neither 
> one is appropriate for implementers.
> File formats don't produce {{Row}} instances or the data values used by 
> {{Row}}, like {{java.sql.Timestamp}} and {{java.sql.Date}}. An implementation 
> that uses {{Row}} instances must produce data that is immediately translated 
> from the representation that was just produced by Spark. In my experience, it 
> made little sense to translate a timestamp in microseconds to a 
> (milliseconds, nanoseconds) pair, create a {{Timestamp}} instance, and pass 
> that instance to Spark for immediate translation back.
> On the other hand, {{UnsafeRow}} is very difficult to produce unless data is 
> already held in memory. Even the Parquet support built into Spark 
> deserializes to {{InternalRow}} and then uses {{UnsafeProjection}} to produce 
> unsafe rows. When I went to build an implementation that deserializes Parquet 
> or Avro directly to {{UnsafeRow}} (I tried both), I found that it couldn't be 
> done without first deserializing into memory because the size of an array 
> must be known before any values are written.
> I ended up deciding to deserialize to {{InternalRow}} and use 
> {{UnsafeProjection}} to convert to unsafe. There are two problems with this: 
> first, this is Scala and was difficult to call from Java (it required 
> reflection), and second, this causes double projection in the physical plan 
> (a copy for unsafe to unsafe) if there is a projection that wasn't fully 
> pushed to the data source.
> I think the solution is to have a single interface for readers that expects 
> {{InternalRow}}. Then, a projection should be added in the Spark plan to 
> convert to unsafe and avoid projection in the plan and in the data source. If 
> the data source already produces unsafe rows by deserializing directly, this 
> still minimizes the number of copies because the unsafe projection will check 
> whether the incoming data is already {{UnsafeRow}}.
> Using {{InternalRow}} would also match the interface on the write side.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24183) add unit tests for ContinuousDataReader hook

2018-05-04 Thread Jose Torres (JIRA)
Jose Torres created SPARK-24183:
---

 Summary: add unit tests for ContinuousDataReader hook
 Key: SPARK-24183
 URL: https://issues.apache.org/jira/browse/SPARK-24183
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: Jose Torres


Currently this is the class named ContinuousQueuedDataReader, but I don't know 
if this will change as we deal with stateful operators.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24039) remove restarting iterators hack

2018-05-04 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das reassigned SPARK-24039:
-

Assignee: Jose Torres

> remove restarting iterators hack
> 
>
> Key: SPARK-24039
> URL: https://issues.apache.org/jira/browse/SPARK-24039
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Jose Torres
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, continuous processing execution calls next() to restart the query 
> iterator after it returns false. This doesn't work for complex RDDs - we need 
> to call compute() instead.
> This isn't refactoring-only; changes will be required to keep the reader from 
> starting over in each compute() call.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23657) Document InternalRow and expose it as a stable interface

2018-05-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23657:


Assignee: (was: Apache Spark)

> Document InternalRow and expose it as a stable interface
> 
>
> Key: SPARK-23657
> URL: https://issues.apache.org/jira/browse/SPARK-23657
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>
> The new DataSourceV2 API needs to stabilize the {{InternalRow}} interface so 
> that it can be used by new data source implementations. It already exposes 
> {{UnsafeRow}} for reads and {{InternalRow}} for writes, and the 
> representations are unlikely to change so this is primarily documentation 
> work.
> For more discussion, see SPARK-23325.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23657) Document InternalRow and expose it as a stable interface

2018-05-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16464477#comment-16464477
 ] 

Apache Spark commented on SPARK-23657:
--

User 'rdblue' has created a pull request for this issue:
https://github.com/apache/spark/pull/21242

> Document InternalRow and expose it as a stable interface
> 
>
> Key: SPARK-23657
> URL: https://issues.apache.org/jira/browse/SPARK-23657
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>
> The new DataSourceV2 API needs to stabilize the {{InternalRow}} interface so 
> that it can be used by new data source implementations. It already exposes 
> {{UnsafeRow}} for reads and {{InternalRow}} for writes, and the 
> representations are unlikely to change so this is primarily documentation 
> work.
> For more discussion, see SPARK-23325.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24182) Improve error message for client mode when AM fails

2018-05-04 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-24182:
--

 Summary: Improve error message for client mode when AM fails
 Key: SPARK-24182
 URL: https://issues.apache.org/jira/browse/SPARK-24182
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 2.3.0
Reporter: Marcelo Vanzin


Today, when the client AM fails, there's not a lot of useful information 
printed on the output. Depending on the type of failure, the information 
provided by the YARN AM is also not very useful. For example, you'd see this in 
the Spark shell:

{noformat}
18/05/04 11:07:38 ERROR spark.SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Yarn application has already ended! It might 
have been killed or unable to launch application master.
at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:86)
at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:63)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:164)
at org.apache.spark.SparkContext.(SparkContext.scala:500)
 [long stack trace]
{noformat}

Similarly, on the YARN RM, for certain failures you see a generic error like 
this:

{noformat}
ExitCodeException exitCode=10: at 
org.apache.hadoop.util.Shell.runCommand(Shell.java:543) at 
org.apache.hadoop.util.Shell.run(Shell.java:460) at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:720) at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:366)
 at 
[blah blah blah]
{noformat}

It would be nice if we could provide a more accurate description of what went 
wrong when possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL

2018-05-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16464448#comment-16464448
 ] 

Apache Spark commented on SPARK-21274:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/21240

> Implement EXCEPT ALL and INTERSECT ALL
> --
>
> Key: SPARK-21274
> URL: https://issues.apache.org/jira/browse/SPARK-21274
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> 1) *EXCEPT ALL* / MINUS ALL :
> {code}
> SELECT a,b,c FROM tab1
>  EXCEPT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following outer join:
> {code}
> SELECT a,b,c
> FROMtab1 t1
>  LEFT OUTER JOIN 
> tab2 t2
>  ON (
> (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c)
>  )
> WHERE
> COALESCE(t2.a, t2.b, t2.c) IS NULL
> {code}
> (register as a temp.view this second query under "*t1_except_t2_df*" name 
> that can be also used to find INTERSECT ALL below):
> 2) *INTERSECT ALL*:
> {code}
> SELECT a,b,c FROM tab1
>  INTERSECT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following anti-join using t1_except_t2_df we defined 
> above:
> {code}
> SELECT a,b,c
> FROMtab1 t1
> WHERE 
>NOT EXISTS
>(SELECT 1
> FROMt1_except_t2_df e
> WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c)
>)
> {code}
> So the suggestion is just to use above query rewrites to implement both 
> EXCEPT ALL and INTERSECT ALL sql set operations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24040) support single partition aggregates

2018-05-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24040:


Assignee: (was: Apache Spark)

> support single partition aggregates
> ---
>
> Key: SPARK-24040
> URL: https://issues.apache.org/jira/browse/SPARK-24040
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>
> Single partition aggregates are a useful milestone because they don't involve 
> a shuffle.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL

2018-05-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21274:


Assignee: (was: Apache Spark)

> Implement EXCEPT ALL and INTERSECT ALL
> --
>
> Key: SPARK-21274
> URL: https://issues.apache.org/jira/browse/SPARK-21274
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> 1) *EXCEPT ALL* / MINUS ALL :
> {code}
> SELECT a,b,c FROM tab1
>  EXCEPT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following outer join:
> {code}
> SELECT a,b,c
> FROMtab1 t1
>  LEFT OUTER JOIN 
> tab2 t2
>  ON (
> (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c)
>  )
> WHERE
> COALESCE(t2.a, t2.b, t2.c) IS NULL
> {code}
> (register as a temp.view this second query under "*t1_except_t2_df*" name 
> that can be also used to find INTERSECT ALL below):
> 2) *INTERSECT ALL*:
> {code}
> SELECT a,b,c FROM tab1
>  INTERSECT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following anti-join using t1_except_t2_df we defined 
> above:
> {code}
> SELECT a,b,c
> FROMtab1 t1
> WHERE 
>NOT EXISTS
>(SELECT 1
> FROMt1_except_t2_df e
> WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c)
>)
> {code}
> So the suggestion is just to use above query rewrites to implement both 
> EXCEPT ALL and INTERSECT ALL sql set operations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24040) support single partition aggregates

2018-05-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24040:


Assignee: Apache Spark

> support single partition aggregates
> ---
>
> Key: SPARK-24040
> URL: https://issues.apache.org/jira/browse/SPARK-24040
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Apache Spark
>Priority: Major
>
> Single partition aggregates are a useful milestone because they don't involve 
> a shuffle.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL

2018-05-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21274:


Assignee: Apache Spark

> Implement EXCEPT ALL and INTERSECT ALL
> --
>
> Key: SPARK-21274
> URL: https://issues.apache.org/jira/browse/SPARK-21274
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: Ruslan Dautkhanov
>Assignee: Apache Spark
>Priority: Major
>
> 1) *EXCEPT ALL* / MINUS ALL :
> {code}
> SELECT a,b,c FROM tab1
>  EXCEPT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following outer join:
> {code}
> SELECT a,b,c
> FROMtab1 t1
>  LEFT OUTER JOIN 
> tab2 t2
>  ON (
> (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c)
>  )
> WHERE
> COALESCE(t2.a, t2.b, t2.c) IS NULL
> {code}
> (register as a temp.view this second query under "*t1_except_t2_df*" name 
> that can be also used to find INTERSECT ALL below):
> 2) *INTERSECT ALL*:
> {code}
> SELECT a,b,c FROM tab1
>  INTERSECT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following anti-join using t1_except_t2_df we defined 
> above:
> {code}
> SELECT a,b,c
> FROMtab1 t1
> WHERE 
>NOT EXISTS
>(SELECT 1
> FROMt1_except_t2_df e
> WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c)
>)
> {code}
> So the suggestion is just to use above query rewrites to implement both 
> EXCEPT ALL and INTERSECT ALL sql set operations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24040) support single partition aggregates

2018-05-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16464449#comment-16464449
 ] 

Apache Spark commented on SPARK-24040:
--

User 'jose-torres' has created a pull request for this issue:
https://github.com/apache/spark/pull/21239

> support single partition aggregates
> ---
>
> Key: SPARK-24040
> URL: https://issues.apache.org/jira/browse/SPARK-24040
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>
> Single partition aggregates are a useful milestone because they don't involve 
> a shuffle.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size

2018-05-04 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16464467#comment-16464467
 ] 

Matt Cheah commented on SPARK-24135:


Put up the PR< see above - created a separate setting for this class of errors.

> [K8s] Executors that fail to start up because of init-container errors are 
> not retried and limit the executor pool size
> ---
>
> Key: SPARK-24135
> URL: https://issues.apache.org/jira/browse/SPARK-24135
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Priority: Major
>
> In KubernetesClusterSchedulerBackend, we detect if executors disconnect after 
> having been started or if executors hit the {{ERROR}} or {{DELETED}} states. 
> When executors fail in these ways, they are removed from the pending 
> executors pool and the driver should retry requesting these executors.
> However, the driver does not handle a different class of error: when the pod 
> enters the {{Init:Error}} state. This state comes up when the executor fails 
> to launch because one of its init-containers fails. Spark itself doesn't 
> attach any init-containers to the executors. However, custom web hooks can 
> run on the cluster and attach init-containers to the executor pods. 
> Additionally, pod presets can specify init containers to run on these pods. 
> Therefore Spark should be handling the {{Init:Error}} cases regardless if 
> Spark itself is aware of init-containers or not.
> This class of error is particularly bad because when we hit this state, the 
> failed executor will never start, but it's still seen as pending by the 
> executor allocator. The executor allocator won't request more rounds of 
> executors because its current batch hasn't been resolved to either running or 
> failed. Therefore we end up with being stuck with the number of executors 
> that successfully started before the faulty one failed to start, potentially 
> creating a fake resource bottleneck.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23657) Document InternalRow and expose it as a stable interface

2018-05-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23657:


Assignee: Apache Spark

> Document InternalRow and expose it as a stable interface
> 
>
> Key: SPARK-23657
> URL: https://issues.apache.org/jira/browse/SPARK-23657
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Assignee: Apache Spark
>Priority: Major
>
> The new DataSourceV2 API needs to stabilize the {{InternalRow}} interface so 
> that it can be used by new data source implementations. It already exposes 
> {{UnsafeRow}} for reads and {{InternalRow}} for writes, and the 
> representations are unlikely to change so this is primarily documentation 
> work.
> For more discussion, see SPARK-23325.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24136) MemoryStreamDataReader.next should skip sleeping if record is available

2018-05-04 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao reassigned SPARK-24136:
---

Assignee: Arun Mahadevan

> MemoryStreamDataReader.next should skip sleeping if record is available
> ---
>
> Key: SPARK-24136
> URL: https://issues.apache.org/jira/browse/SPARK-24136
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Arun Mahadevan
>Assignee: Arun Mahadevan
>Priority: Minor
> Fix For: 2.4.0
>
>
> Currently the code sleeps 10ms for each invocation of the next even if the 
> record is available.
> {code:java}
> override def next(): Boolean = {
> current = None
>  while (current.isEmpty) {
>  Thread.sleep(10)
> current = endpoint.askSync[Option[Row]](
> GetRecord(ContinuousMemoryStreamPartitionOffset(partition, currentOffset)))
>  }{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Deleted] (SPARK-24178) Upgrade spark's py4j to 0.10.7

2018-05-04 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon deleted SPARK-24178:
-


> Upgrade spark's py4j to 0.10.7
> --
>
> Key: SPARK-24178
> URL: https://issues.apache.org/jira/browse/SPARK-24178
> Project: Spark
>  Issue Type: Bug
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Notable changes:
> - From Py4j 0.10.7, Python's 3.6 support is officially added. 
> - There was a security issue in Py4J which is, anyone can connect to control 
> the JVM. It's still true but at least Py4J added a simple authorization 
> mechanism probably we should take a closer look again to leverage it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24179) History Server for Kubernetes

2018-05-04 Thread Eric Charles (JIRA)
Eric Charles created SPARK-24179:


 Summary: History Server for Kubernetes
 Key: SPARK-24179
 URL: https://issues.apache.org/jira/browse/SPARK-24179
 Project: Spark
  Issue Type: Task
  Components: Kubernetes
Affects Versions: 2.3.0
Reporter: Eric Charles


The History server is missing when running on Kubernetes, with the side effect 
we can not debug post-mortem or analyze after-the-fact.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24088) only HadoopRDD leverage HDFS Cache as preferred location

2018-05-04 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16463653#comment-16463653
 ] 

Marco Gaido commented on SPARK-24088:
-

[~xiaojuwu] I don't understand which problem is stated here. {{FileScanRDD}} 
uses as preferred location the hosts form which the highest number of bytes can 
be retrieved. What is the problem with this policy? Which issue are you 
experiencing?

> only HadoopRDD leverage HDFS Cache as preferred location
> 
>
> Key: SPARK-24088
> URL: https://issues.apache.org/jira/browse/SPARK-24088
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.3.0
>Reporter: Xiaoju Wu
>Priority: Minor
>
> Only HadoopRDD implements convertSplitLocationInfo which will convert 
> location to HDFSCacheTaskLocation based on if the block is cached in Datanode 
> memory.  While FileScanRDD not. In FileScanRDD, all split location 
> information is dropped. 
> private[spark] def convertSplitLocationInfo(
>  infos: Array[SplitLocationInfo]): Option[Seq[String]] = {
>  Option(infos).map(_.flatMap { loc =>
>  val locationStr = loc.getLocation
>  if (locationStr != "localhost") {
>  if (loc.isInMemory) {
>  logDebug(s"Partition $locationStr is cached by Hadoop.")
>  Some(HDFSCacheTaskLocation(locationStr).toString)
>  } else {
>  Some(HostTaskLocation(locationStr).toString)
>  }
>  } else {
>  None
>  }
>  })
> }



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24174) Expose Hadoop config as part of /environment API

2018-05-04 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16463575#comment-16463575
 ] 

Saisai Shao commented on SPARK-24174:
-

I believe Hadoop web UI already expose such configurations. It seems not so 
proper and necessary to expose here in the Spark side, this potentially mixed 
things up.

> Expose Hadoop config as part of /environment API
> 
>
> Key: SPARK-24174
> URL: https://issues.apache.org/jira/browse/SPARK-24174
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Nikolay Sokolov
>Priority: Minor
>  Labels: features, usability
>
> Currently, /environment API call exposes only system properties and 
> SparkConf. However, in some cases when Spark is used in conjunction with 
> Hadoop, it is useful to know Hadoop configuration properties. For example, 
> HDFS or GS buffer sizes, hive metastore settings, and so on.
> So it would be good to have hadoop properties being exposed in /environment 
> API, for example:
> {code:none}
> GET .../application_1525395994996_5/environment
> {
>"runtime": {"javaVersion": "1.8.0_131 (Oracle Corporation)", ...}
>"sparkProperties": ["java.io.tmpdir","/tmp", ...],
>"systemProperties": [["spark.yarn.jars", "local:/usr/lib/spark/jars/*"], 
> ...],
>"classpathEntries": [["/usr/lib/hadoop/hadoop-annotations.jar","System 
> Classpath"], ...],
>"hadoopProperties": [["dfs.stream-buffer-size": 4096], ...],
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24178) Upgrade spark's py4j to 0.10.7

2018-05-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24178:


Assignee: (was: Apache Spark)

> Upgrade spark's py4j to 0.10.7
> --
>
> Key: SPARK-24178
> URL: https://issues.apache.org/jira/browse/SPARK-24178
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Notable changes:
> - From Py4j 0.10.7, Python's 3.6 support is officially added. 
> - There was a security issue in Py4J which is, anyone can connect to control 
> the JVM. It's still true but at least Py4J added a simple authorization 
> mechanism probably we should take a closer look again to leverage it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24178) Upgrade spark's py4j to 0.10.7

2018-05-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16463535#comment-16463535
 ] 

Apache Spark commented on SPARK-24178:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/21233

> Upgrade spark's py4j to 0.10.7
> --
>
> Key: SPARK-24178
> URL: https://issues.apache.org/jira/browse/SPARK-24178
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Notable changes:
> - From Py4j 0.10.7, Python's 3.6 support is officially added. 
> - There was a security issue in Py4J which is, anyone can connect to control 
> the JVM. It's still true but at least Py4J added a simple authorization 
> mechanism probably we should take a closer look again to leverage it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24178) Upgrade spark's py4j to 0.10.7

2018-05-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24178:


Assignee: Apache Spark

> Upgrade spark's py4j to 0.10.7
> --
>
> Key: SPARK-24178
> URL: https://issues.apache.org/jira/browse/SPARK-24178
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> Notable changes:
> - From Py4j 0.10.7, Python's 3.6 support is officially added. 
> - There was a security issue in Py4J which is, anyone can connect to control 
> the JVM. It's still true but at least Py4J added a simple authorization 
> mechanism probably we should take a closer look again to leverage it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24177) Spark returning inconsistent rows and data in a join query when run using Spark SQL (using SQLContext.sql(...))

2018-05-04 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16463616#comment-16463616
 ] 

Marco Gaido commented on SPARK-24177:
-

[~ajay_monga] please may you try with a higher spark version? 1.6 is pretty old 
and not maintained anymore...

> Spark returning inconsistent rows and data in a join query when run using 
> Spark SQL (using SQLContext.sql(...))
> ---
>
> Key: SPARK-24177
> URL: https://issues.apache.org/jira/browse/SPARK-24177
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
> Environment: Production
>Reporter: Ajay Monga
>Priority: Major
>
> Spark SQL is returning inconsistent result for a JOIN query. It returns 
> different rows and the value of the column on which a simple multiplication 
> takes place returns different values:
> The query is like:
> SELECT
>  second_table.date_value, SUM(XXX * second_table.shift_value)
>  FROM
>  (
>  SELECT
>  date_value, SUM(value) as XXX
>  FROM first_table
>  WHERE
>  AND date IN ( '2018-01-01', '2018-01-02' )
>  GROUP BY date_value
>  )
>  intermediate LEFT OUTER
>  JOIN second_table ON second_table.date_value = ( 'date_value' from first table, say if it's a Saturday or Sunday then use 
> Monday, else next valid working date>)
>  AND second_table.date_value IN (
>  '2018-01-02',
>  '2018-01-03'
>  )
>  GROUP BY second_table.date_value
>  
> Suspicion is that, the execution of above query is split into two queries - 
> one for first_table and other for second_table before joining. Then the 
> results get split across partitions, seemingly grouped/distributed by the 
> join column, which is 'date_value'. In the join there is a date shift logic 
> that fails to join in some cases when it should, primarily for the 
> date_values at the edge of the partitions distributed across the executors. 
> So, the execution is dependent on how the data (or the rdd) of the individual 
> queries is partitioned in the first place, which is not ideal as a normal 
> looking ANSI standard SQL query is not behaving consistently.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24177) Spark returning inconsistent rows and data in a join query when run using Spark SQL (using SQLContext.sql(...))

2018-05-04 Thread Ajay Monga (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ajay Monga updated SPARK-24177:
---
Description: 
Spark SQL is returning inconsistent result for a JOIN query. It returns 
different rows and the value of the column on which a simple multiplication 
takes place returns different values:

The query is like:

SELECT
 second_table.date_value, SUM(XXX * second_table.shift_value)
 FROM
 (
 SELECT
 date_value, SUM(value) as XXX
 FROM first_table
 WHERE
 AND date IN ( '2018-01-01', '2018-01-02' )
 GROUP BY date_value
 )
 intermediate LEFT OUTER
 JOIN second_table ON second_table.date_value = ()
 AND second_table.date_value IN (
 '2018-01-02',
 '2018-01-03'
 )
 GROUP BY second_table.date_value

 

Suspicion is that, the execution of above query is split into two queries - one 
for first_table and other for second_table before joining. Then the results get 
split across partitions, seemingly grouped/distributed by the join column, 
which is 'date_value'. In the join there is a date shift logic that fails to 
join in some cases when it should, primarily for the date_values at the edge of 
the partitions distributed across the executors. So, the execution is dependent 
on how the data (or the rdd) of the individual queries is partitioned in the 
first place, which is not ideal as a normal looking ANSI standard SQL query is 
not behaving consistently.

  was:
Spark SQL is returning inconsistent result for a JOIN query. It returns 
different rows and the value of the column on which a simple multiplication 
takes place returns different values:

The query is like:

SELECT
second_table.date_value, SUM(XXX * second_table.shift_value)
FROM
(
 SELECT
 date_value, SUM(value) as XXX
 FROM first_table
 WHERE
 AND date IN ( '2018-01-01', '2018-01-02' )
 GROUP BY date_value
)
intermediate LEFT OUTER
JOIN second_table ON second_table.date_value = ()
AND second_table.date_value IN (
 '2018-01-02',
 '2018-01-03'
)
GROUP BY second_table.date_value

 

Suspicion is that, the execution of above query is split into two queries - one 
for first_table and other for second_table before joining. Then the result gets 
split across partitions, seemingly grouped/distributed by the join column, 
which is 'date_value'. In the join there is a date shift logic that fails to 
join in some cases when it should, primarily for the date_values at the edge of 
the partitions across the Spark cluster. So, it's dependent on how the data (or 
the rdd) of the individual queries is partitioned in the first place.


> Spark returning inconsistent rows and data in a join query when run using 
> Spark SQL (using SQLContext.sql(...))
> ---
>
> Key: SPARK-24177
> URL: https://issues.apache.org/jira/browse/SPARK-24177
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
> Environment: Production
>Reporter: Ajay Monga
>Priority: Major
>
> Spark SQL is returning inconsistent result for a JOIN query. It returns 
> different rows and the value of the column on which a simple multiplication 
> takes place returns different values:
> The query is like:
> SELECT
>  second_table.date_value, SUM(XXX * second_table.shift_value)
>  FROM
>  (
>  SELECT
>  date_value, SUM(value) as XXX
>  FROM first_table
>  WHERE
>  AND date IN ( '2018-01-01', '2018-01-02' )
>  GROUP BY date_value
>  )
>  intermediate LEFT OUTER
>  JOIN second_table ON second_table.date_value = ( 'date_value' from first table, say if it's a Saturday or Sunday then use 
> Monday, else next valid working date>)
>  AND second_table.date_value IN (
>  '2018-01-02',
>  '2018-01-03'
>  )
>  GROUP BY second_table.date_value
>  
> Suspicion is that, the execution of above query is split into two queries - 
> one for first_table and other for second_table before joining. Then the 
> results get split across partitions, seemingly grouped/distributed by the 
> join column, which is 'date_value'. In the join there is a date shift logic 
> that fails to join in some cases when it should, primarily for the 
> date_values at the edge of the partitions distributed across the executors. 
> So, the execution is dependent on how the data (or the rdd) of the individual 
> queries is partitioned in the first place, which is not ideal as a normal 
> looking ANSI standard SQL query is not behaving consistently.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24180) Using another dynamodb endpoint for kinesis

2018-05-04 Thread JEAN-SEBASTIEN NEY (JIRA)
JEAN-SEBASTIEN NEY created SPARK-24180:
--

 Summary: Using another dynamodb endpoint for kinesis
 Key: SPARK-24180
 URL: https://issues.apache.org/jira/browse/SPARK-24180
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit
Affects Versions: 2.3.0
Reporter: JEAN-SEBASTIEN NEY


Hello,

Using Kinesis in pyspark, I'd like to run my local tests against a local 
kinesis running in a docker.

For this, I need to specify local Kinesis' endpoint and a local Dynamodb 
endpoint, but I don't see in documentation how to specifiy the last one.

Could you help please?

Regards



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23697) Accumulators of Spark 1.x no longer work with Spark 2.x

2018-05-04 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-23697:
---

Assignee: Wenchen Fan

> Accumulators of Spark 1.x no longer work with Spark 2.x
> ---
>
> Key: SPARK-23697
> URL: https://issues.apache.org/jira/browse/SPARK-23697
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
> Environment: Spark 2.2.0
> Scala 2.11
>Reporter: Sergey Zhemzhitsky
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.0.3, 2.1.3, 2.2.2, 2.3.1, 2.4.0
>
>
> I've noticed that accumulators of Spark 1.x no longer work with Spark 2.x 
> failing with
> {code:java}
> java.lang.AssertionError: assertion failed: copyAndReset must return a zero 
> value copy{code}
>  It happens while serializing an accumulator 
> [here|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L165]
> {code:java}
> val copyAcc = copyAndReset()
> assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code}
> ... although copyAndReset returns zero-value copy for sure, just consider the 
> accumulator below
> {code:java}
> val concatParam = new AccumulatorParam[jl.StringBuilder] {
>   override def zero(initialValue: jl.StringBuilder): jl.StringBuilder = new 
> jl.StringBuilder()
>   override def addInPlace(r1: jl.StringBuilder, r2: jl.StringBuilder): 
> jl.StringBuilder = r1.append(r2)
> }{code}
> So, Spark treats zero value as non-zero due to how 
> [isZero|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L489]
>  is implemented in LegacyAccumulatorWrapper.
> {code:java}
> override def isZero: Boolean = _value == param.zero(initialValue){code}
> All this means that the values to be accumulated must implement equals and 
> hashCode, otherwise isZero is very likely to always return false.
> So I'm wondering whether the assertion 
> {code:java}
> assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code}
> is really necessary and whether it can be safely removed from there?
> If not - is it ok to just override writeReplace for LegacyAccumulatorWrapper 
> to prevent such failures?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23697) Accumulators of Spark 1.x no longer work with Spark 2.x

2018-05-04 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23697.
-
   Resolution: Fixed
Fix Version/s: 2.3.1
   2.1.3
   2.4.0
   2.2.2
   2.0.3

Issue resolved by pull request 21229
[https://github.com/apache/spark/pull/21229]

> Accumulators of Spark 1.x no longer work with Spark 2.x
> ---
>
> Key: SPARK-23697
> URL: https://issues.apache.org/jira/browse/SPARK-23697
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
> Environment: Spark 2.2.0
> Scala 2.11
>Reporter: Sergey Zhemzhitsky
>Priority: Major
> Fix For: 2.0.3, 2.2.2, 2.4.0, 2.1.3, 2.3.1
>
>
> I've noticed that accumulators of Spark 1.x no longer work with Spark 2.x 
> failing with
> {code:java}
> java.lang.AssertionError: assertion failed: copyAndReset must return a zero 
> value copy{code}
>  It happens while serializing an accumulator 
> [here|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L165]
> {code:java}
> val copyAcc = copyAndReset()
> assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code}
> ... although copyAndReset returns zero-value copy for sure, just consider the 
> accumulator below
> {code:java}
> val concatParam = new AccumulatorParam[jl.StringBuilder] {
>   override def zero(initialValue: jl.StringBuilder): jl.StringBuilder = new 
> jl.StringBuilder()
>   override def addInPlace(r1: jl.StringBuilder, r2: jl.StringBuilder): 
> jl.StringBuilder = r1.append(r2)
> }{code}
> So, Spark treats zero value as non-zero due to how 
> [isZero|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L489]
>  is implemented in LegacyAccumulatorWrapper.
> {code:java}
> override def isZero: Boolean = _value == param.zero(initialValue){code}
> All this means that the values to be accumulated must implement equals and 
> hashCode, otherwise isZero is very likely to always return false.
> So I'm wondering whether the assertion 
> {code:java}
> assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code}
> is really necessary and whether it can be safely removed from there?
> If not - is it ok to just override writeReplace for LegacyAccumulatorWrapper 
> to prevent such failures?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24136) MemoryStreamDataReader.next should skip sleeping if record is available

2018-05-04 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-24136.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21207
[https://github.com/apache/spark/pull/21207]

> MemoryStreamDataReader.next should skip sleeping if record is available
> ---
>
> Key: SPARK-24136
> URL: https://issues.apache.org/jira/browse/SPARK-24136
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Arun Mahadevan
>Priority: Minor
> Fix For: 2.4.0
>
>
> Currently the code sleeps 10ms for each invocation of the next even if the 
> record is available.
> {code:java}
> override def next(): Boolean = {
> current = None
>  while (current.isEmpty) {
>  Thread.sleep(10)
> current = endpoint.askSync[Option[Row]](
> GetRecord(ContinuousMemoryStreamPartitionOffset(partition, currentOffset)))
>  }{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24124) Spark history server should create spark.history.store.path and set permissions properly

2018-05-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16463934#comment-16463934
 ] 

Apache Spark commented on SPARK-24124:
--

User 'tgravescs' has created a pull request for this issue:
https://github.com/apache/spark/pull/21234

> Spark history server should create spark.history.store.path and set 
> permissions properly
> 
>
> Key: SPARK-24124
> URL: https://issues.apache.org/jira/browse/SPARK-24124
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Priority: Major
>
> Current with the new spark history server you can set 
> spark.history.store.path to a location to store the levelDB files.  Currently 
> the directory has to be made before it can use that path.
> We should just have the history server create it and set the file permissions 
> on the leveldb files to be restrictive -> new FsPermission((short) 0700)
> the shuffle service already does this, this would be much more convenient to 
> use and prevent people from making mistakes with the permissions on the 
> directory and files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24124) Spark history server should create spark.history.store.path and set permissions properly

2018-05-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24124:


Assignee: (was: Apache Spark)

> Spark history server should create spark.history.store.path and set 
> permissions properly
> 
>
> Key: SPARK-24124
> URL: https://issues.apache.org/jira/browse/SPARK-24124
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Priority: Major
>
> Current with the new spark history server you can set 
> spark.history.store.path to a location to store the levelDB files.  Currently 
> the directory has to be made before it can use that path.
> We should just have the history server create it and set the file permissions 
> on the leveldb files to be restrictive -> new FsPermission((short) 0700)
> the shuffle service already does this, this would be much more convenient to 
> use and prevent people from making mistakes with the permissions on the 
> directory and files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24124) Spark history server should create spark.history.store.path and set permissions properly

2018-05-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24124:


Assignee: Apache Spark

> Spark history server should create spark.history.store.path and set 
> permissions properly
> 
>
> Key: SPARK-24124
> URL: https://issues.apache.org/jira/browse/SPARK-24124
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Assignee: Apache Spark
>Priority: Major
>
> Current with the new spark history server you can set 
> spark.history.store.path to a location to store the levelDB files.  Currently 
> the directory has to be made before it can use that path.
> We should just have the history server create it and set the file permissions 
> on the leveldb files to be restrictive -> new FsPermission((short) 0700)
> the shuffle service already does this, this would be much more convenient to 
> use and prevent people from making mistakes with the permissions on the 
> directory and files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24177) Spark returning inconsistent rows and data in a join query when run using Spark SQL (using SQLContext.sql(...))

2018-05-04 Thread Ajay Monga (JIRA)
Ajay Monga created SPARK-24177:
--

 Summary: Spark returning inconsistent rows and data in a join 
query when run using Spark SQL (using SQLContext.sql(...))
 Key: SPARK-24177
 URL: https://issues.apache.org/jira/browse/SPARK-24177
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.0
 Environment: Production
Reporter: Ajay Monga


Spark SQL is returning inconsistent result for a JOIN query. It returns 
different rows and the value of the column on which a simple multiplication 
takes place returns different values:

The query is like:

SELECT
second_table.date_value, SUM(XXX * second_table.shift_value)
FROM
(
 SELECT
 date_value, SUM(value) as XXX
 FROM first_table
 WHERE
 AND date IN ( '2018-01-01', '2018-01-02' )
 GROUP BY date_value
)
intermediate LEFT OUTER
JOIN second_table ON second_table.date_value = ()
AND second_table.date_value IN (
 '2018-01-02',
 '2018-01-03'
)
GROUP BY second_table.date_value

 

Suspicion is that, the execution of above query is split into two queries - one 
for first_table and other for second_table before joining. Then the result gets 
split across partitions, seemingly grouped/distributed by the join column, 
which is 'date_value'. In the join there is a date shift logic that fails to 
join in some cases when it should, primarily for the date_values at the edge of 
the partitions across the Spark cluster. So, it's dependent on how the data (or 
the rdd) of the individual queries is partitioned in the first place.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23780) Failed to use googleVis library with new SparkR

2018-05-04 Thread Albert Chan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16463487#comment-16463487
 ] 

Albert Chan commented on SPARK-23780:
-

I have the same issue.  I trying Spark 2.3.0.  I tried installing the 
mages/googleVis and still had the same problem.  Had to fall back to Spark 
2.1.7.

> Failed to use googleVis library with new SparkR
> ---
>
> Key: SPARK-23780
> URL: https://issues.apache.org/jira/browse/SPARK-23780
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1
>Reporter: Ivan Dzikovsky
>Priority: Major
>
> I've tried to use googleVis library with Spark 2.2.1, and faced with problem.
> Steps to reproduce:
> # Install R with googleVis library.
> # Run SparkR:
> {code}
> sparkR --master yarn --deploy-mode client
> {code}
> # Run code that uses googleVis:
> {code}
> library(googleVis)
> df=data.frame(country=c("US", "GB", "BR"), 
>   val1=c(10,13,14), 
>   val2=c(23,12,32))
> Bar <- gvisBarChart(df)
> cat("%html ", Bar$html$chart)
> {code}
> Than I got following error message:
> {code}
> Error : .onLoad failed in loadNamespace() for 'googleVis', details:
>   call: rematchDefinition(definition, fdef, mnames, fnames, signature)
>   error: methods can add arguments to the generic 'toJSON' only if '...' is 
> an argument to the generic
> Error : package or namespace load failed for 'googleVis'
> {code}
> But expected result is to get some HTML code output, as it was with Spark 
> 2.1.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23780) Failed to use googleVis library with new SparkR

2018-05-04 Thread Ivan Dzikovsky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16463514#comment-16463514
 ] 

Ivan Dzikovsky commented on SPARK-23780:


[~achan]

You can still use googleVis with Spark 2.3. To do this you will need to detach 
SparkR package before importing googleVis, and then reimport it again:
{code:java}
detach("package:SparkR")
library(googleVis)
suppressPackageStartupMessages(library(SparkR))

df=data.frame(country=c("US", "GB", "BR"), 
val1=c(10,13,14), 
val2=c(23,12,32))
Bar <- gvisBarChart(df)
print(Bar, tag = 'chart'){code}

> Failed to use googleVis library with new SparkR
> ---
>
> Key: SPARK-23780
> URL: https://issues.apache.org/jira/browse/SPARK-23780
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1
>Reporter: Ivan Dzikovsky
>Priority: Major
>
> I've tried to use googleVis library with Spark 2.2.1, and faced with problem.
> Steps to reproduce:
> # Install R with googleVis library.
> # Run SparkR:
> {code}
> sparkR --master yarn --deploy-mode client
> {code}
> # Run code that uses googleVis:
> {code}
> library(googleVis)
> df=data.frame(country=c("US", "GB", "BR"), 
>   val1=c(10,13,14), 
>   val2=c(23,12,32))
> Bar <- gvisBarChart(df)
> cat("%html ", Bar$html$chart)
> {code}
> Than I got following error message:
> {code}
> Error : .onLoad failed in loadNamespace() for 'googleVis', details:
>   call: rematchDefinition(definition, fdef, mnames, fnames, signature)
>   error: methods can add arguments to the generic 'toJSON' only if '...' is 
> an argument to the generic
> Error : package or namespace load failed for 'googleVis'
> {code}
> But expected result is to get some HTML code output, as it was with Spark 
> 2.1.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24177) Spark returning inconsistent rows and data in a join query when run using Spark SQL (using SQLContext.sql(...))

2018-05-04 Thread Ajay Monga (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16463494#comment-16463494
 ] 

Ajay Monga commented on SPARK-24177:


The suspicion has been strengthened by the fact that when the query is written 
in such a fashion that the date shift logic is put into the SELECT clause and 
then the join is done, the result is correct and consistent across runs.

> Spark returning inconsistent rows and data in a join query when run using 
> Spark SQL (using SQLContext.sql(...))
> ---
>
> Key: SPARK-24177
> URL: https://issues.apache.org/jira/browse/SPARK-24177
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
> Environment: Production
>Reporter: Ajay Monga
>Priority: Major
>
> Spark SQL is returning inconsistent result for a JOIN query. It returns 
> different rows and the value of the column on which a simple multiplication 
> takes place returns different values:
> The query is like:
> SELECT
> second_table.date_value, SUM(XXX * second_table.shift_value)
> FROM
> (
>  SELECT
>  date_value, SUM(value) as XXX
>  FROM first_table
>  WHERE
>  AND date IN ( '2018-01-01', '2018-01-02' )
>  GROUP BY date_value
> )
> intermediate LEFT OUTER
> JOIN second_table ON second_table.date_value = ( 'date_value' from first table, say if it's a Saturday or Sunday then use 
> Monday, else next valid working date>)
> AND second_table.date_value IN (
>  '2018-01-02',
>  '2018-01-03'
> )
> GROUP BY second_table.date_value
>  
> Suspicion is that, the execution of above query is split into two queries - 
> one for first_table and other for second_table before joining. Then the 
> result gets split across partitions, seemingly grouped/distributed by the 
> join column, which is 'date_value'. In the join there is a date shift logic 
> that fails to join in some cases when it should, primarily for the 
> date_values at the edge of the partitions across the Spark cluster. So, it's 
> dependent on how the data (or the rdd) of the individual queries is 
> partitioned in the first place.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24178) Upgrade spark's py4j to 0.10.7

2018-05-04 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-24178:


 Summary: Upgrade spark's py4j to 0.10.7
 Key: SPARK-24178
 URL: https://issues.apache.org/jira/browse/SPARK-24178
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.3.0
Reporter: Hyukjin Kwon


Notable changes:

- From Py4j 0.10.7, Python's 3.6 support is officially added. 
- There was a security issue in Py4J which is, anyone can connect to control 
the JVM. It's still true but at least Py4J added a simple authorization 
mechanism probably we should take a closer look again to leverage it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org