[jira] [Assigned] (SPARK-23094) Json Readers choose wrong encoding when bad records are present and fail

2018-02-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23094:


Assignee: Burak Yavuz  (was: Apache Spark)

> Json Readers choose wrong encoding when bad records are present and fail
> 
>
> Key: SPARK-23094
> URL: https://issues.apache.org/jira/browse/SPARK-23094
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Major
> Fix For: 2.3.0
>
>
> The cases described in SPARK-16548 and SPARK-20549 handled the JsonParser 
> code paths for expressions but not the readers. We should also cover reader 
> code paths reading files with bad characters.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23094) Json Readers choose wrong encoding when bad records are present and fail

2018-02-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23094:


Assignee: Apache Spark  (was: Burak Yavuz)

> Json Readers choose wrong encoding when bad records are present and fail
> 
>
> Key: SPARK-23094
> URL: https://issues.apache.org/jira/browse/SPARK-23094
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Burak Yavuz
>Assignee: Apache Spark
>Priority: Major
> Fix For: 2.3.0
>
>
> The cases described in SPARK-16548 and SPARK-20549 handled the JsonParser 
> code paths for expressions but not the readers. We should also cover reader 
> code paths reading files with bad characters.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-23094) Json Readers choose wrong encoding when bad records are present and fail

2018-02-14 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reopened SPARK-23094:
-

> Json Readers choose wrong encoding when bad records are present and fail
> 
>
> Key: SPARK-23094
> URL: https://issues.apache.org/jira/browse/SPARK-23094
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Major
> Fix For: 2.3.0
>
>
> The cases described in SPARK-16548 and SPARK-20549 handled the JsonParser 
> code paths for expressions but not the readers. We should also cover reader 
> code paths reading files with bad characters.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23421) Document the behavior change in SPARK-22356

2018-02-14 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23421:

Fix Version/s: 2.3.0

> Document the behavior change in SPARK-22356
> ---
>
> Key: SPARK-23421
> URL: https://issues.apache.org/jira/browse/SPARK-23421
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
> Fix For: 2.3.0
>
>
> SPARK-22356 introduces a behavior change. We need to document it in the 
> migration guide. Also update the HiveExternalCatalogVersionsSuite to verify 
> it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23421) Document the behavior change in SPARK-22356

2018-02-14 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-23421.
-
Resolution: Fixed

> Document the behavior change in SPARK-22356
> ---
>
> Key: SPARK-23421
> URL: https://issues.apache.org/jira/browse/SPARK-23421
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
>
> SPARK-22356 introduces a behavior change. We need to document it in the 
> migration guide. Also update the HiveExternalCatalogVersionsSuite to verify 
> it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21521) History service requires user is in any group

2018-02-14 Thread Wei Zheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16365208#comment-16365208
 ] 

Wei Zheng commented on SPARK-21521:
---

We came across the same problem recently - SHS UI only shows the jobs for the 
user who started SHS service. Although that user is a super user (both in local 
FS and HDFS), it cannot read other users' job log files (due to rwxrwx---).

Special logic to tell whether a user is a super user is nice, but I don't know 
if that's doable, because that logic may be vendor specific. For those using 
HDFS maybe we can read dfs.permissions.supergroup from hdfs-site.xml and tell, 
but other system like MapR doesn't use hdfs-site.xml at all but has different 
configs. I don't know if that's the case for other vendors.

We currently work around this issue by changing LOG_FILE_PERMISSIONS from 770 
to 774. I'm not sure if that's a safe change though.

> History service requires user is in any group
> -
>
> Key: SPARK-21521
> URL: https://issues.apache.org/jira/browse/SPARK-21521
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Adrian Bridgett
>Priority: Major
>
> (Regression cf. 2.0.2)
> We run spark as several users, these write to the history location where the 
> files are saved as those users with permissions of 770 (this is hardcoded in 
> EventLoggingListener.scala).
> The history service runs as root so that it has permissions on these files 
> (see https://spark.apache.org/docs/latest/security.html).
> This worked fine in v2.0.2, however in v2.2.0 the events are being skipped 
> unless I add the root user into each users group at which point they are seen.
> We currently have all acls configuration unset.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors

2018-02-14 Thread Igor Berman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16365194#comment-16365194
 ] 

Igor Berman commented on SPARK-23423:
-

Hi [~skonto], thanks for your response!

Yes you are right, the slaves are not removed, but taskIds are. However in my 
case(maybe connected to mesos version) I see that the only updates I'm getting 
for task statuses are fro TASK_RUNNING but not for TASK_KILLED primary

I'm referencing this line:

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L609]

here is grep:
{code:java}
[root@node mycomp]# zgrep "is now" /var/log/mycomp/my-app.*.log.gz | grep -v 
TASK_RUNNING
/var/log/mycomp/my-app.12.log.gz:2018-02-12 15:01:31,329 INFO [Thread-56] 
MesosCoarseGrainedSchedulerBackend [] - Mesos task 17 is now TASK_FAILED
/var/log/mycomp/my-app.8.log.gz:2018-02-05 23:52:16,534 INFO [Thread-62] 
MesosCoarseGrainedSchedulerBackend [] - Mesos task 32 is now TASK_FAILED{code}
 

so my thought is that those updates that cause taskIds to be removed(when 
executor is killed due to scaling down) are somehow lost, unless I'm missing 
something...

 

> Application declines any offers when killed+active executors rich 
> spark.dynamicAllocation.maxExecutors
> --
>
> Key: SPARK-23423
> URL: https://issues.apache.org/jira/browse/SPARK-23423
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.2.1
>Reporter: Igor Berman
>Priority: Major
>
> Hi
> Mesos Version:1.1.0
> I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend 
> when running on Mesos with dynamic allocation on and limiting number of max 
> executors by spark.dynamicAllocation.maxExecutors.
> Suppose we have long running driver that has cyclic pattern of resource 
> consumption(with some idle times in between), due to dyn.allocation it 
> receives offers and then releases them after current chunk of work processed.
> Since at 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]
>  the backend compares numExecutors < executorLimit and 
> numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves 
> holds all slaves ever "met", i.e. both active and killed (see comment 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)]
>  
> On the other hand, number of taskIds should be updated due to statusUpdate, 
> but suppose this update is lost(actually I don't see logs of 'is now 
> TASK_KILLED') so this number of executors might be wrong
>  
> I've created test that "reproduces" this behavior, not sure how good it is:
> {code:java}
> //MesosCoarseGrainedSchedulerBackendSuite
> test("max executors registered stops to accept offers when dynamic allocation 
> enabled") {
>   setBackend(Map(
> "spark.dynamicAllocation.maxExecutors" -> "1",
> "spark.dynamicAllocation.enabled" -> "true",
> "spark.dynamicAllocation.testing" -> "true"))
>   backend.doRequestTotalExecutors(1)
>   val (mem, cpu) = (backend.executorMemory(sc), 4)
>   val offer1 = createOffer("o1", "s1", mem, cpu)
>   backend.resourceOffers(driver, List(offer1).asJava)
>   verifyTaskLaunched(driver, "o1")
>   backend.doKillExecutors(List("0"))
>   verify(driver, times(1)).killTask(createTaskId("0"))
>   val offer2 = createOffer("o2", "s2", mem, cpu)
>   backend.resourceOffers(driver, List(offer2).asJava)
>   verify(driver, times(1)).declineOffer(offer2.getId)
> }{code}
>  
>  
> Workaround: Don't set maxExecutors with dynamicAllocation on
>  
> Please advice
> Igor
> marking you friends since you were last to touch this piece of code and 
> probably can advice something([~vanzin], [~skonto], [~susanxhuynh])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors

2018-02-14 Thread Igor Berman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Igor Berman updated SPARK-23423:

Description: 
Hi

Mesos Version:1.1.0

I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend when 
running on Mesos with dynamic allocation on and limiting number of max 
executors by spark.dynamicAllocation.maxExecutors.

Suppose we have long running driver that has cyclic pattern of resource 
consumption(with some idle times in between), due to dyn.allocation it receives 
offers and then releases them after current chunk of work processed.

Since at 
[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]
 the backend compares numExecutors < executorLimit and 

numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves 
holds all slaves ever "met", i.e. both active and killed (see comment 
[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)]
 

On the other hand, number of taskIds should be updated due to statusUpdate, but 
suppose this update is lost(actually I don't see logs of 'is now TASK_KILLED') 
so this number of executors might be wrong

 

I've created test that "reproduces" this behavior, not sure how good it is:
{code:java}
//MesosCoarseGrainedSchedulerBackendSuite
test("max executors registered stops to accept offers when dynamic allocation 
enabled") {
  setBackend(Map(
"spark.dynamicAllocation.maxExecutors" -> "1",
"spark.dynamicAllocation.enabled" -> "true",
"spark.dynamicAllocation.testing" -> "true"))

  backend.doRequestTotalExecutors(1)

  val (mem, cpu) = (backend.executorMemory(sc), 4)

  val offer1 = createOffer("o1", "s1", mem, cpu)
  backend.resourceOffers(driver, List(offer1).asJava)
  verifyTaskLaunched(driver, "o1")

  backend.doKillExecutors(List("0"))
  verify(driver, times(1)).killTask(createTaskId("0"))

  val offer2 = createOffer("o2", "s2", mem, cpu)
  backend.resourceOffers(driver, List(offer2).asJava)
  verify(driver, times(1)).declineOffer(offer2.getId)
}{code}
 

 

Workaround: Don't set maxExecutors with dynamicAllocation on

 

Please advice

Igor

marking you friends since you were last to touch this piece of code and 
probably can advice something([~vanzin], [~skonto], [~susanxhuynh])

  was:
Hi

I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend when 
running on Mesos with dynamic allocation on and limiting number of max 
executors by spark.dynamicAllocation.maxExecutors.

Suppose we have long running driver that has cyclic pattern of resource 
consumption(with some idle times in between), due to dyn.allocation it receives 
offers and then releases them after current chunk of work processed.

Since at 
[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]
 the backend compares numExecutors < executorLimit and 

numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves 
holds all slaves ever "met", i.e. both active and killed (see comment 
[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)]
 

On the other hand, number of taskIds should be updated due to statusUpdate, but 
suppose this update is lost(actually I don't see logs of 'is now TASK_KILLED') 
so this number of executors might be wrong

 

I've created test that "reproduces" this behavior, not sure how good it is:
{code:java}
//MesosCoarseGrainedSchedulerBackendSuite
test("max executors registered stops to accept offers when dynamic allocation 
enabled") {
  setBackend(Map(
"spark.dynamicAllocation.maxExecutors" -> "1",
"spark.dynamicAllocation.enabled" -> "true",
"spark.dynamicAllocation.testing" -> "true"))

  backend.doRequestTotalExecutors(1)

  val (mem, cpu) = (backend.executorMemory(sc), 4)

  val offer1 = createOffer("o1", "s1", mem, cpu)
  backend.resourceOffers(driver, List(offer1).asJava)
  verifyTaskLaunched(driver, "o1")

  backend.doKillExecutors(List("0"))
  verify(driver, times(1)).killTask(createTaskId("0"))

  val offer2 = createOffer("o2", "s2", mem, cpu)
  backend.resourceOffers(driver, List(offer2).asJava)
  verify(driver, times(1)).declineOffer(offer2.getId)
}{code}
 

 

Workaround: Don't set maxExecutors with dynamicAllocation on

 

Please advice

Igor

marking you friends since you were last to touch this piece of code and 
probably can advice something([~vanzin], [~skonto], [~susanxhuynh])


> Application declines any offers when killed+active executors rich 
> spark.dynamicAllocation.maxExecutors
> 

[jira] [Updated] (SPARK-23434) Spark should not warn `metadata directory` for a HDFS file path

2018-02-14 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23434:
--
Description: 
When Spark reads a file path (e.g. `people.json`), it warns with a wrong error 
message during looking up `people.json/_spark_metadata`. The root cause of this 
istuation is the difference between `LocalFileSystem` and 
`DistributedFileSystem`. `LocalFileSystem.exists()` returns `false`, but 
`DistributedFileSystem.exists` raises Exception.

{code}
scala> spark.version
res0: String = 2.4.0-SNAPSHOT

scala> 
spark.read.json("file:///usr/hdp/current/spark-client/examples/src/main/resources/people.json").show
++---+
| age|   name|
++---+
|null|Michael|
|  30|   Andy|
|  19| Justin|
++---+

scala> spark.read.json("hdfs:///tmp/people.json")
18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for 
metadata directory.
18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for 
metadata directory.
res6: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
{code}

{code}
scala> spark.version
res0: String = 2.2.1

scala> spark.read.json("hdfs:///tmp/people.json").show
18/02/15 05:28:02 WARN FileStreamSink: Error while looking for metadata 
directory.
18/02/15 05:28:02 WARN FileStreamSink: Error while looking for metadata 
directory.
{code}

{code}
scala> spark.read.json("hdfs:///tmp/people.json").show
18/02/15 05:29:53 WARN DataSource: Error while looking for metadata directory.
++---+
| age|   name|
++---+
|null|Michael|
|  30|   Andy|
|  19| Justin|
++---+


scala> spark.version
res1: String = 2.1.2
{code}

{code}
scala> spark.version
res0: String = 2.0.2

scala> spark.read.json("hdfs:///tmp/people.json").show
18/02/15 05:25:24 WARN DataSource: Error while looking for metadata directory.
{code}

  was:
When Spark reads a file path (e.g. `people.json`), it warns with a wrong error 
message during looking up `people.json/_spark_metadata`. The root cause of this 
istuation is the difference between `LocalFileSystem` and 
`DistributedFileSystem`. `LocalFileSystem.exists()` returns `false`, but 
`DistributedFileSystem.exists` raises Exception.

{code}
scala> spark.version
res0: String = 2.4.0-SNAPSHOT

scala> 
spark.read.json("file:///usr/hdp/current/spark-client/examples/src/main/resources/people.json").show
++---+
| age|   name|
++---+
|null|Michael|
|  30|   Andy|
|  19| Justin|
++---+

scala> spark.read.json("hdfs:///tmp/people.json")
18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for 
metadata directory.
18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for 
metadata directory.
res6: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
{code}

{code}
scala> spark.version
res0: String = 2.2.1

scala> spark.read.json("hdfs:///tmp/people.json").show
18/02/15 05:28:02 WARN FileStreamSink: Error while looking for metadata 
directory.
18/02/15 05:28:02 WARN FileStreamSink: Error while looking for metadata 
directory.
{code}

{code}
scala> spark.version
res0: String = 2.0.2

scala> spark.read.json("hdfs:///tmp/people.json").show
18/02/15 05:25:24 WARN DataSource: Error while looking for metadata directory.
{code}


> Spark should not warn `metadata directory` for a HDFS file path
> ---
>
> Key: SPARK-23434
> URL: https://issues.apache.org/jira/browse/SPARK-23434
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> When Spark reads a file path (e.g. `people.json`), it warns with a wrong 
> error message during looking up `people.json/_spark_metadata`. The root cause 
> of this istuation is the difference between `LocalFileSystem` and 
> `DistributedFileSystem`. `LocalFileSystem.exists()` returns `false`, but 
> `DistributedFileSystem.exists` raises Exception.
> {code}
> scala> spark.version
> res0: String = 2.4.0-SNAPSHOT
> scala> 
> spark.read.json("file:///usr/hdp/current/spark-client/examples/src/main/resources/people.json").show
> ++---+
> | age|   name|
> ++---+
> |null|Michael|
> |  30|   Andy|
> |  19| Justin|
> ++---+
> scala> spark.read.json("hdfs:///tmp/people.json")
> 18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for 
> metadata directory.
> 18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for 
> metadata directory.
> res6: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
> {code}
> {code}
> scala> spark.version
> res0: String = 2.2.1
> scala> spark.read.json("hdfs:///tmp/people.json").show
> 18/02/15 05:28:02 WARN FileStreamSink: Error while looking for metadata 
> directory.
> 18/02/15 05:28:02 WARN FileStreamSink: Error 

[jira] [Updated] (SPARK-23434) Spark should not warn `metadata directory` for a HDFS file path

2018-02-14 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23434:
--
Description: 
When Spark reads a file path (e.g. `people.json`), it warns with a wrong error 
message during looking up `people.json/_spark_metadata`. The root cause of this 
istuation is the difference between `LocalFileSystem` and 
`DistributedFileSystem`. `LocalFileSystem.exists()` returns `false`, but 
`DistributedFileSystem.exists` raises Exception.

{code}
scala> spark.version
res0: String = 2.4.0-SNAPSHOT

scala> 
spark.read.json("file:///usr/hdp/current/spark-client/examples/src/main/resources/people.json").show
++---+
| age|   name|
++---+
|null|Michael|
|  30|   Andy|
|  19| Justin|
++---+

scala> spark.read.json("hdfs:///tmp/people.json")
18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for 
metadata directory.
18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for 
metadata directory.
res6: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
{code}

{code}
scala> spark.version
res0: String = 2.2.1

scala> spark.read.json("hdfs:///tmp/people.json").show
18/02/15 05:28:02 WARN FileStreamSink: Error while looking for metadata 
directory.
18/02/15 05:28:02 WARN FileStreamSink: Error while looking for metadata 
directory.
{code}

{code}
scala> spark.version
res0: String = 2.1.2

scala> spark.read.json("hdfs:///tmp/people.json").show
18/02/15 05:29:53 WARN DataSource: Error while looking for metadata directory.
++---+
| age|   name|
++---+
|null|Michael|
|  30|   Andy|
|  19| Justin|
++---+
{code}

{code}
scala> spark.version
res0: String = 2.0.2

scala> spark.read.json("hdfs:///tmp/people.json").show
18/02/15 05:25:24 WARN DataSource: Error while looking for metadata directory.
{code}

  was:
When Spark reads a file path (e.g. `people.json`), it warns with a wrong error 
message during looking up `people.json/_spark_metadata`. The root cause of this 
istuation is the difference between `LocalFileSystem` and 
`DistributedFileSystem`. `LocalFileSystem.exists()` returns `false`, but 
`DistributedFileSystem.exists` raises Exception.

{code}
scala> spark.version
res0: String = 2.4.0-SNAPSHOT

scala> 
spark.read.json("file:///usr/hdp/current/spark-client/examples/src/main/resources/people.json").show
++---+
| age|   name|
++---+
|null|Michael|
|  30|   Andy|
|  19| Justin|
++---+

scala> spark.read.json("hdfs:///tmp/people.json")
18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for 
metadata directory.
18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for 
metadata directory.
res6: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
{code}

{code}
scala> spark.version
res0: String = 2.2.1

scala> spark.read.json("hdfs:///tmp/people.json").show
18/02/15 05:28:02 WARN FileStreamSink: Error while looking for metadata 
directory.
18/02/15 05:28:02 WARN FileStreamSink: Error while looking for metadata 
directory.
{code}

{code}
scala> spark.read.json("hdfs:///tmp/people.json").show
18/02/15 05:29:53 WARN DataSource: Error while looking for metadata directory.
++---+
| age|   name|
++---+
|null|Michael|
|  30|   Andy|
|  19| Justin|
++---+


scala> spark.version
res1: String = 2.1.2
{code}

{code}
scala> spark.version
res0: String = 2.0.2

scala> spark.read.json("hdfs:///tmp/people.json").show
18/02/15 05:25:24 WARN DataSource: Error while looking for metadata directory.
{code}


> Spark should not warn `metadata directory` for a HDFS file path
> ---
>
> Key: SPARK-23434
> URL: https://issues.apache.org/jira/browse/SPARK-23434
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> When Spark reads a file path (e.g. `people.json`), it warns with a wrong 
> error message during looking up `people.json/_spark_metadata`. The root cause 
> of this istuation is the difference between `LocalFileSystem` and 
> `DistributedFileSystem`. `LocalFileSystem.exists()` returns `false`, but 
> `DistributedFileSystem.exists` raises Exception.
> {code}
> scala> spark.version
> res0: String = 2.4.0-SNAPSHOT
> scala> 
> spark.read.json("file:///usr/hdp/current/spark-client/examples/src/main/resources/people.json").show
> ++---+
> | age|   name|
> ++---+
> |null|Michael|
> |  30|   Andy|
> |  19| Justin|
> ++---+
> scala> spark.read.json("hdfs:///tmp/people.json")
> 18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for 
> metadata directory.
> 18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for 
> metadata directory.
> res6: 

[jira] [Updated] (SPARK-23434) Spark should not warn `metadata directory` for a HDFS file path

2018-02-14 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23434:
--
Description: 
When Spark reads a file path (e.g. `people.json`), it warns with a wrong error 
message during looking up `people.json/_spark_metadata`. The root cause of this 
istuation is the difference between `LocalFileSystem` and 
`DistributedFileSystem`. `LocalFileSystem.exists()` returns `false`, but 
`DistributedFileSystem.exists` raises Exception.

{code}
scala> spark.version
res0: String = 2.4.0-SNAPSHOT

scala> 
spark.read.json("file:///usr/hdp/current/spark-client/examples/src/main/resources/people.json").show
++---+
| age|   name|
++---+
|null|Michael|
|  30|   Andy|
|  19| Justin|
++---+

scala> spark.read.json("hdfs:///tmp/people.json")
18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for 
metadata directory.
18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for 
metadata directory.
res6: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
{code}

{code}
scala> spark.version
res0: String = 2.2.1

scala> spark.read.json("hdfs:///tmp/people.json").show
18/02/15 05:28:02 WARN FileStreamSink: Error while looking for metadata 
directory.
18/02/15 05:28:02 WARN FileStreamSink: Error while looking for metadata 
directory.
{code}

{code}
scala> spark.version
res0: String = 2.0.2

scala> spark.read.json("hdfs:///tmp/people.json").show
18/02/15 05:25:24 WARN DataSource: Error while looking for metadata directory.
{code}

  was:
When Spark reads a file path (e.g. `people.json`), it warns with a wrong error 
message during looking up `people.json/_spark_metadata`. The root cause of this 
istuation is the difference between `LocalFileSystem` and 
`DistributedFileSystem`. `LocalFileSystem.exists()` returns `false`, but 
`DistributedFileSystem.exists` raises Exception.

{code}
scala> spark.version
res0: String = 2.4.0-SNAPSHOT

scala> 
spark.read.json("file:///usr/hdp/current/spark-client/examples/src/main/resources/people.json").show
++---+
| age|   name|
++---+
|null|Michael|
|  30|   Andy|
|  19| Justin|
++---+

scala> spark.read.json("hdfs:///tmp/people.json")
18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for 
metadata directory.
18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for 
metadata directory.
res6: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
{code}

{code}
scala> spark.version
res0: String = 2.0.2

scala> spark.read.json("hdfs:///tmp/people.json").show
18/02/15 05:25:24 WARN DataSource: Error while looking for metadata directory.
{code}


> Spark should not warn `metadata directory` for a HDFS file path
> ---
>
> Key: SPARK-23434
> URL: https://issues.apache.org/jira/browse/SPARK-23434
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> When Spark reads a file path (e.g. `people.json`), it warns with a wrong 
> error message during looking up `people.json/_spark_metadata`. The root cause 
> of this istuation is the difference between `LocalFileSystem` and 
> `DistributedFileSystem`. `LocalFileSystem.exists()` returns `false`, but 
> `DistributedFileSystem.exists` raises Exception.
> {code}
> scala> spark.version
> res0: String = 2.4.0-SNAPSHOT
> scala> 
> spark.read.json("file:///usr/hdp/current/spark-client/examples/src/main/resources/people.json").show
> ++---+
> | age|   name|
> ++---+
> |null|Michael|
> |  30|   Andy|
> |  19| Justin|
> ++---+
> scala> spark.read.json("hdfs:///tmp/people.json")
> 18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for 
> metadata directory.
> 18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for 
> metadata directory.
> res6: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
> {code}
> {code}
> scala> spark.version
> res0: String = 2.2.1
> scala> spark.read.json("hdfs:///tmp/people.json").show
> 18/02/15 05:28:02 WARN FileStreamSink: Error while looking for metadata 
> directory.
> 18/02/15 05:28:02 WARN FileStreamSink: Error while looking for metadata 
> directory.
> {code}
> {code}
> scala> spark.version
> res0: String = 2.0.2
> scala> spark.read.json("hdfs:///tmp/people.json").show
> 18/02/15 05:25:24 WARN DataSource: Error while looking for metadata directory.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23434) Spark should not warn `metadata directory` for a HDFS file path

2018-02-14 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23434:
--
Description: 
When Spark reads a file path (e.g. `people.json`), it warns with a wrong error 
message during looking up `people.json/_spark_metadata`. The root cause of this 
istuation is the difference between `LocalFileSystem` and 
`DistributedFileSystem`. `LocalFileSystem.exists()` returns `false`, but 
`DistributedFileSystem.exists` raises Exception.

{code}
scala> spark.version
res0: String = 2.4.0-SNAPSHOT

scala> 
spark.read.json("file:///usr/hdp/current/spark-client/examples/src/main/resources/people.json").show
++---+
| age|   name|
++---+
|null|Michael|
|  30|   Andy|
|  19| Justin|
++---+

scala> spark.read.json("hdfs:///tmp/people.json")
18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for 
metadata directory.
18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for 
metadata directory.
res6: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
{code}

{code}
scala> spark.version
res0: String = 2.0.2

scala> spark.read.json("hdfs:///tmp/people.json").show
18/02/15 05:25:24 WARN DataSource: Error while looking for metadata directory.
{code}

  was:
When Spark reads a file path (e.g. `people.json`), it warns with a wrong error 
message during looking up `people.json/_spark_metadata`. The root cause of this 
istuation is the difference between `LocalFileSystem` and 
`DistributedFileSystem`. `LocalFileSystem.exists()` returns `false`, but 
`DistributedFileSystem.exists` raises Exception.

{code}
scala> spark.version
res0: String = 2.4.0-SNAPSHOT

scala> 
spark.read.json("file:///usr/hdp/current/spark-client/examples/src/main/resources/people.json").show
++---+
| age|   name|
++---+
|null|Michael|
|  30|   Andy|
|  19| Justin|
++---+

scala> spark.read.json("hdfs:///tmp/people.json")
18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for 
metadata directory.
18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for 
metadata directory.
res6: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
{code}


> Spark should not warn `metadata directory` for a HDFS file path
> ---
>
> Key: SPARK-23434
> URL: https://issues.apache.org/jira/browse/SPARK-23434
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> When Spark reads a file path (e.g. `people.json`), it warns with a wrong 
> error message during looking up `people.json/_spark_metadata`. The root cause 
> of this istuation is the difference between `LocalFileSystem` and 
> `DistributedFileSystem`. `LocalFileSystem.exists()` returns `false`, but 
> `DistributedFileSystem.exists` raises Exception.
> {code}
> scala> spark.version
> res0: String = 2.4.0-SNAPSHOT
> scala> 
> spark.read.json("file:///usr/hdp/current/spark-client/examples/src/main/resources/people.json").show
> ++---+
> | age|   name|
> ++---+
> |null|Michael|
> |  30|   Andy|
> |  19| Justin|
> ++---+
> scala> spark.read.json("hdfs:///tmp/people.json")
> 18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for 
> metadata directory.
> 18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for 
> metadata directory.
> res6: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
> {code}
> {code}
> scala> spark.version
> res0: String = 2.0.2
> scala> spark.read.json("hdfs:///tmp/people.json").show
> 18/02/15 05:25:24 WARN DataSource: Error while looking for metadata directory.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23434) Spark should not warn `metadata directory` for a HDFS file path

2018-02-14 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23434:
--
Affects Version/s: 2.0.2

> Spark should not warn `metadata directory` for a HDFS file path
> ---
>
> Key: SPARK-23434
> URL: https://issues.apache.org/jira/browse/SPARK-23434
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> When Spark reads a file path (e.g. `people.json`), it warns with a wrong 
> error message during looking up `people.json/_spark_metadata`. The root cause 
> of this istuation is the difference between `LocalFileSystem` and 
> `DistributedFileSystem`. `LocalFileSystem.exists()` returns `false`, but 
> `DistributedFileSystem.exists` raises Exception.
> {code}
> scala> spark.version
> res0: String = 2.4.0-SNAPSHOT
> scala> 
> spark.read.json("file:///usr/hdp/current/spark-client/examples/src/main/resources/people.json").show
> ++---+
> | age|   name|
> ++---+
> |null|Michael|
> |  30|   Andy|
> |  19| Justin|
> ++---+
> scala> spark.read.json("hdfs:///tmp/people.json")
> 18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for 
> metadata directory.
> 18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for 
> metadata directory.
> res6: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23434) Spark should not warn `metadata directory` for a HDFS file path

2018-02-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16365150#comment-16365150
 ] 

Apache Spark commented on SPARK-23434:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/20616

> Spark should not warn `metadata directory` for a HDFS file path
> ---
>
> Key: SPARK-23434
> URL: https://issues.apache.org/jira/browse/SPARK-23434
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.2, 2.2.1, 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> When Spark reads a file path (e.g. `people.json`), it warns with a wrong 
> error message during looking up `people.json/_spark_metadata`. The root cause 
> of this istuation is the difference between `LocalFileSystem` and 
> `DistributedFileSystem`. `LocalFileSystem.exists()` returns `false`, but 
> `DistributedFileSystem.exists` raises Exception.
> {code}
> scala> spark.version
> res0: String = 2.4.0-SNAPSHOT
> scala> 
> spark.read.json("file:///usr/hdp/current/spark-client/examples/src/main/resources/people.json").show
> ++---+
> | age|   name|
> ++---+
> |null|Michael|
> |  30|   Andy|
> |  19| Justin|
> ++---+
> scala> spark.read.json("hdfs:///tmp/people.json")
> 18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for 
> metadata directory.
> 18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for 
> metadata directory.
> res6: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23434) Spark should not warn `metadata directory` for a HDFS file path

2018-02-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23434:


Assignee: (was: Apache Spark)

> Spark should not warn `metadata directory` for a HDFS file path
> ---
>
> Key: SPARK-23434
> URL: https://issues.apache.org/jira/browse/SPARK-23434
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.2, 2.2.1, 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> When Spark reads a file path (e.g. `people.json`), it warns with a wrong 
> error message during looking up `people.json/_spark_metadata`. The root cause 
> of this istuation is the difference between `LocalFileSystem` and 
> `DistributedFileSystem`. `LocalFileSystem.exists()` returns `false`, but 
> `DistributedFileSystem.exists` raises Exception.
> {code}
> scala> spark.version
> res0: String = 2.4.0-SNAPSHOT
> scala> 
> spark.read.json("file:///usr/hdp/current/spark-client/examples/src/main/resources/people.json").show
> ++---+
> | age|   name|
> ++---+
> |null|Michael|
> |  30|   Andy|
> |  19| Justin|
> ++---+
> scala> spark.read.json("hdfs:///tmp/people.json")
> 18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for 
> metadata directory.
> 18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for 
> metadata directory.
> res6: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23434) Spark should not warn `metadata directory` for a HDFS file path

2018-02-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23434:


Assignee: Apache Spark

> Spark should not warn `metadata directory` for a HDFS file path
> ---
>
> Key: SPARK-23434
> URL: https://issues.apache.org/jira/browse/SPARK-23434
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.2, 2.2.1, 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>
> When Spark reads a file path (e.g. `people.json`), it warns with a wrong 
> error message during looking up `people.json/_spark_metadata`. The root cause 
> of this istuation is the difference between `LocalFileSystem` and 
> `DistributedFileSystem`. `LocalFileSystem.exists()` returns `false`, but 
> `DistributedFileSystem.exists` raises Exception.
> {code}
> scala> spark.version
> res0: String = 2.4.0-SNAPSHOT
> scala> 
> spark.read.json("file:///usr/hdp/current/spark-client/examples/src/main/resources/people.json").show
> ++---+
> | age|   name|
> ++---+
> |null|Michael|
> |  30|   Andy|
> |  19| Justin|
> ++---+
> scala> spark.read.json("hdfs:///tmp/people.json")
> 18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for 
> metadata directory.
> 18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for 
> metadata directory.
> res6: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23434) Spark should not warn `metadata directory` for a HDFS file path

2018-02-14 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-23434:
-

 Summary: Spark should not warn `metadata directory` for a HDFS 
file path
 Key: SPARK-23434
 URL: https://issues.apache.org/jira/browse/SPARK-23434
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.1, 2.1.2, 2.3.0
Reporter: Dongjoon Hyun


When Spark reads a file path (e.g. `people.json`), it warns with a wrong error 
message during looking up `people.json/_spark_metadata`. The root cause of this 
istuation is the difference between `LocalFileSystem` and 
`DistributedFileSystem`. `LocalFileSystem.exists()` returns `false`, but 
`DistributedFileSystem.exists` raises Exception.

{code}
scala> spark.version
res0: String = 2.4.0-SNAPSHOT

scala> 
spark.read.json("file:///usr/hdp/current/spark-client/examples/src/main/resources/people.json").show
++---+
| age|   name|
++---+
|null|Michael|
|  30|   Andy|
|  19| Justin|
++---+

scala> spark.read.json("hdfs:///tmp/people.json")
18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for 
metadata directory.
18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for 
metadata directory.
res6: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23427) spark.sql.autoBroadcastJoinThreshold causing OOM in the driver

2018-02-14 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16365065#comment-16365065
 ] 

Kazuaki Ishizaki commented on SPARK-23427:
--

We would appreciate it if you could post a program that can reproduce this 
issue.

> spark.sql.autoBroadcastJoinThreshold causing OOM  in the driver 
> 
>
> Key: SPARK-23427
> URL: https://issues.apache.org/jira/browse/SPARK-23427
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: SPARK 2.0 version
>Reporter: Dhiraj
>Priority: Critical
>
> We are facing issue around value of spark.sql.autoBroadcastJoinThreshold.
> With spark.sql.autoBroadcastJoinThreshold -1 ( disable) we seeing driver 
> memory used flat.
> With any other values 10MB, 5MB, 2 MB, 1MB, 10K, 1K we see driver memory used 
> goes up with rate depending upon the size of the autoBroadcastThreshold and 
> getting OOM exception. The problem is memory used by autoBroadcast is not 
> being free up in the driver.
> Application imports oracle tables as master dataframes which are persisted. 
> Each job applies filter to these tables and then registered them as 
> tempViewTable . Then sql query are using to process data further. At the end 
> all the intermediate dataFrame are unpersisted.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23433) java.lang.IllegalStateException: more than one active taskSet for stage

2018-02-14 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-23433:


 Summary: java.lang.IllegalStateException: more than one active 
taskSet for stage
 Key: SPARK-23433
 URL: https://issues.apache.org/jira/browse/SPARK-23433
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.1
Reporter: Shixiong Zhu


This following error thrown by DAGScheduler stopped the cluster:

{code}
18/02/11 13:22:27 ERROR DAGSchedulerEventProcessLoop: 
DAGSchedulerEventProcessLoop failed; shutting down SparkContext
java.lang.IllegalStateException: more than one active taskSet for stage 
7580621: 7580621.2,7580621.1
at 
org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:229)
at 
org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1193)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:1059)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:900)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:899)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at 
org.apache.spark.scheduler.DAGScheduler.submitWaitingChildStages(DAGScheduler.scala:899)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1427)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1929)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1880)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1868)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors

2018-02-14 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364861#comment-16364861
 ] 

Stavros Kontopoulos edited comment on SPARK-23423 at 2/14/18 11:57 PM:
---

Hi [~igor.berman]. Looking at the code again I think when there is a status 
update tasksIds of dead tasks are removed:

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L732]

Slaves are not removed but task Ids are, maybe something else is not working. 
Do you have a log at the time of the issue to attach?

The test you have is ok but I think it does not trigger deletion for the tasks 
in the case of a failure. I think you need to update the backend with task 
status:

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/test/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackendSuite.scala#L102-L103]

 

The following test passes:
{code}
  test("max executors registered stops to accept offers when dynamic allocation 
enabled") {
setBackend(Map(
  "spark.dynamicAllocation.maxExecutors" -> "1",
  "spark.dynamicAllocation.enabled" -> "true",
  "spark.dynamicAllocation.testing" -> "true"))

backend.doRequestTotalExecutors(1)

val (mem, cpu) = (backend.executorMemory(sc), 4)

val offer1 = createOffer("o1", "s1", mem, cpu)
backend.resourceOffers(driver, List(offer1).asJava)
verifyTaskLaunched(driver, "o1")

backend.doKillExecutors(List("0"))
verify(driver, times(1)).killTask(createTaskId("0"))

val status = createTaskStatus("0", "s1", TaskState.TASK_KILLED)
backend.statusUpdate(driver, status)

val offer2 = createOffer("o2", "s2", mem, cpu)
backend.resourceOffers(driver, List(offer2).asJava)
//verify(driver, times(1)).declineOffer(offer2.getId)
val taskInfos = verifyTaskLaunched(driver, "o2")
assert(taskInfos.length == 1)
  }{code}
 

Btw the behavior for checking the upper limit of the num of the executors you 
are referring to is defined in different places: 

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L354]

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]

The latter exists for very long time. The former was added with Spark-16944. 
Essentially they do check the same thing. 

 


was (Author: skonto):
Hi [~igor.berman]. Looking at the code again I think when there is a status 
update tasksIds of dead tasks are removed:

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L732]

Slaves are not removed but task Ids are, maybe something else is not working. 
Do you have a log at the time of the issue to attach?

The test you have is ok but I think it does not trigger deletion for the tasks 
in the case of a failure. I think you need to update the backend with task 
status:

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/test/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackendSuite.scala#L102-L103]

 

The following test passes:

{{{code:scala}}}

test("max executors registered stops to accept offers when dynamic allocation 
enabled")

{ setBackend(Map( "spark.dynamicAllocation.maxExecutors" -> "1", 
"spark.dynamicAllocation.enabled" -> "true", "spark.dynamicAllocation.testing" 
-> "true")) backend.doRequestTotalExecutors(1) val (mem, cpu) = 
(backend.executorMemory(sc), 4) val offer1 = createOffer("o1", "s1", mem, cpu) 
backend.resourceOffers(driver, List(offer1).asJava) verifyTaskLaunched(driver, 
"o1") backend.doKillExecutors(List("0")) verify(driver, 
times(1)).killTask(createTaskId("0")) val status = createTaskStatus("0", "s1", 
TaskState.TASK_KILLED) backend.statusUpdate(driver, status) val offer2 = 
createOffer("o2", "s2", mem, cpu) backend.resourceOffers(driver, 
List(offer2).asJava) // verify(driver, times(1)).declineOffer(offer2.getId) val 
taskInfos = verifyTaskLaunched(driver, "o2") assert(taskInfos.length == 1) }

{{{code}}}

Btw the behavior for checking the upper limit of the num of the executors you 
are referring to is defined in different places: 

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L354]

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]

The latter exists for very long time. The former was added with 

[jira] [Comment Edited] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors

2018-02-14 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364861#comment-16364861
 ] 

Stavros Kontopoulos edited comment on SPARK-23423 at 2/14/18 11:57 PM:
---

Hi [~igor.berman]. Looking at the code again I think when there is a status 
update tasksIds of dead tasks are removed:

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L732]

Slaves are not removed but task Ids are, maybe something else is not working. 
Do you have a log at the time of the issue to attach?

The test you have is ok but I think it does not trigger deletion for the tasks 
in the case of a failure. I think you need to update the backend with task 
status:

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/test/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackendSuite.scala#L102-L103]

The following test passes:
{code:java}
  test("max executors registered stops to accept offers when dynamic allocation 
enabled") {
setBackend(Map(
  "spark.dynamicAllocation.maxExecutors" -> "1",
  "spark.dynamicAllocation.enabled" -> "true",
  "spark.dynamicAllocation.testing" -> "true"))

backend.doRequestTotalExecutors(1)

val (mem, cpu) = (backend.executorMemory(sc), 4)

val offer1 = createOffer("o1", "s1", mem, cpu)
backend.resourceOffers(driver, List(offer1).asJava)
verifyTaskLaunched(driver, "o1")

backend.doKillExecutors(List("0"))
verify(driver, times(1)).killTask(createTaskId("0"))

val status = createTaskStatus("0", "s1", TaskState.TASK_KILLED)
backend.statusUpdate(driver, status)

val offer2 = createOffer("o2", "s2", mem, cpu)
backend.resourceOffers(driver, List(offer2).asJava)
//verify(driver, times(1)).declineOffer(offer2.getId)
val taskInfos = verifyTaskLaunched(driver, "o2")
assert(taskInfos.length == 1)
  }{code}
 

Btw the behavior for checking the upper limit of the num of the executors you 
are referring to is defined in different places: 

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L354]

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]

The latter exists for very long time. The former was added with Spark-16944. 
Essentially they do check the same thing. 

 


was (Author: skonto):
Hi [~igor.berman]. Looking at the code again I think when there is a status 
update tasksIds of dead tasks are removed:

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L732]

Slaves are not removed but task Ids are, maybe something else is not working. 
Do you have a log at the time of the issue to attach?

The test you have is ok but I think it does not trigger deletion for the tasks 
in the case of a failure. I think you need to update the backend with task 
status:

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/test/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackendSuite.scala#L102-L103]

 

The following test passes:
{code}
  test("max executors registered stops to accept offers when dynamic allocation 
enabled") {
setBackend(Map(
  "spark.dynamicAllocation.maxExecutors" -> "1",
  "spark.dynamicAllocation.enabled" -> "true",
  "spark.dynamicAllocation.testing" -> "true"))

backend.doRequestTotalExecutors(1)

val (mem, cpu) = (backend.executorMemory(sc), 4)

val offer1 = createOffer("o1", "s1", mem, cpu)
backend.resourceOffers(driver, List(offer1).asJava)
verifyTaskLaunched(driver, "o1")

backend.doKillExecutors(List("0"))
verify(driver, times(1)).killTask(createTaskId("0"))

val status = createTaskStatus("0", "s1", TaskState.TASK_KILLED)
backend.statusUpdate(driver, status)

val offer2 = createOffer("o2", "s2", mem, cpu)
backend.resourceOffers(driver, List(offer2).asJava)
//verify(driver, times(1)).declineOffer(offer2.getId)
val taskInfos = verifyTaskLaunched(driver, "o2")
assert(taskInfos.length == 1)
  }{code}
 

Btw the behavior for checking the upper limit of the num of the executors you 
are referring to is defined in different places: 

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L354]

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]

The 

[jira] [Comment Edited] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors

2018-02-14 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364861#comment-16364861
 ] 

Stavros Kontopoulos edited comment on SPARK-23423 at 2/14/18 11:56 PM:
---

Hi [~igor.berman]. Looking at the code again I think when there is a status 
update tasksIds of dead tasks are removed:

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L732]

Slaves are not removed but task Ids are, maybe something else is not working. 
Do you have a log at the time of the issue to attach?

The test you have is ok but I think it does not trigger deletion for the tasks 
in the case of a failure. I think you need to update the backend with task 
status:

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/test/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackendSuite.scala#L102-L103]

 

The following test passes:

{{{code:scala}}}

test("max executors registered stops to accept offers when dynamic allocation 
enabled")

{ setBackend(Map( "spark.dynamicAllocation.maxExecutors" -> "1", 
"spark.dynamicAllocation.enabled" -> "true", "spark.dynamicAllocation.testing" 
-> "true")) backend.doRequestTotalExecutors(1) val (mem, cpu) = 
(backend.executorMemory(sc), 4) val offer1 = createOffer("o1", "s1", mem, cpu) 
backend.resourceOffers(driver, List(offer1).asJava) verifyTaskLaunched(driver, 
"o1") backend.doKillExecutors(List("0")) verify(driver, 
times(1)).killTask(createTaskId("0")) val status = createTaskStatus("0", "s1", 
TaskState.TASK_KILLED) backend.statusUpdate(driver, status) val offer2 = 
createOffer("o2", "s2", mem, cpu) backend.resourceOffers(driver, 
List(offer2).asJava) // verify(driver, times(1)).declineOffer(offer2.getId) val 
taskInfos = verifyTaskLaunched(driver, "o2") assert(taskInfos.length == 1) }

{{{code}}}

Btw the behavior for checking the upper limit of the num of the executors you 
are referring to is defined in different places: 

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L354]

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]

The latter exists for very long time. The former was added with Spark-16944. 
Essentially they do check the same thing. 

 


was (Author: skonto):
Hi [~igor.berman]. Looking at the code again I think when there is a status 
update tasksIds of dead tasks are removed:

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L732]

Slaves are not removed but task Ids are, maybe something else is not working. 
Do you have a log at the time of the issue to attach?

The test you have is ok but I think it does not trigger deletion for the tasks 
in the case of a failure. I think you need to update the backend with task 
status:

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/test/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackendSuite.scala#L102-L103]

 

The following test passes:

```

test("max executors registered stops to accept offers when dynamic allocation 
enabled") {
 setBackend(Map(
 "spark.dynamicAllocation.maxExecutors" -> "1",
 "spark.dynamicAllocation.enabled" -> "true",
 "spark.dynamicAllocation.testing" -> "true"))

 backend.doRequestTotalExecutors(1)

 val (mem, cpu) = (backend.executorMemory(sc), 4)

 val offer1 = createOffer("o1", "s1", mem, cpu)
 backend.resourceOffers(driver, List(offer1).asJava)
 verifyTaskLaunched(driver, "o1")

 backend.doKillExecutors(List("0"))
 verify(driver, times(1)).killTask(createTaskId("0"))

 val status = createTaskStatus("0", "s1", TaskState.TASK_KILLED)
 backend.statusUpdate(driver, status)

 val offer2 = createOffer("o2", "s2", mem, cpu)
 backend.resourceOffers(driver, List(offer2).asJava)
// verify(driver, times(1)).declineOffer(offer2.getId)
 val taskInfos = verifyTaskLaunched(driver, "o2")
 assert(taskInfos.length == 1)
 }

```

Btw the behavior for checking the upper limit of the num of the executors you 
are referring to is defined in different places: 

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L354]

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]

The latter exists for very long time. The former was added with Spark-16944. 
Essentially they do check the same thing. 

 

> 

[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics

2018-02-14 Thread Edwina Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364979#comment-16364979
 ] 

Edwina Lu commented on SPARK-23206:
---

After discussion with [~cltlfcjin] and the eBay Hadoop team, we would like to 
coordinate our efforts in adding more executor memory metrics. I've added some 
subtasks, and will follow up with pull requests. I think there are some design 
differences – looking forward to hearing more details. Comments and suggestions 
are very welcome.

> Additional Memory Tuning Metrics
> 
>
> Key: SPARK-23206
> URL: https://issues.apache.org/jira/browse/SPARK-23206
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
> Attachments: ExecutorsTab.png, ExecutorsTab2.png, 
> MemoryTuningMetricsDesignDoc.pdf, StageTab.png
>
>
> At LinkedIn, we have multiple clusters, running thousands of Spark 
> applications, and these numbers are growing rapidly. We need to ensure that 
> these Spark applications are well tuned – cluster resources, including 
> memory, should be used efficiently so that the cluster can support running 
> more applications concurrently, and applications should run quickly and 
> reliably.
> Currently there is limited visibility into how much memory executors are 
> using, and users are guessing numbers for executor and driver memory sizing. 
> These estimates are often much larger than needed, leading to memory wastage. 
> Examining the metrics for one cluster for a month, the average percentage of 
> used executor memory (max JVM used memory across executors /  
> spark.executor.memory) is 35%, leading to an average of 591GB unused memory 
> per application (number of executors * (spark.executor.memory - max JVM used 
> memory)). Spark has multiple memory regions (user memory, execution memory, 
> storage memory, and overhead memory), and to understand how memory is being 
> used and fine-tune allocation between regions, it would be useful to have 
> information about how much memory is being used for the different regions.
> To improve visibility into memory usage for the driver and executors and 
> different memory regions, the following additional memory metrics can be be 
> tracked for each executor and driver:
>  * JVM used memory: the JVM heap size for the executor/driver.
>  * Execution memory: memory used for computation in shuffles, joins, sorts 
> and aggregations.
>  * Storage memory: memory used caching and propagating internal data across 
> the cluster.
>  * Unified memory: sum of execution and storage memory.
> The peak values for each memory metric can be tracked for each executor, and 
> also per stage. This information can be shown in the Spark UI and the REST 
> APIs. Information for peak JVM used memory can help with determining 
> appropriate values for spark.executor.memory and spark.driver.memory, and 
> information about the unified memory region can help with determining 
> appropriate values for spark.memory.fraction and 
> spark.memory.storageFraction. Stage memory information can help identify 
> which stages are most memory intensive, and users can look into the relevant 
> code to determine if it can be optimized.
> The memory metrics can be gathered by adding the current JVM used memory, 
> execution memory and storage memory to the heartbeat. SparkListeners are 
> modified to collect the new metrics for the executors, stages and Spark 
> history log. Only interesting values (peak values per stage per executor) are 
> recorded in the Spark history log, to minimize the amount of additional 
> logging.
> We have attached our design documentation with this ticket and would like to 
> receive feedback from the community for this proposal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors

2018-02-14 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364861#comment-16364861
 ] 

Stavros Kontopoulos edited comment on SPARK-23423 at 2/14/18 11:54 PM:
---

Hi [~igor.berman]. Looking at the code again I think when there is a status 
update tasksIds of dead tasks are removed:

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L732]

Slaves are not removed but task Ids are, maybe something else is not working. 
Do you have a log at the time of the issue to attach?

The test you have is ok but I think it does not trigger deletion for the tasks 
in the case of a failure. I think you need to update the backend with task 
status:

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/test/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackendSuite.scala#L102-L103]

 

The following test passes:

```

test("max executors registered stops to accept offers when dynamic allocation 
enabled") {
 setBackend(Map(
 "spark.dynamicAllocation.maxExecutors" -> "1",
 "spark.dynamicAllocation.enabled" -> "true",
 "spark.dynamicAllocation.testing" -> "true"))

 backend.doRequestTotalExecutors(1)

 val (mem, cpu) = (backend.executorMemory(sc), 4)

 val offer1 = createOffer("o1", "s1", mem, cpu)
 backend.resourceOffers(driver, List(offer1).asJava)
 verifyTaskLaunched(driver, "o1")

 backend.doKillExecutors(List("0"))
 verify(driver, times(1)).killTask(createTaskId("0"))

 val status = createTaskStatus("0", "s1", TaskState.TASK_KILLED)
 backend.statusUpdate(driver, status)

 val offer2 = createOffer("o2", "s2", mem, cpu)
 backend.resourceOffers(driver, List(offer2).asJava)
// verify(driver, times(1)).declineOffer(offer2.getId)
 val taskInfos = verifyTaskLaunched(driver, "o2")
 assert(taskInfos.length == 1)
 }

```

Btw the behavior for checking the upper limit of the num of the executors you 
are referring to is defined in different places: 

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L354]

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]

The latter exists for very long time. The former was added with Spark-16944. 
Essentially they do check the same thing. 

 


was (Author: skonto):
Hi [~igor.berman]. Looking at the code again I think when there is a status 
update tasksIds of dead tasks are removed:

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L732]

Slaves are not removed but task Ids are, maybe something else is not working. 
Do you have a log at the time of the issue to attach?

The test you have is ok but I think it does not trigger deletion for the tasks 
in the case of a failure. I think you need to update the backend with task 
status:

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/test/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackendSuite.scala#L102-L103]

Btw the behavior for checking the upper limit of the num of the executors you 
are referring to is defined in different places: 

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L354]

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]

The latter exists for very long time. The former was added with Spark-16944. 
Essentially they do check the same thing. 

 

> Application declines any offers when killed+active executors rich 
> spark.dynamicAllocation.maxExecutors
> --
>
> Key: SPARK-23423
> URL: https://issues.apache.org/jira/browse/SPARK-23423
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.2.1
>Reporter: Igor Berman
>Priority: Major
>
> Hi
> I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend 
> when running on Mesos with dynamic allocation on and limiting number of max 
> executors by spark.dynamicAllocation.maxExecutors.
> Suppose we have long running driver that has cyclic pattern of resource 
> consumption(with some idle times in between), due to dyn.allocation it 
> receives offers and then releases them after current chunk of work processed.
> Since 

[jira] [Commented] (SPARK-23430) Cannot sort "Executor ID" or "Host" columns in the task table

2018-02-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364976#comment-16364976
 ] 

Apache Spark commented on SPARK-23430:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/20615

> Cannot sort "Executor ID" or "Host" columns in the task table
> -
>
> Key: SPARK-23430
> URL: https://issues.apache.org/jira/browse/SPARK-23430
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Blocker
>  Labels: regression
>
> Click the "Executor ID" or "Host" header in the task table and it will fail:
> {code}
> java.lang.IllegalArgumentException: Invalid sort column: Executor ID
>   at org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1009)
>   at 
> org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:686)
>   at org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61)
>   at org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96)
>   at org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:700)
>   at org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293)
>   at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282)
>   at org.apache.spark.ui.WebUI$$anonfun$3.apply(WebUI.scala:98)
>   at org.apache.spark.ui.WebUI$$anonfun$3.apply(WebUI.scala:98)
>   at org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>   at 
> org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
>   at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:584)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>   at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>   at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>   at 
> org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493)
>   at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
>   at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>   at org.eclipse.jetty.server.Server.handle(Server.java:534)
>   at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
>   at 
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
>   at 
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
>   at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
>   at 
> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>   at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
>   at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
>   at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
>   at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
>   at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23430) Cannot sort "Executor ID" or "Host" columns in the task table

2018-02-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23430:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Cannot sort "Executor ID" or "Host" columns in the task table
> -
>
> Key: SPARK-23430
> URL: https://issues.apache.org/jira/browse/SPARK-23430
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Blocker
>  Labels: regression
>
> Click the "Executor ID" or "Host" header in the task table and it will fail:
> {code}
> java.lang.IllegalArgumentException: Invalid sort column: Executor ID
>   at org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1009)
>   at 
> org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:686)
>   at org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61)
>   at org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96)
>   at org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:700)
>   at org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293)
>   at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282)
>   at org.apache.spark.ui.WebUI$$anonfun$3.apply(WebUI.scala:98)
>   at org.apache.spark.ui.WebUI$$anonfun$3.apply(WebUI.scala:98)
>   at org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>   at 
> org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
>   at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:584)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>   at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>   at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>   at 
> org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493)
>   at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
>   at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>   at org.eclipse.jetty.server.Server.handle(Server.java:534)
>   at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
>   at 
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
>   at 
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
>   at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
>   at 
> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>   at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
>   at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
>   at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
>   at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
>   at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23430) Cannot sort "Executor ID" or "Host" columns in the task table

2018-02-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23430:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Cannot sort "Executor ID" or "Host" columns in the task table
> -
>
> Key: SPARK-23430
> URL: https://issues.apache.org/jira/browse/SPARK-23430
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>Priority: Blocker
>  Labels: regression
>
> Click the "Executor ID" or "Host" header in the task table and it will fail:
> {code}
> java.lang.IllegalArgumentException: Invalid sort column: Executor ID
>   at org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1009)
>   at 
> org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:686)
>   at org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61)
>   at org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96)
>   at org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:700)
>   at org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293)
>   at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282)
>   at org.apache.spark.ui.WebUI$$anonfun$3.apply(WebUI.scala:98)
>   at org.apache.spark.ui.WebUI$$anonfun$3.apply(WebUI.scala:98)
>   at org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>   at 
> org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
>   at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:584)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>   at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>   at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>   at 
> org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493)
>   at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
>   at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>   at org.eclipse.jetty.server.Server.handle(Server.java:534)
>   at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
>   at 
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
>   at 
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
>   at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
>   at 
> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>   at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
>   at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
>   at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
>   at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
>   at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23432) Expose executor memory metrics in the web UI for executors and stages

2018-02-14 Thread Edwina Lu (JIRA)
Edwina Lu created SPARK-23432:
-

 Summary: Expose executor memory metrics in the web UI for 
executors and stages
 Key: SPARK-23432
 URL: https://issues.apache.org/jira/browse/SPARK-23432
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 2.2.1
Reporter: Edwina Lu


Add the new memory metrics (jvmUsedMemory, executionMemory, storageMemory, and 
unifiedMemory) to the executors tab, in the summary and for foreach executor. 

Also add the new memory metrics to the stages tab. Add a new Summary Metrics 
for Executors table, which will show quantile values for the executor level 
metrics. Also add columns for the new metrics to the Aggregated Metrics by 
Executor table.

This is a subtask for SPARK-23206. Please refer to the design doc for that 
ticket for more details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23431) Expose the new executor memory metrics at the stage level

2018-02-14 Thread Edwina Lu (JIRA)
Edwina Lu created SPARK-23431:
-

 Summary: Expose the new executor memory metrics at the stage level
 Key: SPARK-23431
 URL: https://issues.apache.org/jira/browse/SPARK-23431
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 2.2.1
Reporter: Edwina Lu


Collect and show the new executor memory metrics for each stage, to provide 
more information on how memory is used per stage.

Modify the AppStatusListener to track the peak values for JVM used memory, 
execution memory, storage memory, and unified memory for each executor for each 
stage.

Add the peak values for the metrics to the stages REST API. Also add a new 
stageSummary REST API, which will return executor summary metrics for a 
specified stage:
{code:java}
curl http://:18080/api/v1/applicationsexecutorSummary{code}
This is a subtask for SPARK-23206. Please refer to the design doc for that 
ticket for more details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23429) Add executor memory metrics to heartbeat and expose in executors REST API

2018-02-14 Thread Edwina Lu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Edwina Lu updated SPARK-23429:
--
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-23206

> Add executor memory metrics to heartbeat and expose in executors REST API
> -
>
> Key: SPARK-23429
> URL: https://issues.apache.org/jira/browse/SPARK-23429
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
>
> Add new executor level memory metrics ( jvmUsedMemory, executionMemory, 
> storageMemory, and unifiedMemory), and expose these via the executors REST 
> API. This information will help provide insight into how executor and driver 
> JVM memory is used, and for the different memory regions. It can be used to 
> help determine good values for spark.executor.memory, spark.driver.memory, 
> spark.memory.fraction, and spark.memory.storageFraction.
> Add an ExecutorMetrics class, with jvmUsedMemory, executionMemory, and 
> storageMemory. This will track the memory usage at the executor level. The 
> new ExecutorMetrics will be sent by executors to the driver as part of the 
> Heartbeat. A heartbeat will be added for the driver as well, to collect these 
> metrics for the driver.
> Modify the EventLoggingListener to log ExecutorMetricsUpdate events if there 
> is a new peak value for one of the memory metrics for an executor and stage. 
> Only the ExecutorMetrics will be logged, and not the TaskMetrics, to minimize 
> additional logging. Analysis on a set of sample applications showed an 
> increase of 0.25% in the size of the Spark history log, with this approach.
> Modify the AppStatusListener to collect snapshots of peak values for each 
> memory metric. Each snapshot has the time, jvmUsedMemory, executionMemory and 
> storageMemory, and list of active stages.
> Add the new memory metrics (snapshots of peak values for each memory metric) 
> to the executors REST API.
> This is a subtask for SPARK-23206. Please refer to the design doc for that 
> ticket for more details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23430) Cannot sort "Executor ID" or "Host" columns in the task table

2018-02-14 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-23430:


 Summary: Cannot sort "Executor ID" or "Host" columns in the task 
table
 Key: SPARK-23430
 URL: https://issues.apache.org/jira/browse/SPARK-23430
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.3.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


Click the "Executor ID" or "Host" header in the task table and it will fail:
{code}
java.lang.IllegalArgumentException: Invalid sort column: Executor ID
at org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1009)
at 
org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:686)
at org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61)
at org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96)
at org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:700)
at org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293)
at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282)
at org.apache.spark.ui.WebUI$$anonfun$3.apply(WebUI.scala:98)
at org.apache.spark.ui.WebUI$$anonfun$3.apply(WebUI.scala:98)
at org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
at 
org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:584)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at 
org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at org.eclipse.jetty.server.Server.handle(Server.java:534)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
at 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
at 
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
at 
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
at java.lang.Thread.run(Thread.java:748)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23429) Add executor memory metrics to heartbeat and expose in executors REST API

2018-02-14 Thread Edwina Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364967#comment-16364967
 ] 

Edwina Lu commented on SPARK-23429:
---

Subtask of SPARK-23206

> Add executor memory metrics to heartbeat and expose in executors REST API
> -
>
> Key: SPARK-23429
> URL: https://issues.apache.org/jira/browse/SPARK-23429
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
>
> Add new executor level memory metrics ( jvmUsedMemory, executionMemory, 
> storageMemory, and unifiedMemory), and expose these via the executors REST 
> API. This information will help provide insight into how executor and driver 
> JVM memory is used, and for the different memory regions. It can be used to 
> help determine good values for spark.executor.memory, spark.driver.memory, 
> spark.memory.fraction, and spark.memory.storageFraction.
> Add an ExecutorMetrics class, with jvmUsedMemory, executionMemory, and 
> storageMemory. This will track the memory usage at the executor level. The 
> new ExecutorMetrics will be sent by executors to the driver as part of the 
> Heartbeat. A heartbeat will be added for the driver as well, to collect these 
> metrics for the driver.
> Modify the EventLoggingListener to log ExecutorMetricsUpdate events if there 
> is a new peak value for one of the memory metrics for an executor and stage. 
> Only the ExecutorMetrics will be logged, and not the TaskMetrics, to minimize 
> additional logging. Analysis on a set of sample applications showed an 
> increase of 0.25% in the size of the Spark history log, with this approach.
> Modify the AppStatusListener to collect snapshots of peak values for each 
> memory metric. Each snapshot has the time, jvmUsedMemory, executionMemory and 
> storageMemory, and list of active stages.
> Add the new memory metrics (snapshots of peak values for each memory metric) 
> to the executors REST API.
> This is a subtask for SPARK-23206. Please refer to the design doc for that 
> ticket for more details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23429) Add executor memory metrics to heartbeat and expose in executors REST API

2018-02-14 Thread Edwina Lu (JIRA)
Edwina Lu created SPARK-23429:
-

 Summary: Add executor memory metrics to heartbeat and expose in 
executors REST API
 Key: SPARK-23429
 URL: https://issues.apache.org/jira/browse/SPARK-23429
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.2.1
Reporter: Edwina Lu


Add new executor level memory metrics ( jvmUsedMemory, executionMemory, 
storageMemory, and unifiedMemory), and expose these via the executors REST API. 
This information will help provide insight into how executor and driver JVM 
memory is used, and for the different memory regions. It can be used to help 
determine good values for spark.executor.memory, spark.driver.memory, 
spark.memory.fraction, and spark.memory.storageFraction.

Add an ExecutorMetrics class, with jvmUsedMemory, executionMemory, and 
storageMemory. This will track the memory usage at the executor level. The new 
ExecutorMetrics will be sent by executors to the driver as part of the 
Heartbeat. A heartbeat will be added for the driver as well, to collect these 
metrics for the driver.

Modify the EventLoggingListener to log ExecutorMetricsUpdate events if there is 
a new peak value for one of the memory metrics for an executor and stage. Only 
the ExecutorMetrics will be logged, and not the TaskMetrics, to minimize 
additional logging. Analysis on a set of sample applications showed an increase 
of 0.25% in the size of the Spark history log, with this approach.

Modify the AppStatusListener to collect snapshots of peak values for each 
memory metric. Each snapshot has the time, jvmUsedMemory, executionMemory and 
storageMemory, and list of active stages.

Add the new memory metrics (snapshots of peak values for each memory metric) to 
the executors REST API.

This is a subtask for SPARK-23206. Please refer to the design doc for that 
ticket for more details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23206) Additional Memory Tuning Metrics

2018-02-14 Thread Edwina Lu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Edwina Lu updated SPARK-23206:
--
Issue Type: Umbrella  (was: Improvement)

> Additional Memory Tuning Metrics
> 
>
> Key: SPARK-23206
> URL: https://issues.apache.org/jira/browse/SPARK-23206
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
> Attachments: ExecutorsTab.png, ExecutorsTab2.png, 
> MemoryTuningMetricsDesignDoc.pdf, StageTab.png
>
>
> At LinkedIn, we have multiple clusters, running thousands of Spark 
> applications, and these numbers are growing rapidly. We need to ensure that 
> these Spark applications are well tuned – cluster resources, including 
> memory, should be used efficiently so that the cluster can support running 
> more applications concurrently, and applications should run quickly and 
> reliably.
> Currently there is limited visibility into how much memory executors are 
> using, and users are guessing numbers for executor and driver memory sizing. 
> These estimates are often much larger than needed, leading to memory wastage. 
> Examining the metrics for one cluster for a month, the average percentage of 
> used executor memory (max JVM used memory across executors /  
> spark.executor.memory) is 35%, leading to an average of 591GB unused memory 
> per application (number of executors * (spark.executor.memory - max JVM used 
> memory)). Spark has multiple memory regions (user memory, execution memory, 
> storage memory, and overhead memory), and to understand how memory is being 
> used and fine-tune allocation between regions, it would be useful to have 
> information about how much memory is being used for the different regions.
> To improve visibility into memory usage for the driver and executors and 
> different memory regions, the following additional memory metrics can be be 
> tracked for each executor and driver:
>  * JVM used memory: the JVM heap size for the executor/driver.
>  * Execution memory: memory used for computation in shuffles, joins, sorts 
> and aggregations.
>  * Storage memory: memory used caching and propagating internal data across 
> the cluster.
>  * Unified memory: sum of execution and storage memory.
> The peak values for each memory metric can be tracked for each executor, and 
> also per stage. This information can be shown in the Spark UI and the REST 
> APIs. Information for peak JVM used memory can help with determining 
> appropriate values for spark.executor.memory and spark.driver.memory, and 
> information about the unified memory region can help with determining 
> appropriate values for spark.memory.fraction and 
> spark.memory.storageFraction. Stage memory information can help identify 
> which stages are most memory intensive, and users can look into the relevant 
> code to determine if it can be optimized.
> The memory metrics can be gathered by adding the current JVM used memory, 
> execution memory and storage memory to the heartbeat. SparkListeners are 
> modified to collect the new metrics for the executors, stages and Spark 
> history log. Only interesting values (peak values per stage per executor) are 
> recorded in the Spark history log, to minimize the amount of additional 
> logging.
> We have attached our design documentation with this ticket and would like to 
> receive feedback from the community for this proposal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23094) Json Readers choose wrong encoding when bad records are present and fail

2018-02-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364956#comment-16364956
 ] 

Apache Spark commented on SPARK-23094:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/20614

> Json Readers choose wrong encoding when bad records are present and fail
> 
>
> Key: SPARK-23094
> URL: https://issues.apache.org/jira/browse/SPARK-23094
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Major
> Fix For: 2.3.0
>
>
> The cases described in SPARK-16548 and SPARK-20549 handled the JsonParser 
> code paths for expressions but not the readers. We should also cover reader 
> code paths reading files with bad characters.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23410) Unable to read jsons in charset different from UTF-8

2018-02-14 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364954#comment-16364954
 ] 

Xiao Li commented on SPARK-23410:
-

This is a regression we need to resolve in Spark 2.3. [~maxgekk] Please submit 
a separate PR to fix it. I am just reverting the 
https://github.com/apache/spark/pull/20302 now. Thanks!

> Unable to read jsons in charset different from UTF-8
> 
>
> Key: SPARK-23410
> URL: https://issues.apache.org/jira/browse/SPARK-23410
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Blocker
> Attachments: utf16WithBOM.json
>
>
> Currently the Json Parser is forced to read json files in UTF-8. Such 
> behavior breaks backward compatibility with Spark 2.2.1 and previous versions 
> that can read json files in UTF-16, UTF-32 and other encodings due to using 
> of the auto detection mechanism of the jackson library. Need to give back to 
> users possibility to read json files in specified charset and/or detect 
> charset automatically as it was before.    



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23410) Unable to read jsons in charset different from UTF-8

2018-02-14 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23410:

Target Version/s: 2.3.0

> Unable to read jsons in charset different from UTF-8
> 
>
> Key: SPARK-23410
> URL: https://issues.apache.org/jira/browse/SPARK-23410
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
> Attachments: utf16WithBOM.json
>
>
> Currently the Json Parser is forced to read json files in UTF-8. Such 
> behavior breaks backward compatibility with Spark 2.2.1 and previous versions 
> that can read json files in UTF-16, UTF-32 and other encodings due to using 
> of the auto detection mechanism of the jackson library. Need to give back to 
> users possibility to read json files in specified charset and/or detect 
> charset automatically as it was before.    



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23410) Unable to read jsons in charset different from UTF-8

2018-02-14 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23410:

Priority: Blocker  (was: Major)

> Unable to read jsons in charset different from UTF-8
> 
>
> Key: SPARK-23410
> URL: https://issues.apache.org/jira/browse/SPARK-23410
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Blocker
> Attachments: utf16WithBOM.json
>
>
> Currently the Json Parser is forced to read json files in UTF-8. Such 
> behavior breaks backward compatibility with Spark 2.2.1 and previous versions 
> that can read json files in UTF-16, UTF-32 and other encodings due to using 
> of the auto detection mechanism of the jackson library. Need to give back to 
> users possibility to read json files in specified charset and/or detect 
> charset automatically as it was before.    



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23410) Unable to read jsons in charset different from UTF-8

2018-02-14 Thread Bruce Robbins (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364929#comment-16364929
 ] 

Bruce Robbins edited comment on SPARK-23410 at 2/14/18 11:17 PM:
-

{quote}I am working on a fix, just in case
{quote}
Oh, OK, this one is already in progress then.


was (Author: bersprockets):
bq. I am working on a fix, just in case

Oh, OK, this one is already in progress already then.

> Unable to read jsons in charset different from UTF-8
> 
>
> Key: SPARK-23410
> URL: https://issues.apache.org/jira/browse/SPARK-23410
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
> Attachments: utf16WithBOM.json
>
>
> Currently the Json Parser is forced to read json files in UTF-8. Such 
> behavior breaks backward compatibility with Spark 2.2.1 and previous versions 
> that can read json files in UTF-16, UTF-32 and other encodings due to using 
> of the auto detection mechanism of the jackson library. Need to give back to 
> users possibility to read json files in specified charset and/or detect 
> charset automatically as it was before.    



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23410) Unable to read jsons in charset different from UTF-8

2018-02-14 Thread Bruce Robbins (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364929#comment-16364929
 ] 

Bruce Robbins commented on SPARK-23410:
---

bq. I am working on a fix, just in case

Oh, OK, this one is already in progress already then.

> Unable to read jsons in charset different from UTF-8
> 
>
> Key: SPARK-23410
> URL: https://issues.apache.org/jira/browse/SPARK-23410
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
> Attachments: utf16WithBOM.json
>
>
> Currently the Json Parser is forced to read json files in UTF-8. Such 
> behavior breaks backward compatibility with Spark 2.2.1 and previous versions 
> that can read json files in UTF-16, UTF-32 and other encodings due to using 
> of the auto detection mechanism of the jackson library. Need to give back to 
> users possibility to read json files in specified charset and/or detect 
> charset automatically as it was before.    



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23428) Revert [SPARK-23094] Fix invalid character handling in JsonDataSource

2018-02-14 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-23428.
-
Resolution: Invalid

> Revert [SPARK-23094]  Fix invalid character handling in JsonDataSource
> --
>
> Key: SPARK-23428
> URL: https://issues.apache.org/jira/browse/SPARK-23428
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Blocker
>
> {noformat}
>   test("invalid json with leading nulls - from dataset") {
> import testImplicits._
> withTempDir { tempDir =>
>   val path = tempDir.getAbsolutePath
>   Seq("""{"firstName":"Chris", "lastName":"Baird"}""",
> """{"firstName":"Doug", 
> "lastName":"Rood"}""").toDS().write.mode("overwrite").text(path)
>   val schema = new StructType().add("a", 
> IntegerType).add("_corrupt_record", StringType)
>   val jsonDF = spark.read.schema(schema).option("mode", 
> "DROPMALFORMED").json(path)
>   checkAnswer(jsonDF, Seq(
> Row("Chris", "Baird"), Row("Doug", "Rood")
>   ))
> }
>   }
> {noformat}
> After this PR it returns a wrong answer. 
> {noformat}
> [null,null]
> [null,null]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23410) Unable to read jsons in charset different from UTF-8

2018-02-14 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23410:

Component/s: (was: Input/Output)
 SQL

> Unable to read jsons in charset different from UTF-8
> 
>
> Key: SPARK-23410
> URL: https://issues.apache.org/jira/browse/SPARK-23410
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
> Attachments: utf16WithBOM.json
>
>
> Currently the Json Parser is forced to read json files in UTF-8. Such 
> behavior breaks backward compatibility with Spark 2.2.1 and previous versions 
> that can read json files in UTF-16, UTF-32 and other encodings due to using 
> of the auto detection mechanism of the jackson library. Need to give back to 
> users possibility to read json files in specified charset and/or detect 
> charset automatically as it was before.    



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23410) Unable to read jsons in charset different from UTF-8

2018-02-14 Thread Bruce Robbins (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364916#comment-16364916
 ] 

Bruce Robbins commented on SPARK-23410:
---

On Spark 2.2.1, I got the same result as you. But with those extraneous null 
rows, it still doesn't look right.

When I converted your file to utf-8, Spark 2.2.1 gave me:
{noformat}
+-++
|firstName|lastName|
+-++
|Chris|   Baird|
| Doug|Rood|
+-++
{noformat}
No extraneous null rows.

On a previous version (Spark 2.1.2), I got
{noformat}
8/02/14 14:51:47 WARN JacksonParser: Found at least one malformed records 
(sample: ��{^@"^@f^@i^@r^@s^@t^@N^@a^@m^@e^@"^@:^@"^@C^@h^@r^@i^@s^@"^@,^@ 
^@"^@l^@a^@s^@t^@N^@a^\
@m^@e^@"^@:^@"^@B^@a^@i^@r^@d^@"^@}^@). The JSON reader will replace
all malformed records with placeholder null in current PERMISSIVE parser mode.
To find out which corrupted records have been replaced with null, please use the
default inferred schema instead of providing a custom schema.

Code example to print all malformed records (scala):
===
// The corrupted record exists in column _corrupt_record.
val parsedJson = spark.read.json("/path/to/json/file/test.json")


+-++
|firstName|lastName|
+-++
| null|null|
| null|null|
+-++
{noformat}
On a 2.4 snapshot, I got:
{noformat}
-++
|firstName|lastName|
+-++
| null|null|
| null|null|
| null|null|
| null|null|
| null|null|
+-++
{noformat}
 It worked *best* on Spark 2.2.1, but even there it still wasn't right.

 

> Unable to read jsons in charset different from UTF-8
> 
>
> Key: SPARK-23410
> URL: https://issues.apache.org/jira/browse/SPARK-23410
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
> Attachments: utf16WithBOM.json
>
>
> Currently the Json Parser is forced to read json files in UTF-8. Such 
> behavior breaks backward compatibility with Spark 2.2.1 and previous versions 
> that can read json files in UTF-16, UTF-32 and other encodings due to using 
> of the auto detection mechanism of the jackson library. Need to give back to 
> users possibility to read json files in specified charset and/or detect 
> charset automatically as it was before.    



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23410) Unable to read jsons in charset different from UTF-8

2018-02-14 Thread Maxim Gekk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364889#comment-16364889
 ] 

Maxim Gekk commented on SPARK-23410:


I am working on a fix, just in case

> Unable to read jsons in charset different from UTF-8
> 
>
> Key: SPARK-23410
> URL: https://issues.apache.org/jira/browse/SPARK-23410
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
> Attachments: utf16WithBOM.json
>
>
> Currently the Json Parser is forced to read json files in UTF-8. Such 
> behavior breaks backward compatibility with Spark 2.2.1 and previous versions 
> that can read json files in UTF-16, UTF-32 and other encodings due to using 
> of the auto detection mechanism of the jackson library. Need to give back to 
> users possibility to read json files in specified charset and/or detect 
> charset automatically as it was before.    



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23406) Stream-stream self joins does not work

2018-02-14 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-23406.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 20598
[https://github.com/apache/spark/pull/20598]

> Stream-stream self joins does not work
> --
>
> Key: SPARK-23406
> URL: https://issues.apache.org/jira/browse/SPARK-23406
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently stream-stream self join throws the following error
> {code}
> val df = spark.readStream.format("rate").option("numRowsPerSecond", 
> "1").option("numPartitions", "1").load()
> display(df.withColumn("key", $"value" / 10).join(df.withColumn("key", 
> $"value" / 5), "key"))
> {code}
> error:
> {code}
> Failure when resolving conflicting references in Join:
> 'Join UsingJoin(Inner,List(key))
> :- Project [timestamp#850, value#851L, (cast(value#851L as double) / cast(10 
> as double)) AS key#855]
> : +- StreamingRelation 
> DataSource(org.apache.spark.sql.SparkSession@7f1d2a68,rate,List(),None,List(),None,Map(numPartitions
>  -> 1, numRowsPerSecond -> 1),None), rate, [timestamp#850, value#851L]
> +- Project [timestamp#850, value#851L, (cast(value#851L as double) / cast(5 
> as double)) AS key#860]
>  +- StreamingRelation 
> DataSource(org.apache.spark.sql.SparkSession@7f1d2a68,rate,List(),None,List(),None,Map(numPartitions
>  -> 1, numRowsPerSecond -> 1),None), rate, [timestamp#850, value#851L]
> Conflicting attributes: timestamp#850,value#851L
> ;;
> 'Join UsingJoin(Inner,List(key))
> :- Project [timestamp#850, value#851L, (cast(value#851L as double) / cast(10 
> as double)) AS key#855]
> : +- StreamingRelation 
> DataSource(org.apache.spark.sql.SparkSession@7f1d2a68,rate,List(),None,List(),None,Map(numPartitions
>  -> 1, numRowsPerSecond -> 1),None), rate, [timestamp#850, value#851L]
> +- Project [timestamp#850, value#851L, (cast(value#851L as double) / cast(5 
> as double)) AS key#860]
>  +- StreamingRelation 
> DataSource(org.apache.spark.sql.SparkSession@7f1d2a68,rate,List(),None,List(),None,Map(numPartitions
>  -> 1, numRowsPerSecond -> 1),None), rate, [timestamp#850, value#851L]
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39)
>  at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:101)
>  at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:378)
>  at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:98)
>  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:148)
>  at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:98)
>  at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:101)
>  at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:71)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:73)
>  at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:3063)
>  at org.apache.spark.sql.Dataset.join(Dataset.scala:787)
>  at org.apache.spark.sql.Dataset.join(Dataset.scala:756)
>  at org.apache.spark.sql.Dataset.join(Dataset.scala:731)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23410) Unable to read jsons in charset different from UTF-8

2018-02-14 Thread Maxim Gekk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364875#comment-16364875
 ] 

Maxim Gekk commented on SPARK-23410:


I attached the file on which I tested on 2.2.1:

{code:scala}
import org.apache.spark.sql.types._
val schema = new StructType().add("firstName", StringType).add("lastName", 
StringType)
spark.read.schema(schema).json("utf16WithBOM.json").show
{code}

{code}
+-++
|firstName|lastName|
+-++
|Chris|   Baird|
| null|null|
| Doug|Rood|
| null|null|
| null|null|
+-++
{code}

> Unable to read jsons in charset different from UTF-8
> 
>
> Key: SPARK-23410
> URL: https://issues.apache.org/jira/browse/SPARK-23410
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
> Attachments: utf16WithBOM.json
>
>
> Currently the Json Parser is forced to read json files in UTF-8. Such 
> behavior breaks backward compatibility with Spark 2.2.1 and previous versions 
> that can read json files in UTF-16, UTF-32 and other encodings due to using 
> of the auto detection mechanism of the jackson library. Need to give back to 
> users possibility to read json files in specified charset and/or detect 
> charset automatically as it was before.    



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23410) Unable to read jsons in charset different from UTF-8

2018-02-14 Thread Maxim Gekk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-23410:
---
Attachment: utf16WithBOM.json

> Unable to read jsons in charset different from UTF-8
> 
>
> Key: SPARK-23410
> URL: https://issues.apache.org/jira/browse/SPARK-23410
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
> Attachments: utf16WithBOM.json
>
>
> Currently the Json Parser is forced to read json files in UTF-8. Such 
> behavior breaks backward compatibility with Spark 2.2.1 and previous versions 
> that can read json files in UTF-16, UTF-32 and other encodings due to using 
> of the auto detection mechanism of the jackson library. Need to give back to 
> users possibility to read json files in specified charset and/or detect 
> charset automatically as it was before.    



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors

2018-02-14 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364861#comment-16364861
 ] 

Stavros Kontopoulos edited comment on SPARK-23423 at 2/14/18 10:27 PM:
---

Hi [~igor.berman]. Looking at the code again I think when there is a status 
update tasksIds of dead tasks are removed:

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L732]

Slaves are not removed but task Ids are, maybe something else is not working. 
Do you have a log at the time of the issue to attach?

The test you have is ok but I think it does not trigger deletion for the tasks 
in the case of a failure. I think you need to update the backend with task 
status:

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/test/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackendSuite.scala#L102-L103]

Btw the behavior for checking the upper limit of the num of the executors you 
are referring to is defined in different places: 

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L354]

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]

The latter exists for very long time. The former was added with Spark-16944. 
Essentially they do check the same thing. 

 


was (Author: skonto):
Hi [~igor.berman]. Looking at the code again I think when there is a status 
update tasksIds of dead tasks are removed:

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L732]

Slaves are not removed but task Ids are, maybe something else is not working. 
Do you have a log at the time of the issue to attach?

The test you have is ok but I think it does not trigger deletion for the tasks 
in the case of a failure. I think you need to update the backend with task 
status:

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/test/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackendSuite.scala#L102-L103]

 

Btw the behavior for checking the upper limit of the num of the executors you 
are referring to is defined in different places: 

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L354]

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]

The latter exists for very long time. The former was added with Spark-16944. 
Essentially they do check the same thing. 

 

> Application declines any offers when killed+active executors rich 
> spark.dynamicAllocation.maxExecutors
> --
>
> Key: SPARK-23423
> URL: https://issues.apache.org/jira/browse/SPARK-23423
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.2.1
>Reporter: Igor Berman
>Priority: Major
>
> Hi
> I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend 
> when running on Mesos with dynamic allocation on and limiting number of max 
> executors by spark.dynamicAllocation.maxExecutors.
> Suppose we have long running driver that has cyclic pattern of resource 
> consumption(with some idle times in between), due to dyn.allocation it 
> receives offers and then releases them after current chunk of work processed.
> Since at 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]
>  the backend compares numExecutors < executorLimit and 
> numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves 
> holds all slaves ever "met", i.e. both active and killed (see comment 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)]
>  
> On the other hand, number of taskIds should be updated due to statusUpdate, 
> but suppose this update is lost(actually I don't see logs of 'is now 
> TASK_KILLED') so this number of executors might be wrong
>  
> I've created test that "reproduces" this behavior, not sure how good it is:
> {code:java}
> //MesosCoarseGrainedSchedulerBackendSuite
> test("max executors 

[jira] [Comment Edited] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors

2018-02-14 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364861#comment-16364861
 ] 

Stavros Kontopoulos edited comment on SPARK-23423 at 2/14/18 10:26 PM:
---

Hi [~igor.berman]. Looking at the code again I think when there is a status 
update tasksIds of dead tasks are removed:

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L732]

Slaves are not removed but task Ids are, maybe something else is not working. 
Do you have a log at the time of the issue to attach?

The test you have is ok but I think it does not trigger deletion for the tasks 
in the case of a failure. I think you need to update the backend with task 
status:

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/test/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackendSuite.scala#L102-L103]

 

Btw the behavior for checking the upper limit of the num of the executors you 
are referring to is defined in different places: 

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L354]

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]

The latter exists for very long time. The former was added with Spark-16944. 
Essentially they do check the same thing. 

 


was (Author: skonto):
Hi [~igor.berman]. Looking at the code again I think when there is a status 
update tasksIds of dead tasks are removed:

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L732]

Slaves are not removed but task Ids are, maybe something else is not working. 
Do you have a log at the time of the issue to attach?

The test you have is ok but I suspect it does not trigger deletion for the 
tasks in the case of a failure. I will check it.

Btw the behavior for checking the upper limit of the num of the executors you 
are referring to is defined in different places: 

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L354]

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]

The latter exists for very long time. The former was added with Spark-16944. 
Essentially they do check the same thing. 

 

> Application declines any offers when killed+active executors rich 
> spark.dynamicAllocation.maxExecutors
> --
>
> Key: SPARK-23423
> URL: https://issues.apache.org/jira/browse/SPARK-23423
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.2.1
>Reporter: Igor Berman
>Priority: Major
>
> Hi
> I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend 
> when running on Mesos with dynamic allocation on and limiting number of max 
> executors by spark.dynamicAllocation.maxExecutors.
> Suppose we have long running driver that has cyclic pattern of resource 
> consumption(with some idle times in between), due to dyn.allocation it 
> receives offers and then releases them after current chunk of work processed.
> Since at 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]
>  the backend compares numExecutors < executorLimit and 
> numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves 
> holds all slaves ever "met", i.e. both active and killed (see comment 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)]
>  
> On the other hand, number of taskIds should be updated due to statusUpdate, 
> but suppose this update is lost(actually I don't see logs of 'is now 
> TASK_KILLED') so this number of executors might be wrong
>  
> I've created test that "reproduces" this behavior, not sure how good it is:
> {code:java}
> //MesosCoarseGrainedSchedulerBackendSuite
> test("max executors registered stops to accept offers when dynamic allocation 
> enabled") {
>   setBackend(Map(
> "spark.dynamicAllocation.maxExecutors" -> "1",
> "spark.dynamicAllocation.enabled" -> "true",
> 

[jira] [Comment Edited] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors

2018-02-14 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364861#comment-16364861
 ] 

Stavros Kontopoulos edited comment on SPARK-23423 at 2/14/18 10:23 PM:
---

Hi [~igor.berman]. Looking at the code again I think when there is a status 
update tasksIds of dead tasks are removed:

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L732]

Slaves are not removed but task Ids are, maybe something else is not working. 
Do you have a log at the time of the issue to attach?

The test you have is ok but I suspect it does not trigger deletion for the 
tasks in the case of a failure. I will check it.

Btw the behavior for checking the upper limit of the num of the executors you 
are referring to is defined in different places: 

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L354]

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]

The latter exists for very long time. The former was added with Spark-16944. 
Essentially they do check the same thing. 

 


was (Author: skonto):
Hi [~igor.berman]. Looking at the code again I think when there is a status 
update tasksIds of dead tasks are removed:

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L732]

Slaves are not removed but task Ids are, maybe something else is not working. 
Do you have a log at the time of the issue to attach?

The test you have is ok but I suspect it does not trigger deletion for the 
tasks in the case of a failure. I will check it.

Btw the behavior for checking the upper limit of the num of the executors you 
are referring to was added here: spark-16944.

> Application declines any offers when killed+active executors rich 
> spark.dynamicAllocation.maxExecutors
> --
>
> Key: SPARK-23423
> URL: https://issues.apache.org/jira/browse/SPARK-23423
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.2.1
>Reporter: Igor Berman
>Priority: Major
>
> Hi
> I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend 
> when running on Mesos with dynamic allocation on and limiting number of max 
> executors by spark.dynamicAllocation.maxExecutors.
> Suppose we have long running driver that has cyclic pattern of resource 
> consumption(with some idle times in between), due to dyn.allocation it 
> receives offers and then releases them after current chunk of work processed.
> Since at 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]
>  the backend compares numExecutors < executorLimit and 
> numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves 
> holds all slaves ever "met", i.e. both active and killed (see comment 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)]
>  
> On the other hand, number of taskIds should be updated due to statusUpdate, 
> but suppose this update is lost(actually I don't see logs of 'is now 
> TASK_KILLED') so this number of executors might be wrong
>  
> I've created test that "reproduces" this behavior, not sure how good it is:
> {code:java}
> //MesosCoarseGrainedSchedulerBackendSuite
> test("max executors registered stops to accept offers when dynamic allocation 
> enabled") {
>   setBackend(Map(
> "spark.dynamicAllocation.maxExecutors" -> "1",
> "spark.dynamicAllocation.enabled" -> "true",
> "spark.dynamicAllocation.testing" -> "true"))
>   backend.doRequestTotalExecutors(1)
>   val (mem, cpu) = (backend.executorMemory(sc), 4)
>   val offer1 = createOffer("o1", "s1", mem, cpu)
>   backend.resourceOffers(driver, List(offer1).asJava)
>   verifyTaskLaunched(driver, "o1")
>   backend.doKillExecutors(List("0"))
>   verify(driver, times(1)).killTask(createTaskId("0"))
>   val offer2 = createOffer("o2", "s2", mem, cpu)
>   backend.resourceOffers(driver, List(offer2).asJava)
>   verify(driver, times(1)).declineOffer(offer2.getId)
> }{code}
>  
>  
> Workaround: Don't set maxExecutors with dynamicAllocation on
>  
> Please advice
> Igor
> marking you friends since you were last to touch this piece of code and 
> 

[jira] [Assigned] (SPARK-23368) Avoid unnecessary Exchange or Sort after projection

2018-02-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23368:


Assignee: (was: Apache Spark)

> Avoid unnecessary Exchange or Sort after projection
> ---
>
> Key: SPARK-23368
> URL: https://issues.apache.org/jira/browse/SPARK-23368
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maryann Xue
>Priority: Minor
>
> After column rename projection, the ProjectExec's outputOrdering and 
> outputPartitioning should reflect the projected columns as well. For example,
> {code:java}
> SELECT b1
> FROM (
> SELECT a a1, b b1
> FROM testData2
> ORDER BY a
> )
> ORDER BY a1{code}
> The inner query is ordered on a1 as well. If we had a rule to eliminate Sort 
> on sorted result, together with this fix, the order-by in the outer query 
> could have been optimized out.
>  
> Similarly, the below query
> {code:java}
> SELECT *
> FROM (
> SELECT t1.a a1, t2.a a2, t1.b b1, t2.b b2
> FROM testData2 t1
> LEFT JOIN testData2 t2
> ON t1.a = t2.a
> )
> JOIN testData2 t3
> ON a1 = t3.a{code}
> is equivalent to
> {code:java}
> SELECT *
> FROM testData2 t1
> LEFT JOIN testData2 t2
> ON t1.a = t2.a
> JOIN testData2 t3
> ON t1.a = t3.a{code}
> , so the unnecessary sorting and hash-partitioning that have been optimized 
> out for the second query should have be eliminated in the first query as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23368) Avoid unnecessary Exchange or Sort after projection

2018-02-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364867#comment-16364867
 ] 

Apache Spark commented on SPARK-23368:
--

User 'maryannxue' has created a pull request for this issue:
https://github.com/apache/spark/pull/20613

> Avoid unnecessary Exchange or Sort after projection
> ---
>
> Key: SPARK-23368
> URL: https://issues.apache.org/jira/browse/SPARK-23368
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maryann Xue
>Priority: Minor
>
> After column rename projection, the ProjectExec's outputOrdering and 
> outputPartitioning should reflect the projected columns as well. For example,
> {code:java}
> SELECT b1
> FROM (
> SELECT a a1, b b1
> FROM testData2
> ORDER BY a
> )
> ORDER BY a1{code}
> The inner query is ordered on a1 as well. If we had a rule to eliminate Sort 
> on sorted result, together with this fix, the order-by in the outer query 
> could have been optimized out.
>  
> Similarly, the below query
> {code:java}
> SELECT *
> FROM (
> SELECT t1.a a1, t2.a a2, t1.b b1, t2.b b2
> FROM testData2 t1
> LEFT JOIN testData2 t2
> ON t1.a = t2.a
> )
> JOIN testData2 t3
> ON a1 = t3.a{code}
> is equivalent to
> {code:java}
> SELECT *
> FROM testData2 t1
> LEFT JOIN testData2 t2
> ON t1.a = t2.a
> JOIN testData2 t3
> ON t1.a = t3.a{code}
> , so the unnecessary sorting and hash-partitioning that have been optimized 
> out for the second query should have be eliminated in the first query as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23410) Unable to read jsons in charset different from UTF-8

2018-02-14 Thread Bruce Robbins (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364866#comment-16364866
 ] 

Bruce Robbins edited comment on SPARK-23410 at 2/14/18 10:21 PM:
-

[~maxgekk]

My simple test input of

{noformat}
 [{"field1": 10, "field2": "hello"},{"field1": 12, "field2": "byte"}]
{noformat}

is encoded like this (according to emacs hexl-mode):
{noformat}
: feff 005b 007b 0022 0066 0069 0065 006c  ...[.{.".f.i.e.l
0010: 0064 0031 0022 003a 0020 0031 0030 002c  .d.1.".:. .1.0.,
0020: 0020 0022 0066 0069 0065 006c 0064 0032  . .".f.i.e.l.d.2
0030: 0022 003a 0020 0022 0068 0065 006c 006c  .".:. .".h.e.l.l
0040: 006f 0022 007d 002c 007b 0022 0066 0069  .o.".}.,.{.".f.i
0050: 0065 006c 0064 0031 0022 003a 0020 0031  .e.l.d.1.".:. .1
0060: 0032 002c 0020 0022 0066 0069 0065 006c  .2.,. .".f.i.e.l
0070: 0064 0032 0022 003a 0020 0022 0062 0079  .d.2.".:. .".b.y
0080: 0074 0065 0022 007d 005d 000a.t.e.".}.]..
{noformat}
 I just used iconv to convert the file from utf-8 to utf-16.

 


was (Author: bersprockets):
[~maxgekk]

My simple test input of

[{"field1": 10, "field2": "hello"},{"field1": 12, "field2": "byte"}]

is encoded like this (according to emacs hexl-mode):
{noformat}
: feff 005b 007b 0022 0066 0069 0065 006c  ...[.{.".f.i.e.l
0010: 0064 0031 0022 003a 0020 0031 0030 002c  .d.1.".:. .1.0.,
0020: 0020 0022 0066 0069 0065 006c 0064 0032  . .".f.i.e.l.d.2
0030: 0022 003a 0020 0022 0068 0065 006c 006c  .".:. .".h.e.l.l
0040: 006f 0022 007d 002c 007b 0022 0066 0069  .o.".}.,.{.".f.i
0050: 0065 006c 0064 0031 0022 003a 0020 0031  .e.l.d.1.".:. .1
0060: 0032 002c 0020 0022 0066 0069 0065 006c  .2.,. .".f.i.e.l
0070: 0064 0032 0022 003a 0020 0022 0062 0079  .d.2.".:. .".b.y
0080: 0074 0065 0022 007d 005d 000a.t.e.".}.]..
{noformat}
 I just used iconv to convert the file from utf-8 to utf-16.

 

> Unable to read jsons in charset different from UTF-8
> 
>
> Key: SPARK-23410
> URL: https://issues.apache.org/jira/browse/SPARK-23410
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently the Json Parser is forced to read json files in UTF-8. Such 
> behavior breaks backward compatibility with Spark 2.2.1 and previous versions 
> that can read json files in UTF-16, UTF-32 and other encodings due to using 
> of the auto detection mechanism of the jackson library. Need to give back to 
> users possibility to read json files in specified charset and/or detect 
> charset automatically as it was before.    



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23368) Avoid unnecessary Exchange or Sort after projection

2018-02-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23368:


Assignee: Apache Spark

> Avoid unnecessary Exchange or Sort after projection
> ---
>
> Key: SPARK-23368
> URL: https://issues.apache.org/jira/browse/SPARK-23368
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maryann Xue
>Assignee: Apache Spark
>Priority: Minor
>
> After column rename projection, the ProjectExec's outputOrdering and 
> outputPartitioning should reflect the projected columns as well. For example,
> {code:java}
> SELECT b1
> FROM (
> SELECT a a1, b b1
> FROM testData2
> ORDER BY a
> )
> ORDER BY a1{code}
> The inner query is ordered on a1 as well. If we had a rule to eliminate Sort 
> on sorted result, together with this fix, the order-by in the outer query 
> could have been optimized out.
>  
> Similarly, the below query
> {code:java}
> SELECT *
> FROM (
> SELECT t1.a a1, t2.a a2, t1.b b1, t2.b b2
> FROM testData2 t1
> LEFT JOIN testData2 t2
> ON t1.a = t2.a
> )
> JOIN testData2 t3
> ON a1 = t3.a{code}
> is equivalent to
> {code:java}
> SELECT *
> FROM testData2 t1
> LEFT JOIN testData2 t2
> ON t1.a = t2.a
> JOIN testData2 t3
> ON t1.a = t3.a{code}
> , so the unnecessary sorting and hash-partitioning that have been optimized 
> out for the second query should have be eliminated in the first query as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23410) Unable to read jsons in charset different from UTF-8

2018-02-14 Thread Maxim Gekk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364849#comment-16364849
 ] 

Maxim Gekk edited comment on SPARK-23410 at 2/14/18 10:20 PM:
--

[~bersprockets] does your json contain BOM in the first 2 bytes? By using the 
BOM, jackson detects encoding: 
https://github.com/FasterXML/jackson-core/blob/2.6/src/main/java/com/fasterxml/jackson/core/json/ByteSourceJsonBootstrapper.java#L110-L173


was (Author: maxgekk):
[~bersprockets] does your json contain BOM in the first 2 bytes?

> Unable to read jsons in charset different from UTF-8
> 
>
> Key: SPARK-23410
> URL: https://issues.apache.org/jira/browse/SPARK-23410
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently the Json Parser is forced to read json files in UTF-8. Such 
> behavior breaks backward compatibility with Spark 2.2.1 and previous versions 
> that can read json files in UTF-16, UTF-32 and other encodings due to using 
> of the auto detection mechanism of the jackson library. Need to give back to 
> users possibility to read json files in specified charset and/or detect 
> charset automatically as it was before.    



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23410) Unable to read jsons in charset different from UTF-8

2018-02-14 Thread Bruce Robbins (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364866#comment-16364866
 ] 

Bruce Robbins commented on SPARK-23410:
---

[~maxgekk]

My simple test input of

[{"field1": 10, "field2": "hello"},{"field1": 12, "field2": "byte"}]

is encoded like this (according to emacs hexl-mode):
{noformat}
: feff 005b 007b 0022 0066 0069 0065 006c  ...[.{.".f.i.e.l
0010: 0064 0031 0022 003a 0020 0031 0030 002c  .d.1.".:. .1.0.,
0020: 0020 0022 0066 0069 0065 006c 0064 0032  . .".f.i.e.l.d.2
0030: 0022 003a 0020 0022 0068 0065 006c 006c  .".:. .".h.e.l.l
0040: 006f 0022 007d 002c 007b 0022 0066 0069  .o.".}.,.{.".f.i
0050: 0065 006c 0064 0031 0022 003a 0020 0031  .e.l.d.1.".:. .1
0060: 0032 002c 0020 0022 0066 0069 0065 006c  .2.,. .".f.i.e.l
0070: 0064 0032 0022 003a 0020 0022 0062 0079  .d.2.".:. .".b.y
0080: 0074 0065 0022 007d 005d 000a.t.e.".}.]..
{noformat}
 I just used iconv to convert the file from utf-8 to utf-16.

 

> Unable to read jsons in charset different from UTF-8
> 
>
> Key: SPARK-23410
> URL: https://issues.apache.org/jira/browse/SPARK-23410
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently the Json Parser is forced to read json files in UTF-8. Such 
> behavior breaks backward compatibility with Spark 2.2.1 and previous versions 
> that can read json files in UTF-16, UTF-32 and other encodings due to using 
> of the auto detection mechanism of the jackson library. Need to give back to 
> users possibility to read json files in specified charset and/or detect 
> charset automatically as it was before.    



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors

2018-02-14 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364861#comment-16364861
 ] 

Stavros Kontopoulos edited comment on SPARK-23423 at 2/14/18 10:17 PM:
---

Hi [~igor.berman]. Looking at the code again I think when there is a status 
update tasksIds of dead tasks are removed:

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L732]

Slaves are not removed but task Ids are, maybe something else is not working. 
Do you have a log at the time of the issue to attach?

The test you have is ok but I suspect it does not trigger deletion for the 
tasks in the case of a failure. i will check it.

Btw the behavior for checking the upper limit of the num of the executors you 
are referring to was added here: spark-16944.


was (Author: skonto):
Hi [~igor.berman]. Looking at the code again I think when there is a status 
update tasksIds of dead tasks are removed:

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L732]

Slaves are not removed but task Ids are, maybe something else is not working. 
Do you have a log at the time of the issue to attach?

The test you have is ok but does not trigger deletion for the tasks in the case 
of a failure.

Btw the behavior for checking the upper limit of the num of the executors you 
are referring to was added here: spark-16944.

> Application declines any offers when killed+active executors rich 
> spark.dynamicAllocation.maxExecutors
> --
>
> Key: SPARK-23423
> URL: https://issues.apache.org/jira/browse/SPARK-23423
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.2.1
>Reporter: Igor Berman
>Priority: Major
>
> Hi
> I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend 
> when running on Mesos with dynamic allocation on and limiting number of max 
> executors by spark.dynamicAllocation.maxExecutors.
> Suppose we have long running driver that has cyclic pattern of resource 
> consumption(with some idle times in between), due to dyn.allocation it 
> receives offers and then releases them after current chunk of work processed.
> Since at 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]
>  the backend compares numExecutors < executorLimit and 
> numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves 
> holds all slaves ever "met", i.e. both active and killed (see comment 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)]
>  
> On the other hand, number of taskIds should be updated due to statusUpdate, 
> but suppose this update is lost(actually I don't see logs of 'is now 
> TASK_KILLED') so this number of executors might be wrong
>  
> I've created test that "reproduces" this behavior, not sure how good it is:
> {code:java}
> //MesosCoarseGrainedSchedulerBackendSuite
> test("max executors registered stops to accept offers when dynamic allocation 
> enabled") {
>   setBackend(Map(
> "spark.dynamicAllocation.maxExecutors" -> "1",
> "spark.dynamicAllocation.enabled" -> "true",
> "spark.dynamicAllocation.testing" -> "true"))
>   backend.doRequestTotalExecutors(1)
>   val (mem, cpu) = (backend.executorMemory(sc), 4)
>   val offer1 = createOffer("o1", "s1", mem, cpu)
>   backend.resourceOffers(driver, List(offer1).asJava)
>   verifyTaskLaunched(driver, "o1")
>   backend.doKillExecutors(List("0"))
>   verify(driver, times(1)).killTask(createTaskId("0"))
>   val offer2 = createOffer("o2", "s2", mem, cpu)
>   backend.resourceOffers(driver, List(offer2).asJava)
>   verify(driver, times(1)).declineOffer(offer2.getId)
> }{code}
>  
>  
> Workaround: Don't set maxExecutors with dynamicAllocation on
>  
> Please advice
> Igor
> marking you friends since you were last to touch this piece of code and 
> probably can advice something([~vanzin], [~skonto], [~susanxhuynh])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors

2018-02-14 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364861#comment-16364861
 ] 

Stavros Kontopoulos edited comment on SPARK-23423 at 2/14/18 10:17 PM:
---

Hi [~igor.berman]. Looking at the code again I think when there is a status 
update tasksIds of dead tasks are removed:

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L732]

Slaves are not removed but task Ids are, maybe something else is not working. 
Do you have a log at the time of the issue to attach?

The test you have is ok but I suspect it does not trigger deletion for the 
tasks in the case of a failure. I will check it.

Btw the behavior for checking the upper limit of the num of the executors you 
are referring to was added here: spark-16944.


was (Author: skonto):
Hi [~igor.berman]. Looking at the code again I think when there is a status 
update tasksIds of dead tasks are removed:

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L732]

Slaves are not removed but task Ids are, maybe something else is not working. 
Do you have a log at the time of the issue to attach?

The test you have is ok but I suspect it does not trigger deletion for the 
tasks in the case of a failure. i will check it.

Btw the behavior for checking the upper limit of the num of the executors you 
are referring to was added here: spark-16944.

> Application declines any offers when killed+active executors rich 
> spark.dynamicAllocation.maxExecutors
> --
>
> Key: SPARK-23423
> URL: https://issues.apache.org/jira/browse/SPARK-23423
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.2.1
>Reporter: Igor Berman
>Priority: Major
>
> Hi
> I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend 
> when running on Mesos with dynamic allocation on and limiting number of max 
> executors by spark.dynamicAllocation.maxExecutors.
> Suppose we have long running driver that has cyclic pattern of resource 
> consumption(with some idle times in between), due to dyn.allocation it 
> receives offers and then releases them after current chunk of work processed.
> Since at 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]
>  the backend compares numExecutors < executorLimit and 
> numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves 
> holds all slaves ever "met", i.e. both active and killed (see comment 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)]
>  
> On the other hand, number of taskIds should be updated due to statusUpdate, 
> but suppose this update is lost(actually I don't see logs of 'is now 
> TASK_KILLED') so this number of executors might be wrong
>  
> I've created test that "reproduces" this behavior, not sure how good it is:
> {code:java}
> //MesosCoarseGrainedSchedulerBackendSuite
> test("max executors registered stops to accept offers when dynamic allocation 
> enabled") {
>   setBackend(Map(
> "spark.dynamicAllocation.maxExecutors" -> "1",
> "spark.dynamicAllocation.enabled" -> "true",
> "spark.dynamicAllocation.testing" -> "true"))
>   backend.doRequestTotalExecutors(1)
>   val (mem, cpu) = (backend.executorMemory(sc), 4)
>   val offer1 = createOffer("o1", "s1", mem, cpu)
>   backend.resourceOffers(driver, List(offer1).asJava)
>   verifyTaskLaunched(driver, "o1")
>   backend.doKillExecutors(List("0"))
>   verify(driver, times(1)).killTask(createTaskId("0"))
>   val offer2 = createOffer("o2", "s2", mem, cpu)
>   backend.resourceOffers(driver, List(offer2).asJava)
>   verify(driver, times(1)).declineOffer(offer2.getId)
> }{code}
>  
>  
> Workaround: Don't set maxExecutors with dynamicAllocation on
>  
> Please advice
> Igor
> marking you friends since you were last to touch this piece of code and 
> probably can advice something([~vanzin], [~skonto], [~susanxhuynh])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors

2018-02-14 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364861#comment-16364861
 ] 

Stavros Kontopoulos commented on SPARK-23423:
-

Hi [~igor.berman]. Looking at the code again I think when there is a status 
update tasksIds of dead tasks are removed:

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L732]

Slaves are not removed but task Ids are, maybe something else is not working. 
Do you have a log at the time of the issue to attach?

The test you have is ok but does not trigger deletion for the tasks in the case 
of a failure.

Btw the behavior for checking the upper limit of the num of the executors you 
are referring to was added here: spark-16944.

> Application declines any offers when killed+active executors rich 
> spark.dynamicAllocation.maxExecutors
> --
>
> Key: SPARK-23423
> URL: https://issues.apache.org/jira/browse/SPARK-23423
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.2.1
>Reporter: Igor Berman
>Priority: Major
>
> Hi
> I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend 
> when running on Mesos with dynamic allocation on and limiting number of max 
> executors by spark.dynamicAllocation.maxExecutors.
> Suppose we have long running driver that has cyclic pattern of resource 
> consumption(with some idle times in between), due to dyn.allocation it 
> receives offers and then releases them after current chunk of work processed.
> Since at 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]
>  the backend compares numExecutors < executorLimit and 
> numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves 
> holds all slaves ever "met", i.e. both active and killed (see comment 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)]
>  
> On the other hand, number of taskIds should be updated due to statusUpdate, 
> but suppose this update is lost(actually I don't see logs of 'is now 
> TASK_KILLED') so this number of executors might be wrong
>  
> I've created test that "reproduces" this behavior, not sure how good it is:
> {code:java}
> //MesosCoarseGrainedSchedulerBackendSuite
> test("max executors registered stops to accept offers when dynamic allocation 
> enabled") {
>   setBackend(Map(
> "spark.dynamicAllocation.maxExecutors" -> "1",
> "spark.dynamicAllocation.enabled" -> "true",
> "spark.dynamicAllocation.testing" -> "true"))
>   backend.doRequestTotalExecutors(1)
>   val (mem, cpu) = (backend.executorMemory(sc), 4)
>   val offer1 = createOffer("o1", "s1", mem, cpu)
>   backend.resourceOffers(driver, List(offer1).asJava)
>   verifyTaskLaunched(driver, "o1")
>   backend.doKillExecutors(List("0"))
>   verify(driver, times(1)).killTask(createTaskId("0"))
>   val offer2 = createOffer("o2", "s2", mem, cpu)
>   backend.resourceOffers(driver, List(offer2).asJava)
>   verify(driver, times(1)).declineOffer(offer2.getId)
> }{code}
>  
>  
> Workaround: Don't set maxExecutors with dynamicAllocation on
>  
> Please advice
> Igor
> marking you friends since you were last to touch this piece of code and 
> probably can advice something([~vanzin], [~skonto], [~susanxhuynh])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23410) Unable to read jsons in charset different from UTF-8

2018-02-14 Thread Maxim Gekk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364849#comment-16364849
 ] 

Maxim Gekk commented on SPARK-23410:


[~bersprockets] does your json contain BOM in the first 2 bytes?

> Unable to read jsons in charset different from UTF-8
> 
>
> Key: SPARK-23410
> URL: https://issues.apache.org/jira/browse/SPARK-23410
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently the Json Parser is forced to read json files in UTF-8. Such 
> behavior breaks backward compatibility with Spark 2.2.1 and previous versions 
> that can read json files in UTF-16, UTF-32 and other encodings due to using 
> of the auto detection mechanism of the jackson library. Need to give back to 
> users possibility to read json files in specified charset and/or detect 
> charset automatically as it was before.    



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors

2018-02-14 Thread Igor Berman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Igor Berman updated SPARK-23423:

Description: 
Hi

I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend when 
running on Mesos with dynamic allocation on and limiting number of max 
executors by spark.dynamicAllocation.maxExecutors.

Suppose we have long running driver that has cyclic pattern of resource 
consumption(with some idle times in between), due to dyn.allocation it receives 
offers and then releases them after current chunk of work processed.

Since at 
[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]
 the backend compares numExecutors < executorLimit and 

numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves 
holds all slaves ever "met", i.e. both active and killed (see comment 
[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)]
 

On the other hand, number of taskIds should be updated due to statusUpdate, but 
suppose this update is lost(actually I don't see logs of 'is now TASK_KILLED') 
so this number of executors might be wrong

 

I've created test that "reproduces" this behavior, not sure how good it is:
{code:java}
//MesosCoarseGrainedSchedulerBackendSuite
test("max executors registered stops to accept offers when dynamic allocation 
enabled") {
  setBackend(Map(
"spark.dynamicAllocation.maxExecutors" -> "1",
"spark.dynamicAllocation.enabled" -> "true",
"spark.dynamicAllocation.testing" -> "true"))

  backend.doRequestTotalExecutors(1)

  val (mem, cpu) = (backend.executorMemory(sc), 4)

  val offer1 = createOffer("o1", "s1", mem, cpu)
  backend.resourceOffers(driver, List(offer1).asJava)
  verifyTaskLaunched(driver, "o1")

  backend.doKillExecutors(List("0"))
  verify(driver, times(1)).killTask(createTaskId("0"))

  val offer2 = createOffer("o2", "s2", mem, cpu)
  backend.resourceOffers(driver, List(offer2).asJava)
  verify(driver, times(1)).declineOffer(offer2.getId)
}{code}
 

 

Workaround: Don't set maxExecutors with dynamicAllocation on

 

Please advice

Igor

marking you friends since you were last to touch this piece of code and 
probably can advice something([~vanzin], [~skonto], [~susanxhuynh])

  was:
Hi

I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend when 
running on Mesos with dynamic allocation on and limiting number of max 
executors by spark.dynamicAllocation.maxExecutors.

Suppose we have long running driver that has cyclic pattern of resource 
consumption(with some idle times in between), due to dyn.allocation it receives 
offers and then releases them after current chunk of work processed.

Since at 
[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]
 the backend compares numExecutors < executorLimit and 

numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves 
holds all executors ever "met", i.e. both active and killed (see comment 
[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)]
 

it means that after a while, when number of killed executors might be greater 
than maxExecutors, the application will decline any offer, thus stopping to work

 

I've created test that "reproduces" this behavior, not sure how good it is:
{code:java}
//MesosCoarseGrainedSchedulerBackendSuite
test("max executors registered stops to accept offers when dynamic allocation 
enabled") {
  setBackend(Map(
"spark.dynamicAllocation.maxExecutors" -> "1",
"spark.dynamicAllocation.enabled" -> "true",
"spark.dynamicAllocation.testing" -> "true"))

  backend.doRequestTotalExecutors(1)

  val (mem, cpu) = (backend.executorMemory(sc), 4)

  val offer1 = createOffer("o1", "s1", mem, cpu)
  backend.resourceOffers(driver, List(offer1).asJava)
  verifyTaskLaunched(driver, "o1")

  backend.doKillExecutors(List("0"))
  verify(driver, times(1)).killTask(createTaskId("0"))

  val offer2 = createOffer("o2", "s2", mem, cpu)
  backend.resourceOffers(driver, List(offer2).asJava)
  verify(driver, times(1)).declineOffer(offer2.getId)
}{code}
 

 

Workaround: Don't set maxExecutors with dynamicAllocation on

I'm not sure how to solve this problem since it seems that it's not trivial to 
change numExecutors in this scenario to count only active executors(since this 
information is not available in Slave class. On the other hand, might be that 
this behavior is "normal" and expected.

Please advice

Igor

marking you friends since you were last to touch this 

[jira] [Updated] (SPARK-20901) Feature parity for ORC with Parquet

2018-02-14 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-20901:
--
Affects Version/s: 2.3.0

> Feature parity for ORC with Parquet
> ---
>
> Key: SPARK-20901
> URL: https://issues.apache.org/jira/browse/SPARK-20901
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0, 2.2.1, 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This issue aims to track the feature parity for ORC with Parquet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23428) Revert [SPARK-23094] Fix invalid character handling in JsonDataSource

2018-02-14 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23428:

Summary: Revert [SPARK-23094]  Fix invalid character handling in 
JsonDataSource  (was: Revert )

> Revert [SPARK-23094]  Fix invalid character handling in JsonDataSource
> --
>
> Key: SPARK-23428
> URL: https://issues.apache.org/jira/browse/SPARK-23428
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Blocker
>
> {noformat}
>   test("invalid json with leading nulls - from dataset") {
> import testImplicits._
> withTempDir { tempDir =>
>   val path = tempDir.getAbsolutePath
>   Seq("""{"firstName":"Chris", "lastName":"Baird"}""",
> """{"firstName":"Doug", 
> "lastName":"Rood"}""").toDS().write.mode("overwrite").text(path)
>   val schema = new StructType().add("a", 
> IntegerType).add("_corrupt_record", StringType)
>   val jsonDF = spark.read.schema(schema).option("mode", 
> "DROPMALFORMED").json(path)
>   checkAnswer(jsonDF, Seq(
> Row("Chris", "Baird"), Row("Doug", "Rood")
>   ))
> }
>   }
> {noformat}
> After this PR it returns a wrong answer. 
> {noformat}
> [null,null]
> [null,null]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23428) Revert

2018-02-14 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23428:

Description: 
{noformat}
  test("invalid json with leading nulls - from dataset") {
import testImplicits._
withTempDir { tempDir =>
  val path = tempDir.getAbsolutePath
  Seq("""{"firstName":"Chris", "lastName":"Baird"}""",
"""{"firstName":"Doug", 
"lastName":"Rood"}""").toDS().write.mode("overwrite").text(path)
  val schema = new StructType().add("a", 
IntegerType).add("_corrupt_record", StringType)
  val jsonDF = spark.read.schema(schema).option("mode", 
"DROPMALFORMED").json(path)
  checkAnswer(jsonDF, Seq(
Row("Chris", "Baird"), Row("Doug", "Rood")
  ))
}
  }

{noformat}

After this PR it returns a wrong answer. 

{noformat}
[null,null]
[null,null]
{noformat}

  was:
  test("invalid json with leading nulls - from dataset") {
import testImplicits._
withTempDir { tempDir =>
  val path = tempDir.getAbsolutePath
  Seq("""{"firstName":"Chris", "lastName":"Baird"}""",
"""{"firstName":"Doug", 
"lastName":"Rood"}""").toDS().write.mode("overwrite").text(path)
  val schema = new StructType().add("a", 
IntegerType).add("_corrupt_record", StringType)
  val jsonDF = spark.read.schema(schema).option("mode", 
"DROPMALFORMED").json(path)
  checkAnswer(jsonDF, Seq(
Row("Chris", "Baird"), Row("Doug", "Rood")
  ))
}
  }

Now it returns 
{noformat}
[null,null]
[null,null]
{noformat}


> Revert 
> ---
>
> Key: SPARK-23428
> URL: https://issues.apache.org/jira/browse/SPARK-23428
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Blocker
>
> {noformat}
>   test("invalid json with leading nulls - from dataset") {
> import testImplicits._
> withTempDir { tempDir =>
>   val path = tempDir.getAbsolutePath
>   Seq("""{"firstName":"Chris", "lastName":"Baird"}""",
> """{"firstName":"Doug", 
> "lastName":"Rood"}""").toDS().write.mode("overwrite").text(path)
>   val schema = new StructType().add("a", 
> IntegerType).add("_corrupt_record", StringType)
>   val jsonDF = spark.read.schema(schema).option("mode", 
> "DROPMALFORMED").json(path)
>   checkAnswer(jsonDF, Seq(
> Row("Chris", "Baird"), Row("Doug", "Rood")
>   ))
> }
>   }
> {noformat}
> After this PR it returns a wrong answer. 
> {noformat}
> [null,null]
> [null,null]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23428) Revert

2018-02-14 Thread Xiao Li (JIRA)
Xiao Li created SPARK-23428:
---

 Summary: Revert 
 Key: SPARK-23428
 URL: https://issues.apache.org/jira/browse/SPARK-23428
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Xiao Li
Assignee: Xiao Li


  test("invalid json with leading nulls - from dataset") {
import testImplicits._
withTempDir { tempDir =>
  val path = tempDir.getAbsolutePath
  Seq("""{"firstName":"Chris", "lastName":"Baird"}""",
"""{"firstName":"Doug", 
"lastName":"Rood"}""").toDS().write.mode("overwrite").text(path)
  val schema = new StructType().add("a", 
IntegerType).add("_corrupt_record", StringType)
  val jsonDF = spark.read.schema(schema).option("mode", 
"DROPMALFORMED").json(path)
  checkAnswer(jsonDF, Seq(
Row("Chris", "Baird"), Row("Doug", "Rood")
  ))
}
  }

Now it returns 
{noformat}
[null,null]
[null,null]
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23410) Unable to read jsons in charset different from UTF-8

2018-02-14 Thread Bruce Robbins (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364788#comment-16364788
 ] 

Bruce Robbins commented on SPARK-23410:
---

I am probably misunderstanding the issue, but I couldn't load UTF-16 (big 
endian or little endian) encoded JSON files using DataFrameReader.json() (e.g., 
spark.read.json) in Spark 2.2.1 or even Spark 2.1.2 for that matter. It always 
resulted in a DataSet with "_corrupt_record" column.

> Unable to read jsons in charset different from UTF-8
> 
>
> Key: SPARK-23410
> URL: https://issues.apache.org/jira/browse/SPARK-23410
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently the Json Parser is forced to read json files in UTF-8. Such 
> behavior breaks backward compatibility with Spark 2.2.1 and previous versions 
> that can read json files in UTF-16, UTF-32 and other encodings due to using 
> of the auto detection mechanism of the jackson library. Need to give back to 
> users possibility to read json files in specified charset and/or detect 
> charset automatically as it was before.    



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21783) Turn on ORC filter push-down by default

2018-02-14 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364723#comment-16364723
 ] 

Dongjoon Hyun commented on SPARK-21783:
---

Unfortunately, this is reopened by SPARK-23426 and is blocked by SPARK-23426 

> Turn on ORC filter push-down by default
> ---
>
> Key: SPARK-21783
> URL: https://issues.apache.org/jira/browse/SPARK-21783
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>
> Like Parquet (SPARK-9207), it would be great to turn on ORC option, too.
> This option was turned off by default from the begining, SPARK-2883
> - 
> https://github.com/apache/spark/commit/aa31e431fc09f0477f1c2351c6275769a31aca90#diff-41ef65b9ef5b518f77e2a03559893f4dR149



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21783) Turn on ORC filter push-down by default

2018-02-14 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-21783:
--
Fix Version/s: (was: 2.3.0)

> Turn on ORC filter push-down by default
> ---
>
> Key: SPARK-21783
> URL: https://issues.apache.org/jira/browse/SPARK-21783
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>
> Like Parquet (SPARK-9207), it would be great to turn on ORC option, too.
> This option was turned off by default from the begining, SPARK-2883
> - 
> https://github.com/apache/spark/commit/aa31e431fc09f0477f1c2351c6275769a31aca90#diff-41ef65b9ef5b518f77e2a03559893f4dR149



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-21783) Turn on ORC filter push-down by default

2018-02-14 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-21783:
---

> Turn on ORC filter push-down by default
> ---
>
> Key: SPARK-21783
> URL: https://issues.apache.org/jira/browse/SPARK-21783
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.3.0
>
>
> Like Parquet (SPARK-9207), it would be great to turn on ORC option, too.
> This option was turned off by default from the begining, SPARK-2883
> - 
> https://github.com/apache/spark/commit/aa31e431fc09f0477f1c2351c6275769a31aca90#diff-41ef65b9ef5b518f77e2a03559893f4dR149



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23427) spark.sql.autoBroadcastJoinThreshold causing OOM in the driver

2018-02-14 Thread Dhiraj (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dhiraj updated SPARK-23427:
---
Description: 
We are facing issue around value of spark.sql.autoBroadcastJoinThreshold.

With spark.sql.autoBroadcastJoinThreshold -1 ( disable) we seeing driver memory 
used flat.

With any other values 10MB, 5MB, 2 MB, 1MB, 10K, 1K we see driver memory used 
goes up with rate depending upon the size of the autoBroadcastThreshold and 
getting OOM exception. The problem is memory used by autoBroadcast is not being 
free up in the driver.

Application imports oracle tables as master dataframes which are persisted. 
Each job applies filter to these tables and then registered them as 
tempViewTable . Then sql query are using to process data further. At the end 
all the intermediate dataFrame are unpersisted.

 

  was:
We are facing issue around value of spark.sql.autoBroadcastJoinThreshold.

With spark.sql.autoBroadcastJoinThreshold -1 ( disable) we seeing driver memory 
used flat.

With any other values 10MB, 5MB, 2 MB, 1MB, 10K, 1K we see driver memory used 
goes up with rate depending upon the size of the autoBroadcastThreshold and 
getting OOM exception. The problem is memory used by autoBroadcast is not being 
free up in the driver.

 


> spark.sql.autoBroadcastJoinThreshold causing OOM  in the driver 
> 
>
> Key: SPARK-23427
> URL: https://issues.apache.org/jira/browse/SPARK-23427
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: SPARK 2.0 version
>Reporter: Dhiraj
>Priority: Critical
>
> We are facing issue around value of spark.sql.autoBroadcastJoinThreshold.
> With spark.sql.autoBroadcastJoinThreshold -1 ( disable) we seeing driver 
> memory used flat.
> With any other values 10MB, 5MB, 2 MB, 1MB, 10K, 1K we see driver memory used 
> goes up with rate depending upon the size of the autoBroadcastThreshold and 
> getting OOM exception. The problem is memory used by autoBroadcast is not 
> being free up in the driver.
> Application imports oracle tables as master dataframes which are persisted. 
> Each job applies filter to these tables and then registered them as 
> tempViewTable . Then sql query are using to process data further. At the end 
> all the intermediate dataFrame are unpersisted.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23427) spark.sql.autoBroadcastJoinThreshold causing OOM in the driver

2018-02-14 Thread Dhiraj (JIRA)
Dhiraj created SPARK-23427:
--

 Summary: spark.sql.autoBroadcastJoinThreshold causing OOM  in the 
driver 
 Key: SPARK-23427
 URL: https://issues.apache.org/jira/browse/SPARK-23427
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
 Environment: SPARK 2.0 version
Reporter: Dhiraj


We are facing issue around value of spark.sql.autoBroadcastJoinThreshold.

With spark.sql.autoBroadcastJoinThreshold -1 ( disable) we seeing driver memory 
used flat.

With any other values 10MB, 5MB, 2 MB, 1MB, 10K, 1K we see driver memory used 
goes up with rate depending upon the size of the autoBroadcastThreshold and 
getting OOM exception. The problem is memory used by autoBroadcast is not being 
free up in the driver.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23415) BufferHolderSparkSubmitSuite is flaky

2018-02-14 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364686#comment-16364686
 ] 

Dongjoon Hyun commented on SPARK-23415:
---

Thank you so much, [~kiszk]!

> BufferHolderSparkSubmitSuite is flaky
> -
>
> Key: SPARK-23415
> URL: https://issues.apache.org/jira/browse/SPARK-23415
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> The test suite fails due to 60-second timeout sometimes.
> {code}
> Error Message
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> failAfter did not complete within 60 seconds.
> Stacktrace
> sbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> failAfter did not complete within 60 seconds.
> {code}
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87380/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4206/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23415) BufferHolderSparkSubmitSuite is flaky

2018-02-14 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364682#comment-16364682
 ] 

Kazuaki Ishizaki commented on SPARK-23415:
--

I am working for this.

> BufferHolderSparkSubmitSuite is flaky
> -
>
> Key: SPARK-23415
> URL: https://issues.apache.org/jira/browse/SPARK-23415
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> The test suite fails due to 60-second timeout sometimes.
> {code}
> Error Message
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> failAfter did not complete within 60 seconds.
> Stacktrace
> sbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> failAfter did not complete within 60 seconds.
> {code}
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87380/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4206/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23234) ML python test failure due to default outputCol

2018-02-14 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364680#comment-16364680
 ] 

Marco Gaido commented on SPARK-23234:
-

[~josephkb] maybe it is not a blocker, but since this can cause also other 
issues, I'd say at least that it is a very nice to have.

> ML python test failure due to default outputCol
> ---
>
> Key: SPARK-23234
> URL: https://issues.apache.org/jira/browse/SPARK-23234
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Priority: Blocker
>
> SPARK-22799 and SPARK-22797 are causing valid Python test failures. The 
> reason is that Python is setting the default params with set. So they are not 
> considered as defaults, but as params passed by the user.
> This means that an outputCol is set not as a default but as a real value.
> Anyway, this is a misbehavior of the python API which can cause serious 
> problems and I'd suggest to rethink the way this is done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23424) Add codegenStageId in comment

2018-02-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364653#comment-16364653
 ] 

Apache Spark commented on SPARK-23424:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/20612

> Add codegenStageId in comment
> -
>
> Key: SPARK-23424
> URL: https://issues.apache.org/jira/browse/SPARK-23424
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kazuaki Ishizaki
>Priority: Minor
>
> SPARK-23032 introduced to use a per-query ID to the generated class name in 
> {{WholeStageCodegenExec.}}This Jira also add the ID in the comment of the 
> generated Java source file.
> This is helpful for debugging when {{spark.sql.codegen.useIdInClassName}} is 
> false.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23424) Add codegenStageId in comment

2018-02-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23424:


Assignee: Apache Spark

> Add codegenStageId in comment
> -
>
> Key: SPARK-23424
> URL: https://issues.apache.org/jira/browse/SPARK-23424
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kazuaki Ishizaki
>Assignee: Apache Spark
>Priority: Minor
>
> SPARK-23032 introduced to use a per-query ID to the generated class name in 
> {{WholeStageCodegenExec.}}This Jira also add the ID in the comment of the 
> generated Java source file.
> This is helpful for debugging when {{spark.sql.codegen.useIdInClassName}} is 
> false.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23424) Add codegenStageId in comment

2018-02-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23424:


Assignee: (was: Apache Spark)

> Add codegenStageId in comment
> -
>
> Key: SPARK-23424
> URL: https://issues.apache.org/jira/browse/SPARK-23424
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kazuaki Ishizaki
>Priority: Minor
>
> SPARK-23032 introduced to use a per-query ID to the generated class name in 
> {{WholeStageCodegenExec.}}This Jira also add the ID in the comment of the 
> generated Java source file.
> This is helpful for debugging when {{spark.sql.codegen.useIdInClassName}} is 
> false.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23425) load data for hdfs file path with wild card usage is not working properly

2018-02-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23425:


Assignee: Apache Spark

> load data for hdfs file path with wild card usage is not working properly
> -
>
> Key: SPARK-23425
> URL: https://issues.apache.org/jira/browse/SPARK-23425
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Sujith
>Assignee: Apache Spark
>Priority: Major
> Attachments: wildcard_issue.PNG
>
>
> load data command  for loading data from non local  file paths by using wild 
> card strings lke * are not working
> eg:
> "load data inpath 'hdfs://hacluster/user/ext*  into table t1"
> Getting Analysis excepton while executing this query
> !image-2018-02-14-23-41-39-923.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23425) load data for hdfs file path with wild card usage is not working properly

2018-02-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364605#comment-16364605
 ] 

Apache Spark commented on SPARK-23425:
--

User 'sujith71955' has created a pull request for this issue:
https://github.com/apache/spark/pull/20611

> load data for hdfs file path with wild card usage is not working properly
> -
>
> Key: SPARK-23425
> URL: https://issues.apache.org/jira/browse/SPARK-23425
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Sujith
>Priority: Major
> Attachments: wildcard_issue.PNG
>
>
> load data command  for loading data from non local  file paths by using wild 
> card strings lke * are not working
> eg:
> "load data inpath 'hdfs://hacluster/user/ext*  into table t1"
> Getting Analysis excepton while executing this query
> !image-2018-02-14-23-41-39-923.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23425) load data for hdfs file path with wild card usage is not working properly

2018-02-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23425:


Assignee: (was: Apache Spark)

> load data for hdfs file path with wild card usage is not working properly
> -
>
> Key: SPARK-23425
> URL: https://issues.apache.org/jira/browse/SPARK-23425
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Sujith
>Priority: Major
> Attachments: wildcard_issue.PNG
>
>
> load data command  for loading data from non local  file paths by using wild 
> card strings lke * are not working
> eg:
> "load data inpath 'hdfs://hacluster/user/ext*  into table t1"
> Getting Analysis excepton while executing this query
> !image-2018-02-14-23-41-39-923.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23426) Use `hive` ORC implementation for Spark 2.3.0

2018-02-14 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23426:
--
Issue Type: Task  (was: Bug)

> Use `hive` ORC implementation for Spark 2.3.0
> -
>
> Key: SPARK-23426
> URL: https://issues.apache.org/jira/browse/SPARK-23426
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23426) Use `hive` ORC impl and disable PPD for Spark 2.3.0

2018-02-14 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23426:
--
Summary: Use `hive` ORC impl and disable PPD for Spark 2.3.0  (was: Use 
`hive` ORC implementation for Spark 2.3.0)

> Use `hive` ORC impl and disable PPD for Spark 2.3.0
> ---
>
> Key: SPARK-23426
> URL: https://issues.apache.org/jira/browse/SPARK-23426
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23426) Use `hive` ORC implementation for Spark 2.3.0

2018-02-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23426:


Assignee: Apache Spark

> Use `hive` ORC implementation for Spark 2.3.0
> -
>
> Key: SPARK-23426
> URL: https://issues.apache.org/jira/browse/SPARK-23426
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23426) Use `hive` ORC implementation for Spark 2.3.0

2018-02-14 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-23426:
-

 Summary: Use `hive` ORC implementation for Spark 2.3.0
 Key: SPARK-23426
 URL: https://issues.apache.org/jira/browse/SPARK-23426
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23426) Use `hive` ORC implementation for Spark 2.3.0

2018-02-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364563#comment-16364563
 ] 

Apache Spark commented on SPARK-23426:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/20610

> Use `hive` ORC implementation for Spark 2.3.0
> -
>
> Key: SPARK-23426
> URL: https://issues.apache.org/jira/browse/SPARK-23426
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23426) Use `hive` ORC implementation for Spark 2.3.0

2018-02-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23426:


Assignee: (was: Apache Spark)

> Use `hive` ORC implementation for Spark 2.3.0
> -
>
> Key: SPARK-23426
> URL: https://issues.apache.org/jira/browse/SPARK-23426
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23425) load data for hdfs file path with wild card usage is not working properly

2018-02-14 Thread Sujith (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364538#comment-16364538
 ] 

Sujith commented on SPARK-23425:


I am working towards resolving this bug, please let me know for any suggestions 
or valuable feedback

> load data for hdfs file path with wild card usage is not working properly
> -
>
> Key: SPARK-23425
> URL: https://issues.apache.org/jira/browse/SPARK-23425
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Sujith
>Priority: Major
> Attachments: wildcard_issue.PNG
>
>
> load data command  for loading data from non local  file paths by using wild 
> card strings lke * are not working
> eg:
> "load data inpath 'hdfs://hacluster/user/ext*  into table t1"
> Getting Analysis excepton while executing this query
> !image-2018-02-14-23-41-39-923.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23425) load data for hdfs file path with wild card usage is not working properly

2018-02-14 Thread Sujith (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sujith updated SPARK-23425:
---
Attachment: wildcard_issue.PNG

> load data for hdfs file path with wild card usage is not working properly
> -
>
> Key: SPARK-23425
> URL: https://issues.apache.org/jira/browse/SPARK-23425
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Sujith
>Priority: Major
> Attachments: wildcard_issue.PNG
>
>
> load data command  for loading data from non local  file paths by using wild 
> card strings lke * are not working
> eg:
> "load data inpath 'hdfs://hacluster/user/ext*  into table t1"
> Getting Analysis excepton while executing this query
> !image-2018-02-14-23-41-39-923.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23425) load data for hdfs file path with wild card usage is not working properly

2018-02-14 Thread Sujith (JIRA)
Sujith created SPARK-23425:
--

 Summary: load data for hdfs file path with wild card usage is not 
working properly
 Key: SPARK-23425
 URL: https://issues.apache.org/jira/browse/SPARK-23425
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.1, 2.3.0
Reporter: Sujith


load data command  for loading data from non local  file paths by using wild 
card strings lke * are not working

eg:

"load data inpath 'hdfs://hacluster/user/ext*  into table t1"

Getting Analysis excepton while executing this query

!image-2018-02-14-23-41-39-923.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23424) Add codegenStageId in comment

2018-02-14 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-23424:


 Summary: Add codegenStageId in comment
 Key: SPARK-23424
 URL: https://issues.apache.org/jira/browse/SPARK-23424
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Kazuaki Ishizaki


SPARK-23032 introduced to use a per-query ID to the generated class name in 
{{WholeStageCodegenExec.}}This Jira also add the ID in the comment of the 
generated Java source file.
This is helpful for debugging when {{spark.sql.codegen.useIdInClassName}} is 
false.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23234) ML python test failure due to default outputCol

2018-02-14 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364510#comment-16364510
 ] 

Joseph K. Bradley commented on SPARK-23234:
---

Is this still a blocker now that [SPARK-22797] has been reverted?  I assume 
[SPARK-22797] won't get into 2.3 at this point since it's a new API.

> ML python test failure due to default outputCol
> ---
>
> Key: SPARK-23234
> URL: https://issues.apache.org/jira/browse/SPARK-23234
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Priority: Blocker
>
> SPARK-22799 and SPARK-22797 are causing valid Python test failures. The 
> reason is that Python is setting the default params with set. So they are not 
> considered as defaults, but as params passed by the user.
> This means that an outputCol is set not as a default but as a real value.
> Anyway, this is a misbehavior of the python API which can cause serious 
> problems and I'd suggest to rethink the way this is done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23377) Bucketizer with multiple columns persistence bug

2018-02-14 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-23377:
--
Priority: Blocker  (was: Critical)

> Bucketizer with multiple columns persistence bug
> 
>
> Key: SPARK-23377
> URL: https://issues.apache.org/jira/browse/SPARK-23377
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>Priority: Blocker
>
> A Bucketizer with multiple input/output columns get "inputCol" set to the 
> default value on write -> read which causes it to throw an error on 
> transform. Here's an example.
> {code:java}
> import org.apache.spark.ml.feature._
> val splits = Array(Double.NegativeInfinity, 0, 10, 100, 
> Double.PositiveInfinity)
> val bucketizer = new Bucketizer()
>   .setSplitsArray(Array(splits, splits))
>   .setInputCols(Array("foo1", "foo2"))
>   .setOutputCols(Array("bar1", "bar2"))
> val data = Seq((1.0, 2.0), (10.0, 100.0), (101.0, -1.0)).toDF("foo1", "foo2")
> bucketizer.transform(data)
> val path = "/temp/bucketrizer-persist-test"
> bucketizer.write.overwrite.save(path)
> val bucketizerAfterRead = Bucketizer.read.load(path)
> println(bucketizerAfterRead.isDefined(bucketizerAfterRead.outputCol))
> // This line throws an error because "outputCol" is set
> bucketizerAfterRead.transform(data)
> {code}
> And the trace:
> {code:java}
> java.lang.IllegalArgumentException: Bucketizer bucketizer_6f0acc3341f7 has 
> the inputCols Param set for multi-column transform. The following Params are 
> not applicable and should not be set: outputCol.
>   at 
> org.apache.spark.ml.param.ParamValidators$.checkExclusiveParams$1(params.scala:300)
>   at 
> org.apache.spark.ml.param.ParamValidators$.checkSingleVsMultiColumnParams(params.scala:314)
>   at 
> org.apache.spark.ml.feature.Bucketizer.transformSchema(Bucketizer.scala:189)
>   at 
> org.apache.spark.ml.feature.Bucketizer.transform(Bucketizer.scala:141)
>   at 
> line251821108a8a433da484ee31f166c83725.$read$$iw$$iw$$iw$$iw$$iw$$iw.(command-6079631:17)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23401) Improve test cases for all supported types and unsupported types

2018-02-14 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-23401:
-
Issue Type: Sub-task  (was: Test)
Parent: SPARK-22216

> Improve test cases for all supported types and unsupported types
> 
>
> Key: SPARK-23401
> URL: https://issues.apache.org/jira/browse/SPARK-23401
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Looks there are some missing types to test in supported types. 
> For example, please see 
> https://github.com/apache/spark/blob/c338c8cf8253c037ecd4f39bbd58ed5a86581b37/python/pyspark/sql/tests.py#L4397-L4401
> We can improve this test coverage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23380) Make toPandas fallback to non-Arrow optimization if possible

2018-02-14 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-23380:
-
Summary: Make toPandas fallback to non-Arrow optimization if possible  
(was: Make toPandas fall back to Arrow optimization disabled when schema is not 
supported in the Arrow optimization )

> Make toPandas fallback to non-Arrow optimization if possible
> 
>
> Key: SPARK-23380
> URL: https://issues.apache.org/jira/browse/SPARK-23380
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Seems we can check the schema ahead and fall back in toPandas.
> Please see this case below:
> {code}
> df = spark.createDataFrame([[{'a': 1}]])
> spark.conf.set("spark.sql.execution.arrow.enabled", "false")
> df.toPandas()
> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
> df.toPandas()
> {code}
> {code}
> ...
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o42.collectAsArrowToPython.
> ...
> java.lang.UnsupportedOperationException: Unsupported data type: 
> map
> {code}
> In case of {{createDataFrame}}, we fall back to make this at least working 
> even though the optimisation is disabled.
> {code}
> df = spark.createDataFrame([[{'a': 1}]])
> spark.conf.set("spark.sql.execution.arrow.enabled", "false")
> pdf = df.toPandas()
> spark.createDataFrame(pdf).show()
> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
> spark.createDataFrame(pdf).show()
> {code}
> {code}
> ...
> ... UserWarning: Arrow will not be used in createDataFrame: Error inferring 
> Arrow type ...
> ++
> |  _1|
> ++
> |[a -> 1]|
> ++
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23420) Datasource loading not handling paths with regex chars.

2018-02-14 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364324#comment-16364324
 ] 

Sean Owen commented on SPARK-23420:
---

Hm. That's a different problem then. Those are URIs and should be interpreted 
as such. I also recall there was a fix for something like this recently. See 
https://issues.apache.org/jira/browse/SPARK-21996 or 
https://issues.apache.org/jira/browse/SPARK-22585 for example; it could be yet 
another instance.

Although encoding these characters is probably right, I don't think that's the 
issue here after all, as this check occurs on a Path object, after it has been 
parsed as an HDFS URI. I think there's no way to reliably distinguish a string 
that means "*" as a glob and "*" as a character in the path.

I think the real fix may be another way of specifying the path as a "glob" 
parameter instead of "path". That's what the HDFS API does. However here even 
that wouldn't fix the fact that this is already the expected behavior of "path".

I don't see a workaround right now other than to avoid these chars. Anyone else?

> Datasource loading not handling paths with regex chars.
> ---
>
> Key: SPARK-23420
> URL: https://issues.apache.org/jira/browse/SPARK-23420
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.2.1
>Reporter: Mitchell
>Priority: Major
>
> Greetings, during some recent testing I ran across an issue attempting to 
> load files with regex chars like []()* etc. in them. The files are valid in 
> the various storages and the normal hadoop APIs all function properly 
> accessing them.
> When my code is executed, I get the following stack trace.
> 8/02/14 04:52:46 ERROR yarn.ApplicationMaster: User class threw exception: 
> java.io.IOException: Illegal file pattern: Unmatched closing ')' near index 
> 130 
> A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_??
>  ^ java.io.IOException: Illegal file pattern: Unmatched closing ')' near 
> index 130 
> A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_??
>  ^ at org.apache.hadoop.fs.GlobFilter.init(GlobFilter.java:71) at 
> org.apache.hadoop.fs.GlobFilter.(GlobFilter.java:50) at 
> org.apache.hadoop.fs.Globber.doGlob(Globber.java:210) at 
> org.apache.hadoop.fs.Globber.glob(Globber.java:149) at 
> org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1955) at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.globStatus(S3AFileSystem.java:2477) at 
> org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:234) 
> at 
> org.apache.spark.deploy.SparkHadoopUtil.globPathIfNecessary(SparkHadoopUtil.scala:244)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:618)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:350)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:350)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at 
> scala.collection.immutable.List.flatMap(List.scala:344) at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:349)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at 
> org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533) at 
> org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412) at 
> com.sap.profile.SparkProfileTask.main(SparkProfileTask.java:95) at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:635)
>  Caused by: java.util.regex.PatternSyntaxException: Unmatched closing ')' 
> near index 130 
> A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_??
>  ^ at java.util.regex.Pattern.error(Pattern.java:1955) at 
> 

[jira] [Created] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors

2018-02-14 Thread Igor Berman (JIRA)
Igor Berman created SPARK-23423:
---

 Summary: Application declines any offers when killed+active 
executors rich spark.dynamicAllocation.maxExecutors
 Key: SPARK-23423
 URL: https://issues.apache.org/jira/browse/SPARK-23423
 Project: Spark
  Issue Type: Bug
  Components: Mesos, Spark Core
Affects Versions: 2.2.1
Reporter: Igor Berman


Hi

I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend when 
running on Mesos with dynamic allocation on and limiting number of max 
executors by spark.dynamicAllocation.maxExecutors.

Suppose we have long running driver that has cyclic pattern of resource 
consumption(with some idle times in between), due to dyn.allocation it receives 
offers and then releases them after current chunk of work processed.

Since at 
[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]
 the backend compares numExecutors < executorLimit and 

numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves 
holds all executors ever "met", i.e. both active and killed (see comment 
[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)]
 

it means that after a while, when number of killed executors might be greater 
than maxExecutors, the application will decline any offer, thus stopping to work

 

I've created test that "reproduces" this behavior, not sure how good it is:
{code:java}
//MesosCoarseGrainedSchedulerBackendSuite
test("max executors registered stops to accept offers when dynamic allocation 
enabled") {
  setBackend(Map(
"spark.dynamicAllocation.maxExecutors" -> "1",
"spark.dynamicAllocation.enabled" -> "true",
"spark.dynamicAllocation.testing" -> "true"))

  backend.doRequestTotalExecutors(1)

  val (mem, cpu) = (backend.executorMemory(sc), 4)

  val offer1 = createOffer("o1", "s1", mem, cpu)
  backend.resourceOffers(driver, List(offer1).asJava)
  verifyTaskLaunched(driver, "o1")

  backend.doKillExecutors(List("0"))
  verify(driver, times(1)).killTask(createTaskId("0"))

  val offer2 = createOffer("o2", "s2", mem, cpu)
  backend.resourceOffers(driver, List(offer2).asJava)
  verify(driver, times(1)).declineOffer(offer2.getId)
}{code}
 

 

Workaround: Don't set maxExecutors with dynamicAllocation on

I'm not sure how to solve this problem since it seems that it's not trivial to 
change numExecutors in this scenario to count only active executors(since this 
information is not available in Slave class. On the other hand, might be that 
this behavior is "normal" and expected.

Please advice

Igor

marking you friends since you were last to touch this piece of code and 
probably can advice something([~vanzin], [~skonto], [~susanxhuynh])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23420) Datasource loading not handling paths with regex chars.

2018-02-14 Thread Mitchell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364284#comment-16364284
 ] 

Mitchell commented on SPARK-23420:
--

Sean, I'm a little confused by your response. From what I've seen, the 
datasource API does not correctly handle setting a fully URI encoded path, 
otherwise I would be doing that. As such, I am stuck with an unencoded path 
which in this case obviously has these characters in it. Even for a simple file 
with a space this does not work if passed encoded.

 

file: "/tmp/space file.csv"

Dataset input = sqlContext.read().option("header", "true").option("sep", 
",").option("quote", "\"").option("charset", "utf8").option("escape", 
"\\").csv("hdfs:///tmp/space%20file.csv"); --> File not found

Dataset input = sqlContext.read().option("header", "true").option("sep", 
",").option("quote", "\"").option("charset", "utf8").option("escape", 
"\\").csv("hdfs:///tmp/space file.csv"); -->Works fine

> Datasource loading not handling paths with regex chars.
> ---
>
> Key: SPARK-23420
> URL: https://issues.apache.org/jira/browse/SPARK-23420
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.2.1
>Reporter: Mitchell
>Priority: Major
>
> Greetings, during some recent testing I ran across an issue attempting to 
> load files with regex chars like []()* etc. in them. The files are valid in 
> the various storages and the normal hadoop APIs all function properly 
> accessing them.
> When my code is executed, I get the following stack trace.
> 8/02/14 04:52:46 ERROR yarn.ApplicationMaster: User class threw exception: 
> java.io.IOException: Illegal file pattern: Unmatched closing ')' near index 
> 130 
> A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_??
>  ^ java.io.IOException: Illegal file pattern: Unmatched closing ')' near 
> index 130 
> A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_??
>  ^ at org.apache.hadoop.fs.GlobFilter.init(GlobFilter.java:71) at 
> org.apache.hadoop.fs.GlobFilter.(GlobFilter.java:50) at 
> org.apache.hadoop.fs.Globber.doGlob(Globber.java:210) at 
> org.apache.hadoop.fs.Globber.glob(Globber.java:149) at 
> org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1955) at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.globStatus(S3AFileSystem.java:2477) at 
> org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:234) 
> at 
> org.apache.spark.deploy.SparkHadoopUtil.globPathIfNecessary(SparkHadoopUtil.scala:244)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:618)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:350)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:350)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at 
> scala.collection.immutable.List.flatMap(List.scala:344) at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:349)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at 
> org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533) at 
> org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412) at 
> com.sap.profile.SparkProfileTask.main(SparkProfileTask.java:95) at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:635)
>  Caused by: java.util.regex.PatternSyntaxException: Unmatched closing ')' 
> near index 130 
> A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_??
>  ^ at java.util.regex.Pattern.error(Pattern.java:1955) at 
> java.util.regex.Pattern.compile(Pattern.java:1700) at 
> java.util.regex.Pattern.(Pattern.java:1351) at 
> 

[jira] [Commented] (SPARK-21302) history server WebUI show HTTP ERROR 500

2018-02-14 Thread bharath kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364257#comment-16364257
 ] 

bharath kumar commented on SPARK-21302:
---

I saw this issue as well with RM Webui, when we click the application master 
webpage. We were using sparklyr to connect to Yarn in client mode. 

 

!nullpointer.PNG!

> history server WebUI show HTTP ERROR 500
> 
>
> Key: SPARK-21302
> URL: https://issues.apache.org/jira/browse/SPARK-21302
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.1
>Reporter: Jason Pan
>Priority: Major
> Attachments: npe.PNG, nullpointer.PNG
>
>
> When navigate to history server WebUI, and check incomplete applications, 
> show http 500
> Error logs:
> 17/07/05 20:17:44 INFO ApplicationCacheCheckFilter: Application Attempt 
> app-20170705201715-0005-0ce78623-38db-4d23-a2b2-8cb45bb3f505/None updated; 
> refreshing
> 17/07/05 20:17:44 WARN ServletHandler: 
> /history/app-20170705201715-0005-0ce78623-38db-4d23-a2b2-8cb45bb3f505/executors/
> java.lang.NullPointerException
> at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
> at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
> at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
> at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
> at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
> at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
> at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
> at org.spark_project.jetty.server.Server.handle(Server.java:499)
> at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311)
> at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
> at 
> org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)
> at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
> at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
> at java.lang.Thread.run(Thread.java:785)
> 17/07/05 20:18:00 WARN ServletHandler: /
> java.lang.NullPointerException
> at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
> at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
> at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
> at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
> at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
> at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at 
> org.spark_project.jetty.servlets.gzip.GzipHandler.handle(GzipHandler.java:479)
> at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
> at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
> at org.spark_project.jetty.server.Server.handle(Server.java:499)
> at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311)
> at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
> at 
> org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)
> at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
> at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
> at java.lang.Thread.run(Thread.java:785)
> 17/07/05 20:18:17 WARN ServletHandler: /
> java.lang.NullPointerException
> at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
> at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
> at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21302) history server WebUI show HTTP ERROR 500

2018-02-14 Thread bharath kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

bharath kumar updated SPARK-21302:
--
Attachment: nullpointer.PNG

> history server WebUI show HTTP ERROR 500
> 
>
> Key: SPARK-21302
> URL: https://issues.apache.org/jira/browse/SPARK-21302
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.1
>Reporter: Jason Pan
>Priority: Major
> Attachments: npe.PNG, nullpointer.PNG
>
>
> When navigate to history server WebUI, and check incomplete applications, 
> show http 500
> Error logs:
> 17/07/05 20:17:44 INFO ApplicationCacheCheckFilter: Application Attempt 
> app-20170705201715-0005-0ce78623-38db-4d23-a2b2-8cb45bb3f505/None updated; 
> refreshing
> 17/07/05 20:17:44 WARN ServletHandler: 
> /history/app-20170705201715-0005-0ce78623-38db-4d23-a2b2-8cb45bb3f505/executors/
> java.lang.NullPointerException
> at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
> at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
> at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
> at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
> at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
> at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
> at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
> at org.spark_project.jetty.server.Server.handle(Server.java:499)
> at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311)
> at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
> at 
> org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)
> at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
> at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
> at java.lang.Thread.run(Thread.java:785)
> 17/07/05 20:18:00 WARN ServletHandler: /
> java.lang.NullPointerException
> at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
> at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
> at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
> at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
> at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
> at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at 
> org.spark_project.jetty.servlets.gzip.GzipHandler.handle(GzipHandler.java:479)
> at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
> at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
> at org.spark_project.jetty.server.Server.handle(Server.java:499)
> at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311)
> at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
> at 
> org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)
> at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
> at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
> at java.lang.Thread.run(Thread.java:785)
> 17/07/05 20:18:17 WARN ServletHandler: /
> java.lang.NullPointerException
> at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
> at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
> at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23420) Datasource loading not handling paths with regex chars.

2018-02-14 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364251#comment-16364251
 ] 

Sean Owen commented on SPARK-23420:
---

I think that logic is mostly correct, because those characters ought to be 
encoded in a file URI, as they're either reserved in URIs or file names. Not 
entirely sure about brackets though. At least, that should be a workaround; 
that may actually be the answer though.

> Datasource loading not handling paths with regex chars.
> ---
>
> Key: SPARK-23420
> URL: https://issues.apache.org/jira/browse/SPARK-23420
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.2.1
>Reporter: Mitchell
>Priority: Major
>
> Greetings, during some recent testing I ran across an issue attempting to 
> load files with regex chars like []()* etc. in them. The files are valid in 
> the various storages and the normal hadoop APIs all function properly 
> accessing them.
> When my code is executed, I get the following stack trace.
> 8/02/14 04:52:46 ERROR yarn.ApplicationMaster: User class threw exception: 
> java.io.IOException: Illegal file pattern: Unmatched closing ')' near index 
> 130 
> A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_??
>  ^ java.io.IOException: Illegal file pattern: Unmatched closing ')' near 
> index 130 
> A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_??
>  ^ at org.apache.hadoop.fs.GlobFilter.init(GlobFilter.java:71) at 
> org.apache.hadoop.fs.GlobFilter.(GlobFilter.java:50) at 
> org.apache.hadoop.fs.Globber.doGlob(Globber.java:210) at 
> org.apache.hadoop.fs.Globber.glob(Globber.java:149) at 
> org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1955) at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.globStatus(S3AFileSystem.java:2477) at 
> org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:234) 
> at 
> org.apache.spark.deploy.SparkHadoopUtil.globPathIfNecessary(SparkHadoopUtil.scala:244)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:618)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:350)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:350)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at 
> scala.collection.immutable.List.flatMap(List.scala:344) at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:349)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at 
> org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533) at 
> org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412) at 
> com.sap.profile.SparkProfileTask.main(SparkProfileTask.java:95) at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:635)
>  Caused by: java.util.regex.PatternSyntaxException: Unmatched closing ')' 
> near index 130 
> A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_??
>  ^ at java.util.regex.Pattern.error(Pattern.java:1955) at 
> java.util.regex.Pattern.compile(Pattern.java:1700) at 
> java.util.regex.Pattern.(Pattern.java:1351) at 
> java.util.regex.Pattern.compile(Pattern.java:1054) at 
> org.apache.hadoop.fs.GlobPattern.set(GlobPattern.java:156) at 
> org.apache.hadoop.fs.GlobPattern.(GlobPattern.java:42) at 
> org.apache.hadoop.fs.GlobFilter.init(GlobFilter.java:67) ... 25 more 18/02/14 
> 04:52:46 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, 
> (reason: User class threw exception: java.io.IOException: Illegal file 
> pattern: Unmatched closing ')' near index 130 
> 

  1   2   >