[jira] [Commented] (SPARK-17897) not isnotnull is converted to the always false condition isnotnull && not isnotnull

2017-02-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15862298#comment-15862298
 ] 

Apache Spark commented on SPARK-17897:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/16894

> not isnotnull is converted to the always false condition isnotnull && not 
> isnotnull
> ---
>
> Key: SPARK-17897
> URL: https://issues.apache.org/jira/browse/SPARK-17897
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Jordan Halterman
>Assignee: Xiao Li
>  Labels: correctness
> Fix For: 2.1.0
>
>
> When a logical plan is built containing the following somewhat nonsensical 
> filter:
> {{Filter (NOT isnotnull($f0#212))}}
> During optimization the filter is converted into a condition that will always 
> fail:
> {{Filter (isnotnull($f0#212) && NOT isnotnull($f0#212))}}
> This appears to be caused by the following check for {{NullIntolerant}}:
> https://github.com/apache/spark/commit/df68beb85de59bb6d35b2a8a3b85dbc447798bf5#diff-203ac90583cebe29a92c1d812c07f102R63
> Which recurses through the expression and extracts nested {{IsNotNull}} 
> calls, converting them to {{IsNotNull}} calls on the attribute at the root 
> level:
> https://github.com/apache/spark/commit/df68beb85de59bb6d35b2a8a3b85dbc447798bf5#diff-203ac90583cebe29a92c1d812c07f102R49
> This results in the nonsensical condition above.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19555) Improve inefficient StringUtils.escapeLikeRegex() method

2017-02-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15862293#comment-15862293
 ] 

Apache Spark commented on SPARK-19555:
--

User 'lins05' has created a pull request for this issue:
https://github.com/apache/spark/pull/16893

> Improve inefficient StringUtils.escapeLikeRegex() method
> 
>
> Key: SPARK-19555
> URL: https://issues.apache.org/jira/browse/SPARK-19555
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>
> Spark's StringUtils.escapeLikeRegex() method is written inefficiently, 
> performing tons of object allocations due to the use zip(), flatMap() , and 
> mkString. Instead, I think method should be rewritten in an imperative style 
> using a Java string builder.
> This method can become a performance bottleneck in cases where regex 
> expressions are used with non-constant-foldable expressions (e.g. the regex 
> expression comes from the data rather than being part of the query).
> Here's the code in question: 
> https://github.com/apache/spark/blob/d785217b791882e075ad537852d49d78fc1ca31b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/StringUtils.scala#L28



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19555) Improve inefficient StringUtils.escapeLikeRegex() method

2017-02-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19555:


Assignee: Apache Spark

> Improve inefficient StringUtils.escapeLikeRegex() method
> 
>
> Key: SPARK-19555
> URL: https://issues.apache.org/jira/browse/SPARK-19555
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> Spark's StringUtils.escapeLikeRegex() method is written inefficiently, 
> performing tons of object allocations due to the use zip(), flatMap() , and 
> mkString. Instead, I think method should be rewritten in an imperative style 
> using a Java string builder.
> This method can become a performance bottleneck in cases where regex 
> expressions are used with non-constant-foldable expressions (e.g. the regex 
> expression comes from the data rather than being part of the query).
> Here's the code in question: 
> https://github.com/apache/spark/blob/d785217b791882e075ad537852d49d78fc1ca31b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/StringUtils.scala#L28



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19555) Improve inefficient StringUtils.escapeLikeRegex() method

2017-02-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19555:


Assignee: (was: Apache Spark)

> Improve inefficient StringUtils.escapeLikeRegex() method
> 
>
> Key: SPARK-19555
> URL: https://issues.apache.org/jira/browse/SPARK-19555
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>
> Spark's StringUtils.escapeLikeRegex() method is written inefficiently, 
> performing tons of object allocations due to the use zip(), flatMap() , and 
> mkString. Instead, I think method should be rewritten in an imperative style 
> using a Java string builder.
> This method can become a performance bottleneck in cases where regex 
> expressions are used with non-constant-foldable expressions (e.g. the regex 
> expression comes from the data rather than being part of the query).
> Here's the code in question: 
> https://github.com/apache/spark/blob/d785217b791882e075ad537852d49d78fc1ca31b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/StringUtils.scala#L28



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19537) Move the pendingPartitions variable from Stage to ShuffleMapStage

2017-02-10 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-19537.

   Resolution: Fixed
Fix Version/s: 2.2.0

> Move the pendingPartitions variable from Stage to ShuffleMapStage
> -
>
> Key: SPARK-19537
> URL: https://issues.apache.org/jira/browse/SPARK-19537
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>Priority: Minor
> Fix For: 2.2.0
>
>
> This variable is only used by ShuffleMapStages, and it is confusing to have 
> it in the Stage class rather than the ShuffleMapStage class.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-19502) Remove unnecessary code to re-submit stages in the DAGScheduler

2017-02-10 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout closed SPARK-19502.
--
Resolution: Not A Problem

This code actually is currently needed to handle cases where a ShuffleMapTask 
succeeds on an executor, but that executor was marked as failed (so the task 
needs to be re-run), as described in this comment: 
https://github.com/apache/spark/pull/16620#issuecomment-279125227

> Remove unnecessary code to re-submit stages in the DAGScheduler
> ---
>
> Key: SPARK-19502
> URL: https://issues.apache.org/jira/browse/SPARK-19502
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.1.1
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>Priority: Minor
>
> There are a [few lines of code in the 
> DAGScheduler](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1215)
>  to re-submit shuffle map stages when some of the tasks fail.  My 
> understanding is that there should be a 1:1 mapping between pending tasks 
> (which are tasks that haven't completed successfully) and available output 
> locations, so that code should never be reachable.  Furthermore, the approach 
> taken by that code (to re-submit an entire stage as a result of task 
> failures) is not how we handle task failures in a stage (the lower-level 
> scheduler resubmits the individual tasks) which is what the 5-years-old TODO 
> on that code seems to be implying should be done.
> The big caveat is that there's a bug being fixed in SPARK-19263 that means 
> there is *not* a 1:1 relationship between pendingTasks and available 
> outputLocations, so that code is serving as a (buggy) band-aid.  This should 
> be fixed once we resolve SPARK-19263.
> cc [~imranr] [~markhamstra] [~jinxing6...@126.com] (let me know if any of you 
> see any reason we actually do need that code)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19538) DAGScheduler and TaskSetManager can have an inconsistent view of whether a stage is complete.

2017-02-10 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-19538:
---
Priority: Minor  (was: Major)

> DAGScheduler and TaskSetManager can have an inconsistent view of whether a 
> stage is complete.
> -
>
> Key: SPARK-19538
> URL: https://issues.apache.org/jira/browse/SPARK-19538
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>Priority: Minor
>
> The pendingPartitions in Stage tracks partitions that still need to be 
> computed, and is used by the DAGScheduler to determine when to mark the stage 
> as complete.  In most cases, this variable is exactly consistent with the 
> tasks in the TaskSetManager (for the current version of the stage) that are 
> still pending.  However, as discussed in SPARK-19263, these can become 
> inconsistent when an ShuffleMapTask for an earlier attempt of the stage 
> completes, in which case the DAGScheduler may think the stage has finished, 
> while the TaskSetManager is still waiting for some tasks to complete (see the 
> description in this pull request: 
> https://github.com/apache/spark/pull/16620).  This leads to bugs like 
> SPARK-19263.  Another problem with this behavior is that listeners can get 
> two StageCompleted messages: once when the DAGScheduler thinks the stage is 
> complete, and a second when the TaskSetManager later decides the stage is 
> complete.  We should fix this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19560) Improve tests for when DAGScheduler learns of "successful" ShuffleMapTask from a failed executor

2017-02-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15862248#comment-15862248
 ] 

Apache Spark commented on SPARK-19560:
--

User 'kayousterhout' has created a pull request for this issue:
https://github.com/apache/spark/pull/16892

> Improve tests for when DAGScheduler learns of "successful" ShuffleMapTask 
> from a failed executor
> 
>
> Key: SPARK-19560
> URL: https://issues.apache.org/jira/browse/SPARK-19560
> Project: Spark
>  Issue Type: Test
>  Components: Scheduler
>Affects Versions: 2.1.1
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>Priority: Minor
>
> There's some tricky code around the case when the DAGScheduler learns of a 
> ShuffleMapTask that completed successfully, but ran on an executor that 
> failed sometime after the task was launched.  This case is tricky because the 
> TaskSetManager (i.e., the lower level scheduler) thinks the task completed 
> successfully, but the DAGScheduler considers the output it generated to be no 
> longer valid (because it was probably lost when the executor was lost).  As a 
> result, the DAGScheduler needs to re-submit the stage, so that the task can 
> be re-run.  This is tested in some of the tests but not clearly documented, 
> so we should improve this to prevent future bugs (this was encountered by 
> [~markhamstra] in attempting to find a better fix for SPARK-19263).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19560) Improve tests for when DAGScheduler learns of "successful" ShuffleMapTask from a failed executor

2017-02-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19560:


Assignee: Apache Spark  (was: Kay Ousterhout)

> Improve tests for when DAGScheduler learns of "successful" ShuffleMapTask 
> from a failed executor
> 
>
> Key: SPARK-19560
> URL: https://issues.apache.org/jira/browse/SPARK-19560
> Project: Spark
>  Issue Type: Test
>  Components: Scheduler
>Affects Versions: 2.1.1
>Reporter: Kay Ousterhout
>Assignee: Apache Spark
>Priority: Minor
>
> There's some tricky code around the case when the DAGScheduler learns of a 
> ShuffleMapTask that completed successfully, but ran on an executor that 
> failed sometime after the task was launched.  This case is tricky because the 
> TaskSetManager (i.e., the lower level scheduler) thinks the task completed 
> successfully, but the DAGScheduler considers the output it generated to be no 
> longer valid (because it was probably lost when the executor was lost).  As a 
> result, the DAGScheduler needs to re-submit the stage, so that the task can 
> be re-run.  This is tested in some of the tests but not clearly documented, 
> so we should improve this to prevent future bugs (this was encountered by 
> [~markhamstra] in attempting to find a better fix for SPARK-19263).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19560) Improve tests for when DAGScheduler learns of "successful" ShuffleMapTask from a failed executor

2017-02-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19560:


Assignee: Kay Ousterhout  (was: Apache Spark)

> Improve tests for when DAGScheduler learns of "successful" ShuffleMapTask 
> from a failed executor
> 
>
> Key: SPARK-19560
> URL: https://issues.apache.org/jira/browse/SPARK-19560
> Project: Spark
>  Issue Type: Test
>  Components: Scheduler
>Affects Versions: 2.1.1
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>Priority: Minor
>
> There's some tricky code around the case when the DAGScheduler learns of a 
> ShuffleMapTask that completed successfully, but ran on an executor that 
> failed sometime after the task was launched.  This case is tricky because the 
> TaskSetManager (i.e., the lower level scheduler) thinks the task completed 
> successfully, but the DAGScheduler considers the output it generated to be no 
> longer valid (because it was probably lost when the executor was lost).  As a 
> result, the DAGScheduler needs to re-submit the stage, so that the task can 
> be re-run.  This is tested in some of the tests but not clearly documented, 
> so we should improve this to prevent future bugs (this was encountered by 
> [~markhamstra] in attempting to find a better fix for SPARK-19263).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19560) Improve tests for when DAGScheduler learns of "successful" ShuffleMapTask from a failed executor

2017-02-10 Thread Kay Ousterhout (JIRA)
Kay Ousterhout created SPARK-19560:
--

 Summary: Improve tests for when DAGScheduler learns of 
"successful" ShuffleMapTask from a failed executor
 Key: SPARK-19560
 URL: https://issues.apache.org/jira/browse/SPARK-19560
 Project: Spark
  Issue Type: Test
  Components: Scheduler
Affects Versions: 2.1.1
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
Priority: Minor


There's some tricky code around the case when the DAGScheduler learns of a 
ShuffleMapTask that completed successfully, but ran on an executor that failed 
sometime after the task was launched.  This case is tricky because the 
TaskSetManager (i.e., the lower level scheduler) thinks the task completed 
successfully, but the DAGScheduler considers the output it generated to be no 
longer valid (because it was probably lost when the executor was lost).  As a 
result, the DAGScheduler needs to re-submit the stage, so that the task can be 
re-run.  This is tested in some of the tests but not clearly documented, so we 
should improve this to prevent future bugs (this was encountered by 
[~markhamstra] in attempting to find a better fix for SPARK-19263).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19318) Docker test case failure: `SPARK-16625: General data types to be mapped to Oracle`

2017-02-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15862207#comment-15862207
 ] 

Apache Spark commented on SPARK-19318:
--

User 'sureshthalamati' has created a pull request for this issue:
https://github.com/apache/spark/pull/16891

> Docker test case failure: `SPARK-16625: General data types to be mapped to 
> Oracle`
> --
>
> Key: SPARK-19318
> URL: https://issues.apache.org/jira/browse/SPARK-19318
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>
> = FINISHED o.a.s.sql.jdbc.OracleIntegrationSuite: 'SPARK-16625: General 
> data types to be mapped to Oracle' =
> - SPARK-16625: General data types to be mapped to Oracle *** FAILED ***
>   types.apply(9).equals("class java.sql.Date") was false 
> (OracleIntegrationSuite.scala:136)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19559) Fix flaky KafkaSourceSuite.subscribing topic by pattern with topic deletions

2017-02-10 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-19559:
---
Description: 
This test has started failing frequently recently; e.g., 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72720/testReport/junit/org.apache.spark.sql.kafka010/KafkaSourceSuite/subscribing_topic_by_pattern_with_topic_deletions/
 and 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72725/testReport/org.apache.spark.sql.kafka010/KafkaSourceSuite/subscribing_topic_by_pattern_with_topic_deletions/

cc [~zsxwing] and [~tcondie] who seemed to have modified the related code most 
recently

  was:This test has started failing frequently recently; e.g., 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72720/testReport/junit/org.apache.spark.sql.kafka010/KafkaSourceSuite/subscribing_topic_by_pattern_with_topic_deletions/
 and 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72725/testReport/org.apache.spark.sql.kafka010/KafkaSourceSuite/subscribing_topic_by_pattern_with_topic_deletions/


> Fix flaky KafkaSourceSuite.subscribing topic by pattern with topic deletions
> 
>
> Key: SPARK-19559
> URL: https://issues.apache.org/jira/browse/SPARK-19559
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming, Tests
>Affects Versions: 2.1.0
>Reporter: Kay Ousterhout
>
> This test has started failing frequently recently; e.g., 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72720/testReport/junit/org.apache.spark.sql.kafka010/KafkaSourceSuite/subscribing_topic_by_pattern_with_topic_deletions/
>  and 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72725/testReport/org.apache.spark.sql.kafka010/KafkaSourceSuite/subscribing_topic_by_pattern_with_topic_deletions/
> cc [~zsxwing] and [~tcondie] who seemed to have modified the related code 
> most recently



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19559) Fix flaky KafkaSourceSuite.subscribing topic by pattern with topic deletions

2017-02-10 Thread Kay Ousterhout (JIRA)
Kay Ousterhout created SPARK-19559:
--

 Summary: Fix flaky KafkaSourceSuite.subscribing topic by pattern 
with topic deletions
 Key: SPARK-19559
 URL: https://issues.apache.org/jira/browse/SPARK-19559
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming, Tests
Affects Versions: 2.1.0
Reporter: Kay Ousterhout


This test has started failing frequently recently; e.g., 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72720/testReport/junit/org.apache.spark.sql.kafka010/KafkaSourceSuite/subscribing_topic_by_pattern_with_topic_deletions/
 and 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72725/testReport/org.apache.spark.sql.kafka010/KafkaSourceSuite/subscribing_topic_by_pattern_with_topic_deletions/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19558) Provide a config option to attach QueryExecutionListener to SparkSession

2017-02-10 Thread Salil Surendran (JIRA)
Salil Surendran created SPARK-19558:
---

 Summary: Provide a config option to attach QueryExecutionListener 
to SparkSession
 Key: SPARK-19558
 URL: https://issues.apache.org/jira/browse/SPARK-19558
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
Reporter: Salil Surendran


Provide a configuration property(just like spark.extraListeners) to attach a 
QueryExecutionListener to a SparkSession



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19557) Output parameters are not present in SQL Query Plan

2017-02-10 Thread Salil Surendran (JIRA)
Salil Surendran created SPARK-19557:
---

 Summary: Output parameters are not present in SQL Query Plan
 Key: SPARK-19557
 URL: https://issues.apache.org/jira/browse/SPARK-19557
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
Reporter: Salil Surendran


For DataFrameWriter methods like parquet(), json(), csv() etc. output 
parameters are not present in the QueryExecution object. For methods like 
saveAsTable() they do. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18717) Datasets - crash (compile exception) when mapping to immutable scala map

2017-02-10 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-18717:
---
Fix Version/s: 2.1.1

> Datasets - crash (compile exception) when mapping to immutable scala map
> 
>
> Key: SPARK-18717
> URL: https://issues.apache.org/jira/browse/SPARK-18717
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Damian Momot
>Assignee: Andrew Ray
> Fix For: 2.1.1, 2.2.0
>
>
> {code}
> val spark: SparkSession = ???
> case class Test(id: String, map_test: Map[Long, String])
> spark.sql("CREATE TABLE xyz.map_test (id string, map_test map) 
> STORED AS PARQUET")
> spark.sql("SELECT * FROM xyz.map_test").as[Test].map(t => t).collect()
> {code}
> {code}
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 307, Column 108: No applicable constructor/method found for actual parameters 
> "java.lang.String, scala.collection.Map"; candidates are: 
> "$line14.$read$$iw$$iw$Test(java.lang.String, scala.collection.immutable.Map)"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18717) Datasets - crash (compile exception) when mapping to immutable scala map

2017-02-10 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-18717:
---
Affects Version/s: 2.1.0

> Datasets - crash (compile exception) when mapping to immutable scala map
> 
>
> Key: SPARK-18717
> URL: https://issues.apache.org/jira/browse/SPARK-18717
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Damian Momot
>Assignee: Andrew Ray
> Fix For: 2.1.1, 2.2.0
>
>
> {code}
> val spark: SparkSession = ???
> case class Test(id: String, map_test: Map[Long, String])
> spark.sql("CREATE TABLE xyz.map_test (id string, map_test map) 
> STORED AS PARQUET")
> spark.sql("SELECT * FROM xyz.map_test").as[Test].map(t => t).collect()
> {code}
> {code}
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 307, Column 108: No applicable constructor/method found for actual parameters 
> "java.lang.String, scala.collection.Map"; candidates are: 
> "$line14.$read$$iw$$iw$Test(java.lang.String, scala.collection.immutable.Map)"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19556) Broadcast data is not encrypted when I/O encryption is on

2017-02-10 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-19556:
--

 Summary: Broadcast data is not encrypted when I/O encryption is on
 Key: SPARK-19556
 URL: https://issues.apache.org/jira/browse/SPARK-19556
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Marcelo Vanzin


{{TorrentBroadcast}} uses a couple of "back doors" into the block manager to 
write and read data:

{code}
  if (!blockManager.putBytes(pieceId, bytes, MEMORY_AND_DISK_SER, 
tellMaster = true)) {
throw new SparkException(s"Failed to store $pieceId of $broadcastId in 
local BlockManager")
  }
{code}

{code}
  bm.getLocalBytes(pieceId) match {
case Some(block) =>
  blocks(pid) = block
  releaseLock(pieceId)
case None =>
  bm.getRemoteBytes(pieceId) match {
case Some(b) =>
  if (checksumEnabled) {
val sum = calcChecksum(b.chunks(0))
if (sum != checksums(pid)) {
  throw new SparkException(s"corrupt remote block $pieceId of 
$broadcastId:" +
s" $sum != ${checksums(pid)}")
}
  }
  // We found the block from remote executors/driver's 
BlockManager, so put the block
  // in this executor's BlockManager.
  if (!bm.putBytes(pieceId, b, StorageLevel.MEMORY_AND_DISK_SER, 
tellMaster = true)) {
throw new SparkException(
  s"Failed to store $pieceId of $broadcastId in local 
BlockManager")
  }
  blocks(pid) = b
case None =>
  throw new SparkException(s"Failed to get $pieceId of 
$broadcastId")
  }
  }
{code}

The thing these block manager methods have in common is that they bypass the 
encryption code; so broadcast data is stored unencrypted in the block manager, 
causing unencrypted data to be written to disk if those blocks need to be 
evicted from memory.

The correct fix here is actually not to change {{TorrentBroadcast}}, but to fix 
the block manager so that:

- data stored in memory is not encrypted
- data written to disk is encrypted

This would simplify the code paths that use BlockManager / SerializerManager 
APIs (e.g. see SPARK-19520), but requires some tricky changes inside the 
BlockManager to still be able to use file channels to avoid reading whole 
blocks back into memory so they can be decrypted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19555) Improve inefficient StringUtils.escapeLikeRegex() method

2017-02-10 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-19555:
--

 Summary: Improve inefficient StringUtils.escapeLikeRegex() method
 Key: SPARK-19555
 URL: https://issues.apache.org/jira/browse/SPARK-19555
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Josh Rosen


Spark's StringUtils.escapeLikeRegex() method is written inefficiently, 
performing tons of object allocations due to the use zip(), flatMap() , and 
mkString. Instead, I think method should be rewritten in an imperative style 
using a Java string builder.

This method can become a performance bottleneck in cases where regex 
expressions are used with non-constant-foldable expressions (e.g. the regex 
expression comes from the data rather than being part of the query).

Here's the code in question: 
https://github.com/apache/spark/blob/d785217b791882e075ad537852d49d78fc1ca31b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/StringUtils.scala#L28



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19548) Hive UDF should support List and Map types

2017-02-10 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19548.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16886
[https://github.com/apache/spark/pull/16886]

> Hive UDF should support List and Map types
> --
>
> Key: SPARK-19548
> URL: https://issues.apache.org/jira/browse/SPARK-19548
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
> Fix For: 2.2.0
>
>
> We currently do not support List and Map types for Hive UDFs. We should 
> improve this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19554) YARN backend should use history server URL for tracking when UI is disabled

2017-02-10 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-19554:
--

 Summary: YARN backend should use history server URL for tracking 
when UI is disabled
 Key: SPARK-19554
 URL: https://issues.apache.org/jira/browse/SPARK-19554
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 2.2.0
Reporter: Marcelo Vanzin
Priority: Minor


Currently, if the app has disabled its UI, Spark does not set a tracking URL in 
YARN. The UI is still available, even if with a lag, in the history server, if 
it's configured. We should use that as the tracking URL in these cases, instead 
of letting YARN show its default page for applications without a UI.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17668) Support representing structs with case classes and tuples in spark sql udf inputs

2017-02-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17668:


Assignee: Apache Spark

> Support representing structs with case classes and tuples in spark sql udf 
> inputs
> -
>
> Key: SPARK-17668
> URL: https://issues.apache.org/jira/browse/SPARK-17668
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: koert kuipers
>Assignee: Apache Spark
>Priority: Minor
>
> after having gotten used to have case classes represent complex structures in 
> Datasets, i am surprised to find out that when i work in DataFrames with udfs 
> no such magic exists, and i have to fall back to manipulating Row objects, 
> which is error prone and somewhat ugly.
> for example:
> {noformat}
> case class Person(name: String, age: Int)
> val df = Seq((Person("john", 33), 5), (Person("mike", 30), 6)).toDF("person", 
> "id")
> val df1 = df.withColumn("person", udf({ (p: Person) => p.copy(age = p.age + 
> 1) }).apply(col("person")))
> df1.printSchema
> df1.show
> {noformat}
> leads to:
> {noformat}
> java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast 
> to Person
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17668) Support representing structs with case classes and tuples in spark sql udf inputs

2017-02-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15861930#comment-15861930
 ] 

Apache Spark commented on SPARK-17668:
--

User 'koertkuipers' has created a pull request for this issue:
https://github.com/apache/spark/pull/16889

> Support representing structs with case classes and tuples in spark sql udf 
> inputs
> -
>
> Key: SPARK-17668
> URL: https://issues.apache.org/jira/browse/SPARK-17668
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: koert kuipers
>Priority: Minor
>
> after having gotten used to have case classes represent complex structures in 
> Datasets, i am surprised to find out that when i work in DataFrames with udfs 
> no such magic exists, and i have to fall back to manipulating Row objects, 
> which is error prone and somewhat ugly.
> for example:
> {noformat}
> case class Person(name: String, age: Int)
> val df = Seq((Person("john", 33), 5), (Person("mike", 30), 6)).toDF("person", 
> "id")
> val df1 = df.withColumn("person", udf({ (p: Person) => p.copy(age = p.age + 
> 1) }).apply(col("person")))
> df1.printSchema
> df1.show
> {noformat}
> leads to:
> {noformat}
> java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast 
> to Person
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17668) Support representing structs with case classes and tuples in spark sql udf inputs

2017-02-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17668:


Assignee: (was: Apache Spark)

> Support representing structs with case classes and tuples in spark sql udf 
> inputs
> -
>
> Key: SPARK-17668
> URL: https://issues.apache.org/jira/browse/SPARK-17668
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: koert kuipers
>Priority: Minor
>
> after having gotten used to have case classes represent complex structures in 
> Datasets, i am surprised to find out that when i work in DataFrames with udfs 
> no such magic exists, and i have to fall back to manipulating Row objects, 
> which is error prone and somewhat ugly.
> for example:
> {noformat}
> case class Person(name: String, age: Int)
> val df = Seq((Person("john", 33), 5), (Person("mike", 30), 6)).toDF("person", 
> "id")
> val df1 = df.withColumn("person", udf({ (p: Person) => p.copy(age = p.age + 
> 1) }).apply(col("person")))
> df1.printSchema
> df1.show
> {noformat}
> leads to:
> {noformat}
> java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast 
> to Person
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19477) [SQL] Datasets created from a Dataframe with extra columns retain the extra columns

2017-02-10 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15861918#comment-15861918
 ] 

Michael Armbrust commented on SPARK-19477:
--

If a lot of people are confused by this being lazy we can change it (didn't we 
already change it in 1.6 -> 2.0 in the other direction?).  It would have to be 
configurable though, since removing columns could be a breaking change.

> [SQL] Datasets created from a Dataframe with extra columns retain the extra 
> columns
> ---
>
> Key: SPARK-19477
> URL: https://issues.apache.org/jira/browse/SPARK-19477
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Don Drake
>
> In 1.6, when you created a Dataset from a Dataframe that had extra columns, 
> the columns not in the case class were dropped from the Dataset.
> For example in 1.6, the column c4 is gone:
> {code}
> scala> case class F(f1: String, f2: String, f3:String)
> defined class F
> scala> import sqlContext.implicits._
> import sqlContext.implicits._
> scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", 
> "j","z")).toDF("f1", "f2", "f3", "c4")
> df: org.apache.spark.sql.DataFrame = [f1: string, f2: string, f3: string, c4: 
> string]
> scala> val ds = df.as[F]
> ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string, f3: string]
> scala> ds.show
> +---+---+---+
> | f1| f2| f3|
> +---+---+---+
> |  a|  b|  c|
> |  d|  e|  f|
> |  h|  i|  j|
> {code}
> This seems to have changed in Spark 2.0 and also 2.1:
> Spark 2.1.0:
> {code}
> scala> case class F(f1: String, f2: String, f3:String)
> defined class F
> scala> import spark.implicits._
> import spark.implicits._
> scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", 
> "j","z")).toDF("f1", "f2", "f3", "c4")
> df: org.apache.spark.sql.DataFrame = [f1: string, f2: string ... 2 more 
> fields]
> scala> val ds = df.as[F]
> ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string ... 2 more 
> fields]
> scala> ds.show
> +---+---+---+---+
> | f1| f2| f3| c4|
> +---+---+---+---+
> |  a|  b|  c|  x|
> |  d|  e|  f|  y|
> |  h|  i|  j|  z|
> +---+---+---+---+
> scala> import org.apache.spark.sql.Encoders
> import org.apache.spark.sql.Encoders
> scala> val fEncoder = Encoders.product[F]
> fEncoder: org.apache.spark.sql.Encoder[F] = class[f1[0]: string, f2[0]: 
> string, f3[0]: string]
> scala> fEncoder.schema == ds.schema
> res2: Boolean = false
> scala> ds.schema
> res3: org.apache.spark.sql.types.StructType = 
> StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), 
> StructField(f3,StringType,true), StructField(c4,StringType,true))
> scala> fEncoder.schema
> res4: org.apache.spark.sql.types.StructType = 
> StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), 
> StructField(f3,StringType,true))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19524) newFilesOnly does not work according to docs.

2017-02-10 Thread Egor Pahomov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15861869#comment-15861869
 ] 

Egor Pahomov commented on SPARK-19524:
--

[~sowen], probably yes. I don't know. "Should process only new files and ignore 
existing files in the directory" if you really think about it, than I agree 
than setting this field to false does not mean to process old files. IMHO, 
everything around this field seems to be poorly documented or architectured. 
Since there is no documentation about spark.streaming.minRememberDuration in 
http://spark.apache.org/docs/2.0.2/configuration.html#spark-streaming I do not 
feel very comfortable changing it. More than that, it would be strange to 
change it to process old files, when purpose of this field very different. And 
nevertheless I was given an API with newFilesOnly, about which I made false 
assumption, but not totally unreasonable, based on all accessible 
documentation. I was wrong, but it still feels like a trap, I walked into, 
which can easily not be there. 

> newFilesOnly does not work according to docs. 
> --
>
> Key: SPARK-19524
> URL: https://issues.apache.org/jira/browse/SPARK-19524
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
>Reporter: Egor Pahomov
>
> Docs says:
> newFilesOnly
> Should process only new files and ignore existing files in the directory
> It's not working. 
> http://stackoverflow.com/questions/29852249/how-spark-streaming-identifies-new-files
>  says, that it shouldn't work as expected. 
> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala
>  not clear at all in terms, what code tries to do



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19540) Add ability to clone SparkSession wherein cloned session has a reference to SharedState and an identical copy of the SessionState

2017-02-10 Thread Kunal Khamar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kunal Khamar updated SPARK-19540:
-
Affects Version/s: (was: 2.1.1)
   2.2.0

> Add ability to clone SparkSession wherein cloned session has a reference to 
> SharedState and an identical copy of the SessionState
> -
>
> Key: SPARK-19540
> URL: https://issues.apache.org/jira/browse/SPARK-19540
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kunal Khamar
>
> Forking a newSession() from SparkSession currently makes a new SparkSession 
> that does not retain SessionState (i.e. temporary tables, SQL config, 
> registered functions etc.) This change adds a method cloneSession() which 
> creates a new SparkSession with a copy of the parent's SessionState.
> Subsequent changes to base session are not propagated to cloned session, 
> clone is independent after creation.
> If the base is changed after clone has been created, say user registers new 
> UDF, then the new UDF will not be available inside the clone. Same goes for 
> configs and temp tables.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19524) newFilesOnly does not work according to docs.

2017-02-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15861822#comment-15861822
 ] 

Sean Owen commented on SPARK-19524:
---

Ah, so you _don't_ want to only read new files. The behavior of 
newFilesOnly=false is _not_ to read _all_ old files. The default other behavior 
is as explained in the SO post. It reprocesses some window of recent data, 
about a minute or so. You can control the size of this lookback with 
spark.streaming.minRememberDuration which is minRememberDurationS in the code 
(this is what I meant above.)

I think this is just the same question as was answered on SO then so this 
should be closed.

> newFilesOnly does not work according to docs. 
> --
>
> Key: SPARK-19524
> URL: https://issues.apache.org/jira/browse/SPARK-19524
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
>Reporter: Egor Pahomov
>
> Docs says:
> newFilesOnly
> Should process only new files and ignore existing files in the directory
> It's not working. 
> http://stackoverflow.com/questions/29852249/how-spark-streaming-identifies-new-files
>  says, that it shouldn't work as expected. 
> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala
>  not clear at all in terms, what code tries to do



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19549) Allow providing reasons for stage/job cancelling

2017-02-10 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-19549.
-
   Resolution: Fixed
 Assignee: Ala Luszczak
Fix Version/s: 2.2.0

> Allow providing reasons for stage/job cancelling
> 
>
> Key: SPARK-19549
> URL: https://issues.apache.org/jira/browse/SPARK-19549
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Ala Luszczak
>Assignee: Ala Luszczak
>Priority: Minor
> Fix For: 2.2.0
>
>
> Currently it is not possible to pass a cancellation reason to  
> SparkContext.cancelStage() and SparkContext.cancelJob(). In many situations 
> having such reason included in the exception message would be useful for the 
> user.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14523) Feature parity for Statistics ML with MLlib

2017-02-10 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15861789#comment-15861789
 ] 

Joseph K. Bradley commented on SPARK-14523:
---

I'd like to keep this open until we have linked tasks for the missing 
functionality.

[~hujiayin] This is for parity w.r.t. the RDD-based API, not for adding new 
functionality to MLlib.  I think there's already a JIRA for ARIMA somewhere.

> Feature parity for Statistics ML with MLlib
> ---
>
> Key: SPARK-14523
> URL: https://issues.apache.org/jira/browse/SPARK-14523
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: yuhao yang
>
> Some statistics functions have been supported by DataFrame directly. Use this 
> jira to discuss/design the statistics package in Spark.ML and its function 
> scope. Hypothesis test and correlation computation may still need to expose 
> independent interfaces.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-14523) Feature parity for Statistics ML with MLlib

2017-02-10 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reopened SPARK-14523:
---

> Feature parity for Statistics ML with MLlib
> ---
>
> Key: SPARK-14523
> URL: https://issues.apache.org/jira/browse/SPARK-14523
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: yuhao yang
>
> Some statistics functions have been supported by DataFrame directly. Use this 
> jira to discuss/design the statistics package in Spark.ML and its function 
> scope. Hypothesis test and correlation computation may still need to expose 
> independent interfaces.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18613) spark.ml LDA classes should not expose spark.mllib in APIs

2017-02-10 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-18613.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16860
[https://github.com/apache/spark/pull/16860]

> spark.ml LDA classes should not expose spark.mllib in APIs
> --
>
> Key: SPARK-18613
> URL: https://issues.apache.org/jira/browse/SPARK-18613
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Sue Ann Hong
>Priority: Critical
> Fix For: 2.2.0
>
>
> spark.ml.LDAModel exposes dependencies on spark.mllib in 2 methods, but it 
> should not:
> * {{def oldLocalModel: OldLocalLDAModel}}
> * {{def getModel: OldLDAModel}}
> This task is to deprecate those methods.  I recommend creating 
> {{private[ml]}} versions of the methods which are used internally in order to 
> avoid deprecation warnings.
> Setting target for 2.2, but I'm OK with getting it into 2.1 if we have time.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19553) Add GroupedData.countApprox()

2017-02-10 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15861735#comment-15861735
 ] 

Nicholas Chammas commented on SPARK-19553:
--

I needed something like this today. I was profiling some data and didn't need 
exact counts.

> Add GroupedData.countApprox()
> -
>
> Key: SPARK-19553
> URL: https://issues.apache.org/jira/browse/SPARK-19553
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> We already have a 
> [{{pyspark.sql.functions.approx_count_distinct()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.approx_count_distinct]
>  that can be applied to grouped data, but it seems odd that you can't just 
> get regular approximate count for grouped data.
> I imagine the API would mirror that for 
> [{{RDD.countApprox()}}|http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.countApprox],
>  but I'm not sure:
> {code}
> (df
> .groupBy('col1')
> .countApprox(timeout=300, confidence=0.95)
> .show())
> {code}
> Or, if we want to mirror the {{approx_count_distinct()}} function, we can do 
> that too. I'd want to understand why that function doesn't take a timeout or 
> confidence parameter, though. Also, what does {{rsd}} mean? It's not 
> documented.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19553) Add GroupedData.countApprox()

2017-02-10 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-19553:


 Summary: Add GroupedData.countApprox()
 Key: SPARK-19553
 URL: https://issues.apache.org/jira/browse/SPARK-19553
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
Reporter: Nicholas Chammas
Priority: Minor


We already have a 
[{{pyspark.sql.functions.approx_count_distinct()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.approx_count_distinct]
 that can be applied to grouped data, but it seems odd that you can't just get 
regular approximate count for grouped data.

I imagine the API would mirror that for 
[{{RDD.countApprox()}}|http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.countApprox],
 but I'm not sure:

{code}
(df
.groupBy('col1')
.countApprox(timeout=300, confidence=0.95)
.show())
{code}

Or, if we want to mirror the {{approx_count_distinct()}} function, we can do 
that too. I'd want to understand why that function doesn't take a timeout or 
confidence parameter, though. Also, what does {{rsd}} mean? It's not documented.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9478) Add sample weights to Random Forest

2017-02-10 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15861714#comment-15861714
 ] 

Joseph K. Bradley commented on SPARK-9478:
--

[~sethah] Thanks for researching this!  +1 for not using weights during bagging 
and using importance weights to compensate.  Intuitively, that seems like it 
should give better estimators for class conditional probabilities than the 
other option.

If you're splitting this into trees and forests, could you please target your 
PR against a subtask for trees?

> Add sample weights to Random Forest
> ---
>
> Key: SPARK-9478
> URL: https://issues.apache.org/jira/browse/SPARK-9478
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.1
>Reporter: Patrick Crenshaw
>
> Currently, this implementation of random forest does not support class 
> weights. Class weights are important when there is imbalanced training data 
> or the evaluation metric of a classifier is imbalanced (e.g. true positive 
> rate at some false positive threshold). 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19459) ORC tables cannot be read when they contain char/varchar columns

2017-02-10 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19459.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> ORC tables cannot be read when they contain char/varchar columns
> 
>
> Key: SPARK-19459
> URL: https://issues.apache.org/jira/browse/SPARK-19459
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
> Fix For: 2.2.0
>
>
> Reading from an ORC table which contains char/varchar columns can fail if the 
> table has been created using Spark. This is caused by the fact that spark 
> internally replaces char and varchar columns with a string column, this 
> causes the ORC reader to use the wrong reader, and that eventually causes a 
> ClassCastException.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19524) newFilesOnly does not work according to docs.

2017-02-10 Thread Egor Pahomov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15861600#comment-15861600
 ] 

Egor Pahomov commented on SPARK-19524:
--

[~sowen], 
Folder, which connect my streaming to:
{code}
[egor@hadoop2 test]$ date
Fri Feb 10 09:51:16 PST 2017
[egor@hadoop2 test]$ ls -al
total 445746
drwxr-xr-x 13 egor egor  4096 Feb  8 14:27 .
drwxr-xr-x 43 egor egor  4096 Feb  9 01:38 ..
-rw-r--r--  1 root jobexecutors241661 Dec  1 18:03 
clog.1480636981858.fl.log.gz
-rw-r--r--  1 egor egor387024 Feb  1 17:26 
clog.1485986399693.fl.log.gz
-rw-r--r--  1 egor egor 128983477 Feb  8 12:43 
clog.2017-01-03.1483431170180.9861.log.gz
-rw-r--r--  1 root jobexecutors  67422481 Dec  1 00:01 
clog.new-1.1480579205495.fl.log.gz
-rw-r--r--  1 egor egor287279 Feb  8 13:21 data2.log.gz
-rw-r--r--  1 egor egor 128983477 Feb  8 14:10 data300.log.gz
-rw-r--r--  1 egor egor 128983477 Feb  8 14:20 data365.log.gz
-rw-r--r--  1 egor egor287279 Feb  8 13:23 data3.log.gz
-rw-r--r--  1 egor egor287279 Feb  8 13:45 data4.log.gz
-rwxrwxr-x  1 egor egor287279 Feb  8 14:04 data5.log.gz
-rwxrwxr-x  1 egor egor287279 Feb  8 14:08 data6.log.gz
{code}
They way I connect: 
{code}
def f(path:Path): Boolean = {
  !path.getName.contains("tmp")
}

 val client_log_d_stream = ssc.fileStream[LongWritable, Text, 
TextInputFormat](input_folder, f _ , newFilesOnly = false)
{code}

Nothing is processed. Than I add file to directory and it processes it. But not 
the old ones

> newFilesOnly does not work according to docs. 
> --
>
> Key: SPARK-19524
> URL: https://issues.apache.org/jira/browse/SPARK-19524
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
>Reporter: Egor Pahomov
>
> Docs says:
> newFilesOnly
> Should process only new files and ignore existing files in the directory
> It's not working. 
> http://stackoverflow.com/questions/29852249/how-spark-streaming-identifies-new-files
>  says, that it shouldn't work as expected. 
> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala
>  not clear at all in terms, what code tries to do



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19493) Remove Java 7 support

2017-02-10 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15861576#comment-15861576
 ] 

Andrew Ash commented on SPARK-19493:


+1 -- we're removing Java 7 compatibility from core internal libraries run in 
Spark and rarely encounter clusters running with Java 7 anymore.

> Remove Java 7 support
> -
>
> Key: SPARK-19493
> URL: https://issues.apache.org/jira/browse/SPARK-19493
> Project: Spark
>  Issue Type: New Feature
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Spark deprecated Java 7 support in 2.0, and the goal of the ticket is to 
> officially remove Java 7 support in 2.2 or 2.3.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19552) Upgrade Netty version to 4.1.8 final

2017-02-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15861525#comment-15861525
 ] 

Sean Owen commented on SPARK-19552:
---

The question you should focus on before proceeding is what the implications of 
updating are for users. Yes it requires Spark changes, and that change in Netty 
4 leaks into the user classpath by default I think. Are there behavior changes? 
we've had problems along this line in the past.

Yes the other JIRA answers about the existence of 3.9.x.

> Upgrade Netty version to 4.1.8 final
> 
>
> Key: SPARK-19552
> URL: https://issues.apache.org/jira/browse/SPARK-19552
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Adam Roberts
>Priority: Minor
>
> Netty 4.1.8 was recently released but isn't API compatible with previous 
> major versions (like Netty 4.0.x), see 
> http://netty.io/news/2017/01/30/4-0-44-Final-4-1-8-Final.html for details.
> This version does include a fix for a security concern - I don't know if 
> Spark can be used as an attack vector so let's upgrade the version we use to 
> be on the safe side. The security fix I'm especially interested in is not 
> available in the 4.0.x release line.
> As this 4.1 version involves API changes we'll need to implement a few 
> methods and possibly adjust the Sasl tests. I'd also like to know the purpose 
> of the additional netty (without "all" in the artifact name) in our pom 
> that's at version 3.9.9.
> This JIRA and associated pull request starts the process which I'll work on - 
> and any help would be much appreciated! Currently I know:
> {code}
> @Override
> public void write(ChannelHandlerContext ctx, Object msg, ChannelPromise 
> promise)
>   throws Exception {
>   if (!foundEncryptionHandler) {
> foundEncryptionHandler =
>   ctx.channel().pipeline().get(encryptHandlerName) != null; <-- this 
> returns false and causes test failures
>   }
>   ctx.write(msg, promise);
> }
> {code}
> Here's what changes will be required (at least):
> {code}
> common/network-common/src/main/java/org/apache/spark/network/crypto/TransportCipher.java{code}
>  requires touch, retain and transferred methods
> {code}
> common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java{code}
>  requires the above methods too
> {code}common/network-common/src/test/java/org/apache/spark/network/protocol/MessageWithHeaderSuite.java{code}
> With "dummy" implementations so we can at least compile and test, we'll see 
> five new test failures to address.
> These are
> {code}
> org.apache.spark.network.sasl.SparkSaslSuite.testFileRegionEncryption
> org.apache.spark.network.sasl.SparkSaslSuite.testSaslEncryption
> org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testEncryption
> org.apache.spark.rpc.netty.NettyRpcEnvSuite.send with SASL encryption
> org.apache.spark.rpc.netty.NettyRpcEnvSuite.ask with SASL encryption
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19551) Theme for PySpark documenation could do with improving

2017-02-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15861522#comment-15861522
 ] 

Sean Owen commented on SPARK-19551:
---

How do you apply the theme? I don't think it's something specific to Spark nor 
something Spark would custom build. If it's easy to flip some switches though, 
that's great.

> Theme for PySpark documenation could do with improving
> --
>
> Key: SPARK-19551
> URL: https://issues.apache.org/jira/browse/SPARK-19551
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, PySpark
>Affects Versions: 2.1.0
>Reporter: Arthur Tacca
>Priority: Minor
>
> I have found the Python Spark documentation hard to navigate for two reasons:
> * Each page in the documentation is huge, because the whole of the 
> documentation is split up into only a few chunks.
> * The methods for each class is not listed in a short form, so the only way 
> to look through them is to browse past the full documentation for all methods 
> (including parameter lists, examples, etc.).
> This has irritated someone enough that they have done [their own build of the 
> pyspark documentation|http://takwatanabe.me/pyspark/index.html]. In 
> comparison to the official docs they are a delight to use. But of course it 
> is not clear whether they'll be kept up to date, which is why I'm asking here 
> that the official docs are improved. Perhaps that site could be used as 
> inspiration? I don't know much about these things, but it appears that the 
> main change they have made is to switch to the "read the docs" theme.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19552) Upgrade Netty version to 4.1.8 final

2017-02-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19552:


Assignee: (was: Apache Spark)

> Upgrade Netty version to 4.1.8 final
> 
>
> Key: SPARK-19552
> URL: https://issues.apache.org/jira/browse/SPARK-19552
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Adam Roberts
>Priority: Minor
>
> Netty 4.1.8 was recently released but isn't API compatible with previous 
> major versions (like Netty 4.0.x), see 
> http://netty.io/news/2017/01/30/4-0-44-Final-4-1-8-Final.html for details.
> This version does include a fix for a security concern - I don't know if 
> Spark can be used as an attack vector so let's upgrade the version we use to 
> be on the safe side. The security fix I'm especially interested in is not 
> available in the 4.0.x release line.
> As this 4.1 version involves API changes we'll need to implement a few 
> methods and possibly adjust the Sasl tests. I'd also like to know the purpose 
> of the additional netty (without "all" in the artifact name) in our pom 
> that's at version 3.9.9.
> This JIRA and associated pull request starts the process which I'll work on - 
> and any help would be much appreciated! Currently I know:
> {code}
> @Override
> public void write(ChannelHandlerContext ctx, Object msg, ChannelPromise 
> promise)
>   throws Exception {
>   if (!foundEncryptionHandler) {
> foundEncryptionHandler =
>   ctx.channel().pipeline().get(encryptHandlerName) != null; <-- this 
> returns false and causes test failures
>   }
>   ctx.write(msg, promise);
> }
> {code}
> Here's what changes will be required (at least):
> {code}
> common/network-common/src/main/java/org/apache/spark/network/crypto/TransportCipher.java{code}
>  requires touch, retain and transferred methods
> {code}
> common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java{code}
>  requires the above methods too
> {code}common/network-common/src/test/java/org/apache/spark/network/protocol/MessageWithHeaderSuite.java{code}
> With "dummy" implementations so we can at least compile and test, we'll see 
> five new test failures to address.
> These are
> {code}
> org.apache.spark.network.sasl.SparkSaslSuite.testFileRegionEncryption
> org.apache.spark.network.sasl.SparkSaslSuite.testSaslEncryption
> org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testEncryption
> org.apache.spark.rpc.netty.NettyRpcEnvSuite.send with SASL encryption
> org.apache.spark.rpc.netty.NettyRpcEnvSuite.ask with SASL encryption
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19552) Upgrade Netty version to 4.1.8 final

2017-02-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19552:


Assignee: Apache Spark

> Upgrade Netty version to 4.1.8 final
> 
>
> Key: SPARK-19552
> URL: https://issues.apache.org/jira/browse/SPARK-19552
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Adam Roberts
>Assignee: Apache Spark
>Priority: Minor
>
> Netty 4.1.8 was recently released but isn't API compatible with previous 
> major versions (like Netty 4.0.x), see 
> http://netty.io/news/2017/01/30/4-0-44-Final-4-1-8-Final.html for details.
> This version does include a fix for a security concern - I don't know if 
> Spark can be used as an attack vector so let's upgrade the version we use to 
> be on the safe side. The security fix I'm especially interested in is not 
> available in the 4.0.x release line.
> As this 4.1 version involves API changes we'll need to implement a few 
> methods and possibly adjust the Sasl tests. I'd also like to know the purpose 
> of the additional netty (without "all" in the artifact name) in our pom 
> that's at version 3.9.9.
> This JIRA and associated pull request starts the process which I'll work on - 
> and any help would be much appreciated! Currently I know:
> {code}
> @Override
> public void write(ChannelHandlerContext ctx, Object msg, ChannelPromise 
> promise)
>   throws Exception {
>   if (!foundEncryptionHandler) {
> foundEncryptionHandler =
>   ctx.channel().pipeline().get(encryptHandlerName) != null; <-- this 
> returns false and causes test failures
>   }
>   ctx.write(msg, promise);
> }
> {code}
> Here's what changes will be required (at least):
> {code}
> common/network-common/src/main/java/org/apache/spark/network/crypto/TransportCipher.java{code}
>  requires touch, retain and transferred methods
> {code}
> common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java{code}
>  requires the above methods too
> {code}common/network-common/src/test/java/org/apache/spark/network/protocol/MessageWithHeaderSuite.java{code}
> With "dummy" implementations so we can at least compile and test, we'll see 
> five new test failures to address.
> These are
> {code}
> org.apache.spark.network.sasl.SparkSaslSuite.testFileRegionEncryption
> org.apache.spark.network.sasl.SparkSaslSuite.testSaslEncryption
> org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testEncryption
> org.apache.spark.rpc.netty.NettyRpcEnvSuite.send with SASL encryption
> org.apache.spark.rpc.netty.NettyRpcEnvSuite.ask with SASL encryption
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19552) Upgrade Netty version to 4.1.8 final

2017-02-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15861500#comment-15861500
 ] 

Apache Spark commented on SPARK-19552:
--

User 'a-roberts' has created a pull request for this issue:
https://github.com/apache/spark/pull/16888

> Upgrade Netty version to 4.1.8 final
> 
>
> Key: SPARK-19552
> URL: https://issues.apache.org/jira/browse/SPARK-19552
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Adam Roberts
>Priority: Minor
>
> Netty 4.1.8 was recently released but isn't API compatible with previous 
> major versions (like Netty 4.0.x), see 
> http://netty.io/news/2017/01/30/4-0-44-Final-4-1-8-Final.html for details.
> This version does include a fix for a security concern - I don't know if 
> Spark can be used as an attack vector so let's upgrade the version we use to 
> be on the safe side. The security fix I'm especially interested in is not 
> available in the 4.0.x release line.
> As this 4.1 version involves API changes we'll need to implement a few 
> methods and possibly adjust the Sasl tests. I'd also like to know the purpose 
> of the additional netty (without "all" in the artifact name) in our pom 
> that's at version 3.9.9.
> This JIRA and associated pull request starts the process which I'll work on - 
> and any help would be much appreciated! Currently I know:
> {code}
> @Override
> public void write(ChannelHandlerContext ctx, Object msg, ChannelPromise 
> promise)
>   throws Exception {
>   if (!foundEncryptionHandler) {
> foundEncryptionHandler =
>   ctx.channel().pipeline().get(encryptHandlerName) != null; <-- this 
> returns false and causes test failures
>   }
>   ctx.write(msg, promise);
> }
> {code}
> Here's what changes will be required (at least):
> {code}
> common/network-common/src/main/java/org/apache/spark/network/crypto/TransportCipher.java{code}
>  requires touch, retain and transferred methods
> {code}
> common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java{code}
>  requires the above methods too
> {code}common/network-common/src/test/java/org/apache/spark/network/protocol/MessageWithHeaderSuite.java{code}
> With "dummy" implementations so we can at least compile and test, we'll see 
> five new test failures to address.
> These are
> {code}
> org.apache.spark.network.sasl.SparkSaslSuite.testFileRegionEncryption
> org.apache.spark.network.sasl.SparkSaslSuite.testSaslEncryption
> org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testEncryption
> org.apache.spark.rpc.netty.NettyRpcEnvSuite.send with SASL encryption
> org.apache.spark.rpc.netty.NettyRpcEnvSuite.ask with SASL encryption
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19466) Improve Fair Scheduler Logging

2017-02-10 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-19466.

   Resolution: Fixed
 Assignee: Eren Avsarogullari
Fix Version/s: 2.2.0

> Improve Fair Scheduler Logging
> --
>
> Key: SPARK-19466
> URL: https://issues.apache.org/jira/browse/SPARK-19466
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Eren Avsarogullari
>Assignee: Eren Avsarogullari
>Priority: Minor
> Fix For: 2.2.0
>
>
> Fair Scheduler Logging for the following cases can be useful for the user.
> 1- If *valid* spark.scheduler.allocation.file property is set, user can be 
> informed so user can aware which scheduler file is processed when 
> SparkContext initializes.
> 2- If *invalid* spark.scheduler.allocation.file property is set, currently, 
> the following stacktrace is shown to user. In addition to this, more 
> meaningful message can be shown to user by emphasizing the problem at 
> building level of fair scheduler and covering other potential issues at this 
> level.
> {code:xml}
> Exception in thread "main" java.io.FileNotFoundException: INVALID_FILE (No 
> such file or directory)
>   at java.io.FileInputStream.open0(Native Method)
>   at java.io.FileInputStream.open(FileInputStream.java:195)
>   at java.io.FileInputStream.(FileInputStream.java:138)
>   at java.io.FileInputStream.(FileInputStream.java:93)
>   at 
> org.apache.spark.scheduler.FairSchedulableBuilder$$anonfun$buildPools$1.apply(SchedulableBuilder.scala:76)
>   at 
> org.apache.spark.scheduler.FairSchedulableBuilder$$anonfun$buildPools$1.apply(SchedulableBuilder.scala:75)
> {code}
> 3- If spark.scheduler.allocation.file property is not set and *default* fair 
> scheduler file(fairscheduler.xml) is found in classpath, it will be loaded 
> but currently, user is not informed so logging can be useful.
> 4- If spark.scheduler.allocation.file property is not set and default fair 
> scheduler file does not exist, currently, user is not informed so logging can 
> be useful.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19552) Upgrade Netty version to 4.1.8 final

2017-02-10 Thread Adam Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15861494#comment-15861494
 ] 

Adam Roberts commented on SPARK-19552:
--

[~srowen] interested in your thoughts and noticed your work at [SPARK-18586]

> Upgrade Netty version to 4.1.8 final
> 
>
> Key: SPARK-19552
> URL: https://issues.apache.org/jira/browse/SPARK-19552
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Adam Roberts
>Priority: Minor
>
> Netty 4.1.8 was recently released but isn't API compatible with previous 
> major versions (like Netty 4.0.x), see 
> http://netty.io/news/2017/01/30/4-0-44-Final-4-1-8-Final.html for details.
> This version does include a fix for a security concern - I don't know if 
> Spark can be used as an attack vector so let's upgrade the version we use to 
> be on the safe side. The security fix I'm especially interested in is not 
> available in the 4.0.x release line.
> As this 4.1 version involves API changes we'll need to implement a few 
> methods and possibly adjust the Sasl tests. I'd also like to know the purpose 
> of the additional netty (without "all" in the artifact name) in our pom 
> that's at version 3.9.9.
> This JIRA and associated pull request starts the process which I'll work on - 
> and any help would be much appreciated! Currently I know:
> {code}
> @Override
> public void write(ChannelHandlerContext ctx, Object msg, ChannelPromise 
> promise)
>   throws Exception {
>   if (!foundEncryptionHandler) {
> foundEncryptionHandler =
>   ctx.channel().pipeline().get(encryptHandlerName) != null; <-- this 
> returns false and causes test failures
>   }
>   ctx.write(msg, promise);
> }
> {code}
> Here's what changes will be required (at least):
> {code}
> common/network-common/src/main/java/org/apache/spark/network/crypto/TransportCipher.java{code}
>  requires touch, retain and transferred methods
> {code}
> common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java{code}
>  requires the above methods too
> {code}common/network-common/src/test/java/org/apache/spark/network/protocol/MessageWithHeaderSuite.java{code}
> With "dummy" implementations so we can at least compile and test, we'll see 
> five new test failures to address.
> These are
> {code}
> org.apache.spark.network.sasl.SparkSaslSuite.testFileRegionEncryption
> org.apache.spark.network.sasl.SparkSaslSuite.testSaslEncryption
> org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testEncryption
> org.apache.spark.rpc.netty.NettyRpcEnvSuite.send with SASL encryption
> org.apache.spark.rpc.netty.NettyRpcEnvSuite.ask with SASL encryption
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19552) Upgrade Netty version to 4.1.8 final

2017-02-10 Thread Adam Roberts (JIRA)
Adam Roberts created SPARK-19552:


 Summary: Upgrade Netty version to 4.1.8 final
 Key: SPARK-19552
 URL: https://issues.apache.org/jira/browse/SPARK-19552
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 2.1.0
Reporter: Adam Roberts
Priority: Minor


Netty 4.1.8 was recently released but isn't API compatible with previous major 
versions (like Netty 4.0.x), see 
http://netty.io/news/2017/01/30/4-0-44-Final-4-1-8-Final.html for details.

This version does include a fix for a security concern - I don't know if Spark 
can be used as an attack vector so let's upgrade the version we use to be on 
the safe side. The security fix I'm especially interested in is not available 
in the 4.0.x release line.

As this 4.1 version involves API changes we'll need to implement a few methods 
and possibly adjust the Sasl tests. I'd also like to know the purpose of the 
additional netty (without "all" in the artifact name) in our pom that's at 
version 3.9.9.

This JIRA and associated pull request starts the process which I'll work on - 
and any help would be much appreciated! Currently I know:

{code}
@Override
public void write(ChannelHandlerContext ctx, Object msg, ChannelPromise promise)
  throws Exception {
  if (!foundEncryptionHandler) {
foundEncryptionHandler =
  ctx.channel().pipeline().get(encryptHandlerName) != null; <-- this 
returns false and causes test failures
  }
  ctx.write(msg, promise);
}
{code}


Here's what changes will be required (at least):

{code}
common/network-common/src/main/java/org/apache/spark/network/crypto/TransportCipher.java{code}
 requires touch, retain and transferred methods

{code}
common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java{code}
 requires the above methods too

{code}common/network-common/src/test/java/org/apache/spark/network/protocol/MessageWithHeaderSuite.java{code}

With "dummy" implementations so we can at least compile and test, we'll see 
five new test failures to address.

These are
{code}
org.apache.spark.network.sasl.SparkSaslSuite.testFileRegionEncryption
org.apache.spark.network.sasl.SparkSaslSuite.testSaslEncryption
org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testEncryption
org.apache.spark.rpc.netty.NettyRpcEnvSuite.send with SASL encryption
org.apache.spark.rpc.netty.NettyRpcEnvSuite.ask with SASL encryption
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19551) Theme for PySpark documenation could do with improving

2017-02-10 Thread Arthur Tacca (JIRA)
Arthur Tacca created SPARK-19551:


 Summary: Theme for PySpark documenation could do with improving
 Key: SPARK-19551
 URL: https://issues.apache.org/jira/browse/SPARK-19551
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, PySpark
Affects Versions: 2.1.0
Reporter: Arthur Tacca
Priority: Minor


I have found the Python Spark documentation hard to navigate for two reasons:

* Each page in the documentation is huge, because the whole of the 
documentation is split up into only a few chunks.
* The methods for each class is not listed in a short form, so the only way to 
look through them is to browse past the full documentation for all methods 
(including parameter lists, examples, etc.).

This has irritated someone enough that they have done [their own build of the 
pyspark documentation|http://takwatanabe.me/pyspark/index.html]. In comparison 
to the official docs they are a delight to use. But of course it is not clear 
whether they'll be kept up to date, which is why I'm asking here that the 
official docs are improved. Perhaps that site could be used as inspiration? I 
don't know much about these things, but it appears that the main change they 
have made is to switch to the "read the docs" theme.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4563) Allow spark driver to bind to different ip then advertise ip

2017-02-10 Thread Danny Robinson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15861453#comment-15861453
 ] 

Danny Robinson commented on SPARK-4563:
---

I found one completely hacked way to allow my Spark Driver in Docker connecting 
to non-docker Spark to work.  This is Spark 1.6.2.

export SPARK_PUBLIC_DNS=IPADDR_OF_DOCKER_HOST_OR_PROXY
export SPARK_LOCAL_IP=IPADDR_OF_DOCKER_HOST_OR_PROXY

at container startup I do this:
echo -e "`hostname -i` `hostname` ${HOSTNAME_OF_DOCKER_HOST_OR_PROXY}" >> 
/etc/hosts

Essentially, the exports seem to control the IP that Spark UI & BlockManager 
recognize
The hosts file hack allows the spark driver to resolve the external hostname as 
if it was a local hostname, and therefore it knows which interface card to 
listen on, and then uses the hostname in the connection info it sends to the 
executors.  When the executors connect back, they are obviously resolving the 
hostname to the correct external IP.

Reason I say HOST or PROXY is that I run haproxy as a docker load-balancer at 
the front of my swarm.  That ensures i never have to worry exactly which node 
is running the Spark driver, all traffic routes via haproxy.

Agree with many here though, this is crazy complicated and inconsistent.


> Allow spark driver to bind to different ip then advertise ip
> 
>
> Key: SPARK-4563
> URL: https://issues.apache.org/jira/browse/SPARK-4563
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Reporter: Long Nguyen
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.1.0
>
>
> Spark driver bind ip and advertise is not configurable. spark.driver.host is 
> only bind ip. SPARK_PUBLIC_DNS does not work for spark driver. Allow option 
> to set advertised ip/hostname



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19550) Remove reflection, docs, build elements related to Java 7

2017-02-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19550:


Assignee: Apache Spark  (was: Sean Owen)

> Remove reflection, docs, build elements related to Java 7
> -
>
> Key: SPARK-19550
> URL: https://issues.apache.org/jira/browse/SPARK-19550
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Documentation, Spark Core
>Affects Versions: 2.2.0
>Reporter: Sean Owen
>Assignee: Apache Spark
>
> - Move external/java8-tests tests into core, streaming, sql and remove
> - Remove MaxPermGen and related options
> - Fix some reflection / TODOs around Java 8+ methods
> - Update doc references to 1.7/1.8 differences
> - Remove Java 7/8 related build profiles
> - Update some plugins for better Java 8 compatibility
> - Fix a few Java-related warnings



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19493) Remove Java 7 support

2017-02-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15861395#comment-15861395
 ] 

Sean Owen commented on SPARK-19493:
---

(I broke out the base change to core, docs and build into a sub-task)

> Remove Java 7 support
> -
>
> Key: SPARK-19493
> URL: https://issues.apache.org/jira/browse/SPARK-19493
> Project: Spark
>  Issue Type: New Feature
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Spark deprecated Java 7 support in 2.0, and the goal of the ticket is to 
> officially remove Java 7 support in 2.2 or 2.3.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19550) Remove reflection, docs, build elements related to Java 7

2017-02-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15861396#comment-15861396
 ] 

Apache Spark commented on SPARK-19550:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/16871

> Remove reflection, docs, build elements related to Java 7
> -
>
> Key: SPARK-19550
> URL: https://issues.apache.org/jira/browse/SPARK-19550
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Documentation, Spark Core
>Affects Versions: 2.2.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>
> - Move external/java8-tests tests into core, streaming, sql and remove
> - Remove MaxPermGen and related options
> - Fix some reflection / TODOs around Java 8+ methods
> - Update doc references to 1.7/1.8 differences
> - Remove Java 7/8 related build profiles
> - Update some plugins for better Java 8 compatibility
> - Fix a few Java-related warnings



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19550) Remove reflection, docs, build elements related to Java 7

2017-02-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19550:


Assignee: Sean Owen  (was: Apache Spark)

> Remove reflection, docs, build elements related to Java 7
> -
>
> Key: SPARK-19550
> URL: https://issues.apache.org/jira/browse/SPARK-19550
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Documentation, Spark Core
>Affects Versions: 2.2.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>
> - Move external/java8-tests tests into core, streaming, sql and remove
> - Remove MaxPermGen and related options
> - Fix some reflection / TODOs around Java 8+ methods
> - Update doc references to 1.7/1.8 differences
> - Remove Java 7/8 related build profiles
> - Update some plugins for better Java 8 compatibility
> - Fix a few Java-related warnings



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19550) Remove reflection, docs, build elements related to Java 7

2017-02-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-19550:
--
Issue Type: Sub-task  (was: Task)
Parent: SPARK-19493

> Remove reflection, docs, build elements related to Java 7
> -
>
> Key: SPARK-19550
> URL: https://issues.apache.org/jira/browse/SPARK-19550
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Documentation, Spark Core
>Affects Versions: 2.2.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>
> - Move external/java8-tests tests into core, streaming, sql and remove
> - Remove MaxPermGen and related options
> - Fix some reflection / TODOs around Java 8+ methods
> - Update doc references to 1.7/1.8 differences
> - Remove Java 7/8 related build profiles
> - Update some plugins for better Java 8 compatibility
> - Fix a few Java-related warnings



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19550) Remove reflection, docs, build elements related to Java 7

2017-02-10 Thread Sean Owen (JIRA)
Sean Owen created SPARK-19550:
-

 Summary: Remove reflection, docs, build elements related to Java 7
 Key: SPARK-19550
 URL: https://issues.apache.org/jira/browse/SPARK-19550
 Project: Spark
  Issue Type: Task
  Components: Build, Documentation, Spark Core
Affects Versions: 2.2.0
Reporter: Sean Owen
Assignee: Sean Owen


- Move external/java8-tests tests into core, streaming, sql and remove
- Remove MaxPermGen and related options
- Fix some reflection / TODOs around Java 8+ methods
- Update doc references to 1.7/1.8 differences
- Remove Java 7/8 related build profiles
- Update some plugins for better Java 8 compatibility
- Fix a few Java-related warnings



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19549) Allow providing reasons for stage/job cancelling

2017-02-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15861374#comment-15861374
 ] 

Apache Spark commented on SPARK-19549:
--

User 'ala' has created a pull request for this issue:
https://github.com/apache/spark/pull/16887

> Allow providing reasons for stage/job cancelling
> 
>
> Key: SPARK-19549
> URL: https://issues.apache.org/jira/browse/SPARK-19549
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Ala Luszczak
>Priority: Minor
>
> Currently it is not possible to pass a cancellation reason to  
> SparkContext.cancelStage() and SparkContext.cancelJob(). In many situations 
> having such reason included in the exception message would be useful for the 
> user.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19549) Allow providing reasons for stage/job cancelling

2017-02-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19549:


Assignee: Apache Spark

> Allow providing reasons for stage/job cancelling
> 
>
> Key: SPARK-19549
> URL: https://issues.apache.org/jira/browse/SPARK-19549
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Ala Luszczak
>Assignee: Apache Spark
>Priority: Minor
>
> Currently it is not possible to pass a cancellation reason to  
> SparkContext.cancelStage() and SparkContext.cancelJob(). In many situations 
> having such reason included in the exception message would be useful for the 
> user.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19549) Allow providing reasons for stage/job cancelling

2017-02-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19549:


Assignee: (was: Apache Spark)

> Allow providing reasons for stage/job cancelling
> 
>
> Key: SPARK-19549
> URL: https://issues.apache.org/jira/browse/SPARK-19549
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Ala Luszczak
>Priority: Minor
>
> Currently it is not possible to pass a cancellation reason to  
> SparkContext.cancelStage() and SparkContext.cancelJob(). In many situations 
> having such reason included in the exception message would be useful for the 
> user.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19549) Allow providing reasons for stage/job cancelling

2017-02-10 Thread Ala Luszczak (JIRA)
Ala Luszczak created SPARK-19549:


 Summary: Allow providing reasons for stage/job cancelling
 Key: SPARK-19549
 URL: https://issues.apache.org/jira/browse/SPARK-19549
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Ala Luszczak
Priority: Minor


Currently it is not possible to pass a cancellation reason to  
SparkContext.cancelStage() and SparkContext.cancelJob(). In many situations 
having such reason included in the exception message would be useful for the 
user.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10748) Log error instead of crashing Spark Mesos dispatcher when a job is misconfigured

2017-02-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-10748:
-

 Assignee: Devaraj K
Affects Version/s: 2.1.0
 Priority: Minor  (was: Major)
   Issue Type: Improvement  (was: Bug)

> Log error instead of crashing Spark Mesos dispatcher when a job is 
> misconfigured
> 
>
> Key: SPARK-10748
> URL: https://issues.apache.org/jira/browse/SPARK-10748
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: Timothy Chen
>Assignee: Devaraj K
>Priority: Minor
> Fix For: 2.2.0
>
>
> Currently when a dispatcher is submitting a new driver, it simply throws a 
> SparkExecption when necessary configuration is not set. We should log and 
> keep the dispatcher running instead of crashing.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10748) Log error instead of crashing Spark Mesos dispatcher when a job is misconfigured

2017-02-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10748.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 13077
[https://github.com/apache/spark/pull/13077]

> Log error instead of crashing Spark Mesos dispatcher when a job is 
> misconfigured
> 
>
> Key: SPARK-10748
> URL: https://issues.apache.org/jira/browse/SPARK-10748
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Reporter: Timothy Chen
> Fix For: 2.2.0
>
>
> Currently when a dispatcher is submitting a new driver, it simply throws a 
> SparkExecption when necessary configuration is not set. We should log and 
> keep the dispatcher running instead of crashing.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19524) newFilesOnly does not work according to docs.

2017-02-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15861319#comment-15861319
 ] 

Sean Owen commented on SPARK-19524:
---

It's based on modification time by the way. Are the files' mod date after the 
system clock time?

> newFilesOnly does not work according to docs. 
> --
>
> Key: SPARK-19524
> URL: https://issues.apache.org/jira/browse/SPARK-19524
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
>Reporter: Egor Pahomov
>
> Docs says:
> newFilesOnly
> Should process only new files and ignore existing files in the directory
> It's not working. 
> http://stackoverflow.com/questions/29852249/how-spark-streaming-identifies-new-files
>  says, that it shouldn't work as expected. 
> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala
>  not clear at all in terms, what code tries to do



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19546) Every mail to u...@spark.apache.org is getting blocked

2017-02-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19546.
---
Resolution: Invalid

This isn't what JIRA is for, but could you ask dev@ who has mailing list admin 
rights? I don't even know who if anyone can kick people off the list.

> Every mail to u...@spark.apache.org is getting blocked
> --
>
> Key: SPARK-19546
> URL: https://issues.apache.org/jira/browse/SPARK-19546
> Project: Spark
>  Issue Type: IT Help
>  Components: Project Infra
>Affects Versions: 2.1.0
>Reporter: Shivam Sharma
>
> Each time I am sending mail to  u...@spark.apache.org I am getting email from 
> yahoo-inc that "tylerchap...@yahoo-inc.com is no longer with Yahoo! Inc".



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19545) Compilation error with method not found when build against Hadoop 2.6.0.

2017-02-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19545.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16884
[https://github.com/apache/spark/pull/16884]

> Compilation error with method not found when build against Hadoop 2.6.0.
> 
>
> Key: SPARK-19545
> URL: https://issues.apache.org/jira/browse/SPARK-19545
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Saisai Shao
> Fix For: 2.2.0
>
>
> {code}
> ./build/sbt -Phadoop-2.6 -Pyarn -Dhadoop.version=2.6.0
> {code}
> {code}
> [error] 
> /Users/sshao/projects/apache-spark/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:249:
>  value setRolledLogsIncludePattern is not a member of 
> org.apache.hadoop.yarn.api.records.LogAggregationContext
> [error]   
> logAggregationContext.setRolledLogsIncludePattern(includePattern)
> [error] ^
> [error] 
> /Users/sshao/projects/apache-spark/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:251:
>  value setRolledLogsExcludePattern is not a member of 
> org.apache.hadoop.yarn.api.records.LogAggregationContext
> [error] 
> logAggregationContext.setRolledLogsExcludePattern(excludePattern)
> [error]   ^
> [error] two errors found
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19545) Compilation error with method not found when build against Hadoop 2.6.0.

2017-02-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-19545:
-

Assignee: Saisai Shao

> Compilation error with method not found when build against Hadoop 2.6.0.
> 
>
> Key: SPARK-19545
> URL: https://issues.apache.org/jira/browse/SPARK-19545
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
> Fix For: 2.2.0
>
>
> {code}
> ./build/sbt -Phadoop-2.6 -Pyarn -Dhadoop.version=2.6.0
> {code}
> {code}
> [error] 
> /Users/sshao/projects/apache-spark/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:249:
>  value setRolledLogsIncludePattern is not a member of 
> org.apache.hadoop.yarn.api.records.LogAggregationContext
> [error]   
> logAggregationContext.setRolledLogsIncludePattern(includePattern)
> [error] ^
> [error] 
> /Users/sshao/projects/apache-spark/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:251:
>  value setRolledLogsExcludePattern is not a member of 
> org.apache.hadoop.yarn.api.records.LogAggregationContext
> [error] 
> logAggregationContext.setRolledLogsExcludePattern(excludePattern)
> [error]   ^
> [error] two errors found
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19543) from_json fails when the input row is empty

2017-02-10 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-19543.
---
   Resolution: Fixed
 Assignee: Burak Yavuz
Fix Version/s: 2.2.0
   2.1.1

> from_json fails when the input row is empty 
> 
>
> Key: SPARK-19543
> URL: https://issues.apache.org/jira/browse/SPARK-19543
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
> Fix For: 2.1.1, 2.2.0
>
>
> Using from_json on a column with an empty string results in: 
> java.util.NoSuchElementException: head of empty list



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19548) Hive UDF should support List and Map types

2017-02-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15861148#comment-15861148
 ] 

Apache Spark commented on SPARK-19548:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/16886

> Hive UDF should support List and Map types
> --
>
> Key: SPARK-19548
> URL: https://issues.apache.org/jira/browse/SPARK-19548
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>
> We currently do not support List and Map types for Hive UDFs. We should 
> improve this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19548) Hive UDF should support List and Map types

2017-02-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19548:


Assignee: Herman van Hovell  (was: Apache Spark)

> Hive UDF should support List and Map types
> --
>
> Key: SPARK-19548
> URL: https://issues.apache.org/jira/browse/SPARK-19548
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>
> We currently do not support List and Map types for Hive UDFs. We should 
> improve this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19548) Hive UDF should support List and Map types

2017-02-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19548:


Assignee: Apache Spark  (was: Herman van Hovell)

> Hive UDF should support List and Map types
> --
>
> Key: SPARK-19548
> URL: https://issues.apache.org/jira/browse/SPARK-19548
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Herman van Hovell
>Assignee: Apache Spark
>
> We currently do not support List and Map types for Hive UDFs. We should 
> improve this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19546) Every mail to u...@spark.apache.org is getting blocked

2017-02-10 Thread Shivam Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivam Sharma updated SPARK-19546:
--
Priority: Major  (was: Minor)

> Every mail to u...@spark.apache.org is getting blocked
> --
>
> Key: SPARK-19546
> URL: https://issues.apache.org/jira/browse/SPARK-19546
> Project: Spark
>  Issue Type: IT Help
>  Components: Project Infra
>Affects Versions: 2.1.0
>Reporter: Shivam Sharma
>
> Each time I am sending mail to  u...@spark.apache.org I am getting email from 
> yahoo-inc that "tylerchap...@yahoo-inc.com is no longer with Yahoo! Inc".



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19548) Hive UDF should support List and Map types

2017-02-10 Thread Herman van Hovell (JIRA)
Herman van Hovell created SPARK-19548:
-

 Summary: Hive UDF should support List and Map types
 Key: SPARK-19548
 URL: https://issues.apache.org/jira/browse/SPARK-19548
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Herman van Hovell
Assignee: Herman van Hovell


We currently do not support List and Map types for Hive UDFs. We should improve 
this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19547) KafkaUtil throw 'No current assignment for partition' Exception

2017-02-10 Thread wuchang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuchang updated SPARK-19547:

Description: 
Below is my scala code to create spark kafka stream:

val kafkaParams = Map[String, Object](
  "bootstrap.servers" -> "server110:2181,server110:9092",
  "zookeeper" -> "server110:2181",
  "key.deserializer" -> classOf[StringDeserializer],
  "value.deserializer" -> classOf[StringDeserializer],
  "group.id" -> "example",
  "auto.offset.reset" -> "latest",
  "enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = Array("ABTest")
val stream = KafkaUtils.createDirectStream[String, String](
  ssc,
  PreferConsistent,
  Subscribe[String, String](topics, kafkaParams)
)

But after run for 10 hours, it throws exceptions:

2017-02-10 10:56:20,000 INFO  [JobGenerator] internals.ConsumerCoordinator: 
Revoking previously assigned partitions [ABTest-0, ABTest-1] for group example
2017-02-10 10:56:20,000 INFO  [JobGenerator] internals.AbstractCoordinator: 
(Re-)joining group example
2017-02-10 10:56:20,011 INFO  [JobGenerator] internals.AbstractCoordinator: 
(Re-)joining group example
2017-02-10 10:56:40,057 INFO  [JobGenerator] internals.AbstractCoordinator: 
Successfully joined group example with generation 5
2017-02-10 10:56:40,058 INFO  [JobGenerator] internals.ConsumerCoordinator: 
Setting newly assigned partitions [ABTest-1] for group example
2017-02-10 10:56:40,080 ERROR [JobScheduler] scheduler.JobScheduler: Error 
generating jobs for time 148669538 ms
java.lang.IllegalStateException: No current assignment for partition ABTest-0
at 
org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:231)
at 
org.apache.kafka.clients.consumer.internals.SubscriptionState.needOffsetReset(SubscriptionState.java:295)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.seekToEnd(KafkaConsumer.java:1169)
at 
org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:179)
at 
org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:196)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
at 
org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
at scala.Option.orElse(Option.scala:289)
at 
org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
at 
org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48)
at 
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117)
at 
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
at 
org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:248)
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:246)
at scala.util.Try$.apply(Try.scala:192)
at 
org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:246)
at 
org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:182)
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:88)
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:87)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)


Obviously 

[jira] [Created] (SPARK-19547) KafkaUtil throw 'No current assignment for partition' Exception

2017-02-10 Thread wuchang (JIRA)
wuchang created SPARK-19547:
---

 Summary: KafkaUtil throw 'No current assignment for partition' 
Exception
 Key: SPARK-19547
 URL: https://issues.apache.org/jira/browse/SPARK-19547
 Project: Spark
  Issue Type: Question
  Components: DStreams
Affects Versions: 1.6.1
Reporter: wuchang


val kafkaParams = Map[String, Object](
  "bootstrap.servers" -> "server110:2181,server110:9092",
  "zookeeper" -> "server110:2181",
  "key.deserializer" -> classOf[StringDeserializer],
  "value.deserializer" -> classOf[StringDeserializer],
  "group.id" -> "example",
  "auto.offset.reset" -> "latest",
  "enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = Array("ABTest")
val stream = KafkaUtils.createDirectStream[String, String](
  ssc,
  PreferConsistent,
  Subscribe[String, String](topics, kafkaParams)
)

This is my code to create a kafka stream.After run for 10 hours, it throws 
exceptions:

2017-02-10 10:56:20,000 INFO  [JobGenerator] internals.ConsumerCoordinator: 
Revoking previously assigned partitions [ABTest-0, ABTest-1] for group example
2017-02-10 10:56:20,000 INFO  [JobGenerator] internals.AbstractCoordinator: 
(Re-)joining group example
2017-02-10 10:56:20,011 INFO  [JobGenerator] internals.AbstractCoordinator: 
(Re-)joining group example
2017-02-10 10:56:40,057 INFO  [JobGenerator] internals.AbstractCoordinator: 
Successfully joined group example with generation 5
2017-02-10 10:56:40,058 INFO  [JobGenerator] internals.ConsumerCoordinator: 
Setting newly assigned partitions [ABTest-1] for group example
2017-02-10 10:56:40,080 ERROR [JobScheduler] scheduler.JobScheduler: Error 
generating jobs for time 148669538 ms
java.lang.IllegalStateException: No current assignment for partition ABTest-0
at 
org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:231)
at 
org.apache.kafka.clients.consumer.internals.SubscriptionState.needOffsetReset(SubscriptionState.java:295)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.seekToEnd(KafkaConsumer.java:1169)
at 
org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:179)
at 
org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:196)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
at 
org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
at scala.Option.orElse(Option.scala:289)
at 
org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
at 
org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48)
at 
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117)
at 
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
at 
org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:248)
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:246)
at scala.util.Try$.apply(Try.scala:192)
at 
org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:246)
at 
org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:182)
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:88)
   

[jira] [Updated] (SPARK-19512) codegen for compare structs fails

2017-02-10 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-19512:

Fix Version/s: 2.1.1

> codegen for compare structs fails
> -
>
> Key: SPARK-19512
> URL: https://issues.apache.org/jira/browse/SPARK-19512
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Bogdan Raducanu
>Assignee: Bogdan Raducanu
> Fix For: 2.1.1, 2.2.0
>
>
> This (1 struct field)
> {code:java|title=1 struct field}
> spark.range(10)
>   .selectExpr("named_struct('a', id) as col1", "named_struct('a', id+2) 
> as col2")
>   .filter("col1 = col2").count
> {code}
> fails with
> {code}
> [info]   Cause: java.util.concurrent.ExecutionException: java.lang.Exception: 
> failed to compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 144, Column 32: Expression "range_value" is not an 
> rvalue
> {code}
> This (2 struct fields)
> {code:java|title=2 struct fields}
> spark.range(10)
> .selectExpr("named_struct('a', id, 'b', id) as col1", 
> "named_struct('a',id+2, 'b',id+2) as col2")
> .filter($"col1" === $"col2").count
> {code}
> fails with 
> {code}
> Caused by: java.lang.IndexOutOfBoundsException: 1
>   at 
> scala.collection.LinearSeqOptimized$class.apply(LinearSeqOptimized.scala:65)
>   at scala.collection.immutable.List.apply(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.expressions.BoundReference.doGenCode(BoundAttribute.scala:64)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19545) Compilation error with method not found when build against Hadoop 2.6.0.

2017-02-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19545:


Assignee: Apache Spark

> Compilation error with method not found when build against Hadoop 2.6.0.
> 
>
> Key: SPARK-19545
> URL: https://issues.apache.org/jira/browse/SPARK-19545
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Saisai Shao
>Assignee: Apache Spark
>
> {code}
> ./build/sbt -Phadoop-2.6 -Pyarn -Dhadoop.version=2.6.0
> {code}
> {code}
> [error] 
> /Users/sshao/projects/apache-spark/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:249:
>  value setRolledLogsIncludePattern is not a member of 
> org.apache.hadoop.yarn.api.records.LogAggregationContext
> [error]   
> logAggregationContext.setRolledLogsIncludePattern(includePattern)
> [error] ^
> [error] 
> /Users/sshao/projects/apache-spark/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:251:
>  value setRolledLogsExcludePattern is not a member of 
> org.apache.hadoop.yarn.api.records.LogAggregationContext
> [error] 
> logAggregationContext.setRolledLogsExcludePattern(excludePattern)
> [error]   ^
> [error] two errors found
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19545) Compilation error with method not found when build against Hadoop 2.6.0.

2017-02-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19545:


Assignee: (was: Apache Spark)

> Compilation error with method not found when build against Hadoop 2.6.0.
> 
>
> Key: SPARK-19545
> URL: https://issues.apache.org/jira/browse/SPARK-19545
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Saisai Shao
>
> {code}
> ./build/sbt -Phadoop-2.6 -Pyarn -Dhadoop.version=2.6.0
> {code}
> {code}
> [error] 
> /Users/sshao/projects/apache-spark/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:249:
>  value setRolledLogsIncludePattern is not a member of 
> org.apache.hadoop.yarn.api.records.LogAggregationContext
> [error]   
> logAggregationContext.setRolledLogsIncludePattern(includePattern)
> [error] ^
> [error] 
> /Users/sshao/projects/apache-spark/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:251:
>  value setRolledLogsExcludePattern is not a member of 
> org.apache.hadoop.yarn.api.records.LogAggregationContext
> [error] 
> logAggregationContext.setRolledLogsExcludePattern(excludePattern)
> [error]   ^
> [error] two errors found
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19545) Compilation error with method not found when build against Hadoop 2.6.0.

2017-02-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860994#comment-15860994
 ] 

Apache Spark commented on SPARK-19545:
--

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/16884

> Compilation error with method not found when build against Hadoop 2.6.0.
> 
>
> Key: SPARK-19545
> URL: https://issues.apache.org/jira/browse/SPARK-19545
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Saisai Shao
>
> {code}
> ./build/sbt -Phadoop-2.6 -Pyarn -Dhadoop.version=2.6.0
> {code}
> {code}
> [error] 
> /Users/sshao/projects/apache-spark/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:249:
>  value setRolledLogsIncludePattern is not a member of 
> org.apache.hadoop.yarn.api.records.LogAggregationContext
> [error]   
> logAggregationContext.setRolledLogsIncludePattern(includePattern)
> [error] ^
> [error] 
> /Users/sshao/projects/apache-spark/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:251:
>  value setRolledLogsExcludePattern is not a member of 
> org.apache.hadoop.yarn.api.records.LogAggregationContext
> [error] 
> logAggregationContext.setRolledLogsExcludePattern(excludePattern)
> [error]   ^
> [error] two errors found
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19546) Every mail to u...@spark.apache.org is getting blocked

2017-02-10 Thread Shivam Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivam Sharma updated SPARK-19546:
--
Description: 
Each time I am sending mail to  u...@spark.apache.org I am getting email from 
yahoo-inc that "tylerchap...@yahoo-inc.com is no longer with Yahoo! Inc".


  was:
Each time I am sending mail to  u...@spark.apache.org I am getting email from 
yahoo-inc that "tylerchap...@yahoo-inc.com is no longer with Yahoo! Inc".

P

Summary: Every mail to u...@spark.apache.org is getting blocked  (was: 
Every mail to u...@spark.apache.org is blocked)

> Every mail to u...@spark.apache.org is getting blocked
> --
>
> Key: SPARK-19546
> URL: https://issues.apache.org/jira/browse/SPARK-19546
> Project: Spark
>  Issue Type: IT Help
>  Components: Project Infra
>Affects Versions: 2.1.0
>Reporter: Shivam Sharma
>Priority: Minor
>
> Each time I am sending mail to  u...@spark.apache.org I am getting email from 
> yahoo-inc that "tylerchap...@yahoo-inc.com is no longer with Yahoo! Inc".



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19546) Every mail to u...@spark.apache.org is blocked

2017-02-10 Thread Shivam Sharma (JIRA)
Shivam Sharma created SPARK-19546:
-

 Summary: Every mail to u...@spark.apache.org is blocked
 Key: SPARK-19546
 URL: https://issues.apache.org/jira/browse/SPARK-19546
 Project: Spark
  Issue Type: IT Help
  Components: Project Infra
Affects Versions: 2.1.0
Reporter: Shivam Sharma
Priority: Minor


Each time I am sending mail to  u...@spark.apache.org I am getting email from 
yahoo-inc that "tylerchap...@yahoo-inc.com is no longer with Yahoo! Inc".

P



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-19473) Several DataFrame Methods still fail with dot in column names

2017-02-10 Thread Wayne Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wayne Zhang closed SPARK-19473.
---
Resolution: Not A Problem

> Several DataFrame Methods still fail with dot in column names 
> --
>
> Key: SPARK-19473
> URL: https://issues.apache.org/jira/browse/SPARK-19473
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Wayne Zhang
>
> Here is an example:
> {code}
> val df = Seq((1.0, 2.0), (2.0, 3.0)).toDF("y.a", "x.b")
> df.select("y.a")
> org.apache.spark.sql.AnalysisException: cannot resolve '`y.a`' given input 
> columns: [y.a, x.b];;
> df.withColumn("d", col("y.a") + col("x.b"))
> org.apache.spark.sql.AnalysisException: cannot resolve '`y.a`' given input 
> columns: [y.a, x.b];;
> {code}
> We can use backquote to avoid the errors, but this behavior is affecting some 
> downstream work such as RFormula and SparkR. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19545) Compilation error with method not found when build against Hadoop 2.6.0.

2017-02-10 Thread Saisai Shao (JIRA)
Saisai Shao created SPARK-19545:
---

 Summary: Compilation error with method not found when build 
against Hadoop 2.6.0.
 Key: SPARK-19545
 URL: https://issues.apache.org/jira/browse/SPARK-19545
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 2.2.0
Reporter: Saisai Shao


{code}
./build/sbt -Phadoop-2.6 -Pyarn -Dhadoop.version=2.6.0
{code}

{code}
[error] 
/Users/sshao/projects/apache-spark/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:249:
 value setRolledLogsIncludePattern is not a member of 
org.apache.hadoop.yarn.api.records.LogAggregationContext
[error]   logAggregationContext.setRolledLogsIncludePattern(includePattern)
[error] ^
[error] 
/Users/sshao/projects/apache-spark/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:251:
 value setRolledLogsExcludePattern is not a member of 
org.apache.hadoop.yarn.api.records.LogAggregationContext
[error] 
logAggregationContext.setRolledLogsExcludePattern(excludePattern)
[error]   ^
[error] two errors found
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org