[jira] [Commented] (SPARK-21349) Make TASK_SIZE_TO_WARN_KB configurable

2017-08-29 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16146624#comment-16146624
 ] 

Shivaram Venkataraman commented on SPARK-21349:
---

Thanks for checking. In that case I dont think we can do much about this 
specific case. For RDDs created from the driver it is inevitable that we need 
to ship the data to the executors. 

> Make TASK_SIZE_TO_WARN_KB configurable
> --
>
> Key: SPARK-21349
> URL: https://issues.apache.org/jira/browse/SPARK-21349
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.2.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Since Spark 1.1.0, Spark emits warning when task size exceeds a threshold, 
> SPARK-2185. Although this is just a warning message, this issue tries to make 
> `TASK_SIZE_TO_WARN_KB` into a normal Spark configuration for advanced users.
> According to the Jenkins log, we also have 123 warnings even in our unit test.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20462) Spark-Kinesis Direct Connector

2017-08-29 Thread Gaurav Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16146614#comment-16146614
 ] 

Gaurav Shah commented on SPARK-20462:
-

related blog post: 
https://medium.com/@b23llc/exactly-once-data-processing-with-amazon-kinesis-and-spark-streaming-7e7f82303e4

> Spark-Kinesis Direct Connector 
> ---
>
> Key: SPARK-20462
> URL: https://issues.apache.org/jira/browse/SPARK-20462
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Lauren Moos
>
> I'd like to propose and the vet the design for a direct connector between 
> Spark and Kinesis. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21349) Make TASK_SIZE_TO_WARN_KB configurable

2017-08-29 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16146599#comment-16146599
 ] 

Dongjoon Hyun commented on SPARK-21349:
---

Yes. For the fewer values like 24*365*1, the warning does not pop up.

> Make TASK_SIZE_TO_WARN_KB configurable
> --
>
> Key: SPARK-21349
> URL: https://issues.apache.org/jira/browse/SPARK-21349
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.2.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Since Spark 1.1.0, Spark emits warning when task size exceeds a threshold, 
> SPARK-2185. Although this is just a warning message, this issue tries to make 
> `TASK_SIZE_TO_WARN_KB` into a normal Spark configuration for advanced users.
> According to the Jenkins log, we also have 123 warnings even in our unit test.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21856) Update Python API for MultilayerPerceptronClassifierModel

2017-08-29 Thread Chunsheng Ji (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16146584#comment-16146584
 ] 

Chunsheng Ji commented on SPARK-21856:
--

I am working on it.

> Update Python API for MultilayerPerceptronClassifierModel
> -
>
> Key: SPARK-21856
> URL: https://issues.apache.org/jira/browse/SPARK-21856
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>Priority: Minor
>
> SPARK-12664 has exposed probability in MultilayerPerceptronClassifier, so 
> python API also need update.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21349) Make TASK_SIZE_TO_WARN_KB configurable

2017-08-29 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16146585#comment-16146585
 ] 

Shivaram Venkataraman commented on SPARK-21349:
---

I think this might be that we create a ParallelCollectionRDD for the statement 
`(1 to (24*365*3))`  -- The values are stored in the partition for this RDD [1]
[~dongjoon] If you use fewer values (i.e. say 1 to 100) or more partitions (I'm 
not sure how many partitions are created in this example) does the warning go 
away ?

[1] 
https://github.com/apache/spark/blob/e47f48c737052564e92903de16ff16707fae32c3/core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala#L32

> Make TASK_SIZE_TO_WARN_KB configurable
> --
>
> Key: SPARK-21349
> URL: https://issues.apache.org/jira/browse/SPARK-21349
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.2.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Since Spark 1.1.0, Spark emits warning when task size exceeds a threshold, 
> SPARK-2185. Although this is just a warning message, this issue tries to make 
> `TASK_SIZE_TO_WARN_KB` into a normal Spark configuration for advanced users.
> According to the Jenkins log, we also have 123 warnings even in our unit test.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21854) Python interface for MLOR summary

2017-08-29 Thread Ming Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16146579#comment-16146579
 ] 

Ming Jiang commented on SPARK-21854:


I can work on this, thanks!

> Python interface for MLOR summary
> -
>
> Key: SPARK-21854
> URL: https://issues.apache.org/jira/browse/SPARK-21854
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>
> Python interface for MLOR summary



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-21856) Update Python API for MultilayerPerceptronClassifierModel

2017-08-29 Thread Ming Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Jiang updated SPARK-21856:
---
Comment: was deleted

(was: I can work on it, thanks!)

> Update Python API for MultilayerPerceptronClassifierModel
> -
>
> Key: SPARK-21856
> URL: https://issues.apache.org/jira/browse/SPARK-21856
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>Priority: Minor
>
> SPARK-12664 has exposed probability in MultilayerPerceptronClassifier, so 
> python API also need update.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21349) Make TASK_SIZE_TO_WARN_KB configurable

2017-08-29 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16146569#comment-16146569
 ] 

Dongjoon Hyun commented on SPARK-21349:
---

Hi, [~jiangxb] and all. I hit this issue again in another situation today. So, 
I want to share the sample case.

{code}
scala> val data = (1 to (24*365*3)).map(i => (i, s"$i", i % 2 == 
0)).toDF("col1", "part1", "part2")
17/08/29 21:07:49 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException
data: org.apache.spark.sql.DataFrame = [col1: int, part1: string ... 1 more 
field]

scala> data.write.format("parquet").partitionBy("part1", 
"part2").mode("overwrite").saveAsTable("t")
17/08/29 21:08:04 WARN TaskSetManager: Stage 0 contains a task of very large 
size (190 KB). The maximum recommended task size is 100 KB.
17/08/29 21:09:34 WARN TaskSetManager: Stage 2 contains a task of very large 
size (233 KB). The maximum recommended task size is 100 KB.

scala> spark.version
res1: String = 2.3.0-SNAPSHOT
{code}


> Make TASK_SIZE_TO_WARN_KB configurable
> --
>
> Key: SPARK-21349
> URL: https://issues.apache.org/jira/browse/SPARK-21349
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.2.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Since Spark 1.1.0, Spark emits warning when task size exceeds a threshold, 
> SPARK-2185. Although this is just a warning message, this issue tries to make 
> `TASK_SIZE_TO_WARN_KB` into a normal Spark configuration for advanced users.
> According to the Jenkins log, we also have 123 warnings even in our unit test.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20886) HadoopMapReduceCommitProtocol to fail with message if FileOutputCommitter.getWorkPath==null

2017-08-29 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-20886:


Assignee: Steve Loughran

This was fixed in a way to allow {{null}} case.

> HadoopMapReduceCommitProtocol to fail with message if 
> FileOutputCommitter.getWorkPath==null
> ---
>
> Key: SPARK-20886
> URL: https://issues.apache.org/jira/browse/SPARK-20886
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Trivial
> Fix For: 2.3.0
>
>
> This is minor, and the root cause is my fault *elsewhere*, but its the patch 
> I used to track down the problem.
> If {{HadoopMapReduceCommitProtocol}} has a {{FileOutputCommitter}} for 
> committing things, and *somehow* that's been configured with a 
> {{JobAttemptContext}}, not a {{TaskAttemptContext}}, then the committer NPEs.
> A {{require()}} statement can validate the working path and so point the 
> blame at whoever's code is confused.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20886) HadoopMapReduceCommitProtocol to fail with message if FileOutputCommitter.getWorkPath==null

2017-08-29 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-20886.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18111
[https://github.com/apache/spark/pull/18111]

> HadoopMapReduceCommitProtocol to fail with message if 
> FileOutputCommitter.getWorkPath==null
> ---
>
> Key: SPARK-20886
> URL: https://issues.apache.org/jira/browse/SPARK-20886
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Steve Loughran
>Priority: Trivial
> Fix For: 2.3.0
>
>
> This is minor, and the root cause is my fault *elsewhere*, but its the patch 
> I used to track down the problem.
> If {{HadoopMapReduceCommitProtocol}} has a {{FileOutputCommitter}} for 
> committing things, and *somehow* that's been configured with a 
> {{JobAttemptContext}}, not a {{TaskAttemptContext}}, then the committer NPEs.
> A {{require()}} statement can validate the working path and so point the 
> blame at whoever's code is confused.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21845) Make codegen fallback of expressions configurable

2017-08-29 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-21845.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> Make codegen fallback of expressions configurable
> -
>
> Key: SPARK-21845
> URL: https://issues.apache.org/jira/browse/SPARK-21845
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.3.0
>
>
> We should make codegen fallback of expressions configurable. So far, it is 
> always on. We might hide it when our codegen have compilation bugs. Thus, we 
> should also disable the codegen fallback when running test cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21872) Is job duration value of Spark Jobs page on Web UI correct?

2017-08-29 Thread iamhumanbeing (JIRA)
iamhumanbeing created SPARK-21872:
-

 Summary: Is job duration value of Spark Jobs page on Web UI 
correct? 
 Key: SPARK-21872
 URL: https://issues.apache.org/jira/browse/SPARK-21872
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.3.0
Reporter: iamhumanbeing
Priority: Minor


I have submitted 2 spark jobs at the same time. but only one get running, the 
other is waiting for resources. but the Web UI display that both of the jobs is 
running. the job waiting for resources have the duration values increase. 
So, Job 7 only runing 14s, but duration value is 29s. 
Active Jobs (2)
Job Id  â–¾Description Submitted
Duration  Stages: Succeeded/Total Tasks (for all stages): Succeeded/Total
7   (kill)count at :30 2017/08/30 11:33:46 7 s 0/1 
0/100
6   (kill)count at :30 2017/08/30 11:33:46 8 s 0/2 
15/127 (2 running)

after job finished
7   count at :30   2017/08/30 11:33:46 29 s1/1 100/100
6   count at :30   2017/08/30 11:33:46 16 s1/1 (1 skipped) 
27/27 (100 skipped) 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17139) Add model summary for MultinomialLogisticRegression

2017-08-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16146494#comment-16146494
 ] 

Apache Spark commented on SPARK-17139:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/19072

> Add model summary for MultinomialLogisticRegression
> ---
>
> Key: SPARK-17139
> URL: https://issues.apache.org/jira/browse/SPARK-17139
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>Assignee: Weichen Xu
> Fix For: 2.3.0
>
>
> Add model summary to multinomial logistic regression using same interface as 
> in other ML models.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20711) MultivariateOnlineSummarizer incorrect min/max for NaN value

2017-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20711:


Assignee: (was: Apache Spark)

> MultivariateOnlineSummarizer incorrect min/max for NaN value
> 
>
> Key: SPARK-20711
> URL: https://issues.apache.org/jira/browse/SPARK-20711
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>Priority: Minor
>
> {code}
> scala> val summarizer = new MultivariateOnlineSummarizer()
> summarizer: org.apache.spark.mllib.stat.MultivariateOnlineSummarizer = 
> org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@2ac58d
> scala> summarizer.add(Vectors.dense(Double.NaN, -10.0))
> res20: summarizer.type = 
> org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@2ac58d
> scala> summarizer.add(Vectors.dense(Double.NaN, 2.0))
> res21: summarizer.type = 
> org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@2ac58d
> scala> summarizer.min
> res22: org.apache.spark.mllib.linalg.Vector = [1.7976931348623157E308,-10.0]
> scala> summarizer.max
> res23: org.apache.spark.mllib.linalg.Vector = [-1.7976931348623157E308,2.0]
> {code}
> For a feature only containing {{Double.NaN}}, the returned max is 
> {{Double.MinValue}} and the min is {{Double.MaxValue}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21871) Check actual bytecode size when compiling generated code

2017-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21871:


Assignee: Apache Spark

> Check actual bytecode size when compiling generated code
> 
>
> Key: SPARK-21871
> URL: https://issues.apache.org/jira/browse/SPARK-21871
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>Priority: Minor
>
> In SPARK-21603, we added code to give up code compilation and use interpreter 
> execution in SparkPlan if the line number of generated functions goes over 
> maxLinesPerFunction. But, we already have code to collect metrics for 
> compiled bytecode size in `CodeGenerator` object. So, I think we could easily 
> reuse the code for this purpose.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21871) Check actual bytecode size when compiling generated code

2017-08-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16146493#comment-16146493
 ] 

Apache Spark commented on SPARK-21871:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/19083

> Check actual bytecode size when compiling generated code
> 
>
> Key: SPARK-21871
> URL: https://issues.apache.org/jira/browse/SPARK-21871
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> In SPARK-21603, we added code to give up code compilation and use interpreter 
> execution in SparkPlan if the line number of generated functions goes over 
> maxLinesPerFunction. But, we already have code to collect metrics for 
> compiled bytecode size in `CodeGenerator` object. So, I think we could easily 
> reuse the code for this purpose.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20711) MultivariateOnlineSummarizer incorrect min/max for NaN value

2017-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20711:


Assignee: Apache Spark

> MultivariateOnlineSummarizer incorrect min/max for NaN value
> 
>
> Key: SPARK-20711
> URL: https://issues.apache.org/jira/browse/SPARK-20711
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Minor
>
> {code}
> scala> val summarizer = new MultivariateOnlineSummarizer()
> summarizer: org.apache.spark.mllib.stat.MultivariateOnlineSummarizer = 
> org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@2ac58d
> scala> summarizer.add(Vectors.dense(Double.NaN, -10.0))
> res20: summarizer.type = 
> org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@2ac58d
> scala> summarizer.add(Vectors.dense(Double.NaN, 2.0))
> res21: summarizer.type = 
> org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@2ac58d
> scala> summarizer.min
> res22: org.apache.spark.mllib.linalg.Vector = [1.7976931348623157E308,-10.0]
> scala> summarizer.max
> res23: org.apache.spark.mllib.linalg.Vector = [-1.7976931348623157E308,2.0]
> {code}
> For a feature only containing {{Double.NaN}}, the returned max is 
> {{Double.MinValue}} and the min is {{Double.MaxValue}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21871) Check actual bytecode size when compiling generated code

2017-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21871:


Assignee: (was: Apache Spark)

> Check actual bytecode size when compiling generated code
> 
>
> Key: SPARK-21871
> URL: https://issues.apache.org/jira/browse/SPARK-21871
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> In SPARK-21603, we added code to give up code compilation and use interpreter 
> execution in SparkPlan if the line number of generated functions goes over 
> maxLinesPerFunction. But, we already have code to collect metrics for 
> compiled bytecode size in `CodeGenerator` object. So, I think we could easily 
> reuse the code for this purpose.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20711) MultivariateOnlineSummarizer incorrect min/max for NaN value

2017-08-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16146492#comment-16146492
 ] 

Apache Spark commented on SPARK-20711:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/19084

> MultivariateOnlineSummarizer incorrect min/max for NaN value
> 
>
> Key: SPARK-20711
> URL: https://issues.apache.org/jira/browse/SPARK-20711
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>Priority: Minor
>
> {code}
> scala> val summarizer = new MultivariateOnlineSummarizer()
> summarizer: org.apache.spark.mllib.stat.MultivariateOnlineSummarizer = 
> org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@2ac58d
> scala> summarizer.add(Vectors.dense(Double.NaN, -10.0))
> res20: summarizer.type = 
> org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@2ac58d
> scala> summarizer.add(Vectors.dense(Double.NaN, 2.0))
> res21: summarizer.type = 
> org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@2ac58d
> scala> summarizer.min
> res22: org.apache.spark.mllib.linalg.Vector = [1.7976931348623157E308,-10.0]
> scala> summarizer.max
> res23: org.apache.spark.mllib.linalg.Vector = [-1.7976931348623157E308,2.0]
> {code}
> For a feature only containing {{Double.NaN}}, the returned max is 
> {{Double.MinValue}} and the min is {{Double.MaxValue}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20711) MultivariateOnlineSummarizer incorrect min/max for NaN value

2017-08-29 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-20711:
-
Summary: MultivariateOnlineSummarizer incorrect min/max for NaN value  
(was: MultivariateOnlineSummarizer incorrect min/max for identical NaN feature)

> MultivariateOnlineSummarizer incorrect min/max for NaN value
> 
>
> Key: SPARK-20711
> URL: https://issues.apache.org/jira/browse/SPARK-20711
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>Priority: Minor
>
> {code}
> scala> val summarizer = new MultivariateOnlineSummarizer()
> summarizer: org.apache.spark.mllib.stat.MultivariateOnlineSummarizer = 
> org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@2ac58d
> scala> summarizer.add(Vectors.dense(Double.NaN, -10.0))
> res20: summarizer.type = 
> org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@2ac58d
> scala> summarizer.add(Vectors.dense(Double.NaN, 2.0))
> res21: summarizer.type = 
> org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@2ac58d
> scala> summarizer.min
> res22: org.apache.spark.mllib.linalg.Vector = [1.7976931348623157E308,-10.0]
> scala> summarizer.max
> res23: org.apache.spark.mllib.linalg.Vector = [-1.7976931348623157E308,2.0]
> {code}
> For a feature only containing {{Double.NaN}}, the returned max is 
> {{Double.MinValue}} and the min is {{Double.MaxValue}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21871) Check actual bytecode size when compiling generated code

2017-08-29 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-21871:


 Summary: Check actual bytecode size when compiling generated code
 Key: SPARK-21871
 URL: https://issues.apache.org/jira/browse/SPARK-21871
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Takeshi Yamamuro
Priority: Minor


In SPARK-21603, we added code to give up code compilation and use interpreter 
execution in SparkPlan if the line number of generated functions goes over 
maxLinesPerFunction. But, we already have code to collect metrics for compiled 
bytecode size in `CodeGenerator` object. So, I think we could easily reuse the 
code for this purpose.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20711) MultivariateOnlineSummarizer incorrect min/max for identical NaN feature

2017-08-29 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16146480#comment-16146480
 ] 

zhengruifeng commented on SPARK-20711:
--

[~WeichenXu123] I notice that you have just fixed a bug in 
{{MultivariateOnlineSummarizer.variance}}. I think the computation of 
{{Min/Max}} in {{MultivariateOnlineSummarizer}} may also be wrong.

> MultivariateOnlineSummarizer incorrect min/max for identical NaN feature
> 
>
> Key: SPARK-20711
> URL: https://issues.apache.org/jira/browse/SPARK-20711
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>Priority: Minor
>
> {code}
> scala> val summarizer = new MultivariateOnlineSummarizer()
> summarizer: org.apache.spark.mllib.stat.MultivariateOnlineSummarizer = 
> org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@2ac58d
> scala> summarizer.add(Vectors.dense(Double.NaN, -10.0))
> res20: summarizer.type = 
> org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@2ac58d
> scala> summarizer.add(Vectors.dense(Double.NaN, 2.0))
> res21: summarizer.type = 
> org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@2ac58d
> scala> summarizer.min
> res22: org.apache.spark.mllib.linalg.Vector = [1.7976931348623157E308,-10.0]
> scala> summarizer.max
> res23: org.apache.spark.mllib.linalg.Vector = [-1.7976931348623157E308,2.0]
> {code}
> For a feature only containing {{Double.NaN}}, the returned max is 
> {{Double.MinValue}} and the min is {{Double.MaxValue}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-20711) MultivariateOnlineSummarizer incorrect min/max for identical NaN feature

2017-08-29 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng reopened SPARK-20711:
--

> MultivariateOnlineSummarizer incorrect min/max for identical NaN feature
> 
>
> Key: SPARK-20711
> URL: https://issues.apache.org/jira/browse/SPARK-20711
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>Priority: Minor
>
> {code}
> scala> val summarizer = new MultivariateOnlineSummarizer()
> summarizer: org.apache.spark.mllib.stat.MultivariateOnlineSummarizer = 
> org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@2ac58d
> scala> summarizer.add(Vectors.dense(Double.NaN, -10.0))
> res20: summarizer.type = 
> org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@2ac58d
> scala> summarizer.add(Vectors.dense(Double.NaN, 2.0))
> res21: summarizer.type = 
> org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@2ac58d
> scala> summarizer.min
> res22: org.apache.spark.mllib.linalg.Vector = [1.7976931348623157E308,-10.0]
> scala> summarizer.max
> res23: org.apache.spark.mllib.linalg.Vector = [-1.7976931348623157E308,2.0]
> {code}
> For a feature only containing {{Double.NaN}}, the returned max is 
> {{Double.MinValue}} and the min is {{Double.MaxValue}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21862) Add overflow check in PCA

2017-08-29 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-21862:
--
Shepherd: Joseph K. Bradley

> Add overflow check in PCA
> -
>
> Key: SPARK-21862
> URL: https://issues.apache.org/jira/browse/SPARK-21862
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Minor
>
> We should add overflow check in PCA, otherwise it is possible to throw 
> `NegativeArraySizeException` when `k` and `numFeatures` are too large.
> The overflow checking formula is here:
> https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/functions/svd.scala#L87



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21862) Add overflow check in PCA

2017-08-29 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-21862:
-

Assignee: Weichen Xu

> Add overflow check in PCA
> -
>
> Key: SPARK-21862
> URL: https://issues.apache.org/jira/browse/SPARK-21862
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Minor
>
> We should add overflow check in PCA, otherwise it is possible to throw 
> `NegativeArraySizeException` when `k` and `numFeatures` are too large.
> The overflow checking formula is here:
> https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/functions/svd.scala#L87



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21870) Split codegen'd aggregation code into small functions for the HotSpot

2017-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21870:


Assignee: (was: Apache Spark)

> Split codegen'd aggregation code into small functions for the HotSpot
> -
>
> Key: SPARK-21870
> URL: https://issues.apache.org/jira/browse/SPARK-21870
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> In SPARK-21603, we got performance regression if the HotSpot didn't compile 
> too long functions (the limit is 8 in bytecode size).
> I checked and I found the codegen of `HashAggregateExec` frequently goes over 
> the limit, for example:
> {code}
> spark.range(1000).selectExpr("id % 1024 AS a", "id AS 
> b").write.saveAsTable("t")
> sql("SELECT a, KURTOSIS(b)FROM t GROUP BY a")
> {code}
> This query goes over the limit and the actual bytecode size is `12356`.
> So, it might be better to split the aggregation code into piecies.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21870) Split codegen'd aggregation code into small functions for the HotSpot

2017-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21870:


Assignee: Apache Spark

> Split codegen'd aggregation code into small functions for the HotSpot
> -
>
> Key: SPARK-21870
> URL: https://issues.apache.org/jira/browse/SPARK-21870
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>Priority: Minor
>
> In SPARK-21603, we got performance regression if the HotSpot didn't compile 
> too long functions (the limit is 8 in bytecode size).
> I checked and I found the codegen of `HashAggregateExec` frequently goes over 
> the limit, for example:
> {code}
> spark.range(1000).selectExpr("id % 1024 AS a", "id AS 
> b").write.saveAsTable("t")
> sql("SELECT a, KURTOSIS(b)FROM t GROUP BY a")
> {code}
> This query goes over the limit and the actual bytecode size is `12356`.
> So, it might be better to split the aggregation code into piecies.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21870) Split codegen'd aggregation code into small functions for the HotSpot

2017-08-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16146424#comment-16146424
 ] 

Apache Spark commented on SPARK-21870:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/19082

> Split codegen'd aggregation code into small functions for the HotSpot
> -
>
> Key: SPARK-21870
> URL: https://issues.apache.org/jira/browse/SPARK-21870
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> In SPARK-21603, we got performance regression if the HotSpot didn't compile 
> too long functions (the limit is 8 in bytecode size).
> I checked and I found the codegen of `HashAggregateExec` frequently goes over 
> the limit, for example:
> {code}
> spark.range(1000).selectExpr("id % 1024 AS a", "id AS 
> b").write.saveAsTable("t")
> sql("SELECT a, KURTOSIS(b)FROM t GROUP BY a")
> {code}
> This query goes over the limit and the actual bytecode size is `12356`.
> So, it might be better to split the aggregation code into piecies.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21870) Split codegen'd aggregation code into small functions for the HotSpot

2017-08-29 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-21870:
-
Description: 
In SPARK-21603, we got performance regression if the HotSpot didn't compile too 
long functions (the limit is 8 in bytecode size).
I checked and I found the codegen of `HashAggregateExec` frequently goes over 
the limit, for example:

{code}
spark.range(1000).selectExpr("id % 1024 AS a", "id AS 
b").write.saveAsTable("t")
sql("SELECT a, KURTOSIS(b)FROM t GROUP BY a")
{code}

This query goes over the limit and the actual bytecode size is `12356`.
So, it might be better to split the aggregation code into piecies.


  was:
In SPARK-21603, we got performance regression if the HotSpot didn't compile too 
long functions (the limit is 8 in bytecode size).
I checked and I found the codegen of `HashAggregateExec` frequently goes over 
the limit, for example:

```
spark.range(1000).selectExpr("id % 1024 AS a", "id AS 
b").write.saveAsTable("t")
sql("SELECT a, KURTOSIS(b)FROM t GROUP BY a")
```

This query goes over the limit and the actual bytecode size is `12356`.
So, it might be better to split the aggregation code into piecies.



> Split codegen'd aggregation code into small functions for the HotSpot
> -
>
> Key: SPARK-21870
> URL: https://issues.apache.org/jira/browse/SPARK-21870
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> In SPARK-21603, we got performance regression if the HotSpot didn't compile 
> too long functions (the limit is 8 in bytecode size).
> I checked and I found the codegen of `HashAggregateExec` frequently goes over 
> the limit, for example:
> {code}
> spark.range(1000).selectExpr("id % 1024 AS a", "id AS 
> b").write.saveAsTable("t")
> sql("SELECT a, KURTOSIS(b)FROM t GROUP BY a")
> {code}
> This query goes over the limit and the actual bytecode size is `12356`.
> So, it might be better to split the aggregation code into piecies.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21870) Split codegen'd aggregation code into small functions for the HotSpot

2017-08-29 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-21870:
-
Description: 
In SPARK-21603, we got performance regression if the HotSpot didn't compile too 
long functions (the limit is 8 in bytecode size).
I checked and I found the codegen of `HashAggregateExec` frequently goes over 
the limit, for example:

```
spark.range(1000).selectExpr("id % 1024 AS a", "id AS 
b").write.saveAsTable("t")
sql("SELECT a, KURTOSIS(b)FROM t GROUP BY a")
```

This query goes over the limit and the actual bytecode size is `12356`.
So, it might be better to split the aggregation code into piecies.


> Split codegen'd aggregation code into small functions for the HotSpot
> -
>
> Key: SPARK-21870
> URL: https://issues.apache.org/jira/browse/SPARK-21870
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> In SPARK-21603, we got performance regression if the HotSpot didn't compile 
> too long functions (the limit is 8 in bytecode size).
> I checked and I found the codegen of `HashAggregateExec` frequently goes over 
> the limit, for example:
> ```
> spark.range(1000).selectExpr("id % 1024 AS a", "id AS 
> b").write.saveAsTable("t")
> sql("SELECT a, KURTOSIS(b)FROM t GROUP BY a")
> ```
> This query goes over the limit and the actual bytecode size is `12356`.
> So, it might be better to split the aggregation code into piecies.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21870) Split codegen'd aggregation code into small functions for the HotSpot

2017-08-29 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-21870:


 Summary: Split codegen'd aggregation code into small functions for 
the HotSpot
 Key: SPARK-21870
 URL: https://issues.apache.org/jira/browse/SPARK-21870
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Takeshi Yamamuro
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18278) SPIP: Support native submission of spark jobs to a kubernetes cluster

2017-08-29 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-18278:
--
Labels: SPIP  (was: )

> SPIP: Support native submission of spark jobs to a kubernetes cluster
> -
>
> Key: SPARK-18278
> URL: https://issues.apache.org/jira/browse/SPARK-18278
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Deploy, Documentation, Scheduler, Spark Core
>Reporter: Erik Erlandson
>  Labels: SPIP
> Attachments: SPARK-18278 Spark on Kubernetes Design Proposal Revision 
> 2 (1).pdf
>
>
> A new Apache Spark sub-project that enables native support for submitting 
> Spark applications to a kubernetes cluster.   The submitted application runs 
> in a driver executing on a kubernetes pod, and executors lifecycles are also 
> managed as pods.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21866) SPIP: Image support in Spark

2017-08-29 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-21866:
--
Labels: SPIP  (was: )

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>  Labels: SPIP
> Attachments: SPIP - Image support for Apache Spark.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
> * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
> * the total size of an image should be restricted to less than 2GB (roughly)
> * the meaning of color channels is application-specific and is not mandated 
> by the standard (in line with the OpenCV standard)
> * specialized formats used in meteorology, the medical field, etc. are not 
> supported
> * this format is specialized to images and does not attempt to solve the more 
> general problem of representing n-dimensional tensors in Spark
> h2. Proposed API changes
> We propose to add a new package in the package structure, under the MLlib 
> project:
> {{org.apache.spark.image}}
> h3. Data format
> We propose to add the following structure:
> imageSchema = StructType([
> * StructField("mode", StringType(), False),
> ** The exact representation of the data.
> ** The values are described in the following OpenCV convention. Basically, 
> the type has both "depth" and "number of channels" info: in particular, type 
> "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 
> (value 32 in the table) with the channel order specified by convention.
> ** The exact channel ordering and meaning of each channel is dictated by 
> convention. By default, the order is RGB (3 channels) and BGRA (4 channels).
> If the image failed to load, the value is the empty string "".
> * StructField("origin", StringType(), True),
> ** Some information about the origin of the image. The content of this is 

[jira] [Commented] (SPARK-21869) A cached Kafka producer should not be closed if any task is using it.

2017-08-29 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16146347#comment-16146347
 ] 

Shixiong Zhu commented on SPARK-21869:
--

[~scrapco...@gmail.com] do you want to take this task?

> A cached Kafka producer should not be closed if any task is using it.
> -
>
> Key: SPARK-21869
> URL: https://issues.apache.org/jira/browse/SPARK-21869
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>
> Right now a cached Kafka producer may be closed if a large task uses it for 
> more than 10 minutes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21869) A cached Kafka producer should not be closed if any task is using it.

2017-08-29 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-21869:


 Summary: A cached Kafka producer should not be closed if any task 
is using it.
 Key: SPARK-21869
 URL: https://issues.apache.org/jira/browse/SPARK-21869
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.2.0
Reporter: Shixiong Zhu


Right now a cached Kafka producer may be closed if a large task uses it for 
more than 10 minutes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21864) Spark 2.0.1 - SaveMode.Overwrite does not work while saving data to memsql

2017-08-29 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16146287#comment-16146287
 ] 

Sean Owen commented on SPARK-21864:
---

My first guess is that it's a MemSQL problem, if you're not able to reproduce 
it without MemSQL. If so, no this is not the place. That's why I'm saying this 
isn't sufficiently narrowed down to open a JIRA here.

> Spark 2.0.1 - SaveMode.Overwrite does not work while saving data to memsql
> --
>
> Key: SPARK-21864
> URL: https://issues.apache.org/jira/browse/SPARK-21864
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Vidya
>
> We are writing dataset and dataframes to Memsql via memsql connector but the 
> SaveMode.Overwrite does not work. Basically its appending the data to the 
> table.
> {code:java}
> schemaEthnicities.write.mode(SaveMode.Overwrite).format("com.memsql.spark.connector").saveAsTable("CD_ETHNICITY_SPARK")
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19307) SPARK-17387 caused ignorance of conf object passed to SparkContext:

2017-08-29 Thread Charlie Tsai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16146280#comment-16146280
 ] 

Charlie Tsai edited comment on SPARK-19307 at 8/29/17 10:48 PM:


Hi,

I am using 2.2.0 but find that command line {{--conf}} arguments are still not 
available when the {{SparkConf()}} object is instantiated. As a result, I can't 
check what has already been set using the command line {{--conf}} arguments in 
my driver and set additional configuration using {{setIfMissing}}. Instead, 
{{setIfMissing}} effectively overwrites whatever is passed in through the CLI.

For example, if my job is:
{code}
# debug.py

import pyspark

if __name__ == '__main__':
print(pyspark.SparkConf()._jconf)# is `None` but should include 
`--conf` arguments

default_conf = {
"spark.dynamicAllocation.maxExecutors": "36",
"spark.yarn.executor.memoryOverhead": "1500",
}

# these are supposed to be set only if not provided by the CLI args
spark_conf = pyspark.SparkConf()
for (k, v) in default_conf.items():
spark_conf.setIfMissing(k, v)
{code}

Running
{code}
spark-submit \
--master yarn \
--deploy-mode client \
--conf spark.yarn.executor.memoryOverhead=2500 \
--conf spark.dynamicAllocation.maxExecutors=128 \
debug.py
{code}

In 1.6.2 the CLI args take precedent, whereas in 2.2.0, 
{{SparkConf().getAll()}} appears empty even though {{--conf}} args were passed 
in already.

Is this a separate problem or was this intended to be addressed by this ticket?


was (Author: ctsai):
Hi,

I am using 2.2.0 but find that command line {{--conf}} arguments are still not 
available when the {{SparkConf()}} object is instantiated. As a result, I can't 
check what has already been set using the command line {{--conf}} arguments in 
my driver and set additional configuration using {{setIfMissing}}. Instead, 
{{setIfMissing}} effectively overwrites whatever is passed in through the CLI.

For example, if my job is:
{code}
# debug.py

import pyspark

if __name__ == '__main__':
print(pyspark.SparkConf()._jconf)# is `None` but should include 
`--conf` arguments

default_conf = {
"spark.dynamicAllocation.maxExecutors": "36",
"spark.yarn.executor.memoryOverhead": "1500",
}

# these are supposed to be set only if not provided by the CLI args
spark_conf = pyspark.SparkConf()
for (k, v) in default_conf.items():
spark_conf.setIfMissing(k, v)
{code}

Running
{code}
spark-submit \
--master yarn \
--deploy-mode client \
--conf spark.yarn.executor.memoryOverhead=2500 \
--conf spark.dynamicAllocation.maxExecutors=128 \
debug.py
{code}

In 1.6.2 the CLI args take precedent, whereas in 2.2.0, 
{{SparkConf().getAll()}} appears empty even though {{--conf}} args were passed 
in already.

> SPARK-17387 caused ignorance of conf object passed to SparkContext:
> ---
>
> Key: SPARK-19307
> URL: https://issues.apache.org/jira/browse/SPARK-19307
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: yuriy_hupalo
>Assignee: Marcelo Vanzin
> Fix For: 2.1.1, 2.2.0
>
> Attachments: SPARK-19307.patch
>
>
> after patch SPARK-17387 was applied -- Sparkconf object is ignored when 
> launching SparkContext programmatically via python from spark-submit:
> https://github.com/apache/spark/blob/master/python/pyspark/context.py#L128:
> in case when we are running python SparkContext(conf=xxx) from spark-submit:
> conf is set, conf._jconf is None ()
> passed as arg  conf object is ignored (and used only when we are 
> launching java_gateway).
> how to fix:
> python/pyspark/context.py:132
> {code:title=python/pyspark/context.py:132}
> if conf is not None and conf._jconf is not None:
> # conf has been initialized in JVM properly, so use conf 
> directly. This represent the
> # scenario that JVM has been launched before SparkConf is created 
> (e.g. SparkContext is
> # created and then stopped, and we create a new SparkConf and new 
> SparkContext again)
> self._conf = conf
> else:
> self._conf = SparkConf(_jvm=SparkContext._jvm)
> + if conf:
> + for key, value in conf.getAll():
> + self._conf.set(key,value)
> + print(key,value)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21849) Make the serializer function more robust

2017-08-29 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21849.
---
Resolution: Not A Problem

> Make the serializer function more robust
> 
>
> Key: SPARK-21849
> URL: https://issues.apache.org/jira/browse/SPARK-21849
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: DjvuLee
>Priority: Trivial
>
> make sure the `close` function is called in the `serialize` function.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21813) [core] Modify TaskMemoryManager.MAXIMUM_PAGE_SIZE_BYTES comments

2017-08-29 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-21813:
-

Assignee: he.qiao

> [core] Modify TaskMemoryManager.MAXIMUM_PAGE_SIZE_BYTES comments
> 
>
> Key: SPARK-21813
> URL: https://issues.apache.org/jira/browse/SPARK-21813
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: he.qiao
>Assignee: he.qiao
>Priority: Trivial
> Fix For: 2.3.0
>
>
> The variable  "TaskMemoryManager.MAXIMUM_PAGE_SIZE_BYTES"  comment error, It 
> shouldn't be 2^32-1, should be 2^31-1, That means the maximum value of int.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21813) [core] Modify TaskMemoryManager.MAXIMUM_PAGE_SIZE_BYTES comments

2017-08-29 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21813.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19025
[https://github.com/apache/spark/pull/19025]

> [core] Modify TaskMemoryManager.MAXIMUM_PAGE_SIZE_BYTES comments
> 
>
> Key: SPARK-21813
> URL: https://issues.apache.org/jira/browse/SPARK-21813
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: he.qiao
>Priority: Trivial
> Fix For: 2.3.0
>
>
> The variable  "TaskMemoryManager.MAXIMUM_PAGE_SIZE_BYTES"  comment error, It 
> shouldn't be 2^32-1, should be 2^31-1, That means the maximum value of int.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19307) SPARK-17387 caused ignorance of conf object passed to SparkContext:

2017-08-29 Thread Charlie Tsai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16146280#comment-16146280
 ] 

Charlie Tsai edited comment on SPARK-19307 at 8/29/17 10:43 PM:


Hi,

I am using 2.2.0 but find that command line {{--conf}} arguments are still not 
available when the {{SparkConf()}} object is instantiated. As a result, I can't 
check what has already been set using the command line {{--conf}} arguments in 
my driver and set additional configuration using {{setIfMissing}}. Instead, 
{{setIfMissing}} effectively overwrites whatever is passed in through the CLI.

For example, if my job is:
{code}
# debug.py

import pyspark

if __name__ == '__main__':
print(pyspark.SparkConf()._jconf)# is `None` but should include 
`--conf` arguments

default_conf = {
"spark.dynamicAllocation.maxExecutors": "36",
"spark.yarn.executor.memoryOverhead": "1500",
}

# these are supposed to be set only if not provided by the CLI args
spark_conf = pyspark.SparkConf()
for (k, v) in default_conf.items():
spark_conf.setIfMissing(k, v)
{code}

Running
{code}
spark-submit \
--master yarn \
--deploy-mode client \
--conf spark.yarn.executor.memoryOverhead=2500 \
--conf spark.dynamicAllocation.maxExecutors=128 \
debug.py
{code}

In 1.6.2 the CLI args take precedent, whereas in 2.2.0, 
{{SparkConf().getAll()}} appears empty even though {{--conf}} args were passed 
in already.


was (Author: ctsai):
Hi,

I am using 2.2.0 but find that command line {{--conf}} arguments are still not 
available when the {{SparkConf()}} object is instantiated. As a result, I can't 
check what has already been set using the command line {{--conf}} arguments in 
my driver and set additional configuration using {{setIfMissing}}. Instead, 
{{setIfMissing}} effectively overwrites whatever is passed in through the CLI.

For example, if my job is:
{code}
# debug.py

import pyspark

if __name__ == '__main__':
print(pyspark.SparkConf()._jconf)# is `None` but should include 
`--conf` arguments

default_conf = {
"spark.dynamicAllocation.maxExecutors": "36",
"spark.yarn.executor.memoryOverhead": "1500",
}

# these are suppsoed to be set only if not provided by the CLI args
spark_conf = pyspark.SparkConf()
for (k, v) in default_conf.items():
spark_conf.setIfMissing(k, v)
{code}

Running
{code}
spark-submit \
--master yarn \
--deploy-mode client \
--conf spark.yarn.executor.memoryOverhead=2500 \
--conf spark.dynamicAllocation.maxExecutors=128 \
debug.py
{code}

In 1.6.2 the CLI args take precedent, whereas in 2.2.0, 
{{SparkConf().getAll()}} appears empty even though {{--conf}} args were passed 
in already.

> SPARK-17387 caused ignorance of conf object passed to SparkContext:
> ---
>
> Key: SPARK-19307
> URL: https://issues.apache.org/jira/browse/SPARK-19307
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: yuriy_hupalo
>Assignee: Marcelo Vanzin
> Fix For: 2.1.1, 2.2.0
>
> Attachments: SPARK-19307.patch
>
>
> after patch SPARK-17387 was applied -- Sparkconf object is ignored when 
> launching SparkContext programmatically via python from spark-submit:
> https://github.com/apache/spark/blob/master/python/pyspark/context.py#L128:
> in case when we are running python SparkContext(conf=xxx) from spark-submit:
> conf is set, conf._jconf is None ()
> passed as arg  conf object is ignored (and used only when we are 
> launching java_gateway).
> how to fix:
> python/pyspark/context.py:132
> {code:title=python/pyspark/context.py:132}
> if conf is not None and conf._jconf is not None:
> # conf has been initialized in JVM properly, so use conf 
> directly. This represent the
> # scenario that JVM has been launched before SparkConf is created 
> (e.g. SparkContext is
> # created and then stopped, and we create a new SparkConf and new 
> SparkContext again)
> self._conf = conf
> else:
> self._conf = SparkConf(_jvm=SparkContext._jvm)
> + if conf:
> + for key, value in conf.getAll():
> + self._conf.set(key,value)
> + print(key,value)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19307) SPARK-17387 caused ignorance of conf object passed to SparkContext:

2017-08-29 Thread Charlie Tsai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16146280#comment-16146280
 ] 

Charlie Tsai edited comment on SPARK-19307 at 8/29/17 10:43 PM:


Hi,

I am using 2.2.0 but find that command line {{--conf}} arguments are still not 
available when the {{SparkConf()}} object is instantiated. As a result, I can't 
check what has already been set using the command line {{--conf}} arguments in 
my driver and set additional configuration using {{setIfMissing}}. Instead, 
{{setIfMissing}} effectively overwrites whatever is passed in through the CLI.

For example, if my job is:
{code}
# debug.py

import pyspark

if __name__ == '__main__':
print(pyspark.SparkConf()._jconf)# is `None` but should include 
`--conf` arguments

default_conf = {
"spark.dynamicAllocation.maxExecutors": "36",
"spark.yarn.executor.memoryOverhead": "1500",
}

# these are suppsoed to be set only if not provided by the CLI args
spark_conf = pyspark.SparkConf()
for (k, v) in default_conf.items():
spark_conf.setIfMissing(k, v)
{code}

Running
{code}
spark-submit \
--master yarn \
--deploy-mode client \
--conf spark.yarn.executor.memoryOverhead=2500 \
--conf spark.dynamicAllocation.maxExecutors=128 \
debug.py
{code}

In 1.6.2 the CLI args take precedent, whereas in 2.2.0, 
{{SparkConf().getAll()}} appears empty even though {{--conf}} args were passed 
in already.


was (Author: ctsai):
Hi,

I am using 2.2.0 but find that command line {{--conf}} arguments are still not 
available when the {{SparkConf()}} object is instantiated. As a result, I can't 
check what has already been set using the command line {{--conf}} arguments in 
my driver and set additional configuration using {{setIfMissing}}. Instead, 
{{setIfMissing}} effectively overwrites whatever is passed in through the CLI.

For example, if my job is:
{code}
# debug.py

import pyspark

if __name__ == '__main__':
print(pyspark.SparkConf()._jconf)# is `None`

default_conf = {
"spark.dynamicAllocation.maxExecutors": "36",
"spark.yarn.executor.memoryOverhead": "1500",
}

# these are suppsoed to be set only if not provided by the CLI args
spark_conf = pyspark.SparkConf()
for (k, v) in default_conf.items():
spark_conf.setIfMissing(k, v)
{code}

Running
{code}
spark-submit \
--master yarn \
--deploy-mode client \
--conf spark.yarn.executor.memoryOverhead=2500 \
--conf spark.dynamicAllocation.maxExecutors=128 \
debug.py
{code}

In 1.6.2 the CLI args take precedent, whereas in 2.2.0, 
{{SparkConf().getAll()}} appears empty even though {{--conf}} args were passed 
in already.

> SPARK-17387 caused ignorance of conf object passed to SparkContext:
> ---
>
> Key: SPARK-19307
> URL: https://issues.apache.org/jira/browse/SPARK-19307
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: yuriy_hupalo
>Assignee: Marcelo Vanzin
> Fix For: 2.1.1, 2.2.0
>
> Attachments: SPARK-19307.patch
>
>
> after patch SPARK-17387 was applied -- Sparkconf object is ignored when 
> launching SparkContext programmatically via python from spark-submit:
> https://github.com/apache/spark/blob/master/python/pyspark/context.py#L128:
> in case when we are running python SparkContext(conf=xxx) from spark-submit:
> conf is set, conf._jconf is None ()
> passed as arg  conf object is ignored (and used only when we are 
> launching java_gateway).
> how to fix:
> python/pyspark/context.py:132
> {code:title=python/pyspark/context.py:132}
> if conf is not None and conf._jconf is not None:
> # conf has been initialized in JVM properly, so use conf 
> directly. This represent the
> # scenario that JVM has been launched before SparkConf is created 
> (e.g. SparkContext is
> # created and then stopped, and we create a new SparkConf and new 
> SparkContext again)
> self._conf = conf
> else:
> self._conf = SparkConf(_jvm=SparkContext._jvm)
> + if conf:
> + for key, value in conf.getAll():
> + self._conf.set(key,value)
> + print(key,value)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19307) SPARK-17387 caused ignorance of conf object passed to SparkContext:

2017-08-29 Thread Charlie Tsai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16146280#comment-16146280
 ] 

Charlie Tsai commented on SPARK-19307:
--

Hi,

I am using 2.2.0 but find that command line {{--conf}} arguments are still not 
available when the {{SparkConf()}} object is instantiated. As a result, I can't 
check what has already been set using the command line {{--conf}} arguments in 
my driver and set additional configuration using {{setIfMissing}}. Instead, 
{{setIfMissing}} effectively overwrites whatever is passed in through the CLI.

For example, if my job is:
{code}
# debug.py

import pyspark

if __name__ == '__main__':
print(pyspark.SparkConf()._jconf)# is `None`

default_conf = {
"spark.dynamicAllocation.maxExecutors": "36",
"spark.yarn.executor.memoryOverhead": "1500",
}

# these are suppsoed to be set only if not provided by the CLI args
spark_conf = pyspark.SparkConf()
for (k, v) in default_conf.items():
spark_conf.setIfMissing(k, v)
{code}

Running
{code}
spark-submit \
--master yarn \
--deploy-mode client \
--conf spark.yarn.executor.memoryOverhead=2500 \
--conf spark.dynamicAllocation.maxExecutors=128 \
debug.py
{code}

In 1.6.2 the CLI args take precedent, whereas in 2.2.0, 
{{SparkConf().getAll()}} appears empty even though {{--conf}} args were passed 
in already.

> SPARK-17387 caused ignorance of conf object passed to SparkContext:
> ---
>
> Key: SPARK-19307
> URL: https://issues.apache.org/jira/browse/SPARK-19307
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: yuriy_hupalo
>Assignee: Marcelo Vanzin
> Fix For: 2.1.1, 2.2.0
>
> Attachments: SPARK-19307.patch
>
>
> after patch SPARK-17387 was applied -- Sparkconf object is ignored when 
> launching SparkContext programmatically via python from spark-submit:
> https://github.com/apache/spark/blob/master/python/pyspark/context.py#L128:
> in case when we are running python SparkContext(conf=xxx) from spark-submit:
> conf is set, conf._jconf is None ()
> passed as arg  conf object is ignored (and used only when we are 
> launching java_gateway).
> how to fix:
> python/pyspark/context.py:132
> {code:title=python/pyspark/context.py:132}
> if conf is not None and conf._jconf is not None:
> # conf has been initialized in JVM properly, so use conf 
> directly. This represent the
> # scenario that JVM has been launched before SparkConf is created 
> (e.g. SparkContext is
> # created and then stopped, and we create a new SparkConf and new 
> SparkContext again)
> self._conf = conf
> else:
> self._conf = SparkConf(_jvm=SparkContext._jvm)
> + if conf:
> + for key, value in conf.getAll():
> + self._conf.set(key,value)
> + print(key,value)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21834) Incorrect executor request in case of dynamic allocation

2017-08-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16146235#comment-16146235
 ] 

Apache Spark commented on SPARK-21834:
--

User 'sitalkedia' has created a pull request for this issue:
https://github.com/apache/spark/pull/19081

> Incorrect executor request in case of dynamic allocation
> 
>
> Key: SPARK-21834
> URL: https://issues.apache.org/jira/browse/SPARK-21834
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.2.0
>Reporter: Sital Kedia
>
> killExecutor api currently does not allow killing an executor without 
> updating the total number of executors needed. In case of dynamic allocation 
> is turned on and the allocator tries to kill an executor, the scheduler 
> reduces the total number of executors needed ( see 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L635)
>  which is incorrect because the allocator already takes care of setting the 
> required number of executors itself. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier

2017-08-29 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16146193#comment-16146193
 ] 

Steve Loughran commented on SPARK-21797:


No> That's a shame. I only came across the option when I pasted the stack trace 
in the IDE, and it said "enable this option". sorry, I'm not sure about what 
other strategies there are. Sean? Any idea?

> spark cannot read partitioned data in S3 that are partly in glacier
> ---
>
> Key: SPARK-21797
> URL: https://issues.apache.org/jira/browse/SPARK-21797
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: Amazon EMR
>Reporter: Boris Clémençon 
>  Labels: glacier, partitions, read, s3
>
> I have a dataset in parquet in S3 partitioned by date (dt) with oldest date 
> stored in AWS Glacier to save some money. For instance, we have...
> {noformat}
> s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier]
> s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier]
> {noformat}
> I want to read this dataset, but only a subset of date that are not yet in 
> glacier, eg:
> {code:java}
> val from = "2017-07-15"
> val to = "2017-08-24"
> val path = "s3://my-bucket/my-dataset/"
> val X = spark.read.parquet(path).where(col("dt").between(from, to))
> {code}
> Unfortunately, I have the exception
> {noformat}
> java.io.IOException: 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>  The operation is not valid for the object's storage class (Service: Amazon 
> S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: 
> C444D508B6042138)
> {noformat}
> I seems that spark does not like partitioned dataset when some partitions are 
> in Glacier. I could always read specifically each date, add the column with 
> current date and reduce(_ union _) at the end, but not pretty and it should 
> not be necessary.
> Is there any tip to read available data in the datastore even with old data 
> in glacier?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21728) Allow SparkSubmit to use logging

2017-08-29 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-21728.

   Resolution: Fixed
 Assignee: Marcelo Vanzin
Fix Version/s: 2.3.0

> Allow SparkSubmit to use logging
> 
>
> Key: SPARK-21728
> URL: https://issues.apache.org/jira/browse/SPARK-21728
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.3.0
>
>
> Currently, code in {{SparkSubmit}} cannot call classes or methods that 
> initialize the Spark {{Logging}} framework. That is because at that time 
> {{SparkSubmit}} doesn't yet know which application will run, and logging is 
> initialized differently for certain special applications (notably, the 
> shells).
> It would be better if either {{SparkSubmit}} did logging initialization 
> earlier based on the application to be run, or did it in a way that could be 
> overridden later when the app initializes.
> Without this, there are currently a few parts of {{SparkSubmit}} that 
> duplicates code from other parts of Spark just to avoid logging. For example:
> * 
> [downloadFiles|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L860]
>  replicates code from Utils.scala
> * 
> [createTempDir|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala#L54]
>  replicates code from Utils.scala and installs its own shutdown hook
> * a few parts of the code could use {{SparkConf}} but can't right now because 
> of the logging issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21868) Spark job fails on java 9 NumberFormatException for input string ea

2017-08-29 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21868.
---
Resolution: Not A Problem

Java 9 is not supported

> Spark job fails on java 9 NumberFormatException for input string ea
> ---
>
> Key: SPARK-21868
> URL: https://issues.apache.org/jira/browse/SPARK-21868
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: rahul  sharma
>Priority: Minor
>
> I have a sample spark job which I am successfully able to run on java 8 but 
> when I run same program on java 9 early access, it fails with 
> NumberFormatException.
> SparkConf conf = new SparkConf();
> conf.setMaster("local[*]").setAppName("dataframe join example");
> SparkSession session = 
> SparkSession.builder().config(conf).getOrCreate();
> Dataset ds = 
> session.read().text(Thread.currentThread().getContextClassLoader().
> getResource("employee").getPath());
> System.out.println(ds.count());
> Maven spark dependencies:
>  
> org.apache.spark
> spark-core_2.10
> 2.1.0
> 
> 
> org.apache.spark
> spark-sql_2.10
> 2.1.0
> 
> Error:
> Exception in thread "main" java.lang.NumberFormatException: For input string: 
> "ea" at 
> java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>  at java.base/java.lang.Integer.parseInt(Integer.java:695) at 
> java.base/java.lang.Integer.parseInt(Integer.java:813) at 
> scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229) at 
> scala.collection.immutable.StringOps.toInt(StringOps.scala:31) at 
> org.apache.spark.SparkContext.warnDeprecatedVersions(SparkContext.scala:353) 
> at org.apache.spark.SparkContext.(SparkContext.scala:186) at 
> org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2313) at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:868)
>  at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:860)
>  at scala.Option.getOrElse(Option.scala:120) at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:860) 
> at 
> Java Details:
> java -version
> java version "9-ea" 
> Java(TM) SE Runtime Environment (build 9-ea+156)
> Java HotSpot(TM) 64-Bit Server VM (build 9-ea+156, mixed mode)
>  are there different set of steps to run spark job on java 9?
> https://stackoverflow.com/questions/45945128/spark-job-fails-on-java-9-numberformatexception-for-input-string-ea/45948077#45948077
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21868) Spark job fails on java 9 NumberFormatException for input string ea

2017-08-29 Thread rahul sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

rahul  sharma updated SPARK-21868:
--
Description: 
I have a sample spark job which I am successfully able to run on java 8 but 
when I run same program on java 9 early access, it fails with 
NumberFormatException.

SparkConf conf = new SparkConf();
conf.setMaster("local[*]").setAppName("dataframe join example");
SparkSession session = 
SparkSession.builder().config(conf).getOrCreate();
Dataset ds = 
session.read().text(Thread.currentThread().getContextClassLoader().
getResource("employee").getPath());
System.out.println(ds.count());


Maven spark dependencies:

 
org.apache.spark
spark-core_2.10
2.1.0



org.apache.spark
spark-sql_2.10
2.1.0


Error:

Exception in thread "main" java.lang.NumberFormatException: For input string: 
"ea" at 
java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
 at java.base/java.lang.Integer.parseInt(Integer.java:695) at 
java.base/java.lang.Integer.parseInt(Integer.java:813) at 
scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229) at 
scala.collection.immutable.StringOps.toInt(StringOps.scala:31) at 
org.apache.spark.SparkContext.warnDeprecatedVersions(SparkContext.scala:353) at 
org.apache.spark.SparkContext.(SparkContext.scala:186) at 
org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2313) at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:868)
 at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:860)
 at scala.Option.getOrElse(Option.scala:120) at 
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:860) 
at 

Java Details:
java -version
java version "9-ea" 
Java(TM) SE Runtime Environment (build 9-ea+156)
Java HotSpot(TM) 64-Bit Server VM (build 9-ea+156, mixed mode)

 are there different set of steps to run spark job on java 9?
https://stackoverflow.com/questions/45945128/spark-job-fails-on-java-9-numberformatexception-for-input-string-ea/45948077#45948077
 


  was:
I have a sample spark job which I am successfully able to run on java 8 but 
when I run same program on java 9 early access, it fails with 
NumberFormatException.

SparkConf conf = new SparkConf();
conf.setMaster("local[*]").setAppName("dataframe join example");
SparkSession session = 
SparkSession.builder().config(conf).getOrCreate();
Dataset ds = 
session.read().text(Thread.currentThread().getContextClassLoader().
getResource("employee").getPath());
System.out.println(ds.count());


Maven spark dependencies:

 
org.apache.spark
spark-core_2.10
2.1.0



org.apache.spark
spark-sql_2.10
2.1.0


Error:

Exception in thread "main" java.lang.NumberFormatException: For input string: 
"ea" at 
java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
 at java.base/java.lang.Integer.parseInt(Integer.java:695) at 
java.base/java.lang.Integer.parseInt(Integer.java:813) at 
scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229) at 
scala.collection.immutable.StringOps.toInt(StringOps.scala:31) at 
org.apache.spark.SparkContext.warnDeprecatedVersions(SparkContext.scala:353) at 
org.apache.spark.SparkContext.(SparkContext.scala:186) at 
org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2313) at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:868)
 at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:860)
 at scala.Option.getOrElse(Option.scala:120) at 
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:860) 
at 

Java Details:
java -version
java version "9-ea" 
Java(TM) SE Runtime Environment (build 9-ea+156)
Java HotSpot(TM) 64-Bit Server VM (build 9-ea+156, mixed mode)



> Spark job fails on java 9 NumberFormatException for input string ea
> ---
>
> Key: SPARK-21868
> URL: https://issues.apache.org/jira/browse/SPARK-21868
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: rahul  sharma
>Priority: Minor
>
> I have a sample spark job which I am successfully able to run on java 8 but 
> when I run same program on java 9 early access, it fails with 
> NumberFormatException.
> SparkConf conf = new SparkConf();
> conf.setMaster("local[*]").setAppName("dataframe join example");
> SparkSession session = 
> SparkSession.builder().config(conf).getOrCreate();
> Dataset ds = 
> session.read().text(Thread.currentThread().getContextClassLoader().
> getResource("employee").getPath());
> System.out.pri

[jira] [Created] (SPARK-21868) Spark job fails on java 9 NumberFormatException for input string ea

2017-08-29 Thread rahul sharma (JIRA)
rahul  sharma created SPARK-21868:
-

 Summary: Spark job fails on java 9 NumberFormatException for input 
string ea
 Key: SPARK-21868
 URL: https://issues.apache.org/jira/browse/SPARK-21868
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: rahul  sharma
Priority: Minor


I have a sample spark job which I am successfully able to run on java 8 but 
when I run same program on java 9 early access, it fails with 
NumberFormatException.

SparkConf conf = new SparkConf();
conf.setMaster("local[*]").setAppName("dataframe join example");
SparkSession session = 
SparkSession.builder().config(conf).getOrCreate();
Dataset ds = 
session.read().text(Thread.currentThread().getContextClassLoader().
getResource("employee").getPath());
System.out.println(ds.count());


Maven spark dependencies:

 
org.apache.spark
spark-core_2.10
2.1.0



org.apache.spark
spark-sql_2.10
2.1.0


Error:

Exception in thread "main" java.lang.NumberFormatException: For input string: 
"ea" at 
java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
 at java.base/java.lang.Integer.parseInt(Integer.java:695) at 
java.base/java.lang.Integer.parseInt(Integer.java:813) at 
scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229) at 
scala.collection.immutable.StringOps.toInt(StringOps.scala:31) at 
org.apache.spark.SparkContext.warnDeprecatedVersions(SparkContext.scala:353) at 
org.apache.spark.SparkContext.(SparkContext.scala:186) at 
org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2313) at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:868)
 at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:860)
 at scala.Option.getOrElse(Option.scala:120) at 
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:860) 
at 

Java Details:
java -version
java version "9-ea" 
Java(TM) SE Runtime Environment (build 9-ea+156)
Java HotSpot(TM) 64-Bit Server VM (build 9-ea+156, mixed mode)




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21867) Support async spilling in UnsafeShuffleWriter

2017-08-29 Thread Sital Kedia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16146060#comment-16146060
 ] 

Sital Kedia commented on SPARK-21867:
-

cc - [~rxin], [~joshrosen], [~sameer] - What do you think of the idea?

> Support async spilling in UnsafeShuffleWriter
> -
>
> Key: SPARK-21867
> URL: https://issues.apache.org/jira/browse/SPARK-21867
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Sital Kedia
>Priority: Minor
>
> Currently, Spark tasks are single-threaded. But we see it could greatly 
> improve the performance of the jobs, if we can multi-thread some part of it. 
> For example, profiling our map tasks, which reads large amount of data from 
> HDFS and spill to disks, we see that we are blocked on HDFS read and spilling 
> majority of the time. Since both these operations are IO intensive the 
> average CPU consumption during map phase is significantly low. In theory, 
> both HDFS read and spilling can be done in parallel if we had additional 
> memory to store data read from HDFS while we are spilling the last batch read.
> Let's say we have 1G of shuffle memory available per task. Currently, in case 
> of map task, it reads from HDFS and the records are stored in the available 
> memory buffer. Once we hit the memory limit and there is no more space to 
> store the records, we sort and spill the content to disk. While we are 
> spilling to disk, since we do not have any available memory, we can not read 
> from HDFS concurrently. 
> Here we propose supporting async spilling for UnsafeShuffleWriter, so that we 
> can support reading from HDFS when sort and spill is happening 
> asynchronously.  Let's say the total 1G of shuffle memory can be split into 
> two regions - active region and spilling region - each of size 500 MB. We 
> start with reading from HDFS and filling the active region. Once we hit the 
> limit of active region, we issue an asynchronous spill, while fliping the 
> active region and spilling region. While the spil is happening 
> asynchronosuly, we still have 500 MB of memory available to read the data 
> from HDFS. This way we can amortize the high disk/network io cost during 
> spilling.
> We made a prototype hack to implement this feature and we could see our map 
> tasks were as much as 40% faster. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21867) Support async spilling in UnsafeShuffleWriter

2017-08-29 Thread Sital Kedia (JIRA)
Sital Kedia created SPARK-21867:
---

 Summary: Support async spilling in UnsafeShuffleWriter
 Key: SPARK-21867
 URL: https://issues.apache.org/jira/browse/SPARK-21867
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Sital Kedia
Priority: Minor


Currently, Spark tasks are single-threaded. But we see it could greatly improve 
the performance of the jobs, if we can multi-thread some part of it. For 
example, profiling our map tasks, which reads large amount of data from HDFS 
and spill to disks, we see that we are blocked on HDFS read and spilling 
majority of the time. Since both these operations are IO intensive the average 
CPU consumption during map phase is significantly low. In theory, both HDFS 
read and spilling can be done in parallel if we had additional memory to store 
data read from HDFS while we are spilling the last batch read.

Let's say we have 1G of shuffle memory available per task. Currently, in case 
of map task, it reads from HDFS and the records are stored in the available 
memory buffer. Once we hit the memory limit and there is no more space to store 
the records, we sort and spill the content to disk. While we are spilling to 
disk, since we do not have any available memory, we can not read from HDFS 
concurrently. 

Here we propose supporting async spilling for UnsafeShuffleWriter, so that we 
can support reading from HDFS when sort and spill is happening asynchronously.  
Let's say the total 1G of shuffle memory can be split into two regions - active 
region and spilling region - each of size 500 MB. We start with reading from 
HDFS and filling the active region. Once we hit the limit of active region, we 
issue an asynchronous spill, while fliping the active region and spilling 
region. While the spil is happening asynchronosuly, we still have 500 MB of 
memory available to read the data from HDFS. This way we can amortize the high 
disk/network io cost during spilling.

We made a prototype hack to implement this feature and we could see our map 
tasks were as much as 40% faster. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21714) SparkSubmit in Yarn Client mode downloads remote files and then reuploads them again

2017-08-29 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-21714:
---
Fix Version/s: (was: 2.2.1)

> SparkSubmit in Yarn Client mode downloads remote files and then reuploads 
> them again
> 
>
> Key: SPARK-21714
> URL: https://issues.apache.org/jira/browse/SPARK-21714
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.2.0
>Reporter: Thomas Graves
>Assignee: Saisai Shao
>Priority: Critical
> Fix For: 2.3.0
>
>
> SPARK-10643 added the ability for spark-submit to download remote file in 
> client mode.
> However in yarn mode this introduced a bug where it downloads them for the 
> client but then yarn client just reuploads them to HDFS and uses them again. 
> This should not happen when the remote file is HDFS.  This is wasting 
> resources and its defeating the  distributed cache because if the original 
> object was public it would have been shared by many users. By us downloading 
> and reuploading, it becomes private.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18350) Support session local timezone

2017-08-29 Thread Vinayak (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16136567#comment-16136567
 ] 

Vinayak edited comment on SPARK-18350 at 8/29/17 7:30 PM:
--

[~ueshin]  
I have set the below value to set the timeZone to UTC. It is adding the current 
timeZone value even though it is in the UTC format.

spark.conf.set("spark.sql.session.timeZone", "UTC")

Find the attached csv data for reference.

Expected : Time should remain same as the input since it's already in UTC format

var df1 = spark.read.option("delimiter", ",").option("qualifier", 
"\"").option("inferSchema","true").option("header", "true").option("mode", 
"PERMISSIVE").option("timestampFormat","MM/dd/'T'HH:mm:ss.SSS").option("dateFormat",
 "MM/dd/'T'HH:mm:ss").csv("DateSpark.csv");

df1: org.apache.spark.sql.DataFrame = [Name: string, Age: int ... 5 more fields]

scala> df1.show(false);

--


Name Age Add  Date  SparkDate  SparkDate1  SparkDate2  

--


abc  21  bvxc 04/22/2017T03:30:02 2017-03-21 03:30:02 2017-03-21 09:00:02.02 
2017-03-21 05:30:00 



was (Author: vinayaksgadag):
I have set the below value to set the timeZone to UTC. It is adding the current 
timeZone value even though it is in the UTC format.

spark.conf.set("spark.sql.session.timeZone", "UTC")

Find the attached csv data for reference.

Expected : Time should remain same as the input since it's already in UTC format

var df1 = spark.read.option("delimiter", ",").option("qualifier", 
"\"").option("inferSchema","true").option("header", "true").option("mode", 
"PERMISSIVE").option("timestampFormat","MM/dd/'T'HH:mm:ss.SSS").option("dateFormat",
 "MM/dd/'T'HH:mm:ss").csv("DateSpark.csv");

df1: org.apache.spark.sql.DataFrame = [Name: string, Age: int ... 5 more fields]

scala> df1.show(false);

--


Name Age Add  Date  SparkDate  SparkDate1  SparkDate2  

--


abc  21  bvxc 04/22/2017T03:30:02 2017-03-21 03:30:02 2017-03-21 09:00:02.02 
2017-03-21 05:30:00 


> Support session local timezone
> --
>
> Key: SPARK-18350
> URL: https://issues.apache.org/jira/browse/SPARK-18350
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Takuya Ueshin
>  Labels: releasenotes
> Fix For: 2.2.0
>
> Attachments: sample.csv
>
>
> As of Spark 2.1, Spark SQL assumes the machine timezone for datetime 
> manipulation, which is bad if users are not in the same timezones as the 
> machines, or if different users have different timezones.
> We should introduce a session local timezone setting that is used for 
> execution.
> An explicit non-goal is locale handling.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9213) Improve regular expression performance (via joni)

2017-08-29 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16145831#comment-16145831
 ] 

Mridul Muralidharan commented on SPARK-9213:


[~rxin] Curious what happened to this effort - did we find a replacement ? Or 
it is still a TODO which will help ?

> Improve regular expression performance (via joni)
> -
>
> Key: SPARK-9213
> URL: https://issues.apache.org/jira/browse/SPARK-9213
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Reporter: Reynold Xin
>
> I'm creating an umbrella ticket to improve regular expression performance for 
> string expressions. Right now our use of regular expressions is inefficient 
> for two reasons:
> 1. Java regex in general is slow.
> 2. We have to convert everything from UTF8 encoded bytes into Java String, 
> and then run regex on it, and then convert it back.
> There are libraries in Java that provide regex support directly on UTF8 
> encoded bytes. One prominent example is joni, used in JRuby.
> Note: all regex functions are in 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark

2017-08-29 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16145808#comment-16145808
 ] 

Sean Owen commented on SPARK-21866:
---

Why would this need to be part of Spark? I assume it's Spark-specific, yes, but 
it already exists as a standalone library. You're saying it will continue to be 
a stand-alone package too? It also doesn't seem to add any advantages in 
representation; this seems like what one would get reading any image into, say, 
BufferedImage and then picking out its channels.

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
> Attachments: SPIP - Image support for Apache Spark.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
> * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
> * the total size of an image should be restricted to less than 2GB (roughly)
> * the meaning of color channels is application-specific and is not mandated 
> by the standard (in line with the OpenCV standard)
> * specialized formats used in meteorology, the medical field, etc. are not 
> supported
> * this format is specialized to images and does not attempt to solve the more 
> general problem of representing n-dimensional tensors in Spark
> h2. Proposed API changes
> We propose to add a new package in the package structure, under the MLlib 
> project:
> {{org.apache.spark.image}}
> h3. Data format
> We propose to add the following structure:
> imageSchema = StructType([
> * StructField("mode", StringType(), False),
> ** The exact representation of the data.
> ** The values are described in the following OpenCV convention. Basically, 
> the type has both "depth" and "number of channels" info: in particular, type 
> "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 
> (value 32 in the table) with the channel order specif

[jira] [Updated] (SPARK-21714) SparkSubmit in Yarn Client mode downloads remote files and then reuploads them again

2017-08-29 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-21714:
---
Fix Version/s: 2.2.1

> SparkSubmit in Yarn Client mode downloads remote files and then reuploads 
> them again
> 
>
> Key: SPARK-21714
> URL: https://issues.apache.org/jira/browse/SPARK-21714
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.2.0
>Reporter: Thomas Graves
>Assignee: Saisai Shao
>Priority: Critical
> Fix For: 2.2.1, 2.3.0
>
>
> SPARK-10643 added the ability for spark-submit to download remote file in 
> client mode.
> However in yarn mode this introduced a bug where it downloads them for the 
> client but then yarn client just reuploads them to HDFS and uses them again. 
> This should not happen when the remote file is HDFS.  This is wasting 
> resources and its defeating the  distributed cache because if the original 
> object was public it would have been shared by many users. By us downloading 
> and reuploading, it becomes private.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21866) SPIP: Image support in Spark

2017-08-29 Thread Timothy Hunter (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Hunter updated SPARK-21866:
---
Attachment: SPIP - Image support for Apache Spark.pdf

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
> Attachments: SPIP - Image support for Apache Spark.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
> * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
> * the total size of an image should be restricted to less than 2GB (roughly)
> * the meaning of color channels is application-specific and is not mandated 
> by the standard (in line with the OpenCV standard)
> * specialized formats used in meteorology, the medical field, etc. are not 
> supported
> * this format is specialized to images and does not attempt to solve the more 
> general problem of representing n-dimensional tensors in Spark
> h2. Proposed API changes
> We propose to add a new package in the package structure, under the MLlib 
> project:
> {{org.apache.spark.image}}
> h3. Data format
> We propose to add the following structure:
> imageSchema = StructType([
> * StructField("mode", StringType(), False),
> ** The exact representation of the data.
> ** The values are described in the following OpenCV convention. Basically, 
> the type has both "depth" and "number of channels" info: in particular, type 
> "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 
> (value 32 in the table) with the channel order specified by convention.
> ** The exact channel ordering and meaning of each channel is dictated by 
> convention. By default, the order is RGB (3 channels) and BGRA (4 channels).
> If the image failed to load, the value is the empty string "".
> * StructField("origin", StringType(), True),
> ** Some information about the origin of the image. The content of th

[jira] [Created] (SPARK-21866) SPIP: Image support in Spark

2017-08-29 Thread Timothy Hunter (JIRA)
Timothy Hunter created SPARK-21866:
--

 Summary: SPIP: Image support in Spark
 Key: SPARK-21866
 URL: https://issues.apache.org/jira/browse/SPARK-21866
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.2.0
Reporter: Timothy Hunter


h2. Background and motivation
As Apache Spark is being used more and more in the industry, some new use cases 
are emerging for different data formats beyond the traditional SQL types or the 
numerical types (vectors and matrices). Deep Learning applications commonly 
deal with image processing. A number of projects add some Deep Learning 
capabilities to Spark (see list below), but they struggle to  communicate with 
each other or with MLlib pipelines because there is no standard way to 
represent an image in Spark DataFrames. We propose to federate efforts for 
representing images in Spark by defining a representation that caters to the 
most common needs of users and library developers.

This SPIP proposes a specification to represent images in Spark DataFrames and 
Datasets (based on existing industrial standards), and an interface for loading 
sources of images. It is not meant to be a full-fledged image processing 
library, but rather the core description that other libraries and users can 
rely on. Several packages already offer various processing facilities for 
transforming images or doing more complex operations, and each has various 
design tradeoffs that make them better as standalone solutions.

This project is a joint collaboration between Microsoft and Databricks, which 
have been testing this design in two open source packages: MMLSpark and Deep 
Learning Pipelines.

The proposed image format is an in-memory, decompressed representation that 
targets low-level applications. It is significantly more liberal in memory 
usage than compressed image representations such as JPEG, PNG, etc., but it 
allows easy communication with popular image processing libraries and has no 
decoding overhead.

h2. Targets users and personas:
Data scientists, data engineers, library developers.
The following libraries define primitives for loading and representing images, 
and will gain from a common interchange format (in alphabetical order):
* BigDL
* DeepLearning4J
* Deep Learning Pipelines
* MMLSpark
* TensorFlow (Spark connector)
* TensorFlowOnSpark
* TensorFrames
* Thunder

h2. Goals:
* Simple representation of images in Spark DataFrames, based on pre-existing 
industrial standards (OpenCV)
* This format should eventually allow the development of high-performance 
integration points with image processing libraries such as libOpenCV, Google 
TensorFlow, CNTK, and other C libraries.
* The reader should be able to read popular formats of images from distributed 
sources.

h2. Non-Goals:
Images are a versatile medium and encompass a very wide range of formats and 
representations. This SPIP explicitly aims at the most common use case in the 
industry currently: multi-channel matrices of binary, int32, int64, float or 
double data that can fit comfortably in the heap of the JVM:
* the total size of an image should be restricted to less than 2GB (roughly)
* the meaning of color channels is application-specific and is not mandated by 
the standard (in line with the OpenCV standard)
* specialized formats used in meteorology, the medical field, etc. are not 
supported
* this format is specialized to images and does not attempt to solve the more 
general problem of representing n-dimensional tensors in Spark

h2. Proposed API changes
We propose to add a new package in the package structure, under the MLlib 
project:
{{org.apache.spark.image}}

h3. Data format
We propose to add the following structure:

imageSchema = StructType([
* StructField("mode", StringType(), False),
** The exact representation of the data.
** The values are described in the following OpenCV convention. Basically, the 
type has both "depth" and "number of channels" info: in particular, type 
"CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 (value 
32 in the table) with the channel order specified by convention.
** The exact channel ordering and meaning of each channel is dictated by 
convention. By default, the order is RGB (3 channels) and BGRA (4 channels).
If the image failed to load, the value is the empty string "".

* StructField("origin", StringType(), True),
** Some information about the origin of the image. The content of this is 
application-specific.
** When the image is loaded from files, users should expect to find the file 
name in this field.

* StructField("height", IntegerType(), False),
** the height of the image, pixels
** If the image fails to load, the value is -1.

* StructField("width", IntegerType(), False),
** the width of the image, pixels
** If the image fails to load, the value is -1.

* StructField("nChannels", In

[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier

2017-08-29 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16145716#comment-16145716
 ] 

Boris Clémençon  commented on SPARK-21797:
--

FYI, the flag spark.sql.files.ignoreCorruptFiles=true does not seem to fix the 
pbm.

> spark cannot read partitioned data in S3 that are partly in glacier
> ---
>
> Key: SPARK-21797
> URL: https://issues.apache.org/jira/browse/SPARK-21797
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: Amazon EMR
>Reporter: Boris Clémençon 
>  Labels: glacier, partitions, read, s3
>
> I have a dataset in parquet in S3 partitioned by date (dt) with oldest date 
> stored in AWS Glacier to save some money. For instance, we have...
> {noformat}
> s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier]
> s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier]
> {noformat}
> I want to read this dataset, but only a subset of date that are not yet in 
> glacier, eg:
> {code:java}
> val from = "2017-07-15"
> val to = "2017-08-24"
> val path = "s3://my-bucket/my-dataset/"
> val X = spark.read.parquet(path).where(col("dt").between(from, to))
> {code}
> Unfortunately, I have the exception
> {noformat}
> java.io.IOException: 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>  The operation is not valid for the object's storage class (Service: Amazon 
> S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: 
> C444D508B6042138)
> {noformat}
> I seems that spark does not like partitioned dataset when some partitions are 
> in Glacier. I could always read specifically each date, add the column with 
> current date and reduce(_ union _) at the end, but not pretty and it should 
> not be necessary.
> Is there any tip to read available data in the datastore even with old data 
> in glacier?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21097) Dynamic allocation will preserve cached data

2017-08-29 Thread Brad (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16145708#comment-16145708
 ] 

Brad commented on SPARK-21097:
--

Here is a document with some of my benchmark results. I am working on adding 
more benchmarks.

https://docs.google.com/document/d/1E6_rhAAJB8Ww0n52-LYcFTO1zhJBWgfIXzNjLi29730/edit?usp=sharing

> Dynamic allocation will preserve cached data
> 
>
> Key: SPARK-21097
> URL: https://issues.apache.org/jira/browse/SPARK-21097
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Scheduler, Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Brad
> Attachments: Preserving Cached Data with Dynamic Allocation.pdf
>
>
> We want to use dynamic allocation to distribute resources among many notebook 
> users on our spark clusters. One difficulty is that if a user has cached data 
> then we are either prevented from de-allocating any of their executors, or we 
> are forced to drop their cached data, which can lead to a bad user experience.
> We propose adding a feature to preserve cached data by copying it to other 
> executors before de-allocation. This behavior would be enabled by a simple 
> spark config like "spark.dynamicAllocation.recoverCachedData". Now when an 
> executor reaches its configured idle timeout, instead of just killing it on 
> the spot, we will stop sending it new tasks, replicate all of its rdd blocks 
> onto other executors, and then kill it. If there is an issue while we 
> replicate the data, like an error, it takes too long, or there isn't enough 
> space, then we will fall back to the original behavior and drop the data and 
> kill the executor.
> This feature should allow anyone with notebook users to use their cluster 
> resources more efficiently. Also since it will be completely opt-in it will 
> unlikely to cause problems for other use cases. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21801) SparkR unit test randomly fail on trees

2017-08-29 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung reassigned SPARK-21801:


Assignee: Felix Cheung

> SparkR unit test randomly fail on trees
> ---
>
> Key: SPARK-21801
> URL: https://issues.apache.org/jira/browse/SPARK-21801
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, Tests
>Affects Versions: 2.2.0
>Reporter: Weichen Xu
>Assignee: Felix Cheung
>Priority: Critical
> Fix For: 2.3.0
>
>
> SparkR unit test sometimes will randomly occur such error:
> ```
> 1. Error: spark.randomForest (@test_mllib_tree.R#236) 
> --
> java.lang.IllegalArgumentException: requirement failed: The input column 
> stridx_87ea3065aeb2 should have at least two distinct values.
> ```
> or
> ```
> 1. Error: spark.decisionTree (@test_mllib_tree.R#353) 
> --
> java.lang.IllegalArgumentException: requirement failed: The input column 
> stridx_d6a0b492cfa1 should have at least two distinct values.
> ```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21801) SparkR unit test randomly fail on trees

2017-08-29 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-21801.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

> SparkR unit test randomly fail on trees
> ---
>
> Key: SPARK-21801
> URL: https://issues.apache.org/jira/browse/SPARK-21801
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, Tests
>Affects Versions: 2.2.0
>Reporter: Weichen Xu
>Assignee: Felix Cheung
>Priority: Critical
> Fix For: 2.3.0
>
>
> SparkR unit test sometimes will randomly occur such error:
> ```
> 1. Error: spark.randomForest (@test_mllib_tree.R#236) 
> --
> java.lang.IllegalArgumentException: requirement failed: The input column 
> stridx_87ea3065aeb2 should have at least two distinct values.
> ```
> or
> ```
> 1. Error: spark.decisionTree (@test_mllib_tree.R#353) 
> --
> java.lang.IllegalArgumentException: requirement failed: The input column 
> stridx_d6a0b492cfa1 should have at least two distinct values.
> ```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21857) Exception in thread "main" java.lang.ExceptionInInitializerError

2017-08-29 Thread Nagamanoj (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16145683#comment-16145683
 ] 

Nagamanoj commented on SPARK-21857:
---

Thank you very much Sean Owen... I reverted to Java 8 and now it works fine.



> Exception in thread "main" java.lang.ExceptionInInitializerError
> 
>
> Key: SPARK-21857
> URL: https://issues.apache.org/jira/browse/SPARK-21857
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Nagamanoj
>
> After installing SPRAK using prebuilt version, when we run ./bin/pySpark
> JAVA Version = Java 9
> I'm getting the following exception
> sing Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
> 17/08/28 20:06:43 INFO SparkContext: Running Spark version 2.2.0
> Exception in thread "main" java.lang.ExceptionInInitializerError
> at org.apache.hadoop.util.StringUtils.(StringUtils.java:80)
> at 
> org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(SecurityUtil.java:611)
> at 
> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:273)
> at 
> org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:261)
> at 
> org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:791)
> at 
> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:761)
> at 
> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:634)
> at 
> org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2430)
> at 
> org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2430)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2430)
> at org.apache.spark.SparkContext.(SparkContext.scala:295)
> at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2509)
> at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:909)
> at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:901)
> at scala.Option.getOrElse(Option.scala:121)
> at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:901)
> at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
> at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
> at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)
> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.base/java.lang.reflect.Method.invoke(Method.java:564)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.StringIndexOutOfBoundsException: begin 0, end 3, length 1
> at java.base/java.lang.String.checkBoundsBeginEnd(String.java:3116)
> at java.base/java.lang.String.substring(String.java:1885)
> at org.apache.hadoop.util.Shell.(Shell.java:52



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21865) remove Partitioning.compatibleWith

2017-08-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16145675#comment-16145675
 ] 

Apache Spark commented on SPARK-21865:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/19080

> remove Partitioning.compatibleWith
> --
>
> Key: SPARK-21865
> URL: https://issues.apache.org/jira/browse/SPARK-21865
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21865) remove Partitioning.compatibleWith

2017-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21865:


Assignee: Wenchen Fan  (was: Apache Spark)

> remove Partitioning.compatibleWith
> --
>
> Key: SPARK-21865
> URL: https://issues.apache.org/jira/browse/SPARK-21865
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21865) remove Partitioning.compatibleWith

2017-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21865:


Assignee: Apache Spark  (was: Wenchen Fan)

> remove Partitioning.compatibleWith
> --
>
> Key: SPARK-21865
> URL: https://issues.apache.org/jira/browse/SPARK-21865
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21822) When insert Hive Table is finished, it is better to clean out the tmpLocation dir

2017-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21822:


Assignee: Apache Spark

> When insert Hive Table is finished, it is better to clean out the tmpLocation 
> dir
> -
>
> Key: SPARK-21822
> URL: https://issues.apache.org/jira/browse/SPARK-21822
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: lufei
>Assignee: Apache Spark
>Priority: Minor
>
> When insert Hive Table is finished, it is better to clean out the tmpLocation 
> dir(the temp directories like 
> ".hive-staging_hive_2017-08-19_10-56-01_540_5448395226195533570-9/-ext-1" 
> or "/tmp/hive/..." for an old spark version).
> Otherwise, when lots of spark job are executed, millions of temporary 
> directories are left in HDFS. And these temporary directories can only be 
> deleted by the maintainer through the shell script.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21822) When insert Hive Table is finished, it is better to clean out the tmpLocation dir

2017-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21822:


Assignee: (was: Apache Spark)

> When insert Hive Table is finished, it is better to clean out the tmpLocation 
> dir
> -
>
> Key: SPARK-21822
> URL: https://issues.apache.org/jira/browse/SPARK-21822
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: lufei
>Priority: Minor
>
> When insert Hive Table is finished, it is better to clean out the tmpLocation 
> dir(the temp directories like 
> ".hive-staging_hive_2017-08-19_10-56-01_540_5448395226195533570-9/-ext-1" 
> or "/tmp/hive/..." for an old spark version).
> Otherwise, when lots of spark job are executed, millions of temporary 
> directories are left in HDFS. And these temporary directories can only be 
> deleted by the maintainer through the shell script.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21849) Make the serializer function more robust

2017-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21849:


Assignee: (was: Apache Spark)

> Make the serializer function more robust
> 
>
> Key: SPARK-21849
> URL: https://issues.apache.org/jira/browse/SPARK-21849
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: DjvuLee
>Priority: Trivial
>
> make sure the `close` function is called in the `serialize` function.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21806) BinaryClassificationMetrics pr(): first point (0.0, 1.0) is misleading

2017-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21806:


Assignee: (was: Apache Spark)

> BinaryClassificationMetrics pr(): first point (0.0, 1.0) is misleading
> --
>
> Key: SPARK-21806
> URL: https://issues.apache.org/jira/browse/SPARK-21806
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: Marc Kaminski
>Priority: Minor
> Attachments: PRROC_example.jpeg
>
>
> I would like to reference to a [discussion in scikit-learn| 
> https://github.com/scikit-learn/scikit-learn/issues/4223], as this behavior 
> is probably based on the scikit implementation. 
> Summary: 
> Currently, the y-axis intercept of the precision recall curve is set to (0.0, 
> 1.0). This behavior is not ideal in certain edge cases (see example below) 
> and can also have an impact on cross validation, when optimization metric is 
> set to "areaUnderPR". 
> Please consider [blucena's 
> post|https://github.com/scikit-learn/scikit-learn/issues/4223#issuecomment-215273613]
>  for possible alternatives. 
> Edge case example: 
> Consider a bad classifier, that assigns a high probability to all samples. A 
> possible output might look like this: 
> ||Real label || Score ||
> |1.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 0.95 |
> |0.0 | 0.95 |
> |1.0 | 1.0 |
> This results in the following pr points (first line set by default): 
> ||Threshold || Recall ||Precision ||
> |1.0 | 0.0 | 1.0 | 
> |0.95| 1.0 | 0.2 |
> |0.0| 1.0 | 0,16 |
> The auPRC would be around 0.6. Classifiers with a more differentiated 
> probability assignment  will be falsely assumed to perform worse in regard to 
> this auPRC.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21845) Make codegen fallback of expressions configurable

2017-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21845:


Assignee: Apache Spark  (was: Xiao Li)

> Make codegen fallback of expressions configurable
> -
>
> Key: SPARK-21845
> URL: https://issues.apache.org/jira/browse/SPARK-21845
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> We should make codegen fallback of expressions configurable. So far, it is 
> always on. We might hide it when our codegen have compilation bugs. Thus, we 
> should also disable the codegen fallback when running test cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20628) Keep track of nodes which are going to be shut down & avoid scheduling new tasks

2017-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20628:


Assignee: (was: Apache Spark)

> Keep track of nodes which are going to be shut down & avoid scheduling new 
> tasks
> 
>
> Key: SPARK-20628
> URL: https://issues.apache.org/jira/browse/SPARK-20628
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: holdenk
>
> Keep track of nodes which are going to be shut down. We considered adding 
> this for YARN but took a different approach, for instances where we can't 
> control instance termination though (EC2, GCE, etc.) this may make more sense.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21097) Dynamic allocation will preserve cached data

2017-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21097:


Assignee: Apache Spark

> Dynamic allocation will preserve cached data
> 
>
> Key: SPARK-21097
> URL: https://issues.apache.org/jira/browse/SPARK-21097
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Scheduler, Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Brad
>Assignee: Apache Spark
> Attachments: Preserving Cached Data with Dynamic Allocation.pdf
>
>
> We want to use dynamic allocation to distribute resources among many notebook 
> users on our spark clusters. One difficulty is that if a user has cached data 
> then we are either prevented from de-allocating any of their executors, or we 
> are forced to drop their cached data, which can lead to a bad user experience.
> We propose adding a feature to preserve cached data by copying it to other 
> executors before de-allocation. This behavior would be enabled by a simple 
> spark config like "spark.dynamicAllocation.recoverCachedData". Now when an 
> executor reaches its configured idle timeout, instead of just killing it on 
> the spot, we will stop sending it new tasks, replicate all of its rdd blocks 
> onto other executors, and then kill it. If there is an issue while we 
> replicate the data, like an error, it takes too long, or there isn't enough 
> space, then we will fall back to the original behavior and drop the data and 
> kill the executor.
> This feature should allow anyone with notebook users to use their cluster 
> resources more efficiently. Also since it will be completely opt-in it will 
> unlikely to cause problems for other use cases. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21811) Inconsistency when finding the widest common type of a combination of DateType, StringType, and NumericType

2017-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21811:


Assignee: (was: Apache Spark)

> Inconsistency when finding the widest common type of a combination of 
> DateType, StringType, and NumericType
> ---
>
> Key: SPARK-21811
> URL: https://issues.apache.org/jira/browse/SPARK-21811
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Bald
>Priority: Minor
>
> Finding the widest common type for the arguments of a variadic function (such 
> as IN or COALESCE) when the types of the arguments are a combination of 
> DateType/TimestampType, StringType, and NumericType fails with an 
> AnalysisException for some orders of the arguments and succeeds with a common 
> type of StringType for other orders of the arguments.
> The below examples used to reproduce the error assume a schema of:
> {{[c1: date, c2: string, c3: int]}}
> The following succeeds:
> {{SELECT coalesce(c1, c2, c3) FROM table}}
> While the following produces an exception:
> {{SELECT coalesce(c1, c3, c2) FROM table}}
> The order of arguments affects the behavior because it looks to be the widest 
> common type is found by repeatedly looking at two arguments at a time, the 
> widest common type found thus far and the next argument. On initial thought 
> of a fix, I think the way the widest common type is found would have to be 
> changed and instead look at all arguments first before deciding what the 
> widest common type should be.
> As my boss is out of office for the rest of the day I will give a pull 
> request a shot, but as I am not super familiar with Scala or Spark's coding 
> style guidelines, a pull request is not promised. Going forward with my 
> attempted pull request, I will assume having DateType/TimestampType, 
> StringType, and NumericType arguments in an IN expression and COALESCE 
> function (and any other function/expression where this combination of 
> argument types can occur) is valid. I find it also quite reasonable to have 
> this combination of argument types to be invalid, so if that's what is 
> decided, then oh well.
> If I were a betting man, I'd say the fix would be made in the following file: 
> [TypeCoercion.scala|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21097) Dynamic allocation will preserve cached data

2017-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21097:


Assignee: (was: Apache Spark)

> Dynamic allocation will preserve cached data
> 
>
> Key: SPARK-21097
> URL: https://issues.apache.org/jira/browse/SPARK-21097
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Scheduler, Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Brad
> Attachments: Preserving Cached Data with Dynamic Allocation.pdf
>
>
> We want to use dynamic allocation to distribute resources among many notebook 
> users on our spark clusters. One difficulty is that if a user has cached data 
> then we are either prevented from de-allocating any of their executors, or we 
> are forced to drop their cached data, which can lead to a bad user experience.
> We propose adding a feature to preserve cached data by copying it to other 
> executors before de-allocation. This behavior would be enabled by a simple 
> spark config like "spark.dynamicAllocation.recoverCachedData". Now when an 
> executor reaches its configured idle timeout, instead of just killing it on 
> the spot, we will stop sending it new tasks, replicate all of its rdd blocks 
> onto other executors, and then kill it. If there is an issue while we 
> replicate the data, like an error, it takes too long, or there isn't enough 
> space, then we will fall back to the original behavior and drop the data and 
> kill the executor.
> This feature should allow anyone with notebook users to use their cluster 
> resources more efficiently. Also since it will be completely opt-in it will 
> unlikely to cause problems for other use cases. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21849) Make the serializer function more robust

2017-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21849:


Assignee: Apache Spark

> Make the serializer function more robust
> 
>
> Key: SPARK-21849
> URL: https://issues.apache.org/jira/browse/SPARK-21849
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: DjvuLee
>Assignee: Apache Spark
>Priority: Trivial
>
> make sure the `close` function is called in the `serialize` function.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21811) Inconsistency when finding the widest common type of a combination of DateType, StringType, and NumericType

2017-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21811:


Assignee: Apache Spark

> Inconsistency when finding the widest common type of a combination of 
> DateType, StringType, and NumericType
> ---
>
> Key: SPARK-21811
> URL: https://issues.apache.org/jira/browse/SPARK-21811
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Bald
>Assignee: Apache Spark
>Priority: Minor
>
> Finding the widest common type for the arguments of a variadic function (such 
> as IN or COALESCE) when the types of the arguments are a combination of 
> DateType/TimestampType, StringType, and NumericType fails with an 
> AnalysisException for some orders of the arguments and succeeds with a common 
> type of StringType for other orders of the arguments.
> The below examples used to reproduce the error assume a schema of:
> {{[c1: date, c2: string, c3: int]}}
> The following succeeds:
> {{SELECT coalesce(c1, c2, c3) FROM table}}
> While the following produces an exception:
> {{SELECT coalesce(c1, c3, c2) FROM table}}
> The order of arguments affects the behavior because it looks to be the widest 
> common type is found by repeatedly looking at two arguments at a time, the 
> widest common type found thus far and the next argument. On initial thought 
> of a fix, I think the way the widest common type is found would have to be 
> changed and instead look at all arguments first before deciding what the 
> widest common type should be.
> As my boss is out of office for the rest of the day I will give a pull 
> request a shot, but as I am not super familiar with Scala or Spark's coding 
> style guidelines, a pull request is not promised. Going forward with my 
> attempted pull request, I will assume having DateType/TimestampType, 
> StringType, and NumericType arguments in an IN expression and COALESCE 
> function (and any other function/expression where this combination of 
> argument types can occur) is valid. I find it also quite reasonable to have 
> this combination of argument types to be invalid, so if that's what is 
> decided, then oh well.
> If I were a betting man, I'd say the fix would be made in the following file: 
> [TypeCoercion.scala|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20628) Keep track of nodes which are going to be shut down & avoid scheduling new tasks

2017-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20628:


Assignee: Apache Spark

> Keep track of nodes which are going to be shut down & avoid scheduling new 
> tasks
> 
>
> Key: SPARK-20628
> URL: https://issues.apache.org/jira/browse/SPARK-20628
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: holdenk
>Assignee: Apache Spark
>
> Keep track of nodes which are going to be shut down. We considered adding 
> this for YARN but took a different approach, for instances where we can't 
> control instance termination though (EC2, GCE, etc.) this may make more sense.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21806) BinaryClassificationMetrics pr(): first point (0.0, 1.0) is misleading

2017-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21806:


Assignee: Apache Spark

> BinaryClassificationMetrics pr(): first point (0.0, 1.0) is misleading
> --
>
> Key: SPARK-21806
> URL: https://issues.apache.org/jira/browse/SPARK-21806
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: Marc Kaminski
>Assignee: Apache Spark
>Priority: Minor
> Attachments: PRROC_example.jpeg
>
>
> I would like to reference to a [discussion in scikit-learn| 
> https://github.com/scikit-learn/scikit-learn/issues/4223], as this behavior 
> is probably based on the scikit implementation. 
> Summary: 
> Currently, the y-axis intercept of the precision recall curve is set to (0.0, 
> 1.0). This behavior is not ideal in certain edge cases (see example below) 
> and can also have an impact on cross validation, when optimization metric is 
> set to "areaUnderPR". 
> Please consider [blucena's 
> post|https://github.com/scikit-learn/scikit-learn/issues/4223#issuecomment-215273613]
>  for possible alternatives. 
> Edge case example: 
> Consider a bad classifier, that assigns a high probability to all samples. A 
> possible output might look like this: 
> ||Real label || Score ||
> |1.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 0.95 |
> |0.0 | 0.95 |
> |1.0 | 1.0 |
> This results in the following pr points (first line set by default): 
> ||Threshold || Recall ||Precision ||
> |1.0 | 0.0 | 1.0 | 
> |0.95| 1.0 | 0.2 |
> |0.0| 1.0 | 0,16 |
> The auPRC would be around 0.6. Classifiers with a more differentiated 
> probability assignment  will be falsely assumed to perform worse in regard to 
> this auPRC.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21845) Make codegen fallback of expressions configurable

2017-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21845:


Assignee: Xiao Li  (was: Apache Spark)

> Make codegen fallback of expressions configurable
> -
>
> Key: SPARK-21845
> URL: https://issues.apache.org/jira/browse/SPARK-21845
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> We should make codegen fallback of expressions configurable. So far, it is 
> always on. We might hide it when our codegen have compilation bugs. Thus, we 
> should also disable the codegen fallback when running test cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21469) Add doc and example for FeatureHasher

2017-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21469:


Assignee: Apache Spark

> Add doc and example for FeatureHasher
> -
>
> Key: SPARK-21469
> URL: https://issues.apache.org/jira/browse/SPARK-21469
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Assignee: Apache Spark
>
> Add examples and user guide section for {{FeatureHasher}} in SPARK-13969



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21808) Add R interface of binarizer

2017-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21808:


Assignee: (was: Apache Spark)

> Add R interface of binarizer
> 
>
> Key: SPARK-21808
> URL: https://issues.apache.org/jira/browse/SPARK-21808
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Jiaming Shu
>Priority: Minor
>  Labels: features
>
> add BinarizerWrapper.scala in org.apache.spark.ml.r
> add mllib_feature.R and test_mllib_feature.R in R
> update DESCRIPTION and NAMESPACE to collate 'mllib_feature.R' and export 
> 'spark.binarizer'



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21808) Add R interface of binarizer

2017-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21808:


Assignee: Apache Spark

> Add R interface of binarizer
> 
>
> Key: SPARK-21808
> URL: https://issues.apache.org/jira/browse/SPARK-21808
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Jiaming Shu
>Assignee: Apache Spark
>Priority: Minor
>  Labels: features
>
> add BinarizerWrapper.scala in org.apache.spark.ml.r
> add mllib_feature.R and test_mllib_feature.R in R
> update DESCRIPTION and NAMESPACE to collate 'mllib_feature.R' and export 
> 'spark.binarizer'



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21469) Add doc and example for FeatureHasher

2017-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21469:


Assignee: (was: Apache Spark)

> Add doc and example for FeatureHasher
> -
>
> Key: SPARK-21469
> URL: https://issues.apache.org/jira/browse/SPARK-21469
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>
> Add examples and user guide section for {{FeatureHasher}} in SPARK-13969



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21787) Support for pushing down filters for date types in ORC

2017-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21787:


Assignee: (was: Apache Spark)

> Support for pushing down filters for date types in ORC
> --
>
> Key: SPARK-21787
> URL: https://issues.apache.org/jira/browse/SPARK-21787
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Stefan de Koning
>
> See related issue https://issues.apache.org/jira/browse/SPARK-16516
> It seems that DateType should also be pushed down to ORC.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21728) Allow SparkSubmit to use logging

2017-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21728:


Assignee: Apache Spark

> Allow SparkSubmit to use logging
> 
>
> Key: SPARK-21728
> URL: https://issues.apache.org/jira/browse/SPARK-21728
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, code in {{SparkSubmit}} cannot call classes or methods that 
> initialize the Spark {{Logging}} framework. That is because at that time 
> {{SparkSubmit}} doesn't yet know which application will run, and logging is 
> initialized differently for certain special applications (notably, the 
> shells).
> It would be better if either {{SparkSubmit}} did logging initialization 
> earlier based on the application to be run, or did it in a way that could be 
> overridden later when the app initializes.
> Without this, there are currently a few parts of {{SparkSubmit}} that 
> duplicates code from other parts of Spark just to avoid logging. For example:
> * 
> [downloadFiles|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L860]
>  replicates code from Utils.scala
> * 
> [createTempDir|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala#L54]
>  replicates code from Utils.scala and installs its own shutdown hook
> * a few parts of the code could use {{SparkConf}} but can't right now because 
> of the logging issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21784) Add ALTER TABLE ADD CONSTRANT DDL to support defining primary key and foreign keys

2017-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21784:


Assignee: (was: Apache Spark)

> Add ALTER TABLE ADD CONSTRANT DDL to support defining primary key and foreign 
> keys
> --
>
> Key: SPARK-21784
> URL: https://issues.apache.org/jira/browse/SPARK-21784
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Suresh Thalamati
>
> Currently Spark SQL does not have  DDL support to define primary key , and 
> foreign key constraints. This Jira is to add DDL support to define primary 
> key and foreign key informational constraint using ALTER TABLE syntax. These 
> constraints will be used in query optimization and you can find more details 
> about this in the spec in SPARK-19842
> *Syntax :*
> {code}
> ALTER TABLE [db_name.]table_name ADD [CONSTRAINT constraintName]
>   (PRIMARY KEY (col_names) |
>   FOREIGN KEY (col_names) REFERENCES [db_name.]table_name [(col_names)])
>   [VALIDATE | NOVALIDATE] [RELY | NORELY]
> {code}
> Examples :
> {code:sql}
> ALTER TABLE employee _ADD CONSTRANT pk_ PRIMARY KEY(empno) VALIDATE RELY
> ALTER TABLE department _ADD CONSTRAINT emp_fk_ FOREIGN KEY (mgrno) REFERENCES 
> employee(empno) NOVALIDATE NORELY
> {code}
> *Constraint name generated by the system:*
> {code:sql}
> ALTER TABLE department ADD PRIMARY KEY(deptno) VALIDATE RELY
> ALTER TABLE employee ADD FOREIGN KEY (workdept) REFERENCES department(deptno) 
> VALIDATE RELY;
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21728) Allow SparkSubmit to use logging

2017-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21728:


Assignee: (was: Apache Spark)

> Allow SparkSubmit to use logging
> 
>
> Key: SPARK-21728
> URL: https://issues.apache.org/jira/browse/SPARK-21728
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> Currently, code in {{SparkSubmit}} cannot call classes or methods that 
> initialize the Spark {{Logging}} framework. That is because at that time 
> {{SparkSubmit}} doesn't yet know which application will run, and logging is 
> initialized differently for certain special applications (notably, the 
> shells).
> It would be better if either {{SparkSubmit}} did logging initialization 
> earlier based on the application to be run, or did it in a way that could be 
> overridden later when the app initializes.
> Without this, there are currently a few parts of {{SparkSubmit}} that 
> duplicates code from other parts of Spark just to avoid logging. For example:
> * 
> [downloadFiles|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L860]
>  replicates code from Utils.scala
> * 
> [createTempDir|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala#L54]
>  replicates code from Utils.scala and installs its own shutdown hook
> * a few parts of the code could use {{SparkConf}} but can't right now because 
> of the logging issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21779) Simpler Dataset.sample API in Python

2017-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21779:


Assignee: Apache Spark

> Simpler Dataset.sample API in Python
> 
>
> Key: SPARK-21779
> URL: https://issues.apache.org/jira/browse/SPARK-21779
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> See parent ticket.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21787) Support for pushing down filters for date types in ORC

2017-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21787:


Assignee: Apache Spark

> Support for pushing down filters for date types in ORC
> --
>
> Key: SPARK-21787
> URL: https://issues.apache.org/jira/browse/SPARK-21787
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Stefan de Koning
>Assignee: Apache Spark
>
> See related issue https://issues.apache.org/jira/browse/SPARK-16516
> It seems that DateType should also be pushed down to ORC.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21774) The rule PromoteStrings cast string to a wrong data type

2017-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21774:


Assignee: Apache Spark

> The rule PromoteStrings cast string to a wrong data type
> 
>
> Key: SPARK-21774
> URL: https://issues.apache.org/jira/browse/SPARK-21774
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: StanZhai
>Assignee: Apache Spark
>Priority: Critical
>  Labels: correctness
>
> Data
> {code}
> create temporary view tb as select * from values
> ("0", 1),
> ("-0.1", 2),
> ("1", 3)
> as grouping(a, b)
> {code}
> SQL:
> {code}
> select a, b from tb where a=0
> {code}
> The result which is wrong:
> {code}
> ++---+
> |   a|  b|
> ++---+
> |   0|  1|
> |-0.1|  2|
> ++---+
> {code}
> Logical Plan:
> {code}
> == Parsed Logical Plan ==
> 'Project ['a]
> +- 'Filter ('a = 0)
>+- 'UnresolvedRelation `src`
> == Analyzed Logical Plan ==
> a: string
> Project [a#8528]
> +- Filter (cast(a#8528 as int) = 0)
>+- SubqueryAlias src
>   +- Project [_1#8525 AS a#8528, _2#8526 AS b#8529]
>  +- LocalRelation [_1#8525, _2#8526]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21783) Turn on ORC filter push-down by default

2017-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21783:


Assignee: Apache Spark

> Turn on ORC filter push-down by default
> ---
>
> Key: SPARK-21783
> URL: https://issues.apache.org/jira/browse/SPARK-21783
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Minor
>
> Like Parquet (SPARK-9207), it would be great to turn on ORC option, too.
> This option was turned off by default from the begining, SPARK-2883
> - 
> https://github.com/apache/spark/commit/aa31e431fc09f0477f1c2351c6275769a31aca90#diff-41ef65b9ef5b518f77e2a03559893f4dR149



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21624) Optimize communication cost of RF/GBT/DT

2017-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21624:


Assignee: (was: Apache Spark)

> Optimize communication cost of RF/GBT/DT
> 
>
> Key: SPARK-21624
> URL: https://issues.apache.org/jira/browse/SPARK-21624
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>
> {quote}The implementation of RF is bound by either  the cost of statistics 
> computation on workers or by communicating the sufficient statistics.{quote}
> The statistics are stored in allStats:
> {code:java}
>   /**
>* Flat array of elements.
>* Index for start of stats for a (feature, bin) is:
>*   index = featureOffsets(featureIndex) + binIndex * statsSize
>*/
>   private var allStats: Array[Double] = new Array[Double](allStatsSize)
> {code}
> The size of allStats maybe very large, and it can be very sparse, especially 
> on the nodes that near the leave of the tree. 
> I have changed allStats from Array to SparseVector,  my tests show the 
> communication is down by about 50%.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19256) Hive bucketing support

2017-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19256:


Assignee: Apache Spark

> Hive bucketing support
> --
>
> Key: SPARK-19256
> URL: https://issues.apache.org/jira/browse/SPARK-19256
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Tejas Patil
>Assignee: Apache Spark
>Priority: Minor
>
> JIRA to track design discussions and tasks related to Hive bucketing support 
> in Spark.
> Proposal : 
> https://docs.google.com/document/d/1a8IDh23RAkrkg9YYAeO51F4aGO8-xAlupKwdshve2fc/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21720) Filter predicate with many conditions throw stackoverflow error

2017-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21720:


Assignee: Apache Spark

> Filter predicate with many conditions throw stackoverflow error
> ---
>
> Key: SPARK-21720
> URL: https://issues.apache.org/jira/browse/SPARK-21720
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: srinivasan
>Assignee: Apache Spark
>
> When trying to filter on dataset with many predicate conditions on both spark 
> sql and dataset filter transformation as described below, spark throws a 
> stackoverflow exception
> Case 1: Filter Transformation on Data
> Dataset filter = sourceDataset.filter(String.format("not(%s)", 
> buildQuery()));
> filter.show();
> where buildQuery() returns
> Field1 = "" and  Field2 = "" and  Field3 = "" and  Field4 = "" and  Field5 = 
> "" and  BLANK_5 = "" and  Field7 = "" and  Field8 = "" and  Field9 = "" and  
> Field10 = "" and  Field11 = "" and  Field12 = "" and  Field13 = "" and  
> Field14 = "" and  Field15 = "" and  Field16 = "" and  Field17 = "" and  
> Field18 = "" and  Field19 = "" and  Field20 = "" and  Field21 = "" and  
> Field22 = "" and  Field23 = "" and  Field24 = "" and  Field25 = "" and  
> Field26 = "" and  Field27 = "" and  Field28 = "" and  Field29 = "" and  
> Field30 = "" and  Field31 = "" and  Field32 = "" and  Field33 = "" and  
> Field34 = "" and  Field35 = "" and  Field36 = "" and  Field37 = "" and  
> Field38 = "" and  Field39 = "" and  Field40 = "" and  Field41 = "" and  
> Field42 = "" and  Field43 = "" and  Field44 = "" and  Field45 = "" and  
> Field46 = "" and  Field47 = "" and  Field48 = "" and  Field49 = "" and  
> Field50 = "" and  Field51 = "" and  Field52 = "" and  Field53 = "" and  
> Field54 = "" and  Field55 = "" and  Field56 = "" and  Field57 = "" and  
> Field58 = "" and  Field59 = "" and  Field60 = "" and  Field61 = "" and  
> Field62 = "" and  Field63 = "" and  Field64 = "" and  Field65 = "" and  
> Field66 = "" and  Field67 = "" and  Field68 = "" and  Field69 = "" and  
> Field70 = "" and  Field71 = "" and  Field72 = "" and  Field73 = "" and  
> Field74 = "" and  Field75 = "" and  Field76 = "" and  Field77 = "" and  
> Field78 = "" and  Field79 = "" and  Field80 = "" and  Field81 = "" and  
> Field82 = "" and  Field83 = "" and  Field84 = "" and  Field85 = "" and  
> Field86 = "" and  Field87 = "" and  Field88 = "" and  Field89 = "" and  
> Field90 = "" and  Field91 = "" and  Field92 = "" and  Field93 = "" and  
> Field94 = "" and  Field95 = "" and  Field96 = "" and  Field97 = "" and  
> Field98 = "" and  Field99 = "" and  Field100 = "" and  Field101 = "" and  
> Field102 = "" and  Field103 = "" and  Field104 = "" and  Field105 = "" and  
> Field106 = "" and  Field107 = "" and  Field108 = "" and  Field109 = "" and  
> Field110 = "" and  Field111 = "" and  Field112 = "" and  Field113 = "" and  
> Field114 = "" and  Field115 = "" and  Field116 = "" and  Field117 = "" and  
> Field118 = "" and  Field119 = "" and  Field120 = "" and  Field121 = "" and  
> Field122 = "" and  Field123 = "" and  Field124 = "" and  Field125 = "" and  
> Field126 = "" and  Field127 = "" and  Field128 = "" and  Field129 = "" and  
> Field130 = "" and  Field131 = "" and  Field132 = "" and  Field133 = "" and  
> Field134 = "" and  Field135 = "" and  Field136 = "" and  Field137 = "" and  
> Field138 = "" and  Field139 = "" and  Field140 = "" and  Field141 = "" and  
> Field142 = "" and  Field143 = "" and  Field144 = "" and  Field145 = "" and  
> Field146 = "" and  Field147 = "" and  Field148 = "" and  Field149 = "" and  
> Field150 = "" and  Field151 = "" and  Field152 = "" and  Field153 = "" and  
> Field154 = "" and  Field155 = "" and  Field156 = "" and  Field157 = "" and  
> Field158 = "" and  Field159 = "" and  Field160 = "" and  Field161 = "" and  
> Field162 = "" and  Field163 = "" and  Field164 = "" and  Field165 = "" and  
> Field166 = "" and  Field167 = "" and  Field168 = "" and  Field169 = "" and  
> Field170 = "" and  Field171 = "" and  Field172 = "" and  Field173 = "" and  
> Field174 = "" and  Field175 = "" and  Field176 = "" and  Field177 = "" and  
> Field178 = "" and  Field179 = "" and  Field180 = "" and  Field181 = "" and  
> Field182 = "" and  Field183 = "" and  Field184 = "" and  Field185 = "" and  
> Field186 = "" and  Field187 = "" and  Field188 = "" and  Field189 = "" and  
> Field190 = "" and  Field191 = "" and  Field192 = "" and  Field193 = "" and  
> Field194 = "" and  Field195 = "" and  Field196 = "" and  Field197 = "" and  
> Field198 = "" and  Field199 = "" and  Field200 = "" and  Field201 = "" and  
> Field202 = "" and  Field203 = "" and  Field204 = "" and  Field205 = "" and  
> Field206 = "" and  Field207 =

[jira] [Assigned] (SPARK-21791) ORC should support column names with dot

2017-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21791:


Assignee: Apache Spark

> ORC should support column names with dot
> 
>
> Key: SPARK-21791
> URL: https://issues.apache.org/jira/browse/SPARK-21791
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>
> *PARQUET*
> {code}
> scala> Seq(Some(1), None).toDF("col.dots").write.parquet("/tmp/parquet_dot")
> scala> spark.read.parquet("/tmp/parquet_dot").show
> ++
> |col.dots|
> ++
> |   1|
> |null|
> ++
> {code}
> *ORC*
> {code}
> scala> Seq(Some(1), None).toDF("col.dots").write.orc("/tmp/orc_dot")
> scala> spark.read.orc("/tmp/orc_dot").show
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input '.' expecting ':'(line 1, pos 10)
> == SQL ==
> struct
> --^^^
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21779) Simpler Dataset.sample API in Python

2017-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21779:


Assignee: (was: Apache Spark)

> Simpler Dataset.sample API in Python
> 
>
> Key: SPARK-21779
> URL: https://issues.apache.org/jira/browse/SPARK-21779
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>
> See parent ticket.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20589) Allow limiting task concurrency per stage

2017-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20589:


Assignee: Apache Spark

> Allow limiting task concurrency per stage
> -
>
> Key: SPARK-20589
> URL: https://issues.apache.org/jira/browse/SPARK-20589
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>Assignee: Apache Spark
>
> It would be nice to have the ability to limit the number of concurrent tasks 
> per stage.  This is useful when your spark job might be accessing another 
> service and you don't want to DOS that service.  For instance Spark writing 
> to hbase or Spark doing http puts on a service.  Many times you want to do 
> this without limiting the number of partitions. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21685) Params isSet in scala Transformer triggered by _setDefault in pyspark

2017-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21685:


Assignee: Apache Spark

> Params isSet in scala Transformer triggered by _setDefault in pyspark
> -
>
> Key: SPARK-21685
> URL: https://issues.apache.org/jira/browse/SPARK-21685
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Ratan Rai Sur
>Assignee: Apache Spark
>
> I'm trying to write a PySpark wrapper for a Transformer whose transform 
> method includes the line
> {code:java}
> require(!(isSet(outputNodeName) && isSet(outputNodeIndex)), "Can't set both 
> outputNodeName and outputNodeIndex")
> {code}
> This should only throw an exception when both of these parameters are 
> explicitly set.
> In the PySpark wrapper for the Transformer, there is this line in ___init___
> {code:java}
> self._setDefault(outputNodeIndex=0)
> {code}
> Here is the line in the main python script showing how it is being configured
> {code:java}
> cntkModel = 
> CNTKModel().setInputCol("images").setOutputCol("output").setModelLocation(spark,
>  model.uri).setOutputNodeName("z")
> {code}
> As you can see, only setOutputNodeName is being explicitly set but the 
> exception is still being thrown.
> If you need more context, 
> https://github.com/RatanRSur/mmlspark/tree/default-cntkmodel-output is the 
> branch with the code, the files I'm referring to here that are tracked are 
> the following:
> src/cntk-model/src/main/scala/CNTKModel.scala
> notebooks/tests/301 - CIFAR10 CNTK CNN Evaluation.ipynb
> The pyspark wrapper code is autogenerated



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21771) SparkSQLEnv creates a useless meta hive client

2017-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21771:


Assignee: Apache Spark

> SparkSQLEnv creates a useless meta hive client
> --
>
> Key: SPARK-21771
> URL: https://issues.apache.org/jira/browse/SPARK-21771
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Minor
>
> Once a meta hive client is created, it generates its SessionState which 
> creates a lot of session related directories, some deleteOnExit, some does 
> not. if a hive client is useless we may not create it at the very start.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >