[jira] [Updated] (SPARK-19237) SparkR package install stuck when no java is found

2017-01-15 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-19237:
-
Description: When installing SparkR as a R package (install.packages), it 
will check for Spark distribution and automatically download and cache it. But 
if there is no java runtime on the machine spark-submit will just hang.

> SparkR package install stuck when no java is found
> --
>
> Key: SPARK-19237
> URL: https://issues.apache.org/jira/browse/SPARK-19237
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>
> When installing SparkR as a R package (install.packages), it will check for 
> Spark distribution and automatically download and cache it. But if there is 
> no java runtime on the machine spark-submit will just hang.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19237) SparkR package stuck when no java is found

2017-01-15 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-19237:


 Summary: SparkR package stuck when no java is found
 Key: SPARK-19237
 URL: https://issues.apache.org/jira/browse/SPARK-19237
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.1.0
Reporter: Felix Cheung






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19237) SparkR package install stuck when no java is found

2017-01-15 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-19237:
-
Summary: SparkR package install stuck when no java is found  (was: SparkR 
package stuck when no java is found)

> SparkR package install stuck when no java is found
> --
>
> Key: SPARK-19237
> URL: https://issues.apache.org/jira/browse/SPARK-19237
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19236) Add createOrReplaceGlobalTempView

2017-01-15 Thread Arman Yazdani (JIRA)
Arman Yazdani created SPARK-19236:
-

 Summary: Add createOrReplaceGlobalTempView
 Key: SPARK-19236
 URL: https://issues.apache.org/jira/browse/SPARK-19236
 Project: Spark
  Issue Type: Improvement
  Components: Java API, Spark Core, SQL
Reporter: Arman Yazdani
Priority: Minor


There are 3 methods for saving a temp tables:
createTempView
createOrReplaceTempView
createGlobalTempView

but there isn't:
createOrReplaceGlobalTempView



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19082) The config ignoreCorruptFiles doesn't work for Parquet

2017-01-15 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19082.
-
   Resolution: Fixed
 Assignee: Liang-Chi Hsieh
Fix Version/s: 2.2.0
   2.1.1

> The config ignoreCorruptFiles doesn't work for Parquet
> --
>
> Key: SPARK-19082
> URL: https://issues.apache.org/jira/browse/SPARK-19082
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.1.1, 2.2.0
>
>
> We have a config {{spark.sql.files.ignoreCorruptFiles}} which can be used to 
> ignore corrupt files when reading files in SQL. Currently the 
> {{ignoreCorruptFiles}} config has two issues and can't work for Parquet:
> 1. We only ignore corrupt files in {{FileScanRDD}} . Actually, we begin to 
> read those files as early as inferring data schema from the files. For 
> corrupt files, we can't read the schema and fail the program. A related issue 
> reported at 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Skip-Corrupted-Parquet-blocks-footer-tc20418.html
> 2. In {{FileScanRDD}}, we assume that we only begin to read the files when 
> starting to consume the iterator. However, it is possibly the files are read 
> before that. In this case, {{ignoreCorruptFiles}} config doesn't work too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17078) show estimated stats when doing explain

2017-01-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17078:


Assignee: Apache Spark

> show estimated stats when doing explain
> ---
>
> Key: SPARK-17078
> URL: https://issues.apache.org/jira/browse/SPARK-17078
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.0.0
>Reporter: Ron Hu
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17078) show estimated stats when doing explain

2017-01-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17078:


Assignee: (was: Apache Spark)

> show estimated stats when doing explain
> ---
>
> Key: SPARK-17078
> URL: https://issues.apache.org/jira/browse/SPARK-17078
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.0.0
>Reporter: Ron Hu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17078) show estimated stats when doing explain

2017-01-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823567#comment-15823567
 ] 

Apache Spark commented on SPARK-17078:
--

User 'wzhfy' has created a pull request for this issue:
https://github.com/apache/spark/pull/16594

> show estimated stats when doing explain
> ---
>
> Key: SPARK-17078
> URL: https://issues.apache.org/jira/browse/SPARK-17078
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.0.0
>Reporter: Ron Hu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19153) DataFrameWriter.saveAsTable should work with hive format to create partitioned table

2017-01-15 Thread Shuai Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823552#comment-15823552
 ] 

Shuai Lin commented on SPARK-19153:
---

[~windpiger] I planned to sent a PR today, just to see you already did that. 
May I suggest you to left a comment before starting to work on a ticket, so we 
don't step on each other's toe?

> DataFrameWriter.saveAsTable should work with hive format to create 
> partitioned table
> 
>
> Key: SPARK-19153
> URL: https://issues.apache.org/jira/browse/SPARK-19153
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19153) DataFrameWriter.saveAsTable should work with hive format to create partitioned table

2017-01-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19153:


Assignee: Apache Spark

> DataFrameWriter.saveAsTable should work with hive format to create 
> partitioned table
> 
>
> Key: SPARK-19153
> URL: https://issues.apache.org/jira/browse/SPARK-19153
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19153) DataFrameWriter.saveAsTable should work with hive format to create partitioned table

2017-01-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823520#comment-15823520
 ] 

Apache Spark commented on SPARK-19153:
--

User 'windpiger' has created a pull request for this issue:
https://github.com/apache/spark/pull/16593

> DataFrameWriter.saveAsTable should work with hive format to create 
> partitioned table
> 
>
> Key: SPARK-19153
> URL: https://issues.apache.org/jira/browse/SPARK-19153
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19153) DataFrameWriter.saveAsTable should work with hive format to create partitioned table

2017-01-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19153:


Assignee: (was: Apache Spark)

> DataFrameWriter.saveAsTable should work with hive format to create 
> partitioned table
> 
>
> Key: SPARK-19153
> URL: https://issues.apache.org/jira/browse/SPARK-19153
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19208) MaxAbsScaler and MinMaxScaler are very inefficient

2017-01-15 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823517#comment-15823517
 ] 

zhengruifeng commented on SPARK-19208:
--

cc [~josephkb] [~yanboliang]

> MaxAbsScaler and MinMaxScaler are very inefficient
> --
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Assignee: Apache Spark
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19208) MaxAbsScaler and MinMaxScaler are very inefficient

2017-01-15 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-19208:
-
Attachment: Tests.pdf

> MaxAbsScaler and MinMaxScaler are very inefficient
> --
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Assignee: Apache Spark
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19208) MaxAbsScaler and MinMaxScaler are very inefficient

2017-01-15 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823508#comment-15823508
 ] 

zhengruifeng commented on SPARK-19208:
--

I do tests on a dataset with 6,000,000 instances and 780 features.
For {{MaxAbs}}:
Duration 17.8s -> 13.9s,
Shuffle 11.6M -> 402.1K

For {{MinMax}}
Duration 16.2s -> 13.8s,
Shuffle 11.6M -> 1946.4K

In general, this modification can bring about 15%~22% speed up, and the size of 
shuffled data is about 3%~16% of current implements.
Detailed information is illustrated in the attached doc.

> MaxAbsScaler and MinMaxScaler are very inefficient
> --
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Assignee: Apache Spark
> Attachments: WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19235) Enable Test Cases in DDLSuite with Hive Metastore

2017-01-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823475#comment-15823475
 ] 

Apache Spark commented on SPARK-19235:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/16592

> Enable Test Cases in DDLSuite with Hive Metastore
> -
>
> Key: SPARK-19235
> URL: https://issues.apache.org/jira/browse/SPARK-19235
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> So far, the test cases in DDLSuites only verify the behaviors of 
> InMemoryCatalog. That means, they do not cover the scenarios using 
> HiveExternalCatalog. Thus, we need to improve the existing test suite to run 
> these cases using Hive metastore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19235) Enable Test Cases in DDLSuite with Hive Metastore

2017-01-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19235:


Assignee: Apache Spark  (was: Xiao Li)

> Enable Test Cases in DDLSuite with Hive Metastore
> -
>
> Key: SPARK-19235
> URL: https://issues.apache.org/jira/browse/SPARK-19235
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> So far, the test cases in DDLSuites only verify the behaviors of 
> InMemoryCatalog. That means, they do not cover the scenarios using 
> HiveExternalCatalog. Thus, we need to improve the existing test suite to run 
> these cases using Hive metastore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19235) Enable Test Cases in DDLSuite with Hive Metastore

2017-01-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19235:


Assignee: Xiao Li  (was: Apache Spark)

> Enable Test Cases in DDLSuite with Hive Metastore
> -
>
> Key: SPARK-19235
> URL: https://issues.apache.org/jira/browse/SPARK-19235
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> So far, the test cases in DDLSuites only verify the behaviors of 
> InMemoryCatalog. That means, they do not cover the scenarios using 
> HiveExternalCatalog. Thus, we need to improve the existing test suite to run 
> these cases using Hive metastore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19235) Enable Test Cases in DDLSuite with Hive Metastore

2017-01-15 Thread Xiao Li (JIRA)
Xiao Li created SPARK-19235:
---

 Summary: Enable Test Cases in DDLSuite with Hive Metastore
 Key: SPARK-19235
 URL: https://issues.apache.org/jira/browse/SPARK-19235
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 2.2.0
Reporter: Xiao Li
Assignee: Xiao Li


So far, the test cases in DDLSuites only verify the behaviors of 
InMemoryCatalog. That means, they do not cover the scenarios using 
HiveExternalCatalog. Thus, we need to improve the existing test suite to run 
these cases using Hive metastore.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19234) AFTSurvivalRegression chokes silently or with confusing errors when any labels are zero

2017-01-15 Thread Andrew MacKinlay (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew MacKinlay updated SPARK-19234:
-
Description: 
If you try and use AFTSurvivalRegression and any label in your input data is 
0.0, you get coefficients of 0.0 returned, and in many cases, errors like this:

{{17/01/16 15:10:50 ERROR StrongWolfeLineSearch: Encountered bad values in 
function evaluation. Decreasing step size to NaN}}

Zero should, I think, be an allowed value for survival analysis. I don't know 
if this is a pathological case for AFT specifically as I don't know enough 
about it, but this behaviour is clearly undesirable. If you have any labels of 
0.0, you get either a) obscure error messages, with no knowledge of the cause 
and coefficients which are all zero or b) no errors messages at all and 
coefficients of zero (arguably worse, since you don't even have console output 
to tell you something's gone awry). If AFT doesn't work with zero-valued 
labels, Spark should fail fast and let the developer know why. If it does, we 
should get results here.


  was:
If you try and use AFTSurvivalRegression and any label in your input data is 
0.0, you get coefficients of 0.0 returned, and in many cases, errors like this:

{{17/01/16 15:10:50 ERROR StrongWolfeLineSearch: Encountered bad values in 
function evaluation. Decreasing step size to NaN}}




> AFTSurvivalRegression chokes silently or with confusing errors when any 
> labels are zero
> ---
>
> Key: SPARK-19234
> URL: https://issues.apache.org/jira/browse/SPARK-19234
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
> Environment: spark-shell or pyspark
>Reporter: Andrew MacKinlay
> Attachments: spark-aft-failure.txt
>
>
> If you try and use AFTSurvivalRegression and any label in your input data is 
> 0.0, you get coefficients of 0.0 returned, and in many cases, errors like 
> this:
> {{17/01/16 15:10:50 ERROR StrongWolfeLineSearch: Encountered bad values in 
> function evaluation. Decreasing step size to NaN}}
> Zero should, I think, be an allowed value for survival analysis. I don't know 
> if this is a pathological case for AFT specifically as I don't know enough 
> about it, but this behaviour is clearly undesirable. If you have any labels 
> of 0.0, you get either a) obscure error messages, with no knowledge of the 
> cause and coefficients which are all zero or b) no errors messages at all and 
> coefficients of zero (arguably worse, since you don't even have console 
> output to tell you something's gone awry). If AFT doesn't work with 
> zero-valued labels, Spark should fail fast and let the developer know why. If 
> it does, we should get results here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19234) AFTSurvivalRegression chokes silently or with confusing errors when any labels are zero

2017-01-15 Thread Andrew MacKinlay (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew MacKinlay updated SPARK-19234:
-
Attachment: spark-aft-failure.txt

Show failure case (a single zero-valued label, causing many errors and 
coefficients of zero) and success (replace 0 with 0.001 and everything works as 
expected)

> AFTSurvivalRegression chokes silently or with confusing errors when any 
> labels are zero
> ---
>
> Key: SPARK-19234
> URL: https://issues.apache.org/jira/browse/SPARK-19234
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
> Environment: spark-shell or pyspark
>Reporter: Andrew MacKinlay
> Attachments: spark-aft-failure.txt
>
>
> If you try and use AFTSurvivalRegression and any label in your input data is 
> 0.0, you get coefficients of 0.0 returned, and in many cases, errors like 
> this:
> {{17/01/16 15:10:50 ERROR StrongWolfeLineSearch: Encountered bad values in 
> function evaluation. Decreasing step size to NaN}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19234) AFTSurvivalRegression chokes silently or with confusing errors when any labels are zero

2017-01-15 Thread Andrew MacKinlay (JIRA)
Andrew MacKinlay created SPARK-19234:


 Summary: AFTSurvivalRegression chokes silently or with confusing 
errors when any labels are zero
 Key: SPARK-19234
 URL: https://issues.apache.org/jira/browse/SPARK-19234
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.1.0
 Environment: spark-shell or pyspark
Reporter: Andrew MacKinlay


If you try and use AFTSurvivalRegression and any label in your input data is 
0.0, you get coefficients of 0.0 returned, and in many cases, errors like this:

{{17/01/16 15:10:50 ERROR StrongWolfeLineSearch: Encountered bad values in 
function evaluation. Decreasing step size to NaN}}





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19217) Offer easy cast from vector to array

2017-01-15 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823439#comment-15823439
 ] 

Hyukjin Kwon edited comment on SPARK-19217 at 1/16/17 4:18 AM:
---

Up to my knowledge, the vectors in SQL become a udt, {{VectorUDT}}. AFAIK, we 
don't currently support explicit/implicit cast for udt in expressions via using 
`sqlType`. I saw several JIRAs related with this.

FWIW, data sources such as ORC, JSON and Parquet support to read/write this via 
using {{udt.sqlType}} IIRC.


was (Author: hyukjin.kwon):
Up to my knowledge, the vectors in SQL become a udf, {{VectorUDT}}. AFAIK, we 
don't currently support explicit/implicit cast for udt in expressions.

FWIW, data sources such as ORC, JSON and Parquet support to read/write this via 
using {{udt.sqlType}} IIRC.

> Offer easy cast from vector to array
> 
>
> Key: SPARK-19217
> URL: https://issues.apache.org/jira/browse/SPARK-19217
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Working with ML often means working with DataFrames with vector columns. You 
> can't save these DataFrames to storage without converting the vector columns 
> to array columns, and there doesn't appear to an easy way to make that 
> conversion.
> This is a common enough problem that it is [documented on Stack 
> Overflow|http://stackoverflow.com/q/35855382/877069]. The current solutions 
> to making the conversion from a vector column to an array column are:
> # Convert the DataFrame to an RDD and back
> # Use a UDF
> Both approaches work fine, but it really seems like you should be able to do 
> something like this instead:
> {code}
> (le_data
> .select(
> col('features').cast('array').alias('features')
> ))
> {code}
> We already have an {{ArrayType}} in {{pyspark.sql.types}}, but it appears 
> that {{cast()}} doesn't support this conversion.
> Would this be an appropriate thing to add?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19217) Offer easy cast from vector to array

2017-01-15 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823439#comment-15823439
 ] 

Hyukjin Kwon commented on SPARK-19217:
--

Up to my knowledge, the vectors in SQL become a udf, {{VectorUDT}}. AFAIK, we 
don't currently support explicit/implicit cast for udt in expressions.

FWIW, data sources such as ORC, JSON and Parquet support to read/write this via 
using {{udt.sqlType}} IIRC.

> Offer easy cast from vector to array
> 
>
> Key: SPARK-19217
> URL: https://issues.apache.org/jira/browse/SPARK-19217
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Working with ML often means working with DataFrames with vector columns. You 
> can't save these DataFrames to storage without converting the vector columns 
> to array columns, and there doesn't appear to an easy way to make that 
> conversion.
> This is a common enough problem that it is [documented on Stack 
> Overflow|http://stackoverflow.com/q/35855382/877069]. The current solutions 
> to making the conversion from a vector column to an array column are:
> # Convert the DataFrame to an RDD and back
> # Use a UDF
> Both approaches work fine, but it really seems like you should be able to do 
> something like this instead:
> {code}
> (le_data
> .select(
> col('features').cast('array').alias('features')
> ))
> {code}
> We already have an {{ArrayType}} in {{pyspark.sql.types}}, but it appears 
> that {{cast()}} doesn't support this conversion.
> Would this be an appropriate thing to add?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19222) Limit Query Performance issue

2017-01-15 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823427#comment-15823427
 ] 

Hyukjin Kwon commented on SPARK-19222:
--

(I just simply inserted \{code\} ... \{code\} in the description just for 
readability)

> Limit Query Performance issue
> -
>
> Key: SPARK-19222
> URL: https://issues.apache.org/jira/browse/SPARK-19222
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Linux/Windows
>Reporter: Sujith
>Priority: Minor
>
> Performance/memory bottle neck occurs in the below mentioned query
> case 1:
> {code}
> create table t1 as select * from dest1 limit 1000;
> {code}
> case 2:
> {code}
> create table t1 as select * from dest1 limit 1000;
> pre-condition : partition count >=1
> {code}
> In above cases limit is being added in the terminal of the physical plan 
> {code}
> == Physical Plan  ==
> ExecutedCommand
>+- CreateHiveTableAsSelectCommand [Database:spark}, TableName: t2, 
> InsertIntoHiveTable]
>  +- GlobalLimit 1000
> +- LocalLimit 1000
>+- Project [imei#101, age#102, task#103L, num#104, level#105, 
> productdate#106, name#107, point#108]
>   +- SubqueryAlias hive
>  +- 
> Relation[imei#101,age#102,task#103L,num#104,level#105,productdate#106,name#107,point#108]
>  csv  |
> {code}
> Issue Hints: 
> Possible Bottleneck snippet in limit.scala file under spark-sql package.
> {code}
>   protected override def doExecute(): RDD[InternalRow] = {
> val locallyLimited = child.execute().mapPartitionsInternal(_.take(limit))
> val shuffled = new ShuffledRowRDD(
>   ShuffleExchange.prepareShuffleDependency(
> locallyLimited, child.output, SinglePartition, serializer))
> shuffled.mapPartitionsInternal(_.take(limit))
>   }
> {code}
> As mentioned in above case 1  (where limit value is 1000 or partition 
> count is > 1) and case 2(limit value is small(around 1000)), As per the 
> above snippet when the {{ShuffledRowRDD}}
> is created by grouping all the limit data from different partitions to a 
> single partition in executer,  memory issue occurs since all the partition 
> limit data will be collected and 
> grouped  in a single partition for processing, in both former/later case the 
> data count  can go very high which can create the memory bottleneck.
> Proposed solution for case 2:
> An accumulator value can be to send to all partitions, all executor will be 
> updating the accumulator value based on the  data fetched , 
> eg: Number of partition = 100, number of cores =10
> Ideally tasks will be launched in a group of 10 task/core, once the first 
> group finishes the tasks driver will check whether the accumulator value is 
> been reached the limit value if its reached then no further tasks will be 
> launched to executors and the result after applying limit will be returned.
> Please let me now for any suggestions or solutions for the above mentioned 
> problems
> Thanks,
> Sujith



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19222) Limit Query Performance issue

2017-01-15 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-19222:
-
Description: 
Performance/memory bottle neck occurs in the below mentioned query
case 1:
{code}
create table t1 as select * from dest1 limit 1000;
{code}
case 2:
{code}
create table t1 as select * from dest1 limit 1000;
pre-condition : partition count >=1
{code}
In above cases limit is being added in the terminal of the physical plan 

{code}
== Physical Plan  ==
ExecutedCommand
   +- CreateHiveTableAsSelectCommand [Database:spark}, TableName: t2, 
InsertIntoHiveTable]
 +- GlobalLimit 1000
+- LocalLimit 1000
   +- Project [imei#101, age#102, task#103L, num#104, level#105, 
productdate#106, name#107, point#108]
  +- SubqueryAlias hive
 +- 
Relation[imei#101,age#102,task#103L,num#104,level#105,productdate#106,name#107,point#108]
 csv  |
{code}
Issue Hints: 

Possible Bottleneck snippet in limit.scala file under spark-sql package.
{code}
  protected override def doExecute(): RDD[InternalRow] = {
val locallyLimited = child.execute().mapPartitionsInternal(_.take(limit))
val shuffled = new ShuffledRowRDD(
  ShuffleExchange.prepareShuffleDependency(
locallyLimited, child.output, SinglePartition, serializer))
shuffled.mapPartitionsInternal(_.take(limit))
  }
{code}

As mentioned in above case 1  (where limit value is 1000 or partition count 
is > 1) and case 2(limit value is small(around 1000)), As per the above 
snippet when the {{ShuffledRowRDD}}
is created by grouping all the limit data from different partitions to a single 
partition in executer,  memory issue occurs since all the partition limit data 
will be collected and 
grouped  in a single partition for processing, in both former/later case the 
data count  can go very high which can create the memory bottleneck.

Proposed solution for case 2:
An accumulator value can be to send to all partitions, all executor will be 
updating the accumulator value based on the  data fetched , 
eg: Number of partition = 100, number of cores =10
Ideally tasks will be launched in a group of 10 task/core, once the first group 
finishes the tasks driver will check whether the accumulator value is been 
reached the limit value if its reached then no further tasks will be launched 
to executors and the result after applying limit will be returned.

Please let me now for any suggestions or solutions for the above mentioned 
problems

Thanks,
Sujith

  was:
Performance/memory bottle neck occurs in the below mentioned query
case 1:
create table t1 as select * from dest1 limit 1000;
case 2:
create table t1 as select * from dest1 limit 1000;
pre-condition : partition count >=1
(It'd be great if the code blocks are wrapped with {{ {code} {code} }}
In above cases limit is being added in the terminal of the physical plan 

== Physical Plan  ==
ExecutedCommand
   +- CreateHiveTableAsSelectCommand [Database:spark}, TableName: t2, 
InsertIntoHiveTable]
 +- GlobalLimit 1000
+- LocalLimit 1000
   +- Project [imei#101, age#102, task#103L, num#104, level#105, 
productdate#106, name#107, point#108]
  +- SubqueryAlias hive
 +- 
Relation[imei#101,age#102,task#103L,num#104,level#105,productdate#106,name#107,point#108]
 csv  |
Issue Hints: 

Possible Bottleneck snippet in limit.scala file under spark-sql package.
  protected override def doExecute(): RDD[InternalRow] = {
val locallyLimited = child.execute().mapPartitionsInternal(_.take(limit))
val shuffled = new ShuffledRowRDD(
  ShuffleExchange.prepareShuffleDependency(
locallyLimited, child.output, SinglePartition, serializer))
shuffled.mapPartitionsInternal(_.take(limit))
  }

As mentioned in above case 1  (where limit value is 1000 or partition count 
is > 1) and case 2(limit value is small(around 1000)), As per the above 
snippet when the ShuffledRowRDD
is created by grouping all the limit data from different partitions to a single 
partition in executer,  memory issue occurs since all the partition limit data 
will be collected and 
grouped  in a single partition for processing, in both former/later case the 
data count  can go very high which can create the memory bottleneck.

Proposed solution for case 2:
An accumulator value can be to send to all partitions, all executor will be 
updating the accumulator value based on the  data fetched , 
eg: Number of partition = 100, number of cores =10
Ideally tasks will be launched in a group of 10 task/core, once the first group 
finishes the tasks driver will check whether the accumulator value is been 
reached the limit value if its reached then no further tasks will be launched 
to executors and the result after applying limit will be returned.

Please let me now for 

[jira] [Updated] (SPARK-19222) Limit Query Performance issue

2017-01-15 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-19222:
-
Description: 
Performance/memory bottle neck occurs in the below mentioned query
case 1:
create table t1 as select * from dest1 limit 1000;
case 2:
create table t1 as select * from dest1 limit 1000;
pre-condition : partition count >=1
(It'd be great if the code blocks are wrapped with {{ {code} {code} }}
In above cases limit is being added in the terminal of the physical plan 

== Physical Plan  ==
ExecutedCommand
   +- CreateHiveTableAsSelectCommand [Database:spark}, TableName: t2, 
InsertIntoHiveTable]
 +- GlobalLimit 1000
+- LocalLimit 1000
   +- Project [imei#101, age#102, task#103L, num#104, level#105, 
productdate#106, name#107, point#108]
  +- SubqueryAlias hive
 +- 
Relation[imei#101,age#102,task#103L,num#104,level#105,productdate#106,name#107,point#108]
 csv  |
Issue Hints: 

Possible Bottleneck snippet in limit.scala file under spark-sql package.
  protected override def doExecute(): RDD[InternalRow] = {
val locallyLimited = child.execute().mapPartitionsInternal(_.take(limit))
val shuffled = new ShuffledRowRDD(
  ShuffleExchange.prepareShuffleDependency(
locallyLimited, child.output, SinglePartition, serializer))
shuffled.mapPartitionsInternal(_.take(limit))
  }

As mentioned in above case 1  (where limit value is 1000 or partition count 
is > 1) and case 2(limit value is small(around 1000)), As per the above 
snippet when the ShuffledRowRDD
is created by grouping all the limit data from different partitions to a single 
partition in executer,  memory issue occurs since all the partition limit data 
will be collected and 
grouped  in a single partition for processing, in both former/later case the 
data count  can go very high which can create the memory bottleneck.

Proposed solution for case 2:
An accumulator value can be to send to all partitions, all executor will be 
updating the accumulator value based on the  data fetched , 
eg: Number of partition = 100, number of cores =10
Ideally tasks will be launched in a group of 10 task/core, once the first group 
finishes the tasks driver will check whether the accumulator value is been 
reached the limit value if its reached then no further tasks will be launched 
to executors and the result after applying limit will be returned.

Please let me now for any suggestions or solutions for the above mentioned 
problems

Thanks,
Sujith

  was:
Performance/memory bottle neck occurs in the below mentioned query
case 1:
create table t1 as select * from dest1 limit 1000;
case 2:
create table t1 as select * from dest1 limit 1000;
pre-condition : partition count >=1

In above cases limit is being added in the terminal of the physical plan 

== Physical Plan  ==
ExecutedCommand
   +- CreateHiveTableAsSelectCommand [Database:spark}, TableName: t2, 
InsertIntoHiveTable]
 +- GlobalLimit 1000
+- LocalLimit 1000
   +- Project [imei#101, age#102, task#103L, num#104, level#105, 
productdate#106, name#107, point#108]
  +- SubqueryAlias hive
 +- 
Relation[imei#101,age#102,task#103L,num#104,level#105,productdate#106,name#107,point#108]
 csv  |
Issue Hints: 

Possible Bottleneck snippet in limit.scala file under spark-sql package.
  protected override def doExecute(): RDD[InternalRow] = {
val locallyLimited = child.execute().mapPartitionsInternal(_.take(limit))
val shuffled = new ShuffledRowRDD(
  ShuffleExchange.prepareShuffleDependency(
locallyLimited, child.output, SinglePartition, serializer))
shuffled.mapPartitionsInternal(_.take(limit))
  }

As mentioned in above case 1  (where limit value is 1000 or partition count 
is > 1) and case 2(limit value is small(around 1000)), As per the above 
snippet when the ShuffledRowRDD
is created by grouping all the limit data from different partitions to a single 
partition in executer,  memory issue occurs since all the partition limit data 
will be collected and 
grouped  in a single partition for processing, in both former/later case the 
data count  can go very high which can create the memory bottleneck.

Proposed solution for case 2:
An accumulator value can be to send to all partitions, all executor will be 
updating the accumulator value based on the  data fetched , 
eg: Number of partition = 100, number of cores =10
Ideally tasks will be launched in a group of 10 task/core, once the first group 
finishes the tasks driver will check whether the accumulator value is been 
reached the limit value if its reached then no further tasks will be launched 
to executors and the result after applying limit will be returned.

Please let me now for any suggestions or solutions for the above mentioned 

[jira] [Commented] (SPARK-19208) MaxAbsScaler and MinMaxScaler are very inefficient

2017-01-15 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823422#comment-15823422
 ] 

zhengruifeng commented on SPARK-19208:
--

The code in {{MinMaxScaler}} are copied from {{MultivariateOnlineSummarizer}}, 
while {{MaxAbsScaler}} not. Because we dont need to compute min/max (three 
arrays needed), instead we only need to maintain maximum of absolute values.
BTW, I found that the same issue also exists in {{StandardScaler}}.
I agreed that it is a good idea that {{MultivariateOnlineSummarizer}} support 
computation of a subset of metrics.
I will do some performance tests in low-dimensionality datasets.

> MaxAbsScaler and MinMaxScaler are very inefficient
> --
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Assignee: Apache Spark
> Attachments: WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19228) inferSchema function processed csv date column as string and "dateFormat" DataSource option is ignored

2017-01-15 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823419#comment-15823419
 ] 

Hyukjin Kwon commented on SPARK-19228:
--

Yes, inferring {{DateType}} is currently not supported (we don't have 
{{tryParseDate}}). As we now have two different options, {{dateFormat}} for 
{{DateType}} and {{timestampFormat}} for {{TimestampType}} each which were 
unified with {{dateFormat}} option before, I think it'd make sense to make this 
possible by introducing {{tryParseDate}}.

> inferSchema function processed csv date column as string and "dateFormat" 
> DataSource option is ignored
> --
>
> Key: SPARK-19228
> URL: https://issues.apache.org/jira/browse/SPARK-19228
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Affects Versions: 2.1.0
>Reporter: Sergey Rubtsov
>  Labels: easyfix
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> I need to process user.csv like this:
> {code}
> id,project,started,ended
> sergey.rubtsov,project0,12/12/2012,10/10/2015
> {code}
> When I add date format options:
> {code}
> Dataset users = spark.read().format("csv").option("mode", 
> "PERMISSIVE").option("header", "true")
> .option("inferSchema", 
> "true").option("dateFormat", 
> "dd/MM/").load("src/main/resources/user.csv");
>   users.printSchema();
> {code}
> expected scheme should be 
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- project: string (nullable = true)
>  |-- started: date (nullable = true)
>  |-- ended: date (nullable = true)
> {code}
> but the actual result is: 
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- project: string (nullable = true)
>  |-- started: string (nullable = true)
>  |-- ended: string (nullable = true)
> {code}
> This mean that date processed as string and "dateFormat" option is ignored 
> and date processed as string.
> If I add option 
> {code}
> .option("timestampFormat", "dd/MM/")
> {code}
> result is: 
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- project: string (nullable = true)
>  |-- started: timestamp (nullable = true)
>  |-- ended: timestamp (nullable = true)
> {code}
> I think, the issue is somewhere in object CSVInferSchema, function 
> inferField, lines 80-97 and
> method "tryParseDate" need to be added before/after "tryParseTimestamp", or 
> date/timestamp process logic need to be changed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19092) Save() API of DataFrameWriter should not scan all the saved files

2017-01-15 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-19092:

Fix Version/s: 2.1.1

> Save() API of DataFrameWriter should not scan all the saved files
> -
>
> Key: SPARK-19092
> URL: https://issues.apache.org/jira/browse/SPARK-19092
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.1.1, 2.2.0
>
>
> `DataFrameWriter`'s save() API is performing a unnecessary full filesystem 
> scan for the saved files. The save() API is the most basic/core API in 
> `DataFrameWriter`. We should avoid these unnecessary file scan. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19222) Limit Query Performance issue

2017-01-15 Thread Yadong Qi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823389#comment-15823389
 ] 

Yadong Qi edited comment on SPARK-19222 at 1/16/17 2:56 AM:


Hi [~maropu], sample means `TABLESAMPLE(x ROWS)` or `TABLESAMPLE(x PERCENT)`, 
the physical plan of `TABLESAMPLE(x ROWS)` is same to LIMIT, so I think you 
mean `TABLESAMPLE(x PERCENT)`. User's query like `create table t1 as select * 
from dest1 where phoneNum = 'xxx' limit 1000` and want to get 1000 
records as more as possible, table t1 will be analyzed later. We don't know the 
number of records about the subquery `select * from dest1 where phoneNum = 
'xxx'`, so we can't know the percent.


was (Author: waterman):
Hi [~maropu], sample means `TABLESAMPLE(x ROWS)` or `TABLESAMPLE(x PERCENT)`, 
the physical of `TABLESAMPLE(x ROWS)` is same to limit, so I think you mean 
`TABLESAMPLE(x PERCENT)`. User's query like `create table t1 as select * from 
dest1 where phoneNum = 'xxx' limit 1000` and want to get 1000 records 
as more as possible, table t1 will be analyzed later. We don't know the number 
of records about the subquery `select * from dest1 where phoneNum = 'xxx'`, so 
we can't know the percent.

> Limit Query Performance issue
> -
>
> Key: SPARK-19222
> URL: https://issues.apache.org/jira/browse/SPARK-19222
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Linux/Windows
>Reporter: Sujith
>Priority: Minor
>
> Performance/memory bottle neck occurs in the below mentioned query
> case 1:
> create table t1 as select * from dest1 limit 1000;
> case 2:
> create table t1 as select * from dest1 limit 1000;
> pre-condition : partition count >=1
> In above cases limit is being added in the terminal of the physical plan 
> == Physical Plan  ==
> ExecutedCommand
>+- CreateHiveTableAsSelectCommand [Database:spark}, TableName: t2, 
> InsertIntoHiveTable]
>  +- GlobalLimit 1000
> +- LocalLimit 1000
>+- Project [imei#101, age#102, task#103L, num#104, level#105, 
> productdate#106, name#107, point#108]
>   +- SubqueryAlias hive
>  +- 
> Relation[imei#101,age#102,task#103L,num#104,level#105,productdate#106,name#107,point#108]
>  csv  |
> Issue Hints: 
> Possible Bottleneck snippet in limit.scala file under spark-sql package.
>   protected override def doExecute(): RDD[InternalRow] = {
> val locallyLimited = child.execute().mapPartitionsInternal(_.take(limit))
> val shuffled = new ShuffledRowRDD(
>   ShuffleExchange.prepareShuffleDependency(
> locallyLimited, child.output, SinglePartition, serializer))
> shuffled.mapPartitionsInternal(_.take(limit))
>   }
> As mentioned in above case 1  (where limit value is 1000 or partition 
> count is > 1) and case 2(limit value is small(around 1000)), As per the 
> above snippet when the ShuffledRowRDD
> is created by grouping all the limit data from different partitions to a 
> single partition in executer,  memory issue occurs since all the partition 
> limit data will be collected and 
> grouped  in a single partition for processing, in both former/later case the 
> data count  can go very high which can create the memory bottleneck.
> Proposed solution for case 2:
> An accumulator value can be to send to all partitions, all executor will be 
> updating the accumulator value based on the  data fetched , 
> eg: Number of partition = 100, number of cores =10
> Ideally tasks will be launched in a group of 10 task/core, once the first 
> group finishes the tasks driver will check whether the accumulator value is 
> been reached the limit value if its reached then no further tasks will be 
> launched to executors and the result after applying limit will be returned.
> Please let me now for any suggestions or solutions for the above mentioned 
> problems
> Thanks,
> Sujith



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19222) Limit Query Performance issue

2017-01-15 Thread Yadong Qi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823389#comment-15823389
 ] 

Yadong Qi commented on SPARK-19222:
---

Hi [~maropu], sample means `TABLESAMPLE(x ROWS)` or `TABLESAMPLE(x PERCENT)`, 
the physical of `TABLESAMPLE(x ROWS)` is same to limit, so I think you mean 
`TABLESAMPLE(x PERCENT)`. User's query like `create table t1 as select * from 
dest1 where phoneNum = 'xxx' limit 1000` and want to get 1000 records 
as more as possible, table t1 will be analyzed later. We don't know the number 
of records about the subquery `select * from dest1 where phoneNum = 'xxx'`, so 
we can't know the percent.

> Limit Query Performance issue
> -
>
> Key: SPARK-19222
> URL: https://issues.apache.org/jira/browse/SPARK-19222
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Linux/Windows
>Reporter: Sujith
>Priority: Minor
>
> Performance/memory bottle neck occurs in the below mentioned query
> case 1:
> create table t1 as select * from dest1 limit 1000;
> case 2:
> create table t1 as select * from dest1 limit 1000;
> pre-condition : partition count >=1
> In above cases limit is being added in the terminal of the physical plan 
> == Physical Plan  ==
> ExecutedCommand
>+- CreateHiveTableAsSelectCommand [Database:spark}, TableName: t2, 
> InsertIntoHiveTable]
>  +- GlobalLimit 1000
> +- LocalLimit 1000
>+- Project [imei#101, age#102, task#103L, num#104, level#105, 
> productdate#106, name#107, point#108]
>   +- SubqueryAlias hive
>  +- 
> Relation[imei#101,age#102,task#103L,num#104,level#105,productdate#106,name#107,point#108]
>  csv  |
> Issue Hints: 
> Possible Bottleneck snippet in limit.scala file under spark-sql package.
>   protected override def doExecute(): RDD[InternalRow] = {
> val locallyLimited = child.execute().mapPartitionsInternal(_.take(limit))
> val shuffled = new ShuffledRowRDD(
>   ShuffleExchange.prepareShuffleDependency(
> locallyLimited, child.output, SinglePartition, serializer))
> shuffled.mapPartitionsInternal(_.take(limit))
>   }
> As mentioned in above case 1  (where limit value is 1000 or partition 
> count is > 1) and case 2(limit value is small(around 1000)), As per the 
> above snippet when the ShuffledRowRDD
> is created by grouping all the limit data from different partitions to a 
> single partition in executer,  memory issue occurs since all the partition 
> limit data will be collected and 
> grouped  in a single partition for processing, in both former/later case the 
> data count  can go very high which can create the memory bottleneck.
> Proposed solution for case 2:
> An accumulator value can be to send to all partitions, all executor will be 
> updating the accumulator value based on the  data fetched , 
> eg: Number of partition = 100, number of cores =10
> Ideally tasks will be launched in a group of 10 task/core, once the first 
> group finishes the tasks driver will check whether the accumulator value is 
> been reached the limit value if its reached then no further tasks will be 
> launched to executors and the result after applying limit will be returned.
> Please let me now for any suggestions or solutions for the above mentioned 
> problems
> Thanks,
> Sujith



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19227) Typo in `org.apache.spark.internal.config.ConfigEntry`

2017-01-15 Thread Biao Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823366#comment-15823366
 ] 

Biao Ma commented on SPARK-19227:
-

Its not outdated comment, simply for its sub-class.

> Typo  in `org.apache.spark.internal.config.ConfigEntry`
> ---
>
> Key: SPARK-19227
> URL: https://issues.apache.org/jira/browse/SPARK-19227
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Biao Ma
>Priority: Minor
>  Labels: easyfix
> Fix For: 2.1.1
>
>
> The parameter `defaultValue` is not exists in class 
> `org.apache.spark.internal.config.ConfigEntry` but `_defaultValue` in its sub 
> class `ConfigEntryWithDefault` and .Also there has some un used imports, 
> should we modify this class?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19233) Inconsistent Behaviour of Spark Streaming Checkpoint

2017-01-15 Thread Nan Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823364#comment-15823364
 ] 

Nan Zhu commented on SPARK-19233:
-

[~zsxwing] so, another potential issue I found in Spark Streaming recently, if 
you agree on this...I will file a PR

> Inconsistent Behaviour of Spark Streaming Checkpoint
> 
>
> Key: SPARK-19233
> URL: https://issues.apache.org/jira/browse/SPARK-19233
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Nan Zhu
>
> When checking one of our application logs, we found the following behavior 
> (simplified)
> 1. Spark application recovers from checkpoint constructed at timestamp 1000ms
> 2. The log shows that Spark application can recover RDDs generated at 
> timestamp 2000, 3000
> The root cause is that generateJobs event is pushed to the queue by a 
> separate thread (RecurTimer), before doCheckpoint event is pushed to the 
> queue, there might have been multiple generatedJobs being processed. As a 
> result, when doCheckpoint for timestamp 1000 is processed, the generatedRDDs 
> data structure containing RDDs generated at 2000, 3000 is serialized as part 
> of checkpoint of 1000.
> It brings overhead for debugging and coordinate our offset management 
> strategy with Spark Streaming's checkpoint strategy when we are developing a 
> new type of DStream which integrates Spark Streaming with an internal message 
> middleware.
> The proposed fix is to filter generatedRDDs according to checkpoint timestamp 
> when serializing it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19233) Inconsistent Behaviour of Spark Streaming Checkpoint

2017-01-15 Thread Nan Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823359#comment-15823359
 ] 

Nan Zhu commented on SPARK-19233:
-

The category of this issue is Improvement which is subject to be revised

> Inconsistent Behaviour of Spark Streaming Checkpoint
> 
>
> Key: SPARK-19233
> URL: https://issues.apache.org/jira/browse/SPARK-19233
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Nan Zhu
>
> When checking one of our application logs, we found the following behavior 
> (simplified)
> 1. Spark application recovers from checkpoint constructed at timestamp 1000ms
> 2. The log shows that Spark application can recover RDDs generated at 
> timestamp 2000, 3000
> The root cause is that generateJobs event is pushed to the queue by a 
> separate thread (RecurTimer), before doCheckpoint event is pushed to the 
> queue, there might have been multiple generatedJobs being processed. As a 
> result, when doCheckpoint for timestamp 1000 is processed, the generatedRDDs 
> data structure containing RDDs generated at 2000, 3000 is serialized as part 
> of checkpoint of 1000.
> It brings overhead for debugging and coordinate our offset management 
> strategy with Spark Streaming's checkpoint strategy when we are developing a 
> new type of DStream which integrates Spark Streaming with an internal message 
> middleware.
> The proposed fix is to filter generatedRDDs according to checkpoint timestamp 
> when serializing it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19227) Typo in `org.apache.spark.internal.config.ConfigEntry`

2017-01-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19227:


Assignee: Apache Spark

> Typo  in `org.apache.spark.internal.config.ConfigEntry`
> ---
>
> Key: SPARK-19227
> URL: https://issues.apache.org/jira/browse/SPARK-19227
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Biao Ma
>Assignee: Apache Spark
>Priority: Minor
>  Labels: easyfix
> Fix For: 2.1.1
>
>
> The parameter `defaultValue` is not exists in class 
> `org.apache.spark.internal.config.ConfigEntry` but `_defaultValue` in its sub 
> class `ConfigEntryWithDefault` and .Also there has some un used imports, 
> should we modify this class?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19227) Typo in `org.apache.spark.internal.config.ConfigEntry`

2017-01-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19227:


Assignee: (was: Apache Spark)

> Typo  in `org.apache.spark.internal.config.ConfigEntry`
> ---
>
> Key: SPARK-19227
> URL: https://issues.apache.org/jira/browse/SPARK-19227
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Biao Ma
>Priority: Minor
>  Labels: easyfix
> Fix For: 2.1.1
>
>
> The parameter `defaultValue` is not exists in class 
> `org.apache.spark.internal.config.ConfigEntry` but `_defaultValue` in its sub 
> class `ConfigEntryWithDefault` and .Also there has some un used imports, 
> should we modify this class?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19227) Typo in `org.apache.spark.internal.config.ConfigEntry`

2017-01-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823356#comment-15823356
 ] 

Apache Spark commented on SPARK-19227:
--

User 'uncleGen' has created a pull request for this issue:
https://github.com/apache/spark/pull/16591

> Typo  in `org.apache.spark.internal.config.ConfigEntry`
> ---
>
> Key: SPARK-19227
> URL: https://issues.apache.org/jira/browse/SPARK-19227
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Biao Ma
>Priority: Minor
>  Labels: easyfix
> Fix For: 2.1.1
>
>
> The parameter `defaultValue` is not exists in class 
> `org.apache.spark.internal.config.ConfigEntry` but `_defaultValue` in its sub 
> class `ConfigEntryWithDefault` and .Also there has some un used imports, 
> should we modify this class?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19153) DataFrameWriter.saveAsTable should work with hive format to create partitioned table

2017-01-15 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823350#comment-15823350
 ] 

Wenchen Fan commented on SPARK-19153:
-

> Now that we can create partitioned table using hive format, e.g. create table 
> t1 (id int, name string, dept string) using hive partitioned by (name), the 
> partition column may not be the last columns, so I think we need to reorder 
> the schema so the partition columns would be the last ones. This is 
> consistent with data source tables.

This is expected.

> DataFrameWriter.saveAsTable should work with hive format to create 
> partitioned table
> 
>
> Key: SPARK-19153
> URL: https://issues.apache.org/jira/browse/SPARK-19153
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19225) Spark SQL round constant double return null

2017-01-15 Thread discipleforteen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823348#comment-15823348
 ] 

discipleforteen commented on SPARK-19225:
-

spark 1.4.1

> select round(4.4, 2);
2017-01-16 09:47:21,573 INFO ParseDriver: Parsing command: select round(4.4, 2)
2017-01-16 09:47:21,783 INFO ParseDriver: Parse Completed
2017-01-16 09:47:22,318 INFO SparkContext: Starting job: processCmd at 
CliDriver.java:423
2017-01-16 09:47:22,335 INFO DAGScheduler: Got job 0 (processCmd at 
CliDriver.java:423) with 1 output partitions (allowLocal=false)
2017-01-16 09:47:22,335 INFO DAGScheduler: Final stage: ResultStage 
0(processCmd at CliDriver.java:423)
2017-01-16 09:47:22,335 INFO DAGScheduler: Parents of final stage: List()
2017-01-16 09:47:22,339 INFO DAGScheduler: Missing parents: List()
2017-01-16 09:47:22,344 INFO DAGScheduler: Submitting ResultStage 0 
(MapPartitionsRDD[2] at processCmd at CliDriver.java:423), which has no missing 
parents
2017-01-16 09:47:22,382 INFO MemoryStore: ensureFreeSpace(3232) called with 
curMem=0, maxMem=278302556
2017-01-16 09:47:22,384 INFO MemoryStore: Block broadcast_0 stored as values in 
memory (estimated size 3.2 KB, free 265.4 MB)
2017-01-16 09:47:22,516 INFO MemoryStore: ensureFreeSpace(1852) called with 
curMem=3232, maxMem=278302556
2017-01-16 09:47:22,516 INFO MemoryStore: Block broadcast_0_piece0 stored as 
bytes in memory (estimated size 1852.0 B, free 265.4 MB)
2017-01-16 09:47:22,519 INFO BlockManagerInfo: Added broadcast_0_piece0 in 
memory on localhost:45436 (size: 1852.0 B, free: 265.4 MB)
2017-01-16 09:47:22,520 INFO SparkContext: Created broadcast 0 from broadcast 
at DAGScheduler.scala:874
2017-01-16 09:47:22,525 INFO DAGScheduler: Submitting 1 missing tasks from 
ResultStage 0 (MapPartitionsRDD[2] at processCmd at CliDriver.java:423)
2017-01-16 09:47:22,526 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
2017-01-16 09:47:22,562 INFO TaskSetManager: Starting task 0.0 in stage 0.0 
(TID 0, localhost, PROCESS_LOCAL, 1392 bytes)
2017-01-16 09:47:22,571 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
2017-01-16 09:47:22,643 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 
955 bytes result sent to driver
2017-01-16 09:47:22,658 INFO TaskSetManager: Finished task 0.0 in stage 0.0 
(TID 0) in 113 ms on localhost (1/1)
2017-01-16 09:47:22,658 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose 
tasks have all completed, from pool
2017-01-16 09:47:22,661 INFO DAGScheduler: ResultStage 0 (processCmd at 
CliDriver.java:423) finished in 0.126 s
2017-01-16 09:47:22,671 INFO DAGScheduler: Job 0 finished: processCmd at 
CliDriver.java:423, took 0.352237 s
4.4

 
 
spark 2.1.0

> select round(4.4, 2);
17/01/16 09:48:13 INFO SparkSqlParser: Parsing command: select round(4.4, 2)
17/01/16 09:48:15 INFO CodeGenerator: Code generated in 215.145435 ms
17/01/16 09:48:15 INFO SparkContext: Starting job: processCmd at 
CliDriver.java:376
17/01/16 09:48:15 INFO DAGScheduler: Got job 0 (processCmd at 
CliDriver.java:376) with 1 output partitions
17/01/16 09:48:15 INFO DAGScheduler: Final stage: ResultStage 0 (processCmd at 
CliDriver.java:376)
17/01/16 09:48:15 INFO DAGScheduler: Parents of final stage: List()
17/01/16 09:48:15 INFO DAGScheduler: Missing parents: List()
17/01/16 09:48:15 INFO DAGScheduler: Submitting ResultStage 0 
(MapPartitionsRDD[3] at processCmd at CliDriver.java:376), which has no missing 
parents
17/01/16 09:48:15 INFO MemoryStore: Block broadcast_0 stored as values in 
memory (estimated size 5.9 KB, free 408.9 MB)
17/01/16 09:48:15 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in 
memory (estimated size 3.1 KB, free 408.9 MB)
17/01/16 09:48:15 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 
10.9.233.62:34512 (size: 3.1 KB, free: 408.9 MB)
17/01/16 09:48:15 INFO SparkContext: Created broadcast 0 from broadcast at 
DAGScheduler.scala:996
17/01/16 09:48:15 INFO DAGScheduler: Submitting 1 missing tasks from 
ResultStage 0 (MapPartitionsRDD[3] at processCmd at CliDriver.java:376)
17/01/16 09:48:15 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
17/01/16 09:48:16 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
localhost, executor driver, partition 0, PROCESS_LOCAL, 6244 bytes)
17/01/16 09:48:16 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
17/01/16 09:48:16 INFO CodeGenerator: Code generated in 8.289946 ms
17/01/16 09:48:16 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1295 
bytes result sent to driver
17/01/16 09:48:16 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) 
in 77 ms on localhost (executor driver) (1/1)
17/01/16 09:48:16 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have 
all completed, from pool
17/01/16 09:48:16 INFO DAGScheduler: ResultStage 0 (processCmd at 
CliDriver.java:376) finished in 0.097 s
17/01/16 09:48:16 INFO DAGScheduler: Job 0 finished: 

[jira] [Created] (SPARK-19233) Inconsistent Behaviour of Spark Streaming Checkpoint

2017-01-15 Thread Nan Zhu (JIRA)
Nan Zhu created SPARK-19233:
---

 Summary: Inconsistent Behaviour of Spark Streaming Checkpoint
 Key: SPARK-19233
 URL: https://issues.apache.org/jira/browse/SPARK-19233
 Project: Spark
  Issue Type: Improvement
  Components: DStreams
Affects Versions: 2.1.0, 2.0.2, 2.0.1, 2.0.0
Reporter: Nan Zhu


When checking one of our application logs, we found the following behavior 
(simplified)

1. Spark application recovers from checkpoint constructed at timestamp 1000ms

2. The log shows that Spark application can recover RDDs generated at timestamp 
2000, 3000

The root cause is that generateJobs event is pushed to the queue by a separate 
thread (RecurTimer), before doCheckpoint event is pushed to the queue, there 
might have been multiple generatedJobs being processed. As a result, when 
doCheckpoint for timestamp 1000 is processed, the generatedRDDs data structure 
containing RDDs generated at 2000, 3000 is serialized as part of checkpoint of 
1000.

It brings overhead for debugging and coordinate our offset management strategy 
with Spark Streaming's checkpoint strategy when we are developing a new type of 
DStream which integrates Spark Streaming with an internal message middleware.

The proposed fix is to filter generatedRDDs according to checkpoint timestamp 
when serializing it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19232) SparkR distribution cache location is wrong on Windows

2017-01-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19232:


Assignee: Apache Spark  (was: Felix Cheung)

> SparkR distribution cache location is wrong on Windows
> --
>
> Key: SPARK-19232
> URL: https://issues.apache.org/jira/browse/SPARK-19232
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Apache Spark
>Priority: Trivial
>
> On Linux:
> {code}
> ~/.cache/spark# ls -lart
> total 12
> drwxr-xr-x 12  500  500 4096 Dec 16 02:18 spark-2.1.0-bin-hadoop2.7
> drwxr-xr-x  3 root root 4096 Dec 18 00:03 ..
> drwxr-xr-x  3 root root 4096 Dec 18 00:06 .
> {code}
> On Windows:
> {code}
> C:\Users\felix\AppData\Local\spark\spark\Cache
> 01/13/2017  11:25 AM  spark-2.1.0-bin-hadoop2.7
> 01/13/2017  11:25 AM33,471,940 spark-2.1.0-bin-hadoop2.7.tgz
> {code}
> If we follow https://pypi.python.org/pypi/appdirs, appauthor should be 
> "Apache"?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19232) SparkR distribution cache location is wrong on Windows

2017-01-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1582#comment-1582
 ] 

Apache Spark commented on SPARK-19232:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/16590

> SparkR distribution cache location is wrong on Windows
> --
>
> Key: SPARK-19232
> URL: https://issues.apache.org/jira/browse/SPARK-19232
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Trivial
>
> On Linux:
> {code}
> ~/.cache/spark# ls -lart
> total 12
> drwxr-xr-x 12  500  500 4096 Dec 16 02:18 spark-2.1.0-bin-hadoop2.7
> drwxr-xr-x  3 root root 4096 Dec 18 00:03 ..
> drwxr-xr-x  3 root root 4096 Dec 18 00:06 .
> {code}
> On Windows:
> {code}
> C:\Users\felix\AppData\Local\spark\spark\Cache
> 01/13/2017  11:25 AM  spark-2.1.0-bin-hadoop2.7
> 01/13/2017  11:25 AM33,471,940 spark-2.1.0-bin-hadoop2.7.tgz
> {code}
> If we follow https://pypi.python.org/pypi/appdirs, appauthor should be 
> "Apache"?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19232) SparkR distribution cache location is wrong on Windows

2017-01-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19232:


Assignee: Felix Cheung  (was: Apache Spark)

> SparkR distribution cache location is wrong on Windows
> --
>
> Key: SPARK-19232
> URL: https://issues.apache.org/jira/browse/SPARK-19232
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Trivial
>
> On Linux:
> {code}
> ~/.cache/spark# ls -lart
> total 12
> drwxr-xr-x 12  500  500 4096 Dec 16 02:18 spark-2.1.0-bin-hadoop2.7
> drwxr-xr-x  3 root root 4096 Dec 18 00:03 ..
> drwxr-xr-x  3 root root 4096 Dec 18 00:06 .
> {code}
> On Windows:
> {code}
> C:\Users\felix\AppData\Local\spark\spark\Cache
> 01/13/2017  11:25 AM  spark-2.1.0-bin-hadoop2.7
> 01/13/2017  11:25 AM33,471,940 spark-2.1.0-bin-hadoop2.7.tgz
> {code}
> If we follow https://pypi.python.org/pypi/appdirs, appauthor should be 
> "Apache"?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19232) SparkR distribution cache location is wrong on Windows

2017-01-15 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-19232:
-
Description: 
On Linux:

{code}
~/.cache/spark# ls -lart
total 12
drwxr-xr-x 12  500  500 4096 Dec 16 02:18 spark-2.1.0-bin-hadoop2.7
drwxr-xr-x  3 root root 4096 Dec 18 00:03 ..
drwxr-xr-x  3 root root 4096 Dec 18 00:06 .
{code}

On Windows:
{code}
C:\Users\felix\AppData\Local\spark\spark\Cache
01/13/2017  11:25 AM  spark-2.1.0-bin-hadoop2.7
01/13/2017  11:25 AM33,471,940 spark-2.1.0-bin-hadoop2.7.tgz
{code}

If we follow https://pypi.python.org/pypi/appdirs, appauthor should be "Apache"?


  was:
On Linux:

{code}
~/.cache/spark# ls -lart
total 12
drwxr-xr-x 12  500  500 4096 Dec 16 02:18 spark-2.1.0-bin-hadoop2.7
drwxr-xr-x  3 root root 4096 Dec 18 00:03 ..
drwxr-xr-x  3 root root 4096 Dec 18 00:06 .
{code}

On Windows:
{code}
C:\Users\felix\AppData\Local\spark\spark\Cache
01/13/2017  11:25 AM  spark-2.1.0-bin-hadoop2.7
01/13/2017  11:25 AM33,471,940 spark-2.1.0-bin-hadoop2.7.tgz
{code}

it should be consistently under Cache\spark or .cache/spark



> SparkR distribution cache location is wrong on Windows
> --
>
> Key: SPARK-19232
> URL: https://issues.apache.org/jira/browse/SPARK-19232
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Trivial
>
> On Linux:
> {code}
> ~/.cache/spark# ls -lart
> total 12
> drwxr-xr-x 12  500  500 4096 Dec 16 02:18 spark-2.1.0-bin-hadoop2.7
> drwxr-xr-x  3 root root 4096 Dec 18 00:03 ..
> drwxr-xr-x  3 root root 4096 Dec 18 00:06 .
> {code}
> On Windows:
> {code}
> C:\Users\felix\AppData\Local\spark\spark\Cache
> 01/13/2017  11:25 AM  spark-2.1.0-bin-hadoop2.7
> 01/13/2017  11:25 AM33,471,940 spark-2.1.0-bin-hadoop2.7.tgz
> {code}
> If we follow https://pypi.python.org/pypi/appdirs, appauthor should be 
> "Apache"?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19232) SparkR distribution cache location is wrong on Windows

2017-01-15 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-19232:
-
Priority: Trivial  (was: Major)

> SparkR distribution cache location is wrong on Windows
> --
>
> Key: SPARK-19232
> URL: https://issues.apache.org/jira/browse/SPARK-19232
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Trivial
>
> On Linux:
> {code}
> ~/.cache/spark# ls -lart
> total 12
> drwxr-xr-x 12  500  500 4096 Dec 16 02:18 spark-2.1.0-bin-hadoop2.7
> drwxr-xr-x  3 root root 4096 Dec 18 00:03 ..
> drwxr-xr-x  3 root root 4096 Dec 18 00:06 .
> {code}
> On Windows:
> {code}
> C:\Users\felix\AppData\Local\spark\spark\Cache
> 01/13/2017  11:25 AM  spark-2.1.0-bin-hadoop2.7
> 01/13/2017  11:25 AM33,471,940 spark-2.1.0-bin-hadoop2.7.tgz
> {code}
> it should be consistently under Cache\spark or .cache/spark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19232) SparkR distribution cache location is wrong on Windows

2017-01-15 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-19232:
-
Description: 
On Linux:

{code}
~/.cache/spark# ls -lart
total 12
drwxr-xr-x 12  500  500 4096 Dec 16 02:18 spark-2.1.0-bin-hadoop2.7
drwxr-xr-x  3 root root 4096 Dec 18 00:03 ..
drwxr-xr-x  3 root root 4096 Dec 18 00:06 .
{code}

On Windows:
{code}
C:\Users\felix\AppData\Local\spark\spark\Cache
01/13/2017  11:25 AM  spark-2.1.0-bin-hadoop2.7
01/13/2017  11:25 AM33,471,940 spark-2.1.0-bin-hadoop2.7.tgz
{code}

it should be consistently under Cache\spark or .cache/spark


  was:
On Linux:

{code}
~/.cache/spark# ls -lart
total 12
drwxr-xr-x 12  500  500 4096 Dec 16 02:18 spark-2.1.0-bin-hadoop2.7
drwxr-xr-x  3 root root 4096 Dec 18 00:03 ..
drwxr-xr-x  3 root root 4096 Dec 18 00:06 .
{code}

On Windows:
{code}
C:\Users\felix\AppData\Local\spark\spark\Cache
01/13/2017  11:25 AM  spark-2.1.0-bin-hadoop2.7
01/13/2017  11:25 AM33,471,940 spark-2.1.0-bin-hadoop2.7.tgz
{code}



> SparkR distribution cache location is wrong on Windows
> --
>
> Key: SPARK-19232
> URL: https://issues.apache.org/jira/browse/SPARK-19232
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>
> On Linux:
> {code}
> ~/.cache/spark# ls -lart
> total 12
> drwxr-xr-x 12  500  500 4096 Dec 16 02:18 spark-2.1.0-bin-hadoop2.7
> drwxr-xr-x  3 root root 4096 Dec 18 00:03 ..
> drwxr-xr-x  3 root root 4096 Dec 18 00:06 .
> {code}
> On Windows:
> {code}
> C:\Users\felix\AppData\Local\spark\spark\Cache
> 01/13/2017  11:25 AM  spark-2.1.0-bin-hadoop2.7
> 01/13/2017  11:25 AM33,471,940 spark-2.1.0-bin-hadoop2.7.tgz
> {code}
> it should be consistently under Cache\spark or .cache/spark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19232) SparkR distribution cache location is wrong on Windows

2017-01-15 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-19232:
-
Description: 
On Linux:

{code}
~/.cache/spark# ls -lart
total 12
drwxr-xr-x 12  500  500 4096 Dec 16 02:18 spark-2.1.0-bin-hadoop2.7
drwxr-xr-x  3 root root 4096 Dec 18 00:03 ..
drwxr-xr-x  3 root root 4096 Dec 18 00:06 .
{code}

On Windows:
{code}
C:\Users\felix\AppData\Local\spark\spark\Cache
01/13/2017  11:25 AM  spark-2.1.0-bin-hadoop2.7
01/13/2017  11:25 AM33,471,940 spark-2.1.0-bin-hadoop2.7.tgz
{code}


> SparkR distribution cache location is wrong on Windows
> --
>
> Key: SPARK-19232
> URL: https://issues.apache.org/jira/browse/SPARK-19232
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>
> On Linux:
> {code}
> ~/.cache/spark# ls -lart
> total 12
> drwxr-xr-x 12  500  500 4096 Dec 16 02:18 spark-2.1.0-bin-hadoop2.7
> drwxr-xr-x  3 root root 4096 Dec 18 00:03 ..
> drwxr-xr-x  3 root root 4096 Dec 18 00:06 .
> {code}
> On Windows:
> {code}
> C:\Users\felix\AppData\Local\spark\spark\Cache
> 01/13/2017  11:25 AM  spark-2.1.0-bin-hadoop2.7
> 01/13/2017  11:25 AM33,471,940 spark-2.1.0-bin-hadoop2.7.tgz
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19232) SparkR distribution cache location is wrong on Windows

2017-01-15 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-19232:
-
Summary: SparkR distribution cache location is wrong on Windows  (was: 
SparkR distribution cache location is wrong)

> SparkR distribution cache location is wrong on Windows
> --
>
> Key: SPARK-19232
> URL: https://issues.apache.org/jira/browse/SPARK-19232
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19231) SparkR hangs when there is download or untar failure

2017-01-15 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung reassigned SPARK-19231:


Assignee: Felix Cheung

> SparkR hangs when there is download or untar failure
> 
>
> Key: SPARK-19231
> URL: https://issues.apache.org/jira/browse/SPARK-19231
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>
> When there is any partial download, or download error it is not cleaned up, 
> and sparkR.session will continue to stuck with no error message.
> {code}
> > sparkR.session()
> Spark not found in SPARK_HOME:
> Spark not found in the cache directory. Installation will start.
> MirrorUrl not provided.
> Looking for preferred site from apache website...
> Preferred mirror site found: http://www-eu.apache.org/dist/spark
> Downloading spark-2.1.0 for Hadoop 2.7 from:
> - 
> http://www-eu.apache.org/dist/spark/spark-2.1.0/spark-2.1.0-bin-hadoop2.7.tgz
> trying URL 
> 'http://www-eu.apache.org/dist/spark/spark-2.1.0/spark-2.1.0-bin-hadoop2.7.tgz'
> Content type 'application/x-gzip' length 195636829 bytes (186.6 MB)
> downloaded 31.9 MB
>  
> Installing to C:\Users\felix\AppData\Local\spark\spark\Cache
> Error in untar2(tarfile, files, list, exdir) : incomplete block on file
> In addition: Warning message:
> In download.file(remotePath, localPath) :
>   downloaded length 33471940 != reported length 195636829
> > sparkR.session()
> Spark not found in SPARK_HOME:
> spark-2.1.0 for Hadoop 2.7 found, setting SPARK_HOME to 
> C:\Users\felix\AppData\Local\spark\spark\Cache/spark-2.1.0-bin-hadoop2.7
> Launching java with spark-submit command 
> C:\Users\felix\AppData\Local\spark\spark\Cache/spark-2.1.0-bin-hadoop2.7/bin/spark-submit2.cmd
>sparkr-shell 
> C:\Users\felix\AppData\Local\Temp\RtmpCqNdne\backend_port16d04191e7
> {code}
> {code}
> Directory of C:\Users\felix\AppData\Local\spark\spark\Cache
> 01/13/2017  11:25 AM  spark-2.1.0-bin-hadoop2.7
> 01/13/2017  11:25 AM33,471,940 spark-2.1.0-bin-hadoop2.7.tgz
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19232) SparkR distribution cache location is wrong

2017-01-15 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-19232:


 Summary: SparkR distribution cache location is wrong
 Key: SPARK-19232
 URL: https://issues.apache.org/jira/browse/SPARK-19232
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.1.0
Reporter: Felix Cheung
Assignee: Felix Cheung






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19231) SparkR hangs when there is download or untar failure

2017-01-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823324#comment-15823324
 ] 

Apache Spark commented on SPARK-19231:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/16589

> SparkR hangs when there is download or untar failure
> 
>
> Key: SPARK-19231
> URL: https://issues.apache.org/jira/browse/SPARK-19231
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>
> When there is any partial download, or download error it is not cleaned up, 
> and sparkR.session will continue to stuck with no error message.
> {code}
> > sparkR.session()
> Spark not found in SPARK_HOME:
> Spark not found in the cache directory. Installation will start.
> MirrorUrl not provided.
> Looking for preferred site from apache website...
> Preferred mirror site found: http://www-eu.apache.org/dist/spark
> Downloading spark-2.1.0 for Hadoop 2.7 from:
> - 
> http://www-eu.apache.org/dist/spark/spark-2.1.0/spark-2.1.0-bin-hadoop2.7.tgz
> trying URL 
> 'http://www-eu.apache.org/dist/spark/spark-2.1.0/spark-2.1.0-bin-hadoop2.7.tgz'
> Content type 'application/x-gzip' length 195636829 bytes (186.6 MB)
> downloaded 31.9 MB
>  
> Installing to C:\Users\felix\AppData\Local\spark\spark\Cache
> Error in untar2(tarfile, files, list, exdir) : incomplete block on file
> In addition: Warning message:
> In download.file(remotePath, localPath) :
>   downloaded length 33471940 != reported length 195636829
> > sparkR.session()
> Spark not found in SPARK_HOME:
> spark-2.1.0 for Hadoop 2.7 found, setting SPARK_HOME to 
> C:\Users\felix\AppData\Local\spark\spark\Cache/spark-2.1.0-bin-hadoop2.7
> Launching java with spark-submit command 
> C:\Users\felix\AppData\Local\spark\spark\Cache/spark-2.1.0-bin-hadoop2.7/bin/spark-submit2.cmd
>sparkr-shell 
> C:\Users\felix\AppData\Local\Temp\RtmpCqNdne\backend_port16d04191e7
> {code}
> {code}
> Directory of C:\Users\felix\AppData\Local\spark\spark\Cache
> 01/13/2017  11:25 AM  spark-2.1.0-bin-hadoop2.7
> 01/13/2017  11:25 AM33,471,940 spark-2.1.0-bin-hadoop2.7.tgz
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19231) SparkR hangs when there is download or untar failure

2017-01-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19231:


Assignee: (was: Apache Spark)

> SparkR hangs when there is download or untar failure
> 
>
> Key: SPARK-19231
> URL: https://issues.apache.org/jira/browse/SPARK-19231
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>
> When there is any partial download, or download error it is not cleaned up, 
> and sparkR.session will continue to stuck with no error message.
> {code}
> > sparkR.session()
> Spark not found in SPARK_HOME:
> Spark not found in the cache directory. Installation will start.
> MirrorUrl not provided.
> Looking for preferred site from apache website...
> Preferred mirror site found: http://www-eu.apache.org/dist/spark
> Downloading spark-2.1.0 for Hadoop 2.7 from:
> - 
> http://www-eu.apache.org/dist/spark/spark-2.1.0/spark-2.1.0-bin-hadoop2.7.tgz
> trying URL 
> 'http://www-eu.apache.org/dist/spark/spark-2.1.0/spark-2.1.0-bin-hadoop2.7.tgz'
> Content type 'application/x-gzip' length 195636829 bytes (186.6 MB)
> downloaded 31.9 MB
>  
> Installing to C:\Users\felix\AppData\Local\spark\spark\Cache
> Error in untar2(tarfile, files, list, exdir) : incomplete block on file
> In addition: Warning message:
> In download.file(remotePath, localPath) :
>   downloaded length 33471940 != reported length 195636829
> > sparkR.session()
> Spark not found in SPARK_HOME:
> spark-2.1.0 for Hadoop 2.7 found, setting SPARK_HOME to 
> C:\Users\felix\AppData\Local\spark\spark\Cache/spark-2.1.0-bin-hadoop2.7
> Launching java with spark-submit command 
> C:\Users\felix\AppData\Local\spark\spark\Cache/spark-2.1.0-bin-hadoop2.7/bin/spark-submit2.cmd
>sparkr-shell 
> C:\Users\felix\AppData\Local\Temp\RtmpCqNdne\backend_port16d04191e7
> {code}
> {code}
> Directory of C:\Users\felix\AppData\Local\spark\spark\Cache
> 01/13/2017  11:25 AM  spark-2.1.0-bin-hadoop2.7
> 01/13/2017  11:25 AM33,471,940 spark-2.1.0-bin-hadoop2.7.tgz
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19231) SparkR hangs when there is download or untar failure

2017-01-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19231:


Assignee: Apache Spark

> SparkR hangs when there is download or untar failure
> 
>
> Key: SPARK-19231
> URL: https://issues.apache.org/jira/browse/SPARK-19231
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Apache Spark
>
> When there is any partial download, or download error it is not cleaned up, 
> and sparkR.session will continue to stuck with no error message.
> {code}
> > sparkR.session()
> Spark not found in SPARK_HOME:
> Spark not found in the cache directory. Installation will start.
> MirrorUrl not provided.
> Looking for preferred site from apache website...
> Preferred mirror site found: http://www-eu.apache.org/dist/spark
> Downloading spark-2.1.0 for Hadoop 2.7 from:
> - 
> http://www-eu.apache.org/dist/spark/spark-2.1.0/spark-2.1.0-bin-hadoop2.7.tgz
> trying URL 
> 'http://www-eu.apache.org/dist/spark/spark-2.1.0/spark-2.1.0-bin-hadoop2.7.tgz'
> Content type 'application/x-gzip' length 195636829 bytes (186.6 MB)
> downloaded 31.9 MB
>  
> Installing to C:\Users\felix\AppData\Local\spark\spark\Cache
> Error in untar2(tarfile, files, list, exdir) : incomplete block on file
> In addition: Warning message:
> In download.file(remotePath, localPath) :
>   downloaded length 33471940 != reported length 195636829
> > sparkR.session()
> Spark not found in SPARK_HOME:
> spark-2.1.0 for Hadoop 2.7 found, setting SPARK_HOME to 
> C:\Users\felix\AppData\Local\spark\spark\Cache/spark-2.1.0-bin-hadoop2.7
> Launching java with spark-submit command 
> C:\Users\felix\AppData\Local\spark\spark\Cache/spark-2.1.0-bin-hadoop2.7/bin/spark-submit2.cmd
>sparkr-shell 
> C:\Users\felix\AppData\Local\Temp\RtmpCqNdne\backend_port16d04191e7
> {code}
> {code}
> Directory of C:\Users\felix\AppData\Local\spark\spark\Cache
> 01/13/2017  11:25 AM  spark-2.1.0-bin-hadoop2.7
> 01/13/2017  11:25 AM33,471,940 spark-2.1.0-bin-hadoop2.7.tgz
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19231) SparkR hangs when there is download or untar failure

2017-01-15 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-19231:
-
Description: 
When there is any partial download, or download error it is not cleaned up, and 
sparkR.session will continue to stuck with no error message.

{code}
> sparkR.session()
Spark not found in SPARK_HOME:
Spark not found in the cache directory. Installation will start.
MirrorUrl not provided.
Looking for preferred site from apache website...
Preferred mirror site found: http://www-eu.apache.org/dist/spark
Downloading spark-2.1.0 for Hadoop 2.7 from:
- http://www-eu.apache.org/dist/spark/spark-2.1.0/spark-2.1.0-bin-hadoop2.7.tgz
trying URL 
'http://www-eu.apache.org/dist/spark/spark-2.1.0/spark-2.1.0-bin-hadoop2.7.tgz'
Content type 'application/x-gzip' length 195636829 bytes (186.6 MB)
downloaded 31.9 MB
 
Installing to C:\Users\felixc\AppData\Local\spark\spark\Cache
Error in untar2(tarfile, files, list, exdir) : incomplete block on file

In addition: Warning message:
In download.file(remotePath, localPath) :
  downloaded length 33471940 != reported length 195636829
> sparkR.session()
Spark not found in SPARK_HOME:
spark-2.1.0 for Hadoop 2.7 found, setting SPARK_HOME to 
C:\Users\felixc\AppData\Local\spark\spark\Cache/spark-2.1.0-bin-hadoop2.7
Launching java with spark-submit command 
C:\Users\felixc\AppData\Local\spark\spark\Cache/spark-2.1.0-bin-hadoop2.7/bin/spark-submit2.cmd
   sparkr-shell 
C:\Users\felixc\AppData\Local\Temp\RtmpCqNdne\backend_port16d04191e7
{code}

{code}
Directory of C:\Users\felixc\AppData\Local\spark\spark\Cache
 01/13/2017  11:25 AM  .
01/13/2017  11:25 AM  ..
01/13/2017  11:25 AM  spark-2.1.0-bin-hadoop2.7
01/13/2017  11:25 AM33,471,940 spark-2.1.0-bin-hadoop2.7.tgz
{code}


> SparkR hangs when there is download or untar failure
> 
>
> Key: SPARK-19231
> URL: https://issues.apache.org/jira/browse/SPARK-19231
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>
> When there is any partial download, or download error it is not cleaned up, 
> and sparkR.session will continue to stuck with no error message.
> {code}
> > sparkR.session()
> Spark not found in SPARK_HOME:
> Spark not found in the cache directory. Installation will start.
> MirrorUrl not provided.
> Looking for preferred site from apache website...
> Preferred mirror site found: http://www-eu.apache.org/dist/spark
> Downloading spark-2.1.0 for Hadoop 2.7 from:
> - 
> http://www-eu.apache.org/dist/spark/spark-2.1.0/spark-2.1.0-bin-hadoop2.7.tgz
> trying URL 
> 'http://www-eu.apache.org/dist/spark/spark-2.1.0/spark-2.1.0-bin-hadoop2.7.tgz'
> Content type 'application/x-gzip' length 195636829 bytes (186.6 MB)
> downloaded 31.9 MB
>  
> Installing to C:\Users\felixc\AppData\Local\spark\spark\Cache
> Error in untar2(tarfile, files, list, exdir) : incomplete block on file
> In addition: Warning message:
> In download.file(remotePath, localPath) :
>   downloaded length 33471940 != reported length 195636829
> > sparkR.session()
> Spark not found in SPARK_HOME:
> spark-2.1.0 for Hadoop 2.7 found, setting SPARK_HOME to 
> C:\Users\felixc\AppData\Local\spark\spark\Cache/spark-2.1.0-bin-hadoop2.7
> Launching java with spark-submit command 
> C:\Users\felixc\AppData\Local\spark\spark\Cache/spark-2.1.0-bin-hadoop2.7/bin/spark-submit2.cmd
>sparkr-shell 
> C:\Users\felixc\AppData\Local\Temp\RtmpCqNdne\backend_port16d04191e7
> {code}
> {code}
> Directory of C:\Users\felixc\AppData\Local\spark\spark\Cache
>  01/13/2017  11:25 AM  .
> 01/13/2017  11:25 AM  ..
> 01/13/2017  11:25 AM  spark-2.1.0-bin-hadoop2.7
> 01/13/2017  11:25 AM33,471,940 spark-2.1.0-bin-hadoop2.7.tgz
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19231) SparkR hangs when there is download or untar failure

2017-01-15 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-19231:
-
Description: 
When there is any partial download, or download error it is not cleaned up, and 
sparkR.session will continue to stuck with no error message.

{code}
> sparkR.session()
Spark not found in SPARK_HOME:
Spark not found in the cache directory. Installation will start.
MirrorUrl not provided.
Looking for preferred site from apache website...
Preferred mirror site found: http://www-eu.apache.org/dist/spark
Downloading spark-2.1.0 for Hadoop 2.7 from:
- http://www-eu.apache.org/dist/spark/spark-2.1.0/spark-2.1.0-bin-hadoop2.7.tgz
trying URL 
'http://www-eu.apache.org/dist/spark/spark-2.1.0/spark-2.1.0-bin-hadoop2.7.tgz'
Content type 'application/x-gzip' length 195636829 bytes (186.6 MB)
downloaded 31.9 MB
 
Installing to C:\Users\felix\AppData\Local\spark\spark\Cache
Error in untar2(tarfile, files, list, exdir) : incomplete block on file

In addition: Warning message:
In download.file(remotePath, localPath) :
  downloaded length 33471940 != reported length 195636829
> sparkR.session()
Spark not found in SPARK_HOME:
spark-2.1.0 for Hadoop 2.7 found, setting SPARK_HOME to 
C:\Users\felix\AppData\Local\spark\spark\Cache/spark-2.1.0-bin-hadoop2.7
Launching java with spark-submit command 
C:\Users\felix\AppData\Local\spark\spark\Cache/spark-2.1.0-bin-hadoop2.7/bin/spark-submit2.cmd
   sparkr-shell 
C:\Users\felix\AppData\Local\Temp\RtmpCqNdne\backend_port16d04191e7
{code}

{code}
Directory of C:\Users\felix\AppData\Local\spark\spark\Cache
01/13/2017  11:25 AM  spark-2.1.0-bin-hadoop2.7
01/13/2017  11:25 AM33,471,940 spark-2.1.0-bin-hadoop2.7.tgz
{code}


  was:
When there is any partial download, or download error it is not cleaned up, and 
sparkR.session will continue to stuck with no error message.

{code}
> sparkR.session()
Spark not found in SPARK_HOME:
Spark not found in the cache directory. Installation will start.
MirrorUrl not provided.
Looking for preferred site from apache website...
Preferred mirror site found: http://www-eu.apache.org/dist/spark
Downloading spark-2.1.0 for Hadoop 2.7 from:
- http://www-eu.apache.org/dist/spark/spark-2.1.0/spark-2.1.0-bin-hadoop2.7.tgz
trying URL 
'http://www-eu.apache.org/dist/spark/spark-2.1.0/spark-2.1.0-bin-hadoop2.7.tgz'
Content type 'application/x-gzip' length 195636829 bytes (186.6 MB)
downloaded 31.9 MB
 
Installing to C:\Users\felixc\AppData\Local\spark\spark\Cache
Error in untar2(tarfile, files, list, exdir) : incomplete block on file

In addition: Warning message:
In download.file(remotePath, localPath) :
  downloaded length 33471940 != reported length 195636829
> sparkR.session()
Spark not found in SPARK_HOME:
spark-2.1.0 for Hadoop 2.7 found, setting SPARK_HOME to 
C:\Users\felixc\AppData\Local\spark\spark\Cache/spark-2.1.0-bin-hadoop2.7
Launching java with spark-submit command 
C:\Users\felixc\AppData\Local\spark\spark\Cache/spark-2.1.0-bin-hadoop2.7/bin/spark-submit2.cmd
   sparkr-shell 
C:\Users\felixc\AppData\Local\Temp\RtmpCqNdne\backend_port16d04191e7
{code}

{code}
Directory of C:\Users\felixc\AppData\Local\spark\spark\Cache
 01/13/2017  11:25 AM  .
01/13/2017  11:25 AM  ..
01/13/2017  11:25 AM  spark-2.1.0-bin-hadoop2.7
01/13/2017  11:25 AM33,471,940 spark-2.1.0-bin-hadoop2.7.tgz
{code}



> SparkR hangs when there is download or untar failure
> 
>
> Key: SPARK-19231
> URL: https://issues.apache.org/jira/browse/SPARK-19231
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>
> When there is any partial download, or download error it is not cleaned up, 
> and sparkR.session will continue to stuck with no error message.
> {code}
> > sparkR.session()
> Spark not found in SPARK_HOME:
> Spark not found in the cache directory. Installation will start.
> MirrorUrl not provided.
> Looking for preferred site from apache website...
> Preferred mirror site found: http://www-eu.apache.org/dist/spark
> Downloading spark-2.1.0 for Hadoop 2.7 from:
> - 
> http://www-eu.apache.org/dist/spark/spark-2.1.0/spark-2.1.0-bin-hadoop2.7.tgz
> trying URL 
> 'http://www-eu.apache.org/dist/spark/spark-2.1.0/spark-2.1.0-bin-hadoop2.7.tgz'
> Content type 'application/x-gzip' length 195636829 bytes (186.6 MB)
> downloaded 31.9 MB
>  
> Installing to C:\Users\felix\AppData\Local\spark\spark\Cache
> Error in untar2(tarfile, files, list, exdir) : incomplete block on file
> In addition: Warning message:
> In download.file(remotePath, localPath) :
>   downloaded length 33471940 != reported length 195636829
> > sparkR.session()
> Spark not found in SPARK_HOME:
> spark-2.1.0 for Hadoop 2.7 found, setting SPARK_HOME to 
> 

[jira] [Created] (SPARK-19231) SparkR hangs when there is download or untar failure

2017-01-15 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-19231:


 Summary: SparkR hangs when there is download or untar failure
 Key: SPARK-19231
 URL: https://issues.apache.org/jira/browse/SPARK-19231
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.1.0
Reporter: Felix Cheung






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14560) Cooperative Memory Management for Spillables

2017-01-15 Thread Lenni Kuff (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823297#comment-15823297
 ] 

Lenni Kuff commented on SPARK-14560:


Hi,I will be out of the office from 12/19/16 - 1/20/17. The Hive, Sentry, Pig, 
and RecordService teams have been transitioned to new leads who should be the 
primary contacts moving forward: 


*Sentry - Mat Crocker

*Hive - Sangeeta Doraiswamy

*RecordService - Alexander Bibighaus

*Pig - Ferenc Denes 
*Everything Else - Eli CollinsAs usual, [1]Non-8x5 Escalations for each 
component should be routed using the instructions on the wiki.  I will be in 
the NY area for most of my time away. If there is a critical item that comes up 
you can reach me on my cell @ [2]415-840-4577.
Thanks,Lenni


[1] https://wiki.cloudera.com/pages/viewpage.action?pageId=24935790
[2] tel:415-840-4577


> Cooperative Memory Management for Spillables
> 
>
> Key: SPARK-14560
> URL: https://issues.apache.org/jira/browse/SPARK-14560
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Imran Rashid
>Assignee: Lianhui Wang
> Fix For: 2.0.0
>
>
> SPARK-10432 introduced cooperative memory management for SQL operators that 
> can spill; however, {{Spillable}} s used by the old RDD api still do not 
> cooperate.  This can lead to memory starvation, in particular on a 
> shuffle-to-shuffle stage, eventually resulting in errors like:
> {noformat}
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Memory used in task 3081
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Acquired by 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter@69ab0291: 32.0 KB
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317230346 bytes of memory 
> were used by task 3081 but are not associated with specific consumers
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317263114 bytes of memory 
> are used for execution and 1710484 bytes of memory are used for storage
> 16/03/28 08:59:54 ERROR executor.Executor: Managed memory leak detected; size 
> = 1317230346 bytes, TID = 3081
> 16/03/28 08:59:54 ERROR executor.Executor: Exception in task 533.0 in stage 
> 3.0 (TID 3081)
> java.lang.OutOfMemoryError: Unable to acquire 75 bytes of memory, got 0
> at 
> org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120)
> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:346)
> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:367)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:237)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> This can happen anytime the shuffle read side requires more memory than what 
> is available for the task.  Since the shuffle-read side doubles its memory 
> request each time, it can easily end up acquiring all of the available 
> memory, even if it does not use it.  Eg., say that after the final spill, the 
> shuffle-read side requires 10 MB more memory, and there is 15 MB of memory 
> available.  But if it starts at 2 MB, it will double to 4, 8, and then 
> request 16 MB of memory, and in fact get all available 15 MB.  Since the 15 
> MB of memory is sufficient, it will not spill, and will continue holding on 
> to all available memory.  But this leaves *no* memory available for the 
> shuffle-write side.  Since the shuffle-write side cannot request the 
> shuffle-read side to free up memory, this leads to an OOM.
> The simple solution is to make {{Spillable}} implement {{MemoryConsumer}} as 
> well, so RDDs can benefit from the cooperative memory management introduced 
> by SPARK-10342.
> Note that an additional improvement would be for the shuffle-read side to 
> simple release unused memory, without spilling, in case that would leave 
> enough memory, and only spill if that was inadequate.  However that can come 
> as a later improvement.
> *Workaround*:  You can set 
> 

[jira] [Commented] (SPARK-14560) Cooperative Memory Management for Spillables

2017-01-15 Thread Morten Hornbech (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823296#comment-15823296
 ] 

Morten Hornbech commented on SPARK-14560:
-

I have also observed this error sporadically on Spark 2.0.2. Does the 
spark.shuffle.spill.numElementsForceSpillThreshold=N workaround work on 2.0.2? 
Any experience with robust values? 

Stack trace:

java.lang.OutOfMemoryError: Unable to acquire 36 bytes of memory, got 0
at 
org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:129)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:377)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:399)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:94)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at 
org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83)
at 
org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedBufferedToRowWithNullFreeJoinKey(SortMergeJoinExec.scala:730)
at 
org.apache.spark.sql.execution.joins.SortMergeJoinScanner.(SortMergeJoinExec.scala:605)
at 
org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$doExecute$1.apply(SortMergeJoinExec.scala:162)
at 
org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$doExecute$1.apply(SortMergeJoinExec.scala:100)
at 
org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at 
org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:96)
at 
org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:95)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at 
org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:106)
at 
org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:98)
at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:214)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926)
at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

> Cooperative Memory Management for Spillables
> 
>
> Key: SPARK-14560
> URL: https://issues.apache.org/jira/browse/SPARK-14560
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Imran Rashid
>Assignee: Lianhui Wang
> Fix For: 2.0.0
>
>
> SPARK-10432 introduced cooperative memory management for SQL operators that 
> can spill; however, {{Spillable}} s used by the old RDD api still do not 
> 

[jira] [Created] (SPARK-19230) View creation in Derby gets SQLDataException because definition gets very big

2017-01-15 Thread Ohad Raviv (JIRA)
Ohad Raviv created SPARK-19230:
--

 Summary: View creation in Derby gets SQLDataException because 
definition gets very big
 Key: SPARK-19230
 URL: https://issues.apache.org/jira/browse/SPARK-19230
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Ohad Raviv


somewhat related to SPARK-6024.
In our tests mockups we have a process that creates a pretty big table 
definition:
{quote}
create table t1 (
field_name_1 string,
field_name_2 string,
field_name_3 string,
.
.
.
field_name_1000 string
)
{quote}
which succeeds. But then we add some calculated fields on top of it with a view:
{quote}
create view v1 as 
select *, 
  some_udf(field_name_1) as field_calc1,
  some_udf(field_name_2) as field_calc2,
  .
  .
  some_udf(field_name_10) as field_calc10
from t1
{quote}
And we get this exception:
{quote}
java.sql.SQLDataException: A truncation error was encountered trying to shrink 
LONG VARCHAR 'SELECT `gen_attr_0` AS `field_name_1`, `gen_attr_1` AS 
`field_name_2&' to length 32700.
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown Source)
at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown 
Source)
at 
org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown 
Source)
at 
org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown 
Source)
at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown 
Source)
at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown 
Source)
at org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(Unknown 
Source)
at 
org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeStatement(Unknown 
Source)
at 
org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeLargeUpdate(Unknown 
Source)
at 
org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeUpdate(Unknown Source)
at 
com.jolbox.bonecp.PreparedStatementHandle.executeUpdate(PreparedStatementHandle.java:205)
at 
org.datanucleus.store.rdbms.ParamLoggingPreparedStatement.executeUpdate(ParamLoggingPreparedStatement.java:399)
at 
org.datanucleus.store.rdbms.SQLController.executeStatementUpdate(SQLController.java:439)
at 
org.datanucleus.store.rdbms.request.InsertRequest.execute(InsertRequest.java:410)
at 
org.datanucleus.store.rdbms.RDBMSPersistenceHandler.insertTable(RDBMSPersistenceHandler.java:167)
at 
org.datanucleus.store.rdbms.RDBMSPersistenceHandler.insertObject(RDBMSPersistenceHandler.java:143)
at 
org.datanucleus.state.JDOStateManager.internalMakePersistent(JDOStateManager.java:3784)
at 
org.datanucleus.state.JDOStateManager.makePersistent(JDOStateManager.java:3760)
at 
org.datanucleus.ExecutionContextImpl.persistObjectInternal(ExecutionContextImpl.java:2219)
at 
org.datanucleus.ExecutionContextImpl.persistObjectWork(ExecutionContextImpl.java:2065)
at 
org.datanucleus.ExecutionContextImpl.persistObject(ExecutionContextImpl.java:1913)
at 
org.datanucleus.ExecutionContextThreadedImpl.persistObject(ExecutionContextThreadedImpl.java:217)
at 
org.datanucleus.api.jdo.JDOPersistenceManager.jdoMakePersistent(JDOPersistenceManager.java:727)
at 
org.datanucleus.api.jdo.JDOPersistenceManager.makePersistent(JDOPersistenceManager.java:752)
at 
org.apache.hadoop.hive.metastore.ObjectStore.createTable(ObjectStore.java:814)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:114)
at com.sun.proxy.$Proxy17.createTable(Unknown Source)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_core(HiveMetaStore.java:1416)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_with_environment_context(HiveMetaStore.java:1449)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107)
at com.sun.proxy.$Proxy19.create_table_with_environment_context(Unknown 
Source)
at 

[jira] [Commented] (SPARK-19092) Save() API of DataFrameWriter should not scan all the saved files

2017-01-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823229#comment-15823229
 ] 

Apache Spark commented on SPARK-19092:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/16588

> Save() API of DataFrameWriter should not scan all the saved files
> -
>
> Key: SPARK-19092
> URL: https://issues.apache.org/jira/browse/SPARK-19092
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.2.0
>
>
> `DataFrameWriter`'s save() API is performing a unnecessary full filesystem 
> scan for the saved files. The save() API is the most basic/core API in 
> `DataFrameWriter`. We should avoid these unnecessary file scan. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19229) Disallow Creating Hive Source Tables when Hive Support is Not Enabled

2017-01-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19229:


Assignee: Apache Spark  (was: Xiao Li)

> Disallow Creating Hive Source Tables when Hive Support is Not Enabled
> -
>
> Key: SPARK-19229
> URL: https://issues.apache.org/jira/browse/SPARK-19229
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> It is weird to create Hive source tables when using InMemoryCatalog. We are 
> unable to operate it. We should block users to create Hive source tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19229) Disallow Creating Hive Source Tables when Hive Support is Not Enabled

2017-01-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19229:


Assignee: Xiao Li  (was: Apache Spark)

> Disallow Creating Hive Source Tables when Hive Support is Not Enabled
> -
>
> Key: SPARK-19229
> URL: https://issues.apache.org/jira/browse/SPARK-19229
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> It is weird to create Hive source tables when using InMemoryCatalog. We are 
> unable to operate it. We should block users to create Hive source tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19229) Disallow Creating Hive Source Tables when Hive Support is Not Enabled

2017-01-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823227#comment-15823227
 ] 

Apache Spark commented on SPARK-19229:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/16587

> Disallow Creating Hive Source Tables when Hive Support is Not Enabled
> -
>
> Key: SPARK-19229
> URL: https://issues.apache.org/jira/browse/SPARK-19229
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> It is weird to create Hive source tables when using InMemoryCatalog. We are 
> unable to operate it. We should block users to create Hive source tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19229) Disallow Creating Hive Source Tables when Hive Support is Not Enabled

2017-01-15 Thread Xiao Li (JIRA)
Xiao Li created SPARK-19229:
---

 Summary: Disallow Creating Hive Source Tables when Hive Support is 
Not Enabled
 Key: SPARK-19229
 URL: https://issues.apache.org/jira/browse/SPARK-19229
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Xiao Li
Assignee: Xiao Li


It is weird to create Hive source tables when using InMemoryCatalog. We are 
unable to operate it. We should block users to create Hive source tables.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19227) Typo in `org.apache.spark.internal.config.ConfigEntry`

2017-01-15 Thread Biao Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Biao Ma updated SPARK-19227:

Summary: Typo  in `org.apache.spark.internal.config.ConfigEntry`  (was: 
Modify some typo  in `ConfigEntry`)

> Typo  in `org.apache.spark.internal.config.ConfigEntry`
> ---
>
> Key: SPARK-19227
> URL: https://issues.apache.org/jira/browse/SPARK-19227
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Biao Ma
>Priority: Minor
>  Labels: easyfix
> Fix For: 2.1.1
>
>
> The parameter `defaultValue` is not exists in class 
> `org.apache.spark.internal.config.ConfigEntry` but `_defaultValue` in its sub 
> class `ConfigEntryWithDefault` and .Also there has some un used imports, 
> should we modify this class?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19227) Modify some typo in `ConfigEntry`

2017-01-15 Thread Biao Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Biao Ma updated SPARK-19227:

Description: The parameter `defaultValue` is not exists in class 
`org.apache.spark.internal.config.ConfigEntry` but `_defaultValue` in its sub 
class `ConfigEntryWithDefault` and .Also there has some un used imports, should 
we modify this class?  (was: The parameter `defaultValue` is not exists in 
class `org.apache.spark.internal.config.ConfigEntry`, we should remove its 
annotation.)
 Issue Type: Bug  (was: Documentation)
Summary: Modify some typo  in `ConfigEntry`  (was: Remove non-existent 
parameter annotation in `ConfigEntry`)

> Modify some typo  in `ConfigEntry`
> --
>
> Key: SPARK-19227
> URL: https://issues.apache.org/jira/browse/SPARK-19227
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Biao Ma
>Priority: Minor
>  Labels: easyfix
> Fix For: 2.1.1
>
>
> The parameter `defaultValue` is not exists in class 
> `org.apache.spark.internal.config.ConfigEntry` but `_defaultValue` in its sub 
> class `ConfigEntryWithDefault` and .Also there has some un used imports, 
> should we modify this class?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11353) Writing to S3 buckets, which only support AWS4-HMAC-SHA256 fails with s3n URLs

2017-01-15 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran resolved SPARK-11353.

Resolution: Duplicate

This is a duplicate of SPARK-13044; that's transitive a WONTFIX due to 
HADOOP-13325. The Hadoop project isn't going to upgrade jets3t as it will 
inevitably introduce a regression somewhere.

Note also that moving to 0.9.4 lib needs changes in the org.apache.hadoop 
libraries, see HADOOP-11086. You cannot bump up the jets3t version and expect 
for all codepaths to work.

If you want to auth with V4 APIs, use s3a:// URLs.

> Writing to S3 buckets, which only support AWS4-HMAC-SHA256 fails with s3n URLs
> --
>
> Key: SPARK-11353
> URL: https://issues.apache.org/jira/browse/SPARK-11353
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.3.1, 1.5.1
>Reporter: Łukasz Piepiora
>
> For certain regions like for example Frankfurt (eu-central-1) AWS supports 
> only [AWS Signature Version 
> 4|http://docs.aws.amazon.com/general/latest/gr/rande.html#d0e3788].
> Currently Spark is using jets3t library in version 0.9.3, which throws an 
> exception when code tries to save files in S3 in eu-central-1.
> {code}
> Caused by: java.lang.RuntimeException: Failed to automatically set required 
> header "x-amz-content-sha256" for request with entity 
> org.jets3t.service.impl.rest.httpclient.RepeatableRequestEntity@1e4bc601
>   at 
> org.jets3t.service.utils.SignatureUtils.awsV4GetOrCalculatePayloadHash(SignatureUtils.java:238)
>   at 
> org.jets3t.service.impl.rest.httpclient.RestStorageService.authorizeHttpRequest(RestStorageService.java:762)
>   at 
> org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:324)
>   at 
> org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:277)
>   at 
> org.jets3t.service.impl.rest.httpclient.RestStorageService.performRestPut(RestStorageService.java:1143)
>   at 
> org.jets3t.service.impl.rest.httpclient.RestStorageService.createObjectImpl(RestStorageService.java:1954)
>   at 
> org.jets3t.service.impl.rest.httpclient.RestStorageService.putObjectWithRequestEntityImpl(RestStorageService.java:1875)
>   at 
> org.jets3t.service.impl.rest.httpclient.RestStorageService.putObjectImpl(RestStorageService.java:1867)
>   at org.jets3t.service.StorageService.putObject(StorageService.java:840)
>   at org.jets3t.service.S3Service.putObject(S3Service.java:2212)
>   at org.jets3t.service.S3Service.putObject(S3Service.java:2356)
>   ... 23 more
> Caused by: java.io.IOException: Stream closed
>   at 
> java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:170)
>   at java.io.BufferedInputStream.reset(BufferedInputStream.java:446)
>   at 
> org.jets3t.service.utils.SignatureUtils.awsV4GetOrCalculatePayloadHash(SignatureUtils.java:236)
>   ... 33 more
> {code}
> There is a newer version of jets3t 0.9.4, which seems to fix this issue 
> (http://www.jets3t.org/RELEASE_NOTES.html).
> Therefore I suggest to upgrade jets3t dependency from 0.9.3 to 0.9.4 for 
> Hadoop profiles.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19228) inferSchema function processed csv date column as string and "dateFormat" DataSource option is ignored

2017-01-15 Thread Sergey Rubtsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Rubtsov updated SPARK-19228:
---
Description: 
I need to process user.csv like this:
{code}
id,project,started,ended
sergey.rubtsov,project0,12/12/2012,10/10/2015
{code}
When I add date format options:
{code}
Dataset users = spark.read().format("csv").option("mode", 
"PERMISSIVE").option("header", "true")
.option("inferSchema", 
"true").option("dateFormat", "dd/MM/").load("src/main/resources/user.csv");
users.printSchema();
{code}
expected scheme should be 
{code}
root
 |-- id: string (nullable = true)
 |-- project: string (nullable = true)
 |-- started: date (nullable = true)
 |-- ended: date (nullable = true)
{code}
but the actual result is: 
{code}
root
 |-- id: string (nullable = true)
 |-- project: string (nullable = true)
 |-- started: string (nullable = true)
 |-- ended: string (nullable = true)
{code}
This mean that date processed as string and "dateFormat" option is ignored and 
date processed as string.
If I add option 
{code}
.option("timestampFormat", "dd/MM/")
{code}
result is: 
{code}
root
 |-- id: string (nullable = true)
 |-- project: string (nullable = true)
 |-- started: timestamp (nullable = true)
 |-- ended: timestamp (nullable = true)
{code}

I think, the issue is somewhere in object CSVInferSchema, function inferField, 
lines 80-97 and
method "tryParseDate" need to be added before/after "tryParseTimestamp", or 
date/timestamp process logic need to be changed.

  was:
I need to process user.csv like this:
{code}
id,project,started,ended
sergey.rubtsov,project0,12/12/2012,10/10/2015
{code}
When I add date format options:
{code}
Dataset users = spark.read().format("csv").option("mode", 
"PERMISSIVE").option("header", "true")
.option("inferSchema", 
"true").option("dateFormat", "dd/MM/").load("src/main/resources/user.csv");
users.printSchema();
{code}
expected scheme should be 
{code}
root
 |-- id: string (nullable = true)
 |-- project: string (nullable = true)
 |-- started: date (nullable = true)
 |-- ended: date (nullable = true)
{code}
but the actual result is: 
{code}
root
 |-- id: string (nullable = true)
 |-- project: string (nullable = true)
 |-- started: string (nullable = true)
 |-- ended: string (nullable = true)

This mean that date processed as string and "dateFormat" option is ignored and 
date processed as string.
If I add option 
{code}
.option("timestampFormat", "dd/MM/")
{code}
result is: 
{code}
root
 |-- id: string (nullable = true)
 |-- project: string (nullable = true)
 |-- started: timestamp (nullable = true)
 |-- ended: timestamp (nullable = true)
{code}

I think, the issue is somewhere in object CSVInferSchema, function inferField, 
lines 80-97 and
method "tryParseDate" need to be added before/after "tryParseTimestamp", or 
date/timestamp process logic need to be changed.


> inferSchema function processed csv date column as string and "dateFormat" 
> DataSource option is ignored
> --
>
> Key: SPARK-19228
> URL: https://issues.apache.org/jira/browse/SPARK-19228
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Affects Versions: 2.1.0
>Reporter: Sergey Rubtsov
>  Labels: easyfix
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> I need to process user.csv like this:
> {code}
> id,project,started,ended
> sergey.rubtsov,project0,12/12/2012,10/10/2015
> {code}
> When I add date format options:
> {code}
> Dataset users = spark.read().format("csv").option("mode", 
> "PERMISSIVE").option("header", "true")
> .option("inferSchema", 
> "true").option("dateFormat", 
> "dd/MM/").load("src/main/resources/user.csv");
>   users.printSchema();
> {code}
> expected scheme should be 
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- project: string (nullable = true)
>  |-- started: date (nullable = true)
>  |-- ended: date (nullable = true)
> {code}
> but the actual result is: 
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- project: string (nullable = true)
>  |-- started: string (nullable = true)
>  |-- ended: string (nullable = true)
> {code}
> This mean that date processed as string and "dateFormat" option is ignored 
> and date processed as string.
> If I add option 
> {code}
> .option("timestampFormat", "dd/MM/")
> {code}
> result is: 
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- project: string (nullable = true)
>  |-- started: timestamp (nullable = true)
>  |-- ended: timestamp (nullable = true)
> {code}
> I think, the issue is somewhere in object CSVInferSchema, function 
> 

[jira] [Updated] (SPARK-11353) Writing to S3 buckets, which only support AWS4-HMAC-SHA256 fails with s3n URLs

2017-01-15 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated SPARK-11353:
---
Summary: Writing to S3 buckets, which only support AWS4-HMAC-SHA256 fails 
with s3n URLs  (was: Writing to S3 buckets, which only support AWS4-HMAC-SHA256 
fails)

> Writing to S3 buckets, which only support AWS4-HMAC-SHA256 fails with s3n URLs
> --
>
> Key: SPARK-11353
> URL: https://issues.apache.org/jira/browse/SPARK-11353
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.3.1, 1.5.1
>Reporter: Łukasz Piepiora
>
> For certain regions like for example Frankfurt (eu-central-1) AWS supports 
> only [AWS Signature Version 
> 4|http://docs.aws.amazon.com/general/latest/gr/rande.html#d0e3788].
> Currently Spark is using jets3t library in version 0.9.3, which throws an 
> exception when code tries to save files in S3 in eu-central-1.
> {code}
> Caused by: java.lang.RuntimeException: Failed to automatically set required 
> header "x-amz-content-sha256" for request with entity 
> org.jets3t.service.impl.rest.httpclient.RepeatableRequestEntity@1e4bc601
>   at 
> org.jets3t.service.utils.SignatureUtils.awsV4GetOrCalculatePayloadHash(SignatureUtils.java:238)
>   at 
> org.jets3t.service.impl.rest.httpclient.RestStorageService.authorizeHttpRequest(RestStorageService.java:762)
>   at 
> org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:324)
>   at 
> org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:277)
>   at 
> org.jets3t.service.impl.rest.httpclient.RestStorageService.performRestPut(RestStorageService.java:1143)
>   at 
> org.jets3t.service.impl.rest.httpclient.RestStorageService.createObjectImpl(RestStorageService.java:1954)
>   at 
> org.jets3t.service.impl.rest.httpclient.RestStorageService.putObjectWithRequestEntityImpl(RestStorageService.java:1875)
>   at 
> org.jets3t.service.impl.rest.httpclient.RestStorageService.putObjectImpl(RestStorageService.java:1867)
>   at org.jets3t.service.StorageService.putObject(StorageService.java:840)
>   at org.jets3t.service.S3Service.putObject(S3Service.java:2212)
>   at org.jets3t.service.S3Service.putObject(S3Service.java:2356)
>   ... 23 more
> Caused by: java.io.IOException: Stream closed
>   at 
> java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:170)
>   at java.io.BufferedInputStream.reset(BufferedInputStream.java:446)
>   at 
> org.jets3t.service.utils.SignatureUtils.awsV4GetOrCalculatePayloadHash(SignatureUtils.java:236)
>   ... 33 more
> {code}
> There is a newer version of jets3t 0.9.4, which seems to fix this issue 
> (http://www.jets3t.org/RELEASE_NOTES.html).
> Therefore I suggest to upgrade jets3t dependency from 0.9.3 to 0.9.4 for 
> Hadoop profiles.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19228) inferSchema function processed csv date column as string and "dateFormat" DataSource option is ignored

2017-01-15 Thread Sergey Rubtsov (JIRA)
Sergey Rubtsov created SPARK-19228:
--

 Summary: inferSchema function processed csv date column as string 
and "dateFormat" DataSource option is ignored
 Key: SPARK-19228
 URL: https://issues.apache.org/jira/browse/SPARK-19228
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, SQL
Affects Versions: 2.1.0
Reporter: Sergey Rubtsov


I need to process user.csv like this:
{code}
id,project,started,ended
sergey.rubtsov,project0,12/12/2012,10/10/2015
{code}
When I add date format options:
{code}
Dataset users = spark.read().format("csv").option("mode", 
"PERMISSIVE").option("header", "true")
.option("inferSchema", 
"true").option("dateFormat", "dd/MM/").load("src/main/resources/user.csv");
users.printSchema();
{code}
expected scheme should be 
{code}
root
 |-- id: string (nullable = true)
 |-- project: string (nullable = true)
 |-- started: date (nullable = true)
 |-- ended: date (nullable = true)
{code}
but the actual result is: 
{code}
root
 |-- id: string (nullable = true)
 |-- project: string (nullable = true)
 |-- started: string (nullable = true)
 |-- ended: string (nullable = true)

This mean that date processed as string and "dateFormat" option is ignored and 
date processed as string.
If I add option 
{code}
.option("timestampFormat", "dd/MM/")
{code}
result is: 
{code}
root
 |-- id: string (nullable = true)
 |-- project: string (nullable = true)
 |-- started: timestamp (nullable = true)
 |-- ended: timestamp (nullable = true)
{code}

I think, the issue is somewhere in object CSVInferSchema, function inferField, 
lines 80-97 and
method "tryParseDate" need to be added before/after "tryParseTimestamp", or 
date/timestamp process logic need to be changed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19185) ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing

2017-01-15 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823164#comment-15823164
 ] 

Cody Koeninger commented on SPARK-19185:


This is a good error report, sorry it's taken me a while to get back to you on 
this.

My immediate suggestions to you as a workaround would be
- Try persist before windowing, so that batches of offsets from Kafka are only 
fetched once, rather than repeatedly and possibly simultaneously for a given 
kafka partition.  I'm assuming that's the underlying issue, but could be wrong.
- Failing that, KafkaRDD's constructor takes a boolean parameter indicating 
whether to use the consumer cache.  You can straightforwardly modify 
DirectKafkaInputDStream.compute to pass false.  This will require rebuilding 
only the kafka consumer jar, not redeploying all of spark.  This will be a 
performance hit, especially if you're using SSL, but is better than nothing.

Fixing this in the Spark master branch (either by allowing configuration of 
whether to use the consumer cache, or replacing the consumer cache with a pool 
of consumers with different group ids for the same topicpartition) is going to 
require getting the attention of a committer.  I don't really have the time to 
mess with that right now (happy to do the work, but zero interest in tracking 
down committers and arguing design decisions).

That being said, if one of the workarounds suggested above doesn't help you, 
let me know.


> ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing
> -
>
> Key: SPARK-19185
> URL: https://issues.apache.org/jira/browse/SPARK-19185
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: Spark 2.0.2
> Spark Streaming Kafka 010
> Mesos 0.28.0 - client mode
> spark.executor.cores 1
> spark.mesos.extra.cores 1
>Reporter: Kalvin Chau
>  Labels: streaming, windowing
>
> We've been running into ConcurrentModificationExcpetions "KafkaConsumer is 
> not safe for multi-threaded access" with the CachedKafkaConsumer. I've been 
> working through debugging this issue and after looking through some of the 
> spark source code I think this is a bug.
> Our set up is:
> Spark 2.0.2, running in Mesos 0.28.0-2 in client mode, using 
> Spark-Streaming-Kafka-010
> spark.executor.cores 1
> spark.mesos.extra.cores 1
> Batch interval: 10s, window interval: 180s, and slide interval: 30s
> We would see the exception when in one executor there are two task worker 
> threads assigned the same Topic+Partition, but a different set of offsets.
> They would both get the same CachedKafkaConsumer, and whichever task thread 
> went first would seek and poll for all the records, and at the same time the 
> second thread would try to seek to its offset but fail because it is unable 
> to acquire the lock.
> Time0 E0 Task0 - TopicPartition("abc", 0) X to Y
> Time0 E0 Task1 - TopicPartition("abc", 0) Y to Z
> Time1 E0 Task0 - Seeks and starts to poll
> Time1 E0 Task1 - Attempts to seek, but fails
> Here are some relevant logs:
> {code}
> 17/01/06 03:10:01 Executor task launch worker-1 INFO KafkaRDD: Computing 
> topic test-topic, partition 2 offsets 4394204414 -> 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 INFO KafkaRDD: Computing 
> topic test-topic, partition 2 offsets 4394238058 -> 4394257712
> 17/01/06 03:10:01 Executor task launch worker-1 DEBUG CachedKafkaConsumer: 
> Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested 
> 4394204414
> 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: 
> Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested 
> 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 INFO CachedKafkaConsumer: 
> Initial fetch for spark-executor-consumer test-topic 2 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: 
> Seeking to test-topic-2 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Putting 
> block rdd_199_2 failed due to an exception
> 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Block 
> rdd_199_2 could not be removed as it was not found on disk or in memory
> 17/01/06 03:10:01 Executor task launch worker-0 ERROR Executor: Exception in 
> task 49.0 in stage 45.0 (TID 3201)
> java.util.ConcurrentModificationException: KafkaConsumer is not safe for 
> multi-threaded access
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1132)
>   at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:95)
>   at 
> 

[jira] [Updated] (SPARK-19226) Report failure reason from Reporter Thread

2017-01-15 Thread Maheedhar Reddy Chappidi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maheedhar Reddy Chappidi updated SPARK-19226:
-
Description: 
With the exponential[1] increase in executor count the Reporter thread [2] 
fails without proper message.

==
17/01/12 09:33:44 INFO YarnAllocator: Driver requested a total number of 32767 
executor(s).
17/01/12 09:33:44 INFO YarnAllocator: Will request 24576 executor containers, 
each with 2 cores and 5632 MB memory including 512 MB overhead
17/01/12 09:33:44 INFO YarnAllocator: Canceled 0 container requests (locality 
no longer needed)
17/01/12 09:33:52 INFO YarnAllocator: Driver requested a total number of 34419 
executor(s).
17/01/12 09:33:52 INFO ApplicationMaster: Final app status: FAILED, exitCode: 
12, (reason: Exception was thrown 1 time(s) from Reporter thread.)
17/01/12 09:33:52 INFO YarnAllocator: Driver requested a total number of 34410 
executor(s).
17/01/12 09:33:52 INFO YarnAllocator: Driver requested a total number of 34409 
executor(s).
17/01/12 09:33:52 INFO ShutdownHookManager: Shutdown hook called
==

We were able to run the workflows by setting/limiting the maxExecutor count 
(spark.dynamicAllocation.maxExecutors) to avoid more requests(35k->65k).
Added I don't see any issues with ApplicationMaster's container memory/compute.
Is it possible to parse more ErrorReason from if/else?

[1]  
https://github.com/apache/spark/blob/6ee28423ad1b2e6089b82af64a31d77d3552bb38/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala
[2] 
https://github.com/apache/spark/blob/01e14bf303e61a5726f3b1418357a50c1bf8b16f/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L446-L480

  was:
With the exponential[1] increase in executor count the Reporter thread [2] 
fails without proper message.

==
17/01/12 09:33:44 INFO YarnAllocator: Driver requested a total number of 32767 
executor(s).
17/01/12 09:33:44 INFO YarnAllocator: Will request 24576 executor containers, 
each with 2 cores and 5632 MB memory including 512 MB overhead
17/01/12 09:33:44 INFO YarnAllocator: Canceled 0 container requests (locality 
no longer needed)
17/01/12 09:33:52 INFO YarnAllocator: Driver requested a total number of 34419 
executor(s).
17/01/12 09:33:52 INFO ApplicationMaster: Final app status: FAILED, exitCode: 
12, (reason: Exception was thrown 1 time(s) from Reporter thread.)
17/01/12 09:33:52 INFO YarnAllocator: Driver requested a total number of 34410 
executor(s).
17/01/12 09:33:52 INFO YarnAllocator: Driver requested a total number of 34409 
executor(s).
17/01/12 09:33:52 INFO ShutdownHookManager: Shutdown hook called
==

We were able to run the workflows by setting/limiting the maxExecutor count 
(spark.dynamicAllocation.maxExecutors) to avoid more requests(35k->65k).
Added I don't see any issues with ApplicationMaster's container memory/compute.

[1]  
https://github.com/apache/spark/blob/6ee28423ad1b2e6089b82af64a31d77d3552bb38/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala
[2] 
https://github.com/apache/spark/blob/01e14bf303e61a5726f3b1418357a50c1bf8b16f/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L446-L480


> Report failure reason from Reporter Thread 
> ---
>
> Key: SPARK-19226
> URL: https://issues.apache.org/jira/browse/SPARK-19226
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.0.2
> Environment: emr-5.2.1 with Zeppelin 0.6.2/Spark2.0.2 and 10 r3.xl 
> core nodes
>Reporter: Maheedhar Reddy Chappidi
>Priority: Minor
>
> With the exponential[1] increase in executor count the Reporter thread [2] 
> fails without proper message.
> ==
> 17/01/12 09:33:44 INFO YarnAllocator: Driver requested a total number of 
> 32767 executor(s).
> 17/01/12 09:33:44 INFO YarnAllocator: Will request 24576 executor containers, 
> each with 2 cores and 5632 MB memory including 512 MB overhead
> 17/01/12 09:33:44 INFO YarnAllocator: Canceled 0 container requests (locality 
> no longer needed)
> 17/01/12 09:33:52 INFO YarnAllocator: Driver requested a total number of 
> 34419 executor(s).
> 17/01/12 09:33:52 INFO ApplicationMaster: Final app status: FAILED, exitCode: 
> 12, (reason: Exception was thrown 1 time(s) from Reporter thread.)
> 17/01/12 09:33:52 INFO YarnAllocator: Driver requested a total number of 
> 34410 executor(s).
> 17/01/12 09:33:52 INFO YarnAllocator: Driver requested a total number of 
> 34409 executor(s).
> 17/01/12 09:33:52 INFO ShutdownHookManager: Shutdown hook called
> ==
> We were able to run the workflows by setting/limiting the maxExecutor count 
> (spark.dynamicAllocation.maxExecutors) to avoid more requests(35k->65k).
> Added I don't see any issues with ApplicationMaster's container 

[jira] [Created] (SPARK-19227) Remove non-existent parameter annotation in `ConfigEntry`

2017-01-15 Thread Biao Ma (JIRA)
Biao Ma created SPARK-19227:
---

 Summary: Remove non-existent parameter annotation in `ConfigEntry`
 Key: SPARK-19227
 URL: https://issues.apache.org/jira/browse/SPARK-19227
 Project: Spark
  Issue Type: Documentation
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Biao Ma
Priority: Minor
 Fix For: 2.1.1


The parameter `defaultValue` is not exists in class 
`org.apache.spark.internal.config.ConfigEntry`, we should remove its annotation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19226) Report failure reason from Reporter Thread

2017-01-15 Thread Maheedhar Reddy Chappidi (JIRA)
Maheedhar Reddy Chappidi created SPARK-19226:


 Summary: Report failure reason from Reporter Thread 
 Key: SPARK-19226
 URL: https://issues.apache.org/jira/browse/SPARK-19226
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 2.0.2
 Environment: emr-5.2.1 with Zeppelin 0.6.2/Spark2.0.2 and 10 r3.xl 
core nodes
Reporter: Maheedhar Reddy Chappidi
Priority: Minor


With the exponential[1] increase in executor count the Reporter thread [2] 
fails without proper message.

==
17/01/12 09:33:44 INFO YarnAllocator: Driver requested a total number of 32767 
executor(s).
17/01/12 09:33:44 INFO YarnAllocator: Will request 24576 executor containers, 
each with 2 cores and 5632 MB memory including 512 MB overhead
17/01/12 09:33:44 INFO YarnAllocator: Canceled 0 container requests (locality 
no longer needed)
17/01/12 09:33:52 INFO YarnAllocator: Driver requested a total number of 34419 
executor(s).
17/01/12 09:33:52 INFO ApplicationMaster: Final app status: FAILED, exitCode: 
12, (reason: Exception was thrown 1 time(s) from Reporter thread.)
17/01/12 09:33:52 INFO YarnAllocator: Driver requested a total number of 34410 
executor(s).
17/01/12 09:33:52 INFO YarnAllocator: Driver requested a total number of 34409 
executor(s).
17/01/12 09:33:52 INFO ShutdownHookManager: Shutdown hook called
==

We were able to run the workflows by setting/limiting the maxExecutor count 
(spark.dynamicAllocation.maxExecutors) to avoid more requests(35k->65k).
Added I don't see any issues with ApplicationMaster's container memory/compute.

[1]  
https://github.com/apache/spark/blob/6ee28423ad1b2e6089b82af64a31d77d3552bb38/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala
[2] 
https://github.com/apache/spark/blob/01e14bf303e61a5726f3b1418357a50c1bf8b16f/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L446-L480



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18922) Fix more resource-closing-related and path-related test failures in identified ones on Windows

2017-01-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823150#comment-15823150
 ] 

Apache Spark commented on SPARK-18922:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/16586

> Fix more resource-closing-related and path-related test failures in 
> identified ones on Windows
> --
>
> Key: SPARK-18922
> URL: https://issues.apache.org/jira/browse/SPARK-18922
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.2.0
>
>
> There are more instances that are failed on Windows as below:
> - {{LauncherBackendSuite}}:
> {code}
> - local: launcher handle *** FAILED *** (30 seconds, 120 milliseconds)
>   The code passed to eventually never returned normally. Attempted 283 times 
> over 30.0960053 seconds. Last failure message: The reference was null. 
> (LauncherBackendSuite.scala:56)
>   org.scalatest.exceptions.TestFailedDueToTimeoutException:
>   at 
> org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
> - standalone/client: launcher handle *** FAILED *** (30 seconds, 47 
> milliseconds)
>   The code passed to eventually never returned normally. Attempted 282 times 
> over 30.03798710002 seconds. Last failure message: The reference was 
> null. (LauncherBackendSuite.scala:56)
>   org.scalatest.exceptions.TestFailedDueToTimeoutException:
>   at 
> org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
> {code}
> - {{SQLQuerySuite}}:
> {code}
> - specifying database name for a temporary table is not allowed *** FAILED 
> *** (125 milliseconds)
>   org.apache.spark.sql.AnalysisException: Path does not exist: 
> file:/C:projectsspark  arget mpspark-1f4471ab-aac0-4239-ae35-833d54b37e52;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370)
> {code}
> - {{JsonSuite}}:
> {code}
> - Loading a JSON dataset from a text file with SQL *** FAILED *** (94 
> milliseconds)
>   org.apache.spark.sql.AnalysisException: Path does not exist: 
> file:/C:projectsspark  arget mpspark-c918a8b7-fc09-433c-b9d0-36c0f78ae918;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370)
> {code}
> - {{StateStoreSuite}}:
> {code}
> - SPARK-18342: commit fails when rename fails *** FAILED *** (16 milliseconds)
>   java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: 
> StateStoreSuite29777261fs://C:%5Cprojects%5Cspark%5Ctarget%5Ctmp%5Cspark-ef349862-7281-4963-aaf3-add0d670a4ad%5C?-2218c2f8-2cf6-4f80-9cdf-96354e8246a77685899733421033312/0
>   at org.apache.hadoop.fs.Path.initialize(Path.java:206)
>   at org.apache.hadoop.fs.Path.(Path.java:116)
>   at org.apache.hadoop.fs.Path.(Path.java:89)
>   ...
>   Cause: java.net.URISyntaxException: Relative path in absolute URI: 
> StateStoreSuite29777261fs://C:%5Cprojects%5Cspark%5Ctarget%5Ctmp%5Cspark-ef349862-7281-4963-aaf3-add0d670a4ad%5C?-2218c2f8-2cf6-4f80-9cdf-96354e8246a77685899733421033312/0
>   at java.net.URI.checkPath(URI.java:1823)
>   at java.net.URI.(URI.java:745)
>   at org.apache.hadoop.fs.Path.initialize(Path.java:203)
> {code}
> - {{HDFSMetadataLogSuite}}:
> {code}
> - FileManager: FileContextManager *** FAILED *** (94 milliseconds)
>   java.io.IOException: Failed to delete: 
> C:\projects\spark\target\tmp\spark-415bb0bd-396b-444d-be82-04599e025f21
>   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$class.withTempDir(SQLTestUtils.scala:127)
>   at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.withTempDir(HDFSMetadataLogSuite.scala:38)
> - FileManager: FileSystemManager *** FAILED *** (78 milliseconds)
>   java.io.IOException: Failed to delete: 
> C:\projects\spark\target\tmp\spark-ef8222cd-85aa-47c0-a396-bc7979e15088
>   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$class.withTempDir(SQLTestUtils.scala:127)
>   at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.withTempDir(HDFSMetadataLogSuite.scala:38)
> {code}
> Please refer, for full logs, 
> 

[jira] [Commented] (SPARK-19117) script transformation does not work on Windows due to fixed bash executable location

2017-01-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823149#comment-15823149
 ] 

Apache Spark commented on SPARK-19117:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/16586

> script transformation does not work on Windows due to fixed bash executable 
> location
> 
>
> Key: SPARK-19117
> URL: https://issues.apache.org/jira/browse/SPARK-19117
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
> Fix For: 2.2.0
>
>
> There are some tests failed on Windows via AppVeyor as below due to this 
> problem :
> {code}
>  - script *** FAILED *** (553 milliseconds)
>org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 56.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
> 56.0 (TID 54, localhost, executor driver): java.io.IOException: Cannot run 
> program "/bin/bash": CreateProcess error=2, The system cannot find the file 
> specified
>  - Star Expansion - script transform *** FAILED *** (2 seconds, 375 
> milliseconds)
>org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 389.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
> 389.0 (TID 725, localhost, executor driver): java.io.IOException: Cannot run 
> program "/bin/bash": CreateProcess error=2, The system cannot find the file 
> specified
>  - test script transform for stdout *** FAILED *** (2 seconds, 813 
> milliseconds)
>org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 391.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
> 391.0 (TID 726, localhost, executor driver): java.io.IOException: Cannot run 
> program "/bin/bash": CreateProcess error=2, The system cannot find the file 
> specified
>  - test script transform for stderr *** FAILED *** (2 seconds, 407 
> milliseconds)
>org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 393.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
> 393.0 (TID 727, localhost, executor driver): java.io.IOException: Cannot run 
> program "/bin/bash": CreateProcess error=2, The system cannot find the file 
> specified
>  - test script transform data type *** FAILED *** (171 milliseconds)
>org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 395.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
> 395.0 (TID 728, localhost, executor driver): java.io.IOException: Cannot run 
> program "/bin/bash": CreateProcess error=2, The system cannot find the file 
> specified
>  - transform *** FAILED *** (359 milliseconds)
>Failed to execute query using catalyst:
>Error: Job aborted due to stage failure: Task 0 in stage 1347.0 failed 1 
> times, most recent failure: Lost task 0.0 in stage 1347.0 (TID 2395, 
> localhost, executor driver): java.io.IOException: Cannot run program 
> "/bin/bash": CreateProcess error=2, The system cannot find the file specified
>   
>  - schema-less transform *** FAILED *** (344 milliseconds)
>Failed to execute query using catalyst:
>Error: Job aborted due to stage failure: Task 0 in stage 1348.0 failed 1 
> times, most recent failure: Lost task 0.0 in stage 1348.0 (TID 2396, 
> localhost, executor driver): java.io.IOException: Cannot run program 
> "/bin/bash": CreateProcess error=2, The system cannot find the file specified
>  - transform with custom field delimiter *** FAILED *** (296 milliseconds)
>Failed to execute query using catalyst:
>Error: Job aborted due to stage failure: Task 0 in stage 1349.0 failed 1 
> times, most recent failure: Lost task 0.0 in stage 1349.0 (TID 2397, 
> localhost, executor driver): java.io.IOException: Cannot run program 
> "/bin/bash": CreateProcess error=2, The system cannot find the file specified
>  - transform with custom field delimiter2 *** FAILED *** (297 milliseconds)
>Failed to execute query using catalyst:
>Error: Job aborted due to stage failure: Task 0 in stage 1350.0 failed 1 
> times, most recent failure: Lost task 0.0 in stage 1350.0 (TID 2398, 
> localhost, executor driver): java.io.IOException: Cannot run program 
> "/bin/bash": CreateProcess error=2, The system cannot find the file specified
>  - transform with custom field delimiter3 *** FAILED *** (312 milliseconds)
>Failed to execute query using catalyst:
>Error: Job aborted due to stage failure: Task 0 in stage 1351.0 failed 1 
> times, most recent failure: Lost task 0.0 in stage 1351.0 (TID 2399, 
> localhost, executor driver): java.io.IOException: Cannot run program 
> "/bin/bash": CreateProcess error=2, The system cannot 

[jira] [Commented] (SPARK-19153) DataFrameWriter.saveAsTable should work with hive format to create partitioned table

2017-01-15 Thread Shuai Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823141#comment-15823141
 ] 

Shuai Lin commented on SPARK-19153:
---

bq. To clarify, we want this feature in DataFrameWriter and the official CREATE 
TABLE SQL statement, the legacy CREATE TABLE hive syntax is not our goal.

Thanks for the reply and I agree with this. But TBH I don't understand your 
opinion of whether the summary I given above is correct or not. Can you be more 
clear on it?

> DataFrameWriter.saveAsTable should work with hive format to create 
> partitioned table
> 
>
> Key: SPARK-19153
> URL: https://issues.apache.org/jira/browse/SPARK-19153
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13735) Log for parquet relation reading files is too verbose

2017-01-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13735.
---
Resolution: Duplicate

> Log for parquet relation reading files is too verbose
> -
>
> Key: SPARK-13735
> URL: https://issues.apache.org/jira/browse/SPARK-13735
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Zhong Wang
>Priority: Trivial
>
> The INFO level logging contains all files read by Parquet Relation, which is 
> way too verbose if the input contains lots of files



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13735) Log for parquet relation reading files is too verbose

2017-01-15 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823118#comment-15823118
 ] 

Hyukjin Kwon commented on SPARK-13735:
--

Is this a duplicate of SPARK-8118?

> Log for parquet relation reading files is too verbose
> -
>
> Key: SPARK-13735
> URL: https://issues.apache.org/jira/browse/SPARK-13735
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Zhong Wang
>Priority: Trivial
>
> The INFO level logging contains all files read by Parquet Relation, which is 
> way too verbose if the input contains lots of files



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19120) Returned an Empty Result after Loading a Hive Table

2017-01-15 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19120.
-
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

Issue resolved by pull request 16500
[https://github.com/apache/spark/pull/16500]

> Returned an Empty Result after Loading a Hive Table
> ---
>
> Key: SPARK-19120
> URL: https://issues.apache.org/jira/browse/SPARK-19120
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Critical
>  Labels: correctness
> Fix For: 2.1.1, 2.2.0
>
>
> {noformat}
> sql(
>   """
> |CREATE TABLE test (a STRING)
> |STORED AS PARQUET
>   """.stripMargin)
>spark.table("test").show()
> sql(
>   s"""
>  |LOAD DATA LOCAL INPATH '$newPartitionDir' OVERWRITE
>  |INTO TABLE test
>""".stripMargin)
> spark.table("test").show()
> {noformat}
> The returned result is empty after table loading. We should refresh the 
> metadata cache after loading the data to the table. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19153) DataFrameWriter.saveAsTable should work with hive format to create partitioned table

2017-01-15 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823116#comment-15823116
 ] 

Wenchen Fan commented on SPARK-19153:
-

To clarify, we want this feature in DataFrameWriter and the official CREATE 
TABLE SQL statement, the legacy CREATE TABLE hive syntax is not our goal.

> DataFrameWriter.saveAsTable should work with hive format to create 
> partitioned table
> 
>
> Key: SPARK-19153
> URL: https://issues.apache.org/jira/browse/SPARK-19153
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18492) GeneratedIterator grows beyond 64 KB

2017-01-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823110#comment-15823110
 ] 

Sean Owen commented on SPARK-18492:
---

64K is a JVM limit as far as I know. And there are lots of ways to hit this 
with codegen. Yes, there are loads of issues about a "64K limit" and only some 
of them are duplicates. I go mostly by the exact call site.

> GeneratedIterator grows beyond 64 KB
> 
>
> Key: SPARK-18492
> URL: https://issues.apache.org/jira/browse/SPARK-18492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: CentOS release 6.7 (Final)
>Reporter: Norris Merritt
>
> spark-submit fails with ERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(I[Lscala/collection/Iterator;)V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" 
> grows beyond 64 KB
> Error message is followed by a huge dump of generated source code.
> The generated code declares 1,454 field sequences like the following:
> /* 036 */   private org.apache.spark.sql.catalyst.expressions.ScalaUDF 
> project_scalaUDF1;
> /* 037 */   private scala.Function1 project_catalystConverter1;
> /* 038 */   private scala.Function1 project_converter1;
> /* 039 */   private scala.Function1 project_converter2;
> /* 040 */   private scala.Function2 project_udf1;
>   (many omitted lines) ...
> /* 6089 */   private org.apache.spark.sql.catalyst.expressions.ScalaUDF 
> project_scalaUDF1454;
> /* 6090 */   private scala.Function1 project_catalystConverter1454;
> /* 6091 */   private scala.Function1 project_converter1695;
> /* 6092 */   private scala.Function1 project_udf1454;
> It then proceeds to emit code for several methods (init, processNext) each of 
> which has totally repetitive sequences of statements pertaining to each of 
> the sequences of variables declared in the class.  For example:
> /* 6101 */   public void init(int index, scala.collection.Iterator inputs[]) {
> The reason that the 64KB JVM limit for code for a method is exceeded is 
> because the code generator is using an incredibly naive strategy.  It emits a 
> sequence like the one shown below for each of the 1,454 groups of variables 
> shown above, in 
> /* 6132 */ this.project_udf = 
> (scala.Function1)project_scalaUDF.userDefinedFunc();
> /* 6133 */ this.project_scalaUDF1 = 
> (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[10];
> /* 6134 */ this.project_catalystConverter1 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF1.dataType());
> /* 6135 */ this.project_converter1 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(0))).dataType());
> /* 6136 */ this.project_converter2 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(1))).dataType());
> It blows up after emitting 230 such sequences, while trying to emit the 231st:
> /* 7282 */ this.project_udf230 = 
> (scala.Function2)project_scalaUDF230.userDefinedFunc();
> /* 7283 */ this.project_scalaUDF231 = 
> (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[240];
> /* 7284 */ this.project_catalystConverter231 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF231.dataType());
>   many omitted lines ...
>  Example of repetitive code sequences emitted for processNext method:
> /* 12253 */   boolean project_isNull247 = project_result244 == null;
> /* 12254 */   MapData project_value247 = null;
> /* 12255 */   if (!project_isNull247) {
> /* 12256 */ project_value247 = project_result244;
> /* 12257 */   }
> /* 12258 */   Object project_arg = sort_isNull5 ? null : 
> project_converter489.apply(sort_value5);
> /* 12259 */
> /* 12260 */   ArrayData project_result249 = null;
> /* 12261 */   try {
> /* 12262 */ project_result249 = 
> (ArrayData)project_catalystConverter248.apply(project_udf248.apply(project_arg));
> /* 12263 */   } catch (Exception e) {
> /* 12264 */ throw new 
> org.apache.spark.SparkException(project_scalaUDF248.udfErrorMessage(), e);
> /* 12265 */   }
> /* 12266 */
> /* 12267 */   boolean project_isNull252 = project_result249 == null;
> /* 12268 */   ArrayData project_value252 = null;
> /* 12269 */  

[jira] [Updated] (SPARK-19206) Update outdated parameter descriptions in external-kafka module

2017-01-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-19206:
--
Assignee: Genmao Yu

> Update outdated parameter descriptions in external-kafka module
> ---
>
> Key: SPARK-19206
> URL: https://issues.apache.org/jira/browse/SPARK-19206
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Genmao Yu
>Assignee: Genmao Yu
>Priority: Minor
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19206) Update outdated parameter descriptions in external-kafka module

2017-01-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19206.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16569
[https://github.com/apache/spark/pull/16569]

> Update outdated parameter descriptions in external-kafka module
> ---
>
> Key: SPARK-19206
> URL: https://issues.apache.org/jira/browse/SPARK-19206
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Genmao Yu
>Priority: Minor
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18971) Netty issue may cause the shuffle client hang

2017-01-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18971:
--
Assignee: Shixiong Zhu
Priority: Minor  (was: Major)

> Netty issue may cause the shuffle client hang
> -
>
> Key: SPARK-18971
> URL: https://issues.apache.org/jira/browse/SPARK-18971
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 2.2.0
>
>
> Check https://github.com/netty/netty/issues/6153 for details



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18971) Netty issue may cause the shuffle client hang

2017-01-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18971.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16568
[https://github.com/apache/spark/pull/16568]

> Netty issue may cause the shuffle client hang
> -
>
> Key: SPARK-18971
> URL: https://issues.apache.org/jira/browse/SPARK-18971
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Shixiong Zhu
> Fix For: 2.2.0
>
>
> Check https://github.com/netty/netty/issues/6153 for details



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19042) Remove query string from jar url for executor

2017-01-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-19042:
--
Assignee: hustfxj
Priority: Minor  (was: Major)

> Remove query string from jar url for executor
> -
>
> Key: SPARK-19042
> URL: https://issues.apache.org/jira/browse/SPARK-19042
> Project: Spark
>  Issue Type: Bug
>Reporter: hustfxj
>Assignee: hustfxj
>Priority: Minor
> Fix For: 2.2.0
>
>
> spark.jars support jar url with http protocal. However, if the url contains 
> any query strings, the "localName = name.split("/").last" won't get the 
> expected jar, then "val url = new File(SparkFiles.getRootDirectory(), 
> localName).toURI.toURL" will get invalid url. The bug fix is the same as 
> [SPARK-17855]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19042) Remove query string from jar url for executor

2017-01-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19042.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16509
[https://github.com/apache/spark/pull/16509]

> Remove query string from jar url for executor
> -
>
> Key: SPARK-19042
> URL: https://issues.apache.org/jira/browse/SPARK-19042
> Project: Spark
>  Issue Type: Bug
>Reporter: hustfxj
>Assignee: hustfxj
>Priority: Minor
> Fix For: 2.2.0
>
>
> spark.jars support jar url with http protocal. However, if the url contains 
> any query strings, the "localName = name.split("/").last" won't get the 
> expected jar, then "val url = new File(SparkFiles.getRootDirectory(), 
> localName).toURI.toURL" will get invalid url. The bug fix is the same as 
> [SPARK-17855]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19207) LocalSparkSession should use Slf4JLoggerFactory.INSTANCE instead of creating new object via constructor

2017-01-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19207.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16570
[https://github.com/apache/spark/pull/16570]

> LocalSparkSession should use Slf4JLoggerFactory.INSTANCE instead of creating 
> new object via constructor
> ---
>
> Key: SPARK-19207
> URL: https://issues.apache.org/jira/browse/SPARK-19207
> Project: Spark
>  Issue Type: Improvement
>Reporter: Tsuyoshi Ozawa
>Assignee: Tsuyoshi Ozawa
>Priority: Trivial
> Fix For: 2.2.0
>
>
> It's deprecated to create Slf4JLoggerFactory's instance via constructor. A 
> warning is generated:
> {code}
> [warn] 
> /Users/ozawa/workspace/spark/sql/core/src/test/scala/org/apache/spark/sql/LocalSparkSession.scala:32:
>  constructor Slf4JLoggerFactory in class Slf4JLoggerFactory is deprecated: 
> see corresponding Javadoc for more information.
> [warn] InternalLoggerFactory.setDefaultFactory(new Slf4JLoggerFactory())
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19207) LocalSparkSession should use Slf4JLoggerFactory.INSTANCE instead of creating new object via constructor

2017-01-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-19207:
--
Assignee: Tsuyoshi Ozawa
Priority: Trivial  (was: Major)

> LocalSparkSession should use Slf4JLoggerFactory.INSTANCE instead of creating 
> new object via constructor
> ---
>
> Key: SPARK-19207
> URL: https://issues.apache.org/jira/browse/SPARK-19207
> Project: Spark
>  Issue Type: Improvement
>Reporter: Tsuyoshi Ozawa
>Assignee: Tsuyoshi Ozawa
>Priority: Trivial
>
> It's deprecated to create Slf4JLoggerFactory's instance via constructor. A 
> warning is generated:
> {code}
> [warn] 
> /Users/ozawa/workspace/spark/sql/core/src/test/scala/org/apache/spark/sql/LocalSparkSession.scala:32:
>  constructor Slf4JLoggerFactory in class Slf4JLoggerFactory is deprecated: 
> see corresponding Javadoc for more information.
> [warn] InternalLoggerFactory.setDefaultFactory(new Slf4JLoggerFactory())
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19217) Offer easy cast from vector to array

2017-01-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823105#comment-15823105
 ] 

Sean Owen commented on SPARK-19217:
---

Yes it's not hard to do this with a UDF, but it seems like everyone has to 
implement the same two one-liner UDFs for this purpose. It's not essential, but 
also seems reasonable to consider being able to cast to/from this common Spark 
type.

> Offer easy cast from vector to array
> 
>
> Key: SPARK-19217
> URL: https://issues.apache.org/jira/browse/SPARK-19217
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Working with ML often means working with DataFrames with vector columns. You 
> can't save these DataFrames to storage without converting the vector columns 
> to array columns, and there doesn't appear to an easy way to make that 
> conversion.
> This is a common enough problem that it is [documented on Stack 
> Overflow|http://stackoverflow.com/q/35855382/877069]. The current solutions 
> to making the conversion from a vector column to an array column are:
> # Convert the DataFrame to an RDD and back
> # Use a UDF
> Both approaches work fine, but it really seems like you should be able to do 
> something like this instead:
> {code}
> (le_data
> .select(
> col('features').cast('array').alias('features')
> ))
> {code}
> We already have an {{ArrayType}} in {{pyspark.sql.types}}, but it appears 
> that {{cast()}} doesn't support this conversion.
> Would this be an appropriate thing to add?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19153) DataFrameWriter.saveAsTable should work with hive format to create partitioned table

2017-01-15 Thread Shuai Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823103#comment-15823103
 ] 

Shuai Lin commented on SPARK-19153:
---

I find it's quite straight forward to remove the restriction of partitioned-by 
for the {{create table t1 using hive partitioned by (c1,c2) as select ..."}} 
CTAS statement.

But another problem comes up: the partition columns must be on the right most 
of the schema, otherwise the schema we stored in the table property of 
metastore (with the property key "spark.sql.sources.schema") would be 
inconsistent with the schema we read back from hive client api.

The reason is, when creating a hive table in the metastore, the schema and 
partition columns are disjoint sets (as required by hive client api). And when 
we reading it back, we append the partition columns to the end of the schema to 
get the catalyst schema, i.e.:
{code}
// HiveClientImpl.scala
val partCols = h.getPartCols.asScala.map(fromHiveColumn)
val schema = StructType(h.getCols.asScala.map(fromHiveColumn) ++ partCols)
{code}
It's not a problem before we have the unified "create table" syntax, because in 
the old create hive table syntax we have to specify the normal columns and 
partition columns separately, e.g. {{create table t1 (id int, name string) 
partitioned by (dept string)}} .

Now that we can create partitioned table using hive format, e.g. {{create table 
t1 (id int, name string, dept string) using hive partitioned by (name)}}, the 
partition column may not be the last columns, so I think we need to reorder the 
schema so the partition columns would be the last ones. This is consistent with 
data source tables, e.g.

{code}
scala> sql("create table t1 (id int, name string, dept string) using parquet 
partitioned by (name)")
scala> spark.table("t1").schema.fields.map(_.name)
res44: Array[String] = Array(id, dept, name)
{code}

[~cloud_fan] Does this sound good to you?


> DataFrameWriter.saveAsTable should work with hive format to create 
> partitioned table
> 
>
> Key: SPARK-19153
> URL: https://issues.apache.org/jira/browse/SPARK-19153
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19225) Spark SQL round constant double return null

2017-01-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-19225:
--
Target Version/s:   (was: 2.1.1)
   Fix Version/s: (was: 2.1.1)

Read http://spark.apache.org/contributing.html first

A self-contained reproduction would be helpful.

> Spark SQL round constant double return null 
> 
>
> Key: SPARK-19225
> URL: https://issues.apache.org/jira/browse/SPARK-19225
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: discipleforteen
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> Spark SQL round constant double value may return null . like 'select 
> round(4.4, 2)' return null in spark 2.x. it's not compatible with spark 1.x 
> version. it seems 4.4 is casted to decimal with new SqlBase.g4 , which is 
> casted to double in spark 1.x version.  round decimal 4.4, 2 get null in 
> changeprecision...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10890) "Column count does not match; SQL statement:" error in JDBCWriteSuite

2017-01-15 Thread Christian Kadner (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Kadner closed SPARK-10890.

   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

This is issue is no longer reproducible. It still happened in v2.1.0-rc5

> "Column count does not match; SQL statement:" error in JDBCWriteSuite
> -
>
> Key: SPARK-10890
> URL: https://issues.apache.org/jira/browse/SPARK-10890
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.5.0
>Reporter: Rick Hillegas
> Fix For: 2.1.1, 2.2.0
>
>
> I get the following error when I run the following test...
> mvn -Dhadoop.version=2.4.0 
> -DwildcardSuites=org.apache.spark.sql.jdbc.JDBCWriteSuite test
> {noformat}
> JDBCWriteSuite:
> 13:22:15.603 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load 
> native-hadoop library for your platform... using builtin-java classes where 
> applicable
> 13:22:16.506 WARN org.apache.spark.metrics.MetricsSystem: Using default name 
> DAGScheduler for source because spark.app.id is not set.
> - Basic CREATE
> - CREATE with overwrite
> - CREATE then INSERT to append
> - CREATE then INSERT to truncate
> 13:22:19.312 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 
> in stage 23.0 (TID 31)
> org.h2.jdbc.JdbcSQLException: Column count does not match; SQL statement:
> INSERT INTO TEST.INCOMPATIBLETEST VALUES (?, ?, ?) [21002-183]
>   at org.h2.message.DbException.getJdbcSQLException(DbException.java:345)
>   at org.h2.message.DbException.get(DbException.java:179)
>   at org.h2.message.DbException.get(DbException.java:155)
>   at org.h2.message.DbException.get(DbException.java:144)
>   at org.h2.command.dml.Insert.prepare(Insert.java:265)
>   at org.h2.command.Parser.prepareCommand(Parser.java:247)
>   at org.h2.engine.Session.prepareLocal(Session.java:446)
>   at org.h2.engine.Session.prepareCommand(Session.java:388)
>   at org.h2.jdbc.JdbcConnection.prepareCommand(JdbcConnection.java:1189)
>   at 
> org.h2.jdbc.JdbcPreparedStatement.(JdbcPreparedStatement.java:72)
>   at org.h2.jdbc.JdbcConnection.prepareStatement(JdbcConnection.java:277)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.insertStatement(JdbcUtils.scala:72)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:100)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:229)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:228)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$32.apply(RDD.scala:892)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$32.apply(RDD.scala:892)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1856)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1856)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 13:22:19.312 ERROR org.apache.spark.executor.Executor: Exception in task 1.0 
> in stage 23.0 (TID 32)
> org.h2.jdbc.JdbcSQLException: Column count does not match; SQL statement:
> INSERT INTO TEST.INCOMPATIBLETEST VALUES (?, ?, ?) [21002-183]
>   at org.h2.message.DbException.getJdbcSQLException(DbException.java:345)
>   at org.h2.message.DbException.get(DbException.java:179)
>   at org.h2.message.DbException.get(DbException.java:155)
>   at org.h2.message.DbException.get(DbException.java:144)
>   at org.h2.command.dml.Insert.prepare(Insert.java:265)
>   at org.h2.command.Parser.prepareCommand(Parser.java:247)
>   at org.h2.engine.Session.prepareLocal(Session.java:446)
>   at org.h2.engine.Session.prepareCommand(Session.java:388)
>   at org.h2.jdbc.JdbcConnection.prepareCommand(JdbcConnection.java:1189)
>   at 
> org.h2.jdbc.JdbcPreparedStatement.(JdbcPreparedStatement.java:72)
>   at org.h2.jdbc.JdbcConnection.prepareStatement(JdbcConnection.java:277)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.insertStatement(JdbcUtils.scala:72)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:100)
>   at 
>