[jira] [Assigned] (SPARK-20877) Investigate if tests will time out on CRAN

2017-05-30 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman reassigned SPARK-20877:
-

Assignee: Felix Cheung

> Investigate if tests will time out on CRAN
> --
>
> Key: SPARK-20877
> URL: https://issues.apache.org/jira/browse/SPARK-20877
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20877) Investigate if tests will time out on CRAN

2017-05-30 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-20877.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 18104
[https://github.com/apache/spark/pull/18104]

> Investigate if tests will time out on CRAN
> --
>
> Key: SPARK-20877
> URL: https://issues.apache.org/jira/browse/SPARK-20877
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20854) extend hint syntax to support any expression, not just identifiers or strings

2017-05-30 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20854:

Issue Type: Improvement  (was: Bug)

> extend hint syntax to support any expression, not just identifiers or strings
> -
>
> Key: SPARK-20854
> URL: https://issues.apache.org/jira/browse/SPARK-20854
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Bogdan Raducanu
>Assignee: Bogdan Raducanu
>Priority: Blocker
>
> Currently the SQL hint syntax supports as parameters only identifiers while 
> the Dataset hint syntax supports only strings.
> They should support any expression as parameters, for example numbers. This 
> is useful for implementing other hints in the future.
> Examples:
> {code}
> df.hint("hint1", Seq(1, 2, 3))
> df.hint("hint2", "A", 1)
> sql("select /*+ hint1((1,2,3)) */")
> sql("select /*+ hint2('A', 1) */")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20854) extend hint syntax to support any expression, not just identifiers or strings

2017-05-30 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-20854:
---

Assignee: Bogdan Raducanu
Target Version/s: 2.2.0
Priority: Blocker  (was: Major)

> extend hint syntax to support any expression, not just identifiers or strings
> -
>
> Key: SPARK-20854
> URL: https://issues.apache.org/jira/browse/SPARK-20854
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Bogdan Raducanu
>Assignee: Bogdan Raducanu
>Priority: Blocker
>
> Currently the SQL hint syntax supports as parameters only identifiers while 
> the Dataset hint syntax supports only strings.
> They should support any expression as parameters, for example numbers. This 
> is useful for implementing other hints in the future.
> Examples:
> {code}
> df.hint("hint1", Seq(1, 2, 3))
> df.hint("hint2", "A", 1)
> sql("select /*+ hint1((1,2,3)) */")
> sql("select /*+ hint2('A', 1) */")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20199) GradientBoostedTreesModel doesn't have featureSubsetStrategy parameter

2017-05-30 Thread pralabhkumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16030629#comment-16030629
 ] 

pralabhkumar edited comment on SPARK-20199 at 5/31/17 4:56 AM:
---

please review the pull request . 


was (Author: pralabhkumar):
please review the pull request . 
https://github.com/apache/spark/commit/16ccbdfd8862c528c90fdde94c8ec20d6631126e

> GradientBoostedTreesModel doesn't have  featureSubsetStrategy parameter
> ---
>
> Key: SPARK-20199
> URL: https://issues.apache.org/jira/browse/SPARK-20199
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: pralabhkumar
>
> Spark GradientBoostedTreesModel doesn't have featureSubsetStrategy . It Uses 
> random forest internally ,which have featureSubsetStrategy hardcoded "all". 
> It should be provided by the user to have randomness at the feature level.
> This parameter is available in H2O and XGBoost. 
> Sample from H2O.ai 
> gbmParams._col_sample_rate
> Please provide the parameter . 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20199) GradientBoostedTreesModel doesn't have featureSubsetStrategy parameter

2017-05-30 Thread pralabhkumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16030629#comment-16030629
 ] 

pralabhkumar commented on SPARK-20199:
--

please review the pull request . 
https://github.com/apache/spark/commit/16ccbdfd8862c528c90fdde94c8ec20d6631126e

> GradientBoostedTreesModel doesn't have  featureSubsetStrategy parameter
> ---
>
> Key: SPARK-20199
> URL: https://issues.apache.org/jira/browse/SPARK-20199
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: pralabhkumar
>
> Spark GradientBoostedTreesModel doesn't have featureSubsetStrategy . It Uses 
> random forest internally ,which have featureSubsetStrategy hardcoded "all". 
> It should be provided by the user to have randomness at the feature level.
> This parameter is available in H2O and XGBoost. 
> Sample from H2O.ai 
> gbmParams._col_sample_rate
> Please provide the parameter . 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows

2017-05-30 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-20392:

Target Version/s: 2.3.0

> Slow performance when calling fit on ML pipeline for dataset with many 
> columns but few rows
> ---
>
> Key: SPARK-20392
> URL: https://issues.apache.org/jira/browse/SPARK-20392
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Barry Becker
>Assignee: Liang-Chi Hsieh
>Priority: Blocker
> Attachments: blockbuster.csv, blockbuster_fewCols.csv, 
> giant_query_plan_for_fitting_pipeline.txt, model_9754.zip, model_9756.zip
>
>
> This started as a [question on stack 
> overflow|http://stackoverflow.com/questions/43484006/why-is-it-slow-to-apply-a-spark-pipeline-to-dataset-with-many-columns-but-few-ro],
>  but it seems like a bug.
> I am testing spark pipelines using a simple dataset (attached) with 312 
> (mostly numeric) columns, but only 421 rows. It is small, but it takes 3 
> minutes to apply my ML pipeline to it on a 24 core server with 60G of memory. 
> This seems much to long for such a tiny dataset. Similar pipelines run 
> quickly on datasets that have fewer columns and more rows. It's something 
> about the number of columns that is causing the slow performance.
> Here are a list of the stages in my pipeline:
> {code}
> 000_strIdx_5708525b2b6c
> 001_strIdx_ec2296082913
> 002_bucketizer_3cbc8811877b
> 003_bucketizer_5a01d5d78436
> 004_bucketizer_bf290d11364d
> 005_bucketizer_c3296dfe94b2
> 006_bucketizer_7071ca50eb85
> 007_bucketizer_27738213c2a1
> 008_bucketizer_bd728fd89ba1
> 009_bucketizer_e1e716f51796
> 010_bucketizer_38be665993ba
> 011_bucketizer_5a0e41e5e94f
> 012_bucketizer_b5a3d5743aaa
> 013_bucketizer_4420f98ff7ff
> 014_bucketizer_777cc4fe6d12
> 015_bucketizer_f0f3a3e5530e
> 016_bucketizer_218ecca3b5c1
> 017_bucketizer_0b083439a192
> 018_bucketizer_4520203aec27
> 019_bucketizer_462c2c346079
> 020_bucketizer_47435822e04c
> 021_bucketizer_eb9dccb5e6e8
> 022_bucketizer_b5f63dd7451d
> 023_bucketizer_e0fd5041c841
> 024_bucketizer_ffb3b9737100
> 025_bucketizer_e06c0d29273c
> 026_bucketizer_36ee535a425f
> 027_bucketizer_ee3a330269f1
> 028_bucketizer_094b58ea01c0
> 029_bucketizer_e93ea86c08e2
> 030_bucketizer_4728a718bc4b
> 031_bucketizer_08f6189c7fcc
> 032_bucketizer_11feb74901e6
> 033_bucketizer_ab4add4966c7
> 034_bucketizer_4474f7f1b8ce
> 035_bucketizer_90cfa5918d71
> 036_bucketizer_1a9ff5e4eccb
> 037_bucketizer_38085415a4f4
> 038_bucketizer_9b5e5a8d12eb
> 039_bucketizer_082bb650ecc3
> 040_bucketizer_57e1e363c483
> 041_bucketizer_337583fbfd65
> 042_bucketizer_73e8f6673262
> 043_bucketizer_0f9394ed30b8
> 044_bucketizer_8530f3570019
> 045_bucketizer_c53614f1e507
> 046_bucketizer_8fd99e6ec27b
> 047_bucketizer_6a8610496d8a
> 048_bucketizer_888b0055c1ad
> 049_bucketizer_974e0a1433a6
> 050_bucketizer_e848c0937cb9
> 051_bucketizer_95611095a4ac
> 052_bucketizer_660a6031acd9
> 053_bucketizer_aaffe5a3140d
> 054_bucketizer_8dc569be285f
> 055_bucketizer_83d1bffa07bc
> 056_bucketizer_0c6180ba75e6
> 057_bucketizer_452f265a000d
> 058_bucketizer_38e02ddfb447
> 059_bucketizer_6fa4ad5d3ebd
> 060_bucketizer_91044ee766ce
> 061_bucketizer_9a9ef04a173d
> 062_bucketizer_3d98eb15f206
> 063_bucketizer_c4915bb4d4ed
> 064_bucketizer_8ca2b6550c38
> 065_bucketizer_417ee9b760bc
> 066_bucketizer_67f3556bebe8
> 067_bucketizer_0556deb652c6
> 068_bucketizer_067b4b3d234c
> 069_bucketizer_30ba55321538
> 070_bucketizer_ad826cc5d746
> 071_bucketizer_77676a898055
> 072_bucketizer_05c37a38ce30
> 073_bucketizer_6d9ae54163ed
> 074_bucketizer_8cd668b2855d
> 075_bucketizer_d50ea1732021
> 076_bucketizer_c68f467c9559
> 077_bucketizer_ee1dfc840db1
> 078_bucketizer_83ec06a32519
> 079_bucketizer_741d08c1b69e
> 080_bucketizer_b7402e4829c7
> 081_bucketizer_8adc590dc447
> 082_bucketizer_673be99bdace
> 083_bucketizer_77693b45f94c
> 084_bucketizer_53529c6b1ac4
> 085_bucketizer_6a3ca776a81e
> 086_bucketizer_6679d9588ac1
> 087_bucketizer_6c73af456f65
> 088_bucketizer_2291b2c5ab51
> 089_bucketizer_cb3d0fe669d8
> 090_bucketizer_e71f913c1512
> 091_bucketizer_156528f65ce7
> 092_bucketizer_f3ec5dae079b
> 093_bucketizer_809fab77eee1
> 094_bucketizer_6925831511e6
> 095_bucketizer_c5d853b95707
> 096_bucketizer_e677659ca253
> 097_bucketizer_396e35548c72
> 098_bucketizer_78a6410d7a84
> 099_bucketizer_e3ae6e54bca1
> 100_bucketizer_9fed5923fe8a
> 101_bucketizer_8925ba4c3ee2
> 102_bucketizer_95750b6942b8
> 103_bucketizer_6e8b50a1918b
> 104_bucketizer_36cfcc13d4ba
> 105_bucketizer_2716d0455512
> 106_bucketizer_9bcf2891652f
> 107_bucketizer_8c3d352915f7
> 108_bucketizer_0786c17d5ef9
> 109_bucketizer_f22df23ef56f
> 110_bucketizer_bad04578bd20
> 111_bucketizer_35cfbde7e28f
> 112_bucketizer_cf89177a528b
> 

[jira] [Updated] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows

2017-05-30 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-20392:

Priority: Blocker  (was: Major)

> Slow performance when calling fit on ML pipeline for dataset with many 
> columns but few rows
> ---
>
> Key: SPARK-20392
> URL: https://issues.apache.org/jira/browse/SPARK-20392
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Barry Becker
>Assignee: Liang-Chi Hsieh
>Priority: Blocker
> Attachments: blockbuster.csv, blockbuster_fewCols.csv, 
> giant_query_plan_for_fitting_pipeline.txt, model_9754.zip, model_9756.zip
>
>
> This started as a [question on stack 
> overflow|http://stackoverflow.com/questions/43484006/why-is-it-slow-to-apply-a-spark-pipeline-to-dataset-with-many-columns-but-few-ro],
>  but it seems like a bug.
> I am testing spark pipelines using a simple dataset (attached) with 312 
> (mostly numeric) columns, but only 421 rows. It is small, but it takes 3 
> minutes to apply my ML pipeline to it on a 24 core server with 60G of memory. 
> This seems much to long for such a tiny dataset. Similar pipelines run 
> quickly on datasets that have fewer columns and more rows. It's something 
> about the number of columns that is causing the slow performance.
> Here are a list of the stages in my pipeline:
> {code}
> 000_strIdx_5708525b2b6c
> 001_strIdx_ec2296082913
> 002_bucketizer_3cbc8811877b
> 003_bucketizer_5a01d5d78436
> 004_bucketizer_bf290d11364d
> 005_bucketizer_c3296dfe94b2
> 006_bucketizer_7071ca50eb85
> 007_bucketizer_27738213c2a1
> 008_bucketizer_bd728fd89ba1
> 009_bucketizer_e1e716f51796
> 010_bucketizer_38be665993ba
> 011_bucketizer_5a0e41e5e94f
> 012_bucketizer_b5a3d5743aaa
> 013_bucketizer_4420f98ff7ff
> 014_bucketizer_777cc4fe6d12
> 015_bucketizer_f0f3a3e5530e
> 016_bucketizer_218ecca3b5c1
> 017_bucketizer_0b083439a192
> 018_bucketizer_4520203aec27
> 019_bucketizer_462c2c346079
> 020_bucketizer_47435822e04c
> 021_bucketizer_eb9dccb5e6e8
> 022_bucketizer_b5f63dd7451d
> 023_bucketizer_e0fd5041c841
> 024_bucketizer_ffb3b9737100
> 025_bucketizer_e06c0d29273c
> 026_bucketizer_36ee535a425f
> 027_bucketizer_ee3a330269f1
> 028_bucketizer_094b58ea01c0
> 029_bucketizer_e93ea86c08e2
> 030_bucketizer_4728a718bc4b
> 031_bucketizer_08f6189c7fcc
> 032_bucketizer_11feb74901e6
> 033_bucketizer_ab4add4966c7
> 034_bucketizer_4474f7f1b8ce
> 035_bucketizer_90cfa5918d71
> 036_bucketizer_1a9ff5e4eccb
> 037_bucketizer_38085415a4f4
> 038_bucketizer_9b5e5a8d12eb
> 039_bucketizer_082bb650ecc3
> 040_bucketizer_57e1e363c483
> 041_bucketizer_337583fbfd65
> 042_bucketizer_73e8f6673262
> 043_bucketizer_0f9394ed30b8
> 044_bucketizer_8530f3570019
> 045_bucketizer_c53614f1e507
> 046_bucketizer_8fd99e6ec27b
> 047_bucketizer_6a8610496d8a
> 048_bucketizer_888b0055c1ad
> 049_bucketizer_974e0a1433a6
> 050_bucketizer_e848c0937cb9
> 051_bucketizer_95611095a4ac
> 052_bucketizer_660a6031acd9
> 053_bucketizer_aaffe5a3140d
> 054_bucketizer_8dc569be285f
> 055_bucketizer_83d1bffa07bc
> 056_bucketizer_0c6180ba75e6
> 057_bucketizer_452f265a000d
> 058_bucketizer_38e02ddfb447
> 059_bucketizer_6fa4ad5d3ebd
> 060_bucketizer_91044ee766ce
> 061_bucketizer_9a9ef04a173d
> 062_bucketizer_3d98eb15f206
> 063_bucketizer_c4915bb4d4ed
> 064_bucketizer_8ca2b6550c38
> 065_bucketizer_417ee9b760bc
> 066_bucketizer_67f3556bebe8
> 067_bucketizer_0556deb652c6
> 068_bucketizer_067b4b3d234c
> 069_bucketizer_30ba55321538
> 070_bucketizer_ad826cc5d746
> 071_bucketizer_77676a898055
> 072_bucketizer_05c37a38ce30
> 073_bucketizer_6d9ae54163ed
> 074_bucketizer_8cd668b2855d
> 075_bucketizer_d50ea1732021
> 076_bucketizer_c68f467c9559
> 077_bucketizer_ee1dfc840db1
> 078_bucketizer_83ec06a32519
> 079_bucketizer_741d08c1b69e
> 080_bucketizer_b7402e4829c7
> 081_bucketizer_8adc590dc447
> 082_bucketizer_673be99bdace
> 083_bucketizer_77693b45f94c
> 084_bucketizer_53529c6b1ac4
> 085_bucketizer_6a3ca776a81e
> 086_bucketizer_6679d9588ac1
> 087_bucketizer_6c73af456f65
> 088_bucketizer_2291b2c5ab51
> 089_bucketizer_cb3d0fe669d8
> 090_bucketizer_e71f913c1512
> 091_bucketizer_156528f65ce7
> 092_bucketizer_f3ec5dae079b
> 093_bucketizer_809fab77eee1
> 094_bucketizer_6925831511e6
> 095_bucketizer_c5d853b95707
> 096_bucketizer_e677659ca253
> 097_bucketizer_396e35548c72
> 098_bucketizer_78a6410d7a84
> 099_bucketizer_e3ae6e54bca1
> 100_bucketizer_9fed5923fe8a
> 101_bucketizer_8925ba4c3ee2
> 102_bucketizer_95750b6942b8
> 103_bucketizer_6e8b50a1918b
> 104_bucketizer_36cfcc13d4ba
> 105_bucketizer_2716d0455512
> 106_bucketizer_9bcf2891652f
> 107_bucketizer_8c3d352915f7
> 108_bucketizer_0786c17d5ef9
> 109_bucketizer_f22df23ef56f
> 110_bucketizer_bad04578bd20
> 111_bucketizer_35cfbde7e28f
> 112_bucketizer_cf89177a528b
> 

[jira] [Updated] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows

2017-05-30 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-20392:

Issue Type: Improvement  (was: Bug)

> Slow performance when calling fit on ML pipeline for dataset with many 
> columns but few rows
> ---
>
> Key: SPARK-20392
> URL: https://issues.apache.org/jira/browse/SPARK-20392
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Barry Becker
>Assignee: Liang-Chi Hsieh
>Priority: Blocker
> Attachments: blockbuster.csv, blockbuster_fewCols.csv, 
> giant_query_plan_for_fitting_pipeline.txt, model_9754.zip, model_9756.zip
>
>
> This started as a [question on stack 
> overflow|http://stackoverflow.com/questions/43484006/why-is-it-slow-to-apply-a-spark-pipeline-to-dataset-with-many-columns-but-few-ro],
>  but it seems like a bug.
> I am testing spark pipelines using a simple dataset (attached) with 312 
> (mostly numeric) columns, but only 421 rows. It is small, but it takes 3 
> minutes to apply my ML pipeline to it on a 24 core server with 60G of memory. 
> This seems much to long for such a tiny dataset. Similar pipelines run 
> quickly on datasets that have fewer columns and more rows. It's something 
> about the number of columns that is causing the slow performance.
> Here are a list of the stages in my pipeline:
> {code}
> 000_strIdx_5708525b2b6c
> 001_strIdx_ec2296082913
> 002_bucketizer_3cbc8811877b
> 003_bucketizer_5a01d5d78436
> 004_bucketizer_bf290d11364d
> 005_bucketizer_c3296dfe94b2
> 006_bucketizer_7071ca50eb85
> 007_bucketizer_27738213c2a1
> 008_bucketizer_bd728fd89ba1
> 009_bucketizer_e1e716f51796
> 010_bucketizer_38be665993ba
> 011_bucketizer_5a0e41e5e94f
> 012_bucketizer_b5a3d5743aaa
> 013_bucketizer_4420f98ff7ff
> 014_bucketizer_777cc4fe6d12
> 015_bucketizer_f0f3a3e5530e
> 016_bucketizer_218ecca3b5c1
> 017_bucketizer_0b083439a192
> 018_bucketizer_4520203aec27
> 019_bucketizer_462c2c346079
> 020_bucketizer_47435822e04c
> 021_bucketizer_eb9dccb5e6e8
> 022_bucketizer_b5f63dd7451d
> 023_bucketizer_e0fd5041c841
> 024_bucketizer_ffb3b9737100
> 025_bucketizer_e06c0d29273c
> 026_bucketizer_36ee535a425f
> 027_bucketizer_ee3a330269f1
> 028_bucketizer_094b58ea01c0
> 029_bucketizer_e93ea86c08e2
> 030_bucketizer_4728a718bc4b
> 031_bucketizer_08f6189c7fcc
> 032_bucketizer_11feb74901e6
> 033_bucketizer_ab4add4966c7
> 034_bucketizer_4474f7f1b8ce
> 035_bucketizer_90cfa5918d71
> 036_bucketizer_1a9ff5e4eccb
> 037_bucketizer_38085415a4f4
> 038_bucketizer_9b5e5a8d12eb
> 039_bucketizer_082bb650ecc3
> 040_bucketizer_57e1e363c483
> 041_bucketizer_337583fbfd65
> 042_bucketizer_73e8f6673262
> 043_bucketizer_0f9394ed30b8
> 044_bucketizer_8530f3570019
> 045_bucketizer_c53614f1e507
> 046_bucketizer_8fd99e6ec27b
> 047_bucketizer_6a8610496d8a
> 048_bucketizer_888b0055c1ad
> 049_bucketizer_974e0a1433a6
> 050_bucketizer_e848c0937cb9
> 051_bucketizer_95611095a4ac
> 052_bucketizer_660a6031acd9
> 053_bucketizer_aaffe5a3140d
> 054_bucketizer_8dc569be285f
> 055_bucketizer_83d1bffa07bc
> 056_bucketizer_0c6180ba75e6
> 057_bucketizer_452f265a000d
> 058_bucketizer_38e02ddfb447
> 059_bucketizer_6fa4ad5d3ebd
> 060_bucketizer_91044ee766ce
> 061_bucketizer_9a9ef04a173d
> 062_bucketizer_3d98eb15f206
> 063_bucketizer_c4915bb4d4ed
> 064_bucketizer_8ca2b6550c38
> 065_bucketizer_417ee9b760bc
> 066_bucketizer_67f3556bebe8
> 067_bucketizer_0556deb652c6
> 068_bucketizer_067b4b3d234c
> 069_bucketizer_30ba55321538
> 070_bucketizer_ad826cc5d746
> 071_bucketizer_77676a898055
> 072_bucketizer_05c37a38ce30
> 073_bucketizer_6d9ae54163ed
> 074_bucketizer_8cd668b2855d
> 075_bucketizer_d50ea1732021
> 076_bucketizer_c68f467c9559
> 077_bucketizer_ee1dfc840db1
> 078_bucketizer_83ec06a32519
> 079_bucketizer_741d08c1b69e
> 080_bucketizer_b7402e4829c7
> 081_bucketizer_8adc590dc447
> 082_bucketizer_673be99bdace
> 083_bucketizer_77693b45f94c
> 084_bucketizer_53529c6b1ac4
> 085_bucketizer_6a3ca776a81e
> 086_bucketizer_6679d9588ac1
> 087_bucketizer_6c73af456f65
> 088_bucketizer_2291b2c5ab51
> 089_bucketizer_cb3d0fe669d8
> 090_bucketizer_e71f913c1512
> 091_bucketizer_156528f65ce7
> 092_bucketizer_f3ec5dae079b
> 093_bucketizer_809fab77eee1
> 094_bucketizer_6925831511e6
> 095_bucketizer_c5d853b95707
> 096_bucketizer_e677659ca253
> 097_bucketizer_396e35548c72
> 098_bucketizer_78a6410d7a84
> 099_bucketizer_e3ae6e54bca1
> 100_bucketizer_9fed5923fe8a
> 101_bucketizer_8925ba4c3ee2
> 102_bucketizer_95750b6942b8
> 103_bucketizer_6e8b50a1918b
> 104_bucketizer_36cfcc13d4ba
> 105_bucketizer_2716d0455512
> 106_bucketizer_9bcf2891652f
> 107_bucketizer_8c3d352915f7
> 108_bucketizer_0786c17d5ef9
> 109_bucketizer_f22df23ef56f
> 110_bucketizer_bad04578bd20
> 111_bucketizer_35cfbde7e28f
> 

[jira] [Reopened] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows

2017-05-30 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reopened SPARK-20392:
-

will re-merge it at the end of Spark 2.3, to avoid conflicts when backporting 
analyzer related PRs to 2.2 in the future.

> Slow performance when calling fit on ML pipeline for dataset with many 
> columns but few rows
> ---
>
> Key: SPARK-20392
> URL: https://issues.apache.org/jira/browse/SPARK-20392
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Barry Becker
>Assignee: Liang-Chi Hsieh
> Attachments: blockbuster.csv, blockbuster_fewCols.csv, 
> giant_query_plan_for_fitting_pipeline.txt, model_9754.zip, model_9756.zip
>
>
> This started as a [question on stack 
> overflow|http://stackoverflow.com/questions/43484006/why-is-it-slow-to-apply-a-spark-pipeline-to-dataset-with-many-columns-but-few-ro],
>  but it seems like a bug.
> I am testing spark pipelines using a simple dataset (attached) with 312 
> (mostly numeric) columns, but only 421 rows. It is small, but it takes 3 
> minutes to apply my ML pipeline to it on a 24 core server with 60G of memory. 
> This seems much to long for such a tiny dataset. Similar pipelines run 
> quickly on datasets that have fewer columns and more rows. It's something 
> about the number of columns that is causing the slow performance.
> Here are a list of the stages in my pipeline:
> {code}
> 000_strIdx_5708525b2b6c
> 001_strIdx_ec2296082913
> 002_bucketizer_3cbc8811877b
> 003_bucketizer_5a01d5d78436
> 004_bucketizer_bf290d11364d
> 005_bucketizer_c3296dfe94b2
> 006_bucketizer_7071ca50eb85
> 007_bucketizer_27738213c2a1
> 008_bucketizer_bd728fd89ba1
> 009_bucketizer_e1e716f51796
> 010_bucketizer_38be665993ba
> 011_bucketizer_5a0e41e5e94f
> 012_bucketizer_b5a3d5743aaa
> 013_bucketizer_4420f98ff7ff
> 014_bucketizer_777cc4fe6d12
> 015_bucketizer_f0f3a3e5530e
> 016_bucketizer_218ecca3b5c1
> 017_bucketizer_0b083439a192
> 018_bucketizer_4520203aec27
> 019_bucketizer_462c2c346079
> 020_bucketizer_47435822e04c
> 021_bucketizer_eb9dccb5e6e8
> 022_bucketizer_b5f63dd7451d
> 023_bucketizer_e0fd5041c841
> 024_bucketizer_ffb3b9737100
> 025_bucketizer_e06c0d29273c
> 026_bucketizer_36ee535a425f
> 027_bucketizer_ee3a330269f1
> 028_bucketizer_094b58ea01c0
> 029_bucketizer_e93ea86c08e2
> 030_bucketizer_4728a718bc4b
> 031_bucketizer_08f6189c7fcc
> 032_bucketizer_11feb74901e6
> 033_bucketizer_ab4add4966c7
> 034_bucketizer_4474f7f1b8ce
> 035_bucketizer_90cfa5918d71
> 036_bucketizer_1a9ff5e4eccb
> 037_bucketizer_38085415a4f4
> 038_bucketizer_9b5e5a8d12eb
> 039_bucketizer_082bb650ecc3
> 040_bucketizer_57e1e363c483
> 041_bucketizer_337583fbfd65
> 042_bucketizer_73e8f6673262
> 043_bucketizer_0f9394ed30b8
> 044_bucketizer_8530f3570019
> 045_bucketizer_c53614f1e507
> 046_bucketizer_8fd99e6ec27b
> 047_bucketizer_6a8610496d8a
> 048_bucketizer_888b0055c1ad
> 049_bucketizer_974e0a1433a6
> 050_bucketizer_e848c0937cb9
> 051_bucketizer_95611095a4ac
> 052_bucketizer_660a6031acd9
> 053_bucketizer_aaffe5a3140d
> 054_bucketizer_8dc569be285f
> 055_bucketizer_83d1bffa07bc
> 056_bucketizer_0c6180ba75e6
> 057_bucketizer_452f265a000d
> 058_bucketizer_38e02ddfb447
> 059_bucketizer_6fa4ad5d3ebd
> 060_bucketizer_91044ee766ce
> 061_bucketizer_9a9ef04a173d
> 062_bucketizer_3d98eb15f206
> 063_bucketizer_c4915bb4d4ed
> 064_bucketizer_8ca2b6550c38
> 065_bucketizer_417ee9b760bc
> 066_bucketizer_67f3556bebe8
> 067_bucketizer_0556deb652c6
> 068_bucketizer_067b4b3d234c
> 069_bucketizer_30ba55321538
> 070_bucketizer_ad826cc5d746
> 071_bucketizer_77676a898055
> 072_bucketizer_05c37a38ce30
> 073_bucketizer_6d9ae54163ed
> 074_bucketizer_8cd668b2855d
> 075_bucketizer_d50ea1732021
> 076_bucketizer_c68f467c9559
> 077_bucketizer_ee1dfc840db1
> 078_bucketizer_83ec06a32519
> 079_bucketizer_741d08c1b69e
> 080_bucketizer_b7402e4829c7
> 081_bucketizer_8adc590dc447
> 082_bucketizer_673be99bdace
> 083_bucketizer_77693b45f94c
> 084_bucketizer_53529c6b1ac4
> 085_bucketizer_6a3ca776a81e
> 086_bucketizer_6679d9588ac1
> 087_bucketizer_6c73af456f65
> 088_bucketizer_2291b2c5ab51
> 089_bucketizer_cb3d0fe669d8
> 090_bucketizer_e71f913c1512
> 091_bucketizer_156528f65ce7
> 092_bucketizer_f3ec5dae079b
> 093_bucketizer_809fab77eee1
> 094_bucketizer_6925831511e6
> 095_bucketizer_c5d853b95707
> 096_bucketizer_e677659ca253
> 097_bucketizer_396e35548c72
> 098_bucketizer_78a6410d7a84
> 099_bucketizer_e3ae6e54bca1
> 100_bucketizer_9fed5923fe8a
> 101_bucketizer_8925ba4c3ee2
> 102_bucketizer_95750b6942b8
> 103_bucketizer_6e8b50a1918b
> 104_bucketizer_36cfcc13d4ba
> 105_bucketizer_2716d0455512
> 106_bucketizer_9bcf2891652f
> 107_bucketizer_8c3d352915f7
> 108_bucketizer_0786c17d5ef9
> 109_bucketizer_f22df23ef56f
> 110_bucketizer_bad04578bd20
> 

[jira] [Updated] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows

2017-05-30 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-20392:

Fix Version/s: (was: 2.3.0)

> Slow performance when calling fit on ML pipeline for dataset with many 
> columns but few rows
> ---
>
> Key: SPARK-20392
> URL: https://issues.apache.org/jira/browse/SPARK-20392
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Barry Becker
>Assignee: Liang-Chi Hsieh
> Attachments: blockbuster.csv, blockbuster_fewCols.csv, 
> giant_query_plan_for_fitting_pipeline.txt, model_9754.zip, model_9756.zip
>
>
> This started as a [question on stack 
> overflow|http://stackoverflow.com/questions/43484006/why-is-it-slow-to-apply-a-spark-pipeline-to-dataset-with-many-columns-but-few-ro],
>  but it seems like a bug.
> I am testing spark pipelines using a simple dataset (attached) with 312 
> (mostly numeric) columns, but only 421 rows. It is small, but it takes 3 
> minutes to apply my ML pipeline to it on a 24 core server with 60G of memory. 
> This seems much to long for such a tiny dataset. Similar pipelines run 
> quickly on datasets that have fewer columns and more rows. It's something 
> about the number of columns that is causing the slow performance.
> Here are a list of the stages in my pipeline:
> {code}
> 000_strIdx_5708525b2b6c
> 001_strIdx_ec2296082913
> 002_bucketizer_3cbc8811877b
> 003_bucketizer_5a01d5d78436
> 004_bucketizer_bf290d11364d
> 005_bucketizer_c3296dfe94b2
> 006_bucketizer_7071ca50eb85
> 007_bucketizer_27738213c2a1
> 008_bucketizer_bd728fd89ba1
> 009_bucketizer_e1e716f51796
> 010_bucketizer_38be665993ba
> 011_bucketizer_5a0e41e5e94f
> 012_bucketizer_b5a3d5743aaa
> 013_bucketizer_4420f98ff7ff
> 014_bucketizer_777cc4fe6d12
> 015_bucketizer_f0f3a3e5530e
> 016_bucketizer_218ecca3b5c1
> 017_bucketizer_0b083439a192
> 018_bucketizer_4520203aec27
> 019_bucketizer_462c2c346079
> 020_bucketizer_47435822e04c
> 021_bucketizer_eb9dccb5e6e8
> 022_bucketizer_b5f63dd7451d
> 023_bucketizer_e0fd5041c841
> 024_bucketizer_ffb3b9737100
> 025_bucketizer_e06c0d29273c
> 026_bucketizer_36ee535a425f
> 027_bucketizer_ee3a330269f1
> 028_bucketizer_094b58ea01c0
> 029_bucketizer_e93ea86c08e2
> 030_bucketizer_4728a718bc4b
> 031_bucketizer_08f6189c7fcc
> 032_bucketizer_11feb74901e6
> 033_bucketizer_ab4add4966c7
> 034_bucketizer_4474f7f1b8ce
> 035_bucketizer_90cfa5918d71
> 036_bucketizer_1a9ff5e4eccb
> 037_bucketizer_38085415a4f4
> 038_bucketizer_9b5e5a8d12eb
> 039_bucketizer_082bb650ecc3
> 040_bucketizer_57e1e363c483
> 041_bucketizer_337583fbfd65
> 042_bucketizer_73e8f6673262
> 043_bucketizer_0f9394ed30b8
> 044_bucketizer_8530f3570019
> 045_bucketizer_c53614f1e507
> 046_bucketizer_8fd99e6ec27b
> 047_bucketizer_6a8610496d8a
> 048_bucketizer_888b0055c1ad
> 049_bucketizer_974e0a1433a6
> 050_bucketizer_e848c0937cb9
> 051_bucketizer_95611095a4ac
> 052_bucketizer_660a6031acd9
> 053_bucketizer_aaffe5a3140d
> 054_bucketizer_8dc569be285f
> 055_bucketizer_83d1bffa07bc
> 056_bucketizer_0c6180ba75e6
> 057_bucketizer_452f265a000d
> 058_bucketizer_38e02ddfb447
> 059_bucketizer_6fa4ad5d3ebd
> 060_bucketizer_91044ee766ce
> 061_bucketizer_9a9ef04a173d
> 062_bucketizer_3d98eb15f206
> 063_bucketizer_c4915bb4d4ed
> 064_bucketizer_8ca2b6550c38
> 065_bucketizer_417ee9b760bc
> 066_bucketizer_67f3556bebe8
> 067_bucketizer_0556deb652c6
> 068_bucketizer_067b4b3d234c
> 069_bucketizer_30ba55321538
> 070_bucketizer_ad826cc5d746
> 071_bucketizer_77676a898055
> 072_bucketizer_05c37a38ce30
> 073_bucketizer_6d9ae54163ed
> 074_bucketizer_8cd668b2855d
> 075_bucketizer_d50ea1732021
> 076_bucketizer_c68f467c9559
> 077_bucketizer_ee1dfc840db1
> 078_bucketizer_83ec06a32519
> 079_bucketizer_741d08c1b69e
> 080_bucketizer_b7402e4829c7
> 081_bucketizer_8adc590dc447
> 082_bucketizer_673be99bdace
> 083_bucketizer_77693b45f94c
> 084_bucketizer_53529c6b1ac4
> 085_bucketizer_6a3ca776a81e
> 086_bucketizer_6679d9588ac1
> 087_bucketizer_6c73af456f65
> 088_bucketizer_2291b2c5ab51
> 089_bucketizer_cb3d0fe669d8
> 090_bucketizer_e71f913c1512
> 091_bucketizer_156528f65ce7
> 092_bucketizer_f3ec5dae079b
> 093_bucketizer_809fab77eee1
> 094_bucketizer_6925831511e6
> 095_bucketizer_c5d853b95707
> 096_bucketizer_e677659ca253
> 097_bucketizer_396e35548c72
> 098_bucketizer_78a6410d7a84
> 099_bucketizer_e3ae6e54bca1
> 100_bucketizer_9fed5923fe8a
> 101_bucketizer_8925ba4c3ee2
> 102_bucketizer_95750b6942b8
> 103_bucketizer_6e8b50a1918b
> 104_bucketizer_36cfcc13d4ba
> 105_bucketizer_2716d0455512
> 106_bucketizer_9bcf2891652f
> 107_bucketizer_8c3d352915f7
> 108_bucketizer_0786c17d5ef9
> 109_bucketizer_f22df23ef56f
> 110_bucketizer_bad04578bd20
> 111_bucketizer_35cfbde7e28f
> 112_bucketizer_cf89177a528b
> 113_bucketizer_183a0d393ef0
> 

[jira] [Commented] (SPARK-20876) If the input parameter is float type for ceil or floor ,the result is not we expected

2017-05-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16030604#comment-16030604
 ] 

Apache Spark commented on SPARK-20876:
--

User '10110346' has created a pull request for this issue:
https://github.com/apache/spark/pull/18155

> If the input parameter is float type for  ceil or floor ,the result is not we 
> expected
> --
>
> Key: SPARK-20876
> URL: https://issues.apache.org/jira/browse/SPARK-20876
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: liuxian
>Assignee: liuxian
> Fix For: 2.3.0
>
>
> spark-sql>SELECT ceil(cast(12345.1233 as float));
> spark-sql>12345
> For this case, the result we expected is 12346
> spark-sql>SELECT floor(cast(-12345.1233 as float));
> spark-sql>-12345
> For this case, the result we expected is  -12346



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20932) CountVectorizer support handle persistence

2017-05-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20932:


Assignee: (was: Apache Spark)

> CountVectorizer support handle persistence
> --
>
> Key: SPARK-20932
> URL: https://issues.apache.org/jira/browse/SPARK-20932
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
>
> in {{CountVectorizer.fit}}, RDDs {{input}} & {{wordCounts}} should be 
> unpersisted after computation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20932) CountVectorizer support handle persistence

2017-05-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20932:


Assignee: Apache Spark

> CountVectorizer support handle persistence
> --
>
> Key: SPARK-20932
> URL: https://issues.apache.org/jira/browse/SPARK-20932
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>
> in {{CountVectorizer.fit}}, RDDs {{input}} & {{wordCounts}} should be 
> unpersisted after computation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20932) CountVectorizer support handle persistence

2017-05-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16030601#comment-16030601
 ] 

Apache Spark commented on SPARK-20932:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/18154

> CountVectorizer support handle persistence
> --
>
> Key: SPARK-20932
> URL: https://issues.apache.org/jira/browse/SPARK-20932
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
>
> in {{CountVectorizer.fit}}, RDDs {{input}} & {{wordCounts}} should be 
> unpersisted after computation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20932) CountVectorizer support handle persistence

2017-05-30 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-20932:


 Summary: CountVectorizer support handle persistence
 Key: SPARK-20932
 URL: https://issues.apache.org/jira/browse/SPARK-20932
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.3.0
Reporter: zhengruifeng


in {{CountVectorizer.fit}}, RDDs {{input}} & {{wordCounts}} should be 
unpersisted after computation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-20931) Built-in SQL Function - ABS support string type

2017-05-30 Thread Yuming Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-20931:

Comment: was deleted

(was: I'm working on.)

> Built-in SQL Function - ABS support string type
> ---
>
> Key: SPARK-20931
> URL: https://issues.apache.org/jira/browse/SPARK-20931
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
>  Labels: starter
>
> {noformat}
> ABS()
> {noformat}
> Hive/MySQL support this.
> Ref: 
> https://github.com/apache/hive/blob/4ba713ccd85c3706d195aeef9476e6e6363f1c21/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFAbs.java#L93



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20931) Built-in SQL Function - ABS support string type

2017-05-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20931:


Assignee: Apache Spark

> Built-in SQL Function - ABS support string type
> ---
>
> Key: SPARK-20931
> URL: https://issues.apache.org/jira/browse/SPARK-20931
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>  Labels: starter
>
> {noformat}
> ABS()
> {noformat}
> Hive/MySQL support this.
> Ref: 
> https://github.com/apache/hive/blob/4ba713ccd85c3706d195aeef9476e6e6363f1c21/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFAbs.java#L93



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20931) Built-in SQL Function - ABS support string type

2017-05-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20931:


Assignee: (was: Apache Spark)

> Built-in SQL Function - ABS support string type
> ---
>
> Key: SPARK-20931
> URL: https://issues.apache.org/jira/browse/SPARK-20931
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
>  Labels: starter
>
> {noformat}
> ABS()
> {noformat}
> Hive/MySQL support this.
> Ref: 
> https://github.com/apache/hive/blob/4ba713ccd85c3706d195aeef9476e6e6363f1c21/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFAbs.java#L93



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20931) Built-in SQL Function - ABS support string type

2017-05-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16030577#comment-16030577
 ] 

Apache Spark commented on SPARK-20931:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/18153

> Built-in SQL Function - ABS support string type
> ---
>
> Key: SPARK-20931
> URL: https://issues.apache.org/jira/browse/SPARK-20931
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
>  Labels: starter
>
> {noformat}
> ABS()
> {noformat}
> Hive/MySQL support this.
> Ref: 
> https://github.com/apache/hive/blob/4ba713ccd85c3706d195aeef9476e6e6363f1c21/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFAbs.java#L93



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20275) HistoryServer page shows incorrect complete date of inprogress apps

2017-05-30 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-20275.
-
   Resolution: Fixed
 Assignee: Saisai Shao
Fix Version/s: 2.2.0
   2.1.2

> HistoryServer page shows incorrect complete date of inprogress apps
> ---
>
> Key: SPARK-20275
> URL: https://issues.apache.org/jira/browse/SPARK-20275
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Minor
> Fix For: 2.1.2, 2.2.0
>
> Attachments: screenshot-1.png
>
>
> The HistoryServer's incomplete page shows in-progress application's completed 
> date as {{1969-12-31 23:59:59}}, which is not meaningful and could be 
> improved.
> !https://issues.apache.org/jira/secure/attachment/12862656/screenshot-1.png!
> So instead of showing this date, here proposed to not display this column 
> since it is not required for in-progress applications.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20931) Built-in SQL Function - ABS support string type

2017-05-30 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-20931:
---

 Summary: Built-in SQL Function - ABS support string type
 Key: SPARK-20931
 URL: https://issues.apache.org/jira/browse/SPARK-20931
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.2.0
Reporter: Yuming Wang


{noformat}
ABS()
{noformat}
Hive/MySQL support this.

Ref: 
https://github.com/apache/hive/blob/4ba713ccd85c3706d195aeef9476e6e6363f1c21/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFAbs.java#L93



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20931) Built-in SQL Function - ABS support string type

2017-05-30 Thread Yuming Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16030562#comment-16030562
 ] 

Yuming Wang commented on SPARK-20931:
-

I'm working on.

> Built-in SQL Function - ABS support string type
> ---
>
> Key: SPARK-20931
> URL: https://issues.apache.org/jira/browse/SPARK-20931
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
>  Labels: starter
>
> {noformat}
> ABS()
> {noformat}
> Hive/MySQL support this.
> Ref: 
> https://github.com/apache/hive/blob/4ba713ccd85c3706d195aeef9476e6e6363f1c21/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFAbs.java#L93



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20930) Destroy broadcasted centers after computing cost

2017-05-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20930:


Assignee: Apache Spark

>  Destroy broadcasted centers after computing cost
> -
>
> Key: SPARK-20930
> URL: https://issues.apache.org/jira/browse/SPARK-20930
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Trivial
>
> Destroy broadcasted centers after computing cost



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20930) Destroy broadcasted centers after computing cost

2017-05-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20930:


Assignee: (was: Apache Spark)

>  Destroy broadcasted centers after computing cost
> -
>
> Key: SPARK-20930
> URL: https://issues.apache.org/jira/browse/SPARK-20930
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
>Priority: Trivial
>
> Destroy broadcasted centers after computing cost



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20930) Destroy broadcasted centers after computing cost

2017-05-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16030559#comment-16030559
 ] 

Apache Spark commented on SPARK-20930:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/18152

>  Destroy broadcasted centers after computing cost
> -
>
> Key: SPARK-20930
> URL: https://issues.apache.org/jira/browse/SPARK-20930
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
>Priority: Trivial
>
> Destroy broadcasted centers after computing cost



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20930) Destroy broadcasted centers after computing cost

2017-05-30 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-20930:


 Summary:  Destroy broadcasted centers after computing cost
 Key: SPARK-20930
 URL: https://issues.apache.org/jira/browse/SPARK-20930
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.3.0
Reporter: zhengruifeng
Priority: Trivial


Destroy broadcasted centers after computing cost



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20213) DataFrameWriter operations do not show up in SQL tab

2017-05-30 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-20213:
---

Assignee: Wenchen Fan

> DataFrameWriter operations do not show up in SQL tab
> 
>
> Key: SPARK-20213
> URL: https://issues.apache.org/jira/browse/SPARK-20213
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Ryan Blue
>Assignee: Wenchen Fan
> Fix For: 2.3.0
>
> Attachments: Screen Shot 2017-05-03 at 5.00.19 PM.png
>
>
> In 1.6.1, {{DataFrame}} writes started using {{DataFrameWriter}} actions like 
> {{insertInto}} would show up in the SQL tab. In 2.0.0 and later, they no 
> longer do. The problem is that 2.0.0 and later no longer wrap execution with 
> {{SQLExecution.withNewExecutionId}}, which emits 
> {{SparkListenerSQLExecutionStart}}.
> Here are the relevant parts of the stack traces:
> {code:title=Spark 1.6.1}
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
> org.apache.spark.sql.execution.QueryExecution$$anonfun$toRdd$1.apply(QueryExecution.scala:56)
> org.apache.spark.sql.execution.QueryExecution$$anonfun$toRdd$1.apply(QueryExecution.scala:56)
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:53)
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:56)
>  => holding 
> Monitor(org.apache.spark.sql.hive.HiveContext$QueryExecution@424773807})
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
> org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:196)
> {code}
> {code:title=Spark 2.0.0}
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
>  => holding Monitor(org.apache.spark.sql.execution.QueryExecution@490977924})
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
> org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:301)
> {code}
> I think this was introduced by 
> [54d23599|https://github.com/apache/spark/commit/54d23599]. The fix should be 
> to add withNewExecutionId to 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L610



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20213) DataFrameWriter operations do not show up in SQL tab

2017-05-30 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-20213.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18064
[https://github.com/apache/spark/pull/18064]

> DataFrameWriter operations do not show up in SQL tab
> 
>
> Key: SPARK-20213
> URL: https://issues.apache.org/jira/browse/SPARK-20213
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Ryan Blue
> Fix For: 2.3.0
>
> Attachments: Screen Shot 2017-05-03 at 5.00.19 PM.png
>
>
> In 1.6.1, {{DataFrame}} writes started using {{DataFrameWriter}} actions like 
> {{insertInto}} would show up in the SQL tab. In 2.0.0 and later, they no 
> longer do. The problem is that 2.0.0 and later no longer wrap execution with 
> {{SQLExecution.withNewExecutionId}}, which emits 
> {{SparkListenerSQLExecutionStart}}.
> Here are the relevant parts of the stack traces:
> {code:title=Spark 1.6.1}
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
> org.apache.spark.sql.execution.QueryExecution$$anonfun$toRdd$1.apply(QueryExecution.scala:56)
> org.apache.spark.sql.execution.QueryExecution$$anonfun$toRdd$1.apply(QueryExecution.scala:56)
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:53)
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:56)
>  => holding 
> Monitor(org.apache.spark.sql.hive.HiveContext$QueryExecution@424773807})
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
> org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:196)
> {code}
> {code:title=Spark 2.0.0}
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
>  => holding Monitor(org.apache.spark.sql.execution.QueryExecution@490977924})
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
> org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:301)
> {code}
> I think this was introduced by 
> [54d23599|https://github.com/apache/spark/commit/54d23599]. The fix should be 
> to add withNewExecutionId to 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L610



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20929) LinearSVC should not use shared Param HasThresholds

2017-05-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20929:


Assignee: Apache Spark  (was: Joseph K. Bradley)

> LinearSVC should not use shared Param HasThresholds
> ---
>
> Key: SPARK-20929
> URL: https://issues.apache.org/jira/browse/SPARK-20929
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Minor
>
> LinearSVC applies the Param 'threshold' to the rawPrediction, not the 
> probability.  It has different semantics than the shared Param HasThreshold, 
> so it should not use the shared Param.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20929) LinearSVC should not use shared Param HasThresholds

2017-05-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20929:


Assignee: Joseph K. Bradley  (was: Apache Spark)

> LinearSVC should not use shared Param HasThresholds
> ---
>
> Key: SPARK-20929
> URL: https://issues.apache.org/jira/browse/SPARK-20929
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
>
> LinearSVC applies the Param 'threshold' to the rawPrediction, not the 
> probability.  It has different semantics than the shared Param HasThreshold, 
> so it should not use the shared Param.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20929) LinearSVC should not use shared Param HasThresholds

2017-05-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16030437#comment-16030437
 ] 

Apache Spark commented on SPARK-20929:
--

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/18151

> LinearSVC should not use shared Param HasThresholds
> ---
>
> Key: SPARK-20929
> URL: https://issues.apache.org/jira/browse/SPARK-20929
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
>
> LinearSVC applies the Param 'threshold' to the rawPrediction, not the 
> probability.  It has different semantics than the shared Param HasThreshold, 
> so it should not use the shared Param.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20929) LinearSVC should not use shared Param HasThresholds

2017-05-30 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20929:
--
Priority: Minor  (was: Major)

> LinearSVC should not use shared Param HasThresholds
> ---
>
> Key: SPARK-20929
> URL: https://issues.apache.org/jira/browse/SPARK-20929
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
>
> LinearSVC applies the Param 'threshold' to the rawPrediction, not the 
> probability.  It has different semantics than the shared Param HasThreshold, 
> so it should not use the shared Param.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20929) LinearSVC should not use shared Param HasThresholds

2017-05-30 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-20929:
-

 Summary: LinearSVC should not use shared Param HasThresholds
 Key: SPARK-20929
 URL: https://issues.apache.org/jira/browse/SPARK-20929
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.2.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley


LinearSVC applies the Param 'threshold' to the rawPrediction, not the 
probability.  It has different semantics than the shared Param HasThreshold, so 
it should not use the shared Param.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20928) Continuous Processing Mode for Structured Streaming

2017-05-30 Thread Nan Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16030379#comment-16030379
 ] 

Nan Zhu commented on SPARK-20928:
-

Hi, is there any description on what does it mean?

> Continuous Processing Mode for Structured Streaming
> ---
>
> Key: SPARK-20928
> URL: https://issues.apache.org/jira/browse/SPARK-20928
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Michael Armbrust
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20651) Speed up the new app state listener

2017-05-30 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-20651.

Resolution: Won't Do

I've done some perf work to make sure live applications don't regress, and made 
a bunch of changes that make the original code I had for this milestone 
obsolete, so I removed it from my plan.

The current list of "upcoming" PRs can be seen at:
https://github.com/vanzin/spark/pulls

> Speed up the new app state listener
> ---
>
> Key: SPARK-20651
> URL: https://issues.apache.org/jira/browse/SPARK-20651
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>
> See spec in parent issue (SPARK-18085) for more details.
> This task tracks enhancements to the code added in previous tasks so that the 
> new app state listener is faster; it adds a caching layer and an asynchronous 
> write layer that also does deduplication, so that it avoids blocking the 
> listener thread and also avoids unnecessary writes to disk.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-2183) Avoid loading/shuffling data twice in self-join query

2017-05-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-2183.
--
Resolution: Fixed
  Assignee: Reynold Xin

This shouldn't be an issue anymore with reuse exchange in the latest release (I 
checked 2.2).



> Avoid loading/shuffling data twice in self-join query
> -
>
> Key: SPARK-2183
> URL: https://issues.apache.org/jira/browse/SPARK-2183
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Minor
>
> {code}
> scala> hql("select * from src a join src b on (a.key=b.key)")
> res2: org.apache.spark.sql.SchemaRDD = 
> SchemaRDD[3] at RDD at SchemaRDD.scala:100
> == Query Plan ==
> Project [key#3:0,value#4:1,key#5:2,value#6:3]
>  HashJoin [key#3], [key#5], BuildRight
>   Exchange (HashPartitioning [key#3:0], 200)
>HiveTableScan [key#3,value#4], (MetastoreRelation default, src, Some(a)), 
> None
>   Exchange (HashPartitioning [key#5:0], 200)
>HiveTableScan [key#5,value#6], (MetastoreRelation default, src, Some(b)), 
> None
> {code}
> The optimal execution strategy for the above example is to load data only 
> once and repartition once. 
> If we want to hyper optimize it, we can also have a self join operator that 
> builds the hashmap and then simply traverses the hashmap ...



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20178) Improve Scheduler fetch failures

2017-05-30 Thread Sital Kedia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16030284#comment-16030284
 ] 

Sital Kedia commented on SPARK-20178:
-

https://github.com/apache/spark/pull/18150

> Improve Scheduler fetch failures
> 
>
> Key: SPARK-20178
> URL: https://issues.apache.org/jira/browse/SPARK-20178
> Project: Spark
>  Issue Type: Epic
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>
> We have been having a lot of discussions around improving the handling of 
> fetch failures.  There are 4 jira currently related to this.  
> We should try to get a list of things we want to improve and come up with one 
> cohesive design.
> SPARK-20163,  SPARK-20091,  SPARK-14649 , and SPARK-19753
> I will put my initial thoughts in a follow on comment.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20883) Improve StateStore APIs for efficiency

2017-05-30 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-20883.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

> Improve StateStore APIs for efficiency
> --
>
> Key: SPARK-20883
> URL: https://issues.apache.org/jira/browse/SPARK-20883
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 2.3.0
>
>
> Current state store API has a bunch of problems that causes too many 
> transient objects causing memory pressure.
> - StateStore.get() returns Options which forces creation of Some/None objects 
> for every get
> - StateStore.iterator() returns tuples which forces creation of new tuple for 
> each record returned
> - StateStore.updates() requires the implementation to keep track of updates, 
> while this is used minimally (only by Append mode in streaming aggregations). 
> This can be totally removed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19753) Remove all shuffle files on a host in case of slave lost of fetch failure

2017-05-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16030270#comment-16030270
 ] 

Apache Spark commented on SPARK-19753:
--

User 'sitalkedia' has created a pull request for this issue:
https://github.com/apache/spark/pull/18150

> Remove all shuffle files on a host in case of slave lost of fetch failure
> -
>
> Key: SPARK-19753
> URL: https://issues.apache.org/jira/browse/SPARK-19753
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.0.1
>Reporter: Sital Kedia
>
> Currently, when we detect fetch failure, we only remove the shuffle files 
> produced by the executor, while the host itself might be down and all the 
> shuffle files are not accessible. In case we are running multiple executors 
> on a host, any host going down currently results in multiple fetch failures 
> and multiple retries of the stage, which is very inefficient. If we remove 
> all the shuffle files on that host, on first fetch failure, we can rerun all 
> the tasks on that host in a single stage retry. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20894) Error while checkpointing to HDFS

2017-05-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20894:


Assignee: (was: Apache Spark)

> Error while checkpointing to HDFS
> -
>
> Key: SPARK-20894
> URL: https://issues.apache.org/jira/browse/SPARK-20894
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.1
> Environment: Ubuntu, Spark 2.1.1, hadoop 2.7
>Reporter: kant kodali
> Attachments: driver_info_log, executor1_log, executor2_log
>
>
> Dataset df2 = df1.groupBy(functions.window(df1.col("Timestamp5"), "24 
> hours", "24 hours"), df1.col("AppName")).count();
> StreamingQuery query = df2.writeStream().foreach(new 
> KafkaSink()).option("checkpointLocation","/usr/local/hadoop/checkpoint").outputMode("update").start();
> query.awaitTermination();
> This for some reason fails with the Error 
> ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.IllegalStateException: Error reading delta file 
> /usr/local/hadoop/checkpoint/state/0/0/1.delta of HDFSStateStoreProvider[id = 
> (op=0, part=0), dir = /usr/local/hadoop/checkpoint/state/0/0]: 
> /usr/local/hadoop/checkpoint/state/0/0/1.delta does not exist
> I did clear all the checkpoint data in /usr/local/hadoop/checkpoint/  and all 
> consumer offsets in Kafka from all brokers prior to running and yet this 
> error still persists. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20894) Error while checkpointing to HDFS

2017-05-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20894:


Assignee: Apache Spark

> Error while checkpointing to HDFS
> -
>
> Key: SPARK-20894
> URL: https://issues.apache.org/jira/browse/SPARK-20894
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.1
> Environment: Ubuntu, Spark 2.1.1, hadoop 2.7
>Reporter: kant kodali
>Assignee: Apache Spark
> Attachments: driver_info_log, executor1_log, executor2_log
>
>
> Dataset df2 = df1.groupBy(functions.window(df1.col("Timestamp5"), "24 
> hours", "24 hours"), df1.col("AppName")).count();
> StreamingQuery query = df2.writeStream().foreach(new 
> KafkaSink()).option("checkpointLocation","/usr/local/hadoop/checkpoint").outputMode("update").start();
> query.awaitTermination();
> This for some reason fails with the Error 
> ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.IllegalStateException: Error reading delta file 
> /usr/local/hadoop/checkpoint/state/0/0/1.delta of HDFSStateStoreProvider[id = 
> (op=0, part=0), dir = /usr/local/hadoop/checkpoint/state/0/0]: 
> /usr/local/hadoop/checkpoint/state/0/0/1.delta does not exist
> I did clear all the checkpoint data in /usr/local/hadoop/checkpoint/  and all 
> consumer offsets in Kafka from all brokers prior to running and yet this 
> error still persists. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20894) Error while checkpointing to HDFS

2017-05-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16030261#comment-16030261
 ] 

Apache Spark commented on SPARK-20894:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/18149

> Error while checkpointing to HDFS
> -
>
> Key: SPARK-20894
> URL: https://issues.apache.org/jira/browse/SPARK-20894
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.1
> Environment: Ubuntu, Spark 2.1.1, hadoop 2.7
>Reporter: kant kodali
> Attachments: driver_info_log, executor1_log, executor2_log
>
>
> Dataset df2 = df1.groupBy(functions.window(df1.col("Timestamp5"), "24 
> hours", "24 hours"), df1.col("AppName")).count();
> StreamingQuery query = df2.writeStream().foreach(new 
> KafkaSink()).option("checkpointLocation","/usr/local/hadoop/checkpoint").outputMode("update").start();
> query.awaitTermination();
> This for some reason fails with the Error 
> ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.IllegalStateException: Error reading delta file 
> /usr/local/hadoop/checkpoint/state/0/0/1.delta of HDFSStateStoreProvider[id = 
> (op=0, part=0), dir = /usr/local/hadoop/checkpoint/state/0/0]: 
> /usr/local/hadoop/checkpoint/state/0/0/1.delta does not exist
> I did clear all the checkpoint data in /usr/local/hadoop/checkpoint/  and all 
> consumer offsets in Kafka from all brokers prior to running and yet this 
> error still persists. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20926) Exposure to Guava libraries by directly accessing tableRelationCache in SessionCatalog caused failures

2017-05-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20926:


Assignee: Apache Spark

> Exposure to Guava libraries by directly accessing tableRelationCache in 
> SessionCatalog caused failures
> --
>
> Key: SPARK-20926
> URL: https://issues.apache.org/jira/browse/SPARK-20926
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Reza Safi
>Assignee: Apache Spark
>
> Because of shading that we did for guava libraries, we see test failures 
> whenever those components directly access tableRelationCache in 
> SessionCatalog.
> This can happen in any component that shaded guava library. Failures looks 
> like this:
> {noformat}
> java.lang.NoSuchMethodError: 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.tableRelationCache()Lcom/google/common/cache/Cache;
> 01:25:14   at 
> org.apache.spark.sql.hive.test.TestHiveSparkSession.reset(TestHive.scala:492)
> 01:25:14   at 
> org.apache.spark.sql.hive.test.TestHiveContext.reset(TestHive.scala:138)
> 01:25:14   at 
> org.apache.spark.sql.hive.test.TestHiveSingleton$class.afterAll(TestHiveSingleton.scala:32)
> 01:25:14   at 
> org.apache.spark.sql.hive.StatisticsSuite.afterAll(StatisticsSuite.scala:34)
> 01:25:14   at 
> org.scalatest.BeforeAndAfterAll$class.afterAll(BeforeAndAfterAll.scala:213)
> 01:25:14   at org.apache.spark.SparkFunSuite.afterAll(SparkFunSuite.scala:31)
> 01:25:14   at 
> org.scalatest.BeforeAndAfterAll$$anonfun$run$1.apply(BeforeAndAfterAll.scala:280)
> 01:25:14   at 
> org.scalatest.BeforeAndAfterAll$$anonfun$run$1.apply(BeforeAndAfterAll.scala:278)
> 01:25:14   at org.scalatest.CompositeStatus.whenCompleted(Status.scala:377)
> 01:25:14   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:278)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20926) Exposure to Guava libraries by directly accessing tableRelationCache in SessionCatalog caused failures

2017-05-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16030257#comment-16030257
 ] 

Apache Spark commented on SPARK-20926:
--

User 'rezasafi' has created a pull request for this issue:
https://github.com/apache/spark/pull/18148

> Exposure to Guava libraries by directly accessing tableRelationCache in 
> SessionCatalog caused failures
> --
>
> Key: SPARK-20926
> URL: https://issues.apache.org/jira/browse/SPARK-20926
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Reza Safi
>
> Because of shading that we did for guava libraries, we see test failures 
> whenever those components directly access tableRelationCache in 
> SessionCatalog.
> This can happen in any component that shaded guava library. Failures looks 
> like this:
> {noformat}
> java.lang.NoSuchMethodError: 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.tableRelationCache()Lcom/google/common/cache/Cache;
> 01:25:14   at 
> org.apache.spark.sql.hive.test.TestHiveSparkSession.reset(TestHive.scala:492)
> 01:25:14   at 
> org.apache.spark.sql.hive.test.TestHiveContext.reset(TestHive.scala:138)
> 01:25:14   at 
> org.apache.spark.sql.hive.test.TestHiveSingleton$class.afterAll(TestHiveSingleton.scala:32)
> 01:25:14   at 
> org.apache.spark.sql.hive.StatisticsSuite.afterAll(StatisticsSuite.scala:34)
> 01:25:14   at 
> org.scalatest.BeforeAndAfterAll$class.afterAll(BeforeAndAfterAll.scala:213)
> 01:25:14   at org.apache.spark.SparkFunSuite.afterAll(SparkFunSuite.scala:31)
> 01:25:14   at 
> org.scalatest.BeforeAndAfterAll$$anonfun$run$1.apply(BeforeAndAfterAll.scala:280)
> 01:25:14   at 
> org.scalatest.BeforeAndAfterAll$$anonfun$run$1.apply(BeforeAndAfterAll.scala:278)
> 01:25:14   at org.scalatest.CompositeStatus.whenCompleted(Status.scala:377)
> 01:25:14   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:278)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20926) Exposure to Guava libraries by directly accessing tableRelationCache in SessionCatalog caused failures

2017-05-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20926:


Assignee: (was: Apache Spark)

> Exposure to Guava libraries by directly accessing tableRelationCache in 
> SessionCatalog caused failures
> --
>
> Key: SPARK-20926
> URL: https://issues.apache.org/jira/browse/SPARK-20926
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Reza Safi
>
> Because of shading that we did for guava libraries, we see test failures 
> whenever those components directly access tableRelationCache in 
> SessionCatalog.
> This can happen in any component that shaded guava library. Failures looks 
> like this:
> {noformat}
> java.lang.NoSuchMethodError: 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.tableRelationCache()Lcom/google/common/cache/Cache;
> 01:25:14   at 
> org.apache.spark.sql.hive.test.TestHiveSparkSession.reset(TestHive.scala:492)
> 01:25:14   at 
> org.apache.spark.sql.hive.test.TestHiveContext.reset(TestHive.scala:138)
> 01:25:14   at 
> org.apache.spark.sql.hive.test.TestHiveSingleton$class.afterAll(TestHiveSingleton.scala:32)
> 01:25:14   at 
> org.apache.spark.sql.hive.StatisticsSuite.afterAll(StatisticsSuite.scala:34)
> 01:25:14   at 
> org.scalatest.BeforeAndAfterAll$class.afterAll(BeforeAndAfterAll.scala:213)
> 01:25:14   at org.apache.spark.SparkFunSuite.afterAll(SparkFunSuite.scala:31)
> 01:25:14   at 
> org.scalatest.BeforeAndAfterAll$$anonfun$run$1.apply(BeforeAndAfterAll.scala:280)
> 01:25:14   at 
> org.scalatest.BeforeAndAfterAll$$anonfun$run$1.apply(BeforeAndAfterAll.scala:278)
> 01:25:14   at org.scalatest.CompositeStatus.whenCompleted(Status.scala:377)
> 01:25:14   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:278)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20925) Out of Memory Issues With org.apache.spark.sql.DataFrameWriter#partitionBy

2017-05-30 Thread Jeffrey Quinn (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16030197#comment-16030197
 ] 

Jeffrey Quinn commented on SPARK-20925:
---

Apologies, will move to the mailing list next time I have a general question 
like that.

I agree key skew is often an issue, but for the data we were testing with the 
cardinality of the partition column is 1, which helps rule some things out.

I wanted to post again because after taking another crack at looking through 
the source I think I may have found a root cause:

The ExecuteWriteTask implementation for a partitioned table 
(org.apache.spark.sql.execution.datasources.FileFormatWriter.DynamicPartitionWriteTask)
 sorts the rows of the table by the partition keys before writing. This makes 
sense as it minimizes the number of OutputWriters that need to be created.

In the course of doing this, the ExecuteWriteTask uses 
org.apache.spark.sql.execution.UnsafeKVExternalSorter to sort the rows to be 
written. It then gets an iterator over the sorted rows via 
org.apache.spark.sql.execution.UnsafeKVExternalSorter#sortedIterator.

The scaladoc of that method advises that it is the callers responsibility to 
call org.apache.spark.sql.execution.UnsafeKVExternalSorter#cleanupResources 
(see 
https://github.com/apache/spark/blob/v2.1.0/sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeKVExternalSorter.java#L176).

However in ExecuteWriteTask, we appear to never call cleanupResources() when we 
are done with the iterator (see 
https://github.com/apache/spark/blob/v2.1.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L379).

This seems like it could create a memory leak, which would explain the behavior 
that we have observed.

Luckily, it seems like this possible memory leak was fixed totally 
coincidentally by this revision: 
https://github.com/apache/spark/commit/776b8f17cfc687a57c005a421a81e591c8d44a3f

Which changes this behavior for stated performance reasons. So the best 
solution to this issue may be to upgrade to v2.1.1.


> Out of Memory Issues With org.apache.spark.sql.DataFrameWriter#partitionBy
> --
>
> Key: SPARK-20925
> URL: https://issues.apache.org/jira/browse/SPARK-20925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Jeffrey Quinn
>
> Observed under the following conditions:
> Spark Version: Spark 2.1.0
> Hadoop Version: Amazon 2.7.3 (emr-5.5.0)
> spark.submit.deployMode = client
> spark.master = yarn
> spark.driver.memory = 10g
> spark.shuffle.service.enabled = true
> spark.dynamicAllocation.enabled = true
> The job we are running is very simple: Our workflow reads data from a JSON 
> format stored on S3, and write out partitioned parquet files to HDFS.
> As a one-liner, the whole workflow looks like this:
> ```
> sparkSession.sqlContext
> .read
> .schema(inputSchema)
> .json(expandedInputPath)
> .select(columnMap:_*)
> .write.partitionBy("partition_by_column")
> .parquet(outputPath)
> ```
> Unfortunately, for larger inputs, this job consistently fails with containers 
> running out of memory. We observed containers of up to 20GB OOMing, which is 
> surprising because the input data itself is only 15 GB compressed and maybe 
> 100GB uncompressed.
> The error message we get indicates yarn is killing the containers. The 
> executors are running out of memory and not the driver.
> ```Caused by: org.apache.spark.SparkException: Job aborted due to stage 
> failure: Task 184 in stage 74.0 failed 4 times, most recent failure: Lost 
> task 184.3 in stage 74.0 (TID 19110, ip-10-242-15-251.ec2.internal, executor 
> 14): ExecutorLostFailure (executor 14 exited caused by one of the running 
> tasks) Reason: Container killed by YARN for exceeding memory limits. 21.5 GB 
> of 20.9 GB physical memory used. Consider boosting 
> spark.yarn.executor.memoryOverhead.```
> We tried a full parameter sweep, including using dynamic allocation and 
> setting executor memory as high as 20GB. The result was the same each time, 
> with the job failing due to lost executors due to YARN killing containers.
> We were able to bisect that `partitionBy` is the problem by progressively 
> removing/commenting out parts of our workflow. Finally when we get to the 
> above state, if we remove `partitionBy` the job succeeds with no OOM.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20928) Continuous Processing Mode for Structured Streaming

2017-05-30 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-20928:


 Summary: Continuous Processing Mode for Structured Streaming
 Key: SPARK-20928
 URL: https://issues.apache.org/jira/browse/SPARK-20928
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.2.0
Reporter: Michael Armbrust






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19236) Add createOrReplaceGlobalTempView

2017-05-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16030165#comment-16030165
 ] 

Apache Spark commented on SPARK-19236:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/18147

> Add createOrReplaceGlobalTempView
> -
>
> Key: SPARK-19236
> URL: https://issues.apache.org/jira/browse/SPARK-19236
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Arman Yazdani
>Priority: Minor
>
> There are 3 methods for saving a temp tables:
> createTempView
> createOrReplaceTempView
> createGlobalTempView
> but there isn't:
> createOrReplaceGlobalTempView



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20894) Error while checkpointing to HDFS (similar to JIRA SPARK-19268)

2017-05-30 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16030160#comment-16030160
 ] 

Shixiong Zhu commented on SPARK-20894:
--

The root issue here is the driver uses the local file system for checkpoints 
but executors use HDFS.

I reopened this ticket because I think we can improve the error message here.

> Error while checkpointing to HDFS (similar to JIRA SPARK-19268)
> ---
>
> Key: SPARK-20894
> URL: https://issues.apache.org/jira/browse/SPARK-20894
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.1
> Environment: Ubuntu, Spark 2.1.1, hadoop 2.7
>Reporter: kant kodali
> Attachments: driver_info_log, executor1_log, executor2_log
>
>
> Dataset df2 = df1.groupBy(functions.window(df1.col("Timestamp5"), "24 
> hours", "24 hours"), df1.col("AppName")).count();
> StreamingQuery query = df2.writeStream().foreach(new 
> KafkaSink()).option("checkpointLocation","/usr/local/hadoop/checkpoint").outputMode("update").start();
> query.awaitTermination();
> This for some reason fails with the Error 
> ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.IllegalStateException: Error reading delta file 
> /usr/local/hadoop/checkpoint/state/0/0/1.delta of HDFSStateStoreProvider[id = 
> (op=0, part=0), dir = /usr/local/hadoop/checkpoint/state/0/0]: 
> /usr/local/hadoop/checkpoint/state/0/0/1.delta does not exist
> I did clear all the checkpoint data in /usr/local/hadoop/checkpoint/  and all 
> consumer offsets in Kafka from all brokers prior to running and yet this 
> error still persists. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20894) Error while checkpointing to HDFS

2017-05-30 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-20894:
-
Summary: Error while checkpointing to HDFS  (was: Error while checkpointing 
to HDFS (similar to JIRA SPARK-19268))

> Error while checkpointing to HDFS
> -
>
> Key: SPARK-20894
> URL: https://issues.apache.org/jira/browse/SPARK-20894
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.1
> Environment: Ubuntu, Spark 2.1.1, hadoop 2.7
>Reporter: kant kodali
> Attachments: driver_info_log, executor1_log, executor2_log
>
>
> Dataset df2 = df1.groupBy(functions.window(df1.col("Timestamp5"), "24 
> hours", "24 hours"), df1.col("AppName")).count();
> StreamingQuery query = df2.writeStream().foreach(new 
> KafkaSink()).option("checkpointLocation","/usr/local/hadoop/checkpoint").outputMode("update").start();
> query.awaitTermination();
> This for some reason fails with the Error 
> ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.IllegalStateException: Error reading delta file 
> /usr/local/hadoop/checkpoint/state/0/0/1.delta of HDFSStateStoreProvider[id = 
> (op=0, part=0), dir = /usr/local/hadoop/checkpoint/state/0/0]: 
> /usr/local/hadoop/checkpoint/state/0/0/1.delta does not exist
> I did clear all the checkpoint data in /usr/local/hadoop/checkpoint/  and all 
> consumer offsets in Kafka from all brokers prior to running and yet this 
> error still persists. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20894) Error while checkpointing to HDFS (similar to JIRA SPARK-19268)

2017-05-30 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-20894:
-
Issue Type: Improvement  (was: Bug)

> Error while checkpointing to HDFS (similar to JIRA SPARK-19268)
> ---
>
> Key: SPARK-20894
> URL: https://issues.apache.org/jira/browse/SPARK-20894
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.1
> Environment: Ubuntu, Spark 2.1.1, hadoop 2.7
>Reporter: kant kodali
> Attachments: driver_info_log, executor1_log, executor2_log
>
>
> Dataset df2 = df1.groupBy(functions.window(df1.col("Timestamp5"), "24 
> hours", "24 hours"), df1.col("AppName")).count();
> StreamingQuery query = df2.writeStream().foreach(new 
> KafkaSink()).option("checkpointLocation","/usr/local/hadoop/checkpoint").outputMode("update").start();
> query.awaitTermination();
> This for some reason fails with the Error 
> ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.IllegalStateException: Error reading delta file 
> /usr/local/hadoop/checkpoint/state/0/0/1.delta of HDFSStateStoreProvider[id = 
> (op=0, part=0), dir = /usr/local/hadoop/checkpoint/state/0/0]: 
> /usr/local/hadoop/checkpoint/state/0/0/1.delta does not exist
> I did clear all the checkpoint data in /usr/local/hadoop/checkpoint/  and all 
> consumer offsets in Kafka from all brokers prior to running and yet this 
> error still persists. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-20894) Error while checkpointing to HDFS (similar to JIRA SPARK-19268)

2017-05-30 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reopened SPARK-20894:
--

> Error while checkpointing to HDFS (similar to JIRA SPARK-19268)
> ---
>
> Key: SPARK-20894
> URL: https://issues.apache.org/jira/browse/SPARK-20894
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.1
> Environment: Ubuntu, Spark 2.1.1, hadoop 2.7
>Reporter: kant kodali
> Attachments: driver_info_log, executor1_log, executor2_log
>
>
> Dataset df2 = df1.groupBy(functions.window(df1.col("Timestamp5"), "24 
> hours", "24 hours"), df1.col("AppName")).count();
> StreamingQuery query = df2.writeStream().foreach(new 
> KafkaSink()).option("checkpointLocation","/usr/local/hadoop/checkpoint").outputMode("update").start();
> query.awaitTermination();
> This for some reason fails with the Error 
> ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.IllegalStateException: Error reading delta file 
> /usr/local/hadoop/checkpoint/state/0/0/1.delta of HDFSStateStoreProvider[id = 
> (op=0, part=0), dir = /usr/local/hadoop/checkpoint/state/0/0]: 
> /usr/local/hadoop/checkpoint/state/0/0/1.delta does not exist
> I did clear all the checkpoint data in /usr/local/hadoop/checkpoint/  and all 
> consumer offsets in Kafka from all brokers prior to running and yet this 
> error still persists. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20924) Unable to call the function registered in the not-current database

2017-05-30 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-20924.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 18146
[https://github.com/apache/spark/pull/18146]

> Unable to call the function registered in the not-current database
> --
>
> Key: SPARK-20924
> URL: https://issues.apache.org/jira/browse/SPARK-20924
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.1, 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.2.0
>
>
> We are unable to call the function registered in the not-current database. 
> {noformat}
> sql("CREATE DATABASE dAtABaSe1")
> sql(s"CREATE FUNCTION dAtABaSe1.test_avg AS 
> '${classOf[GenericUDAFAverage].getName}'")
> sql("SELECT dAtABaSe1.test_avg(1)")
> {noformat}
> The above code returns an error:
> {noformat}
> Undefined function: 'dAtABaSe1.test_avg'. This function is neither a 
> registered temporary function nor a permanent function registered in the 
> database 'default'.; line 1 pos 7
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20925) Out of Memory Issues With org.apache.spark.sql.DataFrameWriter#partitionBy

2017-05-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16030108#comment-16030108
 ] 

Sean Owen commented on SPARK-20925:
---

This is better for the mailing list. Spark allocates off heap memory for lots 
of things (look up "spark tungsten"). Sometimes the default isn't enough. It's 
not a Spark issue per se, no, but a matter of how much YARN is asked to give 
the JVM. Partitioning isn't necessarily a trivial operation, and you might have 
some issue with key skew. By the way, the error message tells you about  
spark.yarn.executor.memoryOverhead already.

> Out of Memory Issues With org.apache.spark.sql.DataFrameWriter#partitionBy
> --
>
> Key: SPARK-20925
> URL: https://issues.apache.org/jira/browse/SPARK-20925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Jeffrey Quinn
>
> Observed under the following conditions:
> Spark Version: Spark 2.1.0
> Hadoop Version: Amazon 2.7.3 (emr-5.5.0)
> spark.submit.deployMode = client
> spark.master = yarn
> spark.driver.memory = 10g
> spark.shuffle.service.enabled = true
> spark.dynamicAllocation.enabled = true
> The job we are running is very simple: Our workflow reads data from a JSON 
> format stored on S3, and write out partitioned parquet files to HDFS.
> As a one-liner, the whole workflow looks like this:
> ```
> sparkSession.sqlContext
> .read
> .schema(inputSchema)
> .json(expandedInputPath)
> .select(columnMap:_*)
> .write.partitionBy("partition_by_column")
> .parquet(outputPath)
> ```
> Unfortunately, for larger inputs, this job consistently fails with containers 
> running out of memory. We observed containers of up to 20GB OOMing, which is 
> surprising because the input data itself is only 15 GB compressed and maybe 
> 100GB uncompressed.
> The error message we get indicates yarn is killing the containers. The 
> executors are running out of memory and not the driver.
> ```Caused by: org.apache.spark.SparkException: Job aborted due to stage 
> failure: Task 184 in stage 74.0 failed 4 times, most recent failure: Lost 
> task 184.3 in stage 74.0 (TID 19110, ip-10-242-15-251.ec2.internal, executor 
> 14): ExecutorLostFailure (executor 14 exited caused by one of the running 
> tasks) Reason: Container killed by YARN for exceeding memory limits. 21.5 GB 
> of 20.9 GB physical memory used. Consider boosting 
> spark.yarn.executor.memoryOverhead.```
> We tried a full parameter sweep, including using dynamic allocation and 
> setting executor memory as high as 20GB. The result was the same each time, 
> with the job failing due to lost executors due to YARN killing containers.
> We were able to bisect that `partitionBy` is the problem by progressively 
> removing/commenting out parts of our workflow. Finally when we get to the 
> above state, if we remove `partitionBy` the job succeeds with no OOM.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19732) DataFrame.fillna() does not work for bools in PySpark

2017-05-30 Thread Ruben Berenguel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16030102#comment-16030102
 ] 

Ruben Berenguel commented on SPARK-19732:
-

I'll give this a go!

> DataFrame.fillna() does not work for bools in PySpark
> -
>
> Key: SPARK-19732
> URL: https://issues.apache.org/jira/browse/SPARK-19732
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Len Frodgers
>Priority: Minor
>
> In PySpark, the fillna function of DataFrame inadvertently casts bools to 
> ints, so fillna cannot be used to fill True/False.
> e.g. 
> `spark.createDataFrame([Row(a=True),Row(a=None)]).fillna(True).collect()` 
> yields
> `[Row(a=True), Row(a=None)]`
> It should be a=True for the second Row
> The cause is this bit of code: 
> {code}
> if isinstance(value, (int, long)):
> value = float(value)
> {code}
> There needs to be a separate check for isinstance(bool), since in python, 
> bools are ints too
> Additionally there's another anomaly:
> Spark (and pyspark) supports filling of bools if you specify the args as a 
> map: 
> {code}
> fillna({"a": False})
> {code}
> , but not if you specify it as
> {code}
> fillna(False)
> {code}
> This is because (scala-)Spark has no
> {code}
> def fill(value: Boolean): DataFrame = fill(value, df.columns)
> {code}
>  method. I find that strange/buggy



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20925) Out of Memory Issues With org.apache.spark.sql.DataFrameWriter#partitionBy

2017-05-30 Thread Jeffrey Quinn (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16030089#comment-16030089
 ] 

Jeffrey Quinn commented on SPARK-20925:
---

Thanks Sean,

Sorry to continue to comment on a resolved issue, but I'm extremely curious to 
learn how this works since I have run into this issue several times before on 
other applications.

In the the scenario you describe, the spark application logic thinks that it 
can allocate more memory, but that calculation is incorrect because there is a 
a significant amount of off-heap memory already in use and the spark 
application logic does not take that into account. Instead to provide for off 
heap allocation, a static overhead in terms of percentage of total JVM memory 
is used. Is that correct?

The thing that really boggles my mind is, what could be using that off-heap 
memory? As you can see from my question, we do not have any sort of custom UDF 
code here, we are just calling the spark api in the most straightforward way 
possible. Why would the default setting not be sufficient for this case? Our 
schema has a significant number of columns (~100), perhaps that is to blame? Is 
catalyst using the off heap memory maybe?

Thanks,

Jeff

> Out of Memory Issues With org.apache.spark.sql.DataFrameWriter#partitionBy
> --
>
> Key: SPARK-20925
> URL: https://issues.apache.org/jira/browse/SPARK-20925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Jeffrey Quinn
>
> Observed under the following conditions:
> Spark Version: Spark 2.1.0
> Hadoop Version: Amazon 2.7.3 (emr-5.5.0)
> spark.submit.deployMode = client
> spark.master = yarn
> spark.driver.memory = 10g
> spark.shuffle.service.enabled = true
> spark.dynamicAllocation.enabled = true
> The job we are running is very simple: Our workflow reads data from a JSON 
> format stored on S3, and write out partitioned parquet files to HDFS.
> As a one-liner, the whole workflow looks like this:
> ```
> sparkSession.sqlContext
> .read
> .schema(inputSchema)
> .json(expandedInputPath)
> .select(columnMap:_*)
> .write.partitionBy("partition_by_column")
> .parquet(outputPath)
> ```
> Unfortunately, for larger inputs, this job consistently fails with containers 
> running out of memory. We observed containers of up to 20GB OOMing, which is 
> surprising because the input data itself is only 15 GB compressed and maybe 
> 100GB uncompressed.
> The error message we get indicates yarn is killing the containers. The 
> executors are running out of memory and not the driver.
> ```Caused by: org.apache.spark.SparkException: Job aborted due to stage 
> failure: Task 184 in stage 74.0 failed 4 times, most recent failure: Lost 
> task 184.3 in stage 74.0 (TID 19110, ip-10-242-15-251.ec2.internal, executor 
> 14): ExecutorLostFailure (executor 14 exited caused by one of the running 
> tasks) Reason: Container killed by YARN for exceeding memory limits. 21.5 GB 
> of 20.9 GB physical memory used. Consider boosting 
> spark.yarn.executor.memoryOverhead.```
> We tried a full parameter sweep, including using dynamic allocation and 
> setting executor memory as high as 20GB. The result was the same each time, 
> with the job failing due to lost executors due to YARN killing containers.
> We were able to bisect that `partitionBy` is the problem by progressively 
> removing/commenting out parts of our workflow. Finally when we get to the 
> above state, if we remove `partitionBy` the job succeeds with no OOM.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20925) Out of Memory Issues With org.apache.spark.sql.DataFrameWriter#partitionBy

2017-05-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20925.
---
Resolution: Not A Problem

That doesn't mean the JVM is out of memory; it kind of means the opposite. It 
thinks it can use more than YARN does, due to off-heap allocation. Setting the 
heap size higher only helps if you make it so high that the default off-heap 
cushion is sufficient. Increase spark.yarn.executor.memoryOverhead instead, as 
your heap is likely far too big.

> Out of Memory Issues With org.apache.spark.sql.DataFrameWriter#partitionBy
> --
>
> Key: SPARK-20925
> URL: https://issues.apache.org/jira/browse/SPARK-20925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Jeffrey Quinn
>
> Observed under the following conditions:
> Spark Version: Spark 2.1.0
> Hadoop Version: Amazon 2.7.3 (emr-5.5.0)
> spark.submit.deployMode = client
> spark.master = yarn
> spark.driver.memory = 10g
> spark.shuffle.service.enabled = true
> spark.dynamicAllocation.enabled = true
> The job we are running is very simple: Our workflow reads data from a JSON 
> format stored on S3, and write out partitioned parquet files to HDFS.
> As a one-liner, the whole workflow looks like this:
> ```
> sparkSession.sqlContext
> .read
> .schema(inputSchema)
> .json(expandedInputPath)
> .select(columnMap:_*)
> .write.partitionBy("partition_by_column")
> .parquet(outputPath)
> ```
> Unfortunately, for larger inputs, this job consistently fails with containers 
> running out of memory. We observed containers of up to 20GB OOMing, which is 
> surprising because the input data itself is only 15 GB compressed and maybe 
> 100GB uncompressed.
> The error message we get indicates yarn is killing the containers. The 
> executors are running out of memory and not the driver.
> ```Caused by: org.apache.spark.SparkException: Job aborted due to stage 
> failure: Task 184 in stage 74.0 failed 4 times, most recent failure: Lost 
> task 184.3 in stage 74.0 (TID 19110, ip-10-242-15-251.ec2.internal, executor 
> 14): ExecutorLostFailure (executor 14 exited caused by one of the running 
> tasks) Reason: Container killed by YARN for exceeding memory limits. 21.5 GB 
> of 20.9 GB physical memory used. Consider boosting 
> spark.yarn.executor.memoryOverhead.```
> We tried a full parameter sweep, including using dynamic allocation and 
> setting executor memory as high as 20GB. The result was the same each time, 
> with the job failing due to lost executors due to YARN killing containers.
> We were able to bisect that `partitionBy` is the problem by progressively 
> removing/commenting out parts of our workflow. Finally when we get to the 
> above state, if we remove `partitionBy` the job succeeds with no OOM.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20923) TaskMetrics._updatedBlockStatuses uses a lot of memory

2017-05-30 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16030059#comment-16030059
 ] 

Thomas Graves commented on SPARK-20923:
---

taking a quick look at the history of the _updatedBlockStatuses it looks like 
this used to be used for StorageStatusListener but it has been since changed to 
do this on the SparkListenerBlockUpdated event.  That BlockUpdated event is 
coming from the BlockManagerMaster.updateBlockInfo which is called by the 
executors.  So I'm not seeing anything use _updatedBlockStatuses.  I'll start 
to rip it out and see what I hit.

> TaskMetrics._updatedBlockStatuses uses a lot of memory
> --
>
> Key: SPARK-20923
> URL: https://issues.apache.org/jira/browse/SPARK-20923
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>
> The driver appears to use a ton of memory in certain cases to store the task 
> metrics updated block status'.  For instance I had a user reading data form 
> hive and caching it.  The # of tasks to read was around 62,000, they were 
> using 1000 executors and it ended up caching a couple TB's of data.  The 
> driver kept running out of memory. 
> I investigated and it looks like there was 5GB of a 10GB heap being used up 
> by the TaskMetrics._updatedBlockStatuses because there are a lot of blocks.
> The updatedBlockStatuses was already removed from the task end event under 
> SPARK-20084.  I don't see anything else that seems to be using this.  Anybody 
> know if I missed something?
>  If its not being used we should remove it, otherwise we need to figure out a 
> better way of doing it so it doesn't use so much memory.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18881) Spark never finishes jobs and stages, JobProgressListener fails

2017-05-30 Thread Mathieu D (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16030025#comment-16030025
 ] 

Mathieu D edited comment on SPARK-18881 at 5/30/17 7:52 PM:


Just to mention a workaround for those experiencing the problem : try increase 
{{spark.scheduler.listenerbus.eventqueue.size}} (default 1). 
It may only postpone the problem, if the queue filling is faster than listeners 
for a long time. In our case, we have bursts of activity and raising this limit 
helps.


was (Author: mathieude):
Just to mention a workaround for those experiencing the problem : try increase 
{{spark.scheduler.listenerbus.eventqueue.size}} (default 1). 
It may only postpone the problem, if the queue filling is faster than listeners 
for a long time. In our case, we have bursts of activity and raising this 
limits helps.

> Spark never finishes jobs and stages, JobProgressListener fails
> ---
>
> Key: SPARK-18881
> URL: https://issues.apache.org/jira/browse/SPARK-18881
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.2
> Environment: yarn, deploy-mode = client
>Reporter: Mathieu D
>
> We have a Spark application that process continuously a lot of incoming jobs. 
> Several jobs are processed in parallel, on multiple threads.
> During intensive workloads, at some point, we start to have hundreds of  
> warnings like this :
> {code}
> 16/12/14 21:04:03 WARN JobProgressListener: Task end for unknown stage 147379
> 16/12/14 21:04:03 WARN JobProgressListener: Job completed for unknown job 
> 64610
> 16/12/14 21:04:04 WARN JobProgressListener: Task start for unknown stage 
> 147405
> 16/12/14 21:04:04 WARN JobProgressListener: Task end for unknown stage 147406
> 16/12/14 21:04:04 WARN JobProgressListener: Job completed for unknown job 
> 64622
> {code}
> Starting from that, the performance of the app plummet, most of Stages and 
> Jobs never finish. On SparkUI, I can see figures like 13000 pending jobs.
> I can't see clearly another related exception happening before. Maybe this 
> one, but it concerns another listener :
> {code}
> 16/12/14 21:03:54 ERROR LiveListenerBus: Dropping SparkListenerEvent because 
> no remaining room in event queue. This likely means one of the SparkListeners 
> is too slow and cannot keep up with the rate at which tasks are being started 
> by the scheduler.
> 16/12/14 21:03:54 WARN LiveListenerBus: Dropped 1 SparkListenerEvents since 
> Thu Jan 01 01:00:00 CET 1970
> {code}
> This is very problematic for us, since it's hard to detect, and requires an 
> app restart.
> *EDIT :*
> I confirm the sequence :
> 1- ERROR LiveListenerBus: Dropping SparkListenerEvent because no remaining 
> room in event queue
> then
> 2- JobProgressListener losing track of job and stages.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18881) Spark never finishes jobs and stages, JobProgressListener fails

2017-05-30 Thread Mathieu D (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16030025#comment-16030025
 ] 

Mathieu D commented on SPARK-18881:
---

Just to mention a workaround for those experiencing the problem : try increase 
{{spark.scheduler.listenerbus.eventqueue.size}} (default 1). 
It may only postpone the problem, if the queue filling is faster than listeners 
for a long time. In our case, we have bursts of activity and raising this 
limits helps.

> Spark never finishes jobs and stages, JobProgressListener fails
> ---
>
> Key: SPARK-18881
> URL: https://issues.apache.org/jira/browse/SPARK-18881
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.2
> Environment: yarn, deploy-mode = client
>Reporter: Mathieu D
>
> We have a Spark application that process continuously a lot of incoming jobs. 
> Several jobs are processed in parallel, on multiple threads.
> During intensive workloads, at some point, we start to have hundreds of  
> warnings like this :
> {code}
> 16/12/14 21:04:03 WARN JobProgressListener: Task end for unknown stage 147379
> 16/12/14 21:04:03 WARN JobProgressListener: Job completed for unknown job 
> 64610
> 16/12/14 21:04:04 WARN JobProgressListener: Task start for unknown stage 
> 147405
> 16/12/14 21:04:04 WARN JobProgressListener: Task end for unknown stage 147406
> 16/12/14 21:04:04 WARN JobProgressListener: Job completed for unknown job 
> 64622
> {code}
> Starting from that, the performance of the app plummet, most of Stages and 
> Jobs never finish. On SparkUI, I can see figures like 13000 pending jobs.
> I can't see clearly another related exception happening before. Maybe this 
> one, but it concerns another listener :
> {code}
> 16/12/14 21:03:54 ERROR LiveListenerBus: Dropping SparkListenerEvent because 
> no remaining room in event queue. This likely means one of the SparkListeners 
> is too slow and cannot keep up with the rate at which tasks are being started 
> by the scheduler.
> 16/12/14 21:03:54 WARN LiveListenerBus: Dropped 1 SparkListenerEvents since 
> Thu Jan 01 01:00:00 CET 1970
> {code}
> This is very problematic for us, since it's hard to detect, and requires an 
> app restart.
> *EDIT :*
> I confirm the sequence :
> 1- ERROR LiveListenerBus: Dropping SparkListenerEvent because no remaining 
> room in event queue
> then
> 2- JobProgressListener losing track of job and stages.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19044) PySpark dropna() can fail with AnalysisException

2017-05-30 Thread Ruben Berenguel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16030021#comment-16030021
 ] 

Ruben Berenguel commented on SPARK-19044:
-

Oh, there's a typo in the "equivalent Scala code" in the bug report, where we 
have v1 instead of v2:

val v1 = spark.range(10)
val v2 = v1.crossJoin(v1)
v2.na.drop().explain()
org.apache.spark.sql.AnalysisException: Reference 'id' is ambiguous, could be: 
id#0L, id#3L.;

I think this can't be really considered a bug, since at least it's as bad in 
Scala as it is in Python (not much more can be done here since this is the 
expected behaviour of a LogicalPlan). [~holdenk] do you know who I need to ping 
in this case?

> PySpark dropna() can fail with AnalysisException
> 
>
> Key: SPARK-19044
> URL: https://issues.apache.org/jira/browse/SPARK-19044
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Josh Rosen
>Priority: Minor
>
> In PySpark, the following fails with an AnalysisException:
> {code}
> v1 = spark.range(10)
> v2 = v1.crossJoin(v1)
> v2.dropna()
> {code}
> {code}
> AnalysisException: u"Reference 'id' is ambiguous, could be: id#66L, id#69L.;"
> {code}
> However, the equivalent Scala code works fine:
> {code}
> val v1 = spark.range(10)
> val v2 = v1.crossJoin(v1)
> v1.na.drop()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20803) KernelDensity.estimate in pyspark.mllib.stat.KernelDensity throws net.razorvine.pickle.PickleException when input data is normally distributed (no error when data is n

2017-05-30 Thread Bettadapura Srinath Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16030004#comment-16030004
 ] 

Bettadapura Srinath Sharma commented on SPARK-20803:


In Java, the (correct) result is:
Code:

KernelDensity kd = new KernelDensity().setSample(col3).setBandwidth(3.0);
double[] densities = kd.estimate(samplePoints);

[0.06854408498726733, 1.4028730306237974E-174, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]


> KernelDensity.estimate in pyspark.mllib.stat.KernelDensity throws 
> net.razorvine.pickle.PickleException when input data is normally distributed 
> (no error when data is not normally distributed)
> ---
>
> Key: SPARK-20803
> URL: https://issues.apache.org/jira/browse/SPARK-20803
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 2.1.1
> Environment: Linux version 4.4.14-smp
> x86/fpu: Legacy x87 FPU detected.
> using command line: 
> bash-4.3$ ./bin/spark-submit ~/work/python/Features.py
> bash-4.3$ pwd
> /home/bsrsharma/spark-2.1.1-bin-hadoop2.7
> export JAVA_HOME=/home/bsrsharma/jdk1.8.0_121
>Reporter: Bettadapura Srinath Sharma
>
> When data is NOT normally distributed (correct behavior):
> This code:
>   vecRDD = sc.parallelize(colVec)
> kd = KernelDensity()
> kd.setSample(vecRDD)
> kd.setBandwidth(3.0)
> # Find density estimates for the given values
> densities = kd.estimate(samplePoints)
> produces:
> 17/05/18 15:40:36 INFO SparkContext: Starting job: aggregate at 
> KernelDensity.scala:92
> 17/05/18 15:40:36 INFO DAGScheduler: Got job 21 (aggregate at 
> KernelDensity.scala:92) with 1 output partitions
> 17/05/18 15:40:36 INFO DAGScheduler: Final stage: ResultStage 24 (aggregate 
> at KernelDensity.scala:92)
> 17/05/18 15:40:36 INFO DAGScheduler: Parents of final stage: List()
> 17/05/18 15:40:36 INFO DAGScheduler: Missing parents: List()
> 17/05/18 15:40:36 INFO DAGScheduler: Submitting ResultStage 24 
> (MapPartitionsRDD[44] at mapPartitions at PythonMLLibAPI.scala:1345), which 
> has no missing parents
> 17/05/18 15:40:36 INFO MemoryStore: Block broadcast_25 stored as values in 
> memory (estimated size 6.6 KB, free 413.6 MB)
> 17/05/18 15:40:36 INFO MemoryStore: Block broadcast_25_piece0 stored as bytes 
> in memory (estimated size 3.6 KB, free 413.6 MB)
> 17/05/18 15:40:36 INFO BlockManagerInfo: Added broadcast_25_piece0 in memory 
> on 192.168.0.115:38645 (size: 3.6 KB, free: 413.9 MB)
> 17/05/18 15:40:36 INFO SparkContext: Created broadcast 25 from broadcast at 
> DAGScheduler.scala:996
> 17/05/18 15:40:36 INFO DAGScheduler: Submitting 1 missing tasks from 
> ResultStage 24 (MapPartitionsRDD[44] at mapPartitions at 
> PythonMLLibAPI.scala:1345)
> 17/05/18 15:40:36 INFO TaskSchedulerImpl: Adding task set 24.0 with 1 tasks
> 17/05/18 15:40:36 INFO TaskSetManager: Starting task 0.0 in stage 24.0 (TID 
> 24, localhost, executor driver, partition 0, PROCESS_LOCAL, 96186 bytes)
> 17/05/18 15:40:36 INFO Executor: Running task 0.0 in stage 24.0 (TID 24)
> 17/05/18 15:40:37 INFO PythonRunner: Times: total = 66, boot = -1831, init = 
> 1844, finish = 53
> 17/05/18 15:40:37 INFO Executor: Finished task 0.0 in stage 24.0 (TID 24). 
> 2476 bytes result sent to driver
> 17/05/18 15:40:37 INFO DAGScheduler: ResultStage 24 (aggregate at 
> KernelDensity.scala:92) finished in 1.001 s
> 17/05/18 15:40:37 INFO TaskSetManager: Finished task 0.0 in stage 24.0 (TID 
> 24) in 1004 ms on localhost (executor driver) (1/1)
> 17/05/18 15:40:37 INFO TaskSchedulerImpl: Removed TaskSet 24.0, whose tasks 
> have all completed, from pool 
> 17/05/18 15:40:37 INFO DAGScheduler: Job 21 finished: aggregate at 
> KernelDensity.scala:92, took 1.136263 s
> 17/05/18 15:40:37 INFO BlockManagerInfo: Removed broadcast_25_piece0 on 
> 192.168.0.115:38645 in memory (size: 3.6 KB, free: 413.9 MB)
> 5.6654703477e-05,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001
> ,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,
> 

[jira] [Created] (SPARK-20927) Add cache operator to Unsupported Operations in Structured Streaming

2017-05-30 Thread Jacek Laskowski (JIRA)
Jacek Laskowski created SPARK-20927:
---

 Summary: Add cache operator to Unsupported Operations in 
Structured Streaming 
 Key: SPARK-20927
 URL: https://issues.apache.org/jira/browse/SPARK-20927
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.3.0
Reporter: Jacek Laskowski
Priority: Trivial


Just [found 
out|https://stackoverflow.com/questions/42062092/why-does-using-cache-on-streaming-datasets-fail-with-analysisexception-queries]
 that {{cache}} is not allowed on streaming datasets.

{{cache}} on streaming datasets leads to the following exception:

{code}
scala> spark.readStream.text("files").cache
org.apache.spark.sql.AnalysisException: Queries with streaming sources must be 
executed with writeStream.start();;
FileSource[files]
  at 
org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:297)
  at 
org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:36)
  at 
org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:34)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
  at 
org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForBatch(UnsupportedOperationChecker.scala:34)
  at 
org.apache.spark.sql.execution.QueryExecution.assertSupported(QueryExecution.scala:63)
  at 
org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:74)
  at 
org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:72)
  at 
org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:78)
  at 
org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:78)
  at 
org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:84)
  at 
org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:80)
  at 
org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:89)
  at 
org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:89)
  at 
org.apache.spark.sql.execution.CacheManager$$anonfun$cacheQuery$1.apply(CacheManager.scala:104)
  at 
org.apache.spark.sql.execution.CacheManager.writeLock(CacheManager.scala:68)
  at 
org.apache.spark.sql.execution.CacheManager.cacheQuery(CacheManager.scala:92)
  at org.apache.spark.sql.Dataset.persist(Dataset.scala:2603)
  at org.apache.spark.sql.Dataset.cache(Dataset.scala:2613)
  ... 48 elided
{code}

It should be included in Structured Streaming's [Unsupported 
Operations|http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#unsupported-operations].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20926) Exposure to Guava libraries by directly accessing tableRelationCache in SessionCatalog caused failures

2017-05-30 Thread Reza Safi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16029996#comment-16029996
 ] 

Reza Safi commented on SPARK-20926:
---

I will post a pull request for this issue soon, by tonight at the latest.

> Exposure to Guava libraries by directly accessing tableRelationCache in 
> SessionCatalog caused failures
> --
>
> Key: SPARK-20926
> URL: https://issues.apache.org/jira/browse/SPARK-20926
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Reza Safi
>
> Because of shading that we did for guava libraries, we see test failures 
> whenever those components directly access tableRelationCache in 
> SessionCatalog.
> This can happen in any component that shaded guava library. Failures looks 
> like this:
> {noformat}
> java.lang.NoSuchMethodError: 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.tableRelationCache()Lcom/google/common/cache/Cache;
> 01:25:14   at 
> org.apache.spark.sql.hive.test.TestHiveSparkSession.reset(TestHive.scala:492)
> 01:25:14   at 
> org.apache.spark.sql.hive.test.TestHiveContext.reset(TestHive.scala:138)
> 01:25:14   at 
> org.apache.spark.sql.hive.test.TestHiveSingleton$class.afterAll(TestHiveSingleton.scala:32)
> 01:25:14   at 
> org.apache.spark.sql.hive.StatisticsSuite.afterAll(StatisticsSuite.scala:34)
> 01:25:14   at 
> org.scalatest.BeforeAndAfterAll$class.afterAll(BeforeAndAfterAll.scala:213)
> 01:25:14   at org.apache.spark.SparkFunSuite.afterAll(SparkFunSuite.scala:31)
> 01:25:14   at 
> org.scalatest.BeforeAndAfterAll$$anonfun$run$1.apply(BeforeAndAfterAll.scala:280)
> 01:25:14   at 
> org.scalatest.BeforeAndAfterAll$$anonfun$run$1.apply(BeforeAndAfterAll.scala:278)
> 01:25:14   at org.scalatest.CompositeStatus.whenCompleted(Status.scala:377)
> 01:25:14   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:278)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20925) Out of Memory Issues With org.apache.spark.sql.DataFrameWriter#partitionBy

2017-05-30 Thread Jeffrey Quinn (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeffrey Quinn updated SPARK-20925:
--
Description: 
Observed under the following conditions:

Spark Version: Spark 2.1.0
Hadoop Version: Amazon 2.7.3 (emr-5.5.0)
spark.submit.deployMode = client
spark.master = yarn
spark.driver.memory = 10g
spark.shuffle.service.enabled = true
spark.dynamicAllocation.enabled = true

The job we are running is very simple: Our workflow reads data from a JSON 
format stored on S3, and write out partitioned parquet files to HDFS.

As a one-liner, the whole workflow looks like this:

```
sparkSession.sqlContext
.read
.schema(inputSchema)
.json(expandedInputPath)
.select(columnMap:_*)
.write.partitionBy("partition_by_column")
.parquet(outputPath)
```

Unfortunately, for larger inputs, this job consistently fails with containers 
running out of memory. We observed containers of up to 20GB OOMing, which is 
surprising because the input data itself is only 15 GB compressed and maybe 
100GB uncompressed.

The error message we get indicates yarn is killing the containers. The 
executors are running out of memory and not the driver.

```Caused by: org.apache.spark.SparkException: Job aborted due to stage 
failure: Task 184 in stage 74.0 failed 4 times, most recent failure: Lost task 
184.3 in stage 74.0 (TID 19110, ip-10-242-15-251.ec2.internal, executor 14): 
ExecutorLostFailure (executor 14 exited caused by one of the running tasks) 
Reason: Container killed by YARN for exceeding memory limits. 21.5 GB of 20.9 
GB physical memory used. Consider boosting 
spark.yarn.executor.memoryOverhead.```

We tried a full parameter sweep, including using dynamic allocation and setting 
executor memory as high as 20GB. The result was the same each time, with the 
job failing due to lost executors due to YARN killing containers.

We were able to bisect that `partitionBy` is the problem by progressively 
removing/commenting out parts of our workflow. Finally when we get to the above 
state, if we remove `partitionBy` the job succeeds with no OOM.

  was:
Observed under the following conditions:

Spark Version: Spark 2.1.0
Hadoop Version: Amazon 2.7.3 (emr-5.5.0)
spark.submit.deployMode = client
spark.master = yarn
spark.driver.memory = 10g
spark.shuffle.service.enabled = true
spark.dynamicAllocation.enabled = true

The job we are running is very simple: Our workflow reads data from a JSON 
format stored on S3, and write out partitioned parquet files to HDFS.

As a one-liner, the whole workflow looks like this:

```
sparkSession.sqlContext
.read
.schema(inputSchema)
.json(expandedInputPath)
.select(columnMap:_*)
.write.partitionBy("partition_by_column")
.parquet(outputPath)
```

Unfortunately, for larger inputs, this job consistently fails with containers 
running out of memory. We observed containers of up to 20GB OOMing, which is 
surprising because the input data itself is only 15 GB compressed and maybe 
100GB uncompressed.

We were able to bisect that `partitionBy` is the problem by progressively 
removing/commenting out parts of our workflow. Finally when we get to the above 
state, if we remove `partitionBy` the job succeeds with no OOM.


> Out of Memory Issues With org.apache.spark.sql.DataFrameWriter#partitionBy
> --
>
> Key: SPARK-20925
> URL: https://issues.apache.org/jira/browse/SPARK-20925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Jeffrey Quinn
>
> Observed under the following conditions:
> Spark Version: Spark 2.1.0
> Hadoop Version: Amazon 2.7.3 (emr-5.5.0)
> spark.submit.deployMode = client
> spark.master = yarn
> spark.driver.memory = 10g
> spark.shuffle.service.enabled = true
> spark.dynamicAllocation.enabled = true
> The job we are running is very simple: Our workflow reads data from a JSON 
> format stored on S3, and write out partitioned parquet files to HDFS.
> As a one-liner, the whole workflow looks like this:
> ```
> sparkSession.sqlContext
> .read
> .schema(inputSchema)
> .json(expandedInputPath)
> .select(columnMap:_*)
> .write.partitionBy("partition_by_column")
> .parquet(outputPath)
> ```
> Unfortunately, for larger inputs, this job consistently fails with containers 
> running out of memory. We observed containers of up to 20GB OOMing, which is 
> surprising because the input data itself is only 15 GB compressed and maybe 
> 100GB uncompressed.
> The error message we get indicates yarn is killing the containers. The 
> executors are running out of memory and not the driver.
> ```Caused by: org.apache.spark.SparkException: Job aborted due to stage 
> failure: Task 184 in 

[jira] [Commented] (SPARK-20925) Out of Memory Issues With org.apache.spark.sql.DataFrameWriter#partitionBy

2017-05-30 Thread Jeffrey Quinn (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16029993#comment-16029993
 ] 

Jeffrey Quinn commented on SPARK-20925:
---

Hi Sean,

Sorry for not providing adequate information. Thank you for responding so 
quickly.

The error message we get indicates yarn is killing the containers. The 
executors are running out of memory and not the driver.

```Caused by: org.apache.spark.SparkException: Job aborted due to stage 
failure: Task 184 in stage 74.0 failed 4 times, most recent failure: Lost task 
184.3 in stage 74.0 (TID 19110, ip-10-242-15-251.ec2.internal, executor 14): 
ExecutorLostFailure (executor 14 exited caused by one of the running tasks) 
Reason: Container killed by YARN for exceeding memory limits. 21.5 GB of 20.9 
GB physical memory used. Consider boosting 
spark.yarn.executor.memoryOverhead.```

We tried a full parameter sweep, including using dynamic allocation and setting 
executor memory as high as 20GB. The result was the same each time, with the 
job failing due to lost executors due to YARN killing containers.

I attempted to trace through the source code to see how `partitionBy` is 
implemented, I was surprised to see it cause this problem since it seems like 
it should not require a shuffle. Unfortunately I am not experienced enough with 
the DataFrame source code to figure out what is going on. For now we are 
reimplementing the partitioning ourselves as a work around, but very curious to 
know what could have been happening here. My next step was going to be to 
obtain a full heap dump and poke around in it with my profiler, does that sound 
like a reasonable approach?

Thanks!

Jeff



> Out of Memory Issues With org.apache.spark.sql.DataFrameWriter#partitionBy
> --
>
> Key: SPARK-20925
> URL: https://issues.apache.org/jira/browse/SPARK-20925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Jeffrey Quinn
>
> Observed under the following conditions:
> Spark Version: Spark 2.1.0
> Hadoop Version: Amazon 2.7.3 (emr-5.5.0)
> spark.submit.deployMode = client
> spark.master = yarn
> spark.driver.memory = 10g
> spark.shuffle.service.enabled = true
> spark.dynamicAllocation.enabled = true
> The job we are running is very simple: Our workflow reads data from a JSON 
> format stored on S3, and write out partitioned parquet files to HDFS.
> As a one-liner, the whole workflow looks like this:
> ```
> sparkSession.sqlContext
> .read
> .schema(inputSchema)
> .json(expandedInputPath)
> .select(columnMap:_*)
> .write.partitionBy("partition_by_column")
> .parquet(outputPath)
> ```
> Unfortunately, for larger inputs, this job consistently fails with containers 
> running out of memory. We observed containers of up to 20GB OOMing, which is 
> surprising because the input data itself is only 15 GB compressed and maybe 
> 100GB uncompressed.
> We were able to bisect that `partitionBy` is the problem by progressively 
> removing/commenting out parts of our workflow. Finally when we get to the 
> above state, if we remove `partitionBy` the job succeeds with no OOM.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20926) Exposure to Guava libraries by directly accessing tableRelationCache in SessionCatalog caused failures

2017-05-30 Thread Reza Safi (JIRA)
Reza Safi created SPARK-20926:
-

 Summary: Exposure to Guava libraries by directly accessing 
tableRelationCache in SessionCatalog caused failures
 Key: SPARK-20926
 URL: https://issues.apache.org/jira/browse/SPARK-20926
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Reza Safi


Because of shading that we did for guava libraries, we see test failures 
whenever those components directly access tableRelationCache in SessionCatalog.
This can happen in any component that shaded guava library. Failures looks like 
this:
{noformat}
java.lang.NoSuchMethodError: 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.tableRelationCache()Lcom/google/common/cache/Cache;
01:25:14   at 
org.apache.spark.sql.hive.test.TestHiveSparkSession.reset(TestHive.scala:492)
01:25:14   at 
org.apache.spark.sql.hive.test.TestHiveContext.reset(TestHive.scala:138)
01:25:14   at 
org.apache.spark.sql.hive.test.TestHiveSingleton$class.afterAll(TestHiveSingleton.scala:32)
01:25:14   at 
org.apache.spark.sql.hive.StatisticsSuite.afterAll(StatisticsSuite.scala:34)
01:25:14   at 
org.scalatest.BeforeAndAfterAll$class.afterAll(BeforeAndAfterAll.scala:213)
01:25:14   at org.apache.spark.SparkFunSuite.afterAll(SparkFunSuite.scala:31)
01:25:14   at 
org.scalatest.BeforeAndAfterAll$$anonfun$run$1.apply(BeforeAndAfterAll.scala:280)
01:25:14   at 
org.scalatest.BeforeAndAfterAll$$anonfun$run$1.apply(BeforeAndAfterAll.scala:278)
01:25:14   at org.scalatest.CompositeStatus.whenCompleted(Status.scala:377)
01:25:14   at 
org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:278)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20925) Out of Memory Issues With org.apache.spark.sql.DataFrameWriter#partitionBy

2017-05-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16029984#comment-16029984
 ] 

Sean Owen commented on SPARK-20925:
---

Not enough info here -- is the JVM running out of memory? is YARN killing it? 
is the driver or executor running out of memory?
All of those are typically matters of setting memory config properly, and not a 
Spark issue, so I am not sure this stands as a JIRA.

> Out of Memory Issues With org.apache.spark.sql.DataFrameWriter#partitionBy
> --
>
> Key: SPARK-20925
> URL: https://issues.apache.org/jira/browse/SPARK-20925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Jeffrey Quinn
>
> Observed under the following conditions:
> Spark Version: Spark 2.1.0
> Hadoop Version: Amazon 2.7.3 (emr-5.5.0)
> spark.submit.deployMode = client
> spark.master = yarn
> spark.driver.memory = 10g
> spark.shuffle.service.enabled = true
> spark.dynamicAllocation.enabled = true
> The job we are running is very simple: Our workflow reads data from a JSON 
> format stored on S3, and write out partitioned parquet files to HDFS.
> As a one-liner, the whole workflow looks like this:
> ```
> sparkSession.sqlContext
> .read
> .schema(inputSchema)
> .json(expandedInputPath)
> .select(columnMap:_*)
> .write.partitionBy("partition_by_column")
> .parquet(outputPath)
> ```
> Unfortunately, for larger inputs, this job consistently fails with containers 
> running out of memory. We observed containers of up to 20GB OOMing, which is 
> surprising because the input data itself is only 15 GB compressed and maybe 
> 100GB uncompressed.
> We were able to bisect that `partitionBy` is the problem by progressively 
> removing/commenting out parts of our workflow. Finally when we get to the 
> above state, if we remove `partitionBy` the job succeeds with no OOM.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20925) Out of Memory Issues With org.apache.spark.sql.DataFrameWriter#partitionBy

2017-05-30 Thread Jeffrey Quinn (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeffrey Quinn updated SPARK-20925:
--
Description: 
Observed under the following conditions:

Spark Version: Spark 2.1.0
Hadoop Version: Amazon 2.7.3 (emr-5.5.0)
spark.submit.deployMode = client
spark.master = yarn
spark.driver.memory = 10g
spark.shuffle.service.enabled = true
spark.dynamicAllocation.enabled = true

The job we are running is very simple: Our workflow reads data from a JSON 
format stored on S3, and write out partitioned parquet files to HDFS.

As a one-liner, the whole workflow looks like this:

```
sparkSession.sqlContext
.read
.schema(inputSchema)
.json(expandedInputPath)
.select(columnMap:_*)
.write.partitionBy("partition_by_column")
.parquet(outputPath)
```

Unfortunately, for larger inputs, this job consistently fails with containers 
running out of memory. We observed containers of up to 20GB OOMing, which is 
surprising because the input data itself is only 15 GB compressed and maybe 
100GB uncompressed.

We were able to bisect that `partitionBy` is the problem by progressively 
removing/commenting out parts of our workflow. Finally when we get to the above 
state, if we remove `partitionBy` the job succeeds with no OOM.

  was:
Observed under the following conditions:

Spark Version: Spark 2.1.0
Hadoop Version: Amazon 2.7.3 (emr-5.5.0)
spark.submit.deployMode = client
spark.master = yarn
spark.driver.memory = 10g
spark.shuffle.service.enabled = true
spark.dynamicAllocation.enabled = true

The job we are running is very simple: Our workflow reads data from a JSON 
format stored on S3, and write out partitioned parquet files to HDFS.

As a one-liner, the whole workflow looks like this:

```
sparkSession.sqlContext
.read
.schema(inputSchema)
.json(expandedInputPath)
.select(columnMap:_*)
.write.partitionBy("partition_by_column")
.parquet(outputPath)
```

Unfortunately, for larger inputs, this job consistently fails with containers 
running out of memory. We observed containers of up to 20GB OOMing, which is 
surprising because the input data itself is only 15 GB compressed and maybe 
100GB uncompressed.


> Out of Memory Issues With org.apache.spark.sql.DataFrameWriter#partitionBy
> --
>
> Key: SPARK-20925
> URL: https://issues.apache.org/jira/browse/SPARK-20925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Jeffrey Quinn
>
> Observed under the following conditions:
> Spark Version: Spark 2.1.0
> Hadoop Version: Amazon 2.7.3 (emr-5.5.0)
> spark.submit.deployMode = client
> spark.master = yarn
> spark.driver.memory = 10g
> spark.shuffle.service.enabled = true
> spark.dynamicAllocation.enabled = true
> The job we are running is very simple: Our workflow reads data from a JSON 
> format stored on S3, and write out partitioned parquet files to HDFS.
> As a one-liner, the whole workflow looks like this:
> ```
> sparkSession.sqlContext
> .read
> .schema(inputSchema)
> .json(expandedInputPath)
> .select(columnMap:_*)
> .write.partitionBy("partition_by_column")
> .parquet(outputPath)
> ```
> Unfortunately, for larger inputs, this job consistently fails with containers 
> running out of memory. We observed containers of up to 20GB OOMing, which is 
> surprising because the input data itself is only 15 GB compressed and maybe 
> 100GB uncompressed.
> We were able to bisect that `partitionBy` is the problem by progressively 
> removing/commenting out parts of our workflow. Finally when we get to the 
> above state, if we remove `partitionBy` the job succeeds with no OOM.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20925) Out of Memory Issues With org.apache.spark.sql.DataFrameWriter#partitionBy

2017-05-30 Thread Jeffrey Quinn (JIRA)
Jeffrey Quinn created SPARK-20925:
-

 Summary: Out of Memory Issues With 
org.apache.spark.sql.DataFrameWriter#partitionBy
 Key: SPARK-20925
 URL: https://issues.apache.org/jira/browse/SPARK-20925
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Jeffrey Quinn


Observed under the following conditions:

Spark Version: Spark 2.1.0
Hadoop Version: Amazon 2.7.3 (emr-5.5.0)
spark.submit.deployMode = client
spark.master = yarn
spark.driver.memory = 10g
spark.shuffle.service.enabled = true
spark.dynamicAllocation.enabled = true

The job we are running is very simple: Our workflow reads data from a JSON 
format stored on S3, and write out partitioned parquet files to HDFS.

As a one-liner, the whole workflow looks like this:

```
sparkSession.sqlContext
.read
.schema(inputSchema)
.json(expandedInputPath)
.select(columnMap:_*)
.write.partitionBy("partition_by_column")
.parquet(outputPath)
```

Unfortunately, for larger inputs, this job consistently fails with containers 
running out of memory. We observed containers of up to 20GB OOMing, which is 
surprising because the input data itself is only 15 GB compressed and maybe 
100GB uncompressed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20923) TaskMetrics._updatedBlockStatuses uses a lot of memory

2017-05-30 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16029962#comment-16029962
 ] 

Josh Rosen commented on SPARK-20923:


It doesn't seem to be used, as far as I can tell from a quick skim. The best 
way to confirm would probably be to start removing it, deleting things which 
depend on this as you go (e.g. the TaskMetrics getter method for accessing the 
current value) and see if you run into anything which looks like a non-test 
use. I'll be happy to review a patch to clean this up.

> TaskMetrics._updatedBlockStatuses uses a lot of memory
> --
>
> Key: SPARK-20923
> URL: https://issues.apache.org/jira/browse/SPARK-20923
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>
> The driver appears to use a ton of memory in certain cases to store the task 
> metrics updated block status'.  For instance I had a user reading data form 
> hive and caching it.  The # of tasks to read was around 62,000, they were 
> using 1000 executors and it ended up caching a couple TB's of data.  The 
> driver kept running out of memory. 
> I investigated and it looks like there was 5GB of a 10GB heap being used up 
> by the TaskMetrics._updatedBlockStatuses because there are a lot of blocks.
> The updatedBlockStatuses was already removed from the task end event under 
> SPARK-20084.  I don't see anything else that seems to be using this.  Anybody 
> know if I missed something?
>  If its not being used we should remove it, otherwise we need to figure out a 
> better way of doing it so it doesn't use so much memory.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20333) Fix HashPartitioner in DAGSchedulerSuite

2017-05-30 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid reassigned SPARK-20333:


Assignee: jin xing

> Fix HashPartitioner in DAGSchedulerSuite
> 
>
> Key: SPARK-20333
> URL: https://issues.apache.org/jira/browse/SPARK-20333
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: jin xing
>Assignee: jin xing
>Priority: Minor
> Fix For: 2.3.0
>
>
> In test 
> "don't submit stage until its dependencies map outputs are registered 
> (SPARK-5259)"
> "run trivial shuffle with out-of-band executor failure and retry"
> "reduce tasks should be placed locally with map output"
> HashPartitioner should be compatible with num of child RDD's partitions. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20802) kolmogorovSmirnovTest in pyspark.mllib.stat.Statistics throws net.razorvine.pickle.PickleException when input data is normally distributed (no error when data is not n

2017-05-30 Thread Bettadapura Srinath Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16029950#comment-16029950
 ] 

Bettadapura Srinath Sharma commented on SPARK-20802:


In Java, (Correct behavior)
code:
KolmogorovSmirnovTestResult testResult = Statistics.kolmogorovSmirnovTest(col1, 
"norm", mean[1], stdDev[1]);
produces:
Kolmogorov-Smirnov test summary:
degrees of freedom = 0 
statistic = 0.005983051038968901 
pValue = 0.8643736171652615 
No presumption against null hypothesis: Sample follows theoretical distribution.


> kolmogorovSmirnovTest in pyspark.mllib.stat.Statistics throws 
> net.razorvine.pickle.PickleException when input data is normally distributed 
> (no error when data is not normally distributed)
> ---
>
> Key: SPARK-20802
> URL: https://issues.apache.org/jira/browse/SPARK-20802
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 2.1.1
> Environment: Linux version 4.4.14-smp
> x86/fpu: Legacy x87 FPU detected.
> using command line: 
> bash-4.3$ ./bin/spark-submit ~/work/python/Features.py
> bash-4.3$ pwd
> /home/bsrsharma/spark-2.1.1-bin-hadoop2.7
> export JAVA_HOME=/home/bsrsharma/jdk1.8.0_121
>Reporter: Bettadapura Srinath Sharma
>
> In Scala,(correct behavior)
> code:
> testResult = Statistics.kolmogorovSmirnovTest(vecRDD, "norm", means(j), 
> stdDev(j))
> produces:
> 17/05/18 10:52:53 INFO FeatureLogger: Kolmogorov-Smirnov test summary:
> degrees of freedom = 0 
> statistic = 0.005495681749849268 
> pValue = 0.9216108887428276 
> No presumption against null hypothesis: Sample follows theoretical 
> distribution.
> in python (incorrect behavior):
> the code:
> testResult = Statistics.kolmogorovSmirnovTest(vecRDD, 'norm', numericMean[j], 
> numericSD[j])
> causes this error:
> 17/05/17 21:59:23 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 14)
> net.razorvine.pickle.PickleException: expected zero arguments for 
> construction of ClassDict (for numpy.dtype)
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20333) Fix HashPartitioner in DAGSchedulerSuite

2017-05-30 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-20333.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 17634
[https://github.com/apache/spark/pull/17634]

> Fix HashPartitioner in DAGSchedulerSuite
> 
>
> Key: SPARK-20333
> URL: https://issues.apache.org/jira/browse/SPARK-20333
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: jin xing
>Priority: Minor
> Fix For: 2.3.0
>
>
> In test 
> "don't submit stage until its dependencies map outputs are registered 
> (SPARK-5259)"
> "run trivial shuffle with out-of-band executor failure and retry"
> "reduce tasks should be placed locally with map output"
> HashPartitioner should be compatible with num of child RDD's partitions. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20923) TaskMetrics._updatedBlockStatuses uses a lot of memory

2017-05-30 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16029917#comment-16029917
 ] 

Thomas Graves commented on SPARK-20923:
---

[~joshrosen] [~zsxwing] [~eseyfe] I think you have looked at this fairly 
recently, do you know if this is used by anything or anybody? I'm not finding 
it used anywhere in the code or UI but maybe I'm missing some obscure reference

> TaskMetrics._updatedBlockStatuses uses a lot of memory
> --
>
> Key: SPARK-20923
> URL: https://issues.apache.org/jira/browse/SPARK-20923
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>
> The driver appears to use a ton of memory in certain cases to store the task 
> metrics updated block status'.  For instance I had a user reading data form 
> hive and caching it.  The # of tasks to read was around 62,000, they were 
> using 1000 executors and it ended up caching a couple TB's of data.  The 
> driver kept running out of memory. 
> I investigated and it looks like there was 5GB of a 10GB heap being used up 
> by the TaskMetrics._updatedBlockStatuses because there are a lot of blocks.
> The updatedBlockStatuses was already removed from the task end event under 
> SPARK-20084.  I don't see anything else that seems to be using this.  Anybody 
> know if I missed something?
>  If its not being used we should remove it, otherwise we need to figure out a 
> better way of doing it so it doesn't use so much memory.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-15905) Driver hung while writing to console progress bar

2017-05-30 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil closed SPARK-15905.
---
Resolution: Cannot Reproduce

> Driver hung while writing to console progress bar
> -
>
> Key: SPARK-15905
> URL: https://issues.apache.org/jira/browse/SPARK-15905
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Tejas Patil
>Priority: Minor
>
> This leads to driver being not able to get heartbeats from its executors and 
> job being stuck. After looking at the locking dependency amongst the driver 
> threads per the jstack, this is where the driver seems to be stuck.
> {noformat}
> "refresh progress" #113 daemon prio=5 os_prio=0 tid=0x7f7986cbc800 
> nid=0x7887d runnable [0x7f6d3507a000]
>java.lang.Thread.State: RUNNABLE
> at java.io.FileOutputStream.writeBytes(Native Method)
> at java.io.FileOutputStream.write(FileOutputStream.java:326)
> at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
> at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
> - locked <0x7f6eb81dd290> (a java.io.BufferedOutputStream)
> at java.io.PrintStream.write(PrintStream.java:482)
>- locked <0x7f6eb81dd258> (a java.io.PrintStream)
> at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
> at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
> at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:104)
> - locked <0x7f6eb81dd400> (a java.io.OutputStreamWriter)
> at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:185)
> at java.io.PrintStream.write(PrintStream.java:527)
> - locked <0x7f6eb81dd258> (a java.io.PrintStream)
> at java.io.PrintStream.print(PrintStream.java:669)
> at 
> org.apache.spark.ui.ConsoleProgressBar.show(ConsoleProgressBar.scala:99)
> at 
> org.apache.spark.ui.ConsoleProgressBar.org$apache$spark$ui$ConsoleProgressBar$$refresh(ConsoleProgressBar.scala:69)
> - locked <0x7f6ed33b48a0> (a 
> org.apache.spark.ui.ConsoleProgressBar)
> at 
> org.apache.spark.ui.ConsoleProgressBar$$anon$1.run(ConsoleProgressBar.scala:53)
> at java.util.TimerThread.mainLoop(Timer.java:555)
> at java.util.TimerThread.run(Timer.java:505)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20923) TaskMetrics._updatedBlockStatuses uses a lot of memory

2017-05-30 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16029847#comment-16029847
 ] 

Ryan Blue commented on SPARK-20923:
---

I didn't look at the code path up to writing history files. I just confirmed 
that nothing based on the history file actually used them. Sounds like if we 
can stop tracking this entirely, that would be great!

> TaskMetrics._updatedBlockStatuses uses a lot of memory
> --
>
> Key: SPARK-20923
> URL: https://issues.apache.org/jira/browse/SPARK-20923
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>
> The driver appears to use a ton of memory in certain cases to store the task 
> metrics updated block status'.  For instance I had a user reading data form 
> hive and caching it.  The # of tasks to read was around 62,000, they were 
> using 1000 executors and it ended up caching a couple TB's of data.  The 
> driver kept running out of memory. 
> I investigated and it looks like there was 5GB of a 10GB heap being used up 
> by the TaskMetrics._updatedBlockStatuses because there are a lot of blocks.
> The updatedBlockStatuses was already removed from the task end event under 
> SPARK-20084.  I don't see anything else that seems to be using this.  Anybody 
> know if I missed something?
>  If its not being used we should remove it, otherwise we need to figure out a 
> better way of doing it so it doesn't use so much memory.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15905) Driver hung while writing to console progress bar

2017-05-30 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16029846#comment-16029846
 ] 

Tejas Patil commented on SPARK-15905:
-

I haven't seen this in a while with Spark 2.0. Closing. If anyone is still 
experiencing this issue, please comment with details.

> Driver hung while writing to console progress bar
> -
>
> Key: SPARK-15905
> URL: https://issues.apache.org/jira/browse/SPARK-15905
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Tejas Patil
>Priority: Minor
>
> This leads to driver being not able to get heartbeats from its executors and 
> job being stuck. After looking at the locking dependency amongst the driver 
> threads per the jstack, this is where the driver seems to be stuck.
> {noformat}
> "refresh progress" #113 daemon prio=5 os_prio=0 tid=0x7f7986cbc800 
> nid=0x7887d runnable [0x7f6d3507a000]
>java.lang.Thread.State: RUNNABLE
> at java.io.FileOutputStream.writeBytes(Native Method)
> at java.io.FileOutputStream.write(FileOutputStream.java:326)
> at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
> at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
> - locked <0x7f6eb81dd290> (a java.io.BufferedOutputStream)
> at java.io.PrintStream.write(PrintStream.java:482)
>- locked <0x7f6eb81dd258> (a java.io.PrintStream)
> at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
> at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
> at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:104)
> - locked <0x7f6eb81dd400> (a java.io.OutputStreamWriter)
> at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:185)
> at java.io.PrintStream.write(PrintStream.java:527)
> - locked <0x7f6eb81dd258> (a java.io.PrintStream)
> at java.io.PrintStream.print(PrintStream.java:669)
> at 
> org.apache.spark.ui.ConsoleProgressBar.show(ConsoleProgressBar.scala:99)
> at 
> org.apache.spark.ui.ConsoleProgressBar.org$apache$spark$ui$ConsoleProgressBar$$refresh(ConsoleProgressBar.scala:69)
> - locked <0x7f6ed33b48a0> (a 
> org.apache.spark.ui.ConsoleProgressBar)
> at 
> org.apache.spark.ui.ConsoleProgressBar$$anon$1.run(ConsoleProgressBar.scala:53)
> at java.util.TimerThread.mainLoop(Timer.java:555)
> at java.util.TimerThread.run(Timer.java:505)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20597) KafkaSourceProvider falls back on path as synonym for topic

2017-05-30 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-20597:
-
Labels: starter  (was: )

> KafkaSourceProvider falls back on path as synonym for topic
> ---
>
> Key: SPARK-20597
> URL: https://issues.apache.org/jira/browse/SPARK-20597
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>  Labels: starter
>
> # {{KafkaSourceProvider}} supports {{topic}} option that sets the Kafka topic 
> to save a DataFrame's rows to
> # {{KafkaSourceProvider}} can use {{topic}} column to assign rows to Kafka 
> topics for writing
> What seems a quite interesting option is to support {{start(path: String)}} 
> as the least precedence option in which {{path}} would designate the default 
> topic when no other options are used.
> {code}
> df.writeStream.format("kafka").start("topic")
> {code}
> See 
> http://apache-spark-developers-list.1001551.n3.nabble.com/KafkaSourceProvider-Why-topic-option-and-column-without-reverting-to-path-as-the-least-priority-td21458.html
>  for discussion



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20599) ConsoleSink should work with write (batch)

2017-05-30 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-20599:
-
Labels: starter  (was: )

> ConsoleSink should work with write (batch)
> --
>
> Key: SPARK-20599
> URL: https://issues.apache.org/jira/browse/SPARK-20599
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Minor
>  Labels: starter
>
> I think the following should just work.
> {code}
> spark.
>   read.  // <-- it's a batch query not streaming query if that matters
>   format("kafka").
>   option("subscribe", "topic1").
>   option("kafka.bootstrap.servers", "localhost:9092").
>   load.
>   write.
>   format("console").  // <-- that's not supported currently
>   save
> {code}
> The above combination of {{kafka}} source and {{console}} sink leads to the 
> following exception:
> {code}
> java.lang.RuntimeException: 
> org.apache.spark.sql.execution.streaming.ConsoleSinkProvider does not allow 
> create table as select.
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:479)
>   at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:93)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:93)
>   at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:610)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
>   ... 48 elided
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20919) Simplificaiton of CachedKafkaConsumer using guava cache.

2017-05-30 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-20919:
-
Affects Version/s: (was: 2.3.0)
   2.2.0
 Target Version/s: 2.3.0

> Simplificaiton of CachedKafkaConsumer using guava cache.
> 
>
> Key: SPARK-20919
> URL: https://issues.apache.org/jira/browse/SPARK-20919
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Prashant Sharma
>
> On the lines of SPARK-19968, guava cache can be used to simplify the code in 
> CachedKafkaConsumer as well. With an additional feature of automatic cleanup 
> of a consumer unused for a configurable time.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20919) Simplificaiton of CachedKafkaConsumer using guava cache.

2017-05-30 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-20919:
-
Issue Type: Improvement  (was: Bug)

> Simplificaiton of CachedKafkaConsumer using guava cache.
> 
>
> Key: SPARK-20919
> URL: https://issues.apache.org/jira/browse/SPARK-20919
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Prashant Sharma
>
> On the lines of SPARK-19968, guava cache can be used to simplify the code in 
> CachedKafkaConsumer as well. With an additional feature of automatic cleanup 
> of a consumer unused for a configurable time.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20924) Unable to call the function registered in the not-current database

2017-05-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20924:


Assignee: Apache Spark  (was: Xiao Li)

> Unable to call the function registered in the not-current database
> --
>
> Key: SPARK-20924
> URL: https://issues.apache.org/jira/browse/SPARK-20924
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.1, 2.2.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Critical
>
> We are unable to call the function registered in the not-current database. 
> {noformat}
> sql("CREATE DATABASE dAtABaSe1")
> sql(s"CREATE FUNCTION dAtABaSe1.test_avg AS 
> '${classOf[GenericUDAFAverage].getName}'")
> sql("SELECT dAtABaSe1.test_avg(1)")
> {noformat}
> The above code returns an error:
> {noformat}
> Undefined function: 'dAtABaSe1.test_avg'. This function is neither a 
> registered temporary function nor a permanent function registered in the 
> database 'default'.; line 1 pos 7
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20924) Unable to call the function registered in the not-current database

2017-05-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20924:


Assignee: Xiao Li  (was: Apache Spark)

> Unable to call the function registered in the not-current database
> --
>
> Key: SPARK-20924
> URL: https://issues.apache.org/jira/browse/SPARK-20924
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.1, 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Critical
>
> We are unable to call the function registered in the not-current database. 
> {noformat}
> sql("CREATE DATABASE dAtABaSe1")
> sql(s"CREATE FUNCTION dAtABaSe1.test_avg AS 
> '${classOf[GenericUDAFAverage].getName}'")
> sql("SELECT dAtABaSe1.test_avg(1)")
> {noformat}
> The above code returns an error:
> {noformat}
> Undefined function: 'dAtABaSe1.test_avg'. This function is neither a 
> registered temporary function nor a permanent function registered in the 
> database 'default'.; line 1 pos 7
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20924) Unable to call the function registered in the not-current database

2017-05-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16029815#comment-16029815
 ] 

Apache Spark commented on SPARK-20924:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/18146

> Unable to call the function registered in the not-current database
> --
>
> Key: SPARK-20924
> URL: https://issues.apache.org/jira/browse/SPARK-20924
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.1, 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Critical
>
> We are unable to call the function registered in the not-current database. 
> {noformat}
> sql("CREATE DATABASE dAtABaSe1")
> sql(s"CREATE FUNCTION dAtABaSe1.test_avg AS 
> '${classOf[GenericUDAFAverage].getName}'")
> sql("SELECT dAtABaSe1.test_avg(1)")
> {noformat}
> The above code returns an error:
> {noformat}
> Undefined function: 'dAtABaSe1.test_avg'. This function is neither a 
> registered temporary function nor a permanent function registered in the 
> database 'default'.; line 1 pos 7
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20924) Unable to call the function registered in the not-current database

2017-05-30 Thread Xiao Li (JIRA)
Xiao Li created SPARK-20924:
---

 Summary: Unable to call the function registered in the not-current 
database
 Key: SPARK-20924
 URL: https://issues.apache.org/jira/browse/SPARK-20924
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.1, 2.0.2, 2.2.0
Reporter: Xiao Li
Assignee: Xiao Li
Priority: Critical


We are unable to call the function registered in the not-current database. 

{noformat}
sql("CREATE DATABASE dAtABaSe1")
sql(s"CREATE FUNCTION dAtABaSe1.test_avg AS 
'${classOf[GenericUDAFAverage].getName}'")
sql("SELECT dAtABaSe1.test_avg(1)")
{noformat}

The above code returns an error:
{noformat}
Undefined function: 'dAtABaSe1.test_avg'. This function is neither a registered 
temporary function nor a permanent function registered in the database 
'default'.; line 1 pos 7
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20832) Standalone master should explicitly inform drivers of worker deaths and invalidate external shuffle service outputs

2017-05-30 Thread Jiang Xingbo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16029778#comment-16029778
 ] 

Jiang Xingbo commented on SPARK-20832:
--

I'm working on this.

> Standalone master should explicitly inform drivers of worker deaths and 
> invalidate external shuffle service outputs
> ---
>
> Key: SPARK-20832
> URL: https://issues.apache.org/jira/browse/SPARK-20832
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Scheduler
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>
> In SPARK-17370 (a patch authored by [~ekhliang] and reviewed by me), we added 
> logic to the DAGScheduler to mark external shuffle service instances as 
> unavailable upon task failure when the task failure reason was "SlaveLost" 
> and this was known to be caused by worker death. If the Spark Master 
> discovered that a worker was dead then it would notify any drivers with 
> executors on those workers to mark those executors as dead. The linked patch 
> simply piggybacked on this logic to have the executor death notification also 
> imply worker death and to have worker-death-caused-executor-death imply 
> shuffle file loss.
> However, there are modes of external shuffle service loss which this 
> mechanism does not detect, leaving the system prone race conditions. Consider 
> the following:
> * Spark standalone is configured to run an external shuffle service embedded 
> in the Worker.
> * Application has shuffle outputs and executors on Worker A.
> * Stage depending on outputs of tasks that ran on Worker A starts.
> * All executors on worker A are removed due to dying with exceptions, 
> scaling-down via the dynamic allocation APIs, but _not_ due to worker death. 
> Worker A is still healthy at this point.
> * At this point the MapOutputTracker still records map output locations on 
> Worker A's shuffle service. This is expected behavior. 
> * Worker A dies at an instant where the application has no executors running 
> on it.
> * The Master knows that Worker A died but does not inform the driver (which 
> had no executors on that worker at the time of its death).
> * Some task from the running stage attempts to fetch map outputs from Worker 
> A but these requests time out because Worker A's shuffle service isn't 
> available.
> * Due to other logic in the scheduler, these preventable FetchFailures don't 
> wind up invaliding the now-invalid unavailable map output locations (this is 
> a distinct bug / behavior which I'll discuss in a separate JIRA ticket).
> * This behavior leads to several unsuccessful stage reattempts and ultimately 
> to a job failure.
> A simple way to address this would be to have the Master explicitly notify 
> drivers of all Worker deaths, even if those drivers don't currently have 
> executors. The Spark Standalone scheduler backend can receive the explicit 
> WorkerLost message and can bubble up the right calls to the task scheduler 
> and DAGScheduler to invalidate map output locations from the now-dead 
> external shuffle service.
> This relates to SPARK-20115 in the sense that both tickets aim to address 
> issues where the external shuffle service is unavailable. The key difference 
> is the mechanism for detection: SPARK-20115 marks the external shuffle 
> service as unavailable whenever any fetch failure occurs from it, whereas the 
> proposal here relies on more explicit signals. This JIRA ticket's proposal is 
> scoped only to Spark Standalone mode. As a compromise, we might be able to 
> consider "all of a single shuffle's outputs lost on a single external shuffle 
> service" following a fetch failure (to be discussed in separate JIRA). 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20899) PySpark supports stringIndexerOrderType in RFormula

2017-05-30 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-20899:

Component/s: ML

> PySpark supports stringIndexerOrderType in RFormula
> ---
>
> Key: SPARK-20899
> URL: https://issues.apache.org/jira/browse/SPARK-20899
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.1.1
>Reporter: Wayne Zhang
>Assignee: Wayne Zhang
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20899) PySpark supports stringIndexerOrderType in RFormula

2017-05-30 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-20899.
-
   Resolution: Fixed
 Assignee: Wayne Zhang
Fix Version/s: 2.3.0

> PySpark supports stringIndexerOrderType in RFormula
> ---
>
> Key: SPARK-20899
> URL: https://issues.apache.org/jira/browse/SPARK-20899
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.1.1
>Reporter: Wayne Zhang
>Assignee: Wayne Zhang
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20923) TaskMetrics._updatedBlockStatuses uses a lot of memory

2017-05-30 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16029701#comment-16029701
 ] 

Thomas Graves commented on SPARK-20923:
---

[~rdblue]  with SPARK-20084, did you see anything using these 
updatedblockStatuses in TaskMetrics?

> TaskMetrics._updatedBlockStatuses uses a lot of memory
> --
>
> Key: SPARK-20923
> URL: https://issues.apache.org/jira/browse/SPARK-20923
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>
> The driver appears to use a ton of memory in certain cases to store the task 
> metrics updated block status'.  For instance I had a user reading data form 
> hive and caching it.  The # of tasks to read was around 62,000, they were 
> using 1000 executors and it ended up caching a couple TB's of data.  The 
> driver kept running out of memory. 
> I investigated and it looks like there was 5GB of a 10GB heap being used up 
> by the TaskMetrics._updatedBlockStatuses because there are a lot of blocks.
> The updatedBlockStatuses was already removed from the task end event under 
> SPARK-20084.  I don't see anything else that seems to be using this.  Anybody 
> know if I missed something?
>  If its not being used we should remove it, otherwise we need to figure out a 
> better way of doing it so it doesn't use so much memory.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20923) TaskMetrics._updatedBlockStatuses uses a lot of memory

2017-05-30 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-20923:
-

 Summary: TaskMetrics._updatedBlockStatuses uses a lot of memory
 Key: SPARK-20923
 URL: https://issues.apache.org/jira/browse/SPARK-20923
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Thomas Graves


The driver appears to use a ton of memory in certain cases to store the task 
metrics updated block status'.  For instance I had a user reading data form 
hive and caching it.  The # of tasks to read was around 62,000, they were using 
1000 executors and it ended up caching a couple TB's of data.  The driver kept 
running out of memory. 

I investigated and it looks like there was 5GB of a 10GB heap being used up by 
the TaskMetrics._updatedBlockStatuses because there are a lot of blocks.

The updatedBlockStatuses was already removed from the task end event under 
SPARK-20084.  I don't see anything else that seems to be using this.  Anybody 
know if I missed something?

 If its not being used we should remove it, otherwise we need to figure out a 
better way of doing it so it doesn't use so much memory.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20922) Unsafe deserialization in Spark LauncherConnection

2017-05-30 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16029693#comment-16029693
 ] 

Marcelo Vanzin commented on SPARK-20922:


Yeah, it's not as simple to exploit, but I guess we'll need custom 
serialization to avoid issues with 3rd-party libraries here... :-/

> Unsafe deserialization in Spark LauncherConnection
> --
>
> Key: SPARK-20922
> URL: https://issues.apache.org/jira/browse/SPARK-20922
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.1.1
>Reporter: Aditya Sharad
>  Labels: security
> Attachments: spark-deserialize-master.zip
>
>
> The {{run()}} method of the class 
> {{org.apache.spark.launcher.LauncherConnection}} performs unsafe 
> deserialization of data received by its socket. This makes Spark applications 
> launched programmatically using the {{SparkLauncher}} framework potentially 
> vulnerable to remote code execution by an attacker with access to any user 
> account on the local machine. Such an attacker could send a malicious 
> serialized Java object to multiple ports on the local machine, and if this 
> port matches the one (randomly) chosen by the Spark launcher, the malicious 
> object will be deserialized. By making use of gadget chains in code present 
> on the Spark application classpath, the deserialization process can lead to 
> RCE or privilege escalation.
> This vulnerability is identified by the “Unsafe deserialization” rule on 
> lgtm.com:
> https://lgtm.com/projects/g/apache/spark/snapshot/80fdc2c9d1693f5b3402a79ca4ec76f6e422ff13/files/launcher/src/main/java/org/apache/spark/launcher/LauncherConnection.java#V58
>  
> Attached is a proof-of-concept exploit involving a simple 
> {{SparkLauncher}}-based application and a known gadget chain in the Apache 
> Commons Beanutils library referenced by Spark.
> See the readme file for demonstration instructions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20909) Build-in SQL Function Support - DAYOFWEEK

2017-05-30 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-20909:
---

Assignee: Yuming Wang

> Build-in SQL Function Support - DAYOFWEEK
> -
>
> Key: SPARK-20909
> URL: https://issues.apache.org/jira/browse/SPARK-20909
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>  Labels: starter
> Fix For: 2.3.0
>
>
> {noformat}
> DAYOFWEEK(date)
> {noformat}
> Return the weekday index of the argument.
> Ref: 
> https://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html#function_dayofweek



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20633) FileFormatWriter wrap the FetchFailedException which breaks job's failover

2017-05-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16029635#comment-16029635
 ] 

Apache Spark commented on SPARK-20633:
--

User 'squito' has created a pull request for this issue:
https://github.com/apache/spark/pull/18145

> FileFormatWriter wrap the FetchFailedException which breaks job's failover
> --
>
> Key: SPARK-20633
> URL: https://issues.apache.org/jira/browse/SPARK-20633
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Liu Shaohui
>
> The task scheduler handles FetchFailedException separately for the task 
> failover. But the FileFormatWriter wraps the FetchFailedException with 
> SparkException. This causes the job cannot be recovered from the failure like 
> a external shuffle server is down.
> See the stacktrace:
> {code}
> 2017-04-30,05:02:42,348 ERROR 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer: Task 
> attempt attempt_201704300443_0018_m_96_1 aborted.
> 2017-04-30,05:02:42,392 ERROR org.apache.spark.executor.Executor: Exception 
> in task 96.1 in stage 18.0 (TID 26538)
> org.apache.spark.SparkException: Task failed while writing rows
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.shuffle.FetchFailedException: 
> java.lang.RuntimeException: Executor is not registered 
> (appId=application_1491898760056_636981, execId=546)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:319)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.handleMessage(ExternalShuffleBlockHandler.java:87)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.receive(ExternalShuffleBlockHandler.java:74)
>   at 
> org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:152)
>   at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:102)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
>   at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18683) REST APIs for standalone Master、Workers and Applications

2017-05-30 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16029591#comment-16029591
 ] 

Imran Rashid commented on SPARK-18683:
--

[~stanzhai] commented here 
https://github.com/apache/spark/pull/10991#issuecomment-301462761 that this 
functionality was removed in SPARK-12299.  I believe that was accidental, the 
goal there was really just to remove the history server from the Master.  I 
don't see any reason why the master shouldn't expose the same info it has in 
the UI in a rest api.

[~bryanc] [~andrewor14] do you have any thoughts on adding this back from 
SPARK-12299?

> REST APIs for standalone Master、Workers and Applications
> 
>
> Key: SPARK-18683
> URL: https://issues.apache.org/jira/browse/SPARK-18683
> Project: Spark
>  Issue Type: Improvement
>Reporter: Shixiong Zhu
>
> It would be great that we have some REST APIs to access Master、Workers and 
> Applications information. Right now the only way to get them is using the Web 
> UI.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20900) ApplicationMaster crashes if SPARK_YARN_STAGING_DIR is not set

2017-05-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20900.
---
Resolution: Not A Problem

> ApplicationMaster crashes if SPARK_YARN_STAGING_DIR is not set
> --
>
> Key: SPARK-20900
> URL: https://issues.apache.org/jira/browse/SPARK-20900
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.2.0, 1.6.0, 2.1.0
> Environment: Spark 2.1.0
>Reporter: Alexander Bessonov
>Priority: Minor
>
> When running {{ApplicationMaster}} directly, if {{SPARK_YARN_STAGING_DIR}} is 
> not set or set to empty string, {{org.apache.hadoop.fs.Path}} will throw 
> {{IllegalArgumentException}} instead of returning {{null}}. This is not 
> handled and the exception crashes the job.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20922) Unsafe deserialization in Spark LauncherConnection

2017-05-30 Thread Aditya Sharad (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16029455#comment-16029455
 ] 

Aditya Sharad edited comment on SPARK-20922 at 5/30/17 3:16 PM:


Yes, this is different from SPARK-11652, which focused on preventing a specific 
known gadget chain (found in Commons Collections). This issue involves the 
general problem of unconditionally deserializing untrusted data, and the 
proof-of-concept is simply an example of a gadget chain (in Commons Beanutils, 
and which cannot be addressed by updating the dependency) that works against 
the latest Spark dependencies.

I believe you are correct about the deserialization leading to code execution 
before the shared secret is established or checked.

Indeed, due to how the socket is opened, you must have access to the local 
machine to connect, but not necessarily to the same user that is running the 
Spark master or task.


was (Author: adityasharad):
Yes, this is different from SPARK-11652, which focused on preventing a specific 
known gadget chain (found in Commons Collections). This issue involves the 
general problem of unconditionally deserializing untrusted data, and the 
proof-of-concept is simply an example of a gadget chain (in Commons Beanutils, 
and which cannot be addressed by updating the dependency) that works against 
the latest Spark dependencies.

Indeed, due to how the socket is opened, you must have access to the local 
machine to connect, but not necessarily to the same user that is running the 
Spark master or task.

> Unsafe deserialization in Spark LauncherConnection
> --
>
> Key: SPARK-20922
> URL: https://issues.apache.org/jira/browse/SPARK-20922
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.1.1
>Reporter: Aditya Sharad
>  Labels: security
> Attachments: spark-deserialize-master.zip
>
>
> The {{run()}} method of the class 
> {{org.apache.spark.launcher.LauncherConnection}} performs unsafe 
> deserialization of data received by its socket. This makes Spark applications 
> launched programmatically using the {{SparkLauncher}} framework potentially 
> vulnerable to remote code execution by an attacker with access to any user 
> account on the local machine. Such an attacker could send a malicious 
> serialized Java object to multiple ports on the local machine, and if this 
> port matches the one (randomly) chosen by the Spark launcher, the malicious 
> object will be deserialized. By making use of gadget chains in code present 
> on the Spark application classpath, the deserialization process can lead to 
> RCE or privilege escalation.
> This vulnerability is identified by the “Unsafe deserialization” rule on 
> lgtm.com:
> https://lgtm.com/projects/g/apache/spark/snapshot/80fdc2c9d1693f5b3402a79ca4ec76f6e422ff13/files/launcher/src/main/java/org/apache/spark/launcher/LauncherConnection.java#V58
>  
> Attached is a proof-of-concept exploit involving a simple 
> {{SparkLauncher}}-based application and a known gadget chain in the Apache 
> Commons Beanutils library referenced by Spark.
> See the readme file for demonstration instructions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20912) map function with columns as strings

2017-05-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20912:


Assignee: Apache Spark

> map function with columns as strings
> 
>
> Key: SPARK-20912
> URL: https://issues.apache.org/jira/browse/SPARK-20912
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Assignee: Apache Spark
>Priority: Trivial
>
> There's only {{map}} function that accepts {{Column}} values only. It'd be 
> very helpful to have a variant that accepted {{String}} for columns like 
> {{array}} or {{struct}}.
> {code}
> scala> val kvs = Seq(("key", "value")).toDF("k", "v")
> kvs: org.apache.spark.sql.DataFrame = [k: string, v: string]
> scala> kvs.printSchema
> root
>  |-- k: string (nullable = true)
>  |-- v: string (nullable = true)
> scala> kvs.withColumn("map", map("k", "v")).show
> :26: error: type mismatch;
>  found   : String("k")
>  required: org.apache.spark.sql.Column
>kvs.withColumn("map", map("k", "v")).show
>  ^
> :26: error: type mismatch;
>  found   : String("v")
>  required: org.apache.spark.sql.Column
>kvs.withColumn("map", map("k", "v")).show
>   ^
> // note $ to create Columns per string
> // not very dev-friendly
> scala> kvs.withColumn("map", map($"k", $"v")).show
> +---+-+-+
> |  k|v|  map|
> +---+-+-+
> |key|value|Map(key -> value)|
> +---+-+-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20912) map function with columns as strings

2017-05-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20912:


Assignee: (was: Apache Spark)

> map function with columns as strings
> 
>
> Key: SPARK-20912
> URL: https://issues.apache.org/jira/browse/SPARK-20912
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> There's only {{map}} function that accepts {{Column}} values only. It'd be 
> very helpful to have a variant that accepted {{String}} for columns like 
> {{array}} or {{struct}}.
> {code}
> scala> val kvs = Seq(("key", "value")).toDF("k", "v")
> kvs: org.apache.spark.sql.DataFrame = [k: string, v: string]
> scala> kvs.printSchema
> root
>  |-- k: string (nullable = true)
>  |-- v: string (nullable = true)
> scala> kvs.withColumn("map", map("k", "v")).show
> :26: error: type mismatch;
>  found   : String("k")
>  required: org.apache.spark.sql.Column
>kvs.withColumn("map", map("k", "v")).show
>  ^
> :26: error: type mismatch;
>  found   : String("v")
>  required: org.apache.spark.sql.Column
>kvs.withColumn("map", map("k", "v")).show
>   ^
> // note $ to create Columns per string
> // not very dev-friendly
> scala> kvs.withColumn("map", map($"k", $"v")).show
> +---+-+-+
> |  k|v|  map|
> +---+-+-+
> |key|value|Map(key -> value)|
> +---+-+-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >