date:20170425

[jira] [Comment Edited] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows

2017-04-25 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15984173#comment-15984173
 ] 

Liang-Chi Hsieh edited comment on SPARK-20392 at 4/26/17 6:44 AM:
--

[~barrybecker4] Currently I think the performance downgrade is caused by the 
cost of exchange between DataFrame/Dataset abstraction and logical plans. Some 
operations on DataFrames exchange between DataFrame and logical plans. It can 
be ignored in the usage of SQL. However, it's not rare to chain dozens of 
pipeline stages in ML. When the query plan grows incrementally during running 
those stages, the cost spent on the exchange grows too. In particular, the 
Analyzer will go through the big query plan even most part of it is analyzed.

By eliminating part of the cost, the time to run the example code locally is 
reduced from about 1min to about 30 secs.

In particular, the time applying the pipeline locally is mostly spent on 
calling {{transform}} of the 137 {{Bucketizer}} s. Before the change, each call 
of {{Bucketizer}} 's {{transform}} can cost about 0.4 sec. So the total time 
spent on all {{Bucketizer}} s' {{transform}} is about 50 secs. After the 
change, each call only costs about 0.1 sec.







was (Author: viirya):
[~barrybecker4] Currently I think the performance downgrade is caused by the 
cost of exchange between DataFrame/Dataset abstraction and logical plans. Some 
operations on DataFrames exchange between DataFrame and logical plans. It can 
be ignored in the usage of SQL. However, it's not rare to chain dozens of 
pipeline stages in ML. When the query plan grows incrementally during running 
those stages, the cost spent on the exchange grows too. In particular, the 
Analyzer will go through the big query plan even most part of it is analyzed.

By eliminating part of the cost, the time to run the example code locally is 
reduced from about 1min to about 30 secs.

In particular, the time applying the pipeline locally is mostly spent on 
calling {{transform}} of the 137 {{Bucketizer}}. Before the change, each call 
of {{Bucketizer}} 's {{transform}} can cost about 0.4 sec. So the total time 
spent on all {{Bucketizer}} 's {{transform}} is about 50 secs. After the 
change, each call only costs about 0.1 sec.






> Slow performance when calling fit on ML pipeline for dataset with many 
> columns but few rows
> ---
>
> Key: SPARK-20392
> URL: https://issues.apache.org/jira/browse/SPARK-20392
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Barry Becker
> Attachments: blockbuster.csv, blockbuster_fewCols.csv, 
> giant_query_plan_for_fitting_pipeline.txt, model_9754.zip, model_9756.zip
>
>
> This started as a [question on stack 
> overflow|http://stackoverflow.com/questions/43484006/why-is-it-slow-to-apply-a-spark-pipeline-to-dataset-with-many-columns-but-few-ro],
>  but it seems like a bug.
> I am testing spark pipelines using a simple dataset (attached) with 312 
> (mostly numeric) columns, but only 421 rows. It is small, but it takes 3 
> minutes to apply my ML pipeline to it on a 24 core server with 60G of memory. 
> This seems much to long for such a tiny dataset. Similar pipelines run 
> quickly on datasets that have fewer columns and more rows. It's something 
> about the number of columns that is causing the slow performance.
> Here are a list of the stages in my pipeline:
> {code}
> 000_strIdx_5708525b2b6c
> 001_strIdx_ec2296082913
> 002_bucketizer_3cbc8811877b
> 003_bucketizer_5a01d5d78436
> 004_bucketizer_bf290d11364d
> 005_bucketizer_c3296dfe94b2
> 006_bucketizer_7071ca50eb85
> 007_bucketizer_27738213c2a1
> 008_bucketizer_bd728fd89ba1
> 009_bucketizer_e1e716f51796
> 010_bucketizer_38be665993ba
> 011_bucketizer_5a0e41e5e94f
> 012_bucketizer_b5a3d5743aaa
> 013_bucketizer_4420f98ff7ff
> 014_bucketizer_777cc4fe6d12
> 015_bucketizer_f0f3a3e5530e
> 016_bucketizer_218ecca3b5c1
> 017_bucketizer_0b083439a192
> 018_bucketizer_4520203aec27
> 019_bucketizer_462c2c346079
> 020_bucketizer_47435822e04c
> 021_bucketizer_eb9dccb5e6e8
> 022_bucketizer_b5f63dd7451d
> 023_bucketizer_e0fd5041c841
> 024_bucketizer_ffb3b9737100
> 025_bucketizer_e06c0d29273c
> 026_bucketizer_36ee535a425f
> 027_bucketizer_ee3a330269f1
> 028_bucketizer_094b58ea01c0
> 029_bucketizer_e93ea86c08e2
> 030_bucketizer_4728a718bc4b
> 031_bucketizer_08f6189c7fcc
> 032_bucketizer_11feb74901e6
> 033_bucketizer_ab4add4966c7
> 034_bucketizer_4474f7f1b8ce
> 035_bucketizer_90cfa5918d71
> 036_bucketizer_1a9ff5e4eccb
> 037_bucketizer_38085415a4f4
> 038_bucketizer_9b5e5a8d12eb
> 039_bucketizer_082bb650ecc3
> 040_bucketizer_57e1e363c483
> 041_bucketizer_337583fbfd65
> 042_bucketizer_73e8f6673262
> 043_bucketizer_0f9394ed30b8

[jira] [Comment Edited] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows

2017-04-25 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15984173#comment-15984173
 ] 

Liang-Chi Hsieh edited comment on SPARK-20392 at 4/26/17 6:43 AM:
--

[~barrybecker4] Currently I think the performance downgrade is caused by the 
cost of exchange between DataFrame/Dataset abstraction and logical plans. Some 
operations on DataFrames exchange between DataFrame and logical plans. It can 
be ignored in the usage of SQL. However, it's not rare to chain dozens of 
pipeline stages in ML. When the query plan grows incrementally during running 
those stages, the cost spent on the exchange grows too. In particular, the 
Analyzer will go through the big query plan even most part of it is analyzed.

By eliminating part of the cost, the time to run the example code locally is 
reduced from about 1min to about 30 secs.

In particular, the time applying the pipeline locally is mostly spent on 
calling {{transform}} of the 137 {{Bucketizer}}. Before the change, each call 
of {{Bucketizer}} 's {{transform}} can cost about 0.4 sec. So the total time 
spent on all {{Bucketizer}} 's {{transform}} is about 50 secs. After the 
change, each call only costs about 0.1 sec.







was (Author: viirya):
[~barrybecker4] Currently I think the performance downgrade is caused by the 
cost of exchange between DataFrame/Dataset abstraction and logical plans. Some 
operations on DataFrames exchange between DataFrame and logical plans. It can 
be ignored in the usage of SQL. However, it's not rare to chain dozens of 
pipeline stages in ML. When the query plan grows incrementally during running 
those stages, the cost spent on the exchange grows too. In particular, the 
Analyzer will go through the big query plan even most part of it is analyzed.

By eliminating part of the cost, the time to run the example code locally is 
reduced from about 1min to about 30 secs.

In particular, the time applying the pipeline locally is mostly spent on 
calling {{transform}} of the 137 {{Bucketizer}}s. Before the change, each call 
of {{Bucketizer}}'s {{transform}} can cost about 0.4 sec. So the total time 
spent on all {{Bucketizer}}s' {{transform}} is about 50 secs. After the change, 
each call only costs about 0.1 sec.






> Slow performance when calling fit on ML pipeline for dataset with many 
> columns but few rows
> ---
>
> Key: SPARK-20392
> URL: https://issues.apache.org/jira/browse/SPARK-20392
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Barry Becker
> Attachments: blockbuster.csv, blockbuster_fewCols.csv, 
> giant_query_plan_for_fitting_pipeline.txt, model_9754.zip, model_9756.zip
>
>
> This started as a [question on stack 
> overflow|http://stackoverflow.com/questions/43484006/why-is-it-slow-to-apply-a-spark-pipeline-to-dataset-with-many-columns-but-few-ro],
>  but it seems like a bug.
> I am testing spark pipelines using a simple dataset (attached) with 312 
> (mostly numeric) columns, but only 421 rows. It is small, but it takes 3 
> minutes to apply my ML pipeline to it on a 24 core server with 60G of memory. 
> This seems much to long for such a tiny dataset. Similar pipelines run 
> quickly on datasets that have fewer columns and more rows. It's something 
> about the number of columns that is causing the slow performance.
> Here are a list of the stages in my pipeline:
> {code}
> 000_strIdx_5708525b2b6c
> 001_strIdx_ec2296082913
> 002_bucketizer_3cbc8811877b
> 003_bucketizer_5a01d5d78436
> 004_bucketizer_bf290d11364d
> 005_bucketizer_c3296dfe94b2
> 006_bucketizer_7071ca50eb85
> 007_bucketizer_27738213c2a1
> 008_bucketizer_bd728fd89ba1
> 009_bucketizer_e1e716f51796
> 010_bucketizer_38be665993ba
> 011_bucketizer_5a0e41e5e94f
> 012_bucketizer_b5a3d5743aaa
> 013_bucketizer_4420f98ff7ff
> 014_bucketizer_777cc4fe6d12
> 015_bucketizer_f0f3a3e5530e
> 016_bucketizer_218ecca3b5c1
> 017_bucketizer_0b083439a192
> 018_bucketizer_4520203aec27
> 019_bucketizer_462c2c346079
> 020_bucketizer_47435822e04c
> 021_bucketizer_eb9dccb5e6e8
> 022_bucketizer_b5f63dd7451d
> 023_bucketizer_e0fd5041c841
> 024_bucketizer_ffb3b9737100
> 025_bucketizer_e06c0d29273c
> 026_bucketizer_36ee535a425f
> 027_bucketizer_ee3a330269f1
> 028_bucketizer_094b58ea01c0
> 029_bucketizer_e93ea86c08e2
> 030_bucketizer_4728a718bc4b
> 031_bucketizer_08f6189c7fcc
> 032_bucketizer_11feb74901e6
> 033_bucketizer_ab4add4966c7
> 034_bucketizer_4474f7f1b8ce
> 035_bucketizer_90cfa5918d71
> 036_bucketizer_1a9ff5e4eccb
> 037_bucketizer_38085415a4f4
> 038_bucketizer_9b5e5a8d12eb
> 039_bucketizer_082bb650ecc3
> 040_bucketizer_57e1e363c483
> 041_bucketizer_337583fbfd65
> 042_bucketizer_73e8f6673262
> 043_bucketizer_0f9394ed30b8
> 0

[jira] [Comment Edited] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows

2017-04-25 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15984173#comment-15984173
 ] 

Liang-Chi Hsieh edited comment on SPARK-20392 at 4/26/17 6:43 AM:
--

[~barrybecker4] Currently I think the performance downgrade is caused by the 
cost of exchange between DataFrame/Dataset abstraction and logical plans. Some 
operations on DataFrames exchange between DataFrame and logical plans. It can 
be ignored in the usage of SQL. However, it's not rare to chain dozens of 
pipeline stages in ML. When the query plan grows incrementally during running 
those stages, the cost spent on the exchange grows too. In particular, the 
Analyzer will go through the big query plan even most part of it is analyzed.

By eliminating part of the cost, the time to run the example code locally is 
reduced from about 1min to about 30 secs.

In particular, the time applying the pipeline locally is mostly spent on 
calling {{transform}} of the 137 {{Bucketizer}}s. Before the change, each call 
of {{Bucketizer}}'s {{transform}} can cost about 0.4 sec. So the total time 
spent on all {{Bucketizer}}s' {{transform}} is about 50 secs. After the change, 
each call only costs about 0.1 sec.







was (Author: viirya):
[~barrybecker4] Currently I think the performance downgrade is caused by the 
cost of exchange between DataFrame/Dataset abstraction and logical plans. Some 
operations on DataFrames exchange between DataFrame and logical plans. It can 
be ignored in the usage of SQL. However, it's not rare to chain dozens of 
pipeline stages in ML. When the query plan grows incrementally during running 
those stages, the cost spent on the exchange grows too. In particular, the 
Analyzer will go through the big query plan even most part of it is analyzed.

By eliminating part of the cost, the time to run the example code locally is 
reduced from about 1min to about 30 secs.


> Slow performance when calling fit on ML pipeline for dataset with many 
> columns but few rows
> ---
>
> Key: SPARK-20392
> URL: https://issues.apache.org/jira/browse/SPARK-20392
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Barry Becker
> Attachments: blockbuster.csv, blockbuster_fewCols.csv, 
> giant_query_plan_for_fitting_pipeline.txt, model_9754.zip, model_9756.zip
>
>
> This started as a [question on stack 
> overflow|http://stackoverflow.com/questions/43484006/why-is-it-slow-to-apply-a-spark-pipeline-to-dataset-with-many-columns-but-few-ro],
>  but it seems like a bug.
> I am testing spark pipelines using a simple dataset (attached) with 312 
> (mostly numeric) columns, but only 421 rows. It is small, but it takes 3 
> minutes to apply my ML pipeline to it on a 24 core server with 60G of memory. 
> This seems much to long for such a tiny dataset. Similar pipelines run 
> quickly on datasets that have fewer columns and more rows. It's something 
> about the number of columns that is causing the slow performance.
> Here are a list of the stages in my pipeline:
> {code}
> 000_strIdx_5708525b2b6c
> 001_strIdx_ec2296082913
> 002_bucketizer_3cbc8811877b
> 003_bucketizer_5a01d5d78436
> 004_bucketizer_bf290d11364d
> 005_bucketizer_c3296dfe94b2
> 006_bucketizer_7071ca50eb85
> 007_bucketizer_27738213c2a1
> 008_bucketizer_bd728fd89ba1
> 009_bucketizer_e1e716f51796
> 010_bucketizer_38be665993ba
> 011_bucketizer_5a0e41e5e94f
> 012_bucketizer_b5a3d5743aaa
> 013_bucketizer_4420f98ff7ff
> 014_bucketizer_777cc4fe6d12
> 015_bucketizer_f0f3a3e5530e
> 016_bucketizer_218ecca3b5c1
> 017_bucketizer_0b083439a192
> 018_bucketizer_4520203aec27
> 019_bucketizer_462c2c346079
> 020_bucketizer_47435822e04c
> 021_bucketizer_eb9dccb5e6e8
> 022_bucketizer_b5f63dd7451d
> 023_bucketizer_e0fd5041c841
> 024_bucketizer_ffb3b9737100
> 025_bucketizer_e06c0d29273c
> 026_bucketizer_36ee535a425f
> 027_bucketizer_ee3a330269f1
> 028_bucketizer_094b58ea01c0
> 029_bucketizer_e93ea86c08e2
> 030_bucketizer_4728a718bc4b
> 031_bucketizer_08f6189c7fcc
> 032_bucketizer_11feb74901e6
> 033_bucketizer_ab4add4966c7
> 034_bucketizer_4474f7f1b8ce
> 035_bucketizer_90cfa5918d71
> 036_bucketizer_1a9ff5e4eccb
> 037_bucketizer_38085415a4f4
> 038_bucketizer_9b5e5a8d12eb
> 039_bucketizer_082bb650ecc3
> 040_bucketizer_57e1e363c483
> 041_bucketizer_337583fbfd65
> 042_bucketizer_73e8f6673262
> 043_bucketizer_0f9394ed30b8
> 044_bucketizer_8530f3570019
> 045_bucketizer_c53614f1e507
> 046_bucketizer_8fd99e6ec27b
> 047_bucketizer_6a8610496d8a
> 048_bucketizer_888b0055c1ad
> 049_bucketizer_974e0a1433a6
> 050_bucketizer_e848c0937cb9
> 051_bucketizer_95611095a4ac
> 052_bucketizer_660a6031acd9
> 053_bucketizer_aaffe5a3140d
> 054_bucketizer_8dc569be285f
> 055_bucketizer_83d1bffa07

[jira] [Comment Edited] (SPARK-20199) GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter

2017-04-25 Thread 颜发才


[ 
https://issues.apache.org/jira/browse/SPARK-20199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15984161#comment-15984161
 ] 

Yan Facai (颜发才) edited comment on SPARK-20199 at 4/26/17 6:11 AM:
--

The work is easy, however **public method** is added and some adjustments are 
needed in inner implementation. Hence, I suggest to delay it until at least one 
of Committers agree to shepherd the issue.

I have two questions:

1. For both GBDT and RandomForest share the attribute,  we can pull 
`featureSubsetStrategy` parameter up to either TreeEnsembleParams or 
DecisionTreeParams. Which one is appropriate?

2. Is it right to add new parameter `featureSubsetStrategy` to Strategy class? 
Or add it to DecisionTree's train method?


was (Author: facai):
The work is easy, however Public method is added and some adjustments are 
needed in inner implementation. Hence, I suggest to delay it until one expert 
agree to shepherd the issue.

I have two questions:

1. For both GBDT and RandomForest share the attribute,  we can pull 
`featureSubsetStrategy` parameter up to either TreeEnsembleParams or 
DecisionTreeParams. Which one is appropriate?

2. Is it right to add new parameter `featureSubsetStrategy` to Strategy class? 
Or add it to DecisionTree's train method?

> GradientBoostedTreesModel doesn't have  Column Sampling Rate Paramenter
> ---
>
> Key: SPARK-20199
> URL: https://issues.apache.org/jira/browse/SPARK-20199
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: pralabhkumar
>
> Spark GradientBoostedTreesModel doesn't have Column  sampling rate parameter 
> . This parameter is available in H2O and XGBoost. 
> Sample from H2O.ai 
> gbmParams._col_sample_rate
> Please provide the parameter . 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows

2017-04-25 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15984194#comment-15984194
 ] 

Liang-Chi Hsieh commented on SPARK-20392:
-

By disabling {{spark.sql.constraintPropagation.enabled}} flag as well to 
eliminate the cost of constraint propagation on big query plan, the running 
time can be further reduced to about 20 secs.

> Slow performance when calling fit on ML pipeline for dataset with many 
> columns but few rows
> ---
>
> Key: SPARK-20392
> URL: https://issues.apache.org/jira/browse/SPARK-20392
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Barry Becker
> Attachments: blockbuster.csv, blockbuster_fewCols.csv, 
> giant_query_plan_for_fitting_pipeline.txt, model_9754.zip, model_9756.zip
>
>
> This started as a [question on stack 
> overflow|http://stackoverflow.com/questions/43484006/why-is-it-slow-to-apply-a-spark-pipeline-to-dataset-with-many-columns-but-few-ro],
>  but it seems like a bug.
> I am testing spark pipelines using a simple dataset (attached) with 312 
> (mostly numeric) columns, but only 421 rows. It is small, but it takes 3 
> minutes to apply my ML pipeline to it on a 24 core server with 60G of memory. 
> This seems much to long for such a tiny dataset. Similar pipelines run 
> quickly on datasets that have fewer columns and more rows. It's something 
> about the number of columns that is causing the slow performance.
> Here are a list of the stages in my pipeline:
> {code}
> 000_strIdx_5708525b2b6c
> 001_strIdx_ec2296082913
> 002_bucketizer_3cbc8811877b
> 003_bucketizer_5a01d5d78436
> 004_bucketizer_bf290d11364d
> 005_bucketizer_c3296dfe94b2
> 006_bucketizer_7071ca50eb85
> 007_bucketizer_27738213c2a1
> 008_bucketizer_bd728fd89ba1
> 009_bucketizer_e1e716f51796
> 010_bucketizer_38be665993ba
> 011_bucketizer_5a0e41e5e94f
> 012_bucketizer_b5a3d5743aaa
> 013_bucketizer_4420f98ff7ff
> 014_bucketizer_777cc4fe6d12
> 015_bucketizer_f0f3a3e5530e
> 016_bucketizer_218ecca3b5c1
> 017_bucketizer_0b083439a192
> 018_bucketizer_4520203aec27
> 019_bucketizer_462c2c346079
> 020_bucketizer_47435822e04c
> 021_bucketizer_eb9dccb5e6e8
> 022_bucketizer_b5f63dd7451d
> 023_bucketizer_e0fd5041c841
> 024_bucketizer_ffb3b9737100
> 025_bucketizer_e06c0d29273c
> 026_bucketizer_36ee535a425f
> 027_bucketizer_ee3a330269f1
> 028_bucketizer_094b58ea01c0
> 029_bucketizer_e93ea86c08e2
> 030_bucketizer_4728a718bc4b
> 031_bucketizer_08f6189c7fcc
> 032_bucketizer_11feb74901e6
> 033_bucketizer_ab4add4966c7
> 034_bucketizer_4474f7f1b8ce
> 035_bucketizer_90cfa5918d71
> 036_bucketizer_1a9ff5e4eccb
> 037_bucketizer_38085415a4f4
> 038_bucketizer_9b5e5a8d12eb
> 039_bucketizer_082bb650ecc3
> 040_bucketizer_57e1e363c483
> 041_bucketizer_337583fbfd65
> 042_bucketizer_73e8f6673262
> 043_bucketizer_0f9394ed30b8
> 044_bucketizer_8530f3570019
> 045_bucketizer_c53614f1e507
> 046_bucketizer_8fd99e6ec27b
> 047_bucketizer_6a8610496d8a
> 048_bucketizer_888b0055c1ad
> 049_bucketizer_974e0a1433a6
> 050_bucketizer_e848c0937cb9
> 051_bucketizer_95611095a4ac
> 052_bucketizer_660a6031acd9
> 053_bucketizer_aaffe5a3140d
> 054_bucketizer_8dc569be285f
> 055_bucketizer_83d1bffa07bc
> 056_bucketizer_0c6180ba75e6
> 057_bucketizer_452f265a000d
> 058_bucketizer_38e02ddfb447
> 059_bucketizer_6fa4ad5d3ebd
> 060_bucketizer_91044ee766ce
> 061_bucketizer_9a9ef04a173d
> 062_bucketizer_3d98eb15f206
> 063_bucketizer_c4915bb4d4ed
> 064_bucketizer_8ca2b6550c38
> 065_bucketizer_417ee9b760bc
> 066_bucketizer_67f3556bebe8
> 067_bucketizer_0556deb652c6
> 068_bucketizer_067b4b3d234c
> 069_bucketizer_30ba55321538
> 070_bucketizer_ad826cc5d746
> 071_bucketizer_77676a898055
> 072_bucketizer_05c37a38ce30
> 073_bucketizer_6d9ae54163ed
> 074_bucketizer_8cd668b2855d
> 075_bucketizer_d50ea1732021
> 076_bucketizer_c68f467c9559
> 077_bucketizer_ee1dfc840db1
> 078_bucketizer_83ec06a32519
> 079_bucketizer_741d08c1b69e
> 080_bucketizer_b7402e4829c7
> 081_bucketizer_8adc590dc447
> 082_bucketizer_673be99bdace
> 083_bucketizer_77693b45f94c
> 084_bucketizer_53529c6b1ac4
> 085_bucketizer_6a3ca776a81e
> 086_bucketizer_6679d9588ac1
> 087_bucketizer_6c73af456f65
> 088_bucketizer_2291b2c5ab51
> 089_bucketizer_cb3d0fe669d8
> 090_bucketizer_e71f913c1512
> 091_bucketizer_156528f65ce7
> 092_bucketizer_f3ec5dae079b
> 093_bucketizer_809fab77eee1
> 094_bucketizer_6925831511e6
> 095_bucketizer_c5d853b95707
> 096_bucketizer_e677659ca253
> 097_bucketizer_396e35548c72
> 098_bucketizer_78a6410d7a84
> 099_bucketizer_e3ae6e54bca1
> 100_bucketizer_9fed5923fe8a
> 101_bucketizer_8925ba4c3ee2
> 102_bucketizer_95750b6942b8
> 103_bucketizer_6e8b50a1918b
> 104_bucketizer_36cfcc13d4ba
> 105_bucketizer_2716d0455512
> 106_bucketizer_9bcf2891652f
> 107_bucketizer_8c3d352915f7

[jira] [Assigned] (SPARK-20465) Throws a proper exception rather than ArrayIndexOutOfBoundsException when temp directories could not be got/created

2017-04-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20465:


Assignee: Apache Spark

> Throws a proper exception rather than ArrayIndexOutOfBoundsException when 
> temp directories could not be got/created
> ---
>
> Key: SPARK-20465
> URL: https://issues.apache.org/jira/browse/SPARK-20465
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Submit
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Trivial
>
> If none of temp directories could not be created, it throws an 
> {{ArrayIndexOutOfBoundsException}} as below:
> {code}
> ./bin/spark-shell --conf 
> spark.local.dir=/NONEXISTENT_DIR_ONE,/NONEXISTENT_DIR_TWO
> 17/04/26 13:11:06 ERROR Utils: Failed to create dir in /NONEXISTENT_DIR_ONE. 
> Ignoring this directory.
> 17/04/26 13:11:06 ERROR Utils: Failed to create dir in /NONEXISTENT_DIR_TWO. 
> Ignoring this directory.
> Exception in thread "main" java.lang.ExceptionInInitializerError
>   at org.apache.spark.repl.Main.main(Main.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:756)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:179)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:204)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:118)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
>   at org.apache.spark.util.Utils$.getLocalDir(Utils.scala:743)
>   at org.apache.spark.repl.Main$$anonfun$1.apply(Main.scala:37)
>   at org.apache.spark.repl.Main$$anonfun$1.apply(Main.scala:37)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.repl.Main$.(Main.scala:37)
>   at org.apache.spark.repl.Main$.(Main.scala)
>   ... 10 more
> {code}
> It seems we should throw a proper exception with better message.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20465) Throws a proper exception rather than ArrayIndexOutOfBoundsException when temp directories could not be got/created

2017-04-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20465:


Assignee: (was: Apache Spark)

> Throws a proper exception rather than ArrayIndexOutOfBoundsException when 
> temp directories could not be got/created
> ---
>
> Key: SPARK-20465
> URL: https://issues.apache.org/jira/browse/SPARK-20465
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Submit
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> If none of temp directories could not be created, it throws an 
> {{ArrayIndexOutOfBoundsException}} as below:
> {code}
> ./bin/spark-shell --conf 
> spark.local.dir=/NONEXISTENT_DIR_ONE,/NONEXISTENT_DIR_TWO
> 17/04/26 13:11:06 ERROR Utils: Failed to create dir in /NONEXISTENT_DIR_ONE. 
> Ignoring this directory.
> 17/04/26 13:11:06 ERROR Utils: Failed to create dir in /NONEXISTENT_DIR_TWO. 
> Ignoring this directory.
> Exception in thread "main" java.lang.ExceptionInInitializerError
>   at org.apache.spark.repl.Main.main(Main.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:756)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:179)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:204)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:118)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
>   at org.apache.spark.util.Utils$.getLocalDir(Utils.scala:743)
>   at org.apache.spark.repl.Main$$anonfun$1.apply(Main.scala:37)
>   at org.apache.spark.repl.Main$$anonfun$1.apply(Main.scala:37)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.repl.Main$.(Main.scala:37)
>   at org.apache.spark.repl.Main$.(Main.scala)
>   ... 10 more
> {code}
> It seems we should throw a proper exception with better message.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20465) Throws a proper exception rather than ArrayIndexOutOfBoundsException when temp directories could not be got/created

2017-04-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15984180#comment-15984180
 ] 

Apache Spark commented on SPARK-20465:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/17768

> Throws a proper exception rather than ArrayIndexOutOfBoundsException when 
> temp directories could not be got/created
> ---
>
> Key: SPARK-20465
> URL: https://issues.apache.org/jira/browse/SPARK-20465
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Submit
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> If none of temp directories could not be created, it throws an 
> {{ArrayIndexOutOfBoundsException}} as below:
> {code}
> ./bin/spark-shell --conf 
> spark.local.dir=/NONEXISTENT_DIR_ONE,/NONEXISTENT_DIR_TWO
> 17/04/26 13:11:06 ERROR Utils: Failed to create dir in /NONEXISTENT_DIR_ONE. 
> Ignoring this directory.
> 17/04/26 13:11:06 ERROR Utils: Failed to create dir in /NONEXISTENT_DIR_TWO. 
> Ignoring this directory.
> Exception in thread "main" java.lang.ExceptionInInitializerError
>   at org.apache.spark.repl.Main.main(Main.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:756)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:179)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:204)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:118)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
>   at org.apache.spark.util.Utils$.getLocalDir(Utils.scala:743)
>   at org.apache.spark.repl.Main$$anonfun$1.apply(Main.scala:37)
>   at org.apache.spark.repl.Main$$anonfun$1.apply(Main.scala:37)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.repl.Main$.(Main.scala:37)
>   at org.apache.spark.repl.Main$.(Main.scala)
>   ... 10 more
> {code}
> It seems we should throw a proper exception with better message.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20437) R wrappers for rollup and cube

2017-04-25 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-20437.
--
  Resolution: Fixed
Assignee: Maciej Szymkiewicz
   Fix Version/s: 2.3.0
Target Version/s: 2.3.0

> R wrappers for rollup and cube
> --
>
> Key: SPARK-20437
> URL: https://issues.apache.org/jira/browse/SPARK-20437
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Minor
> Fix For: 2.3.0
>
>
> Add SparkR wrappers for {{Dataset.cube}} and {{Dataset.rollup}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows

2017-04-25 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15984173#comment-15984173
 ] 

Liang-Chi Hsieh commented on SPARK-20392:
-

[~barrybecker4] Currently I think the performance downgrade is caused by the 
cost of exchange between DataFrame/Dataset abstraction and logical plans. Some 
operations on DataFrames exchange between DataFrame and logical plans. It can 
be ignored in the usage of SQL. However, it's not rare to chain dozens of 
pipeline stages in ML. When the query plan grows incrementally during running 
those stages, the cost spent on the exchange grows too. In particular, the 
Analyzer will go through the big query plan even most part of it is analyzed.

By eliminating part of the cost, the time to run the example code locally is 
reduced from about 1min to about 30 secs.


> Slow performance when calling fit on ML pipeline for dataset with many 
> columns but few rows
> ---
>
> Key: SPARK-20392
> URL: https://issues.apache.org/jira/browse/SPARK-20392
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Barry Becker
> Attachments: blockbuster.csv, blockbuster_fewCols.csv, 
> giant_query_plan_for_fitting_pipeline.txt, model_9754.zip, model_9756.zip
>
>
> This started as a [question on stack 
> overflow|http://stackoverflow.com/questions/43484006/why-is-it-slow-to-apply-a-spark-pipeline-to-dataset-with-many-columns-but-few-ro],
>  but it seems like a bug.
> I am testing spark pipelines using a simple dataset (attached) with 312 
> (mostly numeric) columns, but only 421 rows. It is small, but it takes 3 
> minutes to apply my ML pipeline to it on a 24 core server with 60G of memory. 
> This seems much to long for such a tiny dataset. Similar pipelines run 
> quickly on datasets that have fewer columns and more rows. It's something 
> about the number of columns that is causing the slow performance.
> Here are a list of the stages in my pipeline:
> {code}
> 000_strIdx_5708525b2b6c
> 001_strIdx_ec2296082913
> 002_bucketizer_3cbc8811877b
> 003_bucketizer_5a01d5d78436
> 004_bucketizer_bf290d11364d
> 005_bucketizer_c3296dfe94b2
> 006_bucketizer_7071ca50eb85
> 007_bucketizer_27738213c2a1
> 008_bucketizer_bd728fd89ba1
> 009_bucketizer_e1e716f51796
> 010_bucketizer_38be665993ba
> 011_bucketizer_5a0e41e5e94f
> 012_bucketizer_b5a3d5743aaa
> 013_bucketizer_4420f98ff7ff
> 014_bucketizer_777cc4fe6d12
> 015_bucketizer_f0f3a3e5530e
> 016_bucketizer_218ecca3b5c1
> 017_bucketizer_0b083439a192
> 018_bucketizer_4520203aec27
> 019_bucketizer_462c2c346079
> 020_bucketizer_47435822e04c
> 021_bucketizer_eb9dccb5e6e8
> 022_bucketizer_b5f63dd7451d
> 023_bucketizer_e0fd5041c841
> 024_bucketizer_ffb3b9737100
> 025_bucketizer_e06c0d29273c
> 026_bucketizer_36ee535a425f
> 027_bucketizer_ee3a330269f1
> 028_bucketizer_094b58ea01c0
> 029_bucketizer_e93ea86c08e2
> 030_bucketizer_4728a718bc4b
> 031_bucketizer_08f6189c7fcc
> 032_bucketizer_11feb74901e6
> 033_bucketizer_ab4add4966c7
> 034_bucketizer_4474f7f1b8ce
> 035_bucketizer_90cfa5918d71
> 036_bucketizer_1a9ff5e4eccb
> 037_bucketizer_38085415a4f4
> 038_bucketizer_9b5e5a8d12eb
> 039_bucketizer_082bb650ecc3
> 040_bucketizer_57e1e363c483
> 041_bucketizer_337583fbfd65
> 042_bucketizer_73e8f6673262
> 043_bucketizer_0f9394ed30b8
> 044_bucketizer_8530f3570019
> 045_bucketizer_c53614f1e507
> 046_bucketizer_8fd99e6ec27b
> 047_bucketizer_6a8610496d8a
> 048_bucketizer_888b0055c1ad
> 049_bucketizer_974e0a1433a6
> 050_bucketizer_e848c0937cb9
> 051_bucketizer_95611095a4ac
> 052_bucketizer_660a6031acd9
> 053_bucketizer_aaffe5a3140d
> 054_bucketizer_8dc569be285f
> 055_bucketizer_83d1bffa07bc
> 056_bucketizer_0c6180ba75e6
> 057_bucketizer_452f265a000d
> 058_bucketizer_38e02ddfb447
> 059_bucketizer_6fa4ad5d3ebd
> 060_bucketizer_91044ee766ce
> 061_bucketizer_9a9ef04a173d
> 062_bucketizer_3d98eb15f206
> 063_bucketizer_c4915bb4d4ed
> 064_bucketizer_8ca2b6550c38
> 065_bucketizer_417ee9b760bc
> 066_bucketizer_67f3556bebe8
> 067_bucketizer_0556deb652c6
> 068_bucketizer_067b4b3d234c
> 069_bucketizer_30ba55321538
> 070_bucketizer_ad826cc5d746
> 071_bucketizer_77676a898055
> 072_bucketizer_05c37a38ce30
> 073_bucketizer_6d9ae54163ed
> 074_bucketizer_8cd668b2855d
> 075_bucketizer_d50ea1732021
> 076_bucketizer_c68f467c9559
> 077_bucketizer_ee1dfc840db1
> 078_bucketizer_83ec06a32519
> 079_bucketizer_741d08c1b69e
> 080_bucketizer_b7402e4829c7
> 081_bucketizer_8adc590dc447
> 082_bucketizer_673be99bdace
> 083_bucketizer_77693b45f94c
> 084_bucketizer_53529c6b1ac4
> 085_bucketizer_6a3ca776a81e
> 086_bucketizer_6679d9588ac1
> 087_bucketizer_6c73af456f65
> 088_bucketizer_2291b2c5ab51
> 089_bucketizer_cb3d0fe669d8
> 090_bucketizer_e71f913c1512
> 091_bucketizer_156528f65ce7
> 092_buck

[jira] [Updated] (SPARK-20465) Throws a proper exception rather than ArrayIndexOutOfBoundsException when temp directories could not be got/created

2017-04-25 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-20465:
-
Component/s: Spark Core

> Throws a proper exception rather than ArrayIndexOutOfBoundsException when 
> temp directories could not be got/created
> ---
>
> Key: SPARK-20465
> URL: https://issues.apache.org/jira/browse/SPARK-20465
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Submit
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> If none of temp directories could not be created, it throws an 
> {{ArrayIndexOutOfBoundsException}} as below:
> {code}
> ./bin/spark-shell --conf 
> spark.local.dir=/NONEXISTENT_DIR_ONE,/NONEXISTENT_DIR_TWO
> 17/04/26 13:11:06 ERROR Utils: Failed to create dir in /NONEXISTENT_DIR_ONE. 
> Ignoring this directory.
> 17/04/26 13:11:06 ERROR Utils: Failed to create dir in /NONEXISTENT_DIR_TWO. 
> Ignoring this directory.
> Exception in thread "main" java.lang.ExceptionInInitializerError
>   at org.apache.spark.repl.Main.main(Main.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:756)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:179)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:204)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:118)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
>   at org.apache.spark.util.Utils$.getLocalDir(Utils.scala:743)
>   at org.apache.spark.repl.Main$$anonfun$1.apply(Main.scala:37)
>   at org.apache.spark.repl.Main$$anonfun$1.apply(Main.scala:37)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.repl.Main$.(Main.scala:37)
>   at org.apache.spark.repl.Main$.(Main.scala)
>   ... 10 more
> {code}
> It seems we should throw a proper exception with better message.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20465) Throws a proper exception rather than ArrayIndexOutOfBoundsException when temp directories could not be got/created

2017-04-25 Thread Hyukjin Kwon (JIRA)

Hyukjin Kwon created SPARK-20465:


 Summary: Throws a proper exception rather than 
ArrayIndexOutOfBoundsException when temp directories could not be got/created
 Key: SPARK-20465
 URL: https://issues.apache.org/jira/browse/SPARK-20465
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit
Affects Versions: 2.2.0
Reporter: Hyukjin Kwon
Priority: Trivial


If none of temp directories could not be created, it throws an 
{{ArrayIndexOutOfBoundsException}} as below:

{code}
./bin/spark-shell --conf 
spark.local.dir=/NONEXISTENT_DIR_ONE,/NONEXISTENT_DIR_TWO
17/04/26 13:11:06 ERROR Utils: Failed to create dir in /NONEXISTENT_DIR_ONE. 
Ignoring this directory.
17/04/26 13:11:06 ERROR Utils: Failed to create dir in /NONEXISTENT_DIR_TWO. 
Ignoring this directory.
Exception in thread "main" java.lang.ExceptionInInitializerError
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:756)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:179)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:204)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:118)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
at org.apache.spark.util.Utils$.getLocalDir(Utils.scala:743)
at org.apache.spark.repl.Main$$anonfun$1.apply(Main.scala:37)
at org.apache.spark.repl.Main$$anonfun$1.apply(Main.scala:37)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.repl.Main$.(Main.scala:37)
at org.apache.spark.repl.Main$.(Main.scala)
... 10 more
{code}

It seems we should throw a proper exception with better message.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20199) GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter

2017-04-25 Thread 颜发才


[ 
https://issues.apache.org/jira/browse/SPARK-20199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15984161#comment-15984161
 ] 

Yan Facai (颜发才) commented on SPARK-20199:
-

The work is easy, however Public method is added and some adjustments are 
needed in inner implementation. Hence, I suggest to delay it until one expert 
agree to shepherd the issue.

I have two questions:

1. For both GBDT and RandomForest share the attribute,  we can pull 
`featureSubsetStrategy` parameter up to either TreeEnsembleParams or 
DecisionTreeParams. Which one is appropriate?

2. Is it right to add new parameter `featureSubsetStrategy` to Strategy class? 
Or add it to DecisionTree's train method?

> GradientBoostedTreesModel doesn't have  Column Sampling Rate Paramenter
> ---
>
> Key: SPARK-20199
> URL: https://issues.apache.org/jira/browse/SPARK-20199
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: pralabhkumar
>
> Spark GradientBoostedTreesModel doesn't have Column  sampling rate parameter 
> . This parameter is available in H2O and XGBoost. 
> Sample from H2O.ai 
> gbmParams._col_sample_rate
> Please provide the parameter . 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16548) java.io.CharConversionException: Invalid UTF-32 character prevents me from querying my data

2017-04-25 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-16548.
-
   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.0

> java.io.CharConversionException: Invalid UTF-32 character  prevents me from 
> querying my data
> 
>
> Key: SPARK-16548
> URL: https://issues.apache.org/jira/browse/SPARK-16548
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Egor Pahomov
>Priority: Minor
> Fix For: 2.2.0, 2.3.0
>
>
> Basically, when I query my json data I get 
> {code}
> java.io.CharConversionException: Invalid UTF-32 character 0x7b2265(above 
> 10)  at char #192, byte #771)
>   at 
> com.fasterxml.jackson.core.io.UTF32Reader.reportInvalid(UTF32Reader.java:189)
>   at com.fasterxml.jackson.core.io.UTF32Reader.read(UTF32Reader.java:150)
>   at 
> com.fasterxml.jackson.core.json.ReaderBasedJsonParser.loadMore(ReaderBasedJsonParser.java:153)
>   at 
> com.fasterxml.jackson.core.json.ReaderBasedJsonParser._skipWSOrEnd(ReaderBasedJsonParser.java:1855)
>   at 
> com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:571)
>   at 
> org.apache.spark.sql.catalyst.expressions.GetJsonObject$$anonfun$eval$2$$anonfun$4.apply(jsonExpressions.scala:142)
> {code}
> I do not like it. If you can not process one json among 100500 please return 
> null, do not fail everything. I have dirty one line fix, and I understand how 
> I can make it more reasonable. What is our position - what behaviour we wanna 
> get?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20439) Catalog.listTables() depends on all libraries used to create tables

2017-04-25 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-20439:

Fix Version/s: 2.1.1

> Catalog.listTables() depends on all libraries used to create tables
> ---
>
> Key: SPARK-20439
> URL: https://issues.apache.org/jira/browse/SPARK-20439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.1.1, 2.2.0, 2.3.0
>
>
> spark.catalog.listTables() and getTable
> You may get an error on the table serde library:
> java.lang.RuntimeException: java.lang.ClassNotFoundException: 
> com.amazon.emr.kinesis.hive.KinesisHiveInputFormat
> Or if the database contains any table (e.g., index) with a table type that is 
> not accessible by Spark SQL, it will fail the whole listTable API.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20456) Add examples for functions collection for pyspark

2017-04-25 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-20456:
-
Component/s: PySpark

> Add examples for functions collection for pyspark
> -
>
> Key: SPARK-20456
> URL: https://issues.apache.org/jira/browse/SPARK-20456
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, PySpark
>Affects Versions: 2.1.0
>Reporter: Michael Patterson
>Priority: Minor
>
> Document `sql.functions.py`:
> 1. Add examples for the common aggregate functions (`min`, `max`, `mean`, 
> `count`, `collect_set`, `collect_list`, `stddev`, `variance`)
> 2. Rename columns in datetime examples.
> 3. Add examples for `unix_timestamp` and `from_unixtime`
> 4. Add note to all trigonometry functions that units are radians.
> 5. Add example for `lit`



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20456) Add examples for functions collection for pyspark

2017-04-25 Thread Michael Patterson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Patterson updated SPARK-20456:
--
Summary: Add examples for functions collection for pyspark  (was: Document 
major aggregation functions for pyspark)

> Add examples for functions collection for pyspark
> -
>
> Key: SPARK-20456
> URL: https://issues.apache.org/jira/browse/SPARK-20456
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Michael Patterson
>Priority: Minor
>
> Document `sql.functions.py`:
> 1. Document the common aggregate functions (`min`, `max`, `mean`, `count`, 
> `collect_set`, `collect_list`, `stddev`, `variance`)
> 2. Rename columns in datetime examples.
> 3. Add examples for `unix_timestamp` and `from_unixtime`
> 4. Add note to all trigonometry functions that units are radians.
> 5. Document `lit`



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20456) Add examples for functions collection for pyspark

2017-04-25 Thread Michael Patterson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Patterson updated SPARK-20456:
--
Description: 
Document `sql.functions.py`:

1. Add examples for the common aggregate functions (`min`, `max`, `mean`, 
`count`, `collect_set`, `collect_list`, `stddev`, `variance`)
2. Rename columns in datetime examples.
3. Add examples for `unix_timestamp` and `from_unixtime`
4. Add note to all trigonometry functions that units are radians.
5. Add example for `lit`

  was:
Document `sql.functions.py`:

1. Document the common aggregate functions (`min`, `max`, `mean`, `count`, 
`collect_set`, `collect_list`, `stddev`, `variance`)
2. Rename columns in datetime examples.
3. Add examples for `unix_timestamp` and `from_unixtime`
4. Add note to all trigonometry functions that units are radians.
5. Document `lit`


> Add examples for functions collection for pyspark
> -
>
> Key: SPARK-20456
> URL: https://issues.apache.org/jira/browse/SPARK-20456
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Michael Patterson
>Priority: Minor
>
> Document `sql.functions.py`:
> 1. Add examples for the common aggregate functions (`min`, `max`, `mean`, 
> `count`, `collect_set`, `collect_list`, `stddev`, `variance`)
> 2. Rename columns in datetime examples.
> 3. Add examples for `unix_timestamp` and `from_unixtime`
> 4. Add note to all trigonometry functions that units are radians.
> 5. Add example for `lit`



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20457) Spark CSV is not able to Override Schema while reading data

2017-04-25 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-20457.
--
Resolution: Duplicate

Currently, the nullability seems being ignored. I am pretty sure that it 
duplicates SPARK-19950. I am resolving this.

> Spark CSV is not able to Override Schema while reading data
> ---
>
> Key: SPARK-20457
> URL: https://issues.apache.org/jira/browse/SPARK-20457
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Himanshu Gupta
>
> I have a CSV file, test.csv:
> {code:xml}
> col
> 1
> 2
> 3
> 4
> {code}
> When I read it using Spark, it gets the schema of data correct:
> {code:java}
> val df = spark.read.option("header", "true").option("inferSchema", 
> "true").csv("test.csv")
> 
> df.printSchema
> root
> |-- col: integer (nullable = true)
> {code}
> But when I override the `schema` of CSV file and make `inferSchema` false, 
> then SparkSession is picking up custom schema partially.
> {code:java}
> val df = spark.read.option("header", "true").option("inferSchema", 
> "false").schema(StructType(List(StructField("custom", StringType, 
> false.csv("test.csv")
> df.printSchema
> root
> |-- custom: string (nullable = true)
> {code}
> I mean only column name (`custom`) and DataType (`StringType`) are getting 
> picked up. But, `nullable` part is being ignored, as it is still coming 
> `nullable = true`, which is incorrect.
> I am not able to understand this behavior.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20336) spark.read.csv() with wholeFile=True option fails to read non ASCII unicode characters

2017-04-25 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983896#comment-15983896
 ] 

Hyukjin Kwon commented on SPARK-20336:
--

Thank you guys for confirming this.

> spark.read.csv() with wholeFile=True option fails to read non ASCII unicode 
> characters
> --
>
> Key: SPARK-20336
> URL: https://issues.apache.org/jira/browse/SPARK-20336
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Spark 2.2.0 (master branch is downloaded from Github)
> PySpark
>Reporter: HanCheol Cho
>
> I used spark.read.csv() method with wholeFile=True option to load data that 
> has multi-line records.
> However, non-ASCII characters are not properly loaded.
> The following is a sample data for test:
> {code:none}
> col1,col2,col3
> 1,a,text
> 2,b,テキスト
> 3,c,텍스트
> 4,d,"text
> テキスト
> 텍스트"
> 5,e,last
> {code}
> When it is loaded without wholeFile=True option, non-ASCII characters are 
> shown correctly although multi-line records are parsed incorrectly as follows:
> {code:none}
> testdf_default = spark.read.csv("test.encoding.csv", header=True)
> testdf_default.show()
> ++++
> |col1|col2|col3|
> ++++
> |   1|   a|text|
> |   2|   b|テキスト|
> |   3|   c| 텍스트|
> |   4|   d|text|
> |テキスト|null|null|
> | 텍스트"|null|null|
> |   5|   e|last|
> ++++
> {code}
> When wholeFile=True option is used, non-ASCII characters are broken as 
> follows:
> {code:none}
> testdf_wholefile = spark.read.csv("test.encoding.csv", header=True, 
> wholeFile=True)
> testdf_wholefile.show()
> ++++
> |col1|col2|col3|
> ++++
> |   1|   a|text|
> |   2|   b||
> |   3|   c|   �|
> |   4|   d|text
> ...|
> |   5|   e|last|
> ++++
> {code}
> The result is same even if I use encoding="UTF-8" option with wholeFile=True.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20445) pyspark.sql.utils.IllegalArgumentException: u'DecisionTreeClassifier was given input with invalid label column label, without the number of classes specified. See Stri

2017-04-25 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983892#comment-15983892
 ] 

Hyukjin Kwon commented on SPARK-20445:
--

I meant the current codebase, latest build. Probably, I guess testing against 
2.1.0 might be enough.

In guide lines - http://spark.apache.org/contributing.html

> For issues that can’t be reproduced against master as reported, resolve as 
> Cannot Reproduce

I could not reproduce this in the current master. So, I resolved this as 
{{Cannot Reproduce}}. I assume there is a JIRA fixing this issue.
You (or anyone) can identify the JIRA and then make a backport if applicable. 

> pyspark.sql.utils.IllegalArgumentException: u'DecisionTreeClassifier was 
> given input with invalid label column label, without the number of classes 
> specified. See StringIndexer
> 
>
> Key: SPARK-20445
> URL: https://issues.apache.org/jira/browse/SPARK-20445
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.1
>Reporter: surya pratap
>
>  #Load the CSV file into a RDD
> irisData = sc.textFile("/home/infademo/surya/iris.csv")
> irisData.cache()
> irisData.count()
> #Remove the first line (contains headers)
> dataLines = irisData.filter(lambda x: "Sepal" not in x)
> dataLines.count()
> from pyspark.sql import Row
> #Create a Data Frame from the data
> parts = dataLines.map(lambda l: l.split(","))
> irisMap = parts.map(lambda p: Row(SEPAL_LENGTH=float(p[0]),\
> SEPAL_WIDTH=float(p[1]), \
> PETAL_LENGTH=float(p[2]), \
> PETAL_WIDTH=float(p[3]), \
> SPECIES=p[4] ))
> # Infer the schema, and register the DataFrame as a table.
> irisDf = sqlContext.createDataFrame(irisMap)
> irisDf.cache()
> #Add a numeric indexer for the label/target column
> from pyspark.ml.feature import StringIndexer
> stringIndexer = StringIndexer(inputCol="SPECIES", outputCol="IND_SPECIES")
> si_model = stringIndexer.fit(irisDf)
> irisNormDf = si_model.transform(irisDf)
> irisNormDf.select("SPECIES","IND_SPECIES").distinct().collect()
> irisNormDf.cache()
> 
> """--
> Perform Data Analytics
> 
> -"""
> #See standard parameters
> irisNormDf.describe().show()
> #Find correlation between predictors and target
> for i in irisNormDf.columns:
> if not( isinstance(irisNormDf.select(i).take(1)[0][0], basestring)) :
> print( "Correlation to Species for ", i, \
> irisNormDf.stat.corr('IND_SPECIES',i))
> #Transform to a Data Frame for input to Machine Learing
> #Drop columns that are not required (low correlation)
> from pyspark.mllib.linalg import Vectors
> from pyspark.mllib.linalg import SparseVector
> from pyspark.mllib.regression import LabeledPoint
> from pyspark.mllib.util import MLUtils
> import org.apache.spark.mllib.linalg.{Matrix, Matrices}
> from pyspark.mllib.linalg.distributed import RowMatrix
> from pyspark.ml.linalg import Vectors
> pyspark.mllib.linalg.Vector
> def transformToLabeledPoint(row) :
> lp = ( row["SPECIES"], row["IND_SPECIES"], \
> Vectors.dense([row["SEPAL_LENGTH"],\
> row["SEPAL_WIDTH"], \
> row["PETAL_LENGTH"], \
> row["PETAL_WIDTH"]]))
> return lp
> irisLp = irisNormDf.rdd.map(transformToLabeledPoint)
> irisLpDf = sqlContext.createDataFrame(irisLp,["species","label", 
> "features"])
> irisLpDf.select("species","label","features").show(10)
> irisLpDf.cache()
> 
> """--
> Perform Machine Learning
> 
> -"""
> #Split into training and testing data
> (trainingData, testData) = irisLpDf.randomSplit([0.9, 0.1])
> trainingData.count()
> testData.count()
> testData.collect()
> from pyspark.ml.classification import DecisionTreeClassifier
> from pyspark.ml.evaluation import MulticlassClassificationEvaluator
> #Create the model
> dtClassifer = DecisionTreeClassifier(maxDepth=2, labelCol="label",\
> featuresCol="features")
>dtModel = dtClassifer.fit(trainingData)
>
>issue part:-
>
>dtModel =

[jira] [Commented] (SPARK-20456) Document major aggregation functions for pyspark

2017-04-25 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983888#comment-15983888
 ] 

Hyukjin Kwon commented on SPARK-20456:
--

I simply left the comment above as the current status does not sound matching 
with the description and the title. Let's fix the title and the description 
here.

> Document major aggregation functions for pyspark
> 
>
> Key: SPARK-20456
> URL: https://issues.apache.org/jira/browse/SPARK-20456
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Michael Patterson
>Priority: Minor
>
> Document `sql.functions.py`:
> 1. Document the common aggregate functions (`min`, `max`, `mean`, `count`, 
> `collect_set`, `collect_list`, `stddev`, `variance`)
> 2. Rename columns in datetime examples.
> 3. Add examples for `unix_timestamp` and `from_unixtime`
> 4. Add note to all trigonometry functions that units are radians.
> 5. Document `lit`



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18127) Add hooks and extension points to Spark

2017-04-25 Thread Frederick Reiss (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983872#comment-15983872
 ] 

Frederick Reiss commented on SPARK-18127:
-

Is there a design document or a public design and requirements discussion 
associated with this JIRA?

> Add hooks and extension points to Spark
> ---
>
> Key: SPARK-18127
> URL: https://issues.apache.org/jira/browse/SPARK-18127
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Srinath
>Assignee: Sameer Agarwal
> Fix For: 2.2.0
>
>
> As a Spark user I want to be able to customize my spark session. I currently 
> want to be able to do the following things:
> # I want to be able to add custom analyzer rules. This allows me to implement 
> my own logical constructs; an example of this could be a recursive operator.
> # I want to be able to add my own analysis checks. This allows me to catch 
> problems with spark plans early on. An example of this can be some datasource 
> specific checks.
> # I want to be able to add my own optimizations. This allows me to optimize 
> plans in different ways, for instance when you use a very different cluster 
> (for example a one-node X1 instance). This supersedes the current 
> {{spark.experimental}} methods
> # I want to be able to add my own planning strategies. This supersedes the 
> current {{spark.experimental}} methods. This allows me to plan my own 
> physical plan, an example of this would to plan my own heavily integrated 
> data source (CarbonData for example).
> # I want to be able to use my own customized SQL constructs. An example of 
> this would supporting my own dialect, or be able to add constructs to the 
> current SQL language. I should not have to implement a complete parse, and 
> should be able to delegate to an underlying parser.
> # I want to be able to track modifications and calls to the external catalog. 
> I want this API to be stable. This allows me to do synchronize with other 
> systems.
> This API should modify the SparkSession when the session gets started, and it 
> should NOT change the session in flight.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18127) Add hooks and extension points to Spark

2017-04-25 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-18127.
-
Resolution: Fixed

> Add hooks and extension points to Spark
> ---
>
> Key: SPARK-18127
> URL: https://issues.apache.org/jira/browse/SPARK-18127
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Srinath
>Assignee: Sameer Agarwal
> Fix For: 2.2.0
>
>
> As a Spark user I want to be able to customize my spark session. I currently 
> want to be able to do the following things:
> # I want to be able to add custom analyzer rules. This allows me to implement 
> my own logical constructs; an example of this could be a recursive operator.
> # I want to be able to add my own analysis checks. This allows me to catch 
> problems with spark plans early on. An example of this can be some datasource 
> specific checks.
> # I want to be able to add my own optimizations. This allows me to optimize 
> plans in different ways, for instance when you use a very different cluster 
> (for example a one-node X1 instance). This supersedes the current 
> {{spark.experimental}} methods
> # I want to be able to add my own planning strategies. This supersedes the 
> current {{spark.experimental}} methods. This allows me to plan my own 
> physical plan, an example of this would to plan my own heavily integrated 
> data source (CarbonData for example).
> # I want to be able to use my own customized SQL constructs. An example of 
> this would supporting my own dialect, or be able to add constructs to the 
> current SQL language. I should not have to implement a complete parse, and 
> should be able to delegate to an underlying parser.
> # I want to be able to track modifications and calls to the external catalog. 
> I want this API to be stable. This allows me to do synchronize with other 
> systems.
> This API should modify the SparkSession when the session gets started, and it 
> should NOT change the session in flight.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18127) Add hooks and extension points to Spark

2017-04-25 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-18127:

Fix Version/s: 2.2.0

> Add hooks and extension points to Spark
> ---
>
> Key: SPARK-18127
> URL: https://issues.apache.org/jira/browse/SPARK-18127
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Srinath
>Assignee: Sameer Agarwal
> Fix For: 2.2.0
>
>
> As a Spark user I want to be able to customize my spark session. I currently 
> want to be able to do the following things:
> # I want to be able to add custom analyzer rules. This allows me to implement 
> my own logical constructs; an example of this could be a recursive operator.
> # I want to be able to add my own analysis checks. This allows me to catch 
> problems with spark plans early on. An example of this can be some datasource 
> specific checks.
> # I want to be able to add my own optimizations. This allows me to optimize 
> plans in different ways, for instance when you use a very different cluster 
> (for example a one-node X1 instance). This supersedes the current 
> {{spark.experimental}} methods
> # I want to be able to add my own planning strategies. This supersedes the 
> current {{spark.experimental}} methods. This allows me to plan my own 
> physical plan, an example of this would to plan my own heavily integrated 
> data source (CarbonData for example).
> # I want to be able to use my own customized SQL constructs. An example of 
> this would supporting my own dialect, or be able to add constructs to the 
> current SQL language. I should not have to implement a complete parse, and 
> should be able to delegate to an underlying parser.
> # I want to be able to track modifications and calls to the external catalog. 
> I want this API to be stable. This allows me to do synchronize with other 
> systems.
> This API should modify the SparkSession when the session gets started, and it 
> should NOT change the session in flight.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0

2017-04-25 Thread Ismael Juma (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983853#comment-15983853
 ] 

Ismael Juma commented on SPARK-18057:
-

It's worth noting that no-one is working on that ticket at the moment, so a fix 
may take some time. And even if it lands soon, it's likely to be in 0.11.0.0 
first (0.10.2.1 is being voted and will be out very soon).

> Update structured streaming kafka from 10.0.1 to 10.2.0
> ---
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0

2017-04-25 Thread Helena Edelson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983832#comment-15983832
 ] 

Helena Edelson commented on SPARK-18057:


It is the timeout. I think waiting is better, will be watching that ticket in 
Kafka.

> Update structured streaming kafka from 10.0.1 to 10.2.0
> ---
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0

2017-04-25 Thread Michael Armbrust (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983820#comment-15983820
 ] 

Michael Armbrust commented on SPARK-18057:
--

I guess I'd like to understand more about what problems people are running into 
with the current version.  Are there more pressing issues than hanging when 
topics are deleted?

> Update structured streaming kafka from 10.0.1 to 10.2.0
> ---
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20130) Flaky test: BlockManagerProactiveReplicationSuite

2017-04-25 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-20130.

Resolution: Cannot Reproduce

Seems a lot more stable now, so closing this until it becomes flaky again.

> Flaky test: BlockManagerProactiveReplicationSuite
> -
>
> Key: SPARK-20130
> URL: https://issues.apache.org/jira/browse/SPARK-20130
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: Marcelo Vanzin
>
> See following page:
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.storage.BlockManagerProactiveReplicationSuite&test_name=proactive+block+replication+-+5+replicas+-+4+block+manager+deletions
> I also have seen it fail intermittently during local unit test runs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20421) Mark JobProgressListener (and related classes) as deprecated

2017-04-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20421:


Assignee: (was: Apache Spark)

> Mark JobProgressListener (and related classes) as deprecated
> 
>
> Key: SPARK-20421
> URL: https://issues.apache.org/jira/browse/SPARK-20421
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Marcelo Vanzin
>
> This class (and others) were made {{@DeveloperApi}} as part of 
> https://github.com/apache/spark/pull/648. But as part of the work in 
> SPARK-18085, I plan to get rid of a lot of that code, so we should mark these 
> as deprecated in case anyone is still trying to use them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20421) Mark JobProgressListener (and related classes) as deprecated

2017-04-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20421:


Assignee: Apache Spark

> Mark JobProgressListener (and related classes) as deprecated
> 
>
> Key: SPARK-20421
> URL: https://issues.apache.org/jira/browse/SPARK-20421
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>
> This class (and others) were made {{@DeveloperApi}} as part of 
> https://github.com/apache/spark/pull/648. But as part of the work in 
> SPARK-18085, I plan to get rid of a lot of that code, so we should mark these 
> as deprecated in case anyone is still trying to use them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0

2017-04-25 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983809#comment-15983809
 ] 

Shixiong Zhu commented on SPARK-18057:
--

I prefer to just wait. The user can still use Kafka 0.10.2.0 with the current 
Spark Kafka source in their application. The APIs are compatibility. Commenting 
tests out means we cannot prevent future changes from breaking them.

> Update structured streaming kafka from 10.0.1 to 10.2.0
> ---
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20421) Mark JobProgressListener (and related classes) as deprecated

2017-04-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983808#comment-15983808
 ] 

Apache Spark commented on SPARK-20421:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/17766

> Mark JobProgressListener (and related classes) as deprecated
> 
>
> Key: SPARK-20421
> URL: https://issues.apache.org/jira/browse/SPARK-20421
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Marcelo Vanzin
>
> This class (and others) were made {{@DeveloperApi}} as part of 
> https://github.com/apache/spark/pull/648. But as part of the work in 
> SPARK-18085, I plan to get rid of a lot of that code, so we should mark these 
> as deprecated in case anyone is still trying to use them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0

2017-04-25 Thread Helena Edelson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983796#comment-15983796
 ] 

Helena Edelson commented on SPARK-18057:


I have a branch off branch-2.2 with the 0.10.2.0 upgrade and changes done. All 
the delete-topic-related tests fail (mainly just in streaming kafka sql).

I can PR with those few tests commented out but that doesn't sound right. Or 
wait to PR?

> Update structured streaming kafka from 10.0.1 to 10.2.0
> ---
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20464) Add a job group and an informative description for streaming queries

2017-04-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20464:


Assignee: Apache Spark

> Add a job group and an informative description for streaming queries
> 
>
> Key: SPARK-20464
> URL: https://issues.apache.org/jira/browse/SPARK-20464
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Kunal Khamar
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20464) Add a job group and an informative description for streaming queries

2017-04-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20464:


Assignee: (was: Apache Spark)

> Add a job group and an informative description for streaming queries
> 
>
> Key: SPARK-20464
> URL: https://issues.apache.org/jira/browse/SPARK-20464
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Kunal Khamar
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20464) Add a job group and an informative description for streaming queries

2017-04-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983751#comment-15983751
 ] 

Apache Spark commented on SPARK-20464:
--

User 'kunalkhamar' has created a pull request for this issue:
https://github.com/apache/spark/pull/17765

> Add a job group and an informative description for streaming queries
> 
>
> Key: SPARK-20464
> URL: https://issues.apache.org/jira/browse/SPARK-20464
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Kunal Khamar
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20464) Add a job group and an informative description for streaming queries

2017-04-25 Thread Kunal Khamar (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kunal Khamar updated SPARK-20464:
-
Summary: Add a job group and an informative description for streaming 
queries  (was: Add a job group and an informative job description for streaming 
queries)

> Add a job group and an informative description for streaming queries
> 
>
> Key: SPARK-20464
> URL: https://issues.apache.org/jira/browse/SPARK-20464
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Kunal Khamar
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20239) Improve HistoryServer ACL mechanism

2017-04-25 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-20239:
---
Fix Version/s: 2.1.2
   2.0.3

> Improve HistoryServer ACL mechanism
> ---
>
> Key: SPARK-20239
> URL: https://issues.apache.org/jira/browse/SPARK-20239
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
> Fix For: 2.0.3, 2.1.2, 2.2.0
>
>
> Current SHS (Spark History Server) two different ACLs. 
> * ACL of base URL, it is controlled by "spark.acls.enabled" or 
> "spark.ui.acls.enabled", and with this enabled, only user configured with 
> "spark.admin.acls" (or group) or "spark.ui.view.acls" (or group), or the user 
> who started SHS could list all the applications, otherwise none of them can 
> be listed. This will also affect REST APIs which listing the summary of all 
> apps and one app.
> * Per application ACL. This is controlled by "spark.history.ui.acls.enabled". 
> With this enabled only history admin user and user/group who ran this app can 
> access the details of this app. 
> With this two ACLs, we may encounter several unexpected behaviors:
> 1. if base URL's ACL is enabled but user A has no view permission. User "A" 
> cannot see the app list but could still access details of it's own app.
> 2. if ACLs of base URL is disabled. Then user "A" could see the summary of 
> all the apps, even some didn't run by user "A", but cannot access the details.
> 3. history admin ACL has no permission to list all apps if this admin user is 
> not added to base URL's ACL.
> The unexpected behaviors is mainly because we have two different ACLs, 
> ideally we should have only one to manage all.
> So to improve SHS's ACL mechanism, we should:
> * Unify two different ACLs into one, and always honor this one (both in base 
> URL and app details).
> * User could partially list and display apps which ran by him according to 
> the ACLs in event log.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20464) Add a job group and an informative job description for streaming queries

2017-04-25 Thread Kunal Khamar (JIRA)

Kunal Khamar created SPARK-20464:


 Summary: Add a job group and an informative job description for 
streaming queries
 Key: SPARK-20464
 URL: https://issues.apache.org/jira/browse/SPARK-20464
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.2.0
Reporter: Kunal Khamar






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20463) Expose SPARK SQL <=> operator in PySpark

2017-04-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20463:


Assignee: Apache Spark

> Expose SPARK SQL <=> operator in PySpark
> 
>
> Key: SPARK-20463
> URL: https://issues.apache.org/jira/browse/SPARK-20463
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Michael Styles
>Assignee: Apache Spark
>
> Expose the SPARK SQL '<=>' operator in Pyspark as a column function called 
> *isNotDistinctFrom*. For example:
> {panel}
> {noformat}
> data = [(10, 20), (30, 30), (40, None), (None, None)]
> df2 = sc.parallelize(data).toDF("c1", "c2")
> df2.where(df2["c1"].isNotDistinctFrom(df2["c2"]).collect())
> [Row(c1=30, c2=30), Row(c1=None, c2=None)]
> {noformat}
> {panel}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20463) Expose SPARK SQL <=> operator in PySpark

2017-04-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20463:


Assignee: (was: Apache Spark)

> Expose SPARK SQL <=> operator in PySpark
> 
>
> Key: SPARK-20463
> URL: https://issues.apache.org/jira/browse/SPARK-20463
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Michael Styles
>
> Expose the SPARK SQL '<=>' operator in Pyspark as a column function called 
> *isNotDistinctFrom*. For example:
> {panel}
> {noformat}
> data = [(10, 20), (30, 30), (40, None), (None, None)]
> df2 = sc.parallelize(data).toDF("c1", "c2")
> df2.where(df2["c1"].isNotDistinctFrom(df2["c2"]).collect())
> [Row(c1=30, c2=30), Row(c1=None, c2=None)]
> {noformat}
> {panel}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20463) Expose SPARK SQL <=> operator in PySpark

2017-04-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20463:


Assignee: Apache Spark

> Expose SPARK SQL <=> operator in PySpark
> 
>
> Key: SPARK-20463
> URL: https://issues.apache.org/jira/browse/SPARK-20463
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Michael Styles
>Assignee: Apache Spark
>
> Expose the SPARK SQL '<=>' operator in Pyspark as a column function called 
> *isNotDistinctFrom*. For example:
> {panel}
> {noformat}
> data = [(10, 20), (30, 30), (40, None), (None, None)]
> df2 = sc.parallelize(data).toDF("c1", "c2")
> df2.where(df2["c1"].isNotDistinctFrom(df2["c2"]).collect())
> [Row(c1=30, c2=30), Row(c1=None, c2=None)]
> {noformat}
> {panel}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20463) Expose SPARK SQL <=> operator in PySpark

2017-04-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20463:


Assignee: (was: Apache Spark)

> Expose SPARK SQL <=> operator in PySpark
> 
>
> Key: SPARK-20463
> URL: https://issues.apache.org/jira/browse/SPARK-20463
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Michael Styles
>
> Expose the SPARK SQL '<=>' operator in Pyspark as a column function called 
> *isNotDistinctFrom*. For example:
> {panel}
> {noformat}
> data = [(10, 20), (30, 30), (40, None), (None, None)]
> df2 = sc.parallelize(data).toDF("c1", "c2")
> df2.where(df2["c1"].isNotDistinctFrom(df2["c2"]).collect())
> [Row(c1=30, c2=30), Row(c1=None, c2=None)]
> {noformat}
> {panel}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20463) Expose SPARK SQL <=> operator in PySpark

2017-04-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983654#comment-15983654
 ] 

Apache Spark commented on SPARK-20463:
--

User 'ptkool' has created a pull request for this issue:
https://github.com/apache/spark/pull/17764

> Expose SPARK SQL <=> operator in PySpark
> 
>
> Key: SPARK-20463
> URL: https://issues.apache.org/jira/browse/SPARK-20463
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Michael Styles
>
> Expose the SPARK SQL '<=>' operator in Pyspark as a column function called 
> *isNotDistinctFrom*. For example:
> {panel}
> {noformat}
> data = [(10, 20), (30, 30), (40, None), (None, None)]
> df2 = sc.parallelize(data).toDF("c1", "c2")
> df2.where(df2["c1"].isNotDistinctFrom(df2["c2"]).collect())
> [Row(c1=30, c2=30), Row(c1=None, c2=None)]
> {noformat}
> {panel}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20463) Expose SPARK SQL <=> operator in PySpark

2017-04-25 Thread Michael Styles (JIRA)

Michael Styles created SPARK-20463:
--

 Summary: Expose SPARK SQL <=> operator in PySpark
 Key: SPARK-20463
 URL: https://issues.apache.org/jira/browse/SPARK-20463
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.1.0
Reporter: Michael Styles


Expose the SPARK SQL '<=>' operator in Pyspark as a column function called 
*isNotDistinctFrom*. For example:

{panel}
{noformat}
data = [(10, 20), (30, 30), (40, None), (None, None)]
df2 = sc.parallelize(data).toDF("c1", "c2")
df2.where(df2["c1"].isNotDistinctFrom(df2["c2"]).collect())
[Row(c1=30, c2=30), Row(c1=None, c2=None)]
{noformat}
{panel}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool

2017-04-25 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983615#comment-15983615
 ] 

Shixiong Zhu commented on SPARK-13747:
--

[~dnaumenko] Unfortunately, Spark uses ThreadLocal variables a lot but 
ForkJoinPool doesn't support that very well (It's easy to leak ThreadLocal 
variables to other tasks). Could you check if 
https://github.com/apache/spark/pull/17763 can fix your issue?

> Concurrent execution in SQL doesn't work with Scala ForkJoinPool
> 
>
> Key: SPARK-13747
> URL: https://issues.apache.org/jira/browse/SPARK-13747
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Run the following codes may fail
> {code}
> (1 to 100).par.foreach { _ =>
>   println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count())
> }
> java.lang.IllegalArgumentException: spark.sql.execution.id is already set 
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
>  
> at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) 
> at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) 
> {code}
> This is because SparkContext.runJob can be suspended when using a 
> ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it 
> calls Await.ready (introduced by https://github.com/apache/spark/pull/9264).
> So when SparkContext.runJob is suspended, ForkJoinPool will run another task 
> in the same thread, however, the local properties has been polluted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool

2017-04-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13747:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Concurrent execution in SQL doesn't work with Scala ForkJoinPool
> 
>
> Key: SPARK-13747
> URL: https://issues.apache.org/jira/browse/SPARK-13747
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> Run the following codes may fail
> {code}
> (1 to 100).par.foreach { _ =>
>   println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count())
> }
> java.lang.IllegalArgumentException: spark.sql.execution.id is already set 
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
>  
> at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) 
> at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) 
> {code}
> This is because SparkContext.runJob can be suspended when using a 
> ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it 
> calls Await.ready (introduced by https://github.com/apache/spark/pull/9264).
> So when SparkContext.runJob is suspended, ForkJoinPool will run another task 
> in the same thread, however, the local properties has been polluted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool

2017-04-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13747:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Concurrent execution in SQL doesn't work with Scala ForkJoinPool
> 
>
> Key: SPARK-13747
> URL: https://issues.apache.org/jira/browse/SPARK-13747
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Run the following codes may fail
> {code}
> (1 to 100).par.foreach { _ =>
>   println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count())
> }
> java.lang.IllegalArgumentException: spark.sql.execution.id is already set 
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
>  
> at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) 
> at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) 
> {code}
> This is because SparkContext.runJob can be suspended when using a 
> ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it 
> calls Await.ready (introduced by https://github.com/apache/spark/pull/9264).
> So when SparkContext.runJob is suspended, ForkJoinPool will run another task 
> in the same thread, however, the local properties has been polluted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool

2017-04-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983612#comment-15983612
 ] 

Apache Spark commented on SPARK-13747:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/17763

> Concurrent execution in SQL doesn't work with Scala ForkJoinPool
> 
>
> Key: SPARK-13747
> URL: https://issues.apache.org/jira/browse/SPARK-13747
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Run the following codes may fail
> {code}
> (1 to 100).par.foreach { _ =>
>   println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count())
> }
> java.lang.IllegalArgumentException: spark.sql.execution.id is already set 
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
>  
> at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) 
> at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) 
> {code}
> This is because SparkContext.runJob can be suspended when using a 
> ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it 
> calls Await.ready (introduced by https://github.com/apache/spark/pull/9264).
> So when SparkContext.runJob is suspended, ForkJoinPool will run another task 
> in the same thread, however, the local properties has been polluted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool

2017-04-25 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-13747:
-
Fix Version/s: (was: 2.2.0)

> Concurrent execution in SQL doesn't work with Scala ForkJoinPool
> 
>
> Key: SPARK-13747
> URL: https://issues.apache.org/jira/browse/SPARK-13747
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Run the following codes may fail
> {code}
> (1 to 100).par.foreach { _ =>
>   println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count())
> }
> java.lang.IllegalArgumentException: spark.sql.execution.id is already set 
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
>  
> at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) 
> at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) 
> {code}
> This is because SparkContext.runJob can be suspended when using a 
> ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it 
> calls Await.ready (introduced by https://github.com/apache/spark/pull/9264).
> So when SparkContext.runJob is suspended, ForkJoinPool will run another task 
> in the same thread, however, the local properties has been polluted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool

2017-04-25 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reopened SPARK-13747:
--

> Concurrent execution in SQL doesn't work with Scala ForkJoinPool
> 
>
> Key: SPARK-13747
> URL: https://issues.apache.org/jira/browse/SPARK-13747
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Run the following codes may fail
> {code}
> (1 to 100).par.foreach { _ =>
>   println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count())
> }
> java.lang.IllegalArgumentException: spark.sql.execution.id is already set 
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
>  
> at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) 
> at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) 
> {code}
> This is because SparkContext.runJob can be suspended when using a 
> ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it 
> calls Await.ready (introduced by https://github.com/apache/spark/pull/9264).
> So when SparkContext.runJob is suspended, ForkJoinPool will run another task 
> in the same thread, however, the local properties has been polluted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20462) Spark-Kinesis Direct Connector

2017-04-25 Thread Lauren Moos (JIRA)

Lauren Moos created SPARK-20462:
---

 Summary: Spark-Kinesis Direct Connector 
 Key: SPARK-20462
 URL: https://issues.apache.org/jira/browse/SPARK-20462
 Project: Spark
  Issue Type: New Feature
  Components: Input/Output
Affects Versions: 2.1.0
Reporter: Lauren Moos


I'd like to propose and the vet the design for a direct connector between Spark 
and Kinesis. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20456) Document major aggregation functions for pyspark

2017-04-25 Thread Michael Patterson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983590#comment-15983590
 ] 

Michael Patterson commented on SPARK-20456:
---

I saw that there are short docstrings for the aggregate functions, but I think 
it can be unclear for people new to Spark, or relational algebra. For example, 
some of my coworkers didn't know you could do, for example, 
`df.agg(mean(col))`, without doing a `groupby`. There are also no links to 
`groupby` in any of the aggregate functions. I also didn't know about 
`collect_set` for a long time. I think adding examples would help with 
visibility and understanding.

The same things applies to `lit`. It took me a while to learn what it did.

For the datetime stuff, for example this line has a column named 'd': 
https://github.com/map222/spark/blob/master/python/pyspark/sql/functions.py#L926

I think it would be more informative to name it 'date' or 'time'.

Do these sound reasonable?

> Document major aggregation functions for pyspark
> 
>
> Key: SPARK-20456
> URL: https://issues.apache.org/jira/browse/SPARK-20456
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Michael Patterson
>Priority: Minor
>
> Document `sql.functions.py`:
> 1. Document the common aggregate functions (`min`, `max`, `mean`, `count`, 
> `collect_set`, `collect_list`, `stddev`, `variance`)
> 2. Rename columns in datetime examples.
> 3. Add examples for `unix_timestamp` and `from_unixtime`
> 4. Add note to all trigonometry functions that units are radians.
> 5. Document `lit`



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9103) Tracking spark's memory usage

2017-04-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983531#comment-15983531
 ] 

Apache Spark commented on SPARK-9103:
-

User 'jsoltren' has created a pull request for this issue:
https://github.com/apache/spark/pull/17762

> Tracking spark's memory usage
> -
>
> Key: SPARK-9103
> URL: https://issues.apache.org/jira/browse/SPARK-9103
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Reporter: Zhang, Liye
> Attachments: Tracking Spark Memory Usage - Phase 1.pdf
>
>
> Currently spark only provides little memory usage information (RDD cache on 
> webUI) for the executors. User have no idea on what is the memory consumption 
> when they are running spark applications with a lot of memory used in spark 
> executors. Especially when they encounter the OOM, it’s really hard to know 
> what is the cause of the problem. So it would be helpful to give out the 
> detail memory consumption information for each part of spark, so that user 
> can clearly have a picture of where the memory is exactly used. 
> The memory usage info to expose should include but not limited to shuffle, 
> cache, network, serializer, etc.
> User can optionally choose to open this functionality since this is mainly 
> for debugging and tuning.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows

2017-04-25 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983490#comment-15983490
 ] 

Kazuaki Ishizaki commented on SPARK-20392:
--

Here are my observations:

According to 
[Bucketizer.transform()|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala#L104-L122]
* {{Projection}} is generated by calling {{dataset.withColumn()}} at line 121. 
Since {{Bucketizer.transform}} is called multiple times in the pipeline, 
deeply-nested projects are generated in the plan.
* {{bucketizer}} at line 119 is implemented as {{UDF}}. To call a {{UDF}} 
function leads to some overhead for DeSer.
* If a number of fields are less than 101 (i.e. 
{{"spark.sql.codegen.maxFields"}}), the wholestage codegen is enabled. For 
fewCols, it is enabled. However, in the original case, code generation is not 
enabled for the nested project.

Is it better approach to effectively use {{Bucketizer}} in this case? For 
example, can we reduce the number of calls of {{Bucketizer}}? cc:[~mlnick]


> Slow performance when calling fit on ML pipeline for dataset with many 
> columns but few rows
> ---
>
> Key: SPARK-20392
> URL: https://issues.apache.org/jira/browse/SPARK-20392
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Barry Becker
> Attachments: blockbuster.csv, blockbuster_fewCols.csv, 
> giant_query_plan_for_fitting_pipeline.txt, model_9754.zip, model_9756.zip
>
>
> This started as a [question on stack 
> overflow|http://stackoverflow.com/questions/43484006/why-is-it-slow-to-apply-a-spark-pipeline-to-dataset-with-many-columns-but-few-ro],
>  but it seems like a bug.
> I am testing spark pipelines using a simple dataset (attached) with 312 
> (mostly numeric) columns, but only 421 rows. It is small, but it takes 3 
> minutes to apply my ML pipeline to it on a 24 core server with 60G of memory. 
> This seems much to long for such a tiny dataset. Similar pipelines run 
> quickly on datasets that have fewer columns and more rows. It's something 
> about the number of columns that is causing the slow performance.
> Here are a list of the stages in my pipeline:
> {code}
> 000_strIdx_5708525b2b6c
> 001_strIdx_ec2296082913
> 002_bucketizer_3cbc8811877b
> 003_bucketizer_5a01d5d78436
> 004_bucketizer_bf290d11364d
> 005_bucketizer_c3296dfe94b2
> 006_bucketizer_7071ca50eb85
> 007_bucketizer_27738213c2a1
> 008_bucketizer_bd728fd89ba1
> 009_bucketizer_e1e716f51796
> 010_bucketizer_38be665993ba
> 011_bucketizer_5a0e41e5e94f
> 012_bucketizer_b5a3d5743aaa
> 013_bucketizer_4420f98ff7ff
> 014_bucketizer_777cc4fe6d12
> 015_bucketizer_f0f3a3e5530e
> 016_bucketizer_218ecca3b5c1
> 017_bucketizer_0b083439a192
> 018_bucketizer_4520203aec27
> 019_bucketizer_462c2c346079
> 020_bucketizer_47435822e04c
> 021_bucketizer_eb9dccb5e6e8
> 022_bucketizer_b5f63dd7451d
> 023_bucketizer_e0fd5041c841
> 024_bucketizer_ffb3b9737100
> 025_bucketizer_e06c0d29273c
> 026_bucketizer_36ee535a425f
> 027_bucketizer_ee3a330269f1
> 028_bucketizer_094b58ea01c0
> 029_bucketizer_e93ea86c08e2
> 030_bucketizer_4728a718bc4b
> 031_bucketizer_08f6189c7fcc
> 032_bucketizer_11feb74901e6
> 033_bucketizer_ab4add4966c7
> 034_bucketizer_4474f7f1b8ce
> 035_bucketizer_90cfa5918d71
> 036_bucketizer_1a9ff5e4eccb
> 037_bucketizer_38085415a4f4
> 038_bucketizer_9b5e5a8d12eb
> 039_bucketizer_082bb650ecc3
> 040_bucketizer_57e1e363c483
> 041_bucketizer_337583fbfd65
> 042_bucketizer_73e8f6673262
> 043_bucketizer_0f9394ed30b8
> 044_bucketizer_8530f3570019
> 045_bucketizer_c53614f1e507
> 046_bucketizer_8fd99e6ec27b
> 047_bucketizer_6a8610496d8a
> 048_bucketizer_888b0055c1ad
> 049_bucketizer_974e0a1433a6
> 050_bucketizer_e848c0937cb9
> 051_bucketizer_95611095a4ac
> 052_bucketizer_660a6031acd9
> 053_bucketizer_aaffe5a3140d
> 054_bucketizer_8dc569be285f
> 055_bucketizer_83d1bffa07bc
> 056_bucketizer_0c6180ba75e6
> 057_bucketizer_452f265a000d
> 058_bucketizer_38e02ddfb447
> 059_bucketizer_6fa4ad5d3ebd
> 060_bucketizer_91044ee766ce
> 061_bucketizer_9a9ef04a173d
> 062_bucketizer_3d98eb15f206
> 063_bucketizer_c4915bb4d4ed
> 064_bucketizer_8ca2b6550c38
> 065_bucketizer_417ee9b760bc
> 066_bucketizer_67f3556bebe8
> 067_bucketizer_0556deb652c6
> 068_bucketizer_067b4b3d234c
> 069_bucketizer_30ba55321538
> 070_bucketizer_ad826cc5d746
> 071_bucketizer_77676a898055
> 072_bucketizer_05c37a38ce30
> 073_bucketizer_6d9ae54163ed
> 074_bucketizer_8cd668b2855d
> 075_bucketizer_d50ea1732021
> 076_bucketizer_c68f467c9559
> 077_bucketizer_ee1dfc840db1
> 078_bucketizer_83ec06a32519
> 079_bucketizer_741d08c1b69e
> 080_bucketizer_b7402e4829c7
> 081_bucketizer_8adc590dc447
> 082_bucketizer_673be99bdace
> 083_bucketizer_77693b45f94c
> 084_bucketizer_53

[jira] [Commented] (SPARK-20427) Issue with Spark interpreting Oracle datatype NUMBER

2017-04-25 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983440#comment-15983440
 ] 

Xiao Li commented on SPARK-20427:
-

cc [~tsuresh] Are you interested in this? 

> Issue with Spark interpreting Oracle datatype NUMBER
> 
>
> Key: SPARK-20427
> URL: https://issues.apache.org/jira/browse/SPARK-20427
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Alexander Andrushenko
>
> In Oracle exists data type NUMBER. When defining a filed in a table of type 
> NUMBER the field has two components, precision and scale.
> For example, NUMBER(p,s) has precision p and scale s. 
> Precision can range from 1 to 38.
> Scale can range from -84 to 127.
> When reading such a filed Spark can create numbers with precision exceeding 
> 38. In our case it has created fields with precision 44,
> calculated as sum of the precision (in our case 34 digits) and the scale (10):
> "...java.lang.IllegalArgumentException: requirement failed: Decimal precision 
> 44 exceeds max precision 38...".
> The result was, that a data frame was read from a table on one schema but 
> could not be inserted in the identical table on other schema.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20459) JdbcUtils throws IllegalStateException: Cause already initialized after getting SQLException

2017-04-25 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20459:

Target Version/s: 2.2.0

> JdbcUtils throws IllegalStateException: Cause already initialized after 
> getting SQLException
> 
>
> Key: SPARK-20459
> URL: https://issues.apache.org/jira/browse/SPARK-20459
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.0.2, 2.1.0
>Reporter: Jessie Yu
>Priority: Minor
>
> Testing some failure scenarios, and JdbcUtils throws an IllegalStateException 
> instead of the expected SQLException:
> {code}
> scala> 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.saveTable(prodtbl,url3,"DB2.D_ITEM_INFO",prop1)
>  
> 17/04/03 17:19:35 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)  
>   
> java.lang.IllegalStateException: Cause already initialized
>   
> .at java.lang.Throwable.setCause(Throwable.java:365)  
>   
> .at java.lang.Throwable.initCause(Throwable.java:341) 
>   
> .at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:241)
> .at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:300)
> .at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:299)
> .at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902)
> .at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902)
> .at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
>  
> .at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
>   
> .at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   
> .at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   
> .at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) 
> .at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153
> .at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628
> .at java.lang.Thread.run(Thread.java:785) 
>   
> {code}
> The code in JdbcUtils.savePartition has 
> {code}
> } catch {
>   case e: SQLException =>
> val cause = e.getNextException
> if (cause != null && e.getCause != cause) {
>   if (e.getCause == null) {
> e.initCause(cause)
>   } else {
> e.addSuppressed(cause)
>   }
> }
> {code}
> According to Throwable Java doc, {{initCause()}} throws an 
> {{IllegalStateException}} "if this throwable was created with 
> Throwable(Throwable) or Throwable(String,Throwable), or this method has 
> already been called on this throwable". The code does check whether {{cause}} 
> is {{null}} before initializing it. However, {{getCause()}} "returns the 
> cause of this throwable or null if the cause is nonexistent or unknown." In 
> other words, {{null}} is returned if {{cause}} already exists (which would 
> result in {{IllegalStateException}}) but is unknown. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0

2017-04-25 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983396#comment-15983396
 ] 

Shixiong Zhu edited comment on SPARK-18057 at 4/25/17 6:29 PM:
---

[~guozhang] We have a stress test to test Spark Kafka connector for various 
cases, and it will try to frequently delete / re-create topics.

Topic deletion may not be common in production, but it will just hang forever 
and waste resources. It requires users to set up some monitor scripts to 
restart a job if it doesn't make any progress for a long time.


was (Author: zsxwing):
[~guozhang] We have a stress test to test Spark Kafka connector for various 
cases, and it will try to frequently delete / re-create topics.

> Update structured streaming kafka from 10.0.1 to 10.2.0
> ---
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0

2017-04-25 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983396#comment-15983396
 ] 

Shixiong Zhu commented on SPARK-18057:
--

[~guozhang] We have a stress test to test Spark Kafka connector for various 
cases, and it will try to frequently delete / re-create topics.

> Update structured streaming kafka from 10.0.1 to 10.2.0
> ---
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20461) CachedKafkaConsumer may hang forever when it's interrupted

2017-04-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20461:


Assignee: (was: Apache Spark)

> CachedKafkaConsumer may hang forever when it's interrupted
> --
>
> Key: SPARK-20461
> URL: https://issues.apache.org/jira/browse/SPARK-20461
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>
> CachedKafkaConsumer may hang forever when it's interrupted because of 
> KAFKA-1894



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20461) CachedKafkaConsumer may hang forever when it's interrupted

2017-04-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20461:


Assignee: Apache Spark

> CachedKafkaConsumer may hang forever when it's interrupted
> --
>
> Key: SPARK-20461
> URL: https://issues.apache.org/jira/browse/SPARK-20461
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> CachedKafkaConsumer may hang forever when it's interrupted because of 
> KAFKA-1894



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5484) Pregel should checkpoint periodically to avoid StackOverflowError

2017-04-25 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-5484.
-
  Resolution: Fixed
Assignee: dingding  (was: Ankur Dave)
   Fix Version/s: 2.3.0
  2.2.0
Target Version/s: 2.2.0, 2.3.0

> Pregel should checkpoint periodically to avoid StackOverflowError
> -
>
> Key: SPARK-5484
> URL: https://issues.apache.org/jira/browse/SPARK-5484
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Reporter: Ankur Dave
>Assignee: dingding
> Fix For: 2.2.0, 2.3.0
>
>
> Pregel-based iterative algorithms with more than ~50 iterations begin to slow 
> down and eventually fail with a StackOverflowError due to Spark's lack of 
> support for long lineage chains. Instead, Pregel should checkpoint the graph 
> periodically.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20461) CachedKafkaConsumer may hang forever when it's interrupted

2017-04-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983387#comment-15983387
 ] 

Apache Spark commented on SPARK-20461:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/17761

> CachedKafkaConsumer may hang forever when it's interrupted
> --
>
> Key: SPARK-20461
> URL: https://issues.apache.org/jira/browse/SPARK-20461
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>
> CachedKafkaConsumer may hang forever when it's interrupted because of 
> KAFKA-1894



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20461) CachedKafkaConsumer may hang forever when it's interrupted

2017-04-25 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-20461:


 Summary: CachedKafkaConsumer may hang forever when it's interrupted
 Key: SPARK-20461
 URL: https://issues.apache.org/jira/browse/SPARK-20461
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.2.0
Reporter: Shixiong Zhu


CachedKafkaConsumer may hang forever when it's interrupted because of KAFKA-1894



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20447) spark mesos scheduler suppress call

2017-04-25 Thread Michael Gummelt (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983314#comment-15983314
 ] 

Michael Gummelt commented on SPARK-20447:
-

The scheduler doesn't support suppression, no, but it does reject offers for 
120s: 
https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala#L375,
 and this is configurable.

With Mesos' 1s offer cycle, this should allow offers to be sent to 119 other 
frameworks before being re-offered to Spark.

Is this not sufficient? 

> spark mesos scheduler suppress call
> ---
>
> Key: SPARK-20447
> URL: https://issues.apache.org/jira/browse/SPARK-20447
> Project: Spark
>  Issue Type: Wish
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: Pavel Plotnikov
>Priority: Minor
>
>  The spark mesos scheduler never sends the suppress call to mesos to exclude 
> application from Mesos batch allocation process (HierarchicalDRF allocator) 
> when spark application is idle and there are no tasks in the queue. As a 
> result, the application receives 0% cluster share because of the dynamic 
> resource allocation while other applications, that need additional resources, 
> can’t receive an offer because they have bigger cluster share that is 
> significantly more than 0%
> About suppress call: 
> http://mesos.apache.org/documentation/latest/app-framework-development-guide/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20449) Upgrade breeze version to 0.13.1

2017-04-25 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-20449.
-
   Resolution: Fixed
Fix Version/s: 2.2.0
   3.0.0

Issue resolved by pull request 17746
[https://github.com/apache/spark/pull/17746]

> Upgrade breeze version to 0.13.1
> 
>
> Key: SPARK-20449
> URL: https://issues.apache.org/jira/browse/SPARK-20449
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
> Fix For: 3.0.0, 2.2.0
>
>
> Upgrade breeze version to 0.13.1, which fixed some critical bugs of L-BFGS-B.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20439) Catalog.listTables() depends on all libraries used to create tables

2017-04-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983224#comment-15983224
 ] 

Apache Spark commented on SPARK-20439:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/17760

> Catalog.listTables() depends on all libraries used to create tables
> ---
>
> Key: SPARK-20439
> URL: https://issues.apache.org/jira/browse/SPARK-20439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.2.0, 2.3.0
>
>
> spark.catalog.listTables() and getTable
> You may get an error on the table serde library:
> java.lang.RuntimeException: java.lang.ClassNotFoundException: 
> com.amazon.emr.kinesis.hive.KinesisHiveInputFormat
> Or if the database contains any table (e.g., index) with a table type that is 
> not accessible by Spark SQL, it will fail the whole listTable API.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13802) Fields order in Row(**kwargs) is not consistent with Schema.toInternal method

2017-04-25 Thread Furcy Pin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983094#comment-15983094
 ] 

Furcy Pin commented on SPARK-13802:
---

Hi, I ran into similar issues and found this Jira, so I would like to add some 
water to the mill.

I ran this in the pyspark shell v2.1.0 :
{code}
>>> from pyspark.sql.types import *
>>> from pyspark.sql import Row
>>> 
>>> rdd = spark.sparkContext.parallelize(range(1, 4))
>>> 
>>> schema = StructType([StructField('a', IntegerType()), StructField('b', 
>>> StringType())])
>>> spark.createDataFrame(rdd.map(lambda r: Row(a=1, b=None)), schema).collect()
[Row(a=1, b=None), Row(a=1, b=None), Row(a=1, b=None)]
>>> 
>>> schema = StructType([StructField('b', IntegerType()), StructField('a', 
>>> StringType())])
>>> spark.createDataFrame(rdd.map(lambda r: Row(b=1, a=None)), schema).collect()
[Row(b=1, a=None), Row(b=1, a=None), Row(b=1, a=None)]
{code}

When applying a Schema, it seems that the Rows fields names are correctly 
matched and reordered according to the schema, which is quite nice,
even if when creating a Row alone, the fields are ordered differently:

{code}
>>> Row(b=1, a=None)
Row(a=None, b=1)
{code}

However I get an inconsistent behavior when I start using structs:

{code}
>>> schema = StructType([StructField('a', IntegerType()), StructField('b', 
>>> StructType([StructField('c', StringType())]))])
>>> spark.createDataFrame(rdd.map(lambda r: Row(a=1, b=None)), schema).collect()
[Row(a=1, b=None), Row(a=1, b=None), Row(a=1, b=None)]
>>> 
>>> schema = StructType([StructField('b', IntegerType()), StructField('a', 
>>> StructType([StructField('c', StringType())]))])
>>> spark.createDataFrame(rdd.map(lambda r: Row(b=1, a=None)), schema).collect()
...
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent 
call last):
  File "spark/python/lib/pyspark.zip/pyspark/worker.py", line 174, in main
process()
  File "spark/python/lib/pyspark.zip/pyspark/worker.py", line 169, in process
serializer.dump_stream(func(split_index, iterator), outfile)
  File "spark/python/lib/pyspark.zip/pyspark/serializers.py", line 268, in 
dump_stream
vs = list(itertools.islice(iterator, batch))
  File "spark/python/pyspark/sql/types.py", line 576, in toInternal
return tuple(f.toInternal(v) for f, v in zip(self.fields, obj))
  File "spark/python/pyspark/sql/types.py", line 576, in 
return tuple(f.toInternal(v) for f, v in zip(self.fields, obj))
  File "spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 436, in 
toInternal
return self.dataType.toInternal(obj)
  File "spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 593, in 
toInternal
raise ValueError("Unexpected tuple %r with StructType" % obj)
ValueError: Unexpected tuple 1 with StructType
{code}

So it seems to me that pyspark can match a Row's field names with a schema, but 
only when no struct is involved.
This doesn't seem very consistent so I believe that it should be considered a 
bug.



> Fields order in Row(**kwargs) is not consistent with Schema.toInternal method
> -
>
> Key: SPARK-13802
> URL: https://issues.apache.org/jira/browse/SPARK-13802
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Szymon Matejczyk
>
> When using Row constructor from kwargs, fields in the tuple underneath are 
> sorted by name. When Schema is reading the row, it is not using the fields in 
> this order.
> {code}
> from pyspark.sql import Row
> from pyspark.sql.types import *
> schema = StructType([
> StructField("id", StringType()),
> StructField("first_name", StringType())])
> row = Row(id="39", first_name="Szymon")
> schema.toInternal(row)
> Out[5]: ('Szymon', '39')
> {code}
> {code}
> df = sqlContext.createDataFrame([row], schema)
> df.show(1)
> +--+--+
> |id|first_name|
> +--+--+
> |Szymon|39|
> +--+--+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20459) JdbcUtils throws IllegalStateException: Cause already initialized after getting SQLException

2017-04-25 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983092#comment-15983092
 ] 

Sean Owen commented on SPARK-20459:
---

Ugh, so there's no actual way to detect whether the exception has no cause 
initialized? I guess it's OK here to catch the IllegalStateException and resort 
to addSuppressed if it fails. Go ahead.

> JdbcUtils throws IllegalStateException: Cause already initialized after 
> getting SQLException
> 
>
> Key: SPARK-20459
> URL: https://issues.apache.org/jira/browse/SPARK-20459
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.0.2, 2.1.0
>Reporter: Jessie Yu
>Priority: Minor
>
> Testing some failure scenarios, and JdbcUtils throws an IllegalStateException 
> instead of the expected SQLException:
> {code}
> scala> 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.saveTable(prodtbl,url3,"DB2.D_ITEM_INFO",prop1)
>  
> 17/04/03 17:19:35 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)  
>   
> java.lang.IllegalStateException: Cause already initialized
>   
> .at java.lang.Throwable.setCause(Throwable.java:365)  
>   
> .at java.lang.Throwable.initCause(Throwable.java:341) 
>   
> .at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:241)
> .at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:300)
> .at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:299)
> .at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902)
> .at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902)
> .at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
>  
> .at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
>   
> .at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   
> .at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   
> .at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) 
> .at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153
> .at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628
> .at java.lang.Thread.run(Thread.java:785) 
>   
> {code}
> The code in JdbcUtils.savePartition has 
> {code}
> } catch {
>   case e: SQLException =>
> val cause = e.getNextException
> if (cause != null && e.getCause != cause) {
>   if (e.getCause == null) {
> e.initCause(cause)
>   } else {
> e.addSuppressed(cause)
>   }
> }
> {code}
> According to Throwable Java doc, {{initCause()}} throws an 
> {{IllegalStateException}} "if this throwable was created with 
> Throwable(Throwable) or Throwable(String,Throwable), or this method has 
> already been called on this throwable". The code does check whether {{cause}} 
> is {{null}} before initializing it. However, {{getCause()}} "returns the 
> cause of this throwable or null if the cause is nonexistent or unknown." In 
> other words, {{null}} is returned if {{cause}} already exists (which would 
> result in {{IllegalStateException}}) but is unknown. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20460) Make it more consistent to handle column name duplication

2017-04-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20460:


Assignee: (was: Apache Spark)

> Make it more consistent to handle column name duplication
> -
>
> Key: SPARK-20460
> URL: https://issues.apache.org/jira/browse/SPARK-20460
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Takeshi Yamamuro
>Priority: Trivial
>
> In the current master, error handling is different when hitting column name 
> duplication.
> {code}
> // json
> scala> val schema = StructType(StructField("a", IntegerType) :: 
> StructField("a", IntegerType) :: Nil)
> scala> Seq("""{"a":1, 
> "a":1}"").toDF().coalesce(1).write.mode("overwrite").text("/tmp/data")
> scala> spark.read.format("json").schema(schema).load("/tmp/data").show
> org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: 
> a#12, a#13.;
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153)
> scala> spark.read.format("json").load("/tmp/data").show
> org.apache.spark.sql.AnalysisException: Duplicate column(s) : "a" found, 
> cannot save to JSON format;
>   at 
> org.apache.spark.sql.execution.datasources.json.JsonDataSource.checkConstraints(JsonDataSource.scala:81)
>   at 
> org.apache.spark.sql.execution.datasources.json.JsonDataSource.inferSchema(JsonDataSource.scala:63)
>   at 
> org.apache.spark.sql.execution.datasources.json.JsonFileFormat.inferSchema(JsonFileFormat.scala:57)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:176)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:176)
> // csv
> scala> val schema = StructType(StructField("a", IntegerType) :: 
> StructField("a", IntegerType) :: Nil)
> scala> Seq("a,a", 
> "1,1").toDF().coalesce(1).write.mode("overwrite").text("/tmp/data")
> scala> spark.read.format("csv").schema(schema).option("header", 
> false).load("/tmp/data").show
> org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: 
> a#41, a#42.;
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:152)
> // If `inferSchema` is true, a CSV format is duplicate-safe (See SPARK-16896)
> scala> spark.read.format("csv").option("header", true).load("/tmp/data").show
> +---+---+
> | a0| a1|
> +---+---+
> |  1|  1|
> +---+---+
> // parquet
> scala> val schema = StructType(StructField("a", IntegerType) :: 
> StructField("a", IntegerType) :: Nil)
> scala> Seq((1, 1)).toDF("a", 
> "b").coalesce(1).write.mode("overwrite").parquet("/tmp/data")
> scala> spark.read.format("parquet").schema(schema).option("header", 
> false).load("/tmp/data").show
> org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: 
> a#110, a#111.;
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:152)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> {code}
> To make this error reason clearer, IMO we'd better to make it more consistent 
> to handle column name duplication.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11968) ALS recommend all methods spend most of time in GC

2017-04-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11968:


Assignee: Nick Pentreath  (was: Apache Spark)

> ALS recommend all methods spend most of time in GC
> --
>
> Key: SPARK-11968
> URL: https://issues.apache.org/jira/browse/SPARK-11968
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Joseph K. Bradley
>Assignee: Nick Pentreath
>
> After adding recommendUsersForProducts and recommendProductsForUsers to ALS 
> in spark-perf, I noticed that it takes much longer than ALS itself.  Looking 
> at the monitoring page, I can see it is spending about 8min doing GC for each 
> 10min task.  That sounds fixable.  Looking at the implementation, there is 
> clearly an opportunity to avoid extra allocations: 
> [https://github.com/apache/spark/blob/e6dd237463d2de8c506f0735dfdb3f43e8122513/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala#L283]
> CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20460) Make it more consistent to handle column name duplication

2017-04-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20460:


Assignee: Apache Spark

> Make it more consistent to handle column name duplication
> -
>
> Key: SPARK-20460
> URL: https://issues.apache.org/jira/browse/SPARK-20460
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>Priority: Trivial
>
> In the current master, error handling is different when hitting column name 
> duplication.
> {code}
> // json
> scala> val schema = StructType(StructField("a", IntegerType) :: 
> StructField("a", IntegerType) :: Nil)
> scala> Seq("""{"a":1, 
> "a":1}"").toDF().coalesce(1).write.mode("overwrite").text("/tmp/data")
> scala> spark.read.format("json").schema(schema).load("/tmp/data").show
> org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: 
> a#12, a#13.;
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153)
> scala> spark.read.format("json").load("/tmp/data").show
> org.apache.spark.sql.AnalysisException: Duplicate column(s) : "a" found, 
> cannot save to JSON format;
>   at 
> org.apache.spark.sql.execution.datasources.json.JsonDataSource.checkConstraints(JsonDataSource.scala:81)
>   at 
> org.apache.spark.sql.execution.datasources.json.JsonDataSource.inferSchema(JsonDataSource.scala:63)
>   at 
> org.apache.spark.sql.execution.datasources.json.JsonFileFormat.inferSchema(JsonFileFormat.scala:57)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:176)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:176)
> // csv
> scala> val schema = StructType(StructField("a", IntegerType) :: 
> StructField("a", IntegerType) :: Nil)
> scala> Seq("a,a", 
> "1,1").toDF().coalesce(1).write.mode("overwrite").text("/tmp/data")
> scala> spark.read.format("csv").schema(schema).option("header", 
> false).load("/tmp/data").show
> org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: 
> a#41, a#42.;
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:152)
> // If `inferSchema` is true, a CSV format is duplicate-safe (See SPARK-16896)
> scala> spark.read.format("csv").option("header", true).load("/tmp/data").show
> +---+---+
> | a0| a1|
> +---+---+
> |  1|  1|
> +---+---+
> // parquet
> scala> val schema = StructType(StructField("a", IntegerType) :: 
> StructField("a", IntegerType) :: Nil)
> scala> Seq((1, 1)).toDF("a", 
> "b").coalesce(1).write.mode("overwrite").parquet("/tmp/data")
> scala> spark.read.format("parquet").schema(schema).option("header", 
> false).load("/tmp/data").show
> org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: 
> a#110, a#111.;
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:152)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> {code}
> To make this error reason clearer, IMO we'd better to make it more consistent 
> to handle column name duplication.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11968) ALS recommend all methods spend most of time in GC

2017-04-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11968:


Assignee: Apache Spark  (was: Nick Pentreath)

> ALS recommend all methods spend most of time in GC
> --
>
> Key: SPARK-11968
> URL: https://issues.apache.org/jira/browse/SPARK-11968
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>
> After adding recommendUsersForProducts and recommendProductsForUsers to ALS 
> in spark-perf, I noticed that it takes much longer than ALS itself.  Looking 
> at the monitoring page, I can see it is spending about 8min doing GC for each 
> 10min task.  That sounds fixable.  Looking at the implementation, there is 
> clearly an opportunity to avoid extra allocations: 
> [https://github.com/apache/spark/blob/e6dd237463d2de8c506f0735dfdb3f43e8122513/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala#L283]
> CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20460) Make it more consistent to handle column name duplication

2017-04-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983079#comment-15983079
 ] 

Apache Spark commented on SPARK-20460:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/17758

> Make it more consistent to handle column name duplication
> -
>
> Key: SPARK-20460
> URL: https://issues.apache.org/jira/browse/SPARK-20460
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Takeshi Yamamuro
>Priority: Trivial
>
> In the current master, error handling is different when hitting column name 
> duplication.
> {code}
> // json
> scala> val schema = StructType(StructField("a", IntegerType) :: 
> StructField("a", IntegerType) :: Nil)
> scala> Seq("""{"a":1, 
> "a":1}"").toDF().coalesce(1).write.mode("overwrite").text("/tmp/data")
> scala> spark.read.format("json").schema(schema).load("/tmp/data").show
> org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: 
> a#12, a#13.;
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153)
> scala> spark.read.format("json").load("/tmp/data").show
> org.apache.spark.sql.AnalysisException: Duplicate column(s) : "a" found, 
> cannot save to JSON format;
>   at 
> org.apache.spark.sql.execution.datasources.json.JsonDataSource.checkConstraints(JsonDataSource.scala:81)
>   at 
> org.apache.spark.sql.execution.datasources.json.JsonDataSource.inferSchema(JsonDataSource.scala:63)
>   at 
> org.apache.spark.sql.execution.datasources.json.JsonFileFormat.inferSchema(JsonFileFormat.scala:57)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:176)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:176)
> // csv
> scala> val schema = StructType(StructField("a", IntegerType) :: 
> StructField("a", IntegerType) :: Nil)
> scala> Seq("a,a", 
> "1,1").toDF().coalesce(1).write.mode("overwrite").text("/tmp/data")
> scala> spark.read.format("csv").schema(schema).option("header", 
> false).load("/tmp/data").show
> org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: 
> a#41, a#42.;
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:152)
> // If `inferSchema` is true, a CSV format is duplicate-safe (See SPARK-16896)
> scala> spark.read.format("csv").option("header", true).load("/tmp/data").show
> +---+---+
> | a0| a1|
> +---+---+
> |  1|  1|
> +---+---+
> // parquet
> scala> val schema = StructType(StructField("a", IntegerType) :: 
> StructField("a", IntegerType) :: Nil)
> scala> Seq((1, 1)).toDF("a", 
> "b").coalesce(1).write.mode("overwrite").parquet("/tmp/data")
> scala> spark.read.format("parquet").schema(schema).option("header", 
> false).load("/tmp/data").show
> org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: 
> a#110, a#111.;
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:152)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> {code}
> To make this error reason clearer, IMO we'd better to make it more consistent 
> to handle column name duplication.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11968) ALS recommend all methods spend most of time in GC

2017-04-25 Thread Peng Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983074#comment-15983074
 ] 

Peng Meng commented on SPARK-11968:
---

Thanks [~mlnick] , I will post more results here. 
I latest result is I have changed the PriorityQueue to BoundedPriorityQueue. 
There is about 30% improvement. Will update the PR and  the result. 

> ALS recommend all methods spend most of time in GC
> --
>
> Key: SPARK-11968
> URL: https://issues.apache.org/jira/browse/SPARK-11968
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Joseph K. Bradley
>Assignee: Nick Pentreath
>
> After adding recommendUsersForProducts and recommendProductsForUsers to ALS 
> in spark-perf, I noticed that it takes much longer than ALS itself.  Looking 
> at the monitoring page, I can see it is spending about 8min doing GC for each 
> 10min task.  That sounds fixable.  Looking at the implementation, there is 
> clearly an opportunity to avoid extra allocations: 
> [https://github.com/apache/spark/blob/e6dd237463d2de8c506f0735dfdb3f43e8122513/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala#L283]
> CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20460) Make it more consistent to handle column name duplication

2017-04-25 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-20460:
-
Description: 
In the current master, error handling is different when hitting column name 
duplication.

{code}
// json
scala> val schema = StructType(StructField("a", IntegerType) :: 
StructField("a", IntegerType) :: Nil)
scala> Seq("""{"a":1, 
"a":1}"").toDF().coalesce(1).write.mode("overwrite").text("/tmp/data")
scala> spark.read.format("json").schema(schema).load("/tmp/data").show
org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: 
a#12, a#13.;
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153)

scala> spark.read.format("json").load("/tmp/data").show
org.apache.spark.sql.AnalysisException: Duplicate column(s) : "a" found, cannot 
save to JSON format;
  at 
org.apache.spark.sql.execution.datasources.json.JsonDataSource.checkConstraints(JsonDataSource.scala:81)
  at 
org.apache.spark.sql.execution.datasources.json.JsonDataSource.inferSchema(JsonDataSource.scala:63)
  at 
org.apache.spark.sql.execution.datasources.json.JsonFileFormat.inferSchema(JsonFileFormat.scala:57)
  at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:176)
  at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:176)

// csv
scala> val schema = StructType(StructField("a", IntegerType) :: 
StructField("a", IntegerType) :: Nil)
scala> Seq("a,a", 
"1,1").toDF().coalesce(1).write.mode("overwrite").text("/tmp/data")
scala> spark.read.format("csv").schema(schema).option("header", 
false).load("/tmp/data").show
org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: 
a#41, a#42.;
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:152)

// If `inferSchema` is true, a CSV format is duplicate-safe (See SPARK-16896)
scala> spark.read.format("csv").option("header", true).load("/tmp/data").show
+---+---+
| a0| a1|
+---+---+
|  1|  1|
+---+---+

// parquet
scala> val schema = StructType(StructField("a", IntegerType) :: 
StructField("a", IntegerType) :: Nil)
scala> Seq((1, 1)).toDF("a", 
"b").coalesce(1).write.mode("overwrite").parquet("/tmp/data")
scala> spark.read.format("parquet").schema(schema).option("header", 
false).load("/tmp/data").show
org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: 
a#110, a#111.;
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:152)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
{code}

To make this error reason clearer, IMO we'd better to make it more consistent 
to handle column name duplication.

  was:

In the current master, error handling is different when hitting column name 
duplication.

{code}
// json
scala> val schema = StructType(StructField("a", IntegerType) :: 
StructField("a", IntegerType) :: Nil)
scala> Seq("""{"a":1, 
"a":1}"").toDF().coalesce(1).write.mode("overwrite").text("/tmp/data")
scala> spark.read.format("json").schema(schema).load("/tmp/data").show
org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: 
a#12, a#13.;
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153)

scala> spark.read.format("json").load("/tmp/data").show
org.apache.spark.sql.AnalysisException: Duplicate column(s) : "a" found, cannot 
save to JSON format;
  at 
org.apache.spark.sql.execution.datasources.json.JsonDataSource.checkConstraints(JsonDataSource.scala:81)
  at 
org.apache.spark.sql.execution.datasources.json.JsonDataSource.inferSchema(JsonDataSource.scala:63)
  at 
org.apache.spark.sql.execution.datasources.json.JsonFileFormat.inferSchem

[jira] [Commented] (SPARK-20445) pyspark.sql.utils.IllegalArgumentException: u'DecisionTreeClassifier was given input with invalid label column label, without the number of classes specified. See Stri

2017-04-25 Thread surya pratap (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983069#comment-15983069
 ] 

surya pratap commented on SPARK-20445:
--

Hello Hyukjin Kwon ,
Thanks for fast reply
I am not getting your answer   "current master"
What is current master and what should we check?
Please explain

Note- I am running this code on Mapr having Spark version 1.6.1 

> pyspark.sql.utils.IllegalArgumentException: u'DecisionTreeClassifier was 
> given input with invalid label column label, without the number of classes 
> specified. See StringIndexer
> 
>
> Key: SPARK-20445
> URL: https://issues.apache.org/jira/browse/SPARK-20445
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.1
>Reporter: surya pratap
>
>  #Load the CSV file into a RDD
> irisData = sc.textFile("/home/infademo/surya/iris.csv")
> irisData.cache()
> irisData.count()
> #Remove the first line (contains headers)
> dataLines = irisData.filter(lambda x: "Sepal" not in x)
> dataLines.count()
> from pyspark.sql import Row
> #Create a Data Frame from the data
> parts = dataLines.map(lambda l: l.split(","))
> irisMap = parts.map(lambda p: Row(SEPAL_LENGTH=float(p[0]),\
> SEPAL_WIDTH=float(p[1]), \
> PETAL_LENGTH=float(p[2]), \
> PETAL_WIDTH=float(p[3]), \
> SPECIES=p[4] ))
> # Infer the schema, and register the DataFrame as a table.
> irisDf = sqlContext.createDataFrame(irisMap)
> irisDf.cache()
> #Add a numeric indexer for the label/target column
> from pyspark.ml.feature import StringIndexer
> stringIndexer = StringIndexer(inputCol="SPECIES", outputCol="IND_SPECIES")
> si_model = stringIndexer.fit(irisDf)
> irisNormDf = si_model.transform(irisDf)
> irisNormDf.select("SPECIES","IND_SPECIES").distinct().collect()
> irisNormDf.cache()
> 
> """--
> Perform Data Analytics
> 
> -"""
> #See standard parameters
> irisNormDf.describe().show()
> #Find correlation between predictors and target
> for i in irisNormDf.columns:
> if not( isinstance(irisNormDf.select(i).take(1)[0][0], basestring)) :
> print( "Correlation to Species for ", i, \
> irisNormDf.stat.corr('IND_SPECIES',i))
> #Transform to a Data Frame for input to Machine Learing
> #Drop columns that are not required (low correlation)
> from pyspark.mllib.linalg import Vectors
> from pyspark.mllib.linalg import SparseVector
> from pyspark.mllib.regression import LabeledPoint
> from pyspark.mllib.util import MLUtils
> import org.apache.spark.mllib.linalg.{Matrix, Matrices}
> from pyspark.mllib.linalg.distributed import RowMatrix
> from pyspark.ml.linalg import Vectors
> pyspark.mllib.linalg.Vector
> def transformToLabeledPoint(row) :
> lp = ( row["SPECIES"], row["IND_SPECIES"], \
> Vectors.dense([row["SEPAL_LENGTH"],\
> row["SEPAL_WIDTH"], \
> row["PETAL_LENGTH"], \
> row["PETAL_WIDTH"]]))
> return lp
> irisLp = irisNormDf.rdd.map(transformToLabeledPoint)
> irisLpDf = sqlContext.createDataFrame(irisLp,["species","label", 
> "features"])
> irisLpDf.select("species","label","features").show(10)
> irisLpDf.cache()
> 
> """--
> Perform Machine Learning
> 
> -"""
> #Split into training and testing data
> (trainingData, testData) = irisLpDf.randomSplit([0.9, 0.1])
> trainingData.count()
> testData.count()
> testData.collect()
> from pyspark.ml.classification import DecisionTreeClassifier
> from pyspark.ml.evaluation import MulticlassClassificationEvaluator
> #Create the model
> dtClassifer = DecisionTreeClassifier(maxDepth=2, labelCol="label",\
> featuresCol="features")
>dtModel = dtClassifer.fit(trainingData)
>
>issue part:-
>
>dtModel = dtClassifer.fit(trainingData) Traceback (most recent call last): 
> File "", line 1, in File 
> "/opt/mapr/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/ml/pipeline.py", 
> line 69, in fit return self._fit(dataset) File 
> "/opt/mapr/spark/spark-1.6.1-bin-ha

[jira] [Assigned] (SPARK-11968) ALS recommend all methods spend most of time in GC

2017-04-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11968:


Assignee: Apache Spark  (was: Nick Pentreath)

> ALS recommend all methods spend most of time in GC
> --
>
> Key: SPARK-11968
> URL: https://issues.apache.org/jira/browse/SPARK-11968
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>
> After adding recommendUsersForProducts and recommendProductsForUsers to ALS 
> in spark-perf, I noticed that it takes much longer than ALS itself.  Looking 
> at the monitoring page, I can see it is spending about 8min doing GC for each 
> 10min task.  That sounds fixable.  Looking at the implementation, there is 
> clearly an opportunity to avoid extra allocations: 
> [https://github.com/apache/spark/blob/e6dd237463d2de8c506f0735dfdb3f43e8122513/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala#L283]
> CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11968) ALS recommend all methods spend most of time in GC

2017-04-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11968:


Assignee: Nick Pentreath  (was: Apache Spark)

> ALS recommend all methods spend most of time in GC
> --
>
> Key: SPARK-11968
> URL: https://issues.apache.org/jira/browse/SPARK-11968
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Joseph K. Bradley
>Assignee: Nick Pentreath
>
> After adding recommendUsersForProducts and recommendProductsForUsers to ALS 
> in spark-perf, I noticed that it takes much longer than ALS itself.  Looking 
> at the monitoring page, I can see it is spending about 8min doing GC for each 
> 10min task.  That sounds fixable.  Looking at the implementation, there is 
> clearly an opportunity to avoid extra allocations: 
> [https://github.com/apache/spark/blob/e6dd237463d2de8c506f0735dfdb3f43e8122513/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala#L283]
> CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-20445) pyspark.sql.utils.IllegalArgumentException: u'DecisionTreeClassifier was given input with invalid label column label, without the number of classes specifi

2017-04-25 Thread surya pratap (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

surya pratap updated SPARK-20445:
-
Comment: was deleted

(was: Hello Hyukjin Kwon,
Thxz for fast reply.
You are using which version  and which environment.
I am running my code on MapR having spark version 1.6v)

> pyspark.sql.utils.IllegalArgumentException: u'DecisionTreeClassifier was 
> given input with invalid label column label, without the number of classes 
> specified. See StringIndexer
> 
>
> Key: SPARK-20445
> URL: https://issues.apache.org/jira/browse/SPARK-20445
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.1
>Reporter: surya pratap
>
>  #Load the CSV file into a RDD
> irisData = sc.textFile("/home/infademo/surya/iris.csv")
> irisData.cache()
> irisData.count()
> #Remove the first line (contains headers)
> dataLines = irisData.filter(lambda x: "Sepal" not in x)
> dataLines.count()
> from pyspark.sql import Row
> #Create a Data Frame from the data
> parts = dataLines.map(lambda l: l.split(","))
> irisMap = parts.map(lambda p: Row(SEPAL_LENGTH=float(p[0]),\
> SEPAL_WIDTH=float(p[1]), \
> PETAL_LENGTH=float(p[2]), \
> PETAL_WIDTH=float(p[3]), \
> SPECIES=p[4] ))
> # Infer the schema, and register the DataFrame as a table.
> irisDf = sqlContext.createDataFrame(irisMap)
> irisDf.cache()
> #Add a numeric indexer for the label/target column
> from pyspark.ml.feature import StringIndexer
> stringIndexer = StringIndexer(inputCol="SPECIES", outputCol="IND_SPECIES")
> si_model = stringIndexer.fit(irisDf)
> irisNormDf = si_model.transform(irisDf)
> irisNormDf.select("SPECIES","IND_SPECIES").distinct().collect()
> irisNormDf.cache()
> 
> """--
> Perform Data Analytics
> 
> -"""
> #See standard parameters
> irisNormDf.describe().show()
> #Find correlation between predictors and target
> for i in irisNormDf.columns:
> if not( isinstance(irisNormDf.select(i).take(1)[0][0], basestring)) :
> print( "Correlation to Species for ", i, \
> irisNormDf.stat.corr('IND_SPECIES',i))
> #Transform to a Data Frame for input to Machine Learing
> #Drop columns that are not required (low correlation)
> from pyspark.mllib.linalg import Vectors
> from pyspark.mllib.linalg import SparseVector
> from pyspark.mllib.regression import LabeledPoint
> from pyspark.mllib.util import MLUtils
> import org.apache.spark.mllib.linalg.{Matrix, Matrices}
> from pyspark.mllib.linalg.distributed import RowMatrix
> from pyspark.ml.linalg import Vectors
> pyspark.mllib.linalg.Vector
> def transformToLabeledPoint(row) :
> lp = ( row["SPECIES"], row["IND_SPECIES"], \
> Vectors.dense([row["SEPAL_LENGTH"],\
> row["SEPAL_WIDTH"], \
> row["PETAL_LENGTH"], \
> row["PETAL_WIDTH"]]))
> return lp
> irisLp = irisNormDf.rdd.map(transformToLabeledPoint)
> irisLpDf = sqlContext.createDataFrame(irisLp,["species","label", 
> "features"])
> irisLpDf.select("species","label","features").show(10)
> irisLpDf.cache()
> 
> """--
> Perform Machine Learning
> 
> -"""
> #Split into training and testing data
> (trainingData, testData) = irisLpDf.randomSplit([0.9, 0.1])
> trainingData.count()
> testData.count()
> testData.collect()
> from pyspark.ml.classification import DecisionTreeClassifier
> from pyspark.ml.evaluation import MulticlassClassificationEvaluator
> #Create the model
> dtClassifer = DecisionTreeClassifier(maxDepth=2, labelCol="label",\
> featuresCol="features")
>dtModel = dtClassifer.fit(trainingData)
>
>issue part:-
>
>dtModel = dtClassifer.fit(trainingData) Traceback (most recent call last): 
> File "", line 1, in File 
> "/opt/mapr/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/ml/pipeline.py", 
> line 69, in fit return self._fit(dataset) File 
> "/opt/mapr/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/ml/wrapper.py", 
> line 133, in _fit java_model = self._fit_java(dataset) F

[jira] [Commented] (SPARK-11968) ALS recommend all methods spend most of time in GC

2017-04-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983066#comment-15983066
 ] 

Apache Spark commented on SPARK-11968:
--

User 'mpjlu' has created a pull request for this issue:
https://github.com/apache/spark/pull/17742

> ALS recommend all methods spend most of time in GC
> --
>
> Key: SPARK-11968
> URL: https://issues.apache.org/jira/browse/SPARK-11968
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Joseph K. Bradley
>Assignee: Nick Pentreath
>
> After adding recommendUsersForProducts and recommendProductsForUsers to ALS 
> in spark-perf, I noticed that it takes much longer than ALS itself.  Looking 
> at the monitoring page, I can see it is spending about 8min doing GC for each 
> 10min task.  That sounds fixable.  Looking at the implementation, there is 
> clearly an opportunity to avoid extra allocations: 
> [https://github.com/apache/spark/blob/e6dd237463d2de8c506f0735dfdb3f43e8122513/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala#L283]
> CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20460) Make it more consistent to handle column name duplication

2017-04-25 Thread Takeshi Yamamuro (JIRA)

Takeshi Yamamuro created SPARK-20460:


 Summary: Make it more consistent to handle column name duplication
 Key: SPARK-20460
 URL: https://issues.apache.org/jira/browse/SPARK-20460
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
Reporter: Takeshi Yamamuro
Priority: Trivial



In the current master, error handling is different when hitting column name 
duplication.

{code}
// json
scala> val schema = StructType(StructField("a", IntegerType) :: 
StructField("a", IntegerType) :: Nil)
scala> Seq("""{"a":1, 
"a":1}"").toDF().coalesce(1).write.mode("overwrite").text("/tmp/data")
scala> spark.read.format("json").schema(schema).load("/tmp/data").show
org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: 
a#12, a#13.;
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153)

scala> spark.read.format("json").load("/tmp/data").show
org.apache.spark.sql.AnalysisException: Duplicate column(s) : "a" found, cannot 
save to JSON format;
  at 
org.apache.spark.sql.execution.datasources.json.JsonDataSource.checkConstraints(JsonDataSource.scala:81)
  at 
org.apache.spark.sql.execution.datasources.json.JsonDataSource.inferSchema(JsonDataSource.scala:63)
  at 
org.apache.spark.sql.execution.datasources.json.JsonFileFormat.inferSchema(JsonFileFormat.scala:57)
  at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:176)
  at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:176)

// csv
scala> val schema = StructType(StructField("a", IntegerType) :: 
StructField("a", IntegerType) :: Nil)
scala> Seq("a,a", 
"1,1").toDF().coalesce(1).write.mode("overwrite").text("/tmp/data")
scala> spark.read.format("csv").schema(schema).option("header", 
false).load("/tmp/data").show
org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: 
a#41, a#42.;
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:152)

// If `inferSchema` is true, a CSV format is duplicate-safe (See SPARK-16896)
scala> spark.read.format("csv").option("header", true).load("/tmp/data").show
+---+---+
| a0| a1|
+---+---+
|  1|  1|
+---+---+

// parquet
scala> val schema = StructType(StructField("a", IntegerType) :: 
StructField("a", IntegerType) :: Nil)
scala> Seq((1, 1)).toDF("a", 
"b").coalesce(1).write.mode("overwrite").parquet("/tmp/data")
scala> spark.read.format("parquet").schema(schema).option("header", 
false).load("/tmp/data").show
org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: 
a#110, a#111.;
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:152)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
{code}

To make this error reason clearer, IMO we'd better to make it more consistent 
to handle column name duplication.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20443) The blockSize of MLLIB ALS should be setting by the User

2017-04-25 Thread Peng Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983059#comment-15983059
 ] 

Peng Meng commented on SPARK-20443:
---

Yes, based on my current test, I agree.
But if the data size is large,  maybe there is benefit to adjust block size. 

> The blockSize of MLLIB ALS should be setting  by the User
> -
>
> Key: SPARK-20443
> URL: https://issues.apache.org/jira/browse/SPARK-20443
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
>
> The blockSize of MLLIB ALS is very important for ALS performance. 
> In our test, when the blockSize is 128, the performance is about 4X comparing 
> with the blockSize is 4096 (default value).
> The following are our test results: 
> BlockSize(recommendationForAll time)
> 128(124s), 256(160s), 512(184s), 1024(244s), 2048(332s), 4096(488s), 8192(OOM)
> The Test Environment:
> 3 workers: each work 10 core, each work 30G memory, each work 1 executor.
> The Data: User 480,000, and Item 17,000



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20443) The blockSize of MLLIB ALS should be setting by the User

2017-04-25 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983050#comment-15983050
 ] 

Nick Pentreath commented on SPARK-20443:


Your PR for SPARK-20446 / SPARK11968 should largely remove the need to adjust 
the block size? Do you agree?

> The blockSize of MLLIB ALS should be setting  by the User
> -
>
> Key: SPARK-20443
> URL: https://issues.apache.org/jira/browse/SPARK-20443
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
>
> The blockSize of MLLIB ALS is very important for ALS performance. 
> In our test, when the blockSize is 128, the performance is about 4X comparing 
> with the blockSize is 4096 (default value).
> The following are our test results: 
> BlockSize(recommendationForAll time)
> 128(124s), 256(160s), 512(184s), 1024(244s), 2048(332s), 4096(488s), 8192(OOM)
> The Test Environment:
> 3 workers: each work 10 core, each work 30G memory, each work 1 executor.
> The Data: User 480,000, and Item 17,000



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11968) ALS recommend all methods spend most of time in GC

2017-04-25 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983048#comment-15983048
 ] 

Nick Pentreath commented on SPARK-11968:


[~peng.m...@intel.com] would you mind posting your comments here about the 
solution from SPARK-20446 as well as the experiment timings? You can rename 
your PR to include this JIRA (SPARK-11968) in the title instead, in order to 
link it.

Also please include the timings here of the {{ml DataFrame}} version for 
comparison. Your approach should also be much faster than the current {{ml}} 
SparkSQL approach, I think.

I just did some quick tests using MovieLens {{latest}} data (~24 million 
ratings, ~260,000 users, ~39,000 items) and found the following (note these are 
very rough timings):

Using default block sizes:

Current {{ml}} master - 262 sec
My approach: 58 sec
Your PR: 35 sec

You're correct that there is still +/- 20-25% GC time overhead using the BLAS 3 
+ sorting approach. Potentially it could be slightly improved through some form 
of pre-allocation, but even then it does look like any benefit of BLAS 3 is 
smaller than the GC cost.

> ALS recommend all methods spend most of time in GC
> --
>
> Key: SPARK-11968
> URL: https://issues.apache.org/jira/browse/SPARK-11968
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Joseph K. Bradley
>Assignee: Nick Pentreath
>
> After adding recommendUsersForProducts and recommendProductsForUsers to ALS 
> in spark-perf, I noticed that it takes much longer than ALS itself.  Looking 
> at the monitoring page, I can see it is spending about 8min doing GC for each 
> 10min task.  That sounds fixable.  Looking at the implementation, there is 
> clearly an opportunity to avoid extra allocations: 
> [https://github.com/apache/spark/blob/e6dd237463d2de8c506f0735dfdb3f43e8122513/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala#L283]
> CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-20446) Optimize the process of MLLIB ALS recommendForAll

2017-04-25 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath closed SPARK-20446.
--
Resolution: Duplicate

> Optimize the process of MLLIB ALS recommendForAll
> -
>
> Key: SPARK-20446
> URL: https://issues.apache.org/jira/browse/SPARK-20446
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>
> The recommendForAll of MLLIB ALS is very slow.
> GC is a key problem of the current method. 
> The task use the following code to keep temp result:
> val output = new Array[(Int, (Int, Double))](m*n)
> m = n = 4096 (default value, no method to set)
> so output is about 4k * 4k * (4 + 4 + 8) = 256M. This is a large memory and 
> cause serious GC problem, and it is frequently OOM.
> Actually, we don't need to save all the temp result. Suppose we recommend 
> topK (topK is about 10, or 20) product for each user, we only need  4k * topK 
> * (4 + 4 + 8) memory to save the temp result.
> I have written a solution for this method with the following test result. 
> The Test Environment:
> 3 workers: each work 10 core, each work 30G memory, each work 1 executor.
> The Data: User 480,000, and Item 17,000
> BlockSize: 1024 2048 4096 8192
> Old method: 245s 332s 488s OOM
> This solution: 121s 118s 117s 120s
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20459) JdbcUtils throws IllegalStateException: Cause already initialized after getting SQLException

2017-04-25 Thread Jessie Yu (JIRA)

Jessie Yu created SPARK-20459:
-

 Summary: JdbcUtils throws IllegalStateException: Cause already 
initialized after getting SQLException
 Key: SPARK-20459
 URL: https://issues.apache.org/jira/browse/SPARK-20459
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0, 2.0.2, 2.0.1
Reporter: Jessie Yu
Priority: Minor


Testing some failure scenarios, and JdbcUtils throws an IllegalStateException 
instead of the expected SQLException:
{code}
scala> 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.saveTable(prodtbl,url3,"DB2.D_ITEM_INFO",prop1)
 
17/04/03 17:19:35 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.lang.IllegalStateException: Cause already initialized  
.at java.lang.Throwable.setCause(Throwable.java:365)
.at java.lang.Throwable.initCause(Throwable.java:341)   
.at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:241)
.at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:300)
.at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:299)
.at 
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902)
.at 
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902)
.at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899) 
.at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)  
.at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)  
.at org.apache.spark.scheduler.Task.run(Task.scala:86)  
.at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) 
.at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153
.at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628
.at java.lang.Thread.run(Thread.java:785)   
{code}

The code in JdbcUtils.savePartition has 
{code}
} catch {
  case e: SQLException =>
val cause = e.getNextException
if (cause != null && e.getCause != cause) {
  if (e.getCause == null) {
e.initCause(cause)
  } else {
e.addSuppressed(cause)
  }
}
{code}

According to Throwable Java doc, {{initCause()}} throws an 
{{IllegalStateException}} "if this throwable was created with 
Throwable(Throwable) or Throwable(String,Throwable), or this method has already 
been called on this throwable". The code does check whether {{cause}} is 
{{null}} before initializing it. However, {{getCause()}} "returns the cause of 
this throwable or null if the cause is nonexistent or unknown." In other words, 
{{null}} is returned if {{cause}} already exists (which would result in 
{{IllegalStateException}}) but is unknown. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20445) pyspark.sql.utils.IllegalArgumentException: u'DecisionTreeClassifier was given input with invalid label column label, without the number of classes specified. See Stri

2017-04-25 Thread surya pratap (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983040#comment-15983040
 ] 

surya pratap commented on SPARK-20445:
--

Hello Hyukjin Kwon,
Thxz for fast reply.
You are using which version  and which environment.
I am running my code on MapR having spark version 1.6v

> pyspark.sql.utils.IllegalArgumentException: u'DecisionTreeClassifier was 
> given input with invalid label column label, without the number of classes 
> specified. See StringIndexer
> 
>
> Key: SPARK-20445
> URL: https://issues.apache.org/jira/browse/SPARK-20445
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.1
>Reporter: surya pratap
>
>  #Load the CSV file into a RDD
> irisData = sc.textFile("/home/infademo/surya/iris.csv")
> irisData.cache()
> irisData.count()
> #Remove the first line (contains headers)
> dataLines = irisData.filter(lambda x: "Sepal" not in x)
> dataLines.count()
> from pyspark.sql import Row
> #Create a Data Frame from the data
> parts = dataLines.map(lambda l: l.split(","))
> irisMap = parts.map(lambda p: Row(SEPAL_LENGTH=float(p[0]),\
> SEPAL_WIDTH=float(p[1]), \
> PETAL_LENGTH=float(p[2]), \
> PETAL_WIDTH=float(p[3]), \
> SPECIES=p[4] ))
> # Infer the schema, and register the DataFrame as a table.
> irisDf = sqlContext.createDataFrame(irisMap)
> irisDf.cache()
> #Add a numeric indexer for the label/target column
> from pyspark.ml.feature import StringIndexer
> stringIndexer = StringIndexer(inputCol="SPECIES", outputCol="IND_SPECIES")
> si_model = stringIndexer.fit(irisDf)
> irisNormDf = si_model.transform(irisDf)
> irisNormDf.select("SPECIES","IND_SPECIES").distinct().collect()
> irisNormDf.cache()
> 
> """--
> Perform Data Analytics
> 
> -"""
> #See standard parameters
> irisNormDf.describe().show()
> #Find correlation between predictors and target
> for i in irisNormDf.columns:
> if not( isinstance(irisNormDf.select(i).take(1)[0][0], basestring)) :
> print( "Correlation to Species for ", i, \
> irisNormDf.stat.corr('IND_SPECIES',i))
> #Transform to a Data Frame for input to Machine Learing
> #Drop columns that are not required (low correlation)
> from pyspark.mllib.linalg import Vectors
> from pyspark.mllib.linalg import SparseVector
> from pyspark.mllib.regression import LabeledPoint
> from pyspark.mllib.util import MLUtils
> import org.apache.spark.mllib.linalg.{Matrix, Matrices}
> from pyspark.mllib.linalg.distributed import RowMatrix
> from pyspark.ml.linalg import Vectors
> pyspark.mllib.linalg.Vector
> def transformToLabeledPoint(row) :
> lp = ( row["SPECIES"], row["IND_SPECIES"], \
> Vectors.dense([row["SEPAL_LENGTH"],\
> row["SEPAL_WIDTH"], \
> row["PETAL_LENGTH"], \
> row["PETAL_WIDTH"]]))
> return lp
> irisLp = irisNormDf.rdd.map(transformToLabeledPoint)
> irisLpDf = sqlContext.createDataFrame(irisLp,["species","label", 
> "features"])
> irisLpDf.select("species","label","features").show(10)
> irisLpDf.cache()
> 
> """--
> Perform Machine Learning
> 
> -"""
> #Split into training and testing data
> (trainingData, testData) = irisLpDf.randomSplit([0.9, 0.1])
> trainingData.count()
> testData.count()
> testData.collect()
> from pyspark.ml.classification import DecisionTreeClassifier
> from pyspark.ml.evaluation import MulticlassClassificationEvaluator
> #Create the model
> dtClassifer = DecisionTreeClassifier(maxDepth=2, labelCol="label",\
> featuresCol="features")
>dtModel = dtClassifer.fit(trainingData)
>
>issue part:-
>
>dtModel = dtClassifer.fit(trainingData) Traceback (most recent call last): 
> File "", line 1, in File 
> "/opt/mapr/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/ml/pipeline.py", 
> line 69, in fit return self._fit(dataset) File 
> "/opt/mapr/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/ml/wrapper.py", 
> line 133, in _fit java_model =

[jira] [Comment Edited] (SPARK-20446) Optimize the process of MLLIB ALS recommendForAll

2017-04-25 Thread Peng Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15982251#comment-15982251
 ] 

Peng Meng edited comment on SPARK-20446 at 4/25/17 3:06 PM:


Yes, I compared with ML ALSModel.recommendAll. The data size is 480,000*17,000, 
comparing with mllib recommendAll, there is about 10% to 20% improvement. 

Blockify is very important for the performance of our solution. Because the 
compute of CartesianRDD is:
for (x <- rdd1.iterator(currSplit.s1, context); y <- 
rdd2.iterator(currSplit.s2, context)) yield (x, y)
If no blockify, rdd2.iterator will be called the numberOfRecords times of rdd1.
With blockify, rdd2.iterator will be called the numberOfGroups times of rdd1.
The performance difference is very large.

SparkSQL approach is much different from my solution. It uses crossJoin to do 
Cartesian product, the key optimization is in UnsafeCartesianRDD, which uses 
rowArray to cache the right partition. 
 
The key part of our method:
1. use blockify to optimize the Cartesian production computation, not use 
blockify for BLAS 3.
2. use priorityQueue to save memory (used on each group), not to find TopK 
product for each user.
SqarkSQL approach doesn't work like this.   
cc [~mengxr]




was (Author: peng.m...@intel.com):
Yes, I compared with ML ALSModel.recommendAll. The data size is 480,000*17,000, 
comparing with mllib recommendAll, there is about 10% to 20% improvement. 

Blockify is very important the performance of our solution. Because the compute 
of CartesianRDD is:
for (x <- rdd1.iterator(currSplit.s1, context); y <- 
rdd2.iterator(currSplit.s2, context)) yield (x, y)
If no blockify, rdd2.iterator will be called the numberOfRecords times of rdd1.
With blockify, rdd2.iterator will be called the numberOfGroups times of rdd1.
The performance difference is very large.

SparkSQL approach is much different from my solution. It uses crossJoin to do 
Cartesian product, the key optimization is in UnsafeCartesianRDD, which uses 
rowArray to cache the right partition. 
 
The key part of our method:
1. use blockify to optimize the Cartesian production computation, not use 
blockify for BLAS 3.
2. use priorityQueue to save memory (used on each group), not to find TopK 
product for each user.
SqarkSQL approach doesn't work like this.   
cc [~mengxr]



> Optimize the process of MLLIB ALS recommendForAll
> -
>
> Key: SPARK-20446
> URL: https://issues.apache.org/jira/browse/SPARK-20446
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>
> The recommendForAll of MLLIB ALS is very slow.
> GC is a key problem of the current method. 
> The task use the following code to keep temp result:
> val output = new Array[(Int, (Int, Double))](m*n)
> m = n = 4096 (default value, no method to set)
> so output is about 4k * 4k * (4 + 4 + 8) = 256M. This is a large memory and 
> cause serious GC problem, and it is frequently OOM.
> Actually, we don't need to save all the temp result. Suppose we recommend 
> topK (topK is about 10, or 20) product for each user, we only need  4k * topK 
> * (4 + 4 + 8) memory to save the temp result.
> I have written a solution for this method with the following test result. 
> The Test Environment:
> 3 workers: each work 10 core, each work 30G memory, each work 1 executor.
> The Data: User 480,000, and Item 17,000
> BlockSize: 1024 2048 4096 8192
> Old method: 245s 332s 488s OOM
> This solution: 121s 118s 117s 120s
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20446) Optimize the process of MLLIB ALS recommendForAll

2017-04-25 Thread Peng Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983030#comment-15983030
 ] 

Peng Meng commented on SPARK-20446:
---

Thanks [~mlnick] , I agree with you. I am ok to close this ticket and move the 
discussion to SPARK-11968.

> Optimize the process of MLLIB ALS recommendForAll
> -
>
> Key: SPARK-20446
> URL: https://issues.apache.org/jira/browse/SPARK-20446
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>
> The recommendForAll of MLLIB ALS is very slow.
> GC is a key problem of the current method. 
> The task use the following code to keep temp result:
> val output = new Array[(Int, (Int, Double))](m*n)
> m = n = 4096 (default value, no method to set)
> so output is about 4k * 4k * (4 + 4 + 8) = 256M. This is a large memory and 
> cause serious GC problem, and it is frequently OOM.
> Actually, we don't need to save all the temp result. Suppose we recommend 
> topK (topK is about 10, or 20) product for each user, we only need  4k * topK 
> * (4 + 4 + 8) memory to save the temp result.
> I have written a solution for this method with the following test result. 
> The Test Environment:
> 3 workers: each work 10 core, each work 30G memory, each work 1 executor.
> The Data: User 480,000, and Item 17,000
> BlockSize: 1024 2048 4096 8192
> Old method: 245s 332s 488s OOM
> This solution: 121s 118s 117s 120s
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11968) ALS recommend all methods spend most of time in GC

2017-04-25 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983026#comment-15983026
 ] 

Nick Pentreath commented on SPARK-11968:


Note, there is a solution proposed in SPARK-20446. I've redirected the 
discussion to this original JIRA.

> ALS recommend all methods spend most of time in GC
> --
>
> Key: SPARK-11968
> URL: https://issues.apache.org/jira/browse/SPARK-11968
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Joseph K. Bradley
>Assignee: Nick Pentreath
>
> After adding recommendUsersForProducts and recommendProductsForUsers to ALS 
> in spark-perf, I noticed that it takes much longer than ALS itself.  Looking 
> at the monitoring page, I can see it is spending about 8min doing GC for each 
> 10min task.  That sounds fixable.  Looking at the implementation, there is 
> clearly an opportunity to avoid extra allocations: 
> [https://github.com/apache/spark/blob/e6dd237463d2de8c506f0735dfdb3f43e8122513/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala#L283]
> CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20446) Optimize the process of MLLIB ALS recommendForAll

2017-04-25 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983018#comment-15983018
 ] 

Nick Pentreath commented on SPARK-20446:


By the way when I say it is a duplicate I mean for the JIRA ticket. I agree 
that PR9980 was not the correct solution - JIRA tickets can have multiple PRs 
linked to them.

I'd prefer to close this ticket and move the discussion to SPARK-11968 (also 
there are watchers on that ticket that may be interested in the outcome). 



> Optimize the process of MLLIB ALS recommendForAll
> -
>
> Key: SPARK-20446
> URL: https://issues.apache.org/jira/browse/SPARK-20446
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>
> The recommendForAll of MLLIB ALS is very slow.
> GC is a key problem of the current method. 
> The task use the following code to keep temp result:
> val output = new Array[(Int, (Int, Double))](m*n)
> m = n = 4096 (default value, no method to set)
> so output is about 4k * 4k * (4 + 4 + 8) = 256M. This is a large memory and 
> cause serious GC problem, and it is frequently OOM.
> Actually, we don't need to save all the temp result. Suppose we recommend 
> topK (topK is about 10, or 20) product for each user, we only need  4k * topK 
> * (4 + 4 + 8) memory to save the temp result.
> I have written a solution for this method with the following test result. 
> The Test Environment:
> 3 workers: each work 10 core, each work 30G memory, each work 1 executor.
> The Data: User 480,000, and Item 17,000
> BlockSize: 1024 2048 4096 8192
> Old method: 245s 332s 488s OOM
> This solution: 121s 118s 117s 120s
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool

2017-04-25 Thread Dmitry Naumenko (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15982949#comment-15982949
 ] 

Dmitry Naumenko commented on SPARK-13747:
-

[~zsxwing] I did a similar test with join and have the same error in 2.2.0 
(actual query here - 
https://github.com/dnaumenko/spark-realtime-analytics-sample/blob/master/samples/src/main/scala/com/github/sparksample/httpapp/SimpleServer.scala).
 

My test setup is a simple akka-http long-running application and separate 
Gatling script that spawns multiple requests for join query 
(https://github.com/dnaumenko/spark-realtime-analytics-sample/blob/master/loadtool/src/main/scala/com/github/sparksample/SimpleServerSimulation.scala
 is test simulation script).

[~barrybecker4] I've managed to fix the problem by switching akka's default 
executor to thread-pool. But I guess the root cause is that Spark is relying on 
ThreadLocal variables and manages them incorrectly.

> Concurrent execution in SQL doesn't work with Scala ForkJoinPool
> 
>
> Key: SPARK-13747
> URL: https://issues.apache.org/jira/browse/SPARK-13747
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.2.0
>
>
> Run the following codes may fail
> {code}
> (1 to 100).par.foreach { _ =>
>   println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count())
> }
> java.lang.IllegalArgumentException: spark.sql.execution.id is already set 
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
>  
> at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) 
> at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) 
> {code}
> This is because SparkContext.runJob can be suspended when using a 
> ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it 
> calls Await.ready (introduced by https://github.com/apache/spark/pull/9264).
> So when SparkContext.runJob is suspended, ForkJoinPool will run another task 
> in the same thread, however, the local properties has been polluted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20458) support getting Yarn Tracking URL in code

2017-04-25 Thread PJ Fanning (JIRA)

PJ Fanning created SPARK-20458:
--

 Summary: support getting Yarn Tracking URL in code
 Key: SPARK-20458
 URL: https://issues.apache.org/jira/browse/SPARK-20458
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 2.1.0
Reporter: PJ Fanning


org.apache.spark.deploy.yarn.Client logs the Yarn tracking URL but it would be 
useful to be able to access this in code, as opposed to mining log output.

I have an application where I monitor the health of the SparkContext and 
associated Executors using the Spark REST API.

Would it be feasible to add a listener API to listen for new ApplicationReports 
in org.apache.spark.deploy.yarn.Client? Alternatively, this URL could be 
exposed as a property associated with the SparkContext.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20445) pyspark.sql.utils.IllegalArgumentException: u'DecisionTreeClassifier was given input with invalid label column label, without the number of classes specified. See Stri

2017-04-25 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15982852#comment-15982852
 ] 

Hyukjin Kwon commented on SPARK-20445:
--

Are you maybe able to try this against the current master or higher versions?

> pyspark.sql.utils.IllegalArgumentException: u'DecisionTreeClassifier was 
> given input with invalid label column label, without the number of classes 
> specified. See StringIndexer
> 
>
> Key: SPARK-20445
> URL: https://issues.apache.org/jira/browse/SPARK-20445
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.1
>Reporter: surya pratap
>
>  #Load the CSV file into a RDD
> irisData = sc.textFile("/home/infademo/surya/iris.csv")
> irisData.cache()
> irisData.count()
> #Remove the first line (contains headers)
> dataLines = irisData.filter(lambda x: "Sepal" not in x)
> dataLines.count()
> from pyspark.sql import Row
> #Create a Data Frame from the data
> parts = dataLines.map(lambda l: l.split(","))
> irisMap = parts.map(lambda p: Row(SEPAL_LENGTH=float(p[0]),\
> SEPAL_WIDTH=float(p[1]), \
> PETAL_LENGTH=float(p[2]), \
> PETAL_WIDTH=float(p[3]), \
> SPECIES=p[4] ))
> # Infer the schema, and register the DataFrame as a table.
> irisDf = sqlContext.createDataFrame(irisMap)
> irisDf.cache()
> #Add a numeric indexer for the label/target column
> from pyspark.ml.feature import StringIndexer
> stringIndexer = StringIndexer(inputCol="SPECIES", outputCol="IND_SPECIES")
> si_model = stringIndexer.fit(irisDf)
> irisNormDf = si_model.transform(irisDf)
> irisNormDf.select("SPECIES","IND_SPECIES").distinct().collect()
> irisNormDf.cache()
> 
> """--
> Perform Data Analytics
> 
> -"""
> #See standard parameters
> irisNormDf.describe().show()
> #Find correlation between predictors and target
> for i in irisNormDf.columns:
> if not( isinstance(irisNormDf.select(i).take(1)[0][0], basestring)) :
> print( "Correlation to Species for ", i, \
> irisNormDf.stat.corr('IND_SPECIES',i))
> #Transform to a Data Frame for input to Machine Learing
> #Drop columns that are not required (low correlation)
> from pyspark.mllib.linalg import Vectors
> from pyspark.mllib.linalg import SparseVector
> from pyspark.mllib.regression import LabeledPoint
> from pyspark.mllib.util import MLUtils
> import org.apache.spark.mllib.linalg.{Matrix, Matrices}
> from pyspark.mllib.linalg.distributed import RowMatrix
> from pyspark.ml.linalg import Vectors
> pyspark.mllib.linalg.Vector
> def transformToLabeledPoint(row) :
> lp = ( row["SPECIES"], row["IND_SPECIES"], \
> Vectors.dense([row["SEPAL_LENGTH"],\
> row["SEPAL_WIDTH"], \
> row["PETAL_LENGTH"], \
> row["PETAL_WIDTH"]]))
> return lp
> irisLp = irisNormDf.rdd.map(transformToLabeledPoint)
> irisLpDf = sqlContext.createDataFrame(irisLp,["species","label", 
> "features"])
> irisLpDf.select("species","label","features").show(10)
> irisLpDf.cache()
> 
> """--
> Perform Machine Learning
> 
> -"""
> #Split into training and testing data
> (trainingData, testData) = irisLpDf.randomSplit([0.9, 0.1])
> trainingData.count()
> testData.count()
> testData.collect()
> from pyspark.ml.classification import DecisionTreeClassifier
> from pyspark.ml.evaluation import MulticlassClassificationEvaluator
> #Create the model
> dtClassifer = DecisionTreeClassifier(maxDepth=2, labelCol="label",\
> featuresCol="features")
>dtModel = dtClassifer.fit(trainingData)
>
>issue part:-
>
>dtModel = dtClassifer.fit(trainingData) Traceback (most recent call last): 
> File "", line 1, in File 
> "/opt/mapr/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/ml/pipeline.py", 
> line 69, in fit return self._fit(dataset) File 
> "/opt/mapr/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/ml/wrapper.py", 
> line 133, in _fit java_model = self._fit_java(dataset) File 
> "/opt/mapr/spark/spark-1.6.1-bin-hadoo

[jira] [Commented] (SPARK-20445) pyspark.sql.utils.IllegalArgumentException: u'DecisionTreeClassifier was given input with invalid label column label, without the number of classes specified. See Stri

2017-04-25 Thread surya pratap (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15982830#comment-15982830
 ] 

surya pratap commented on SPARK-20445:
--

Hello Hyukjin Kwon 
Thxz for reply.
I tried many times but getting same issue.
I am using Spark 1.6.1 version.
Please help me to sort out this issue.

> pyspark.sql.utils.IllegalArgumentException: u'DecisionTreeClassifier was 
> given input with invalid label column label, without the number of classes 
> specified. See StringIndexer
> 
>
> Key: SPARK-20445
> URL: https://issues.apache.org/jira/browse/SPARK-20445
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.1
>Reporter: surya pratap
>
>  #Load the CSV file into a RDD
> irisData = sc.textFile("/home/infademo/surya/iris.csv")
> irisData.cache()
> irisData.count()
> #Remove the first line (contains headers)
> dataLines = irisData.filter(lambda x: "Sepal" not in x)
> dataLines.count()
> from pyspark.sql import Row
> #Create a Data Frame from the data
> parts = dataLines.map(lambda l: l.split(","))
> irisMap = parts.map(lambda p: Row(SEPAL_LENGTH=float(p[0]),\
> SEPAL_WIDTH=float(p[1]), \
> PETAL_LENGTH=float(p[2]), \
> PETAL_WIDTH=float(p[3]), \
> SPECIES=p[4] ))
> # Infer the schema, and register the DataFrame as a table.
> irisDf = sqlContext.createDataFrame(irisMap)
> irisDf.cache()
> #Add a numeric indexer for the label/target column
> from pyspark.ml.feature import StringIndexer
> stringIndexer = StringIndexer(inputCol="SPECIES", outputCol="IND_SPECIES")
> si_model = stringIndexer.fit(irisDf)
> irisNormDf = si_model.transform(irisDf)
> irisNormDf.select("SPECIES","IND_SPECIES").distinct().collect()
> irisNormDf.cache()
> 
> """--
> Perform Data Analytics
> 
> -"""
> #See standard parameters
> irisNormDf.describe().show()
> #Find correlation between predictors and target
> for i in irisNormDf.columns:
> if not( isinstance(irisNormDf.select(i).take(1)[0][0], basestring)) :
> print( "Correlation to Species for ", i, \
> irisNormDf.stat.corr('IND_SPECIES',i))
> #Transform to a Data Frame for input to Machine Learing
> #Drop columns that are not required (low correlation)
> from pyspark.mllib.linalg import Vectors
> from pyspark.mllib.linalg import SparseVector
> from pyspark.mllib.regression import LabeledPoint
> from pyspark.mllib.util import MLUtils
> import org.apache.spark.mllib.linalg.{Matrix, Matrices}
> from pyspark.mllib.linalg.distributed import RowMatrix
> from pyspark.ml.linalg import Vectors
> pyspark.mllib.linalg.Vector
> def transformToLabeledPoint(row) :
> lp = ( row["SPECIES"], row["IND_SPECIES"], \
> Vectors.dense([row["SEPAL_LENGTH"],\
> row["SEPAL_WIDTH"], \
> row["PETAL_LENGTH"], \
> row["PETAL_WIDTH"]]))
> return lp
> irisLp = irisNormDf.rdd.map(transformToLabeledPoint)
> irisLpDf = sqlContext.createDataFrame(irisLp,["species","label", 
> "features"])
> irisLpDf.select("species","label","features").show(10)
> irisLpDf.cache()
> 
> """--
> Perform Machine Learning
> 
> -"""
> #Split into training and testing data
> (trainingData, testData) = irisLpDf.randomSplit([0.9, 0.1])
> trainingData.count()
> testData.count()
> testData.collect()
> from pyspark.ml.classification import DecisionTreeClassifier
> from pyspark.ml.evaluation import MulticlassClassificationEvaluator
> #Create the model
> dtClassifer = DecisionTreeClassifier(maxDepth=2, labelCol="label",\
> featuresCol="features")
>dtModel = dtClassifer.fit(trainingData)
>
>issue part:-
>
>dtModel = dtClassifer.fit(trainingData) Traceback (most recent call last): 
> File "", line 1, in File 
> "/opt/mapr/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/ml/pipeline.py", 
> line 69, in fit return self._fit(dataset) File 
> "/opt/mapr/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/ml/wrapper.py", 
> line 133, in _fit java_model

[jira] [Updated] (SPARK-20457) Spark CSV is not able to Override Schema while reading data

2017-04-25 Thread Himanshu Gupta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Himanshu Gupta updated SPARK-20457:
---
Description: 
I have a CSV file, test.csv:

{code:xml}
col
1
2
3
4
{code}

When I read it using Spark, it gets the schema of data correct:

{code:java}
val df = spark.read.option("header", "true").option("inferSchema", 
"true").csv("test.csv")

df.printSchema
root
|-- col: integer (nullable = true)
{code}

But when I override the `schema` of CSV file and make `inferSchema` false, then 
SparkSession is picking up custom schema partially.

{code:java}
val df = spark.read.option("header", "true").option("inferSchema", 
"false").schema(StructType(List(StructField("custom", StringType, 
false.csv("test.csv")

df.printSchema
root
|-- custom: string (nullable = true)
{code}

I mean only column name (`custom`) and DataType (`StringType`) are getting 
picked up. But, `nullable` part is being ignored, as it is still coming 
`nullable = true`, which is incorrect.

I am not able to understand this behavior.

  was:
I have a CSV file, test.csv:

{code:csv}
col
1
2
3
4
{code}

When I read it using Spark, it gets the schema of data correct:

{code:java}
val df = spark.read.option("header", "true").option("inferSchema", 
"true").csv("test.csv")

df.printSchema
root
 |-- col: integer (nullable = true)
{code}

But when I override the `schema` of CSV file and make `inferSchema` false, then 
SparkSession is picking up custom schema partially.

{code:java}
val df = spark.read.option("header", "true").option("inferSchema", 
"false").schema(StructType(List(StructField("custom", StringType, 
false.csv("test.csv")

df.printSchema
root
 |-- custom: string (nullable = true)
{code}

I mean only column name (`custom`) and DataType (`StringType`) are getting 
picked up. But, `nullable` part is being ignored, as it is still coming 
`nullable = true`, which is incorrect.

I am not able to understand this behavior.


> Spark CSV is not able to Override Schema while reading data
> ---
>
> Key: SPARK-20457
> URL: https://issues.apache.org/jira/browse/SPARK-20457
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Himanshu Gupta
>
> I have a CSV file, test.csv:
> {code:xml}
> col
> 1
> 2
> 3
> 4
> {code}
> When I read it using Spark, it gets the schema of data correct:
> {code:java}
> val df = spark.read.option("header", "true").option("inferSchema", 
> "true").csv("test.csv")
> 
> df.printSchema
> root
> |-- col: integer (nullable = true)
> {code}
> But when I override the `schema` of CSV file and make `inferSchema` false, 
> then SparkSession is picking up custom schema partially.
> {code:java}
> val df = spark.read.option("header", "true").option("inferSchema", 
> "false").schema(StructType(List(StructField("custom", StringType, 
> false.csv("test.csv")
> df.printSchema
> root
> |-- custom: string (nullable = true)
> {code}
> I mean only column name (`custom`) and DataType (`StringType`) are getting 
> picked up. But, `nullable` part is being ignored, as it is still coming 
> `nullable = true`, which is incorrect.
> I am not able to understand this behavior.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20457) Spark CSV is not able to Override Schema while reading data

2017-04-25 Thread Himanshu Gupta (JIRA)

Himanshu Gupta created SPARK-20457:
--

 Summary: Spark CSV is not able to Override Schema while reading 
data
 Key: SPARK-20457
 URL: https://issues.apache.org/jira/browse/SPARK-20457
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Himanshu Gupta


I have a CSV file, test.csv:

{code:csv}
col
1
2
3
4
{code}

When I read it using Spark, it gets the schema of data correct:

{code:java}
val df = spark.read.option("header", "true").option("inferSchema", 
"true").csv("test.csv")

df.printSchema
root
 |-- col: integer (nullable = true)
{code}

But when I override the `schema` of CSV file and make `inferSchema` false, then 
SparkSession is picking up custom schema partially.

{code:java}
val df = spark.read.option("header", "true").option("inferSchema", 
"false").schema(StructType(List(StructField("custom", StringType, 
false.csv("test.csv")

df.printSchema
root
 |-- custom: string (nullable = true)
{code}

I mean only column name (`custom`) and DataType (`StringType`) are getting 
picked up. But, `nullable` part is being ignored, as it is still coming 
`nullable = true`, which is incorrect.

I am not able to understand this behavior.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-20336) spark.read.csv() with wholeFile=True option fails to read non ASCII unicode characters

2017-04-25 Thread HanCheol Cho (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

HanCheol Cho closed SPARK-20336.

Resolution: Not A Bug

> spark.read.csv() with wholeFile=True option fails to read non ASCII unicode 
> characters
> --
>
> Key: SPARK-20336
> URL: https://issues.apache.org/jira/browse/SPARK-20336
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Spark 2.2.0 (master branch is downloaded from Github)
> PySpark
>Reporter: HanCheol Cho
>
> I used spark.read.csv() method with wholeFile=True option to load data that 
> has multi-line records.
> However, non-ASCII characters are not properly loaded.
> The following is a sample data for test:
> {code:none}
> col1,col2,col3
> 1,a,text
> 2,b,テキスト
> 3,c,텍스트
> 4,d,"text
> テキスト
> 텍스트"
> 5,e,last
> {code}
> When it is loaded without wholeFile=True option, non-ASCII characters are 
> shown correctly although multi-line records are parsed incorrectly as follows:
> {code:none}
> testdf_default = spark.read.csv("test.encoding.csv", header=True)
> testdf_default.show()
> ++++
> |col1|col2|col3|
> ++++
> |   1|   a|text|
> |   2|   b|テキスト|
> |   3|   c| 텍스트|
> |   4|   d|text|
> |テキスト|null|null|
> | 텍스트"|null|null|
> |   5|   e|last|
> ++++
> {code}
> When wholeFile=True option is used, non-ASCII characters are broken as 
> follows:
> {code:none}
> testdf_wholefile = spark.read.csv("test.encoding.csv", header=True, 
> wholeFile=True)
> testdf_wholefile.show()
> ++++
> |col1|col2|col3|
> ++++
> |   1|   a|text|
> |   2|   b||
> |   3|   c|   �|
> |   4|   d|text
> ...|
> |   5|   e|last|
> ++++
> {code}
> The result is same even if I use encoding="UTF-8" option with wholeFile=True.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 122 matches

Mail list logo