[jira] [Comment Edited] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows
[ https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15984173#comment-15984173 ] Liang-Chi Hsieh edited comment on SPARK-20392 at 4/26/17 6:44 AM: -- [~barrybecker4] Currently I think the performance downgrade is caused by the cost of exchange between DataFrame/Dataset abstraction and logical plans. Some operations on DataFrames exchange between DataFrame and logical plans. It can be ignored in the usage of SQL. However, it's not rare to chain dozens of pipeline stages in ML. When the query plan grows incrementally during running those stages, the cost spent on the exchange grows too. In particular, the Analyzer will go through the big query plan even most part of it is analyzed. By eliminating part of the cost, the time to run the example code locally is reduced from about 1min to about 30 secs. In particular, the time applying the pipeline locally is mostly spent on calling {{transform}} of the 137 {{Bucketizer}} s. Before the change, each call of {{Bucketizer}} 's {{transform}} can cost about 0.4 sec. So the total time spent on all {{Bucketizer}} s' {{transform}} is about 50 secs. After the change, each call only costs about 0.1 sec. was (Author: viirya): [~barrybecker4] Currently I think the performance downgrade is caused by the cost of exchange between DataFrame/Dataset abstraction and logical plans. Some operations on DataFrames exchange between DataFrame and logical plans. It can be ignored in the usage of SQL. However, it's not rare to chain dozens of pipeline stages in ML. When the query plan grows incrementally during running those stages, the cost spent on the exchange grows too. In particular, the Analyzer will go through the big query plan even most part of it is analyzed. By eliminating part of the cost, the time to run the example code locally is reduced from about 1min to about 30 secs. In particular, the time applying the pipeline locally is mostly spent on calling {{transform}} of the 137 {{Bucketizer}}. Before the change, each call of {{Bucketizer}} 's {{transform}} can cost about 0.4 sec. So the total time spent on all {{Bucketizer}} 's {{transform}} is about 50 secs. After the change, each call only costs about 0.1 sec. > Slow performance when calling fit on ML pipeline for dataset with many > columns but few rows > --- > > Key: SPARK-20392 > URL: https://issues.apache.org/jira/browse/SPARK-20392 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Barry Becker > Attachments: blockbuster.csv, blockbuster_fewCols.csv, > giant_query_plan_for_fitting_pipeline.txt, model_9754.zip, model_9756.zip > > > This started as a [question on stack > overflow|http://stackoverflow.com/questions/43484006/why-is-it-slow-to-apply-a-spark-pipeline-to-dataset-with-many-columns-but-few-ro], > but it seems like a bug. > I am testing spark pipelines using a simple dataset (attached) with 312 > (mostly numeric) columns, but only 421 rows. It is small, but it takes 3 > minutes to apply my ML pipeline to it on a 24 core server with 60G of memory. > This seems much to long for such a tiny dataset. Similar pipelines run > quickly on datasets that have fewer columns and more rows. It's something > about the number of columns that is causing the slow performance. > Here are a list of the stages in my pipeline: > {code} > 000_strIdx_5708525b2b6c > 001_strIdx_ec2296082913 > 002_bucketizer_3cbc8811877b > 003_bucketizer_5a01d5d78436 > 004_bucketizer_bf290d11364d > 005_bucketizer_c3296dfe94b2 > 006_bucketizer_7071ca50eb85 > 007_bucketizer_27738213c2a1 > 008_bucketizer_bd728fd89ba1 > 009_bucketizer_e1e716f51796 > 010_bucketizer_38be665993ba > 011_bucketizer_5a0e41e5e94f > 012_bucketizer_b5a3d5743aaa > 013_bucketizer_4420f98ff7ff > 014_bucketizer_777cc4fe6d12 > 015_bucketizer_f0f3a3e5530e > 016_bucketizer_218ecca3b5c1 > 017_bucketizer_0b083439a192 > 018_bucketizer_4520203aec27 > 019_bucketizer_462c2c346079 > 020_bucketizer_47435822e04c > 021_bucketizer_eb9dccb5e6e8 > 022_bucketizer_b5f63dd7451d > 023_bucketizer_e0fd5041c841 > 024_bucketizer_ffb3b9737100 > 025_bucketizer_e06c0d29273c > 026_bucketizer_36ee535a425f > 027_bucketizer_ee3a330269f1 > 028_bucketizer_094b58ea01c0 > 029_bucketizer_e93ea86c08e2 > 030_bucketizer_4728a718bc4b > 031_bucketizer_08f6189c7fcc > 032_bucketizer_11feb74901e6 > 033_bucketizer_ab4add4966c7 > 034_bucketizer_4474f7f1b8ce > 035_bucketizer_90cfa5918d71 > 036_bucketizer_1a9ff5e4eccb > 037_bucketizer_38085415a4f4 > 038_bucketizer_9b5e5a8d12eb > 039_bucketizer_082bb650ecc3 > 040_bucketizer_57e1e363c483 > 041_bucketizer_337583fbfd65 > 042_bucketizer_73e8f6673262 > 043_bucketizer_0f9394ed30b8
[jira] [Comment Edited] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows
[ https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15984173#comment-15984173 ] Liang-Chi Hsieh edited comment on SPARK-20392 at 4/26/17 6:43 AM: -- [~barrybecker4] Currently I think the performance downgrade is caused by the cost of exchange between DataFrame/Dataset abstraction and logical plans. Some operations on DataFrames exchange between DataFrame and logical plans. It can be ignored in the usage of SQL. However, it's not rare to chain dozens of pipeline stages in ML. When the query plan grows incrementally during running those stages, the cost spent on the exchange grows too. In particular, the Analyzer will go through the big query plan even most part of it is analyzed. By eliminating part of the cost, the time to run the example code locally is reduced from about 1min to about 30 secs. In particular, the time applying the pipeline locally is mostly spent on calling {{transform}} of the 137 {{Bucketizer}}. Before the change, each call of {{Bucketizer}} 's {{transform}} can cost about 0.4 sec. So the total time spent on all {{Bucketizer}} 's {{transform}} is about 50 secs. After the change, each call only costs about 0.1 sec. was (Author: viirya): [~barrybecker4] Currently I think the performance downgrade is caused by the cost of exchange between DataFrame/Dataset abstraction and logical plans. Some operations on DataFrames exchange between DataFrame and logical plans. It can be ignored in the usage of SQL. However, it's not rare to chain dozens of pipeline stages in ML. When the query plan grows incrementally during running those stages, the cost spent on the exchange grows too. In particular, the Analyzer will go through the big query plan even most part of it is analyzed. By eliminating part of the cost, the time to run the example code locally is reduced from about 1min to about 30 secs. In particular, the time applying the pipeline locally is mostly spent on calling {{transform}} of the 137 {{Bucketizer}}s. Before the change, each call of {{Bucketizer}}'s {{transform}} can cost about 0.4 sec. So the total time spent on all {{Bucketizer}}s' {{transform}} is about 50 secs. After the change, each call only costs about 0.1 sec. > Slow performance when calling fit on ML pipeline for dataset with many > columns but few rows > --- > > Key: SPARK-20392 > URL: https://issues.apache.org/jira/browse/SPARK-20392 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Barry Becker > Attachments: blockbuster.csv, blockbuster_fewCols.csv, > giant_query_plan_for_fitting_pipeline.txt, model_9754.zip, model_9756.zip > > > This started as a [question on stack > overflow|http://stackoverflow.com/questions/43484006/why-is-it-slow-to-apply-a-spark-pipeline-to-dataset-with-many-columns-but-few-ro], > but it seems like a bug. > I am testing spark pipelines using a simple dataset (attached) with 312 > (mostly numeric) columns, but only 421 rows. It is small, but it takes 3 > minutes to apply my ML pipeline to it on a 24 core server with 60G of memory. > This seems much to long for such a tiny dataset. Similar pipelines run > quickly on datasets that have fewer columns and more rows. It's something > about the number of columns that is causing the slow performance. > Here are a list of the stages in my pipeline: > {code} > 000_strIdx_5708525b2b6c > 001_strIdx_ec2296082913 > 002_bucketizer_3cbc8811877b > 003_bucketizer_5a01d5d78436 > 004_bucketizer_bf290d11364d > 005_bucketizer_c3296dfe94b2 > 006_bucketizer_7071ca50eb85 > 007_bucketizer_27738213c2a1 > 008_bucketizer_bd728fd89ba1 > 009_bucketizer_e1e716f51796 > 010_bucketizer_38be665993ba > 011_bucketizer_5a0e41e5e94f > 012_bucketizer_b5a3d5743aaa > 013_bucketizer_4420f98ff7ff > 014_bucketizer_777cc4fe6d12 > 015_bucketizer_f0f3a3e5530e > 016_bucketizer_218ecca3b5c1 > 017_bucketizer_0b083439a192 > 018_bucketizer_4520203aec27 > 019_bucketizer_462c2c346079 > 020_bucketizer_47435822e04c > 021_bucketizer_eb9dccb5e6e8 > 022_bucketizer_b5f63dd7451d > 023_bucketizer_e0fd5041c841 > 024_bucketizer_ffb3b9737100 > 025_bucketizer_e06c0d29273c > 026_bucketizer_36ee535a425f > 027_bucketizer_ee3a330269f1 > 028_bucketizer_094b58ea01c0 > 029_bucketizer_e93ea86c08e2 > 030_bucketizer_4728a718bc4b > 031_bucketizer_08f6189c7fcc > 032_bucketizer_11feb74901e6 > 033_bucketizer_ab4add4966c7 > 034_bucketizer_4474f7f1b8ce > 035_bucketizer_90cfa5918d71 > 036_bucketizer_1a9ff5e4eccb > 037_bucketizer_38085415a4f4 > 038_bucketizer_9b5e5a8d12eb > 039_bucketizer_082bb650ecc3 > 040_bucketizer_57e1e363c483 > 041_bucketizer_337583fbfd65 > 042_bucketizer_73e8f6673262 > 043_bucketizer_0f9394ed30b8 > 0
[jira] [Comment Edited] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows
[ https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15984173#comment-15984173 ] Liang-Chi Hsieh edited comment on SPARK-20392 at 4/26/17 6:43 AM: -- [~barrybecker4] Currently I think the performance downgrade is caused by the cost of exchange between DataFrame/Dataset abstraction and logical plans. Some operations on DataFrames exchange between DataFrame and logical plans. It can be ignored in the usage of SQL. However, it's not rare to chain dozens of pipeline stages in ML. When the query plan grows incrementally during running those stages, the cost spent on the exchange grows too. In particular, the Analyzer will go through the big query plan even most part of it is analyzed. By eliminating part of the cost, the time to run the example code locally is reduced from about 1min to about 30 secs. In particular, the time applying the pipeline locally is mostly spent on calling {{transform}} of the 137 {{Bucketizer}}s. Before the change, each call of {{Bucketizer}}'s {{transform}} can cost about 0.4 sec. So the total time spent on all {{Bucketizer}}s' {{transform}} is about 50 secs. After the change, each call only costs about 0.1 sec. was (Author: viirya): [~barrybecker4] Currently I think the performance downgrade is caused by the cost of exchange between DataFrame/Dataset abstraction and logical plans. Some operations on DataFrames exchange between DataFrame and logical plans. It can be ignored in the usage of SQL. However, it's not rare to chain dozens of pipeline stages in ML. When the query plan grows incrementally during running those stages, the cost spent on the exchange grows too. In particular, the Analyzer will go through the big query plan even most part of it is analyzed. By eliminating part of the cost, the time to run the example code locally is reduced from about 1min to about 30 secs. > Slow performance when calling fit on ML pipeline for dataset with many > columns but few rows > --- > > Key: SPARK-20392 > URL: https://issues.apache.org/jira/browse/SPARK-20392 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Barry Becker > Attachments: blockbuster.csv, blockbuster_fewCols.csv, > giant_query_plan_for_fitting_pipeline.txt, model_9754.zip, model_9756.zip > > > This started as a [question on stack > overflow|http://stackoverflow.com/questions/43484006/why-is-it-slow-to-apply-a-spark-pipeline-to-dataset-with-many-columns-but-few-ro], > but it seems like a bug. > I am testing spark pipelines using a simple dataset (attached) with 312 > (mostly numeric) columns, but only 421 rows. It is small, but it takes 3 > minutes to apply my ML pipeline to it on a 24 core server with 60G of memory. > This seems much to long for such a tiny dataset. Similar pipelines run > quickly on datasets that have fewer columns and more rows. It's something > about the number of columns that is causing the slow performance. > Here are a list of the stages in my pipeline: > {code} > 000_strIdx_5708525b2b6c > 001_strIdx_ec2296082913 > 002_bucketizer_3cbc8811877b > 003_bucketizer_5a01d5d78436 > 004_bucketizer_bf290d11364d > 005_bucketizer_c3296dfe94b2 > 006_bucketizer_7071ca50eb85 > 007_bucketizer_27738213c2a1 > 008_bucketizer_bd728fd89ba1 > 009_bucketizer_e1e716f51796 > 010_bucketizer_38be665993ba > 011_bucketizer_5a0e41e5e94f > 012_bucketizer_b5a3d5743aaa > 013_bucketizer_4420f98ff7ff > 014_bucketizer_777cc4fe6d12 > 015_bucketizer_f0f3a3e5530e > 016_bucketizer_218ecca3b5c1 > 017_bucketizer_0b083439a192 > 018_bucketizer_4520203aec27 > 019_bucketizer_462c2c346079 > 020_bucketizer_47435822e04c > 021_bucketizer_eb9dccb5e6e8 > 022_bucketizer_b5f63dd7451d > 023_bucketizer_e0fd5041c841 > 024_bucketizer_ffb3b9737100 > 025_bucketizer_e06c0d29273c > 026_bucketizer_36ee535a425f > 027_bucketizer_ee3a330269f1 > 028_bucketizer_094b58ea01c0 > 029_bucketizer_e93ea86c08e2 > 030_bucketizer_4728a718bc4b > 031_bucketizer_08f6189c7fcc > 032_bucketizer_11feb74901e6 > 033_bucketizer_ab4add4966c7 > 034_bucketizer_4474f7f1b8ce > 035_bucketizer_90cfa5918d71 > 036_bucketizer_1a9ff5e4eccb > 037_bucketizer_38085415a4f4 > 038_bucketizer_9b5e5a8d12eb > 039_bucketizer_082bb650ecc3 > 040_bucketizer_57e1e363c483 > 041_bucketizer_337583fbfd65 > 042_bucketizer_73e8f6673262 > 043_bucketizer_0f9394ed30b8 > 044_bucketizer_8530f3570019 > 045_bucketizer_c53614f1e507 > 046_bucketizer_8fd99e6ec27b > 047_bucketizer_6a8610496d8a > 048_bucketizer_888b0055c1ad > 049_bucketizer_974e0a1433a6 > 050_bucketizer_e848c0937cb9 > 051_bucketizer_95611095a4ac > 052_bucketizer_660a6031acd9 > 053_bucketizer_aaffe5a3140d > 054_bucketizer_8dc569be285f > 055_bucketizer_83d1bffa07
[jira] [Comment Edited] (SPARK-20199) GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter
[ https://issues.apache.org/jira/browse/SPARK-20199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15984161#comment-15984161 ] Yan Facai (颜发才) edited comment on SPARK-20199 at 4/26/17 6:11 AM: -- The work is easy, however **public method** is added and some adjustments are needed in inner implementation. Hence, I suggest to delay it until at least one of Committers agree to shepherd the issue. I have two questions: 1. For both GBDT and RandomForest share the attribute, we can pull `featureSubsetStrategy` parameter up to either TreeEnsembleParams or DecisionTreeParams. Which one is appropriate? 2. Is it right to add new parameter `featureSubsetStrategy` to Strategy class? Or add it to DecisionTree's train method? was (Author: facai): The work is easy, however Public method is added and some adjustments are needed in inner implementation. Hence, I suggest to delay it until one expert agree to shepherd the issue. I have two questions: 1. For both GBDT and RandomForest share the attribute, we can pull `featureSubsetStrategy` parameter up to either TreeEnsembleParams or DecisionTreeParams. Which one is appropriate? 2. Is it right to add new parameter `featureSubsetStrategy` to Strategy class? Or add it to DecisionTree's train method? > GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter > --- > > Key: SPARK-20199 > URL: https://issues.apache.org/jira/browse/SPARK-20199 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: pralabhkumar > > Spark GradientBoostedTreesModel doesn't have Column sampling rate parameter > . This parameter is available in H2O and XGBoost. > Sample from H2O.ai > gbmParams._col_sample_rate > Please provide the parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows
[ https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15984194#comment-15984194 ] Liang-Chi Hsieh commented on SPARK-20392: - By disabling {{spark.sql.constraintPropagation.enabled}} flag as well to eliminate the cost of constraint propagation on big query plan, the running time can be further reduced to about 20 secs. > Slow performance when calling fit on ML pipeline for dataset with many > columns but few rows > --- > > Key: SPARK-20392 > URL: https://issues.apache.org/jira/browse/SPARK-20392 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Barry Becker > Attachments: blockbuster.csv, blockbuster_fewCols.csv, > giant_query_plan_for_fitting_pipeline.txt, model_9754.zip, model_9756.zip > > > This started as a [question on stack > overflow|http://stackoverflow.com/questions/43484006/why-is-it-slow-to-apply-a-spark-pipeline-to-dataset-with-many-columns-but-few-ro], > but it seems like a bug. > I am testing spark pipelines using a simple dataset (attached) with 312 > (mostly numeric) columns, but only 421 rows. It is small, but it takes 3 > minutes to apply my ML pipeline to it on a 24 core server with 60G of memory. > This seems much to long for such a tiny dataset. Similar pipelines run > quickly on datasets that have fewer columns and more rows. It's something > about the number of columns that is causing the slow performance. > Here are a list of the stages in my pipeline: > {code} > 000_strIdx_5708525b2b6c > 001_strIdx_ec2296082913 > 002_bucketizer_3cbc8811877b > 003_bucketizer_5a01d5d78436 > 004_bucketizer_bf290d11364d > 005_bucketizer_c3296dfe94b2 > 006_bucketizer_7071ca50eb85 > 007_bucketizer_27738213c2a1 > 008_bucketizer_bd728fd89ba1 > 009_bucketizer_e1e716f51796 > 010_bucketizer_38be665993ba > 011_bucketizer_5a0e41e5e94f > 012_bucketizer_b5a3d5743aaa > 013_bucketizer_4420f98ff7ff > 014_bucketizer_777cc4fe6d12 > 015_bucketizer_f0f3a3e5530e > 016_bucketizer_218ecca3b5c1 > 017_bucketizer_0b083439a192 > 018_bucketizer_4520203aec27 > 019_bucketizer_462c2c346079 > 020_bucketizer_47435822e04c > 021_bucketizer_eb9dccb5e6e8 > 022_bucketizer_b5f63dd7451d > 023_bucketizer_e0fd5041c841 > 024_bucketizer_ffb3b9737100 > 025_bucketizer_e06c0d29273c > 026_bucketizer_36ee535a425f > 027_bucketizer_ee3a330269f1 > 028_bucketizer_094b58ea01c0 > 029_bucketizer_e93ea86c08e2 > 030_bucketizer_4728a718bc4b > 031_bucketizer_08f6189c7fcc > 032_bucketizer_11feb74901e6 > 033_bucketizer_ab4add4966c7 > 034_bucketizer_4474f7f1b8ce > 035_bucketizer_90cfa5918d71 > 036_bucketizer_1a9ff5e4eccb > 037_bucketizer_38085415a4f4 > 038_bucketizer_9b5e5a8d12eb > 039_bucketizer_082bb650ecc3 > 040_bucketizer_57e1e363c483 > 041_bucketizer_337583fbfd65 > 042_bucketizer_73e8f6673262 > 043_bucketizer_0f9394ed30b8 > 044_bucketizer_8530f3570019 > 045_bucketizer_c53614f1e507 > 046_bucketizer_8fd99e6ec27b > 047_bucketizer_6a8610496d8a > 048_bucketizer_888b0055c1ad > 049_bucketizer_974e0a1433a6 > 050_bucketizer_e848c0937cb9 > 051_bucketizer_95611095a4ac > 052_bucketizer_660a6031acd9 > 053_bucketizer_aaffe5a3140d > 054_bucketizer_8dc569be285f > 055_bucketizer_83d1bffa07bc > 056_bucketizer_0c6180ba75e6 > 057_bucketizer_452f265a000d > 058_bucketizer_38e02ddfb447 > 059_bucketizer_6fa4ad5d3ebd > 060_bucketizer_91044ee766ce > 061_bucketizer_9a9ef04a173d > 062_bucketizer_3d98eb15f206 > 063_bucketizer_c4915bb4d4ed > 064_bucketizer_8ca2b6550c38 > 065_bucketizer_417ee9b760bc > 066_bucketizer_67f3556bebe8 > 067_bucketizer_0556deb652c6 > 068_bucketizer_067b4b3d234c > 069_bucketizer_30ba55321538 > 070_bucketizer_ad826cc5d746 > 071_bucketizer_77676a898055 > 072_bucketizer_05c37a38ce30 > 073_bucketizer_6d9ae54163ed > 074_bucketizer_8cd668b2855d > 075_bucketizer_d50ea1732021 > 076_bucketizer_c68f467c9559 > 077_bucketizer_ee1dfc840db1 > 078_bucketizer_83ec06a32519 > 079_bucketizer_741d08c1b69e > 080_bucketizer_b7402e4829c7 > 081_bucketizer_8adc590dc447 > 082_bucketizer_673be99bdace > 083_bucketizer_77693b45f94c > 084_bucketizer_53529c6b1ac4 > 085_bucketizer_6a3ca776a81e > 086_bucketizer_6679d9588ac1 > 087_bucketizer_6c73af456f65 > 088_bucketizer_2291b2c5ab51 > 089_bucketizer_cb3d0fe669d8 > 090_bucketizer_e71f913c1512 > 091_bucketizer_156528f65ce7 > 092_bucketizer_f3ec5dae079b > 093_bucketizer_809fab77eee1 > 094_bucketizer_6925831511e6 > 095_bucketizer_c5d853b95707 > 096_bucketizer_e677659ca253 > 097_bucketizer_396e35548c72 > 098_bucketizer_78a6410d7a84 > 099_bucketizer_e3ae6e54bca1 > 100_bucketizer_9fed5923fe8a > 101_bucketizer_8925ba4c3ee2 > 102_bucketizer_95750b6942b8 > 103_bucketizer_6e8b50a1918b > 104_bucketizer_36cfcc13d4ba > 105_bucketizer_2716d0455512 > 106_bucketizer_9bcf2891652f > 107_bucketizer_8c3d352915f7
[jira] [Assigned] (SPARK-20465) Throws a proper exception rather than ArrayIndexOutOfBoundsException when temp directories could not be got/created
[ https://issues.apache.org/jira/browse/SPARK-20465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20465: Assignee: Apache Spark > Throws a proper exception rather than ArrayIndexOutOfBoundsException when > temp directories could not be got/created > --- > > Key: SPARK-20465 > URL: https://issues.apache.org/jira/browse/SPARK-20465 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Spark Submit >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Trivial > > If none of temp directories could not be created, it throws an > {{ArrayIndexOutOfBoundsException}} as below: > {code} > ./bin/spark-shell --conf > spark.local.dir=/NONEXISTENT_DIR_ONE,/NONEXISTENT_DIR_TWO > 17/04/26 13:11:06 ERROR Utils: Failed to create dir in /NONEXISTENT_DIR_ONE. > Ignoring this directory. > 17/04/26 13:11:06 ERROR Utils: Failed to create dir in /NONEXISTENT_DIR_TWO. > Ignoring this directory. > Exception in thread "main" java.lang.ExceptionInInitializerError > at org.apache.spark.repl.Main.main(Main.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:756) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:179) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:204) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:118) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: java.lang.ArrayIndexOutOfBoundsException: 0 > at org.apache.spark.util.Utils$.getLocalDir(Utils.scala:743) > at org.apache.spark.repl.Main$$anonfun$1.apply(Main.scala:37) > at org.apache.spark.repl.Main$$anonfun$1.apply(Main.scala:37) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.repl.Main$.(Main.scala:37) > at org.apache.spark.repl.Main$.(Main.scala) > ... 10 more > {code} > It seems we should throw a proper exception with better message. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20465) Throws a proper exception rather than ArrayIndexOutOfBoundsException when temp directories could not be got/created
[ https://issues.apache.org/jira/browse/SPARK-20465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20465: Assignee: (was: Apache Spark) > Throws a proper exception rather than ArrayIndexOutOfBoundsException when > temp directories could not be got/created > --- > > Key: SPARK-20465 > URL: https://issues.apache.org/jira/browse/SPARK-20465 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Spark Submit >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Priority: Trivial > > If none of temp directories could not be created, it throws an > {{ArrayIndexOutOfBoundsException}} as below: > {code} > ./bin/spark-shell --conf > spark.local.dir=/NONEXISTENT_DIR_ONE,/NONEXISTENT_DIR_TWO > 17/04/26 13:11:06 ERROR Utils: Failed to create dir in /NONEXISTENT_DIR_ONE. > Ignoring this directory. > 17/04/26 13:11:06 ERROR Utils: Failed to create dir in /NONEXISTENT_DIR_TWO. > Ignoring this directory. > Exception in thread "main" java.lang.ExceptionInInitializerError > at org.apache.spark.repl.Main.main(Main.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:756) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:179) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:204) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:118) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: java.lang.ArrayIndexOutOfBoundsException: 0 > at org.apache.spark.util.Utils$.getLocalDir(Utils.scala:743) > at org.apache.spark.repl.Main$$anonfun$1.apply(Main.scala:37) > at org.apache.spark.repl.Main$$anonfun$1.apply(Main.scala:37) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.repl.Main$.(Main.scala:37) > at org.apache.spark.repl.Main$.(Main.scala) > ... 10 more > {code} > It seems we should throw a proper exception with better message. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20465) Throws a proper exception rather than ArrayIndexOutOfBoundsException when temp directories could not be got/created
[ https://issues.apache.org/jira/browse/SPARK-20465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15984180#comment-15984180 ] Apache Spark commented on SPARK-20465: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/17768 > Throws a proper exception rather than ArrayIndexOutOfBoundsException when > temp directories could not be got/created > --- > > Key: SPARK-20465 > URL: https://issues.apache.org/jira/browse/SPARK-20465 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Spark Submit >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Priority: Trivial > > If none of temp directories could not be created, it throws an > {{ArrayIndexOutOfBoundsException}} as below: > {code} > ./bin/spark-shell --conf > spark.local.dir=/NONEXISTENT_DIR_ONE,/NONEXISTENT_DIR_TWO > 17/04/26 13:11:06 ERROR Utils: Failed to create dir in /NONEXISTENT_DIR_ONE. > Ignoring this directory. > 17/04/26 13:11:06 ERROR Utils: Failed to create dir in /NONEXISTENT_DIR_TWO. > Ignoring this directory. > Exception in thread "main" java.lang.ExceptionInInitializerError > at org.apache.spark.repl.Main.main(Main.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:756) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:179) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:204) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:118) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: java.lang.ArrayIndexOutOfBoundsException: 0 > at org.apache.spark.util.Utils$.getLocalDir(Utils.scala:743) > at org.apache.spark.repl.Main$$anonfun$1.apply(Main.scala:37) > at org.apache.spark.repl.Main$$anonfun$1.apply(Main.scala:37) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.repl.Main$.(Main.scala:37) > at org.apache.spark.repl.Main$.(Main.scala) > ... 10 more > {code} > It seems we should throw a proper exception with better message. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20437) R wrappers for rollup and cube
[ https://issues.apache.org/jira/browse/SPARK-20437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung resolved SPARK-20437. -- Resolution: Fixed Assignee: Maciej Szymkiewicz Fix Version/s: 2.3.0 Target Version/s: 2.3.0 > R wrappers for rollup and cube > -- > > Key: SPARK-20437 > URL: https://issues.apache.org/jira/browse/SPARK-20437 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Minor > Fix For: 2.3.0 > > > Add SparkR wrappers for {{Dataset.cube}} and {{Dataset.rollup}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows
[ https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15984173#comment-15984173 ] Liang-Chi Hsieh commented on SPARK-20392: - [~barrybecker4] Currently I think the performance downgrade is caused by the cost of exchange between DataFrame/Dataset abstraction and logical plans. Some operations on DataFrames exchange between DataFrame and logical plans. It can be ignored in the usage of SQL. However, it's not rare to chain dozens of pipeline stages in ML. When the query plan grows incrementally during running those stages, the cost spent on the exchange grows too. In particular, the Analyzer will go through the big query plan even most part of it is analyzed. By eliminating part of the cost, the time to run the example code locally is reduced from about 1min to about 30 secs. > Slow performance when calling fit on ML pipeline for dataset with many > columns but few rows > --- > > Key: SPARK-20392 > URL: https://issues.apache.org/jira/browse/SPARK-20392 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Barry Becker > Attachments: blockbuster.csv, blockbuster_fewCols.csv, > giant_query_plan_for_fitting_pipeline.txt, model_9754.zip, model_9756.zip > > > This started as a [question on stack > overflow|http://stackoverflow.com/questions/43484006/why-is-it-slow-to-apply-a-spark-pipeline-to-dataset-with-many-columns-but-few-ro], > but it seems like a bug. > I am testing spark pipelines using a simple dataset (attached) with 312 > (mostly numeric) columns, but only 421 rows. It is small, but it takes 3 > minutes to apply my ML pipeline to it on a 24 core server with 60G of memory. > This seems much to long for such a tiny dataset. Similar pipelines run > quickly on datasets that have fewer columns and more rows. It's something > about the number of columns that is causing the slow performance. > Here are a list of the stages in my pipeline: > {code} > 000_strIdx_5708525b2b6c > 001_strIdx_ec2296082913 > 002_bucketizer_3cbc8811877b > 003_bucketizer_5a01d5d78436 > 004_bucketizer_bf290d11364d > 005_bucketizer_c3296dfe94b2 > 006_bucketizer_7071ca50eb85 > 007_bucketizer_27738213c2a1 > 008_bucketizer_bd728fd89ba1 > 009_bucketizer_e1e716f51796 > 010_bucketizer_38be665993ba > 011_bucketizer_5a0e41e5e94f > 012_bucketizer_b5a3d5743aaa > 013_bucketizer_4420f98ff7ff > 014_bucketizer_777cc4fe6d12 > 015_bucketizer_f0f3a3e5530e > 016_bucketizer_218ecca3b5c1 > 017_bucketizer_0b083439a192 > 018_bucketizer_4520203aec27 > 019_bucketizer_462c2c346079 > 020_bucketizer_47435822e04c > 021_bucketizer_eb9dccb5e6e8 > 022_bucketizer_b5f63dd7451d > 023_bucketizer_e0fd5041c841 > 024_bucketizer_ffb3b9737100 > 025_bucketizer_e06c0d29273c > 026_bucketizer_36ee535a425f > 027_bucketizer_ee3a330269f1 > 028_bucketizer_094b58ea01c0 > 029_bucketizer_e93ea86c08e2 > 030_bucketizer_4728a718bc4b > 031_bucketizer_08f6189c7fcc > 032_bucketizer_11feb74901e6 > 033_bucketizer_ab4add4966c7 > 034_bucketizer_4474f7f1b8ce > 035_bucketizer_90cfa5918d71 > 036_bucketizer_1a9ff5e4eccb > 037_bucketizer_38085415a4f4 > 038_bucketizer_9b5e5a8d12eb > 039_bucketizer_082bb650ecc3 > 040_bucketizer_57e1e363c483 > 041_bucketizer_337583fbfd65 > 042_bucketizer_73e8f6673262 > 043_bucketizer_0f9394ed30b8 > 044_bucketizer_8530f3570019 > 045_bucketizer_c53614f1e507 > 046_bucketizer_8fd99e6ec27b > 047_bucketizer_6a8610496d8a > 048_bucketizer_888b0055c1ad > 049_bucketizer_974e0a1433a6 > 050_bucketizer_e848c0937cb9 > 051_bucketizer_95611095a4ac > 052_bucketizer_660a6031acd9 > 053_bucketizer_aaffe5a3140d > 054_bucketizer_8dc569be285f > 055_bucketizer_83d1bffa07bc > 056_bucketizer_0c6180ba75e6 > 057_bucketizer_452f265a000d > 058_bucketizer_38e02ddfb447 > 059_bucketizer_6fa4ad5d3ebd > 060_bucketizer_91044ee766ce > 061_bucketizer_9a9ef04a173d > 062_bucketizer_3d98eb15f206 > 063_bucketizer_c4915bb4d4ed > 064_bucketizer_8ca2b6550c38 > 065_bucketizer_417ee9b760bc > 066_bucketizer_67f3556bebe8 > 067_bucketizer_0556deb652c6 > 068_bucketizer_067b4b3d234c > 069_bucketizer_30ba55321538 > 070_bucketizer_ad826cc5d746 > 071_bucketizer_77676a898055 > 072_bucketizer_05c37a38ce30 > 073_bucketizer_6d9ae54163ed > 074_bucketizer_8cd668b2855d > 075_bucketizer_d50ea1732021 > 076_bucketizer_c68f467c9559 > 077_bucketizer_ee1dfc840db1 > 078_bucketizer_83ec06a32519 > 079_bucketizer_741d08c1b69e > 080_bucketizer_b7402e4829c7 > 081_bucketizer_8adc590dc447 > 082_bucketizer_673be99bdace > 083_bucketizer_77693b45f94c > 084_bucketizer_53529c6b1ac4 > 085_bucketizer_6a3ca776a81e > 086_bucketizer_6679d9588ac1 > 087_bucketizer_6c73af456f65 > 088_bucketizer_2291b2c5ab51 > 089_bucketizer_cb3d0fe669d8 > 090_bucketizer_e71f913c1512 > 091_bucketizer_156528f65ce7 > 092_buck
[jira] [Updated] (SPARK-20465) Throws a proper exception rather than ArrayIndexOutOfBoundsException when temp directories could not be got/created
[ https://issues.apache.org/jira/browse/SPARK-20465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-20465: - Component/s: Spark Core > Throws a proper exception rather than ArrayIndexOutOfBoundsException when > temp directories could not be got/created > --- > > Key: SPARK-20465 > URL: https://issues.apache.org/jira/browse/SPARK-20465 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Spark Submit >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Priority: Trivial > > If none of temp directories could not be created, it throws an > {{ArrayIndexOutOfBoundsException}} as below: > {code} > ./bin/spark-shell --conf > spark.local.dir=/NONEXISTENT_DIR_ONE,/NONEXISTENT_DIR_TWO > 17/04/26 13:11:06 ERROR Utils: Failed to create dir in /NONEXISTENT_DIR_ONE. > Ignoring this directory. > 17/04/26 13:11:06 ERROR Utils: Failed to create dir in /NONEXISTENT_DIR_TWO. > Ignoring this directory. > Exception in thread "main" java.lang.ExceptionInInitializerError > at org.apache.spark.repl.Main.main(Main.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:756) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:179) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:204) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:118) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: java.lang.ArrayIndexOutOfBoundsException: 0 > at org.apache.spark.util.Utils$.getLocalDir(Utils.scala:743) > at org.apache.spark.repl.Main$$anonfun$1.apply(Main.scala:37) > at org.apache.spark.repl.Main$$anonfun$1.apply(Main.scala:37) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.repl.Main$.(Main.scala:37) > at org.apache.spark.repl.Main$.(Main.scala) > ... 10 more > {code} > It seems we should throw a proper exception with better message. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20465) Throws a proper exception rather than ArrayIndexOutOfBoundsException when temp directories could not be got/created
Hyukjin Kwon created SPARK-20465: Summary: Throws a proper exception rather than ArrayIndexOutOfBoundsException when temp directories could not be got/created Key: SPARK-20465 URL: https://issues.apache.org/jira/browse/SPARK-20465 Project: Spark Issue Type: Improvement Components: Spark Submit Affects Versions: 2.2.0 Reporter: Hyukjin Kwon Priority: Trivial If none of temp directories could not be created, it throws an {{ArrayIndexOutOfBoundsException}} as below: {code} ./bin/spark-shell --conf spark.local.dir=/NONEXISTENT_DIR_ONE,/NONEXISTENT_DIR_TWO 17/04/26 13:11:06 ERROR Utils: Failed to create dir in /NONEXISTENT_DIR_ONE. Ignoring this directory. 17/04/26 13:11:06 ERROR Utils: Failed to create dir in /NONEXISTENT_DIR_TWO. Ignoring this directory. Exception in thread "main" java.lang.ExceptionInInitializerError at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:756) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:179) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:204) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:118) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ArrayIndexOutOfBoundsException: 0 at org.apache.spark.util.Utils$.getLocalDir(Utils.scala:743) at org.apache.spark.repl.Main$$anonfun$1.apply(Main.scala:37) at org.apache.spark.repl.Main$$anonfun$1.apply(Main.scala:37) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.repl.Main$.(Main.scala:37) at org.apache.spark.repl.Main$.(Main.scala) ... 10 more {code} It seems we should throw a proper exception with better message. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20199) GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter
[ https://issues.apache.org/jira/browse/SPARK-20199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15984161#comment-15984161 ] Yan Facai (颜发才) commented on SPARK-20199: - The work is easy, however Public method is added and some adjustments are needed in inner implementation. Hence, I suggest to delay it until one expert agree to shepherd the issue. I have two questions: 1. For both GBDT and RandomForest share the attribute, we can pull `featureSubsetStrategy` parameter up to either TreeEnsembleParams or DecisionTreeParams. Which one is appropriate? 2. Is it right to add new parameter `featureSubsetStrategy` to Strategy class? Or add it to DecisionTree's train method? > GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter > --- > > Key: SPARK-20199 > URL: https://issues.apache.org/jira/browse/SPARK-20199 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: pralabhkumar > > Spark GradientBoostedTreesModel doesn't have Column sampling rate parameter > . This parameter is available in H2O and XGBoost. > Sample from H2O.ai > gbmParams._col_sample_rate > Please provide the parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16548) java.io.CharConversionException: Invalid UTF-32 character prevents me from querying my data
[ https://issues.apache.org/jira/browse/SPARK-16548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-16548. - Resolution: Fixed Fix Version/s: 2.3.0 2.2.0 > java.io.CharConversionException: Invalid UTF-32 character prevents me from > querying my data > > > Key: SPARK-16548 > URL: https://issues.apache.org/jira/browse/SPARK-16548 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Egor Pahomov >Priority: Minor > Fix For: 2.2.0, 2.3.0 > > > Basically, when I query my json data I get > {code} > java.io.CharConversionException: Invalid UTF-32 character 0x7b2265(above > 10) at char #192, byte #771) > at > com.fasterxml.jackson.core.io.UTF32Reader.reportInvalid(UTF32Reader.java:189) > at com.fasterxml.jackson.core.io.UTF32Reader.read(UTF32Reader.java:150) > at > com.fasterxml.jackson.core.json.ReaderBasedJsonParser.loadMore(ReaderBasedJsonParser.java:153) > at > com.fasterxml.jackson.core.json.ReaderBasedJsonParser._skipWSOrEnd(ReaderBasedJsonParser.java:1855) > at > com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:571) > at > org.apache.spark.sql.catalyst.expressions.GetJsonObject$$anonfun$eval$2$$anonfun$4.apply(jsonExpressions.scala:142) > {code} > I do not like it. If you can not process one json among 100500 please return > null, do not fail everything. I have dirty one line fix, and I understand how > I can make it more reasonable. What is our position - what behaviour we wanna > get? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20439) Catalog.listTables() depends on all libraries used to create tables
[ https://issues.apache.org/jira/browse/SPARK-20439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-20439: Fix Version/s: 2.1.1 > Catalog.listTables() depends on all libraries used to create tables > --- > > Key: SPARK-20439 > URL: https://issues.apache.org/jira/browse/SPARK-20439 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.1.1, 2.2.0, 2.3.0 > > > spark.catalog.listTables() and getTable > You may get an error on the table serde library: > java.lang.RuntimeException: java.lang.ClassNotFoundException: > com.amazon.emr.kinesis.hive.KinesisHiveInputFormat > Or if the database contains any table (e.g., index) with a table type that is > not accessible by Spark SQL, it will fail the whole listTable API. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20456) Add examples for functions collection for pyspark
[ https://issues.apache.org/jira/browse/SPARK-20456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-20456: - Component/s: PySpark > Add examples for functions collection for pyspark > - > > Key: SPARK-20456 > URL: https://issues.apache.org/jira/browse/SPARK-20456 > Project: Spark > Issue Type: Documentation > Components: Documentation, PySpark >Affects Versions: 2.1.0 >Reporter: Michael Patterson >Priority: Minor > > Document `sql.functions.py`: > 1. Add examples for the common aggregate functions (`min`, `max`, `mean`, > `count`, `collect_set`, `collect_list`, `stddev`, `variance`) > 2. Rename columns in datetime examples. > 3. Add examples for `unix_timestamp` and `from_unixtime` > 4. Add note to all trigonometry functions that units are radians. > 5. Add example for `lit` -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20456) Add examples for functions collection for pyspark
[ https://issues.apache.org/jira/browse/SPARK-20456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Patterson updated SPARK-20456: -- Summary: Add examples for functions collection for pyspark (was: Document major aggregation functions for pyspark) > Add examples for functions collection for pyspark > - > > Key: SPARK-20456 > URL: https://issues.apache.org/jira/browse/SPARK-20456 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.1.0 >Reporter: Michael Patterson >Priority: Minor > > Document `sql.functions.py`: > 1. Document the common aggregate functions (`min`, `max`, `mean`, `count`, > `collect_set`, `collect_list`, `stddev`, `variance`) > 2. Rename columns in datetime examples. > 3. Add examples for `unix_timestamp` and `from_unixtime` > 4. Add note to all trigonometry functions that units are radians. > 5. Document `lit` -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20456) Add examples for functions collection for pyspark
[ https://issues.apache.org/jira/browse/SPARK-20456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Patterson updated SPARK-20456: -- Description: Document `sql.functions.py`: 1. Add examples for the common aggregate functions (`min`, `max`, `mean`, `count`, `collect_set`, `collect_list`, `stddev`, `variance`) 2. Rename columns in datetime examples. 3. Add examples for `unix_timestamp` and `from_unixtime` 4. Add note to all trigonometry functions that units are radians. 5. Add example for `lit` was: Document `sql.functions.py`: 1. Document the common aggregate functions (`min`, `max`, `mean`, `count`, `collect_set`, `collect_list`, `stddev`, `variance`) 2. Rename columns in datetime examples. 3. Add examples for `unix_timestamp` and `from_unixtime` 4. Add note to all trigonometry functions that units are radians. 5. Document `lit` > Add examples for functions collection for pyspark > - > > Key: SPARK-20456 > URL: https://issues.apache.org/jira/browse/SPARK-20456 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.1.0 >Reporter: Michael Patterson >Priority: Minor > > Document `sql.functions.py`: > 1. Add examples for the common aggregate functions (`min`, `max`, `mean`, > `count`, `collect_set`, `collect_list`, `stddev`, `variance`) > 2. Rename columns in datetime examples. > 3. Add examples for `unix_timestamp` and `from_unixtime` > 4. Add note to all trigonometry functions that units are radians. > 5. Add example for `lit` -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20457) Spark CSV is not able to Override Schema while reading data
[ https://issues.apache.org/jira/browse/SPARK-20457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-20457. -- Resolution: Duplicate Currently, the nullability seems being ignored. I am pretty sure that it duplicates SPARK-19950. I am resolving this. > Spark CSV is not able to Override Schema while reading data > --- > > Key: SPARK-20457 > URL: https://issues.apache.org/jira/browse/SPARK-20457 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Himanshu Gupta > > I have a CSV file, test.csv: > {code:xml} > col > 1 > 2 > 3 > 4 > {code} > When I read it using Spark, it gets the schema of data correct: > {code:java} > val df = spark.read.option("header", "true").option("inferSchema", > "true").csv("test.csv") > > df.printSchema > root > |-- col: integer (nullable = true) > {code} > But when I override the `schema` of CSV file and make `inferSchema` false, > then SparkSession is picking up custom schema partially. > {code:java} > val df = spark.read.option("header", "true").option("inferSchema", > "false").schema(StructType(List(StructField("custom", StringType, > false.csv("test.csv") > df.printSchema > root > |-- custom: string (nullable = true) > {code} > I mean only column name (`custom`) and DataType (`StringType`) are getting > picked up. But, `nullable` part is being ignored, as it is still coming > `nullable = true`, which is incorrect. > I am not able to understand this behavior. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20336) spark.read.csv() with wholeFile=True option fails to read non ASCII unicode characters
[ https://issues.apache.org/jira/browse/SPARK-20336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983896#comment-15983896 ] Hyukjin Kwon commented on SPARK-20336: -- Thank you guys for confirming this. > spark.read.csv() with wholeFile=True option fails to read non ASCII unicode > characters > -- > > Key: SPARK-20336 > URL: https://issues.apache.org/jira/browse/SPARK-20336 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: Spark 2.2.0 (master branch is downloaded from Github) > PySpark >Reporter: HanCheol Cho > > I used spark.read.csv() method with wholeFile=True option to load data that > has multi-line records. > However, non-ASCII characters are not properly loaded. > The following is a sample data for test: > {code:none} > col1,col2,col3 > 1,a,text > 2,b,テキスト > 3,c,텍스트 > 4,d,"text > テキスト > 텍스트" > 5,e,last > {code} > When it is loaded without wholeFile=True option, non-ASCII characters are > shown correctly although multi-line records are parsed incorrectly as follows: > {code:none} > testdf_default = spark.read.csv("test.encoding.csv", header=True) > testdf_default.show() > ++++ > |col1|col2|col3| > ++++ > | 1| a|text| > | 2| b|テキスト| > | 3| c| 텍스트| > | 4| d|text| > |テキスト|null|null| > | 텍스트"|null|null| > | 5| e|last| > ++++ > {code} > When wholeFile=True option is used, non-ASCII characters are broken as > follows: > {code:none} > testdf_wholefile = spark.read.csv("test.encoding.csv", header=True, > wholeFile=True) > testdf_wholefile.show() > ++++ > |col1|col2|col3| > ++++ > | 1| a|text| > | 2| b|| > | 3| c| �| > | 4| d|text > ...| > | 5| e|last| > ++++ > {code} > The result is same even if I use encoding="UTF-8" option with wholeFile=True. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20445) pyspark.sql.utils.IllegalArgumentException: u'DecisionTreeClassifier was given input with invalid label column label, without the number of classes specified. See Stri
[ https://issues.apache.org/jira/browse/SPARK-20445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983892#comment-15983892 ] Hyukjin Kwon commented on SPARK-20445: -- I meant the current codebase, latest build. Probably, I guess testing against 2.1.0 might be enough. In guide lines - http://spark.apache.org/contributing.html > For issues that can’t be reproduced against master as reported, resolve as > Cannot Reproduce I could not reproduce this in the current master. So, I resolved this as {{Cannot Reproduce}}. I assume there is a JIRA fixing this issue. You (or anyone) can identify the JIRA and then make a backport if applicable. > pyspark.sql.utils.IllegalArgumentException: u'DecisionTreeClassifier was > given input with invalid label column label, without the number of classes > specified. See StringIndexer > > > Key: SPARK-20445 > URL: https://issues.apache.org/jira/browse/SPARK-20445 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.1 >Reporter: surya pratap > > #Load the CSV file into a RDD > irisData = sc.textFile("/home/infademo/surya/iris.csv") > irisData.cache() > irisData.count() > #Remove the first line (contains headers) > dataLines = irisData.filter(lambda x: "Sepal" not in x) > dataLines.count() > from pyspark.sql import Row > #Create a Data Frame from the data > parts = dataLines.map(lambda l: l.split(",")) > irisMap = parts.map(lambda p: Row(SEPAL_LENGTH=float(p[0]),\ > SEPAL_WIDTH=float(p[1]), \ > PETAL_LENGTH=float(p[2]), \ > PETAL_WIDTH=float(p[3]), \ > SPECIES=p[4] )) > # Infer the schema, and register the DataFrame as a table. > irisDf = sqlContext.createDataFrame(irisMap) > irisDf.cache() > #Add a numeric indexer for the label/target column > from pyspark.ml.feature import StringIndexer > stringIndexer = StringIndexer(inputCol="SPECIES", outputCol="IND_SPECIES") > si_model = stringIndexer.fit(irisDf) > irisNormDf = si_model.transform(irisDf) > irisNormDf.select("SPECIES","IND_SPECIES").distinct().collect() > irisNormDf.cache() > > """-- > Perform Data Analytics > > -""" > #See standard parameters > irisNormDf.describe().show() > #Find correlation between predictors and target > for i in irisNormDf.columns: > if not( isinstance(irisNormDf.select(i).take(1)[0][0], basestring)) : > print( "Correlation to Species for ", i, \ > irisNormDf.stat.corr('IND_SPECIES',i)) > #Transform to a Data Frame for input to Machine Learing > #Drop columns that are not required (low correlation) > from pyspark.mllib.linalg import Vectors > from pyspark.mllib.linalg import SparseVector > from pyspark.mllib.regression import LabeledPoint > from pyspark.mllib.util import MLUtils > import org.apache.spark.mllib.linalg.{Matrix, Matrices} > from pyspark.mllib.linalg.distributed import RowMatrix > from pyspark.ml.linalg import Vectors > pyspark.mllib.linalg.Vector > def transformToLabeledPoint(row) : > lp = ( row["SPECIES"], row["IND_SPECIES"], \ > Vectors.dense([row["SEPAL_LENGTH"],\ > row["SEPAL_WIDTH"], \ > row["PETAL_LENGTH"], \ > row["PETAL_WIDTH"]])) > return lp > irisLp = irisNormDf.rdd.map(transformToLabeledPoint) > irisLpDf = sqlContext.createDataFrame(irisLp,["species","label", > "features"]) > irisLpDf.select("species","label","features").show(10) > irisLpDf.cache() > > """-- > Perform Machine Learning > > -""" > #Split into training and testing data > (trainingData, testData) = irisLpDf.randomSplit([0.9, 0.1]) > trainingData.count() > testData.count() > testData.collect() > from pyspark.ml.classification import DecisionTreeClassifier > from pyspark.ml.evaluation import MulticlassClassificationEvaluator > #Create the model > dtClassifer = DecisionTreeClassifier(maxDepth=2, labelCol="label",\ > featuresCol="features") >dtModel = dtClassifer.fit(trainingData) > >issue part:- > >dtModel =
[jira] [Commented] (SPARK-20456) Document major aggregation functions for pyspark
[ https://issues.apache.org/jira/browse/SPARK-20456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983888#comment-15983888 ] Hyukjin Kwon commented on SPARK-20456: -- I simply left the comment above as the current status does not sound matching with the description and the title. Let's fix the title and the description here. > Document major aggregation functions for pyspark > > > Key: SPARK-20456 > URL: https://issues.apache.org/jira/browse/SPARK-20456 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.1.0 >Reporter: Michael Patterson >Priority: Minor > > Document `sql.functions.py`: > 1. Document the common aggregate functions (`min`, `max`, `mean`, `count`, > `collect_set`, `collect_list`, `stddev`, `variance`) > 2. Rename columns in datetime examples. > 3. Add examples for `unix_timestamp` and `from_unixtime` > 4. Add note to all trigonometry functions that units are radians. > 5. Document `lit` -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18127) Add hooks and extension points to Spark
[ https://issues.apache.org/jira/browse/SPARK-18127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983872#comment-15983872 ] Frederick Reiss commented on SPARK-18127: - Is there a design document or a public design and requirements discussion associated with this JIRA? > Add hooks and extension points to Spark > --- > > Key: SPARK-18127 > URL: https://issues.apache.org/jira/browse/SPARK-18127 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Srinath >Assignee: Sameer Agarwal > Fix For: 2.2.0 > > > As a Spark user I want to be able to customize my spark session. I currently > want to be able to do the following things: > # I want to be able to add custom analyzer rules. This allows me to implement > my own logical constructs; an example of this could be a recursive operator. > # I want to be able to add my own analysis checks. This allows me to catch > problems with spark plans early on. An example of this can be some datasource > specific checks. > # I want to be able to add my own optimizations. This allows me to optimize > plans in different ways, for instance when you use a very different cluster > (for example a one-node X1 instance). This supersedes the current > {{spark.experimental}} methods > # I want to be able to add my own planning strategies. This supersedes the > current {{spark.experimental}} methods. This allows me to plan my own > physical plan, an example of this would to plan my own heavily integrated > data source (CarbonData for example). > # I want to be able to use my own customized SQL constructs. An example of > this would supporting my own dialect, or be able to add constructs to the > current SQL language. I should not have to implement a complete parse, and > should be able to delegate to an underlying parser. > # I want to be able to track modifications and calls to the external catalog. > I want this API to be stable. This allows me to do synchronize with other > systems. > This API should modify the SparkSession when the session gets started, and it > should NOT change the session in flight. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18127) Add hooks and extension points to Spark
[ https://issues.apache.org/jira/browse/SPARK-18127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-18127. - Resolution: Fixed > Add hooks and extension points to Spark > --- > > Key: SPARK-18127 > URL: https://issues.apache.org/jira/browse/SPARK-18127 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Srinath >Assignee: Sameer Agarwal > Fix For: 2.2.0 > > > As a Spark user I want to be able to customize my spark session. I currently > want to be able to do the following things: > # I want to be able to add custom analyzer rules. This allows me to implement > my own logical constructs; an example of this could be a recursive operator. > # I want to be able to add my own analysis checks. This allows me to catch > problems with spark plans early on. An example of this can be some datasource > specific checks. > # I want to be able to add my own optimizations. This allows me to optimize > plans in different ways, for instance when you use a very different cluster > (for example a one-node X1 instance). This supersedes the current > {{spark.experimental}} methods > # I want to be able to add my own planning strategies. This supersedes the > current {{spark.experimental}} methods. This allows me to plan my own > physical plan, an example of this would to plan my own heavily integrated > data source (CarbonData for example). > # I want to be able to use my own customized SQL constructs. An example of > this would supporting my own dialect, or be able to add constructs to the > current SQL language. I should not have to implement a complete parse, and > should be able to delegate to an underlying parser. > # I want to be able to track modifications and calls to the external catalog. > I want this API to be stable. This allows me to do synchronize with other > systems. > This API should modify the SparkSession when the session gets started, and it > should NOT change the session in flight. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18127) Add hooks and extension points to Spark
[ https://issues.apache.org/jira/browse/SPARK-18127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-18127: Fix Version/s: 2.2.0 > Add hooks and extension points to Spark > --- > > Key: SPARK-18127 > URL: https://issues.apache.org/jira/browse/SPARK-18127 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Srinath >Assignee: Sameer Agarwal > Fix For: 2.2.0 > > > As a Spark user I want to be able to customize my spark session. I currently > want to be able to do the following things: > # I want to be able to add custom analyzer rules. This allows me to implement > my own logical constructs; an example of this could be a recursive operator. > # I want to be able to add my own analysis checks. This allows me to catch > problems with spark plans early on. An example of this can be some datasource > specific checks. > # I want to be able to add my own optimizations. This allows me to optimize > plans in different ways, for instance when you use a very different cluster > (for example a one-node X1 instance). This supersedes the current > {{spark.experimental}} methods > # I want to be able to add my own planning strategies. This supersedes the > current {{spark.experimental}} methods. This allows me to plan my own > physical plan, an example of this would to plan my own heavily integrated > data source (CarbonData for example). > # I want to be able to use my own customized SQL constructs. An example of > this would supporting my own dialect, or be able to add constructs to the > current SQL language. I should not have to implement a complete parse, and > should be able to delegate to an underlying parser. > # I want to be able to track modifications and calls to the external catalog. > I want this API to be stable. This allows me to do synchronize with other > systems. > This API should modify the SparkSession when the session gets started, and it > should NOT change the session in flight. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983853#comment-15983853 ] Ismael Juma commented on SPARK-18057: - It's worth noting that no-one is working on that ticket at the moment, so a fix may take some time. And even if it lands soon, it's likely to be in 0.11.0.0 first (0.10.2.1 is being voted and will be out very soon). > Update structured streaming kafka from 10.0.1 to 10.2.0 > --- > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983832#comment-15983832 ] Helena Edelson commented on SPARK-18057: It is the timeout. I think waiting is better, will be watching that ticket in Kafka. > Update structured streaming kafka from 10.0.1 to 10.2.0 > --- > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983820#comment-15983820 ] Michael Armbrust commented on SPARK-18057: -- I guess I'd like to understand more about what problems people are running into with the current version. Are there more pressing issues than hanging when topics are deleted? > Update structured streaming kafka from 10.0.1 to 10.2.0 > --- > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20130) Flaky test: BlockManagerProactiveReplicationSuite
[ https://issues.apache.org/jira/browse/SPARK-20130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-20130. Resolution: Cannot Reproduce Seems a lot more stable now, so closing this until it becomes flaky again. > Flaky test: BlockManagerProactiveReplicationSuite > - > > Key: SPARK-20130 > URL: https://issues.apache.org/jira/browse/SPARK-20130 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.2.0 >Reporter: Marcelo Vanzin > > See following page: > https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.storage.BlockManagerProactiveReplicationSuite&test_name=proactive+block+replication+-+5+replicas+-+4+block+manager+deletions > I also have seen it fail intermittently during local unit test runs. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20421) Mark JobProgressListener (and related classes) as deprecated
[ https://issues.apache.org/jira/browse/SPARK-20421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20421: Assignee: (was: Apache Spark) > Mark JobProgressListener (and related classes) as deprecated > > > Key: SPARK-20421 > URL: https://issues.apache.org/jira/browse/SPARK-20421 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Marcelo Vanzin > > This class (and others) were made {{@DeveloperApi}} as part of > https://github.com/apache/spark/pull/648. But as part of the work in > SPARK-18085, I plan to get rid of a lot of that code, so we should mark these > as deprecated in case anyone is still trying to use them. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20421) Mark JobProgressListener (and related classes) as deprecated
[ https://issues.apache.org/jira/browse/SPARK-20421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20421: Assignee: Apache Spark > Mark JobProgressListener (and related classes) as deprecated > > > Key: SPARK-20421 > URL: https://issues.apache.org/jira/browse/SPARK-20421 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark > > This class (and others) were made {{@DeveloperApi}} as part of > https://github.com/apache/spark/pull/648. But as part of the work in > SPARK-18085, I plan to get rid of a lot of that code, so we should mark these > as deprecated in case anyone is still trying to use them. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983809#comment-15983809 ] Shixiong Zhu commented on SPARK-18057: -- I prefer to just wait. The user can still use Kafka 0.10.2.0 with the current Spark Kafka source in their application. The APIs are compatibility. Commenting tests out means we cannot prevent future changes from breaking them. > Update structured streaming kafka from 10.0.1 to 10.2.0 > --- > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20421) Mark JobProgressListener (and related classes) as deprecated
[ https://issues.apache.org/jira/browse/SPARK-20421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983808#comment-15983808 ] Apache Spark commented on SPARK-20421: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/17766 > Mark JobProgressListener (and related classes) as deprecated > > > Key: SPARK-20421 > URL: https://issues.apache.org/jira/browse/SPARK-20421 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Marcelo Vanzin > > This class (and others) were made {{@DeveloperApi}} as part of > https://github.com/apache/spark/pull/648. But as part of the work in > SPARK-18085, I plan to get rid of a lot of that code, so we should mark these > as deprecated in case anyone is still trying to use them. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983796#comment-15983796 ] Helena Edelson commented on SPARK-18057: I have a branch off branch-2.2 with the 0.10.2.0 upgrade and changes done. All the delete-topic-related tests fail (mainly just in streaming kafka sql). I can PR with those few tests commented out but that doesn't sound right. Or wait to PR? > Update structured streaming kafka from 10.0.1 to 10.2.0 > --- > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20464) Add a job group and an informative description for streaming queries
[ https://issues.apache.org/jira/browse/SPARK-20464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20464: Assignee: Apache Spark > Add a job group and an informative description for streaming queries > > > Key: SPARK-20464 > URL: https://issues.apache.org/jira/browse/SPARK-20464 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Kunal Khamar >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20464) Add a job group and an informative description for streaming queries
[ https://issues.apache.org/jira/browse/SPARK-20464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20464: Assignee: (was: Apache Spark) > Add a job group and an informative description for streaming queries > > > Key: SPARK-20464 > URL: https://issues.apache.org/jira/browse/SPARK-20464 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Kunal Khamar > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20464) Add a job group and an informative description for streaming queries
[ https://issues.apache.org/jira/browse/SPARK-20464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983751#comment-15983751 ] Apache Spark commented on SPARK-20464: -- User 'kunalkhamar' has created a pull request for this issue: https://github.com/apache/spark/pull/17765 > Add a job group and an informative description for streaming queries > > > Key: SPARK-20464 > URL: https://issues.apache.org/jira/browse/SPARK-20464 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Kunal Khamar > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20464) Add a job group and an informative description for streaming queries
[ https://issues.apache.org/jira/browse/SPARK-20464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kunal Khamar updated SPARK-20464: - Summary: Add a job group and an informative description for streaming queries (was: Add a job group and an informative job description for streaming queries) > Add a job group and an informative description for streaming queries > > > Key: SPARK-20464 > URL: https://issues.apache.org/jira/browse/SPARK-20464 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Kunal Khamar > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20239) Improve HistoryServer ACL mechanism
[ https://issues.apache.org/jira/browse/SPARK-20239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-20239: --- Fix Version/s: 2.1.2 2.0.3 > Improve HistoryServer ACL mechanism > --- > > Key: SPARK-20239 > URL: https://issues.apache.org/jira/browse/SPARK-20239 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Saisai Shao >Assignee: Saisai Shao > Fix For: 2.0.3, 2.1.2, 2.2.0 > > > Current SHS (Spark History Server) two different ACLs. > * ACL of base URL, it is controlled by "spark.acls.enabled" or > "spark.ui.acls.enabled", and with this enabled, only user configured with > "spark.admin.acls" (or group) or "spark.ui.view.acls" (or group), or the user > who started SHS could list all the applications, otherwise none of them can > be listed. This will also affect REST APIs which listing the summary of all > apps and one app. > * Per application ACL. This is controlled by "spark.history.ui.acls.enabled". > With this enabled only history admin user and user/group who ran this app can > access the details of this app. > With this two ACLs, we may encounter several unexpected behaviors: > 1. if base URL's ACL is enabled but user A has no view permission. User "A" > cannot see the app list but could still access details of it's own app. > 2. if ACLs of base URL is disabled. Then user "A" could see the summary of > all the apps, even some didn't run by user "A", but cannot access the details. > 3. history admin ACL has no permission to list all apps if this admin user is > not added to base URL's ACL. > The unexpected behaviors is mainly because we have two different ACLs, > ideally we should have only one to manage all. > So to improve SHS's ACL mechanism, we should: > * Unify two different ACLs into one, and always honor this one (both in base > URL and app details). > * User could partially list and display apps which ran by him according to > the ACLs in event log. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20464) Add a job group and an informative job description for streaming queries
Kunal Khamar created SPARK-20464: Summary: Add a job group and an informative job description for streaming queries Key: SPARK-20464 URL: https://issues.apache.org/jira/browse/SPARK-20464 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.2.0 Reporter: Kunal Khamar -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20463) Expose SPARK SQL <=> operator in PySpark
[ https://issues.apache.org/jira/browse/SPARK-20463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20463: Assignee: Apache Spark > Expose SPARK SQL <=> operator in PySpark > > > Key: SPARK-20463 > URL: https://issues.apache.org/jira/browse/SPARK-20463 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.1.0 >Reporter: Michael Styles >Assignee: Apache Spark > > Expose the SPARK SQL '<=>' operator in Pyspark as a column function called > *isNotDistinctFrom*. For example: > {panel} > {noformat} > data = [(10, 20), (30, 30), (40, None), (None, None)] > df2 = sc.parallelize(data).toDF("c1", "c2") > df2.where(df2["c1"].isNotDistinctFrom(df2["c2"]).collect()) > [Row(c1=30, c2=30), Row(c1=None, c2=None)] > {noformat} > {panel} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20463) Expose SPARK SQL <=> operator in PySpark
[ https://issues.apache.org/jira/browse/SPARK-20463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20463: Assignee: (was: Apache Spark) > Expose SPARK SQL <=> operator in PySpark > > > Key: SPARK-20463 > URL: https://issues.apache.org/jira/browse/SPARK-20463 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.1.0 >Reporter: Michael Styles > > Expose the SPARK SQL '<=>' operator in Pyspark as a column function called > *isNotDistinctFrom*. For example: > {panel} > {noformat} > data = [(10, 20), (30, 30), (40, None), (None, None)] > df2 = sc.parallelize(data).toDF("c1", "c2") > df2.where(df2["c1"].isNotDistinctFrom(df2["c2"]).collect()) > [Row(c1=30, c2=30), Row(c1=None, c2=None)] > {noformat} > {panel} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20463) Expose SPARK SQL <=> operator in PySpark
[ https://issues.apache.org/jira/browse/SPARK-20463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20463: Assignee: Apache Spark > Expose SPARK SQL <=> operator in PySpark > > > Key: SPARK-20463 > URL: https://issues.apache.org/jira/browse/SPARK-20463 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.1.0 >Reporter: Michael Styles >Assignee: Apache Spark > > Expose the SPARK SQL '<=>' operator in Pyspark as a column function called > *isNotDistinctFrom*. For example: > {panel} > {noformat} > data = [(10, 20), (30, 30), (40, None), (None, None)] > df2 = sc.parallelize(data).toDF("c1", "c2") > df2.where(df2["c1"].isNotDistinctFrom(df2["c2"]).collect()) > [Row(c1=30, c2=30), Row(c1=None, c2=None)] > {noformat} > {panel} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20463) Expose SPARK SQL <=> operator in PySpark
[ https://issues.apache.org/jira/browse/SPARK-20463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20463: Assignee: (was: Apache Spark) > Expose SPARK SQL <=> operator in PySpark > > > Key: SPARK-20463 > URL: https://issues.apache.org/jira/browse/SPARK-20463 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.1.0 >Reporter: Michael Styles > > Expose the SPARK SQL '<=>' operator in Pyspark as a column function called > *isNotDistinctFrom*. For example: > {panel} > {noformat} > data = [(10, 20), (30, 30), (40, None), (None, None)] > df2 = sc.parallelize(data).toDF("c1", "c2") > df2.where(df2["c1"].isNotDistinctFrom(df2["c2"]).collect()) > [Row(c1=30, c2=30), Row(c1=None, c2=None)] > {noformat} > {panel} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20463) Expose SPARK SQL <=> operator in PySpark
[ https://issues.apache.org/jira/browse/SPARK-20463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983654#comment-15983654 ] Apache Spark commented on SPARK-20463: -- User 'ptkool' has created a pull request for this issue: https://github.com/apache/spark/pull/17764 > Expose SPARK SQL <=> operator in PySpark > > > Key: SPARK-20463 > URL: https://issues.apache.org/jira/browse/SPARK-20463 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.1.0 >Reporter: Michael Styles > > Expose the SPARK SQL '<=>' operator in Pyspark as a column function called > *isNotDistinctFrom*. For example: > {panel} > {noformat} > data = [(10, 20), (30, 30), (40, None), (None, None)] > df2 = sc.parallelize(data).toDF("c1", "c2") > df2.where(df2["c1"].isNotDistinctFrom(df2["c2"]).collect()) > [Row(c1=30, c2=30), Row(c1=None, c2=None)] > {noformat} > {panel} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20463) Expose SPARK SQL <=> operator in PySpark
Michael Styles created SPARK-20463: -- Summary: Expose SPARK SQL <=> operator in PySpark Key: SPARK-20463 URL: https://issues.apache.org/jira/browse/SPARK-20463 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 2.1.0 Reporter: Michael Styles Expose the SPARK SQL '<=>' operator in Pyspark as a column function called *isNotDistinctFrom*. For example: {panel} {noformat} data = [(10, 20), (30, 30), (40, None), (None, None)] df2 = sc.parallelize(data).toDF("c1", "c2") df2.where(df2["c1"].isNotDistinctFrom(df2["c2"]).collect()) [Row(c1=30, c2=30), Row(c1=None, c2=None)] {noformat} {panel} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool
[ https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983615#comment-15983615 ] Shixiong Zhu commented on SPARK-13747: -- [~dnaumenko] Unfortunately, Spark uses ThreadLocal variables a lot but ForkJoinPool doesn't support that very well (It's easy to leak ThreadLocal variables to other tasks). Could you check if https://github.com/apache/spark/pull/17763 can fix your issue? > Concurrent execution in SQL doesn't work with Scala ForkJoinPool > > > Key: SPARK-13747 > URL: https://issues.apache.org/jira/browse/SPARK-13747 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > Run the following codes may fail > {code} > (1 to 100).par.foreach { _ => > println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count()) > } > java.lang.IllegalArgumentException: spark.sql.execution.id is already set > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87) > > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) > at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) > {code} > This is because SparkContext.runJob can be suspended when using a > ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it > calls Await.ready (introduced by https://github.com/apache/spark/pull/9264). > So when SparkContext.runJob is suspended, ForkJoinPool will run another task > in the same thread, however, the local properties has been polluted. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool
[ https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13747: Assignee: Apache Spark (was: Shixiong Zhu) > Concurrent execution in SQL doesn't work with Scala ForkJoinPool > > > Key: SPARK-13747 > URL: https://issues.apache.org/jira/browse/SPARK-13747 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Shixiong Zhu >Assignee: Apache Spark > > Run the following codes may fail > {code} > (1 to 100).par.foreach { _ => > println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count()) > } > java.lang.IllegalArgumentException: spark.sql.execution.id is already set > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87) > > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) > at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) > {code} > This is because SparkContext.runJob can be suspended when using a > ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it > calls Await.ready (introduced by https://github.com/apache/spark/pull/9264). > So when SparkContext.runJob is suspended, ForkJoinPool will run another task > in the same thread, however, the local properties has been polluted. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool
[ https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13747: Assignee: Shixiong Zhu (was: Apache Spark) > Concurrent execution in SQL doesn't work with Scala ForkJoinPool > > > Key: SPARK-13747 > URL: https://issues.apache.org/jira/browse/SPARK-13747 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > Run the following codes may fail > {code} > (1 to 100).par.foreach { _ => > println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count()) > } > java.lang.IllegalArgumentException: spark.sql.execution.id is already set > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87) > > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) > at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) > {code} > This is because SparkContext.runJob can be suspended when using a > ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it > calls Await.ready (introduced by https://github.com/apache/spark/pull/9264). > So when SparkContext.runJob is suspended, ForkJoinPool will run another task > in the same thread, however, the local properties has been polluted. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool
[ https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983612#comment-15983612 ] Apache Spark commented on SPARK-13747: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/17763 > Concurrent execution in SQL doesn't work with Scala ForkJoinPool > > > Key: SPARK-13747 > URL: https://issues.apache.org/jira/browse/SPARK-13747 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > Run the following codes may fail > {code} > (1 to 100).par.foreach { _ => > println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count()) > } > java.lang.IllegalArgumentException: spark.sql.execution.id is already set > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87) > > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) > at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) > {code} > This is because SparkContext.runJob can be suspended when using a > ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it > calls Await.ready (introduced by https://github.com/apache/spark/pull/9264). > So when SparkContext.runJob is suspended, ForkJoinPool will run another task > in the same thread, however, the local properties has been polluted. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool
[ https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-13747: - Fix Version/s: (was: 2.2.0) > Concurrent execution in SQL doesn't work with Scala ForkJoinPool > > > Key: SPARK-13747 > URL: https://issues.apache.org/jira/browse/SPARK-13747 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > Run the following codes may fail > {code} > (1 to 100).par.foreach { _ => > println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count()) > } > java.lang.IllegalArgumentException: spark.sql.execution.id is already set > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87) > > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) > at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) > {code} > This is because SparkContext.runJob can be suspended when using a > ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it > calls Await.ready (introduced by https://github.com/apache/spark/pull/9264). > So when SparkContext.runJob is suspended, ForkJoinPool will run another task > in the same thread, however, the local properties has been polluted. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool
[ https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu reopened SPARK-13747: -- > Concurrent execution in SQL doesn't work with Scala ForkJoinPool > > > Key: SPARK-13747 > URL: https://issues.apache.org/jira/browse/SPARK-13747 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > Run the following codes may fail > {code} > (1 to 100).par.foreach { _ => > println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count()) > } > java.lang.IllegalArgumentException: spark.sql.execution.id is already set > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87) > > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) > at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) > {code} > This is because SparkContext.runJob can be suspended when using a > ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it > calls Await.ready (introduced by https://github.com/apache/spark/pull/9264). > So when SparkContext.runJob is suspended, ForkJoinPool will run another task > in the same thread, however, the local properties has been polluted. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20462) Spark-Kinesis Direct Connector
Lauren Moos created SPARK-20462: --- Summary: Spark-Kinesis Direct Connector Key: SPARK-20462 URL: https://issues.apache.org/jira/browse/SPARK-20462 Project: Spark Issue Type: New Feature Components: Input/Output Affects Versions: 2.1.0 Reporter: Lauren Moos I'd like to propose and the vet the design for a direct connector between Spark and Kinesis. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20456) Document major aggregation functions for pyspark
[ https://issues.apache.org/jira/browse/SPARK-20456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983590#comment-15983590 ] Michael Patterson commented on SPARK-20456: --- I saw that there are short docstrings for the aggregate functions, but I think it can be unclear for people new to Spark, or relational algebra. For example, some of my coworkers didn't know you could do, for example, `df.agg(mean(col))`, without doing a `groupby`. There are also no links to `groupby` in any of the aggregate functions. I also didn't know about `collect_set` for a long time. I think adding examples would help with visibility and understanding. The same things applies to `lit`. It took me a while to learn what it did. For the datetime stuff, for example this line has a column named 'd': https://github.com/map222/spark/blob/master/python/pyspark/sql/functions.py#L926 I think it would be more informative to name it 'date' or 'time'. Do these sound reasonable? > Document major aggregation functions for pyspark > > > Key: SPARK-20456 > URL: https://issues.apache.org/jira/browse/SPARK-20456 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.1.0 >Reporter: Michael Patterson >Priority: Minor > > Document `sql.functions.py`: > 1. Document the common aggregate functions (`min`, `max`, `mean`, `count`, > `collect_set`, `collect_list`, `stddev`, `variance`) > 2. Rename columns in datetime examples. > 3. Add examples for `unix_timestamp` and `from_unixtime` > 4. Add note to all trigonometry functions that units are radians. > 5. Document `lit` -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9103) Tracking spark's memory usage
[ https://issues.apache.org/jira/browse/SPARK-9103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983531#comment-15983531 ] Apache Spark commented on SPARK-9103: - User 'jsoltren' has created a pull request for this issue: https://github.com/apache/spark/pull/17762 > Tracking spark's memory usage > - > > Key: SPARK-9103 > URL: https://issues.apache.org/jira/browse/SPARK-9103 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Reporter: Zhang, Liye > Attachments: Tracking Spark Memory Usage - Phase 1.pdf > > > Currently spark only provides little memory usage information (RDD cache on > webUI) for the executors. User have no idea on what is the memory consumption > when they are running spark applications with a lot of memory used in spark > executors. Especially when they encounter the OOM, it’s really hard to know > what is the cause of the problem. So it would be helpful to give out the > detail memory consumption information for each part of spark, so that user > can clearly have a picture of where the memory is exactly used. > The memory usage info to expose should include but not limited to shuffle, > cache, network, serializer, etc. > User can optionally choose to open this functionality since this is mainly > for debugging and tuning. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows
[ https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983490#comment-15983490 ] Kazuaki Ishizaki commented on SPARK-20392: -- Here are my observations: According to [Bucketizer.transform()|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala#L104-L122] * {{Projection}} is generated by calling {{dataset.withColumn()}} at line 121. Since {{Bucketizer.transform}} is called multiple times in the pipeline, deeply-nested projects are generated in the plan. * {{bucketizer}} at line 119 is implemented as {{UDF}}. To call a {{UDF}} function leads to some overhead for DeSer. * If a number of fields are less than 101 (i.e. {{"spark.sql.codegen.maxFields"}}), the wholestage codegen is enabled. For fewCols, it is enabled. However, in the original case, code generation is not enabled for the nested project. Is it better approach to effectively use {{Bucketizer}} in this case? For example, can we reduce the number of calls of {{Bucketizer}}? cc:[~mlnick] > Slow performance when calling fit on ML pipeline for dataset with many > columns but few rows > --- > > Key: SPARK-20392 > URL: https://issues.apache.org/jira/browse/SPARK-20392 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Barry Becker > Attachments: blockbuster.csv, blockbuster_fewCols.csv, > giant_query_plan_for_fitting_pipeline.txt, model_9754.zip, model_9756.zip > > > This started as a [question on stack > overflow|http://stackoverflow.com/questions/43484006/why-is-it-slow-to-apply-a-spark-pipeline-to-dataset-with-many-columns-but-few-ro], > but it seems like a bug. > I am testing spark pipelines using a simple dataset (attached) with 312 > (mostly numeric) columns, but only 421 rows. It is small, but it takes 3 > minutes to apply my ML pipeline to it on a 24 core server with 60G of memory. > This seems much to long for such a tiny dataset. Similar pipelines run > quickly on datasets that have fewer columns and more rows. It's something > about the number of columns that is causing the slow performance. > Here are a list of the stages in my pipeline: > {code} > 000_strIdx_5708525b2b6c > 001_strIdx_ec2296082913 > 002_bucketizer_3cbc8811877b > 003_bucketizer_5a01d5d78436 > 004_bucketizer_bf290d11364d > 005_bucketizer_c3296dfe94b2 > 006_bucketizer_7071ca50eb85 > 007_bucketizer_27738213c2a1 > 008_bucketizer_bd728fd89ba1 > 009_bucketizer_e1e716f51796 > 010_bucketizer_38be665993ba > 011_bucketizer_5a0e41e5e94f > 012_bucketizer_b5a3d5743aaa > 013_bucketizer_4420f98ff7ff > 014_bucketizer_777cc4fe6d12 > 015_bucketizer_f0f3a3e5530e > 016_bucketizer_218ecca3b5c1 > 017_bucketizer_0b083439a192 > 018_bucketizer_4520203aec27 > 019_bucketizer_462c2c346079 > 020_bucketizer_47435822e04c > 021_bucketizer_eb9dccb5e6e8 > 022_bucketizer_b5f63dd7451d > 023_bucketizer_e0fd5041c841 > 024_bucketizer_ffb3b9737100 > 025_bucketizer_e06c0d29273c > 026_bucketizer_36ee535a425f > 027_bucketizer_ee3a330269f1 > 028_bucketizer_094b58ea01c0 > 029_bucketizer_e93ea86c08e2 > 030_bucketizer_4728a718bc4b > 031_bucketizer_08f6189c7fcc > 032_bucketizer_11feb74901e6 > 033_bucketizer_ab4add4966c7 > 034_bucketizer_4474f7f1b8ce > 035_bucketizer_90cfa5918d71 > 036_bucketizer_1a9ff5e4eccb > 037_bucketizer_38085415a4f4 > 038_bucketizer_9b5e5a8d12eb > 039_bucketizer_082bb650ecc3 > 040_bucketizer_57e1e363c483 > 041_bucketizer_337583fbfd65 > 042_bucketizer_73e8f6673262 > 043_bucketizer_0f9394ed30b8 > 044_bucketizer_8530f3570019 > 045_bucketizer_c53614f1e507 > 046_bucketizer_8fd99e6ec27b > 047_bucketizer_6a8610496d8a > 048_bucketizer_888b0055c1ad > 049_bucketizer_974e0a1433a6 > 050_bucketizer_e848c0937cb9 > 051_bucketizer_95611095a4ac > 052_bucketizer_660a6031acd9 > 053_bucketizer_aaffe5a3140d > 054_bucketizer_8dc569be285f > 055_bucketizer_83d1bffa07bc > 056_bucketizer_0c6180ba75e6 > 057_bucketizer_452f265a000d > 058_bucketizer_38e02ddfb447 > 059_bucketizer_6fa4ad5d3ebd > 060_bucketizer_91044ee766ce > 061_bucketizer_9a9ef04a173d > 062_bucketizer_3d98eb15f206 > 063_bucketizer_c4915bb4d4ed > 064_bucketizer_8ca2b6550c38 > 065_bucketizer_417ee9b760bc > 066_bucketizer_67f3556bebe8 > 067_bucketizer_0556deb652c6 > 068_bucketizer_067b4b3d234c > 069_bucketizer_30ba55321538 > 070_bucketizer_ad826cc5d746 > 071_bucketizer_77676a898055 > 072_bucketizer_05c37a38ce30 > 073_bucketizer_6d9ae54163ed > 074_bucketizer_8cd668b2855d > 075_bucketizer_d50ea1732021 > 076_bucketizer_c68f467c9559 > 077_bucketizer_ee1dfc840db1 > 078_bucketizer_83ec06a32519 > 079_bucketizer_741d08c1b69e > 080_bucketizer_b7402e4829c7 > 081_bucketizer_8adc590dc447 > 082_bucketizer_673be99bdace > 083_bucketizer_77693b45f94c > 084_bucketizer_53
[jira] [Commented] (SPARK-20427) Issue with Spark interpreting Oracle datatype NUMBER
[ https://issues.apache.org/jira/browse/SPARK-20427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983440#comment-15983440 ] Xiao Li commented on SPARK-20427: - cc [~tsuresh] Are you interested in this? > Issue with Spark interpreting Oracle datatype NUMBER > > > Key: SPARK-20427 > URL: https://issues.apache.org/jira/browse/SPARK-20427 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Alexander Andrushenko > > In Oracle exists data type NUMBER. When defining a filed in a table of type > NUMBER the field has two components, precision and scale. > For example, NUMBER(p,s) has precision p and scale s. > Precision can range from 1 to 38. > Scale can range from -84 to 127. > When reading such a filed Spark can create numbers with precision exceeding > 38. In our case it has created fields with precision 44, > calculated as sum of the precision (in our case 34 digits) and the scale (10): > "...java.lang.IllegalArgumentException: requirement failed: Decimal precision > 44 exceeds max precision 38...". > The result was, that a data frame was read from a table on one schema but > could not be inserted in the identical table on other schema. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20459) JdbcUtils throws IllegalStateException: Cause already initialized after getting SQLException
[ https://issues.apache.org/jira/browse/SPARK-20459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-20459: Target Version/s: 2.2.0 > JdbcUtils throws IllegalStateException: Cause already initialized after > getting SQLException > > > Key: SPARK-20459 > URL: https://issues.apache.org/jira/browse/SPARK-20459 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1, 2.0.2, 2.1.0 >Reporter: Jessie Yu >Priority: Minor > > Testing some failure scenarios, and JdbcUtils throws an IllegalStateException > instead of the expected SQLException: > {code} > scala> > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.saveTable(prodtbl,url3,"DB2.D_ITEM_INFO",prop1) > > 17/04/03 17:19:35 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > > java.lang.IllegalStateException: Cause already initialized > > .at java.lang.Throwable.setCause(Throwable.java:365) > > .at java.lang.Throwable.initCause(Throwable.java:341) > > .at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:241) > .at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:300) > .at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:299) > .at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902) > .at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902) > .at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899) > > .at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899) > > .at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > > .at org.apache.spark.scheduler.Task.run(Task.scala:86) > > .at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > .at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153 > .at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628 > .at java.lang.Thread.run(Thread.java:785) > > {code} > The code in JdbcUtils.savePartition has > {code} > } catch { > case e: SQLException => > val cause = e.getNextException > if (cause != null && e.getCause != cause) { > if (e.getCause == null) { > e.initCause(cause) > } else { > e.addSuppressed(cause) > } > } > {code} > According to Throwable Java doc, {{initCause()}} throws an > {{IllegalStateException}} "if this throwable was created with > Throwable(Throwable) or Throwable(String,Throwable), or this method has > already been called on this throwable". The code does check whether {{cause}} > is {{null}} before initializing it. However, {{getCause()}} "returns the > cause of this throwable or null if the cause is nonexistent or unknown." In > other words, {{null}} is returned if {{cause}} already exists (which would > result in {{IllegalStateException}}) but is unknown. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983396#comment-15983396 ] Shixiong Zhu edited comment on SPARK-18057 at 4/25/17 6:29 PM: --- [~guozhang] We have a stress test to test Spark Kafka connector for various cases, and it will try to frequently delete / re-create topics. Topic deletion may not be common in production, but it will just hang forever and waste resources. It requires users to set up some monitor scripts to restart a job if it doesn't make any progress for a long time. was (Author: zsxwing): [~guozhang] We have a stress test to test Spark Kafka connector for various cases, and it will try to frequently delete / re-create topics. > Update structured streaming kafka from 10.0.1 to 10.2.0 > --- > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983396#comment-15983396 ] Shixiong Zhu commented on SPARK-18057: -- [~guozhang] We have a stress test to test Spark Kafka connector for various cases, and it will try to frequently delete / re-create topics. > Update structured streaming kafka from 10.0.1 to 10.2.0 > --- > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20461) CachedKafkaConsumer may hang forever when it's interrupted
[ https://issues.apache.org/jira/browse/SPARK-20461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20461: Assignee: (was: Apache Spark) > CachedKafkaConsumer may hang forever when it's interrupted > -- > > Key: SPARK-20461 > URL: https://issues.apache.org/jira/browse/SPARK-20461 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu > > CachedKafkaConsumer may hang forever when it's interrupted because of > KAFKA-1894 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20461) CachedKafkaConsumer may hang forever when it's interrupted
[ https://issues.apache.org/jira/browse/SPARK-20461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20461: Assignee: Apache Spark > CachedKafkaConsumer may hang forever when it's interrupted > -- > > Key: SPARK-20461 > URL: https://issues.apache.org/jira/browse/SPARK-20461 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu >Assignee: Apache Spark > > CachedKafkaConsumer may hang forever when it's interrupted because of > KAFKA-1894 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5484) Pregel should checkpoint periodically to avoid StackOverflowError
[ https://issues.apache.org/jira/browse/SPARK-5484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung resolved SPARK-5484. - Resolution: Fixed Assignee: dingding (was: Ankur Dave) Fix Version/s: 2.3.0 2.2.0 Target Version/s: 2.2.0, 2.3.0 > Pregel should checkpoint periodically to avoid StackOverflowError > - > > Key: SPARK-5484 > URL: https://issues.apache.org/jira/browse/SPARK-5484 > Project: Spark > Issue Type: Bug > Components: GraphX >Reporter: Ankur Dave >Assignee: dingding > Fix For: 2.2.0, 2.3.0 > > > Pregel-based iterative algorithms with more than ~50 iterations begin to slow > down and eventually fail with a StackOverflowError due to Spark's lack of > support for long lineage chains. Instead, Pregel should checkpoint the graph > periodically. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20461) CachedKafkaConsumer may hang forever when it's interrupted
[ https://issues.apache.org/jira/browse/SPARK-20461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983387#comment-15983387 ] Apache Spark commented on SPARK-20461: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/17761 > CachedKafkaConsumer may hang forever when it's interrupted > -- > > Key: SPARK-20461 > URL: https://issues.apache.org/jira/browse/SPARK-20461 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu > > CachedKafkaConsumer may hang forever when it's interrupted because of > KAFKA-1894 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20461) CachedKafkaConsumer may hang forever when it's interrupted
Shixiong Zhu created SPARK-20461: Summary: CachedKafkaConsumer may hang forever when it's interrupted Key: SPARK-20461 URL: https://issues.apache.org/jira/browse/SPARK-20461 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.2.0 Reporter: Shixiong Zhu CachedKafkaConsumer may hang forever when it's interrupted because of KAFKA-1894 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20447) spark mesos scheduler suppress call
[ https://issues.apache.org/jira/browse/SPARK-20447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983314#comment-15983314 ] Michael Gummelt commented on SPARK-20447: - The scheduler doesn't support suppression, no, but it does reject offers for 120s: https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala#L375, and this is configurable. With Mesos' 1s offer cycle, this should allow offers to be sent to 119 other frameworks before being re-offered to Spark. Is this not sufficient? > spark mesos scheduler suppress call > --- > > Key: SPARK-20447 > URL: https://issues.apache.org/jira/browse/SPARK-20447 > Project: Spark > Issue Type: Wish > Components: Mesos >Affects Versions: 2.1.0 >Reporter: Pavel Plotnikov >Priority: Minor > > The spark mesos scheduler never sends the suppress call to mesos to exclude > application from Mesos batch allocation process (HierarchicalDRF allocator) > when spark application is idle and there are no tasks in the queue. As a > result, the application receives 0% cluster share because of the dynamic > resource allocation while other applications, that need additional resources, > can’t receive an offer because they have bigger cluster share that is > significantly more than 0% > About suppress call: > http://mesos.apache.org/documentation/latest/app-framework-development-guide/ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20449) Upgrade breeze version to 0.13.1
[ https://issues.apache.org/jira/browse/SPARK-20449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai resolved SPARK-20449. - Resolution: Fixed Fix Version/s: 2.2.0 3.0.0 Issue resolved by pull request 17746 [https://github.com/apache/spark/pull/17746] > Upgrade breeze version to 0.13.1 > > > Key: SPARK-20449 > URL: https://issues.apache.org/jira/browse/SPARK-20449 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.0 >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Minor > Fix For: 3.0.0, 2.2.0 > > > Upgrade breeze version to 0.13.1, which fixed some critical bugs of L-BFGS-B. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20439) Catalog.listTables() depends on all libraries used to create tables
[ https://issues.apache.org/jira/browse/SPARK-20439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983224#comment-15983224 ] Apache Spark commented on SPARK-20439: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/17760 > Catalog.listTables() depends on all libraries used to create tables > --- > > Key: SPARK-20439 > URL: https://issues.apache.org/jira/browse/SPARK-20439 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.2.0, 2.3.0 > > > spark.catalog.listTables() and getTable > You may get an error on the table serde library: > java.lang.RuntimeException: java.lang.ClassNotFoundException: > com.amazon.emr.kinesis.hive.KinesisHiveInputFormat > Or if the database contains any table (e.g., index) with a table type that is > not accessible by Spark SQL, it will fail the whole listTable API. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13802) Fields order in Row(**kwargs) is not consistent with Schema.toInternal method
[ https://issues.apache.org/jira/browse/SPARK-13802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983094#comment-15983094 ] Furcy Pin commented on SPARK-13802: --- Hi, I ran into similar issues and found this Jira, so I would like to add some water to the mill. I ran this in the pyspark shell v2.1.0 : {code} >>> from pyspark.sql.types import * >>> from pyspark.sql import Row >>> >>> rdd = spark.sparkContext.parallelize(range(1, 4)) >>> >>> schema = StructType([StructField('a', IntegerType()), StructField('b', >>> StringType())]) >>> spark.createDataFrame(rdd.map(lambda r: Row(a=1, b=None)), schema).collect() [Row(a=1, b=None), Row(a=1, b=None), Row(a=1, b=None)] >>> >>> schema = StructType([StructField('b', IntegerType()), StructField('a', >>> StringType())]) >>> spark.createDataFrame(rdd.map(lambda r: Row(b=1, a=None)), schema).collect() [Row(b=1, a=None), Row(b=1, a=None), Row(b=1, a=None)] {code} When applying a Schema, it seems that the Rows fields names are correctly matched and reordered according to the schema, which is quite nice, even if when creating a Row alone, the fields are ordered differently: {code} >>> Row(b=1, a=None) Row(a=None, b=1) {code} However I get an inconsistent behavior when I start using structs: {code} >>> schema = StructType([StructField('a', IntegerType()), StructField('b', >>> StructType([StructField('c', StringType())]))]) >>> spark.createDataFrame(rdd.map(lambda r: Row(a=1, b=None)), schema).collect() [Row(a=1, b=None), Row(a=1, b=None), Row(a=1, b=None)] >>> >>> schema = StructType([StructField('b', IntegerType()), StructField('a', >>> StructType([StructField('c', StringType())]))]) >>> spark.createDataFrame(rdd.map(lambda r: Row(b=1, a=None)), schema).collect() ... Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "spark/python/lib/pyspark.zip/pyspark/worker.py", line 174, in main process() File "spark/python/lib/pyspark.zip/pyspark/worker.py", line 169, in process serializer.dump_stream(func(split_index, iterator), outfile) File "spark/python/lib/pyspark.zip/pyspark/serializers.py", line 268, in dump_stream vs = list(itertools.islice(iterator, batch)) File "spark/python/pyspark/sql/types.py", line 576, in toInternal return tuple(f.toInternal(v) for f, v in zip(self.fields, obj)) File "spark/python/pyspark/sql/types.py", line 576, in return tuple(f.toInternal(v) for f, v in zip(self.fields, obj)) File "spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 436, in toInternal return self.dataType.toInternal(obj) File "spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 593, in toInternal raise ValueError("Unexpected tuple %r with StructType" % obj) ValueError: Unexpected tuple 1 with StructType {code} So it seems to me that pyspark can match a Row's field names with a schema, but only when no struct is involved. This doesn't seem very consistent so I believe that it should be considered a bug. > Fields order in Row(**kwargs) is not consistent with Schema.toInternal method > - > > Key: SPARK-13802 > URL: https://issues.apache.org/jira/browse/SPARK-13802 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.0 >Reporter: Szymon Matejczyk > > When using Row constructor from kwargs, fields in the tuple underneath are > sorted by name. When Schema is reading the row, it is not using the fields in > this order. > {code} > from pyspark.sql import Row > from pyspark.sql.types import * > schema = StructType([ > StructField("id", StringType()), > StructField("first_name", StringType())]) > row = Row(id="39", first_name="Szymon") > schema.toInternal(row) > Out[5]: ('Szymon', '39') > {code} > {code} > df = sqlContext.createDataFrame([row], schema) > df.show(1) > +--+--+ > |id|first_name| > +--+--+ > |Szymon|39| > +--+--+ > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20459) JdbcUtils throws IllegalStateException: Cause already initialized after getting SQLException
[ https://issues.apache.org/jira/browse/SPARK-20459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983092#comment-15983092 ] Sean Owen commented on SPARK-20459: --- Ugh, so there's no actual way to detect whether the exception has no cause initialized? I guess it's OK here to catch the IllegalStateException and resort to addSuppressed if it fails. Go ahead. > JdbcUtils throws IllegalStateException: Cause already initialized after > getting SQLException > > > Key: SPARK-20459 > URL: https://issues.apache.org/jira/browse/SPARK-20459 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1, 2.0.2, 2.1.0 >Reporter: Jessie Yu >Priority: Minor > > Testing some failure scenarios, and JdbcUtils throws an IllegalStateException > instead of the expected SQLException: > {code} > scala> > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.saveTable(prodtbl,url3,"DB2.D_ITEM_INFO",prop1) > > 17/04/03 17:19:35 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > > java.lang.IllegalStateException: Cause already initialized > > .at java.lang.Throwable.setCause(Throwable.java:365) > > .at java.lang.Throwable.initCause(Throwable.java:341) > > .at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:241) > .at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:300) > .at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:299) > .at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902) > .at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902) > .at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899) > > .at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899) > > .at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > > .at org.apache.spark.scheduler.Task.run(Task.scala:86) > > .at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > .at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153 > .at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628 > .at java.lang.Thread.run(Thread.java:785) > > {code} > The code in JdbcUtils.savePartition has > {code} > } catch { > case e: SQLException => > val cause = e.getNextException > if (cause != null && e.getCause != cause) { > if (e.getCause == null) { > e.initCause(cause) > } else { > e.addSuppressed(cause) > } > } > {code} > According to Throwable Java doc, {{initCause()}} throws an > {{IllegalStateException}} "if this throwable was created with > Throwable(Throwable) or Throwable(String,Throwable), or this method has > already been called on this throwable". The code does check whether {{cause}} > is {{null}} before initializing it. However, {{getCause()}} "returns the > cause of this throwable or null if the cause is nonexistent or unknown." In > other words, {{null}} is returned if {{cause}} already exists (which would > result in {{IllegalStateException}}) but is unknown. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20460) Make it more consistent to handle column name duplication
[ https://issues.apache.org/jira/browse/SPARK-20460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20460: Assignee: (was: Apache Spark) > Make it more consistent to handle column name duplication > - > > Key: SPARK-20460 > URL: https://issues.apache.org/jira/browse/SPARK-20460 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Takeshi Yamamuro >Priority: Trivial > > In the current master, error handling is different when hitting column name > duplication. > {code} > // json > scala> val schema = StructType(StructField("a", IntegerType) :: > StructField("a", IntegerType) :: Nil) > scala> Seq("""{"a":1, > "a":1}"").toDF().coalesce(1).write.mode("overwrite").text("/tmp/data") > scala> spark.read.format("json").schema(schema).load("/tmp/data").show > org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: > a#12, a#13.; > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153) > scala> spark.read.format("json").load("/tmp/data").show > org.apache.spark.sql.AnalysisException: Duplicate column(s) : "a" found, > cannot save to JSON format; > at > org.apache.spark.sql.execution.datasources.json.JsonDataSource.checkConstraints(JsonDataSource.scala:81) > at > org.apache.spark.sql.execution.datasources.json.JsonDataSource.inferSchema(JsonDataSource.scala:63) > at > org.apache.spark.sql.execution.datasources.json.JsonFileFormat.inferSchema(JsonFileFormat.scala:57) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:176) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:176) > // csv > scala> val schema = StructType(StructField("a", IntegerType) :: > StructField("a", IntegerType) :: Nil) > scala> Seq("a,a", > "1,1").toDF().coalesce(1).write.mode("overwrite").text("/tmp/data") > scala> spark.read.format("csv").schema(schema).option("header", > false).load("/tmp/data").show > org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: > a#41, a#42.; > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:152) > // If `inferSchema` is true, a CSV format is duplicate-safe (See SPARK-16896) > scala> spark.read.format("csv").option("header", true).load("/tmp/data").show > +---+---+ > | a0| a1| > +---+---+ > | 1| 1| > +---+---+ > // parquet > scala> val schema = StructType(StructField("a", IntegerType) :: > StructField("a", IntegerType) :: Nil) > scala> Seq((1, 1)).toDF("a", > "b").coalesce(1).write.mode("overwrite").parquet("/tmp/data") > scala> spark.read.format("parquet").schema(schema).option("header", > false).load("/tmp/data").show > org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: > a#110, a#111.; > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:152) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > {code} > To make this error reason clearer, IMO we'd better to make it more consistent > to handle column name duplication. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11968) ALS recommend all methods spend most of time in GC
[ https://issues.apache.org/jira/browse/SPARK-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11968: Assignee: Nick Pentreath (was: Apache Spark) > ALS recommend all methods spend most of time in GC > -- > > Key: SPARK-11968 > URL: https://issues.apache.org/jira/browse/SPARK-11968 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.5.2, 1.6.0 >Reporter: Joseph K. Bradley >Assignee: Nick Pentreath > > After adding recommendUsersForProducts and recommendProductsForUsers to ALS > in spark-perf, I noticed that it takes much longer than ALS itself. Looking > at the monitoring page, I can see it is spending about 8min doing GC for each > 10min task. That sounds fixable. Looking at the implementation, there is > clearly an opportunity to avoid extra allocations: > [https://github.com/apache/spark/blob/e6dd237463d2de8c506f0735dfdb3f43e8122513/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala#L283] > CC: [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20460) Make it more consistent to handle column name duplication
[ https://issues.apache.org/jira/browse/SPARK-20460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20460: Assignee: Apache Spark > Make it more consistent to handle column name duplication > - > > Key: SPARK-20460 > URL: https://issues.apache.org/jira/browse/SPARK-20460 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Takeshi Yamamuro >Assignee: Apache Spark >Priority: Trivial > > In the current master, error handling is different when hitting column name > duplication. > {code} > // json > scala> val schema = StructType(StructField("a", IntegerType) :: > StructField("a", IntegerType) :: Nil) > scala> Seq("""{"a":1, > "a":1}"").toDF().coalesce(1).write.mode("overwrite").text("/tmp/data") > scala> spark.read.format("json").schema(schema).load("/tmp/data").show > org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: > a#12, a#13.; > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153) > scala> spark.read.format("json").load("/tmp/data").show > org.apache.spark.sql.AnalysisException: Duplicate column(s) : "a" found, > cannot save to JSON format; > at > org.apache.spark.sql.execution.datasources.json.JsonDataSource.checkConstraints(JsonDataSource.scala:81) > at > org.apache.spark.sql.execution.datasources.json.JsonDataSource.inferSchema(JsonDataSource.scala:63) > at > org.apache.spark.sql.execution.datasources.json.JsonFileFormat.inferSchema(JsonFileFormat.scala:57) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:176) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:176) > // csv > scala> val schema = StructType(StructField("a", IntegerType) :: > StructField("a", IntegerType) :: Nil) > scala> Seq("a,a", > "1,1").toDF().coalesce(1).write.mode("overwrite").text("/tmp/data") > scala> spark.read.format("csv").schema(schema).option("header", > false).load("/tmp/data").show > org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: > a#41, a#42.; > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:152) > // If `inferSchema` is true, a CSV format is duplicate-safe (See SPARK-16896) > scala> spark.read.format("csv").option("header", true).load("/tmp/data").show > +---+---+ > | a0| a1| > +---+---+ > | 1| 1| > +---+---+ > // parquet > scala> val schema = StructType(StructField("a", IntegerType) :: > StructField("a", IntegerType) :: Nil) > scala> Seq((1, 1)).toDF("a", > "b").coalesce(1).write.mode("overwrite").parquet("/tmp/data") > scala> spark.read.format("parquet").schema(schema).option("header", > false).load("/tmp/data").show > org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: > a#110, a#111.; > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:152) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > {code} > To make this error reason clearer, IMO we'd better to make it more consistent > to handle column name duplication. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11968) ALS recommend all methods spend most of time in GC
[ https://issues.apache.org/jira/browse/SPARK-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11968: Assignee: Apache Spark (was: Nick Pentreath) > ALS recommend all methods spend most of time in GC > -- > > Key: SPARK-11968 > URL: https://issues.apache.org/jira/browse/SPARK-11968 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.5.2, 1.6.0 >Reporter: Joseph K. Bradley >Assignee: Apache Spark > > After adding recommendUsersForProducts and recommendProductsForUsers to ALS > in spark-perf, I noticed that it takes much longer than ALS itself. Looking > at the monitoring page, I can see it is spending about 8min doing GC for each > 10min task. That sounds fixable. Looking at the implementation, there is > clearly an opportunity to avoid extra allocations: > [https://github.com/apache/spark/blob/e6dd237463d2de8c506f0735dfdb3f43e8122513/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala#L283] > CC: [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20460) Make it more consistent to handle column name duplication
[ https://issues.apache.org/jira/browse/SPARK-20460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983079#comment-15983079 ] Apache Spark commented on SPARK-20460: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/17758 > Make it more consistent to handle column name duplication > - > > Key: SPARK-20460 > URL: https://issues.apache.org/jira/browse/SPARK-20460 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Takeshi Yamamuro >Priority: Trivial > > In the current master, error handling is different when hitting column name > duplication. > {code} > // json > scala> val schema = StructType(StructField("a", IntegerType) :: > StructField("a", IntegerType) :: Nil) > scala> Seq("""{"a":1, > "a":1}"").toDF().coalesce(1).write.mode("overwrite").text("/tmp/data") > scala> spark.read.format("json").schema(schema).load("/tmp/data").show > org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: > a#12, a#13.; > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153) > scala> spark.read.format("json").load("/tmp/data").show > org.apache.spark.sql.AnalysisException: Duplicate column(s) : "a" found, > cannot save to JSON format; > at > org.apache.spark.sql.execution.datasources.json.JsonDataSource.checkConstraints(JsonDataSource.scala:81) > at > org.apache.spark.sql.execution.datasources.json.JsonDataSource.inferSchema(JsonDataSource.scala:63) > at > org.apache.spark.sql.execution.datasources.json.JsonFileFormat.inferSchema(JsonFileFormat.scala:57) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:176) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:176) > // csv > scala> val schema = StructType(StructField("a", IntegerType) :: > StructField("a", IntegerType) :: Nil) > scala> Seq("a,a", > "1,1").toDF().coalesce(1).write.mode("overwrite").text("/tmp/data") > scala> spark.read.format("csv").schema(schema).option("header", > false).load("/tmp/data").show > org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: > a#41, a#42.; > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:152) > // If `inferSchema` is true, a CSV format is duplicate-safe (See SPARK-16896) > scala> spark.read.format("csv").option("header", true).load("/tmp/data").show > +---+---+ > | a0| a1| > +---+---+ > | 1| 1| > +---+---+ > // parquet > scala> val schema = StructType(StructField("a", IntegerType) :: > StructField("a", IntegerType) :: Nil) > scala> Seq((1, 1)).toDF("a", > "b").coalesce(1).write.mode("overwrite").parquet("/tmp/data") > scala> spark.read.format("parquet").schema(schema).option("header", > false).load("/tmp/data").show > org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: > a#110, a#111.; > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:152) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > {code} > To make this error reason clearer, IMO we'd better to make it more consistent > to handle column name duplication. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11968) ALS recommend all methods spend most of time in GC
[ https://issues.apache.org/jira/browse/SPARK-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983074#comment-15983074 ] Peng Meng commented on SPARK-11968: --- Thanks [~mlnick] , I will post more results here. I latest result is I have changed the PriorityQueue to BoundedPriorityQueue. There is about 30% improvement. Will update the PR and the result. > ALS recommend all methods spend most of time in GC > -- > > Key: SPARK-11968 > URL: https://issues.apache.org/jira/browse/SPARK-11968 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.5.2, 1.6.0 >Reporter: Joseph K. Bradley >Assignee: Nick Pentreath > > After adding recommendUsersForProducts and recommendProductsForUsers to ALS > in spark-perf, I noticed that it takes much longer than ALS itself. Looking > at the monitoring page, I can see it is spending about 8min doing GC for each > 10min task. That sounds fixable. Looking at the implementation, there is > clearly an opportunity to avoid extra allocations: > [https://github.com/apache/spark/blob/e6dd237463d2de8c506f0735dfdb3f43e8122513/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala#L283] > CC: [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20460) Make it more consistent to handle column name duplication
[ https://issues.apache.org/jira/browse/SPARK-20460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-20460: - Description: In the current master, error handling is different when hitting column name duplication. {code} // json scala> val schema = StructType(StructField("a", IntegerType) :: StructField("a", IntegerType) :: Nil) scala> Seq("""{"a":1, "a":1}"").toDF().coalesce(1).write.mode("overwrite").text("/tmp/data") scala> spark.read.format("json").schema(schema).load("/tmp/data").show org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: a#12, a#13.; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153) scala> spark.read.format("json").load("/tmp/data").show org.apache.spark.sql.AnalysisException: Duplicate column(s) : "a" found, cannot save to JSON format; at org.apache.spark.sql.execution.datasources.json.JsonDataSource.checkConstraints(JsonDataSource.scala:81) at org.apache.spark.sql.execution.datasources.json.JsonDataSource.inferSchema(JsonDataSource.scala:63) at org.apache.spark.sql.execution.datasources.json.JsonFileFormat.inferSchema(JsonFileFormat.scala:57) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:176) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:176) // csv scala> val schema = StructType(StructField("a", IntegerType) :: StructField("a", IntegerType) :: Nil) scala> Seq("a,a", "1,1").toDF().coalesce(1).write.mode("overwrite").text("/tmp/data") scala> spark.read.format("csv").schema(schema).option("header", false).load("/tmp/data").show org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: a#41, a#42.; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:152) // If `inferSchema` is true, a CSV format is duplicate-safe (See SPARK-16896) scala> spark.read.format("csv").option("header", true).load("/tmp/data").show +---+---+ | a0| a1| +---+---+ | 1| 1| +---+---+ // parquet scala> val schema = StructType(StructField("a", IntegerType) :: StructField("a", IntegerType) :: Nil) scala> Seq((1, 1)).toDF("a", "b").coalesce(1).write.mode("overwrite").parquet("/tmp/data") scala> spark.read.format("parquet").schema(schema).option("header", false).load("/tmp/data").show org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: a#110, a#111.; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:152) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) {code} To make this error reason clearer, IMO we'd better to make it more consistent to handle column name duplication. was: In the current master, error handling is different when hitting column name duplication. {code} // json scala> val schema = StructType(StructField("a", IntegerType) :: StructField("a", IntegerType) :: Nil) scala> Seq("""{"a":1, "a":1}"").toDF().coalesce(1).write.mode("overwrite").text("/tmp/data") scala> spark.read.format("json").schema(schema).load("/tmp/data").show org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: a#12, a#13.; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153) scala> spark.read.format("json").load("/tmp/data").show org.apache.spark.sql.AnalysisException: Duplicate column(s) : "a" found, cannot save to JSON format; at org.apache.spark.sql.execution.datasources.json.JsonDataSource.checkConstraints(JsonDataSource.scala:81) at org.apache.spark.sql.execution.datasources.json.JsonDataSource.inferSchema(JsonDataSource.scala:63) at org.apache.spark.sql.execution.datasources.json.JsonFileFormat.inferSchem
[jira] [Commented] (SPARK-20445) pyspark.sql.utils.IllegalArgumentException: u'DecisionTreeClassifier was given input with invalid label column label, without the number of classes specified. See Stri
[ https://issues.apache.org/jira/browse/SPARK-20445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983069#comment-15983069 ] surya pratap commented on SPARK-20445: -- Hello Hyukjin Kwon , Thanks for fast reply I am not getting your answer "current master" What is current master and what should we check? Please explain Note- I am running this code on Mapr having Spark version 1.6.1 > pyspark.sql.utils.IllegalArgumentException: u'DecisionTreeClassifier was > given input with invalid label column label, without the number of classes > specified. See StringIndexer > > > Key: SPARK-20445 > URL: https://issues.apache.org/jira/browse/SPARK-20445 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.1 >Reporter: surya pratap > > #Load the CSV file into a RDD > irisData = sc.textFile("/home/infademo/surya/iris.csv") > irisData.cache() > irisData.count() > #Remove the first line (contains headers) > dataLines = irisData.filter(lambda x: "Sepal" not in x) > dataLines.count() > from pyspark.sql import Row > #Create a Data Frame from the data > parts = dataLines.map(lambda l: l.split(",")) > irisMap = parts.map(lambda p: Row(SEPAL_LENGTH=float(p[0]),\ > SEPAL_WIDTH=float(p[1]), \ > PETAL_LENGTH=float(p[2]), \ > PETAL_WIDTH=float(p[3]), \ > SPECIES=p[4] )) > # Infer the schema, and register the DataFrame as a table. > irisDf = sqlContext.createDataFrame(irisMap) > irisDf.cache() > #Add a numeric indexer for the label/target column > from pyspark.ml.feature import StringIndexer > stringIndexer = StringIndexer(inputCol="SPECIES", outputCol="IND_SPECIES") > si_model = stringIndexer.fit(irisDf) > irisNormDf = si_model.transform(irisDf) > irisNormDf.select("SPECIES","IND_SPECIES").distinct().collect() > irisNormDf.cache() > > """-- > Perform Data Analytics > > -""" > #See standard parameters > irisNormDf.describe().show() > #Find correlation between predictors and target > for i in irisNormDf.columns: > if not( isinstance(irisNormDf.select(i).take(1)[0][0], basestring)) : > print( "Correlation to Species for ", i, \ > irisNormDf.stat.corr('IND_SPECIES',i)) > #Transform to a Data Frame for input to Machine Learing > #Drop columns that are not required (low correlation) > from pyspark.mllib.linalg import Vectors > from pyspark.mllib.linalg import SparseVector > from pyspark.mllib.regression import LabeledPoint > from pyspark.mllib.util import MLUtils > import org.apache.spark.mllib.linalg.{Matrix, Matrices} > from pyspark.mllib.linalg.distributed import RowMatrix > from pyspark.ml.linalg import Vectors > pyspark.mllib.linalg.Vector > def transformToLabeledPoint(row) : > lp = ( row["SPECIES"], row["IND_SPECIES"], \ > Vectors.dense([row["SEPAL_LENGTH"],\ > row["SEPAL_WIDTH"], \ > row["PETAL_LENGTH"], \ > row["PETAL_WIDTH"]])) > return lp > irisLp = irisNormDf.rdd.map(transformToLabeledPoint) > irisLpDf = sqlContext.createDataFrame(irisLp,["species","label", > "features"]) > irisLpDf.select("species","label","features").show(10) > irisLpDf.cache() > > """-- > Perform Machine Learning > > -""" > #Split into training and testing data > (trainingData, testData) = irisLpDf.randomSplit([0.9, 0.1]) > trainingData.count() > testData.count() > testData.collect() > from pyspark.ml.classification import DecisionTreeClassifier > from pyspark.ml.evaluation import MulticlassClassificationEvaluator > #Create the model > dtClassifer = DecisionTreeClassifier(maxDepth=2, labelCol="label",\ > featuresCol="features") >dtModel = dtClassifer.fit(trainingData) > >issue part:- > >dtModel = dtClassifer.fit(trainingData) Traceback (most recent call last): > File "", line 1, in File > "/opt/mapr/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/ml/pipeline.py", > line 69, in fit return self._fit(dataset) File > "/opt/mapr/spark/spark-1.6.1-bin-ha
[jira] [Assigned] (SPARK-11968) ALS recommend all methods spend most of time in GC
[ https://issues.apache.org/jira/browse/SPARK-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11968: Assignee: Apache Spark (was: Nick Pentreath) > ALS recommend all methods spend most of time in GC > -- > > Key: SPARK-11968 > URL: https://issues.apache.org/jira/browse/SPARK-11968 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.5.2, 1.6.0 >Reporter: Joseph K. Bradley >Assignee: Apache Spark > > After adding recommendUsersForProducts and recommendProductsForUsers to ALS > in spark-perf, I noticed that it takes much longer than ALS itself. Looking > at the monitoring page, I can see it is spending about 8min doing GC for each > 10min task. That sounds fixable. Looking at the implementation, there is > clearly an opportunity to avoid extra allocations: > [https://github.com/apache/spark/blob/e6dd237463d2de8c506f0735dfdb3f43e8122513/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala#L283] > CC: [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11968) ALS recommend all methods spend most of time in GC
[ https://issues.apache.org/jira/browse/SPARK-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11968: Assignee: Nick Pentreath (was: Apache Spark) > ALS recommend all methods spend most of time in GC > -- > > Key: SPARK-11968 > URL: https://issues.apache.org/jira/browse/SPARK-11968 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.5.2, 1.6.0 >Reporter: Joseph K. Bradley >Assignee: Nick Pentreath > > After adding recommendUsersForProducts and recommendProductsForUsers to ALS > in spark-perf, I noticed that it takes much longer than ALS itself. Looking > at the monitoring page, I can see it is spending about 8min doing GC for each > 10min task. That sounds fixable. Looking at the implementation, there is > clearly an opportunity to avoid extra allocations: > [https://github.com/apache/spark/blob/e6dd237463d2de8c506f0735dfdb3f43e8122513/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala#L283] > CC: [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-20445) pyspark.sql.utils.IllegalArgumentException: u'DecisionTreeClassifier was given input with invalid label column label, without the number of classes specifi
[ https://issues.apache.org/jira/browse/SPARK-20445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] surya pratap updated SPARK-20445: - Comment: was deleted (was: Hello Hyukjin Kwon, Thxz for fast reply. You are using which version and which environment. I am running my code on MapR having spark version 1.6v) > pyspark.sql.utils.IllegalArgumentException: u'DecisionTreeClassifier was > given input with invalid label column label, without the number of classes > specified. See StringIndexer > > > Key: SPARK-20445 > URL: https://issues.apache.org/jira/browse/SPARK-20445 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.1 >Reporter: surya pratap > > #Load the CSV file into a RDD > irisData = sc.textFile("/home/infademo/surya/iris.csv") > irisData.cache() > irisData.count() > #Remove the first line (contains headers) > dataLines = irisData.filter(lambda x: "Sepal" not in x) > dataLines.count() > from pyspark.sql import Row > #Create a Data Frame from the data > parts = dataLines.map(lambda l: l.split(",")) > irisMap = parts.map(lambda p: Row(SEPAL_LENGTH=float(p[0]),\ > SEPAL_WIDTH=float(p[1]), \ > PETAL_LENGTH=float(p[2]), \ > PETAL_WIDTH=float(p[3]), \ > SPECIES=p[4] )) > # Infer the schema, and register the DataFrame as a table. > irisDf = sqlContext.createDataFrame(irisMap) > irisDf.cache() > #Add a numeric indexer for the label/target column > from pyspark.ml.feature import StringIndexer > stringIndexer = StringIndexer(inputCol="SPECIES", outputCol="IND_SPECIES") > si_model = stringIndexer.fit(irisDf) > irisNormDf = si_model.transform(irisDf) > irisNormDf.select("SPECIES","IND_SPECIES").distinct().collect() > irisNormDf.cache() > > """-- > Perform Data Analytics > > -""" > #See standard parameters > irisNormDf.describe().show() > #Find correlation between predictors and target > for i in irisNormDf.columns: > if not( isinstance(irisNormDf.select(i).take(1)[0][0], basestring)) : > print( "Correlation to Species for ", i, \ > irisNormDf.stat.corr('IND_SPECIES',i)) > #Transform to a Data Frame for input to Machine Learing > #Drop columns that are not required (low correlation) > from pyspark.mllib.linalg import Vectors > from pyspark.mllib.linalg import SparseVector > from pyspark.mllib.regression import LabeledPoint > from pyspark.mllib.util import MLUtils > import org.apache.spark.mllib.linalg.{Matrix, Matrices} > from pyspark.mllib.linalg.distributed import RowMatrix > from pyspark.ml.linalg import Vectors > pyspark.mllib.linalg.Vector > def transformToLabeledPoint(row) : > lp = ( row["SPECIES"], row["IND_SPECIES"], \ > Vectors.dense([row["SEPAL_LENGTH"],\ > row["SEPAL_WIDTH"], \ > row["PETAL_LENGTH"], \ > row["PETAL_WIDTH"]])) > return lp > irisLp = irisNormDf.rdd.map(transformToLabeledPoint) > irisLpDf = sqlContext.createDataFrame(irisLp,["species","label", > "features"]) > irisLpDf.select("species","label","features").show(10) > irisLpDf.cache() > > """-- > Perform Machine Learning > > -""" > #Split into training and testing data > (trainingData, testData) = irisLpDf.randomSplit([0.9, 0.1]) > trainingData.count() > testData.count() > testData.collect() > from pyspark.ml.classification import DecisionTreeClassifier > from pyspark.ml.evaluation import MulticlassClassificationEvaluator > #Create the model > dtClassifer = DecisionTreeClassifier(maxDepth=2, labelCol="label",\ > featuresCol="features") >dtModel = dtClassifer.fit(trainingData) > >issue part:- > >dtModel = dtClassifer.fit(trainingData) Traceback (most recent call last): > File "", line 1, in File > "/opt/mapr/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/ml/pipeline.py", > line 69, in fit return self._fit(dataset) File > "/opt/mapr/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/ml/wrapper.py", > line 133, in _fit java_model = self._fit_java(dataset) F
[jira] [Commented] (SPARK-11968) ALS recommend all methods spend most of time in GC
[ https://issues.apache.org/jira/browse/SPARK-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983066#comment-15983066 ] Apache Spark commented on SPARK-11968: -- User 'mpjlu' has created a pull request for this issue: https://github.com/apache/spark/pull/17742 > ALS recommend all methods spend most of time in GC > -- > > Key: SPARK-11968 > URL: https://issues.apache.org/jira/browse/SPARK-11968 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.5.2, 1.6.0 >Reporter: Joseph K. Bradley >Assignee: Nick Pentreath > > After adding recommendUsersForProducts and recommendProductsForUsers to ALS > in spark-perf, I noticed that it takes much longer than ALS itself. Looking > at the monitoring page, I can see it is spending about 8min doing GC for each > 10min task. That sounds fixable. Looking at the implementation, there is > clearly an opportunity to avoid extra allocations: > [https://github.com/apache/spark/blob/e6dd237463d2de8c506f0735dfdb3f43e8122513/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala#L283] > CC: [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20460) Make it more consistent to handle column name duplication
Takeshi Yamamuro created SPARK-20460: Summary: Make it more consistent to handle column name duplication Key: SPARK-20460 URL: https://issues.apache.org/jira/browse/SPARK-20460 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.0 Reporter: Takeshi Yamamuro Priority: Trivial In the current master, error handling is different when hitting column name duplication. {code} // json scala> val schema = StructType(StructField("a", IntegerType) :: StructField("a", IntegerType) :: Nil) scala> Seq("""{"a":1, "a":1}"").toDF().coalesce(1).write.mode("overwrite").text("/tmp/data") scala> spark.read.format("json").schema(schema).load("/tmp/data").show org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: a#12, a#13.; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153) scala> spark.read.format("json").load("/tmp/data").show org.apache.spark.sql.AnalysisException: Duplicate column(s) : "a" found, cannot save to JSON format; at org.apache.spark.sql.execution.datasources.json.JsonDataSource.checkConstraints(JsonDataSource.scala:81) at org.apache.spark.sql.execution.datasources.json.JsonDataSource.inferSchema(JsonDataSource.scala:63) at org.apache.spark.sql.execution.datasources.json.JsonFileFormat.inferSchema(JsonFileFormat.scala:57) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:176) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:176) // csv scala> val schema = StructType(StructField("a", IntegerType) :: StructField("a", IntegerType) :: Nil) scala> Seq("a,a", "1,1").toDF().coalesce(1).write.mode("overwrite").text("/tmp/data") scala> spark.read.format("csv").schema(schema).option("header", false).load("/tmp/data").show org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: a#41, a#42.; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:152) // If `inferSchema` is true, a CSV format is duplicate-safe (See SPARK-16896) scala> spark.read.format("csv").option("header", true).load("/tmp/data").show +---+---+ | a0| a1| +---+---+ | 1| 1| +---+---+ // parquet scala> val schema = StructType(StructField("a", IntegerType) :: StructField("a", IntegerType) :: Nil) scala> Seq((1, 1)).toDF("a", "b").coalesce(1).write.mode("overwrite").parquet("/tmp/data") scala> spark.read.format("parquet").schema(schema).option("header", false).load("/tmp/data").show org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: a#110, a#111.; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:152) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) {code} To make this error reason clearer, IMO we'd better to make it more consistent to handle column name duplication. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20443) The blockSize of MLLIB ALS should be setting by the User
[ https://issues.apache.org/jira/browse/SPARK-20443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983059#comment-15983059 ] Peng Meng commented on SPARK-20443: --- Yes, based on my current test, I agree. But if the data size is large, maybe there is benefit to adjust block size. > The blockSize of MLLIB ALS should be setting by the User > - > > Key: SPARK-20443 > URL: https://issues.apache.org/jira/browse/SPARK-20443 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng >Priority: Minor > > The blockSize of MLLIB ALS is very important for ALS performance. > In our test, when the blockSize is 128, the performance is about 4X comparing > with the blockSize is 4096 (default value). > The following are our test results: > BlockSize(recommendationForAll time) > 128(124s), 256(160s), 512(184s), 1024(244s), 2048(332s), 4096(488s), 8192(OOM) > The Test Environment: > 3 workers: each work 10 core, each work 30G memory, each work 1 executor. > The Data: User 480,000, and Item 17,000 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20443) The blockSize of MLLIB ALS should be setting by the User
[ https://issues.apache.org/jira/browse/SPARK-20443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983050#comment-15983050 ] Nick Pentreath commented on SPARK-20443: Your PR for SPARK-20446 / SPARK11968 should largely remove the need to adjust the block size? Do you agree? > The blockSize of MLLIB ALS should be setting by the User > - > > Key: SPARK-20443 > URL: https://issues.apache.org/jira/browse/SPARK-20443 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng >Priority: Minor > > The blockSize of MLLIB ALS is very important for ALS performance. > In our test, when the blockSize is 128, the performance is about 4X comparing > with the blockSize is 4096 (default value). > The following are our test results: > BlockSize(recommendationForAll time) > 128(124s), 256(160s), 512(184s), 1024(244s), 2048(332s), 4096(488s), 8192(OOM) > The Test Environment: > 3 workers: each work 10 core, each work 30G memory, each work 1 executor. > The Data: User 480,000, and Item 17,000 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11968) ALS recommend all methods spend most of time in GC
[ https://issues.apache.org/jira/browse/SPARK-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983048#comment-15983048 ] Nick Pentreath commented on SPARK-11968: [~peng.m...@intel.com] would you mind posting your comments here about the solution from SPARK-20446 as well as the experiment timings? You can rename your PR to include this JIRA (SPARK-11968) in the title instead, in order to link it. Also please include the timings here of the {{ml DataFrame}} version for comparison. Your approach should also be much faster than the current {{ml}} SparkSQL approach, I think. I just did some quick tests using MovieLens {{latest}} data (~24 million ratings, ~260,000 users, ~39,000 items) and found the following (note these are very rough timings): Using default block sizes: Current {{ml}} master - 262 sec My approach: 58 sec Your PR: 35 sec You're correct that there is still +/- 20-25% GC time overhead using the BLAS 3 + sorting approach. Potentially it could be slightly improved through some form of pre-allocation, but even then it does look like any benefit of BLAS 3 is smaller than the GC cost. > ALS recommend all methods spend most of time in GC > -- > > Key: SPARK-11968 > URL: https://issues.apache.org/jira/browse/SPARK-11968 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.5.2, 1.6.0 >Reporter: Joseph K. Bradley >Assignee: Nick Pentreath > > After adding recommendUsersForProducts and recommendProductsForUsers to ALS > in spark-perf, I noticed that it takes much longer than ALS itself. Looking > at the monitoring page, I can see it is spending about 8min doing GC for each > 10min task. That sounds fixable. Looking at the implementation, there is > clearly an opportunity to avoid extra allocations: > [https://github.com/apache/spark/blob/e6dd237463d2de8c506f0735dfdb3f43e8122513/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala#L283] > CC: [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-20446) Optimize the process of MLLIB ALS recommendForAll
[ https://issues.apache.org/jira/browse/SPARK-20446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath closed SPARK-20446. -- Resolution: Duplicate > Optimize the process of MLLIB ALS recommendForAll > - > > Key: SPARK-20446 > URL: https://issues.apache.org/jira/browse/SPARK-20446 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng > > The recommendForAll of MLLIB ALS is very slow. > GC is a key problem of the current method. > The task use the following code to keep temp result: > val output = new Array[(Int, (Int, Double))](m*n) > m = n = 4096 (default value, no method to set) > so output is about 4k * 4k * (4 + 4 + 8) = 256M. This is a large memory and > cause serious GC problem, and it is frequently OOM. > Actually, we don't need to save all the temp result. Suppose we recommend > topK (topK is about 10, or 20) product for each user, we only need 4k * topK > * (4 + 4 + 8) memory to save the temp result. > I have written a solution for this method with the following test result. > The Test Environment: > 3 workers: each work 10 core, each work 30G memory, each work 1 executor. > The Data: User 480,000, and Item 17,000 > BlockSize: 1024 2048 4096 8192 > Old method: 245s 332s 488s OOM > This solution: 121s 118s 117s 120s > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20459) JdbcUtils throws IllegalStateException: Cause already initialized after getting SQLException
Jessie Yu created SPARK-20459: - Summary: JdbcUtils throws IllegalStateException: Cause already initialized after getting SQLException Key: SPARK-20459 URL: https://issues.apache.org/jira/browse/SPARK-20459 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0, 2.0.2, 2.0.1 Reporter: Jessie Yu Priority: Minor Testing some failure scenarios, and JdbcUtils throws an IllegalStateException instead of the expected SQLException: {code} scala> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.saveTable(prodtbl,url3,"DB2.D_ITEM_INFO",prop1) 17/04/03 17:19:35 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) java.lang.IllegalStateException: Cause already initialized .at java.lang.Throwable.setCause(Throwable.java:365) .at java.lang.Throwable.initCause(Throwable.java:341) .at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:241) .at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:300) .at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:299) .at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902) .at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902) .at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899) .at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899) .at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) .at org.apache.spark.scheduler.Task.run(Task.scala:86) .at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) .at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153 .at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628 .at java.lang.Thread.run(Thread.java:785) {code} The code in JdbcUtils.savePartition has {code} } catch { case e: SQLException => val cause = e.getNextException if (cause != null && e.getCause != cause) { if (e.getCause == null) { e.initCause(cause) } else { e.addSuppressed(cause) } } {code} According to Throwable Java doc, {{initCause()}} throws an {{IllegalStateException}} "if this throwable was created with Throwable(Throwable) or Throwable(String,Throwable), or this method has already been called on this throwable". The code does check whether {{cause}} is {{null}} before initializing it. However, {{getCause()}} "returns the cause of this throwable or null if the cause is nonexistent or unknown." In other words, {{null}} is returned if {{cause}} already exists (which would result in {{IllegalStateException}}) but is unknown. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20445) pyspark.sql.utils.IllegalArgumentException: u'DecisionTreeClassifier was given input with invalid label column label, without the number of classes specified. See Stri
[ https://issues.apache.org/jira/browse/SPARK-20445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983040#comment-15983040 ] surya pratap commented on SPARK-20445: -- Hello Hyukjin Kwon, Thxz for fast reply. You are using which version and which environment. I am running my code on MapR having spark version 1.6v > pyspark.sql.utils.IllegalArgumentException: u'DecisionTreeClassifier was > given input with invalid label column label, without the number of classes > specified. See StringIndexer > > > Key: SPARK-20445 > URL: https://issues.apache.org/jira/browse/SPARK-20445 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.1 >Reporter: surya pratap > > #Load the CSV file into a RDD > irisData = sc.textFile("/home/infademo/surya/iris.csv") > irisData.cache() > irisData.count() > #Remove the first line (contains headers) > dataLines = irisData.filter(lambda x: "Sepal" not in x) > dataLines.count() > from pyspark.sql import Row > #Create a Data Frame from the data > parts = dataLines.map(lambda l: l.split(",")) > irisMap = parts.map(lambda p: Row(SEPAL_LENGTH=float(p[0]),\ > SEPAL_WIDTH=float(p[1]), \ > PETAL_LENGTH=float(p[2]), \ > PETAL_WIDTH=float(p[3]), \ > SPECIES=p[4] )) > # Infer the schema, and register the DataFrame as a table. > irisDf = sqlContext.createDataFrame(irisMap) > irisDf.cache() > #Add a numeric indexer for the label/target column > from pyspark.ml.feature import StringIndexer > stringIndexer = StringIndexer(inputCol="SPECIES", outputCol="IND_SPECIES") > si_model = stringIndexer.fit(irisDf) > irisNormDf = si_model.transform(irisDf) > irisNormDf.select("SPECIES","IND_SPECIES").distinct().collect() > irisNormDf.cache() > > """-- > Perform Data Analytics > > -""" > #See standard parameters > irisNormDf.describe().show() > #Find correlation between predictors and target > for i in irisNormDf.columns: > if not( isinstance(irisNormDf.select(i).take(1)[0][0], basestring)) : > print( "Correlation to Species for ", i, \ > irisNormDf.stat.corr('IND_SPECIES',i)) > #Transform to a Data Frame for input to Machine Learing > #Drop columns that are not required (low correlation) > from pyspark.mllib.linalg import Vectors > from pyspark.mllib.linalg import SparseVector > from pyspark.mllib.regression import LabeledPoint > from pyspark.mllib.util import MLUtils > import org.apache.spark.mllib.linalg.{Matrix, Matrices} > from pyspark.mllib.linalg.distributed import RowMatrix > from pyspark.ml.linalg import Vectors > pyspark.mllib.linalg.Vector > def transformToLabeledPoint(row) : > lp = ( row["SPECIES"], row["IND_SPECIES"], \ > Vectors.dense([row["SEPAL_LENGTH"],\ > row["SEPAL_WIDTH"], \ > row["PETAL_LENGTH"], \ > row["PETAL_WIDTH"]])) > return lp > irisLp = irisNormDf.rdd.map(transformToLabeledPoint) > irisLpDf = sqlContext.createDataFrame(irisLp,["species","label", > "features"]) > irisLpDf.select("species","label","features").show(10) > irisLpDf.cache() > > """-- > Perform Machine Learning > > -""" > #Split into training and testing data > (trainingData, testData) = irisLpDf.randomSplit([0.9, 0.1]) > trainingData.count() > testData.count() > testData.collect() > from pyspark.ml.classification import DecisionTreeClassifier > from pyspark.ml.evaluation import MulticlassClassificationEvaluator > #Create the model > dtClassifer = DecisionTreeClassifier(maxDepth=2, labelCol="label",\ > featuresCol="features") >dtModel = dtClassifer.fit(trainingData) > >issue part:- > >dtModel = dtClassifer.fit(trainingData) Traceback (most recent call last): > File "", line 1, in File > "/opt/mapr/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/ml/pipeline.py", > line 69, in fit return self._fit(dataset) File > "/opt/mapr/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/ml/wrapper.py", > line 133, in _fit java_model =
[jira] [Comment Edited] (SPARK-20446) Optimize the process of MLLIB ALS recommendForAll
[ https://issues.apache.org/jira/browse/SPARK-20446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15982251#comment-15982251 ] Peng Meng edited comment on SPARK-20446 at 4/25/17 3:06 PM: Yes, I compared with ML ALSModel.recommendAll. The data size is 480,000*17,000, comparing with mllib recommendAll, there is about 10% to 20% improvement. Blockify is very important for the performance of our solution. Because the compute of CartesianRDD is: for (x <- rdd1.iterator(currSplit.s1, context); y <- rdd2.iterator(currSplit.s2, context)) yield (x, y) If no blockify, rdd2.iterator will be called the numberOfRecords times of rdd1. With blockify, rdd2.iterator will be called the numberOfGroups times of rdd1. The performance difference is very large. SparkSQL approach is much different from my solution. It uses crossJoin to do Cartesian product, the key optimization is in UnsafeCartesianRDD, which uses rowArray to cache the right partition. The key part of our method: 1. use blockify to optimize the Cartesian production computation, not use blockify for BLAS 3. 2. use priorityQueue to save memory (used on each group), not to find TopK product for each user. SqarkSQL approach doesn't work like this. cc [~mengxr] was (Author: peng.m...@intel.com): Yes, I compared with ML ALSModel.recommendAll. The data size is 480,000*17,000, comparing with mllib recommendAll, there is about 10% to 20% improvement. Blockify is very important the performance of our solution. Because the compute of CartesianRDD is: for (x <- rdd1.iterator(currSplit.s1, context); y <- rdd2.iterator(currSplit.s2, context)) yield (x, y) If no blockify, rdd2.iterator will be called the numberOfRecords times of rdd1. With blockify, rdd2.iterator will be called the numberOfGroups times of rdd1. The performance difference is very large. SparkSQL approach is much different from my solution. It uses crossJoin to do Cartesian product, the key optimization is in UnsafeCartesianRDD, which uses rowArray to cache the right partition. The key part of our method: 1. use blockify to optimize the Cartesian production computation, not use blockify for BLAS 3. 2. use priorityQueue to save memory (used on each group), not to find TopK product for each user. SqarkSQL approach doesn't work like this. cc [~mengxr] > Optimize the process of MLLIB ALS recommendForAll > - > > Key: SPARK-20446 > URL: https://issues.apache.org/jira/browse/SPARK-20446 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng > > The recommendForAll of MLLIB ALS is very slow. > GC is a key problem of the current method. > The task use the following code to keep temp result: > val output = new Array[(Int, (Int, Double))](m*n) > m = n = 4096 (default value, no method to set) > so output is about 4k * 4k * (4 + 4 + 8) = 256M. This is a large memory and > cause serious GC problem, and it is frequently OOM. > Actually, we don't need to save all the temp result. Suppose we recommend > topK (topK is about 10, or 20) product for each user, we only need 4k * topK > * (4 + 4 + 8) memory to save the temp result. > I have written a solution for this method with the following test result. > The Test Environment: > 3 workers: each work 10 core, each work 30G memory, each work 1 executor. > The Data: User 480,000, and Item 17,000 > BlockSize: 1024 2048 4096 8192 > Old method: 245s 332s 488s OOM > This solution: 121s 118s 117s 120s > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20446) Optimize the process of MLLIB ALS recommendForAll
[ https://issues.apache.org/jira/browse/SPARK-20446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983030#comment-15983030 ] Peng Meng commented on SPARK-20446: --- Thanks [~mlnick] , I agree with you. I am ok to close this ticket and move the discussion to SPARK-11968. > Optimize the process of MLLIB ALS recommendForAll > - > > Key: SPARK-20446 > URL: https://issues.apache.org/jira/browse/SPARK-20446 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng > > The recommendForAll of MLLIB ALS is very slow. > GC is a key problem of the current method. > The task use the following code to keep temp result: > val output = new Array[(Int, (Int, Double))](m*n) > m = n = 4096 (default value, no method to set) > so output is about 4k * 4k * (4 + 4 + 8) = 256M. This is a large memory and > cause serious GC problem, and it is frequently OOM. > Actually, we don't need to save all the temp result. Suppose we recommend > topK (topK is about 10, or 20) product for each user, we only need 4k * topK > * (4 + 4 + 8) memory to save the temp result. > I have written a solution for this method with the following test result. > The Test Environment: > 3 workers: each work 10 core, each work 30G memory, each work 1 executor. > The Data: User 480,000, and Item 17,000 > BlockSize: 1024 2048 4096 8192 > Old method: 245s 332s 488s OOM > This solution: 121s 118s 117s 120s > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11968) ALS recommend all methods spend most of time in GC
[ https://issues.apache.org/jira/browse/SPARK-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983026#comment-15983026 ] Nick Pentreath commented on SPARK-11968: Note, there is a solution proposed in SPARK-20446. I've redirected the discussion to this original JIRA. > ALS recommend all methods spend most of time in GC > -- > > Key: SPARK-11968 > URL: https://issues.apache.org/jira/browse/SPARK-11968 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.5.2, 1.6.0 >Reporter: Joseph K. Bradley >Assignee: Nick Pentreath > > After adding recommendUsersForProducts and recommendProductsForUsers to ALS > in spark-perf, I noticed that it takes much longer than ALS itself. Looking > at the monitoring page, I can see it is spending about 8min doing GC for each > 10min task. That sounds fixable. Looking at the implementation, there is > clearly an opportunity to avoid extra allocations: > [https://github.com/apache/spark/blob/e6dd237463d2de8c506f0735dfdb3f43e8122513/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala#L283] > CC: [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20446) Optimize the process of MLLIB ALS recommendForAll
[ https://issues.apache.org/jira/browse/SPARK-20446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983018#comment-15983018 ] Nick Pentreath commented on SPARK-20446: By the way when I say it is a duplicate I mean for the JIRA ticket. I agree that PR9980 was not the correct solution - JIRA tickets can have multiple PRs linked to them. I'd prefer to close this ticket and move the discussion to SPARK-11968 (also there are watchers on that ticket that may be interested in the outcome). > Optimize the process of MLLIB ALS recommendForAll > - > > Key: SPARK-20446 > URL: https://issues.apache.org/jira/browse/SPARK-20446 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng > > The recommendForAll of MLLIB ALS is very slow. > GC is a key problem of the current method. > The task use the following code to keep temp result: > val output = new Array[(Int, (Int, Double))](m*n) > m = n = 4096 (default value, no method to set) > so output is about 4k * 4k * (4 + 4 + 8) = 256M. This is a large memory and > cause serious GC problem, and it is frequently OOM. > Actually, we don't need to save all the temp result. Suppose we recommend > topK (topK is about 10, or 20) product for each user, we only need 4k * topK > * (4 + 4 + 8) memory to save the temp result. > I have written a solution for this method with the following test result. > The Test Environment: > 3 workers: each work 10 core, each work 30G memory, each work 1 executor. > The Data: User 480,000, and Item 17,000 > BlockSize: 1024 2048 4096 8192 > Old method: 245s 332s 488s OOM > This solution: 121s 118s 117s 120s > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool
[ https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15982949#comment-15982949 ] Dmitry Naumenko commented on SPARK-13747: - [~zsxwing] I did a similar test with join and have the same error in 2.2.0 (actual query here - https://github.com/dnaumenko/spark-realtime-analytics-sample/blob/master/samples/src/main/scala/com/github/sparksample/httpapp/SimpleServer.scala). My test setup is a simple akka-http long-running application and separate Gatling script that spawns multiple requests for join query (https://github.com/dnaumenko/spark-realtime-analytics-sample/blob/master/loadtool/src/main/scala/com/github/sparksample/SimpleServerSimulation.scala is test simulation script). [~barrybecker4] I've managed to fix the problem by switching akka's default executor to thread-pool. But I guess the root cause is that Spark is relying on ThreadLocal variables and manages them incorrectly. > Concurrent execution in SQL doesn't work with Scala ForkJoinPool > > > Key: SPARK-13747 > URL: https://issues.apache.org/jira/browse/SPARK-13747 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.2.0 > > > Run the following codes may fail > {code} > (1 to 100).par.foreach { _ => > println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count()) > } > java.lang.IllegalArgumentException: spark.sql.execution.id is already set > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87) > > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) > at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) > {code} > This is because SparkContext.runJob can be suspended when using a > ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it > calls Await.ready (introduced by https://github.com/apache/spark/pull/9264). > So when SparkContext.runJob is suspended, ForkJoinPool will run another task > in the same thread, however, the local properties has been polluted. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20458) support getting Yarn Tracking URL in code
PJ Fanning created SPARK-20458: -- Summary: support getting Yarn Tracking URL in code Key: SPARK-20458 URL: https://issues.apache.org/jira/browse/SPARK-20458 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 2.1.0 Reporter: PJ Fanning org.apache.spark.deploy.yarn.Client logs the Yarn tracking URL but it would be useful to be able to access this in code, as opposed to mining log output. I have an application where I monitor the health of the SparkContext and associated Executors using the Spark REST API. Would it be feasible to add a listener API to listen for new ApplicationReports in org.apache.spark.deploy.yarn.Client? Alternatively, this URL could be exposed as a property associated with the SparkContext. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20445) pyspark.sql.utils.IllegalArgumentException: u'DecisionTreeClassifier was given input with invalid label column label, without the number of classes specified. See Stri
[ https://issues.apache.org/jira/browse/SPARK-20445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15982852#comment-15982852 ] Hyukjin Kwon commented on SPARK-20445: -- Are you maybe able to try this against the current master or higher versions? > pyspark.sql.utils.IllegalArgumentException: u'DecisionTreeClassifier was > given input with invalid label column label, without the number of classes > specified. See StringIndexer > > > Key: SPARK-20445 > URL: https://issues.apache.org/jira/browse/SPARK-20445 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.1 >Reporter: surya pratap > > #Load the CSV file into a RDD > irisData = sc.textFile("/home/infademo/surya/iris.csv") > irisData.cache() > irisData.count() > #Remove the first line (contains headers) > dataLines = irisData.filter(lambda x: "Sepal" not in x) > dataLines.count() > from pyspark.sql import Row > #Create a Data Frame from the data > parts = dataLines.map(lambda l: l.split(",")) > irisMap = parts.map(lambda p: Row(SEPAL_LENGTH=float(p[0]),\ > SEPAL_WIDTH=float(p[1]), \ > PETAL_LENGTH=float(p[2]), \ > PETAL_WIDTH=float(p[3]), \ > SPECIES=p[4] )) > # Infer the schema, and register the DataFrame as a table. > irisDf = sqlContext.createDataFrame(irisMap) > irisDf.cache() > #Add a numeric indexer for the label/target column > from pyspark.ml.feature import StringIndexer > stringIndexer = StringIndexer(inputCol="SPECIES", outputCol="IND_SPECIES") > si_model = stringIndexer.fit(irisDf) > irisNormDf = si_model.transform(irisDf) > irisNormDf.select("SPECIES","IND_SPECIES").distinct().collect() > irisNormDf.cache() > > """-- > Perform Data Analytics > > -""" > #See standard parameters > irisNormDf.describe().show() > #Find correlation between predictors and target > for i in irisNormDf.columns: > if not( isinstance(irisNormDf.select(i).take(1)[0][0], basestring)) : > print( "Correlation to Species for ", i, \ > irisNormDf.stat.corr('IND_SPECIES',i)) > #Transform to a Data Frame for input to Machine Learing > #Drop columns that are not required (low correlation) > from pyspark.mllib.linalg import Vectors > from pyspark.mllib.linalg import SparseVector > from pyspark.mllib.regression import LabeledPoint > from pyspark.mllib.util import MLUtils > import org.apache.spark.mllib.linalg.{Matrix, Matrices} > from pyspark.mllib.linalg.distributed import RowMatrix > from pyspark.ml.linalg import Vectors > pyspark.mllib.linalg.Vector > def transformToLabeledPoint(row) : > lp = ( row["SPECIES"], row["IND_SPECIES"], \ > Vectors.dense([row["SEPAL_LENGTH"],\ > row["SEPAL_WIDTH"], \ > row["PETAL_LENGTH"], \ > row["PETAL_WIDTH"]])) > return lp > irisLp = irisNormDf.rdd.map(transformToLabeledPoint) > irisLpDf = sqlContext.createDataFrame(irisLp,["species","label", > "features"]) > irisLpDf.select("species","label","features").show(10) > irisLpDf.cache() > > """-- > Perform Machine Learning > > -""" > #Split into training and testing data > (trainingData, testData) = irisLpDf.randomSplit([0.9, 0.1]) > trainingData.count() > testData.count() > testData.collect() > from pyspark.ml.classification import DecisionTreeClassifier > from pyspark.ml.evaluation import MulticlassClassificationEvaluator > #Create the model > dtClassifer = DecisionTreeClassifier(maxDepth=2, labelCol="label",\ > featuresCol="features") >dtModel = dtClassifer.fit(trainingData) > >issue part:- > >dtModel = dtClassifer.fit(trainingData) Traceback (most recent call last): > File "", line 1, in File > "/opt/mapr/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/ml/pipeline.py", > line 69, in fit return self._fit(dataset) File > "/opt/mapr/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/ml/wrapper.py", > line 133, in _fit java_model = self._fit_java(dataset) File > "/opt/mapr/spark/spark-1.6.1-bin-hadoo
[jira] [Commented] (SPARK-20445) pyspark.sql.utils.IllegalArgumentException: u'DecisionTreeClassifier was given input with invalid label column label, without the number of classes specified. See Stri
[ https://issues.apache.org/jira/browse/SPARK-20445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15982830#comment-15982830 ] surya pratap commented on SPARK-20445: -- Hello Hyukjin Kwon Thxz for reply. I tried many times but getting same issue. I am using Spark 1.6.1 version. Please help me to sort out this issue. > pyspark.sql.utils.IllegalArgumentException: u'DecisionTreeClassifier was > given input with invalid label column label, without the number of classes > specified. See StringIndexer > > > Key: SPARK-20445 > URL: https://issues.apache.org/jira/browse/SPARK-20445 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.1 >Reporter: surya pratap > > #Load the CSV file into a RDD > irisData = sc.textFile("/home/infademo/surya/iris.csv") > irisData.cache() > irisData.count() > #Remove the first line (contains headers) > dataLines = irisData.filter(lambda x: "Sepal" not in x) > dataLines.count() > from pyspark.sql import Row > #Create a Data Frame from the data > parts = dataLines.map(lambda l: l.split(",")) > irisMap = parts.map(lambda p: Row(SEPAL_LENGTH=float(p[0]),\ > SEPAL_WIDTH=float(p[1]), \ > PETAL_LENGTH=float(p[2]), \ > PETAL_WIDTH=float(p[3]), \ > SPECIES=p[4] )) > # Infer the schema, and register the DataFrame as a table. > irisDf = sqlContext.createDataFrame(irisMap) > irisDf.cache() > #Add a numeric indexer for the label/target column > from pyspark.ml.feature import StringIndexer > stringIndexer = StringIndexer(inputCol="SPECIES", outputCol="IND_SPECIES") > si_model = stringIndexer.fit(irisDf) > irisNormDf = si_model.transform(irisDf) > irisNormDf.select("SPECIES","IND_SPECIES").distinct().collect() > irisNormDf.cache() > > """-- > Perform Data Analytics > > -""" > #See standard parameters > irisNormDf.describe().show() > #Find correlation between predictors and target > for i in irisNormDf.columns: > if not( isinstance(irisNormDf.select(i).take(1)[0][0], basestring)) : > print( "Correlation to Species for ", i, \ > irisNormDf.stat.corr('IND_SPECIES',i)) > #Transform to a Data Frame for input to Machine Learing > #Drop columns that are not required (low correlation) > from pyspark.mllib.linalg import Vectors > from pyspark.mllib.linalg import SparseVector > from pyspark.mllib.regression import LabeledPoint > from pyspark.mllib.util import MLUtils > import org.apache.spark.mllib.linalg.{Matrix, Matrices} > from pyspark.mllib.linalg.distributed import RowMatrix > from pyspark.ml.linalg import Vectors > pyspark.mllib.linalg.Vector > def transformToLabeledPoint(row) : > lp = ( row["SPECIES"], row["IND_SPECIES"], \ > Vectors.dense([row["SEPAL_LENGTH"],\ > row["SEPAL_WIDTH"], \ > row["PETAL_LENGTH"], \ > row["PETAL_WIDTH"]])) > return lp > irisLp = irisNormDf.rdd.map(transformToLabeledPoint) > irisLpDf = sqlContext.createDataFrame(irisLp,["species","label", > "features"]) > irisLpDf.select("species","label","features").show(10) > irisLpDf.cache() > > """-- > Perform Machine Learning > > -""" > #Split into training and testing data > (trainingData, testData) = irisLpDf.randomSplit([0.9, 0.1]) > trainingData.count() > testData.count() > testData.collect() > from pyspark.ml.classification import DecisionTreeClassifier > from pyspark.ml.evaluation import MulticlassClassificationEvaluator > #Create the model > dtClassifer = DecisionTreeClassifier(maxDepth=2, labelCol="label",\ > featuresCol="features") >dtModel = dtClassifer.fit(trainingData) > >issue part:- > >dtModel = dtClassifer.fit(trainingData) Traceback (most recent call last): > File "", line 1, in File > "/opt/mapr/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/ml/pipeline.py", > line 69, in fit return self._fit(dataset) File > "/opt/mapr/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/ml/wrapper.py", > line 133, in _fit java_model
[jira] [Updated] (SPARK-20457) Spark CSV is not able to Override Schema while reading data
[ https://issues.apache.org/jira/browse/SPARK-20457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Himanshu Gupta updated SPARK-20457: --- Description: I have a CSV file, test.csv: {code:xml} col 1 2 3 4 {code} When I read it using Spark, it gets the schema of data correct: {code:java} val df = spark.read.option("header", "true").option("inferSchema", "true").csv("test.csv") df.printSchema root |-- col: integer (nullable = true) {code} But when I override the `schema` of CSV file and make `inferSchema` false, then SparkSession is picking up custom schema partially. {code:java} val df = spark.read.option("header", "true").option("inferSchema", "false").schema(StructType(List(StructField("custom", StringType, false.csv("test.csv") df.printSchema root |-- custom: string (nullable = true) {code} I mean only column name (`custom`) and DataType (`StringType`) are getting picked up. But, `nullable` part is being ignored, as it is still coming `nullable = true`, which is incorrect. I am not able to understand this behavior. was: I have a CSV file, test.csv: {code:csv} col 1 2 3 4 {code} When I read it using Spark, it gets the schema of data correct: {code:java} val df = spark.read.option("header", "true").option("inferSchema", "true").csv("test.csv") df.printSchema root |-- col: integer (nullable = true) {code} But when I override the `schema` of CSV file and make `inferSchema` false, then SparkSession is picking up custom schema partially. {code:java} val df = spark.read.option("header", "true").option("inferSchema", "false").schema(StructType(List(StructField("custom", StringType, false.csv("test.csv") df.printSchema root |-- custom: string (nullable = true) {code} I mean only column name (`custom`) and DataType (`StringType`) are getting picked up. But, `nullable` part is being ignored, as it is still coming `nullable = true`, which is incorrect. I am not able to understand this behavior. > Spark CSV is not able to Override Schema while reading data > --- > > Key: SPARK-20457 > URL: https://issues.apache.org/jira/browse/SPARK-20457 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Himanshu Gupta > > I have a CSV file, test.csv: > {code:xml} > col > 1 > 2 > 3 > 4 > {code} > When I read it using Spark, it gets the schema of data correct: > {code:java} > val df = spark.read.option("header", "true").option("inferSchema", > "true").csv("test.csv") > > df.printSchema > root > |-- col: integer (nullable = true) > {code} > But when I override the `schema` of CSV file and make `inferSchema` false, > then SparkSession is picking up custom schema partially. > {code:java} > val df = spark.read.option("header", "true").option("inferSchema", > "false").schema(StructType(List(StructField("custom", StringType, > false.csv("test.csv") > df.printSchema > root > |-- custom: string (nullable = true) > {code} > I mean only column name (`custom`) and DataType (`StringType`) are getting > picked up. But, `nullable` part is being ignored, as it is still coming > `nullable = true`, which is incorrect. > I am not able to understand this behavior. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20457) Spark CSV is not able to Override Schema while reading data
Himanshu Gupta created SPARK-20457: -- Summary: Spark CSV is not able to Override Schema while reading data Key: SPARK-20457 URL: https://issues.apache.org/jira/browse/SPARK-20457 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Himanshu Gupta I have a CSV file, test.csv: {code:csv} col 1 2 3 4 {code} When I read it using Spark, it gets the schema of data correct: {code:java} val df = spark.read.option("header", "true").option("inferSchema", "true").csv("test.csv") df.printSchema root |-- col: integer (nullable = true) {code} But when I override the `schema` of CSV file and make `inferSchema` false, then SparkSession is picking up custom schema partially. {code:java} val df = spark.read.option("header", "true").option("inferSchema", "false").schema(StructType(List(StructField("custom", StringType, false.csv("test.csv") df.printSchema root |-- custom: string (nullable = true) {code} I mean only column name (`custom`) and DataType (`StringType`) are getting picked up. But, `nullable` part is being ignored, as it is still coming `nullable = true`, which is incorrect. I am not able to understand this behavior. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-20336) spark.read.csv() with wholeFile=True option fails to read non ASCII unicode characters
[ https://issues.apache.org/jira/browse/SPARK-20336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] HanCheol Cho closed SPARK-20336. Resolution: Not A Bug > spark.read.csv() with wholeFile=True option fails to read non ASCII unicode > characters > -- > > Key: SPARK-20336 > URL: https://issues.apache.org/jira/browse/SPARK-20336 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: Spark 2.2.0 (master branch is downloaded from Github) > PySpark >Reporter: HanCheol Cho > > I used spark.read.csv() method with wholeFile=True option to load data that > has multi-line records. > However, non-ASCII characters are not properly loaded. > The following is a sample data for test: > {code:none} > col1,col2,col3 > 1,a,text > 2,b,テキスト > 3,c,텍스트 > 4,d,"text > テキスト > 텍스트" > 5,e,last > {code} > When it is loaded without wholeFile=True option, non-ASCII characters are > shown correctly although multi-line records are parsed incorrectly as follows: > {code:none} > testdf_default = spark.read.csv("test.encoding.csv", header=True) > testdf_default.show() > ++++ > |col1|col2|col3| > ++++ > | 1| a|text| > | 2| b|テキスト| > | 3| c| 텍스트| > | 4| d|text| > |テキスト|null|null| > | 텍스트"|null|null| > | 5| e|last| > ++++ > {code} > When wholeFile=True option is used, non-ASCII characters are broken as > follows: > {code:none} > testdf_wholefile = spark.read.csv("test.encoding.csv", header=True, > wholeFile=True) > testdf_wholefile.show() > ++++ > |col1|col2|col3| > ++++ > | 1| a|text| > | 2| b|| > | 3| c| �| > | 4| d|text > ...| > | 5| e|last| > ++++ > {code} > The result is same even if I use encoding="UTF-8" option with wholeFile=True. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org