[jira] [Comment Edited] (SPARK-20199) GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter
[ https://issues.apache.org/jira/browse/SPARK-20199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984161#comment-15984161 ] Yan Facai (颜发才) edited comment on SPARK-20199 at 4/26/17 6:11 AM: -- The work is easy, however **public method** is added and some adjustments are needed in inner implementation. Hence, I suggest to delay it until at least one of Committers agree to shepherd the issue. I have two questions: 1. For both GBDT and RandomForest share the attribute, we can pull `featureSubsetStrategy` parameter up to either TreeEnsembleParams or DecisionTreeParams. Which one is appropriate? 2. Is it right to add new parameter `featureSubsetStrategy` to Strategy class? Or add it to DecisionTree's train method? was (Author: facai): The work is easy, however Public method is added and some adjustments are needed in inner implementation. Hence, I suggest to delay it until one expert agree to shepherd the issue. I have two questions: 1. For both GBDT and RandomForest share the attribute, we can pull `featureSubsetStrategy` parameter up to either TreeEnsembleParams or DecisionTreeParams. Which one is appropriate? 2. Is it right to add new parameter `featureSubsetStrategy` to Strategy class? Or add it to DecisionTree's train method? > GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter > --- > > Key: SPARK-20199 > URL: https://issues.apache.org/jira/browse/SPARK-20199 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: pralabhkumar > > Spark GradientBoostedTreesModel doesn't have Column sampling rate parameter > . This parameter is available in H2O and XGBoost. > Sample from H2O.ai > gbmParams._col_sample_rate > Please provide the parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows
[ https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984173#comment-15984173 ] Liang-Chi Hsieh edited comment on SPARK-20392 at 4/26/17 6:43 AM: -- [~barrybecker4] Currently I think the performance downgrade is caused by the cost of exchange between DataFrame/Dataset abstraction and logical plans. Some operations on DataFrames exchange between DataFrame and logical plans. It can be ignored in the usage of SQL. However, it's not rare to chain dozens of pipeline stages in ML. When the query plan grows incrementally during running those stages, the cost spent on the exchange grows too. In particular, the Analyzer will go through the big query plan even most part of it is analyzed. By eliminating part of the cost, the time to run the example code locally is reduced from about 1min to about 30 secs. In particular, the time applying the pipeline locally is mostly spent on calling {{transform}} of the 137 {{Bucketizer}}s. Before the change, each call of {{Bucketizer}}'s {{transform}} can cost about 0.4 sec. So the total time spent on all {{Bucketizer}}s' {{transform}} is about 50 secs. After the change, each call only costs about 0.1 sec. was (Author: viirya): [~barrybecker4] Currently I think the performance downgrade is caused by the cost of exchange between DataFrame/Dataset abstraction and logical plans. Some operations on DataFrames exchange between DataFrame and logical plans. It can be ignored in the usage of SQL. However, it's not rare to chain dozens of pipeline stages in ML. When the query plan grows incrementally during running those stages, the cost spent on the exchange grows too. In particular, the Analyzer will go through the big query plan even most part of it is analyzed. By eliminating part of the cost, the time to run the example code locally is reduced from about 1min to about 30 secs. > Slow performance when calling fit on ML pipeline for dataset with many > columns but few rows > --- > > Key: SPARK-20392 > URL: https://issues.apache.org/jira/browse/SPARK-20392 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Barry Becker > Attachments: blockbuster.csv, blockbuster_fewCols.csv, > giant_query_plan_for_fitting_pipeline.txt, model_9754.zip, model_9756.zip > > > This started as a [question on stack > overflow|http://stackoverflow.com/questions/43484006/why-is-it-slow-to-apply-a-spark-pipeline-to-dataset-with-many-columns-but-few-ro], > but it seems like a bug. > I am testing spark pipelines using a simple dataset (attached) with 312 > (mostly numeric) columns, but only 421 rows. It is small, but it takes 3 > minutes to apply my ML pipeline to it on a 24 core server with 60G of memory. > This seems much to long for such a tiny dataset. Similar pipelines run > quickly on datasets that have fewer columns and more rows. It's something > about the number of columns that is causing the slow performance. > Here are a list of the stages in my pipeline: > {code} > 000_strIdx_5708525b2b6c > 001_strIdx_ec2296082913 > 002_bucketizer_3cbc8811877b > 003_bucketizer_5a01d5d78436 > 004_bucketizer_bf290d11364d > 005_bucketizer_c3296dfe94b2 > 006_bucketizer_7071ca50eb85 > 007_bucketizer_27738213c2a1 > 008_bucketizer_bd728fd89ba1 > 009_bucketizer_e1e716f51796 > 010_bucketizer_38be665993ba > 011_bucketizer_5a0e41e5e94f > 012_bucketizer_b5a3d5743aaa > 013_bucketizer_4420f98ff7ff > 014_bucketizer_777cc4fe6d12 > 015_bucketizer_f0f3a3e5530e > 016_bucketizer_218ecca3b5c1 > 017_bucketizer_0b083439a192 > 018_bucketizer_4520203aec27 > 019_bucketizer_462c2c346079 > 020_bucketizer_47435822e04c > 021_bucketizer_eb9dccb5e6e8 > 022_bucketizer_b5f63dd7451d > 023_bucketizer_e0fd5041c841 > 024_bucketizer_ffb3b9737100 > 025_bucketizer_e06c0d29273c > 026_bucketizer_36ee535a425f > 027_bucketizer_ee3a330269f1 > 028_bucketizer_094b58ea01c0 > 029_bucketizer_e93ea86c08e2 > 030_bucketizer_4728a718bc4b > 031_bucketizer_08f6189c7fcc > 032_bucketizer_11feb74901e6 > 033_bucketizer_ab4add4966c7 > 034_bucketizer_4474f7f1b8ce > 035_bucketizer_90cfa5918d71 > 036_bucketizer_1a9ff5e4eccb > 037_bucketizer_38085415a4f4 > 038_bucketizer_9b5e5a8d12eb > 039_bucketizer_082bb650ecc3 > 040_bucketizer_57e1e363c483 > 041_bucketizer_337583fbfd65 > 042_bucketizer_73e8f6673262 > 043_bucketizer_0f9394ed30b8 > 044_bucketizer_8530f3570019 > 045_bucketizer_c53614f1e507 > 046_bucketizer_8fd99e6ec27b > 047_bucketizer_6a8610496d8a > 048_bucketizer_888b0055c1ad > 049_bucketizer_974e0a1433a6 > 050_bucketizer_e848c0937cb9 > 051_bucketizer_95611095a4ac > 052_bucketizer_660a6031acd9 > 053_bucketizer_aaffe5a3140d > 054_bucketizer_8dc569be285f > 055_bucketizer_83d1bffa07bc >
[jira] [Comment Edited] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows
[ https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984173#comment-15984173 ] Liang-Chi Hsieh edited comment on SPARK-20392 at 4/26/17 6:43 AM: -- [~barrybecker4] Currently I think the performance downgrade is caused by the cost of exchange between DataFrame/Dataset abstraction and logical plans. Some operations on DataFrames exchange between DataFrame and logical plans. It can be ignored in the usage of SQL. However, it's not rare to chain dozens of pipeline stages in ML. When the query plan grows incrementally during running those stages, the cost spent on the exchange grows too. In particular, the Analyzer will go through the big query plan even most part of it is analyzed. By eliminating part of the cost, the time to run the example code locally is reduced from about 1min to about 30 secs. In particular, the time applying the pipeline locally is mostly spent on calling {{transform}} of the 137 {{Bucketizer}}. Before the change, each call of {{Bucketizer}} 's {{transform}} can cost about 0.4 sec. So the total time spent on all {{Bucketizer}} 's {{transform}} is about 50 secs. After the change, each call only costs about 0.1 sec. was (Author: viirya): [~barrybecker4] Currently I think the performance downgrade is caused by the cost of exchange between DataFrame/Dataset abstraction and logical plans. Some operations on DataFrames exchange between DataFrame and logical plans. It can be ignored in the usage of SQL. However, it's not rare to chain dozens of pipeline stages in ML. When the query plan grows incrementally during running those stages, the cost spent on the exchange grows too. In particular, the Analyzer will go through the big query plan even most part of it is analyzed. By eliminating part of the cost, the time to run the example code locally is reduced from about 1min to about 30 secs. In particular, the time applying the pipeline locally is mostly spent on calling {{transform}} of the 137 {{Bucketizer}}s. Before the change, each call of {{Bucketizer}}'s {{transform}} can cost about 0.4 sec. So the total time spent on all {{Bucketizer}}s' {{transform}} is about 50 secs. After the change, each call only costs about 0.1 sec. > Slow performance when calling fit on ML pipeline for dataset with many > columns but few rows > --- > > Key: SPARK-20392 > URL: https://issues.apache.org/jira/browse/SPARK-20392 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Barry Becker > Attachments: blockbuster.csv, blockbuster_fewCols.csv, > giant_query_plan_for_fitting_pipeline.txt, model_9754.zip, model_9756.zip > > > This started as a [question on stack > overflow|http://stackoverflow.com/questions/43484006/why-is-it-slow-to-apply-a-spark-pipeline-to-dataset-with-many-columns-but-few-ro], > but it seems like a bug. > I am testing spark pipelines using a simple dataset (attached) with 312 > (mostly numeric) columns, but only 421 rows. It is small, but it takes 3 > minutes to apply my ML pipeline to it on a 24 core server with 60G of memory. > This seems much to long for such a tiny dataset. Similar pipelines run > quickly on datasets that have fewer columns and more rows. It's something > about the number of columns that is causing the slow performance. > Here are a list of the stages in my pipeline: > {code} > 000_strIdx_5708525b2b6c > 001_strIdx_ec2296082913 > 002_bucketizer_3cbc8811877b > 003_bucketizer_5a01d5d78436 > 004_bucketizer_bf290d11364d > 005_bucketizer_c3296dfe94b2 > 006_bucketizer_7071ca50eb85 > 007_bucketizer_27738213c2a1 > 008_bucketizer_bd728fd89ba1 > 009_bucketizer_e1e716f51796 > 010_bucketizer_38be665993ba > 011_bucketizer_5a0e41e5e94f > 012_bucketizer_b5a3d5743aaa > 013_bucketizer_4420f98ff7ff > 014_bucketizer_777cc4fe6d12 > 015_bucketizer_f0f3a3e5530e > 016_bucketizer_218ecca3b5c1 > 017_bucketizer_0b083439a192 > 018_bucketizer_4520203aec27 > 019_bucketizer_462c2c346079 > 020_bucketizer_47435822e04c > 021_bucketizer_eb9dccb5e6e8 > 022_bucketizer_b5f63dd7451d > 023_bucketizer_e0fd5041c841 > 024_bucketizer_ffb3b9737100 > 025_bucketizer_e06c0d29273c > 026_bucketizer_36ee535a425f > 027_bucketizer_ee3a330269f1 > 028_bucketizer_094b58ea01c0 > 029_bucketizer_e93ea86c08e2 > 030_bucketizer_4728a718bc4b > 031_bucketizer_08f6189c7fcc > 032_bucketizer_11feb74901e6 > 033_bucketizer_ab4add4966c7 > 034_bucketizer_4474f7f1b8ce > 035_bucketizer_90cfa5918d71 > 036_bucketizer_1a9ff5e4eccb > 037_bucketizer_38085415a4f4 > 038_bucketizer_9b5e5a8d12eb > 039_bucketizer_082bb650ecc3 > 040_bucketizer_57e1e363c483 > 041_bucketizer_337583fbfd65 > 042_bucketizer_73e8f6673262 > 043_bucketizer_0f9394ed30b8 >
[jira] [Comment Edited] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows
[ https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984173#comment-15984173 ] Liang-Chi Hsieh edited comment on SPARK-20392 at 4/26/17 6:44 AM: -- [~barrybecker4] Currently I think the performance downgrade is caused by the cost of exchange between DataFrame/Dataset abstraction and logical plans. Some operations on DataFrames exchange between DataFrame and logical plans. It can be ignored in the usage of SQL. However, it's not rare to chain dozens of pipeline stages in ML. When the query plan grows incrementally during running those stages, the cost spent on the exchange grows too. In particular, the Analyzer will go through the big query plan even most part of it is analyzed. By eliminating part of the cost, the time to run the example code locally is reduced from about 1min to about 30 secs. In particular, the time applying the pipeline locally is mostly spent on calling {{transform}} of the 137 {{Bucketizer}} s. Before the change, each call of {{Bucketizer}} 's {{transform}} can cost about 0.4 sec. So the total time spent on all {{Bucketizer}} s' {{transform}} is about 50 secs. After the change, each call only costs about 0.1 sec. was (Author: viirya): [~barrybecker4] Currently I think the performance downgrade is caused by the cost of exchange between DataFrame/Dataset abstraction and logical plans. Some operations on DataFrames exchange between DataFrame and logical plans. It can be ignored in the usage of SQL. However, it's not rare to chain dozens of pipeline stages in ML. When the query plan grows incrementally during running those stages, the cost spent on the exchange grows too. In particular, the Analyzer will go through the big query plan even most part of it is analyzed. By eliminating part of the cost, the time to run the example code locally is reduced from about 1min to about 30 secs. In particular, the time applying the pipeline locally is mostly spent on calling {{transform}} of the 137 {{Bucketizer}}. Before the change, each call of {{Bucketizer}} 's {{transform}} can cost about 0.4 sec. So the total time spent on all {{Bucketizer}} 's {{transform}} is about 50 secs. After the change, each call only costs about 0.1 sec. > Slow performance when calling fit on ML pipeline for dataset with many > columns but few rows > --- > > Key: SPARK-20392 > URL: https://issues.apache.org/jira/browse/SPARK-20392 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Barry Becker > Attachments: blockbuster.csv, blockbuster_fewCols.csv, > giant_query_plan_for_fitting_pipeline.txt, model_9754.zip, model_9756.zip > > > This started as a [question on stack > overflow|http://stackoverflow.com/questions/43484006/why-is-it-slow-to-apply-a-spark-pipeline-to-dataset-with-many-columns-but-few-ro], > but it seems like a bug. > I am testing spark pipelines using a simple dataset (attached) with 312 > (mostly numeric) columns, but only 421 rows. It is small, but it takes 3 > minutes to apply my ML pipeline to it on a 24 core server with 60G of memory. > This seems much to long for such a tiny dataset. Similar pipelines run > quickly on datasets that have fewer columns and more rows. It's something > about the number of columns that is causing the slow performance. > Here are a list of the stages in my pipeline: > {code} > 000_strIdx_5708525b2b6c > 001_strIdx_ec2296082913 > 002_bucketizer_3cbc8811877b > 003_bucketizer_5a01d5d78436 > 004_bucketizer_bf290d11364d > 005_bucketizer_c3296dfe94b2 > 006_bucketizer_7071ca50eb85 > 007_bucketizer_27738213c2a1 > 008_bucketizer_bd728fd89ba1 > 009_bucketizer_e1e716f51796 > 010_bucketizer_38be665993ba > 011_bucketizer_5a0e41e5e94f > 012_bucketizer_b5a3d5743aaa > 013_bucketizer_4420f98ff7ff > 014_bucketizer_777cc4fe6d12 > 015_bucketizer_f0f3a3e5530e > 016_bucketizer_218ecca3b5c1 > 017_bucketizer_0b083439a192 > 018_bucketizer_4520203aec27 > 019_bucketizer_462c2c346079 > 020_bucketizer_47435822e04c > 021_bucketizer_eb9dccb5e6e8 > 022_bucketizer_b5f63dd7451d > 023_bucketizer_e0fd5041c841 > 024_bucketizer_ffb3b9737100 > 025_bucketizer_e06c0d29273c > 026_bucketizer_36ee535a425f > 027_bucketizer_ee3a330269f1 > 028_bucketizer_094b58ea01c0 > 029_bucketizer_e93ea86c08e2 > 030_bucketizer_4728a718bc4b > 031_bucketizer_08f6189c7fcc > 032_bucketizer_11feb74901e6 > 033_bucketizer_ab4add4966c7 > 034_bucketizer_4474f7f1b8ce > 035_bucketizer_90cfa5918d71 > 036_bucketizer_1a9ff5e4eccb > 037_bucketizer_38085415a4f4 > 038_bucketizer_9b5e5a8d12eb > 039_bucketizer_082bb650ecc3 > 040_bucketizer_57e1e363c483 > 041_bucketizer_337583fbfd65 > 042_bucketizer_73e8f6673262 > 043_bucketizer_0f9394ed30b8 >
[jira] [Created] (SPARK-20466) HadoopRDD#addLocalConfiguration throws NPE
liyunzhang_intel created SPARK-20466: Summary: HadoopRDD#addLocalConfiguration throws NPE Key: SPARK-20466 URL: https://issues.apache.org/jira/browse/SPARK-20466 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 2.0.2 Reporter: liyunzhang_intel in spark2.0.2, it throws NPE {code} 17/04/23 08:19:55 ERROR executor.Executor: Exception in task 439.0 in stage 16.0 (TID 986)$ java.lang.NullPointerException$ ^Iat org.apache.spark.rdd.HadoopRDD$.addLocalConfiguration(HadoopRDD.scala:373)$ ^Iat org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:243)$ ^Iat org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:208)$ ^Iat org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)$ ^Iat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)$ ^Iat org.apache.spark.rdd.RDD.iterator(RDD.scala:283)$ ^Iat org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)$ ^Iat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)$ ^Iat org.apache.spark.rdd.RDD.iterator(RDD.scala:283)$ ^Iat org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)$ ^Iat org.apache.spark.scheduler.Task.run(Task.scala:86)$ ^Iat org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)$ ^Iat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)$ ^Iat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)$ ^Iat java.lang.Thread.run(Thread.java:745)$ {code} suggestion to add some code to avoid NPE {code} /** Add Hadoop configuration specific to a single partition and attempt. */ def addLocalConfiguration(jobTrackerId: String, jobId: Int, splitId: Int, attemptId: Int, conf: JobConf) { val jobID = new JobID(jobTrackerId, jobId) val taId = new TaskAttemptID(new TaskID(jobID, TaskType.MAP, splitId), attemptId) if ( conf != null){ conf.set("mapred.tip.id", taId.getTaskID.toString) conf.set("mapred.task.id", taId.toString) conf.setBoolean("mapred.task.is.map", true) conf.setInt("mapred.task.partition", splitId) conf.set("mapred.job.id", jobID.toString) } } {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20467) sbt-launch-lib.bash has lacked the ASF header.
[ https://issues.apache.org/jira/browse/SPARK-20467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20467: Assignee: Apache Spark > sbt-launch-lib.bash has lacked the ASF header. > --- > > Key: SPARK-20467 > URL: https://issues.apache.org/jira/browse/SPARK-20467 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.1.0 >Reporter: liuzhaokun >Assignee: Apache Spark > > When I use this script,I found sbt-launch-lib.bash lack the ASF header.It > doesn't be permitted by Apache Foundation according to apache license 2.0. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows
[ https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984270#comment-15984270 ] Liang-Chi Hsieh commented on SPARK-20392: - [~barrybecker4] Btw, the time applying the model_9756 pipeline on blockbuster_fewCols.csv is less than 1 sec, however, this pipeline only has 5 {{Bucketizer}} s. So it doesn't cost so much time as model_9756 pipeline. > Slow performance when calling fit on ML pipeline for dataset with many > columns but few rows > --- > > Key: SPARK-20392 > URL: https://issues.apache.org/jira/browse/SPARK-20392 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Barry Becker > Attachments: blockbuster.csv, blockbuster_fewCols.csv, > giant_query_plan_for_fitting_pipeline.txt, model_9754.zip, model_9756.zip > > > This started as a [question on stack > overflow|http://stackoverflow.com/questions/43484006/why-is-it-slow-to-apply-a-spark-pipeline-to-dataset-with-many-columns-but-few-ro], > but it seems like a bug. > I am testing spark pipelines using a simple dataset (attached) with 312 > (mostly numeric) columns, but only 421 rows. It is small, but it takes 3 > minutes to apply my ML pipeline to it on a 24 core server with 60G of memory. > This seems much to long for such a tiny dataset. Similar pipelines run > quickly on datasets that have fewer columns and more rows. It's something > about the number of columns that is causing the slow performance. > Here are a list of the stages in my pipeline: > {code} > 000_strIdx_5708525b2b6c > 001_strIdx_ec2296082913 > 002_bucketizer_3cbc8811877b > 003_bucketizer_5a01d5d78436 > 004_bucketizer_bf290d11364d > 005_bucketizer_c3296dfe94b2 > 006_bucketizer_7071ca50eb85 > 007_bucketizer_27738213c2a1 > 008_bucketizer_bd728fd89ba1 > 009_bucketizer_e1e716f51796 > 010_bucketizer_38be665993ba > 011_bucketizer_5a0e41e5e94f > 012_bucketizer_b5a3d5743aaa > 013_bucketizer_4420f98ff7ff > 014_bucketizer_777cc4fe6d12 > 015_bucketizer_f0f3a3e5530e > 016_bucketizer_218ecca3b5c1 > 017_bucketizer_0b083439a192 > 018_bucketizer_4520203aec27 > 019_bucketizer_462c2c346079 > 020_bucketizer_47435822e04c > 021_bucketizer_eb9dccb5e6e8 > 022_bucketizer_b5f63dd7451d > 023_bucketizer_e0fd5041c841 > 024_bucketizer_ffb3b9737100 > 025_bucketizer_e06c0d29273c > 026_bucketizer_36ee535a425f > 027_bucketizer_ee3a330269f1 > 028_bucketizer_094b58ea01c0 > 029_bucketizer_e93ea86c08e2 > 030_bucketizer_4728a718bc4b > 031_bucketizer_08f6189c7fcc > 032_bucketizer_11feb74901e6 > 033_bucketizer_ab4add4966c7 > 034_bucketizer_4474f7f1b8ce > 035_bucketizer_90cfa5918d71 > 036_bucketizer_1a9ff5e4eccb > 037_bucketizer_38085415a4f4 > 038_bucketizer_9b5e5a8d12eb > 039_bucketizer_082bb650ecc3 > 040_bucketizer_57e1e363c483 > 041_bucketizer_337583fbfd65 > 042_bucketizer_73e8f6673262 > 043_bucketizer_0f9394ed30b8 > 044_bucketizer_8530f3570019 > 045_bucketizer_c53614f1e507 > 046_bucketizer_8fd99e6ec27b > 047_bucketizer_6a8610496d8a > 048_bucketizer_888b0055c1ad > 049_bucketizer_974e0a1433a6 > 050_bucketizer_e848c0937cb9 > 051_bucketizer_95611095a4ac > 052_bucketizer_660a6031acd9 > 053_bucketizer_aaffe5a3140d > 054_bucketizer_8dc569be285f > 055_bucketizer_83d1bffa07bc > 056_bucketizer_0c6180ba75e6 > 057_bucketizer_452f265a000d > 058_bucketizer_38e02ddfb447 > 059_bucketizer_6fa4ad5d3ebd > 060_bucketizer_91044ee766ce > 061_bucketizer_9a9ef04a173d > 062_bucketizer_3d98eb15f206 > 063_bucketizer_c4915bb4d4ed > 064_bucketizer_8ca2b6550c38 > 065_bucketizer_417ee9b760bc > 066_bucketizer_67f3556bebe8 > 067_bucketizer_0556deb652c6 > 068_bucketizer_067b4b3d234c > 069_bucketizer_30ba55321538 > 070_bucketizer_ad826cc5d746 > 071_bucketizer_77676a898055 > 072_bucketizer_05c37a38ce30 > 073_bucketizer_6d9ae54163ed > 074_bucketizer_8cd668b2855d > 075_bucketizer_d50ea1732021 > 076_bucketizer_c68f467c9559 > 077_bucketizer_ee1dfc840db1 > 078_bucketizer_83ec06a32519 > 079_bucketizer_741d08c1b69e > 080_bucketizer_b7402e4829c7 > 081_bucketizer_8adc590dc447 > 082_bucketizer_673be99bdace > 083_bucketizer_77693b45f94c > 084_bucketizer_53529c6b1ac4 > 085_bucketizer_6a3ca776a81e > 086_bucketizer_6679d9588ac1 > 087_bucketizer_6c73af456f65 > 088_bucketizer_2291b2c5ab51 > 089_bucketizer_cb3d0fe669d8 > 090_bucketizer_e71f913c1512 > 091_bucketizer_156528f65ce7 > 092_bucketizer_f3ec5dae079b > 093_bucketizer_809fab77eee1 > 094_bucketizer_6925831511e6 > 095_bucketizer_c5d853b95707 > 096_bucketizer_e677659ca253 > 097_bucketizer_396e35548c72 > 098_bucketizer_78a6410d7a84 > 099_bucketizer_e3ae6e54bca1 > 100_bucketizer_9fed5923fe8a > 101_bucketizer_8925ba4c3ee2 > 102_bucketizer_95750b6942b8 > 103_bucketizer_6e8b50a1918b > 104_bucketizer_36cfcc13d4ba > 105_bucketizer_2716d0455512 > 106_bucketizer_9bcf2891652f >
[jira] [Assigned] (SPARK-20468) Refactor the ALS code
[ https://issues.apache.org/jira/browse/SPARK-20468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20468: Assignee: Apache Spark > Refactor the ALS code > - > > Key: SPARK-20468 > URL: https://issues.apache.org/jira/browse/SPARK-20468 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: Daniel Li >Assignee: Apache Spark >Priority: Minor > Labels: documentation, readability, refactoring > > The current ALS implementation ({{org.apache.spark.ml.recommendation}}) is > quite the beast --- 21 classes, traits, and objects across 1,500+ lines, all > in one file. Here are some things I think could improve the clarity and > maintainability of the code: > * The file can be split into more manageable parts. In particular, {{ALS}}, > {{ALSParams}}, {{ALSModel}}, and {{ALSModelParams}} can be in separate files > for better readability. > * Certain parts can be encapsulated or moved to clarify the intent. For > example: > ** The {{ALS.train}} method is currently defined in the {{ALS}} companion > object, and it seems to take 12 individual parameters that are all members of > the {{ALS}} class. This method can be made an instance method. > ** The code that creates in-blocks and out-blocks in the body of > {{ALS.train}}, along with the {{partitionRatings}} and {{makeBlocks}} methods > in the {{ALS}} companion object, can be moved into a separate case class that > holds the blocks. This has the added benefit of allowing us to write > specific Scaladoc to explain the logic behind these block objects, as their > usage is certainly nontrivial yet is fundamental to the implementation. > ** The {{KeyWrapper}} and {{UncompressedInBlockSort}} classes could be > hidden within {{UncompressedInBlock}} to clarify the scope of their usage. > ** Various builder classes could be encapsulated in the companion objects of > the classes they build. > * The code can be formatted more clearly. For example: > ** Certain methods such as {{ALS.train}} and {{ALS.makeBlocks}} can be > formatted more clearly and have comments added explaining the reasoning > behind key parts. That these methods form the core of the ALS logic makes > this doubly important for maintainability. > ** Parts of the code that use {{while}} loops with manually incremented > counters can be rewritten as {{for}} loops. > ** Where non-idiomatic Scala code is used that doesn't improve performance > much, clearer code can be substituted. (This in particular should be done > very carefully if at all, as it's apparent the original author spent much > time and pains optimizing the code to significantly improve its runtime > profile.) > * The documentation (both Scaladocs and inline comments) can be clarified > where needed and expanded where incomplete. This is especially important for > parts of the code that are written imperatively for performance, as these > parts don't benefit from the intuitive self-documentation of Scala's > higher-level language abstractions. Specifically, I'd like to add > documentation fully explaining the key functionality of the in-block and > out-block objects, their purpose, how they relate to the overall ALS > algorithm, and how they are calculated in such a way that new maintainers can > ramp up much more quickly. > The above is not a complete enumeration of improvements but a high-level > survey. All of these improvements will, I believe, add up to make the code > easier to understand, extend, and maintain. This issue will track the > progress of this refactoring so that going forward, authors will have an > easier time maintaining this part of the project. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20468) Refactor the ALS code
[ https://issues.apache.org/jira/browse/SPARK-20468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20468: Assignee: (was: Apache Spark) > Refactor the ALS code > - > > Key: SPARK-20468 > URL: https://issues.apache.org/jira/browse/SPARK-20468 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: Daniel Li >Priority: Minor > Labels: documentation, readability, refactoring > > The current ALS implementation ({{org.apache.spark.ml.recommendation}}) is > quite the beast --- 21 classes, traits, and objects across 1,500+ lines, all > in one file. Here are some things I think could improve the clarity and > maintainability of the code: > * The file can be split into more manageable parts. In particular, {{ALS}}, > {{ALSParams}}, {{ALSModel}}, and {{ALSModelParams}} can be in separate files > for better readability. > * Certain parts can be encapsulated or moved to clarify the intent. For > example: > ** The {{ALS.train}} method is currently defined in the {{ALS}} companion > object, and it seems to take 12 individual parameters that are all members of > the {{ALS}} class. This method can be made an instance method. > ** The code that creates in-blocks and out-blocks in the body of > {{ALS.train}}, along with the {{partitionRatings}} and {{makeBlocks}} methods > in the {{ALS}} companion object, can be moved into a separate case class that > holds the blocks. This has the added benefit of allowing us to write > specific Scaladoc to explain the logic behind these block objects, as their > usage is certainly nontrivial yet is fundamental to the implementation. > ** The {{KeyWrapper}} and {{UncompressedInBlockSort}} classes could be > hidden within {{UncompressedInBlock}} to clarify the scope of their usage. > ** Various builder classes could be encapsulated in the companion objects of > the classes they build. > * The code can be formatted more clearly. For example: > ** Certain methods such as {{ALS.train}} and {{ALS.makeBlocks}} can be > formatted more clearly and have comments added explaining the reasoning > behind key parts. That these methods form the core of the ALS logic makes > this doubly important for maintainability. > ** Parts of the code that use {{while}} loops with manually incremented > counters can be rewritten as {{for}} loops. > ** Where non-idiomatic Scala code is used that doesn't improve performance > much, clearer code can be substituted. (This in particular should be done > very carefully if at all, as it's apparent the original author spent much > time and pains optimizing the code to significantly improve its runtime > profile.) > * The documentation (both Scaladocs and inline comments) can be clarified > where needed and expanded where incomplete. This is especially important for > parts of the code that are written imperatively for performance, as these > parts don't benefit from the intuitive self-documentation of Scala's > higher-level language abstractions. Specifically, I'd like to add > documentation fully explaining the key functionality of the in-block and > out-block objects, their purpose, how they relate to the overall ALS > algorithm, and how they are calculated in such a way that new maintainers can > ramp up much more quickly. > The above is not a complete enumeration of improvements but a high-level > survey. All of these improvements will, I believe, add up to make the code > easier to understand, extend, and maintain. This issue will track the > progress of this refactoring so that going forward, authors will have an > easier time maintaining this part of the project. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20469) Add a method to display DataFrame schema in PipelineStage
darion yaphet created SPARK-20469: - Summary: Add a method to display DataFrame schema in PipelineStage Key: SPARK-20469 URL: https://issues.apache.org/jira/browse/SPARK-20469 Project: Spark Issue Type: New Feature Components: ML, MLlib Affects Versions: 2.1.0, 2.0.2, 1.6.3 Reporter: darion yaphet Priority: Minor Sometime apply Transformer and Estimator on a pipeline. The PipelineStage could display schema will be a big help to understand and check the dataset . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows
[ https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20392: Assignee: (was: Apache Spark) > Slow performance when calling fit on ML pipeline for dataset with many > columns but few rows > --- > > Key: SPARK-20392 > URL: https://issues.apache.org/jira/browse/SPARK-20392 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Barry Becker > Attachments: blockbuster.csv, blockbuster_fewCols.csv, > giant_query_plan_for_fitting_pipeline.txt, model_9754.zip, model_9756.zip > > > This started as a [question on stack > overflow|http://stackoverflow.com/questions/43484006/why-is-it-slow-to-apply-a-spark-pipeline-to-dataset-with-many-columns-but-few-ro], > but it seems like a bug. > I am testing spark pipelines using a simple dataset (attached) with 312 > (mostly numeric) columns, but only 421 rows. It is small, but it takes 3 > minutes to apply my ML pipeline to it on a 24 core server with 60G of memory. > This seems much to long for such a tiny dataset. Similar pipelines run > quickly on datasets that have fewer columns and more rows. It's something > about the number of columns that is causing the slow performance. > Here are a list of the stages in my pipeline: > {code} > 000_strIdx_5708525b2b6c > 001_strIdx_ec2296082913 > 002_bucketizer_3cbc8811877b > 003_bucketizer_5a01d5d78436 > 004_bucketizer_bf290d11364d > 005_bucketizer_c3296dfe94b2 > 006_bucketizer_7071ca50eb85 > 007_bucketizer_27738213c2a1 > 008_bucketizer_bd728fd89ba1 > 009_bucketizer_e1e716f51796 > 010_bucketizer_38be665993ba > 011_bucketizer_5a0e41e5e94f > 012_bucketizer_b5a3d5743aaa > 013_bucketizer_4420f98ff7ff > 014_bucketizer_777cc4fe6d12 > 015_bucketizer_f0f3a3e5530e > 016_bucketizer_218ecca3b5c1 > 017_bucketizer_0b083439a192 > 018_bucketizer_4520203aec27 > 019_bucketizer_462c2c346079 > 020_bucketizer_47435822e04c > 021_bucketizer_eb9dccb5e6e8 > 022_bucketizer_b5f63dd7451d > 023_bucketizer_e0fd5041c841 > 024_bucketizer_ffb3b9737100 > 025_bucketizer_e06c0d29273c > 026_bucketizer_36ee535a425f > 027_bucketizer_ee3a330269f1 > 028_bucketizer_094b58ea01c0 > 029_bucketizer_e93ea86c08e2 > 030_bucketizer_4728a718bc4b > 031_bucketizer_08f6189c7fcc > 032_bucketizer_11feb74901e6 > 033_bucketizer_ab4add4966c7 > 034_bucketizer_4474f7f1b8ce > 035_bucketizer_90cfa5918d71 > 036_bucketizer_1a9ff5e4eccb > 037_bucketizer_38085415a4f4 > 038_bucketizer_9b5e5a8d12eb > 039_bucketizer_082bb650ecc3 > 040_bucketizer_57e1e363c483 > 041_bucketizer_337583fbfd65 > 042_bucketizer_73e8f6673262 > 043_bucketizer_0f9394ed30b8 > 044_bucketizer_8530f3570019 > 045_bucketizer_c53614f1e507 > 046_bucketizer_8fd99e6ec27b > 047_bucketizer_6a8610496d8a > 048_bucketizer_888b0055c1ad > 049_bucketizer_974e0a1433a6 > 050_bucketizer_e848c0937cb9 > 051_bucketizer_95611095a4ac > 052_bucketizer_660a6031acd9 > 053_bucketizer_aaffe5a3140d > 054_bucketizer_8dc569be285f > 055_bucketizer_83d1bffa07bc > 056_bucketizer_0c6180ba75e6 > 057_bucketizer_452f265a000d > 058_bucketizer_38e02ddfb447 > 059_bucketizer_6fa4ad5d3ebd > 060_bucketizer_91044ee766ce > 061_bucketizer_9a9ef04a173d > 062_bucketizer_3d98eb15f206 > 063_bucketizer_c4915bb4d4ed > 064_bucketizer_8ca2b6550c38 > 065_bucketizer_417ee9b760bc > 066_bucketizer_67f3556bebe8 > 067_bucketizer_0556deb652c6 > 068_bucketizer_067b4b3d234c > 069_bucketizer_30ba55321538 > 070_bucketizer_ad826cc5d746 > 071_bucketizer_77676a898055 > 072_bucketizer_05c37a38ce30 > 073_bucketizer_6d9ae54163ed > 074_bucketizer_8cd668b2855d > 075_bucketizer_d50ea1732021 > 076_bucketizer_c68f467c9559 > 077_bucketizer_ee1dfc840db1 > 078_bucketizer_83ec06a32519 > 079_bucketizer_741d08c1b69e > 080_bucketizer_b7402e4829c7 > 081_bucketizer_8adc590dc447 > 082_bucketizer_673be99bdace > 083_bucketizer_77693b45f94c > 084_bucketizer_53529c6b1ac4 > 085_bucketizer_6a3ca776a81e > 086_bucketizer_6679d9588ac1 > 087_bucketizer_6c73af456f65 > 088_bucketizer_2291b2c5ab51 > 089_bucketizer_cb3d0fe669d8 > 090_bucketizer_e71f913c1512 > 091_bucketizer_156528f65ce7 > 092_bucketizer_f3ec5dae079b > 093_bucketizer_809fab77eee1 > 094_bucketizer_6925831511e6 > 095_bucketizer_c5d853b95707 > 096_bucketizer_e677659ca253 > 097_bucketizer_396e35548c72 > 098_bucketizer_78a6410d7a84 > 099_bucketizer_e3ae6e54bca1 > 100_bucketizer_9fed5923fe8a > 101_bucketizer_8925ba4c3ee2 > 102_bucketizer_95750b6942b8 > 103_bucketizer_6e8b50a1918b > 104_bucketizer_36cfcc13d4ba > 105_bucketizer_2716d0455512 > 106_bucketizer_9bcf2891652f > 107_bucketizer_8c3d352915f7 > 108_bucketizer_0786c17d5ef9 > 109_bucketizer_f22df23ef56f > 110_bucketizer_bad04578bd20 > 111_bucketizer_35cfbde7e28f > 112_bucketizer_cf89177a528b > 113_bucketizer_183a0d393ef0 > 114_bucketizer_467c78156a67 >
[jira] [Assigned] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows
[ https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20392: Assignee: Apache Spark > Slow performance when calling fit on ML pipeline for dataset with many > columns but few rows > --- > > Key: SPARK-20392 > URL: https://issues.apache.org/jira/browse/SPARK-20392 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Barry Becker >Assignee: Apache Spark > Attachments: blockbuster.csv, blockbuster_fewCols.csv, > giant_query_plan_for_fitting_pipeline.txt, model_9754.zip, model_9756.zip > > > This started as a [question on stack > overflow|http://stackoverflow.com/questions/43484006/why-is-it-slow-to-apply-a-spark-pipeline-to-dataset-with-many-columns-but-few-ro], > but it seems like a bug. > I am testing spark pipelines using a simple dataset (attached) with 312 > (mostly numeric) columns, but only 421 rows. It is small, but it takes 3 > minutes to apply my ML pipeline to it on a 24 core server with 60G of memory. > This seems much to long for such a tiny dataset. Similar pipelines run > quickly on datasets that have fewer columns and more rows. It's something > about the number of columns that is causing the slow performance. > Here are a list of the stages in my pipeline: > {code} > 000_strIdx_5708525b2b6c > 001_strIdx_ec2296082913 > 002_bucketizer_3cbc8811877b > 003_bucketizer_5a01d5d78436 > 004_bucketizer_bf290d11364d > 005_bucketizer_c3296dfe94b2 > 006_bucketizer_7071ca50eb85 > 007_bucketizer_27738213c2a1 > 008_bucketizer_bd728fd89ba1 > 009_bucketizer_e1e716f51796 > 010_bucketizer_38be665993ba > 011_bucketizer_5a0e41e5e94f > 012_bucketizer_b5a3d5743aaa > 013_bucketizer_4420f98ff7ff > 014_bucketizer_777cc4fe6d12 > 015_bucketizer_f0f3a3e5530e > 016_bucketizer_218ecca3b5c1 > 017_bucketizer_0b083439a192 > 018_bucketizer_4520203aec27 > 019_bucketizer_462c2c346079 > 020_bucketizer_47435822e04c > 021_bucketizer_eb9dccb5e6e8 > 022_bucketizer_b5f63dd7451d > 023_bucketizer_e0fd5041c841 > 024_bucketizer_ffb3b9737100 > 025_bucketizer_e06c0d29273c > 026_bucketizer_36ee535a425f > 027_bucketizer_ee3a330269f1 > 028_bucketizer_094b58ea01c0 > 029_bucketizer_e93ea86c08e2 > 030_bucketizer_4728a718bc4b > 031_bucketizer_08f6189c7fcc > 032_bucketizer_11feb74901e6 > 033_bucketizer_ab4add4966c7 > 034_bucketizer_4474f7f1b8ce > 035_bucketizer_90cfa5918d71 > 036_bucketizer_1a9ff5e4eccb > 037_bucketizer_38085415a4f4 > 038_bucketizer_9b5e5a8d12eb > 039_bucketizer_082bb650ecc3 > 040_bucketizer_57e1e363c483 > 041_bucketizer_337583fbfd65 > 042_bucketizer_73e8f6673262 > 043_bucketizer_0f9394ed30b8 > 044_bucketizer_8530f3570019 > 045_bucketizer_c53614f1e507 > 046_bucketizer_8fd99e6ec27b > 047_bucketizer_6a8610496d8a > 048_bucketizer_888b0055c1ad > 049_bucketizer_974e0a1433a6 > 050_bucketizer_e848c0937cb9 > 051_bucketizer_95611095a4ac > 052_bucketizer_660a6031acd9 > 053_bucketizer_aaffe5a3140d > 054_bucketizer_8dc569be285f > 055_bucketizer_83d1bffa07bc > 056_bucketizer_0c6180ba75e6 > 057_bucketizer_452f265a000d > 058_bucketizer_38e02ddfb447 > 059_bucketizer_6fa4ad5d3ebd > 060_bucketizer_91044ee766ce > 061_bucketizer_9a9ef04a173d > 062_bucketizer_3d98eb15f206 > 063_bucketizer_c4915bb4d4ed > 064_bucketizer_8ca2b6550c38 > 065_bucketizer_417ee9b760bc > 066_bucketizer_67f3556bebe8 > 067_bucketizer_0556deb652c6 > 068_bucketizer_067b4b3d234c > 069_bucketizer_30ba55321538 > 070_bucketizer_ad826cc5d746 > 071_bucketizer_77676a898055 > 072_bucketizer_05c37a38ce30 > 073_bucketizer_6d9ae54163ed > 074_bucketizer_8cd668b2855d > 075_bucketizer_d50ea1732021 > 076_bucketizer_c68f467c9559 > 077_bucketizer_ee1dfc840db1 > 078_bucketizer_83ec06a32519 > 079_bucketizer_741d08c1b69e > 080_bucketizer_b7402e4829c7 > 081_bucketizer_8adc590dc447 > 082_bucketizer_673be99bdace > 083_bucketizer_77693b45f94c > 084_bucketizer_53529c6b1ac4 > 085_bucketizer_6a3ca776a81e > 086_bucketizer_6679d9588ac1 > 087_bucketizer_6c73af456f65 > 088_bucketizer_2291b2c5ab51 > 089_bucketizer_cb3d0fe669d8 > 090_bucketizer_e71f913c1512 > 091_bucketizer_156528f65ce7 > 092_bucketizer_f3ec5dae079b > 093_bucketizer_809fab77eee1 > 094_bucketizer_6925831511e6 > 095_bucketizer_c5d853b95707 > 096_bucketizer_e677659ca253 > 097_bucketizer_396e35548c72 > 098_bucketizer_78a6410d7a84 > 099_bucketizer_e3ae6e54bca1 > 100_bucketizer_9fed5923fe8a > 101_bucketizer_8925ba4c3ee2 > 102_bucketizer_95750b6942b8 > 103_bucketizer_6e8b50a1918b > 104_bucketizer_36cfcc13d4ba > 105_bucketizer_2716d0455512 > 106_bucketizer_9bcf2891652f > 107_bucketizer_8c3d352915f7 > 108_bucketizer_0786c17d5ef9 > 109_bucketizer_f22df23ef56f > 110_bucketizer_bad04578bd20 > 111_bucketizer_35cfbde7e28f > 112_bucketizer_cf89177a528b > 113_bucketizer_183a0d393ef0 >
[jira] [Created] (SPARK-20470) Invalid json converting RDD row with Array of struct to json
Philip Adetiloye created SPARK-20470: Summary: Invalid json converting RDD row with Array of struct to json Key: SPARK-20470 URL: https://issues.apache.org/jira/browse/SPARK-20470 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.6.3 Reporter: Philip Adetiloye Trying to convert an RDD in pyspark containing Array of struct doesn't generate the right json. It looks trivial but can't get a good json out. rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict( -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20468) Refactor the ALS code
[ https://issues.apache.org/jira/browse/SPARK-20468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984350#comment-15984350 ] Daniel Li commented on SPARK-20468: --- I'd appreciate if someone with permissions could assign this issue to me. > Refactor the ALS code > - > > Key: SPARK-20468 > URL: https://issues.apache.org/jira/browse/SPARK-20468 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: Daniel Li >Priority: Minor > Labels: documentation, readability, refactoring > > The current ALS implementation ({{org.apache.spark.ml.recommendation}}) is > quite the beast --- 21 classes, traits, and objects across 1,500+ lines, all > in one file. Here are some things I think could improve the clarity and > maintainability of the code: > * The file can be split into more manageable parts. In particular, {{ALS}}, > {{ALSParams}}, {{ALSModel}}, and {{ALSModelParams}} can be in separate files > for better readability. > * Certain parts can be encapsulated or moved to clarify the intent. For > example: > ** The {{ALS.train}} method is currently defined in the {{ALS}} companion > object, and it seems to take 12 individual parameters that are all members of > the {{ALS}} class. This method can be made an instance method. > ** The code that creates in-blocks and out-blocks in the body of > {{ALS.train}}, along with the {{partitionRatings}} and {{makeBlocks}} methods > in the {{ALS}} companion object, can be moved into a separate case class that > holds the blocks. This has the added benefit of allowing us to write > specific Scaladoc to explain the logic behind these block objects, as their > usage is certainly nontrivial yet is fundamental to the implementation. > ** The {{KeyWrapper}} and {{UncompressedInBlockSort}} classes could be > hidden within {{UncompressedInBlock}} to clarify the scope of their usage. > ** Various builder classes could be encapsulated in the companion objects of > the classes they build. > * The code can be formatted more clearly. For example: > ** Certain methods such as {{ALS.train}} and {{ALS.makeBlocks}} can be > formatted more clearly and have comments added explaining the reasoning > behind key parts. That these methods form the core of the ALS logic makes > this doubly important for maintainability. > ** Parts of the code that use {{while}} loops with manually incremented > counters can be rewritten as {{for}} loops. > ** Where non-idiomatic Scala code is used that doesn't improve performance > much, clearer code can be substituted. (This in particular should be done > very carefully if at all, as it's apparent the original author spent much > time and pains optimizing the code to significantly improve its runtime > profile.) > * The documentation (both Scaladocs and inline comments) can be clarified > where needed and expanded where incomplete. This is especially important for > parts of the code that are written imperatively for performance, as these > parts don't benefit from the intuitive self-documentation of Scala's > higher-level language abstractions. Specifically, I'd like to add > documentation fully explaining the key functionality of the in-block and > out-block objects, their purpose, how they relate to the overall ALS > algorithm, and how they are calculated in such a way that new maintainers can > ramp up much more quickly. > The above is not a complete enumeration of improvements but a high-level > survey. All of these improvements will, I believe, add up to make the code > easier to understand, extend, and maintain. This issue will track the > progress of this refactoring so that going forward, authors will have an > easier time maintaining this part of the project. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20471) Remove AggregateBenchmark testsuite warning: Two level hashmap is disabled but vectorized hashmap is enabled.
[ https://issues.apache.org/jira/browse/SPARK-20471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20471: Assignee: Apache Spark > Remove AggregateBenchmark testsuite warning: Two level hashmap is disabled > but vectorized hashmap is enabled. > - > > Key: SPARK-20471 > URL: https://issues.apache.org/jira/browse/SPARK-20471 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.1.0 >Reporter: caoxuewen >Assignee: Apache Spark > > remove AggregateBenchmark testsuite warning: > such as '14:26:33.220 WARN > org.apache.spark.sql.execution.aggregate.HashAggregateExec: Two level hashmap > is disabled but vectorized hashmap is enabled.' > unit tests: AggregateBenchmark > Modify the 'ignore function for 'test funtion -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20471) Remove AggregateBenchmark testsuite warning: Two level hashmap is disabled but vectorized hashmap is enabled.
[ https://issues.apache.org/jira/browse/SPARK-20471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20471: Assignee: (was: Apache Spark) > Remove AggregateBenchmark testsuite warning: Two level hashmap is disabled > but vectorized hashmap is enabled. > - > > Key: SPARK-20471 > URL: https://issues.apache.org/jira/browse/SPARK-20471 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.1.0 >Reporter: caoxuewen > > remove AggregateBenchmark testsuite warning: > such as '14:26:33.220 WARN > org.apache.spark.sql.execution.aggregate.HashAggregateExec: Two level hashmap > is disabled but vectorized hashmap is enabled.' > unit tests: AggregateBenchmark > Modify the 'ignore function for 'test funtion -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20466) HadoopRDD#addLocalConfiguration throws NPE
[ https://issues.apache.org/jira/browse/SPARK-20466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated SPARK-20466: - Attachment: NPE_log NPE_log describes the detailed info. > HadoopRDD#addLocalConfiguration throws NPE > -- > > Key: SPARK-20466 > URL: https://issues.apache.org/jira/browse/SPARK-20466 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.2 >Reporter: liyunzhang_intel > Attachments: NPE_log > > > in spark2.0.2, it throws NPE > {code} > 17/04/23 08:19:55 ERROR executor.Executor: Exception in task 439.0 in stage > 16.0 (TID 986)$ > java.lang.NullPointerException$ > ^Iat > org.apache.spark.rdd.HadoopRDD$.addLocalConfiguration(HadoopRDD.scala:373)$ > ^Iat org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:243)$ > ^Iat org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:208)$ > ^Iat org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)$ > ^Iat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)$ > ^Iat org.apache.spark.rdd.RDD.iterator(RDD.scala:283)$ > ^Iat org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)$ > ^Iat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)$ > ^Iat org.apache.spark.rdd.RDD.iterator(RDD.scala:283)$ > ^Iat org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)$ > ^Iat org.apache.spark.scheduler.Task.run(Task.scala:86)$ > ^Iat org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)$ > ^Iat > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)$ > ^Iat > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)$ > ^Iat java.lang.Thread.run(Thread.java:745)$ > {code} > suggestion to add some code to avoid NPE > {code} >/** Add Hadoop configuration specific to a single partition and attempt. */ > def addLocalConfiguration(jobTrackerId: String, jobId: Int, splitId: Int, > attemptId: Int, > conf: JobConf) { > val jobID = new JobID(jobTrackerId, jobId) > val taId = new TaskAttemptID(new TaskID(jobID, TaskType.MAP, splitId), > attemptId) > if ( conf != null){ > conf.set("mapred.tip.id", taId.getTaskID.toString) > conf.set("mapred.task.id", taId.toString) > conf.setBoolean("mapred.task.is.map", true) > conf.setInt("mapred.task.partition", splitId) > conf.set("mapred.job.id", jobID.toString) >} > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20467) sbt-launch-lib.bash has lacked the ASF header.
liuzhaokun created SPARK-20467: -- Summary: sbt-launch-lib.bash has lacked the ASF header. Key: SPARK-20467 URL: https://issues.apache.org/jira/browse/SPARK-20467 Project: Spark Issue Type: Bug Components: Build Affects Versions: 2.1.0 Reporter: liuzhaokun When I use this script,I found sbt-launch-lib.bash lack the ASF header.It doesn't be permitted by Apache Foundation according to apache license 2.0. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20468) Refactor the ALS code
Daniel Li created SPARK-20468: - Summary: Refactor the ALS code Key: SPARK-20468 URL: https://issues.apache.org/jira/browse/SPARK-20468 Project: Spark Issue Type: Improvement Components: ML, MLlib Affects Versions: 2.1.0 Reporter: Daniel Li Priority: Minor The current ALS implementation ({{org.apache.spark.ml.recommendation}}) is quite the beast --- 21 classes, traits, and objects across 1,500+ lines, all in one file. Here are some things I think could improve the clarity and maintainability of the code: * The file can be split into more manageable parts. In particular, {{ALS}}, {{ALSParams}}, {{ALSModel}}, and {{ALSModelParams}} can be in separate files for better readability. * Certain parts can be encapsulated or moved to clarify the intent. For example: ** The {{ALS.train}} method is currently defined in the {{ALS}} companion object, and it seems to take 12 individual parameters that are all members of the {{ALS}} class. This method can be made an instance method. ** The code that creates in-blocks and out-blocks in the body of {{ALS.train}}, along with the {{partitionRatings}} and {{makeBlocks}} methods in the {{ALS}} companion object, can be moved into a separate case class that holds the blocks. This has the added benefit of allowing us to write specific Scaladoc to explain the logic behind these block objects, as their usage is certainly nontrivial yet is fundamental to the implementation. ** The {{KeyWrapper}} and {{UncompressedInBlockSort}} classes could be hidden within {{UncompressedInBlock}} to clarify the scope of their usage. ** Various builder classes could be encapsulated in the companion objects of the classes they build. * The code can be formatted more clearly. For example: ** Certain methods such as {{ALS.train}} and {{ALS.makeBlocks}} can be formatted more clearly and have comments added explaining the reasoning behind key parts. That these methods form the core of the ALS logic makes this doubly important for maintainability. ** Parts of the code that use {{while}} loops with manually incremented counters can be rewritten as {{for}} loops. ** Where non-idiomatic Scala code is used that doesn't improve performance much, clearer code can be substituted. (This in particular should be done very carefully if at all, as it's apparent the original author spent much time and pains optimizing the code to significantly improve its runtime profile.) * The documentation (both Scaladocs and inline comments) can be clarified where needed and expanded where incomplete. This is especially important for parts of the code that are written imperatively for performance, as these parts don't benefit from the intuitive self-documentation of Scala's higher-level language abstractions. Specifically, I'd like to add documentation fully explaining the key functionality of the in-block and out-block objects, their purpose, how they relate to the overall ALS algorithm, and how they are calculated in such a way that new maintainers can ramp up much more quickly. The above is not a complete enumeration of improvements but a high-level survey. All of these improvements will, I believe, add up to make the code easier to understand, extend, and maintain. This issue will track the progress of this refactoring so that going forward, authors will have an easier time maintaining this part of the project. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20468) Refactor the ALS code
[ https://issues.apache.org/jira/browse/SPARK-20468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984346#comment-15984346 ] Apache Spark commented on SPARK-20468: -- User 'danielyli' has created a pull request for this issue: https://github.com/apache/spark/pull/17767 > Refactor the ALS code > - > > Key: SPARK-20468 > URL: https://issues.apache.org/jira/browse/SPARK-20468 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: Daniel Li >Priority: Minor > Labels: documentation, readability, refactoring > > The current ALS implementation ({{org.apache.spark.ml.recommendation}}) is > quite the beast --- 21 classes, traits, and objects across 1,500+ lines, all > in one file. Here are some things I think could improve the clarity and > maintainability of the code: > * The file can be split into more manageable parts. In particular, {{ALS}}, > {{ALSParams}}, {{ALSModel}}, and {{ALSModelParams}} can be in separate files > for better readability. > * Certain parts can be encapsulated or moved to clarify the intent. For > example: > ** The {{ALS.train}} method is currently defined in the {{ALS}} companion > object, and it seems to take 12 individual parameters that are all members of > the {{ALS}} class. This method can be made an instance method. > ** The code that creates in-blocks and out-blocks in the body of > {{ALS.train}}, along with the {{partitionRatings}} and {{makeBlocks}} methods > in the {{ALS}} companion object, can be moved into a separate case class that > holds the blocks. This has the added benefit of allowing us to write > specific Scaladoc to explain the logic behind these block objects, as their > usage is certainly nontrivial yet is fundamental to the implementation. > ** The {{KeyWrapper}} and {{UncompressedInBlockSort}} classes could be > hidden within {{UncompressedInBlock}} to clarify the scope of their usage. > ** Various builder classes could be encapsulated in the companion objects of > the classes they build. > * The code can be formatted more clearly. For example: > ** Certain methods such as {{ALS.train}} and {{ALS.makeBlocks}} can be > formatted more clearly and have comments added explaining the reasoning > behind key parts. That these methods form the core of the ALS logic makes > this doubly important for maintainability. > ** Parts of the code that use {{while}} loops with manually incremented > counters can be rewritten as {{for}} loops. > ** Where non-idiomatic Scala code is used that doesn't improve performance > much, clearer code can be substituted. (This in particular should be done > very carefully if at all, as it's apparent the original author spent much > time and pains optimizing the code to significantly improve its runtime > profile.) > * The documentation (both Scaladocs and inline comments) can be clarified > where needed and expanded where incomplete. This is especially important for > parts of the code that are written imperatively for performance, as these > parts don't benefit from the intuitive self-documentation of Scala's > higher-level language abstractions. Specifically, I'd like to add > documentation fully explaining the key functionality of the in-block and > out-block objects, their purpose, how they relate to the overall ALS > algorithm, and how they are calculated in such a way that new maintainers can > ramp up much more quickly. > The above is not a complete enumeration of improvements but a high-level > survey. All of these improvements will, I believe, add up to make the code > easier to understand, extend, and maintain. This issue will track the > progress of this refactoring so that going forward, authors will have an > easier time maintaining this part of the project. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20470) Invalid json converting RDD row with Array of struct to json
[ https://issues.apache.org/jira/browse/SPARK-20470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984403#comment-15984403 ] Sean Owen commented on SPARK-20470: --- What JSON do you expect? what JSON do you get? There's no relevant detail here. > Invalid json converting RDD row with Array of struct to json > > > Key: SPARK-20470 > URL: https://issues.apache.org/jira/browse/SPARK-20470 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.3 >Reporter: Philip Adetiloye > > Trying to convert an RDD in pyspark containing Array of struct doesn't > generate the right json. It looks trivial but can't get a good json out. > rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict( -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20470) Invalid json converting RDD row with Array of struct to json
[ https://issues.apache.org/jira/browse/SPARK-20470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Adetiloye updated SPARK-20470: - Description: Trying to convert an RDD in pyspark containing Array of struct doesn't generate the right json. It looks trivial but can't get a good json out. I read the json below into a dataframe: { "feature": "feature_id_001", "histogram": [ { "start": 1.9796095151877942, "y": 968.0, "width": 0.1564485056196041 }, { "start": 2.1360580208073983, "y": 892.0, "width": 0.1564485056196041 }, { "start": 2.2925065264270024, "y": 814.0, "width": 0.15644850561960366 }, { "start": 2.448955032046606, "y": 690.0, "width": 0.1564485056196041 }] } Df schema looks good {code} root |-- feature: string (nullable = true) |-- histogram: array (nullable = true) ||-- element: struct (containsNull = true) |||-- start: double (nullable = true) |||-- width: double (nullable = true) |||-- y: double (nullable = true) {code} Need to convert each row to json now and save to HBase rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict( Output JSON (Wrong) { "feature": "feature_id_001", "histogram": [ [ 1.9796095151877942, 968.0, 0.1564485056196041 ], [ 2.1360580208073983, 892.0, 0.1564485056196041 ], [ 2.2925065264270024, 814.0, 0.15644850561960366 ], [ 2.448955032046606, 690.0, 0.1564485056196041 ] } was: Trying to convert an RDD in pyspark containing Array of struct doesn't generate the right json. It looks trivial but can't get a good json out. I read the json below into a dataframe: { "feature": "feature_id_001", "histogram": [ { "start": 1.9796095151877942, "y": 968.0, "width": 0.1564485056196041 }, { "start": 2.1360580208073983, "y": 892.0, "width": 0.1564485056196041 }, { "start": 2.2925065264270024, "y": 814.0, "width": 0.15644850561960366 }, { "start": 2.448955032046606, "y": 690.0, "width": 0.1564485056196041 }] } Df schema looks good root |-- feature: string (nullable = true) |-- histogram: array (nullable = true) ||-- element: struct (containsNull = true) |||-- start: double (nullable = true) |||-- width: double (nullable = true) |||-- y: double (nullable = true) Need to convert each row to json now and save to HBase rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict( Output JSON (Wrong) { "feature": "feature_id_001", "histogram": [ [ 1.9796095151877942, 968.0, 0.1564485056196041 ], [ 2.1360580208073983, 892.0, 0.1564485056196041 ], [ 2.2925065264270024, 814.0, 0.15644850561960366 ], [ 2.448955032046606, 690.0, 0.1564485056196041 ] } > Invalid json converting RDD row with Array of struct to json > > > Key: SPARK-20470 > URL: https://issues.apache.org/jira/browse/SPARK-20470 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.3 >Reporter: Philip Adetiloye > > Trying to convert an RDD in pyspark containing Array of struct doesn't > generate the right json. It looks trivial but can't get a good json out. > I read the json below into a dataframe: > { > "feature": "feature_id_001", > "histogram": [ > { > "start": 1.9796095151877942, > "y": 968.0, > "width": 0.1564485056196041 > }, > { > "start": 2.1360580208073983, > "y": 892.0, > "width": 0.1564485056196041 > }, > { > "start": 2.2925065264270024, > "y": 814.0, > "width": 0.15644850561960366 > }, > { > "start": 2.448955032046606, > "y": 690.0, > "width": 0.1564485056196041 > }] > } > Df schema looks good > {code} > root > |-- feature: string (nullable = true) > |-- histogram: array (nullable = true) > ||-- element: struct (containsNull = true) > |||-- start: double (nullable = true) > |||-- width: double (nullable = true) > |||-- y: double (nullable = true) > {code} > Need to convert each row to json now and save to HBase > rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict( > Output JSON (Wrong) > { > "feature": "feature_id_001", > "histogram": [ > [ > 1.9796095151877942, > 968.0, > 0.1564485056196041 > ], > [ > 2.1360580208073983, > 892.0, > 0.1564485056196041 > ], > [ > 2.2925065264270024, > 814.0, >
[jira] [Updated] (SPARK-20470) Invalid json converting RDD row with Array of struct to json
[ https://issues.apache.org/jira/browse/SPARK-20470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Adetiloye updated SPARK-20470: - Description: Trying to convert an RDD in pyspark containing Array of struct doesn't generate the right json. It looks trivial but can't get a good json out. I read the json below into a dataframe: {code} { "feature": "feature_id_001", "histogram": [ { "start": 1.9796095151877942, "y": 968.0, "width": 0.1564485056196041 }, { "start": 2.1360580208073983, "y": 892.0, "width": 0.1564485056196041 }, { "start": 2.2925065264270024, "y": 814.0, "width": 0.15644850561960366 }, { "start": 2.448955032046606, "y": 690.0, "width": 0.1564485056196041 }] } {code} Df schema looks good {code} root |-- feature: string (nullable = true) |-- histogram: array (nullable = true) ||-- element: struct (containsNull = true) |||-- start: double (nullable = true) |||-- width: double (nullable = true) |||-- y: double (nullable = true) {code} Need to convert each row to json now and save to HBase rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict( Output JSON (Wrong) {code} { "feature": "feature_id_001", "histogram": [ [ 1.9796095151877942, 968.0, 0.1564485056196041 ], [ 2.1360580208073983, 892.0, 0.1564485056196041 ], [ 2.2925065264270024, 814.0, 0.15644850561960366 ], [ 2.448955032046606, 690.0, 0.1564485056196041 ] } {code} was: Trying to convert an RDD in pyspark containing Array of struct doesn't generate the right json. It looks trivial but can't get a good json out. I read the json below into a dataframe: { "feature": "feature_id_001", "histogram": [ { "start": 1.9796095151877942, "y": 968.0, "width": 0.1564485056196041 }, { "start": 2.1360580208073983, "y": 892.0, "width": 0.1564485056196041 }, { "start": 2.2925065264270024, "y": 814.0, "width": 0.15644850561960366 }, { "start": 2.448955032046606, "y": 690.0, "width": 0.1564485056196041 }] } Df schema looks good {code} root |-- feature: string (nullable = true) |-- histogram: array (nullable = true) ||-- element: struct (containsNull = true) |||-- start: double (nullable = true) |||-- width: double (nullable = true) |||-- y: double (nullable = true) {code} Need to convert each row to json now and save to HBase rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict( Output JSON (Wrong) { "feature": "feature_id_001", "histogram": [ [ 1.9796095151877942, 968.0, 0.1564485056196041 ], [ 2.1360580208073983, 892.0, 0.1564485056196041 ], [ 2.2925065264270024, 814.0, 0.15644850561960366 ], [ 2.448955032046606, 690.0, 0.1564485056196041 ] } > Invalid json converting RDD row with Array of struct to json > > > Key: SPARK-20470 > URL: https://issues.apache.org/jira/browse/SPARK-20470 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.3 >Reporter: Philip Adetiloye > > Trying to convert an RDD in pyspark containing Array of struct doesn't > generate the right json. It looks trivial but can't get a good json out. > I read the json below into a dataframe: > {code} > { > "feature": "feature_id_001", > "histogram": [ > { > "start": 1.9796095151877942, > "y": 968.0, > "width": 0.1564485056196041 > }, > { > "start": 2.1360580208073983, > "y": 892.0, > "width": 0.1564485056196041 > }, > { > "start": 2.2925065264270024, > "y": 814.0, > "width": 0.15644850561960366 > }, > { > "start": 2.448955032046606, > "y": 690.0, > "width": 0.1564485056196041 > }] > } > {code} > Df schema looks good > {code} > root > |-- feature: string (nullable = true) > |-- histogram: array (nullable = true) > ||-- element: struct (containsNull = true) > |||-- start: double (nullable = true) > |||-- width: double (nullable = true) > |||-- y: double (nullable = true) > {code} > Need to convert each row to json now and save to HBase > rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict( > Output JSON (Wrong) > {code} > { > "feature": "feature_id_001", > "histogram": [ > [ > 1.9796095151877942, > 968.0, > 0.1564485056196041 > ], > [ > 2.1360580208073983, > 892.0, > 0.1564485056196041 > ], >
[jira] [Created] (SPARK-20472) Support for Dynamic Configuration
Shahbaz Hussain created SPARK-20472: --- Summary: Support for Dynamic Configuration Key: SPARK-20472 URL: https://issues.apache.org/jira/browse/SPARK-20472 Project: Spark Issue Type: Bug Components: Spark Submit Affects Versions: 2.1.0 Reporter: Shahbaz Hussain Currently Spark Configuration can not be dynamically changed. It requires Spark Job be killed and started again for a new configuration to take in to effect. This bug is to enhance Spark ,such that configuration changes can be dynamically changed without requiring a application restart. Ex: If Batch Interval in a Streaming Job is 20 seconds ,and if user wants to reduce it to 5 seconds,currently it requires a re-submit of the job. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20471) Remove AggregateBenchmark testsuite warning: Two level hashmap is disabled but vectorized hashmap is enabled.
[ https://issues.apache.org/jira/browse/SPARK-20471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984451#comment-15984451 ] Apache Spark commented on SPARK-20471: -- User 'heary-cao' has created a pull request for this issue: https://github.com/apache/spark/pull/17771 > Remove AggregateBenchmark testsuite warning: Two level hashmap is disabled > but vectorized hashmap is enabled. > - > > Key: SPARK-20471 > URL: https://issues.apache.org/jira/browse/SPARK-20471 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.1.0 >Reporter: caoxuewen > > remove AggregateBenchmark testsuite warning: > such as '14:26:33.220 WARN > org.apache.spark.sql.execution.aggregate.HashAggregateExec: Two level hashmap > is disabled but vectorized hashmap is enabled.' > unit tests: AggregateBenchmark > Modify the 'ignore function for 'test funtion -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11968) ALS recommend all methods spend most of time in GC
[ https://issues.apache.org/jira/browse/SPARK-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984533#comment-15984533 ] Nick Pentreath commented on SPARK-11968: Thanks - in the meantime I will take a look at the PR. > ALS recommend all methods spend most of time in GC > -- > > Key: SPARK-11968 > URL: https://issues.apache.org/jira/browse/SPARK-11968 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.5.2, 1.6.0 >Reporter: Joseph K. Bradley >Assignee: Nick Pentreath > > After adding recommendUsersForProducts and recommendProductsForUsers to ALS > in spark-perf, I noticed that it takes much longer than ALS itself. Looking > at the monitoring page, I can see it is spending about 8min doing GC for each > 10min task. That sounds fixable. Looking at the implementation, there is > clearly an opportunity to avoid extra allocations: > [https://github.com/apache/spark/blob/e6dd237463d2de8c506f0735dfdb3f43e8122513/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala#L283] > CC: [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20081) RandomForestClassifier doesn't seem to support more than 100 labels
[ https://issues.apache.org/jira/browse/SPARK-20081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984326#comment-15984326 ] Christian Reiniger commented on SPARK-20081: Thank you. All (potential) labels *are* in fact known in advance, so constructing a StringIndexerModel directly proved to be an excellend (and performant) solution. > RandomForestClassifier doesn't seem to support more than 100 labels > --- > > Key: SPARK-20081 > URL: https://issues.apache.org/jira/browse/SPARK-20081 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.1.0 > Environment: Java >Reporter: Christian Reiniger > > When feeding data with more than 100 labels into RanfomForestClassifer#fit() > (from java code), I get the following error message: > {code} > Classifier inferred 143 from label values in column > rfc_df0e968db9df__labelCol, but this exceeded the max numClasses (100) > allowed to be inferred from values. > To avoid this error for labels with > 100 classes, specify numClasses > explicitly in the metadata; this can be done by applying StringIndexer to the > label column. > {code} > Setting "numClasses" in the metadata for the label column doesn't make a > difference. Looking at the code, this is not surprising, since > MetadataUtils.getNumClasses() ignores this setting: > {code:language=scala} > def getNumClasses(labelSchema: StructField): Option[Int] = { > Attribute.fromStructField(labelSchema) match { > case binAttr: BinaryAttribute => Some(2) > case nomAttr: NominalAttribute => nomAttr.getNumValues > case _: NumericAttribute | UnresolvedAttribute => None > } > } > {code} > The alternative would be to pass a proper "maxNumClasses" parameter to the > classifier, so that Classifier#getNumClasses() allows a larger number of > auto-detected labels. However, RandomForestClassifer#train() calls > #getNumClasses without the "maxNumClasses" parameter, causing it to use the > default of 100: > {code:language=scala} > override protected def train(dataset: Dataset[_]): > RandomForestClassificationModel = { > val categoricalFeatures: Map[Int, Int] = > MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol))) > val numClasses: Int = getNumClasses(dataset) > // ... > {code} > My scala skills are pretty sketchy, so please correct me if I misinterpreted > something. But as it seems right now, there is no way to learn from data with > more than 100 labels via RandomForestClassifier. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20467) sbt-launch-lib.bash has lacked the ASF header.
[ https://issues.apache.org/jira/browse/SPARK-20467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20467. --- Resolution: Not A Problem This file should not have an ASF header. > sbt-launch-lib.bash has lacked the ASF header. > --- > > Key: SPARK-20467 > URL: https://issues.apache.org/jira/browse/SPARK-20467 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.1.0 >Reporter: liuzhaokun > > When I use this script,I found sbt-launch-lib.bash lack the ASF header.It > doesn't be permitted by Apache Foundation according to apache license 2.0. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20400) Remove References to Third Party Vendors from Spark ASF Documentation
[ https://issues.apache.org/jira/browse/SPARK-20400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-20400: - Assignee: Bill Chambers > Remove References to Third Party Vendors from Spark ASF Documentation > - > > Key: SPARK-20400 > URL: https://issues.apache.org/jira/browse/SPARK-20400 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.1.0 >Reporter: Bill Chambers >Assignee: Bill Chambers >Priority: Trivial > Fix For: 2.2.0 > > > Similar to SPARK-17445, vendors should probably not be referenced on the ASF > documentation. > Related: > https://github.com/apache/spark/commit/dc0a4c916151c795dc41b5714e9d23b4937f4636 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20400) Remove References to Third Party Vendors from Spark ASF Documentation
[ https://issues.apache.org/jira/browse/SPARK-20400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20400. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17695 [https://github.com/apache/spark/pull/17695] > Remove References to Third Party Vendors from Spark ASF Documentation > - > > Key: SPARK-20400 > URL: https://issues.apache.org/jira/browse/SPARK-20400 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.1.0 >Reporter: Bill Chambers >Priority: Trivial > Fix For: 2.2.0 > > > Similar to SPARK-17445, vendors should probably not be referenced on the ASF > documentation. > Related: > https://github.com/apache/spark/commit/dc0a4c916151c795dc41b5714e9d23b4937f4636 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20470) Invalid json converting RDD row with Array of struct to json
[ https://issues.apache.org/jira/browse/SPARK-20470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Adetiloye updated SPARK-20470: - Description: Trying to convert an RDD in pyspark containing Array of struct doesn't generate the right json. It looks trivial but can't get a good json out. I read the json below into a dataframe: {code} { "feature": "feature_id_001", "histogram": [ { "start": 1.9796095151877942, "y": 968.0, "width": 0.1564485056196041 }, { "start": 2.1360580208073983, "y": 892.0, "width": 0.1564485056196041 }, { "start": 2.2925065264270024, "y": 814.0, "width": 0.15644850561960366 }, { "start": 2.448955032046606, "y": 690.0, "width": 0.1564485056196041 }] } {code} Df schema looks good {code} root |-- feature: string (nullable = true) |-- histogram: array (nullable = true) ||-- element: struct (containsNull = true) |||-- start: double (nullable = true) |||-- width: double (nullable = true) |||-- y: double (nullable = true) {code} Need to convert each row to json now and save to HBase {code} rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict( {code} Output JSON (Wrong) {code} { "feature": "feature_id_001", "histogram": [ [ 1.9796095151877942, 968.0, 0.1564485056196041 ], [ 2.1360580208073983, 892.0, 0.1564485056196041 ], [ 2.2925065264270024, 814.0, 0.15644850561960366 ], [ 2.448955032046606, 690.0, 0.1564485056196041 ] } {code} was: Trying to convert an RDD in pyspark containing Array of struct doesn't generate the right json. It looks trivial but can't get a good json out. I read the json below into a dataframe: {code} { "feature": "feature_id_001", "histogram": [ { "start": 1.9796095151877942, "y": 968.0, "width": 0.1564485056196041 }, { "start": 2.1360580208073983, "y": 892.0, "width": 0.1564485056196041 }, { "start": 2.2925065264270024, "y": 814.0, "width": 0.15644850561960366 }, { "start": 2.448955032046606, "y": 690.0, "width": 0.1564485056196041 }] } {code} Df schema looks good {code} root |-- feature: string (nullable = true) |-- histogram: array (nullable = true) ||-- element: struct (containsNull = true) |||-- start: double (nullable = true) |||-- width: double (nullable = true) |||-- y: double (nullable = true) {code} Need to convert each row to json now and save to HBase rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict( Output JSON (Wrong) {code} { "feature": "feature_id_001", "histogram": [ [ 1.9796095151877942, 968.0, 0.1564485056196041 ], [ 2.1360580208073983, 892.0, 0.1564485056196041 ], [ 2.2925065264270024, 814.0, 0.15644850561960366 ], [ 2.448955032046606, 690.0, 0.1564485056196041 ] } {code} > Invalid json converting RDD row with Array of struct to json > > > Key: SPARK-20470 > URL: https://issues.apache.org/jira/browse/SPARK-20470 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.3 >Reporter: Philip Adetiloye > > Trying to convert an RDD in pyspark containing Array of struct doesn't > generate the right json. It looks trivial but can't get a good json out. > I read the json below into a dataframe: > {code} > { > "feature": "feature_id_001", > "histogram": [ > { > "start": 1.9796095151877942, > "y": 968.0, > "width": 0.1564485056196041 > }, > { > "start": 2.1360580208073983, > "y": 892.0, > "width": 0.1564485056196041 > }, > { > "start": 2.2925065264270024, > "y": 814.0, > "width": 0.15644850561960366 > }, > { > "start": 2.448955032046606, > "y": 690.0, > "width": 0.1564485056196041 > }] > } > {code} > Df schema looks good > {code} > root > |-- feature: string (nullable = true) > |-- histogram: array (nullable = true) > ||-- element: struct (containsNull = true) > |||-- start: double (nullable = true) > |||-- width: double (nullable = true) > |||-- y: double (nullable = true) > {code} > Need to convert each row to json now and save to HBase > {code} > rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict( > {code} > Output JSON (Wrong) > {code} > { > "feature": "feature_id_001", > "histogram": [ > [ > 1.9796095151877942, > 968.0, > 0.1564485056196041 > ], > [ >
[jira] [Commented] (SPARK-20470) Invalid json converting RDD row with Array of struct to json
[ https://issues.apache.org/jira/browse/SPARK-20470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984428#comment-15984428 ] Philip Adetiloye commented on SPARK-20470: -- [~srowen] I'm sorry, I added an example. > Invalid json converting RDD row with Array of struct to json > > > Key: SPARK-20470 > URL: https://issues.apache.org/jira/browse/SPARK-20470 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.3 >Reporter: Philip Adetiloye > > Trying to convert an RDD in pyspark containing Array of struct doesn't > generate the right json. It looks trivial but can't get a good json out. > I read the json below into a dataframe: > {code} > { > "feature": "feature_id_001", > "histogram": [ > { > "start": 1.9796095151877942, > "y": 968.0, > "width": 0.1564485056196041 > }, > { > "start": 2.1360580208073983, > "y": 892.0, > "width": 0.1564485056196041 > }, > { > "start": 2.2925065264270024, > "y": 814.0, > "width": 0.15644850561960366 > }, > { > "start": 2.448955032046606, > "y": 690.0, > "width": 0.1564485056196041 > }] > } > {code} > Df schema looks good > {code} > root > |-- feature: string (nullable = true) > |-- histogram: array (nullable = true) > ||-- element: struct (containsNull = true) > |||-- start: double (nullable = true) > |||-- width: double (nullable = true) > |||-- y: double (nullable = true) > {code} > Need to convert each row to json now and save to HBase > {code} > rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict( > {code} > Output JSON (Wrong) > {code} > { > "feature": "feature_id_001", > "histogram": [ > [ > 1.9796095151877942, > 968.0, > 0.1564485056196041 > ], > [ > 2.1360580208073983, > 892.0, > 0.1564485056196041 > ], > [ > 2.2925065264270024, > 814.0, > 0.15644850561960366 > ], > [ > 2.448955032046606, > 690.0, > 0.1564485056196041 > ] > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20471) Remove AggregateBenchmark testsuite warning: Two level hashmap is disabled but vectorized hashmap is enabled.
caoxuewen created SPARK-20471: - Summary: Remove AggregateBenchmark testsuite warning: Two level hashmap is disabled but vectorized hashmap is enabled. Key: SPARK-20471 URL: https://issues.apache.org/jira/browse/SPARK-20471 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 2.1.0 Reporter: caoxuewen remove AggregateBenchmark testsuite warning: such as '14:26:33.220 WARN org.apache.spark.sql.execution.aggregate.HashAggregateExec: Two level hashmap is disabled but vectorized hashmap is enabled.' unit tests: AggregateBenchmark Modify the 'ignore function for 'test funtion -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20472) Support for Dynamic Configuration
[ https://issues.apache.org/jira/browse/SPARK-20472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984461#comment-15984461 ] Sean Owen commented on SPARK-20472: --- I don't think this is generally possible, because some config is global, it is needed and has effect at startup, and can't be changed even if you wanted (think: JVM heap size). I doubt this is achievable. You will probably have to narrow this down much further. > Support for Dynamic Configuration > - > > Key: SPARK-20472 > URL: https://issues.apache.org/jira/browse/SPARK-20472 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.1.0 >Reporter: Shahbaz Hussain > > Currently Spark Configuration can not be dynamically changed. > It requires Spark Job be killed and started again for a new configuration to > take in to effect. > This bug is to enhance Spark ,such that configuration changes can be > dynamically changed without requiring a application restart. > Ex: If Batch Interval in a Streaming Job is 20 seconds ,and if user wants to > reduce it to 5 seconds,currently it requires a re-submit of the job. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20473) ColumnVector.Array is missing accessors for some types
Michal Szafranski created SPARK-20473: - Summary: ColumnVector.Array is missing accessors for some types Key: SPARK-20473 URL: https://issues.apache.org/jira/browse/SPARK-20473 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.0 Reporter: Michal Szafranski ColumnVector implementations originally did not support some Catalyst types (float, short, and boolean). Now that they do, those types should be also added to the ColumnVector.Array. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12965) Indexer setInputCol() doesn't resolve column names like DataFrame.col()
[ https://issues.apache.org/jira/browse/SPARK-12965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984540#comment-15984540 ] Calin Cocan commented on SPARK-12965: - I have encountered the same problem using StringIndexer and VectorAssembles components. In my opinion this particular issue can be fixed directly on these ML classes. Replacing in StringIndexing fit method val counts = dataset.select(col($(inputCol)).cast(StringType)) with val counts = dataset.select(col(s"`${$(inputCol)}`").cast(StringType)) should do the trick Also a change must be done as well in StringIndexerModel transform method. The call dataset.where(filterer(dataset($(inputCol must be replaced with dataset.where(filterer(dataset(s"`${$(inputCol)}`"))) BTW a similar problem can be encountered in VectorAssembler transform method at this line (105 in spark 2.1): case _: NumericType | BooleanType => dataset(c).cast(DoubleType).as(s"${c}_double_$uid") Changing dataset(columnName) with its backquote columName should fix the problem: case _: NumericType | BooleanType => dataset(c).cast(DoubleType).as(s"${c}_double_$uid") > Indexer setInputCol() doesn't resolve column names like DataFrame.col() > --- > > Key: SPARK-12965 > URL: https://issues.apache.org/jira/browse/SPARK-12965 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.2.0 >Reporter: Joshua Taylor > Attachments: SparkMLDotColumn.java > > > The setInputCol() method doesn't seem to resolve column names in the same way > that other methods do. E.g., Given a DataFrame df, {{df.col("`a.b`")}} will > return a column. On a StringIndexer indexer, > {{indexer.setInputCol("`a.b`")}} produces leads to an indexer where fitting > and transforming seem to have no effect. Running the following code produces: > {noformat} > +---+---++ > |a.b|a_b|a_bIndex| > +---+---++ > |foo|foo| 0.0| > |bar|bar| 1.0| > +---+---++ > {noformat} > but I think it should have another column, {{abIndex}} with the same contents > as a_bIndex. > {code} > public class SparkMLDotColumn { > public static void main(String[] args) { > // Get the contexts > SparkConf conf = new SparkConf() > .setMaster("local[*]") > .setAppName("test") > .set("spark.ui.enabled", "false"); > JavaSparkContext sparkContext = new JavaSparkContext(conf); > SQLContext sqlContext = new SQLContext(sparkContext); > > // Create a schema with a single string column named "a.b" > StructType schema = new StructType(new StructField[] { > DataTypes.createStructField("a.b", > DataTypes.StringType, false) > }); > // Create an empty RDD and DataFrame > List rows = Arrays.asList(RowFactory.create("foo"), > RowFactory.create("bar")); > JavaRDD rdd = sparkContext.parallelize(rows); > DataFrame df = sqlContext.createDataFrame(rdd, schema); > > df = df.withColumn("a_b", df.col("`a.b`")); > > StringIndexer indexer0 = new StringIndexer(); > indexer0.setInputCol("a_b"); > indexer0.setOutputCol("a_bIndex"); > df = indexer0.fit(df).transform(df); > > StringIndexer indexer1 = new StringIndexer(); > indexer1.setInputCol("`a.b`"); > indexer1.setOutputCol("abIndex"); > df = indexer1.fit(df).transform(df); > > df.show(); > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20473) ColumnVector.Array is missing accessors for some types
[ https://issues.apache.org/jira/browse/SPARK-20473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984576#comment-15984576 ] Apache Spark commented on SPARK-20473: -- User 'michal-databricks' has created a pull request for this issue: https://github.com/apache/spark/pull/17772 > ColumnVector.Array is missing accessors for some types > -- > > Key: SPARK-20473 > URL: https://issues.apache.org/jira/browse/SPARK-20473 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Michal Szafranski > > ColumnVector implementations originally did not support some Catalyst types > (float, short, and boolean). Now that they do, those types should be also > added to the ColumnVector.Array. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20475) Whether use "broadcast join" depends on hive configuration
Lijia Liu created SPARK-20475: - Summary: Whether use "broadcast join" depends on hive configuration Key: SPARK-20475 URL: https://issues.apache.org/jira/browse/SPARK-20475 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.0 Reporter: Lijia Liu Currently, broadcast join in Spark only works while: 1. The value of "spark.sql.autoBroadcastJoinThreshold" bigger than 0(default is 10485760). 2. The size of one of the hive tables less than "spark.sql.autoBroadcastJoinThreshold". To get the size information of the hive table from hive metastore, "hive.stats.autogather" should be set to true in hive or the command "ANALYZE TABLE COMPUTE STATISTICS noscan" has been run. But in Hive, it calculate the size of the file or directory corresponding to the hive table to determine whether to use the map side join, and does not depend on the hive metastore. This leads to two problems: 1. Spark will not use "broadcast join" when the hive parameter "hive.stats.autogather" is not set to ture or the command "ANALYZE TABLE COMPUTE STATISTICS noscan" has not been run because the information of the hive table has not saved in hive metastore . The mode of work in Spark depends on the configuration of Hive. 2. For some reason, we set "hive.stats.autogather" to false in our Hive. For the same SQL, Hive is 4 times faster than Spark because Hive used "map side join" but Spark did not use "broadcast join". Is it possible to use the mechanism same to hive's to look up the size of a hive tale in Spark. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20474) OnHeapColumnVector realocation may not copy existing data
Michal Szafranski created SPARK-20474: - Summary: OnHeapColumnVector realocation may not copy existing data Key: SPARK-20474 URL: https://issues.apache.org/jira/browse/SPARK-20474 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.0 Reporter: Michal Szafranski OnHeapColumnVector reallocation copies to the new storage data up to 'elementsAppended'. This variable is only updated when using the ColumnVector.appendX API, while ColumnVector.putX is more commonly used. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20474) OnHeapColumnVector realocation may not copy existing data
[ https://issues.apache.org/jira/browse/SPARK-20474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984577#comment-15984577 ] Apache Spark commented on SPARK-20474: -- User 'michal-databricks' has created a pull request for this issue: https://github.com/apache/spark/pull/17773 > OnHeapColumnVector realocation may not copy existing data > - > > Key: SPARK-20474 > URL: https://issues.apache.org/jira/browse/SPARK-20474 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Michal Szafranski > > OnHeapColumnVector reallocation copies to the new storage data up to > 'elementsAppended'. This variable is only updated when using the > ColumnVector.appendX API, while ColumnVector.putX is more commonly used. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20473) ColumnVector.Array is missing accessors for some types
[ https://issues.apache.org/jira/browse/SPARK-20473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20473: Assignee: Apache Spark > ColumnVector.Array is missing accessors for some types > -- > > Key: SPARK-20473 > URL: https://issues.apache.org/jira/browse/SPARK-20473 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Michal Szafranski >Assignee: Apache Spark > > ColumnVector implementations originally did not support some Catalyst types > (float, short, and boolean). Now that they do, those types should be also > added to the ColumnVector.Array. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20473) ColumnVector.Array is missing accessors for some types
[ https://issues.apache.org/jira/browse/SPARK-20473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20473: Assignee: (was: Apache Spark) > ColumnVector.Array is missing accessors for some types > -- > > Key: SPARK-20473 > URL: https://issues.apache.org/jira/browse/SPARK-20473 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Michal Szafranski > > ColumnVector implementations originally did not support some Catalyst types > (float, short, and boolean). Now that they do, those types should be also > added to the ColumnVector.Array. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20473) ColumnVector.Array is missing accessors for some types
[ https://issues.apache.org/jira/browse/SPARK-20473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20473: Assignee: Apache Spark > ColumnVector.Array is missing accessors for some types > -- > > Key: SPARK-20473 > URL: https://issues.apache.org/jira/browse/SPARK-20473 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Michal Szafranski >Assignee: Apache Spark > > ColumnVector implementations originally did not support some Catalyst types > (float, short, and boolean). Now that they do, those types should be also > added to the ColumnVector.Array. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20474) OnHeapColumnVector realocation may not copy existing data
[ https://issues.apache.org/jira/browse/SPARK-20474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20474: Assignee: (was: Apache Spark) > OnHeapColumnVector realocation may not copy existing data > - > > Key: SPARK-20474 > URL: https://issues.apache.org/jira/browse/SPARK-20474 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Michal Szafranski > > OnHeapColumnVector reallocation copies to the new storage data up to > 'elementsAppended'. This variable is only updated when using the > ColumnVector.appendX API, while ColumnVector.putX is more commonly used. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20473) ColumnVector.Array is missing accessors for some types
[ https://issues.apache.org/jira/browse/SPARK-20473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20473: Assignee: (was: Apache Spark) > ColumnVector.Array is missing accessors for some types > -- > > Key: SPARK-20473 > URL: https://issues.apache.org/jira/browse/SPARK-20473 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Michal Szafranski > > ColumnVector implementations originally did not support some Catalyst types > (float, short, and boolean). Now that they do, those types should be also > added to the ColumnVector.Array. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20474) OnHeapColumnVector realocation may not copy existing data
[ https://issues.apache.org/jira/browse/SPARK-20474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20474: Assignee: Apache Spark > OnHeapColumnVector realocation may not copy existing data > - > > Key: SPARK-20474 > URL: https://issues.apache.org/jira/browse/SPARK-20474 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Michal Szafranski >Assignee: Apache Spark > > OnHeapColumnVector reallocation copies to the new storage data up to > 'elementsAppended'. This variable is only updated when using the > ColumnVector.appendX API, while ColumnVector.putX is more commonly used. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records
[ https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984776#comment-15984776 ] Sebastian Arzt commented on SPARK-18371: I deep dived into it and found a simple solution. The problem is that [maxRateLimitPerPartition|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L94] returns {{None}} for an unintended case. {{None}} should only be returned if there is no lag as indicated by this [condition|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L114]. However, this condition is also true if all backpressureRates are [rounded|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L107] to zero. I propose a solution, where rounding is omitted at all. I has the nice side-effect that the backpressure in more fine-grained and not only an integral multiple of the [batchDuration|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L117] in seconds. I will open a pull request for it soon. > Spark Streaming backpressure bug - generates a batch with large number of > records > - > > Key: SPARK-18371 > URL: https://issues.apache.org/jira/browse/SPARK-18371 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0 >Reporter: mapreduced > Attachments: GiantBatch2.png, GiantBatch3.png, > Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png > > > When the streaming job is configured with backpressureEnabled=true, it > generates a GIANT batch of records if the processing time + scheduled delay > is (much) larger than batchDuration. This creates a backlog of records like > no other and results in batches queueing for hours until it chews through > this giant batch. > Expectation is that it should reduce the number of records per batch in some > time to whatever it can really process. > Attaching some screen shots where it seems that this issue is quite easily > reproducible. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records
[ https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984776#comment-15984776 ] Sebastian Arzt edited comment on SPARK-18371 at 4/26/17 1:24 PM: - I deep dived into it and found a simple solution. The problem is that [maxRateLimitPerPartition|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L94] returns {{None}} for an unintended case. {{None}} should only be returned if there is no lag as indicated by this [condition|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L114]. However, this condition is also true if all backpressureRates are [rounded|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L107] to zero. I propose a solution, where rounding is omitted at all. This has the nice side-effect that backpressure is more fine-grained and not only an integral multiple of the [batchDuration|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L117] in seconds. I will open a pull request for it soon. was (Author: seb.arzt): I deep dived into it and found a simple solution. The problem is that [maxRateLimitPerPartition|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L94] returns {{None}} for an unintended case. {{None}} should only be returned if there is no lag as indicated by this [condition|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L114]. However, this condition is also true if all backpressureRates are [rounded|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L107] to zero. I propose a solution, where rounding is omitted at all. I has the nice side-effect that the backpressure is more fine-grained and not only an integral multiple of the [batchDuration|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L117] in seconds. I will open a pull request for it soon. > Spark Streaming backpressure bug - generates a batch with large number of > records > - > > Key: SPARK-18371 > URL: https://issues.apache.org/jira/browse/SPARK-18371 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0 >Reporter: mapreduced > Attachments: GiantBatch2.png, GiantBatch3.png, > Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png > > > When the streaming job is configured with backpressureEnabled=true, it > generates a GIANT batch of records if the processing time + scheduled delay > is (much) larger than batchDuration. This creates a backlog of records like > no other and results in batches queueing for hours until it chews through > this giant batch. > Expectation is that it should reduce the number of records per batch in some > time to whatever it can really process. > Attaching some screen shots where it seems that this issue is quite easily > reproducible. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19812) YARN shuffle service fails to relocate recovery DB across NFS directories
[ https://issues.apache.org/jira/browse/SPARK-19812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-19812. --- Resolution: Fixed Fix Version/s: 2.3.0 2.2.0 > YARN shuffle service fails to relocate recovery DB across NFS directories > - > > Key: SPARK-19812 > URL: https://issues.apache.org/jira/browse/SPARK-19812 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.1 >Reporter: Thomas Graves >Assignee: Thomas Graves > Fix For: 2.2.0, 2.3.0 > > > The yarn shuffle service tries to switch from the yarn local directories to > the real recovery directory but can fail to move the existing recovery db's. > It fails due to Files.move not doing directories that have contents. > 2017-03-03 14:57:19,558 [main] ERROR yarn.YarnShuffleService: Failed to move > recovery file sparkShuffleRecovery.ldb to the path > /mapred/yarn-nodemanager/nm-aux-services/spark_shuffle > java.nio.file.DirectoryNotEmptyException:/yarn-local/sparkShuffleRecovery.ldb > at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:498) > at > sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262) > at java.nio.file.Files.move(Files.java:1395) > at > org.apache.spark.network.yarn.YarnShuffleService.initRecoveryDb(YarnShuffleService.java:369) > at > org.apache.spark.network.yarn.YarnShuffleService.createSecretManager(YarnShuffleService.java:200) > at > org.apache.spark.network.yarn.YarnShuffleService.serviceInit(YarnShuffleService.java:174) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:143) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:262) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:357) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:636) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:684) > This used to use f.renameTo and we switched it in the pr due to review > comments and it looks like didn't do a final real test. The tests are using > files rather then directories so it didn't catch. We need to fix the test > also. > history: > https://github.com/apache/spark/pull/14999/commits/65de8531ccb91287f5a8a749c7819e99533b9440 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records
[ https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984776#comment-15984776 ] Sebastian Arzt edited comment on SPARK-18371 at 4/26/17 1:23 PM: - I deep dived into it and found a simple solution. The problem is that [maxRateLimitPerPartition|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L94] returns {{None}} for an unintended case. {{None}} should only be returned if there is no lag as indicated by this [condition|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L114]. However, this condition is also true if all backpressureRates are [rounded|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L107] to zero. I propose a solution, where rounding is omitted at all. I has the nice side-effect that the backpressure is more fine-grained and not only an integral multiple of the [batchDuration|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L117] in seconds. I will open a pull request for it soon. was (Author: seb.arzt): I deep dived into it and found a simple solution. The problem is that [maxRateLimitPerPartition|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L94] returns {{None}} for an unintended case. {{None}} should only be returned if there is no lag as indicated by this [condition|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L114]. However, this condition is also true if all backpressureRates are [rounded|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L107] to zero. I propose a solution, where rounding is omitted at all. I has the nice side-effect that the backpressure in more fine-grained and not only an integral multiple of the [batchDuration|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L117] in seconds. I will open a pull request for it soon. > Spark Streaming backpressure bug - generates a batch with large number of > records > - > > Key: SPARK-18371 > URL: https://issues.apache.org/jira/browse/SPARK-18371 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0 >Reporter: mapreduced > Attachments: GiantBatch2.png, GiantBatch3.png, > Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png > > > When the streaming job is configured with backpressureEnabled=true, it > generates a GIANT batch of records if the processing time + scheduled delay > is (much) larger than batchDuration. This creates a backlog of records like > no other and results in batches queueing for hours until it chews through > this giant batch. > Expectation is that it should reduce the number of records per batch in some > time to whatever it can really process. > Attaching some screen shots where it seems that this issue is quite easily > reproducible. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20391) Properly rename the memory related fields in ExecutorSummary REST API
[ https://issues.apache.org/jira/browse/SPARK-20391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid reassigned SPARK-20391: Assignee: Saisai Shao > Properly rename the memory related fields in ExecutorSummary REST API > - > > Key: SPARK-20391 > URL: https://issues.apache.org/jira/browse/SPARK-20391 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Blocker > Fix For: 2.2.0 > > > Currently in Spark we could get executor summary through REST API > {{/api/v1/applications//executors}}. The format of executor summary > is: > {code} > class ExecutorSummary private[spark]( > val id: String, > val hostPort: String, > val isActive: Boolean, > val rddBlocks: Int, > val memoryUsed: Long, > val diskUsed: Long, > val totalCores: Int, > val maxTasks: Int, > val activeTasks: Int, > val failedTasks: Int, > val completedTasks: Int, > val totalTasks: Int, > val totalDuration: Long, > val totalGCTime: Long, > val totalInputBytes: Long, > val totalShuffleRead: Long, > val totalShuffleWrite: Long, > val isBlacklisted: Boolean, > val maxMemory: Long, > val executorLogs: Map[String, String], > val onHeapMemoryUsed: Option[Long], > val offHeapMemoryUsed: Option[Long], > val maxOnHeapMemory: Option[Long], > val maxOffHeapMemory: Option[Long]) > {code} > Here are 6 memory related fields: {{memoryUsed}}, {{maxMemory}}, > {{onHeapMemoryUsed}}, {{offHeapMemoryUsed}}, {{maxOnHeapMemory}}, > {{maxOffHeapMemory}}. > These all 6 fields reflects the *storage* memory usage in Spark, but from the > name of this 6 fields, user doesn't really know it is referring to *storage* > memory or the total memory (storage memory + execution memory). This will be > misleading. > So I think we should properly rename these fields to reflect their real > meanings. Or we should will document it. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7481) Add spark-hadoop-cloud module to pull in object store support
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984771#comment-15984771 ] Steve Loughran commented on SPARK-7481: --- I think we ended up going in circles on that PR. Sean has actually been very tolerant of me, however it's been hampered by my full time focus on other thingsr. I've only been had time to work on the spark PR intermittently and that's been hard for all: me in the rebase/retest, the one reviewer in having to catch up again. Now, anyone who does manage to get that CP right will discover that S3A absolutely flies with Spark, in partitioning (list file improvements), data input (set fadvise=true for ORC and Parquet), and for output (set fast.output=true, play with the pool options). It delivers that performance because this patch set things up for the integration tests, downstream of this patch so I and others can be confident that the things actually work, at sped, at scale. Indeed, many of S3A performance work was actually based on Hive and Spark workloads:, the data formats & their seek patterns, directory layouts, file generation. All that's left is the little problem of getting the classpath right. Oh, and the committer. For now, for people's enjoyment, here's some videos from Spark Summit East on the topic * [Spark and object stores|https://youtu.be/8F2Jqw5_OnI]. * [Robust and Scalable etl over Cloud Storage With Spark|https://spark-summit.org/east-2017/events/robust-and-scalable-etl-over-cloud-storage-with-spark/] > Add spark-hadoop-cloud module to pull in object store support > - > > Key: SPARK-7481 > URL: https://issues.apache.org/jira/browse/SPARK-7481 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.1.0 >Reporter: Steve Loughran > > To keep the s3n classpath right, to add s3a, swift & azure, the dependencies > of spark in a 2.6+ profile need to add the relevant object store packages > (hadoop-aws, hadoop-openstack, hadoop-azure) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20391) Properly rename the memory related fields in ExecutorSummary REST API
[ https://issues.apache.org/jira/browse/SPARK-20391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-20391. -- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17700 [https://github.com/apache/spark/pull/17700] > Properly rename the memory related fields in ExecutorSummary REST API > - > > Key: SPARK-20391 > URL: https://issues.apache.org/jira/browse/SPARK-20391 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Saisai Shao >Priority: Blocker > Fix For: 2.2.0 > > > Currently in Spark we could get executor summary through REST API > {{/api/v1/applications//executors}}. The format of executor summary > is: > {code} > class ExecutorSummary private[spark]( > val id: String, > val hostPort: String, > val isActive: Boolean, > val rddBlocks: Int, > val memoryUsed: Long, > val diskUsed: Long, > val totalCores: Int, > val maxTasks: Int, > val activeTasks: Int, > val failedTasks: Int, > val completedTasks: Int, > val totalTasks: Int, > val totalDuration: Long, > val totalGCTime: Long, > val totalInputBytes: Long, > val totalShuffleRead: Long, > val totalShuffleWrite: Long, > val isBlacklisted: Boolean, > val maxMemory: Long, > val executorLogs: Map[String, String], > val onHeapMemoryUsed: Option[Long], > val offHeapMemoryUsed: Option[Long], > val maxOnHeapMemory: Option[Long], > val maxOffHeapMemory: Option[Long]) > {code} > Here are 6 memory related fields: {{memoryUsed}}, {{maxMemory}}, > {{onHeapMemoryUsed}}, {{offHeapMemoryUsed}}, {{maxOnHeapMemory}}, > {{maxOffHeapMemory}}. > These all 6 fields reflects the *storage* memory usage in Spark, but from the > name of this 6 fields, user doesn't really know it is referring to *storage* > memory or the total memory (storage memory + execution memory). This will be > misleading. > So I think we should properly rename these fields to reflect their real > meanings. Or we should will document it. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records
[ https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984962#comment-15984962 ] Apache Spark commented on SPARK-18371: -- User 'arzt' has created a pull request for this issue: https://github.com/apache/spark/pull/17774 > Spark Streaming backpressure bug - generates a batch with large number of > records > - > > Key: SPARK-18371 > URL: https://issues.apache.org/jira/browse/SPARK-18371 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0 >Reporter: mapreduced > Attachments: GiantBatch2.png, GiantBatch3.png, > Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png > > > When the streaming job is configured with backpressureEnabled=true, it > generates a GIANT batch of records if the processing time + scheduled delay > is (much) larger than batchDuration. This creates a backlog of records like > no other and results in batches queueing for hours until it chews through > this giant batch. > Expectation is that it should reduce the number of records per batch in some > time to whatever it can really process. > Attaching some screen shots where it seems that this issue is quite easily > reproducible. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records
[ https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Arzt updated SPARK-18371: --- Comment: was deleted (was: After fix) > Spark Streaming backpressure bug - generates a batch with large number of > records > - > > Key: SPARK-18371 > URL: https://issues.apache.org/jira/browse/SPARK-18371 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0 >Reporter: mapreduced > Attachments: 01.png, 02.png, GiantBatch2.png, GiantBatch3.png, > Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png > > > When the streaming job is configured with backpressureEnabled=true, it > generates a GIANT batch of records if the processing time + scheduled delay > is (much) larger than batchDuration. This creates a backlog of records like > no other and results in batches queueing for hours until it chews through > this giant batch. > Expectation is that it should reduce the number of records per batch in some > time to whatever it can really process. > Attaching some screen shots where it seems that this issue is quite easily > reproducible. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records
[ https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Arzt updated SPARK-18371: --- Comment: was deleted (was: Before fix) > Spark Streaming backpressure bug - generates a batch with large number of > records > - > > Key: SPARK-18371 > URL: https://issues.apache.org/jira/browse/SPARK-18371 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0 >Reporter: mapreduced > Attachments: 01.png, 02.png, GiantBatch2.png, GiantBatch3.png, > Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png > > > When the streaming job is configured with backpressureEnabled=true, it > generates a GIANT batch of records if the processing time + scheduled delay > is (much) larger than batchDuration. This creates a backlog of records like > no other and results in batches queueing for hours until it chews through > this giant batch. > Expectation is that it should reduce the number of records per batch in some > time to whatever it can really process. > Attaching some screen shots where it seems that this issue is quite easily > reproducible. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20476) Exception between "create table as" and "get_json_object"
[ https://issues.apache.org/jira/browse/SPARK-20476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] cen yuhai updated SPARK-20476: -- Description: I encounter this problem when I want to create a table as select , get_json_object from xxx; It is wrong. {code} create table spark_json_object as select get_json_object(deliver_geojson,'$.') from dw.dw_prd_order where dt='2017-04-24' limit 10; {code} It is ok. {code} create table spark_json_object as select * from dw.dw_prd_order where dt='2017-04-24' limit 10; {code} It is ok {code} select get_json_object(deliver_geojson,'$.') from dw.dw_prd_order where dt='2017-04-24' limit 10; {code} {code} 17/04/26 23:12:56 ERROR [hive.log(397) -- main]: error in initSerDe: org.apache.hadoop.hive.serde2.SerDeException org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements while columns.types has 1 elements! org.apache.hadoop.hive.serde2.SerDeException: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements while columns.types has 1 elements! at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:146) at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125) at org.apache.hadoop.hive.serde2.AbstractSerDe.initialize(AbstractSerDe.java:53) at org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:521) at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:391) at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276) at org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:197) at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:699) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply$mcV$sp(HiveClientImpl.scala:455) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:309) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:256) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:255) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:298) at org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:454) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:237) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) at org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:199) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:248) at org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$lzycompute$1(CreateHiveTableAsSelectCommand.scala:72) at org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$1(CreateHiveTableAsSelectCommand.scala:48) at org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:91) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) at org.apache.spark.sql.Dataset.(Dataset.scala:179) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:593) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:699) at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:335) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:247) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
[jira] [Updated] (SPARK-20476) Exception between "create table as" and "get_json_object"
[ https://issues.apache.org/jira/browse/SPARK-20476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] cen yuhai updated SPARK-20476: -- Description: I encounter this problem when I want to create a table as select , get_json_object from xxx; It is wrong. {code} create table spark_json_object as select get_json_object(deliver_geojson,'$.') from dw.dw_prd_order where dt='2017-04-24' limit 10; {code} It is ok. {code} create table spark_json_object as select * from dw.dw_prd_order where dt='2017-04-24' limit 10; {code} It is ok {code} select get_json_object(deliver_geojson,'$.') from dw.dw_prd_order where dt='2017-04-24' limit 10; {code} 17/04/26 23:12:56 ERROR [hive.log(397) -- main]: error in initSerDe: org.apache.hadoop.hive.serde2.SerDeException org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements while columns.types has 1 elements! org.apache.hadoop.hive.serde2.SerDeException: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements while columns.types has 1 elements! at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:146) at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125) at org.apache.hadoop.hive.serde2.AbstractSerDe.initialize(AbstractSerDe.java:53) at org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:521) at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:391) at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276) at org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:197) at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:699) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply$mcV$sp(HiveClientImpl.scala:455) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:309) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:256) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:255) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:298) at org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:454) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:237) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) at org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:199) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:248) at org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$lzycompute$1(CreateHiveTableAsSelectCommand.scala:72) at org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$1(CreateHiveTableAsSelectCommand.scala:48) at org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:91) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) at org.apache.spark.sql.Dataset.(Dataset.scala:179) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:593) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:699) at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:335) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:247) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at
[jira] [Updated] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records
[ https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Arzt updated SPARK-18371: --- Attachment: 01.png Before fix > Spark Streaming backpressure bug - generates a batch with large number of > records > - > > Key: SPARK-18371 > URL: https://issues.apache.org/jira/browse/SPARK-18371 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0 >Reporter: mapreduced > Attachments: 01.png, 02.png, GiantBatch2.png, GiantBatch3.png, > Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png > > > When the streaming job is configured with backpressureEnabled=true, it > generates a GIANT batch of records if the processing time + scheduled delay > is (much) larger than batchDuration. This creates a backlog of records like > no other and results in batches queueing for hours until it chews through > this giant batch. > Expectation is that it should reduce the number of records per batch in some > time to whatever it can really process. > Attaching some screen shots where it seems that this issue is quite easily > reproducible. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records
[ https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Arzt updated SPARK-18371: --- Attachment: 02.png After fix > Spark Streaming backpressure bug - generates a batch with large number of > records > - > > Key: SPARK-18371 > URL: https://issues.apache.org/jira/browse/SPARK-18371 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0 >Reporter: mapreduced > Attachments: 01.png, 02.png, GiantBatch2.png, GiantBatch3.png, > Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png > > > When the streaming job is configured with backpressureEnabled=true, it > generates a GIANT batch of records if the processing time + scheduled delay > is (much) larger than batchDuration. This creates a backlog of records like > no other and results in batches queueing for hours until it chews through > this giant batch. > Expectation is that it should reduce the number of records per batch in some > time to whatever it can really process. > Attaching some screen shots where it seems that this issue is quite easily > reproducible. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records
[ https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984975#comment-15984975 ] Sebastian Arzt commented on SPARK-18371: Screenshots: [before|https://issues.apache.org/jira/secure/attachment/12865156/01.png] [after|https://issues.apache.org/jira/secure/attachment/12865158/02.png] > Spark Streaming backpressure bug - generates a batch with large number of > records > - > > Key: SPARK-18371 > URL: https://issues.apache.org/jira/browse/SPARK-18371 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0 >Reporter: mapreduced > Attachments: 01.png, 02.png, GiantBatch2.png, GiantBatch3.png, > Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png > > > When the streaming job is configured with backpressureEnabled=true, it > generates a GIANT batch of records if the processing time + scheduled delay > is (much) larger than batchDuration. This creates a backlog of records like > no other and results in batches queueing for hours until it chews through > this giant batch. > Expectation is that it should reduce the number of records per batch in some > time to whatever it can really process. > Attaching some screen shots where it seems that this issue is quite easily > reproducible. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17403) Fatal Error: Scan cached strings
[ https://issues.apache.org/jira/browse/SPARK-17403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15982595#comment-15982595 ] Paul Lysak edited comment on SPARK-17403 at 4/26/17 3:23 PM: - Looks like we have the same issue with Spark 2.1 on YARN (Amazon EMR release emr-5.4.0). Workaround that solves the issue for us (at the cost of some performance) is to use df.persist(StorageLevel.DISK_ONLY) instead of df.cache(). Depending on the node types, memory settings, storage level and some other factors I couldn't clearly identify it may appear as {noformat} User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 158 in stage 504.0 failed 4 times, most recent failure: Lost task 158.3 in stage 504.0 (TID 427365, ip-10-35-162-171.ec2.internal, executor 83): java.lang.NegativeArraySizeException at org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229) at org.apache.spark.unsafe.types.UTF8String.clone(UTF8String.java:826) at org.apache.spark.sql.execution.columnar.StringColumnStats.gatherStats(ColumnStats.scala:217) at org.apache.spark.sql.execution.columnar.NullableColumnBuilder$class.appendFrom(NullableColumnBuilder.scala:55) at org.apache.spark.sql.execution.columnar.NativeColumnBuilder.org$apache$spark$sql$execution$columnar$compression$CompressibleColumnBuilder$$super$appendFrom(ColumnBuilder.scala:97) at org.apache.spark.sql.execution.columnar.compression.CompressibleColumnBuilder$class.appendFrom(CompressibleColumnBuilder.scala:78) at org.apache.spark.sql.execution.columnar.NativeColumnBuilder.appendFrom(ColumnBuilder.scala:97) at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:122) at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:97) at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:957) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:888) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:694) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at org.apache.spark.rdd.RDD.iterator(RDD.scala:285) {noformat} or as {noformat} User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 27 in stage 61.0 failed 4 times, most recent failure: Lost task 27.3 in stage 61.0 (TID 36167, ip-10-35-162-149.ec2.internal, executor 1): java.lang.OutOfMemoryError: Java heap space at org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:73) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_38$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:107) at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:97) at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:957) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:888) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:694) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) {noformat} or as {noformat} 2017-04-24 19:02:45,951 ERROR org.apache.spark.util.SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-3,5,main] java.lang.OutOfMemoryError: Requested array size exceeds VM limit at org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:73) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_37$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.execution.joins.HashJoin$$anonfun$join$1.apply(HashJoin.scala:217) at
[jira] [Assigned] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records
[ https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18371: Assignee: (was: Apache Spark) > Spark Streaming backpressure bug - generates a batch with large number of > records > - > > Key: SPARK-18371 > URL: https://issues.apache.org/jira/browse/SPARK-18371 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0 >Reporter: mapreduced > Attachments: GiantBatch2.png, GiantBatch3.png, > Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png > > > When the streaming job is configured with backpressureEnabled=true, it > generates a GIANT batch of records if the processing time + scheduled delay > is (much) larger than batchDuration. This creates a backlog of records like > no other and results in batches queueing for hours until it chews through > this giant batch. > Expectation is that it should reduce the number of records per batch in some > time to whatever it can really process. > Attaching some screen shots where it seems that this issue is quite easily > reproducible. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records
[ https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18371: Assignee: Apache Spark > Spark Streaming backpressure bug - generates a batch with large number of > records > - > > Key: SPARK-18371 > URL: https://issues.apache.org/jira/browse/SPARK-18371 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0 >Reporter: mapreduced >Assignee: Apache Spark > Attachments: GiantBatch2.png, GiantBatch3.png, > Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png > > > When the streaming job is configured with backpressureEnabled=true, it > generates a GIANT batch of records if the processing time + scheduled delay > is (much) larger than batchDuration. This creates a backlog of records like > no other and results in batches queueing for hours until it chews through > this giant batch. > Expectation is that it should reduce the number of records per batch in some > time to whatever it can really process. > Attaching some screen shots where it seems that this issue is quite easily > reproducible. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17403) Fatal Error: Scan cached strings
[ https://issues.apache.org/jira/browse/SPARK-17403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984987#comment-15984987 ] Paul Lysak commented on SPARK-17403: Hope that helps - finally managed to reproduce it without using production data: {code} import org.apache.spark.sql.functions._ val leftRows = spark.sparkContext.parallelize(numSlices = 3000, seq = for (k <- 1 to 1000; j <- 1 to 1000) yield Row(k, j)) .flatMap(r => (1 to 1000).map(l => Row.fromSeq(r.toSeq :+ l))) val leftDf = spark.createDataFrame(leftRows, StructType(Seq( StructField("k", IntegerType), StructField("j", IntegerType), StructField("l", IntegerType) ))) .withColumn("combinedKey", expr("k*100 + j*1000 + l")) .withColumn("fixedCol", lit("sampleVal")) .withColumn("combKeyStr", format_number(col("combinedKey"), 0)) .withColumn("k100", expr("k*100")) .withColumn("j100", expr("j*100")) .withColumn("l100", expr("l*100")) .withColumn("k_200", expr("k+200")) .withColumn("j_200", expr("j+200")) .withColumn("l_200", expr("l+200")) .withColumn("strCol1_1", concat(lit("value of sample column number one with which column k will be concatenated:" * 5), format_number(col("k"), 0))) .withColumn("strCol1_2", concat(lit("value of sample column two one with which column j will be concatenated:" * 5), format_number(col("j"), 0))) .withColumn("strCol1_3", concat(lit("value of sample column three one with which column r will be concatenated:" * 5), format_number(col("l"), 0))) .withColumn("strCol2_1", concat(lit("value of sample column number one with which column k will be concatenated:" * 5), format_number(col("k"), 0))) .withColumn("strCol2_2", concat(lit("value of sample column two one with which column j will be concatenated:" * 5), format_number(col("j"), 0))) .withColumn("strCol2_3", concat(lit("value of sample column three one with which column r will be concatenated:" * 5), format_number(col("l"), 0))) .withColumn("strCol3_1", concat(lit("value of sample column number one with which column k will be concatenated:" * 5), format_number(col("k"), 0))) .withColumn("strCol3_2", concat(lit("value of sample column two one with which column j will be concatenated:" * 5), format_number(col("j"), 0))) .withColumn("strCol3_3", concat(lit("value of sample column three one with which column r will be concatenated:" * 5), format_number(col("l"), 0))) //if further columns commented out - error disappears leftDf.cache() println("= leftDf count:" + leftDf.count()) leftDf.show(10) val rightRows = spark.sparkContext.parallelize((1 to 800).map(i => Row(i, "k_" + i, "sampleVal"))) val rightDf = spark.createDataFrame(rightRows, StructType(Seq( StructField("k", IntegerType), StructField("kStr", StringType), StructField("sampleCol", StringType) ))) rightDf.cache() println("= rightDf count:" + rightDf.count()) rightDf.show(10) val joinedDf = leftDf.join(broadcast(rightDf), usingColumns = Seq("k"), joinType = "left") joinedDf.cache() println("= joinedDf count:" + joinedDf.count()) joinedDf.show(10) {code} ApplicationMaster fails with such exception: {noformat} User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 949 in stage 8.0 failed 4 times, most recent failure: Lost task 949.3 in stage 8.0 (TID 4922, ip-10-35-162-219.ec2.internal, executor 139): java.lang.NegativeArraySizeException at org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229) at org.apache.spark.unsafe.types.UTF8String.clone(UTF8String.java:826) at org.apache.spark.sql.execution.columnar.StringColumnStats.gatherStats(ColumnStats.scala:216) at org.apache.spark.sql.execution.columnar.NullableColumnBuilder$class.appendFrom(NullableColumnBuilder.scala:55) at org.apache.spark.sql.execution.columnar.NativeColumnBuilder.org$apache$spark$sql$execution$columnar$compression$CompressibleColumnBuilder$$super$appendFrom(ColumnBuilder.scala:97) at org.apache.spark.sql.execution.columnar.compression.CompressibleColumnBuilder$class.appendFrom(CompressibleColumnBuilder.scala:78) at org.apache.spark.sql.execution.columnar.NativeColumnBuilder.appendFrom(ColumnBuilder.scala:97) at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:122) at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:97) at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:957) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948) at
[jira] [Created] (SPARK-20476) Exception between "create table as" and "get_json_object"
cen yuhai created SPARK-20476: - Summary: Exception between "create table as" and "get_json_object" Key: SPARK-20476 URL: https://issues.apache.org/jira/browse/SPARK-20476 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: cen yuhai {code} create table spark_json_object as select get_json_object(deliver_geojson,'') from dw.dw_prd_restaurant where dt='2017-04-24' limit 10; {code} {code} 17/04/26 23:12:56 ERROR [hive.log(397) -- main]: error in initSerDe: org.apache.hadoop.hive.serde2.SerDeException org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements while columns.types has 1 elements! org.apache.hadoop.hive.serde2.SerDeException: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements while columns.types has 1 elements! at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:146) at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125) at org.apache.hadoop.hive.serde2.AbstractSerDe.initialize(AbstractSerDe.java:53) at org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:521) at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:391) at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276) at org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:197) at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:699) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply$mcV$sp(HiveClientImpl.scala:455) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:309) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:256) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:255) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:298) at org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:454) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:237) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) at org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:199) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:248) at org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$lzycompute$1(CreateHiveTableAsSelectCommand.scala:72) at org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$1(CreateHiveTableAsSelectCommand.scala:48) at org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:91) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) at org.apache.spark.sql.Dataset.(Dataset.scala:179) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:593) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:699) at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:335) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:247) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at
[jira] [Updated] (SPARK-20476) Exception between "create table as" and "get_json_object"
[ https://issues.apache.org/jira/browse/SPARK-20476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] cen yuhai updated SPARK-20476: -- Description: {code} create table spark_json_object as select get_json_object(deliver_geojson,'$.') from dw.dw_prd_order where dt='2017-04-24' limit 10; {code} {code} 17/04/26 23:12:56 ERROR [hive.log(397) -- main]: error in initSerDe: org.apache.hadoop.hive.serde2.SerDeException org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements while columns.types has 1 elements! org.apache.hadoop.hive.serde2.SerDeException: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements while columns.types has 1 elements! at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:146) at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125) at org.apache.hadoop.hive.serde2.AbstractSerDe.initialize(AbstractSerDe.java:53) at org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:521) at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:391) at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276) at org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:197) at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:699) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply$mcV$sp(HiveClientImpl.scala:455) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:309) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:256) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:255) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:298) at org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:454) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:237) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) at org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:199) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:248) at org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$lzycompute$1(CreateHiveTableAsSelectCommand.scala:72) at org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$1(CreateHiveTableAsSelectCommand.scala:48) at org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:91) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) at org.apache.spark.sql.Dataset.(Dataset.scala:179) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:593) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:699) at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:335) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:247) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at
[jira] [Commented] (SPARK-20476) Exception between "create table as" and "get_json_object"
[ https://issues.apache.org/jira/browse/SPARK-20476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985175#comment-15985175 ] Xiao Li commented on SPARK-20476: - You can bypass it by {noformat} create table spark_json_object as select get_json_object(deliver_geojson,'$.') as col1 from dw.dw_prd_order where dt='2017-04-24' limit 10; {noformat} Will fix it later. > Exception between "create table as" and "get_json_object" > - > > Key: SPARK-20476 > URL: https://issues.apache.org/jira/browse/SPARK-20476 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: cen yuhai > > I encounter this problem when I want to create a table as select , > get_json_object from xxx; > It is wrong. > {code} > create table spark_json_object as > select get_json_object(deliver_geojson,'$.') > from dw.dw_prd_order where dt='2017-04-24' limit 10; > {code} > It is ok. > {code} > create table spark_json_object as > select * > from dw.dw_prd_order where dt='2017-04-24' limit 10; > {code} > It is ok > {code} > select get_json_object(deliver_geojson,'$.') > from dw.dw_prd_order where dt='2017-04-24' limit 10; > {code} > {code} > 17/04/26 23:12:56 ERROR [hive.log(397) -- main]: error in initSerDe: > org.apache.hadoop.hive.serde2.SerDeException > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements > while columns.types has 1 elements! > org.apache.hadoop.hive.serde2.SerDeException: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements > while columns.types has 1 elements! > at > org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:146) > at > org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85) > at > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125) > at > org.apache.hadoop.hive.serde2.AbstractSerDe.initialize(AbstractSerDe.java:53) > at > org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:521) > at > org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:391) > at > org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276) > at > org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:197) > at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:699) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply$mcV$sp(HiveClientImpl.scala:455) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:309) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:256) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:255) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:298) > at > org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:454) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:237) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) > at > org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:199) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:248) > at > org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$lzycompute$1(CreateHiveTableAsSelectCommand.scala:72) > at > org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$1(CreateHiveTableAsSelectCommand.scala:48) > at > org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:91) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at >
[jira] [Created] (SPARK-20478) Document LinearSVC in R programming guide
Felix Cheung created SPARK-20478: Summary: Document LinearSVC in R programming guide Key: SPARK-20478 URL: https://issues.apache.org/jira/browse/SPARK-20478 Project: Spark Issue Type: Documentation Components: SparkR Affects Versions: 2.2.0 Reporter: Felix Cheung -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20477) Document R bisecting k-means in R programming guide
Felix Cheung created SPARK-20477: Summary: Document R bisecting k-means in R programming guide Key: SPARK-20477 URL: https://issues.apache.org/jira/browse/SPARK-20477 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.2.0 Reporter: Felix Cheung -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20478) Document LinearSVC in R programming guide
[ https://issues.apache.org/jira/browse/SPARK-20478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985227#comment-15985227 ] Felix Cheung commented on SPARK-20478: -- [~wangmiao1981] Would you like to add this? > Document LinearSVC in R programming guide > - > > Key: SPARK-20478 > URL: https://issues.apache.org/jira/browse/SPARK-20478 > Project: Spark > Issue Type: Documentation > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20477) Document R bisecting k-means in R programming guide
[ https://issues.apache.org/jira/browse/SPARK-20477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985226#comment-15985226 ] Felix Cheung commented on SPARK-20477: -- [~wangmiao1981] Would you like to add this? > Document R bisecting k-means in R programming guide > --- > > Key: SPARK-20477 > URL: https://issues.apache.org/jira/browse/SPARK-20477 > Project: Spark > Issue Type: Documentation > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20477) Document R bisecting k-means in R programming guide
[ https://issues.apache.org/jira/browse/SPARK-20477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-20477: - Issue Type: Documentation (was: Bug) > Document R bisecting k-means in R programming guide > --- > > Key: SPARK-20477 > URL: https://issues.apache.org/jira/browse/SPARK-20477 > Project: Spark > Issue Type: Documentation > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20473) ColumnVector.Array is missing accessors for some types
[ https://issues.apache.org/jira/browse/SPARK-20473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-20473. - Resolution: Fixed Assignee: Michal Szafranski Fix Version/s: 2.2.0 > ColumnVector.Array is missing accessors for some types > -- > > Key: SPARK-20473 > URL: https://issues.apache.org/jira/browse/SPARK-20473 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Michal Szafranski >Assignee: Michal Szafranski > Fix For: 2.2.0 > > > ColumnVector implementations originally did not support some Catalyst types > (float, short, and boolean). Now that they do, those types should be also > added to the ColumnVector.Array. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20208) Document R fpGrowth support in vignettes, programming guide and code example
[ https://issues.apache.org/jira/browse/SPARK-20208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985309#comment-15985309 ] Apache Spark commented on SPARK-20208: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/17775 > Document R fpGrowth support in vignettes, programming guide and code example > > > Key: SPARK-20208 > URL: https://issues.apache.org/jira/browse/SPARK-20208 > Project: Spark > Issue Type: Documentation > Components: Documentation, SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung >Assignee: Maciej Szymkiewicz > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20208) Document R fpGrowth support in vignettes, programming guide and code example
[ https://issues.apache.org/jira/browse/SPARK-20208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20208: Assignee: Apache Spark (was: Maciej Szymkiewicz) > Document R fpGrowth support in vignettes, programming guide and code example > > > Key: SPARK-20208 > URL: https://issues.apache.org/jira/browse/SPARK-20208 > Project: Spark > Issue Type: Documentation > Components: Documentation, SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung >Assignee: Apache Spark > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20208) Document R fpGrowth support in vignettes, programming guide and code example
[ https://issues.apache.org/jira/browse/SPARK-20208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20208: Assignee: Maciej Szymkiewicz (was: Apache Spark) > Document R fpGrowth support in vignettes, programming guide and code example > > > Key: SPARK-20208 > URL: https://issues.apache.org/jira/browse/SPARK-20208 > Project: Spark > Issue Type: Documentation > Components: Documentation, SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung >Assignee: Maciej Szymkiewicz > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20470) Invalid json converting RDD row with Array of struct to json
[ https://issues.apache.org/jira/browse/SPARK-20470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-20470: Component/s: SQL > Invalid json converting RDD row with Array of struct to json > > > Key: SPARK-20470 > URL: https://issues.apache.org/jira/browse/SPARK-20470 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.6.3 >Reporter: Philip Adetiloye > > Trying to convert an RDD in pyspark containing Array of struct doesn't > generate the right json. It looks trivial but can't get a good json out. > I read the json below into a dataframe: > {code} > { > "feature": "feature_id_001", > "histogram": [ > { > "start": 1.9796095151877942, > "y": 968.0, > "width": 0.1564485056196041 > }, > { > "start": 2.1360580208073983, > "y": 892.0, > "width": 0.1564485056196041 > }, > { > "start": 2.2925065264270024, > "y": 814.0, > "width": 0.15644850561960366 > }, > { > "start": 2.448955032046606, > "y": 690.0, > "width": 0.1564485056196041 > }] > } > {code} > Df schema looks good > {code} > root > |-- feature: string (nullable = true) > |-- histogram: array (nullable = true) > ||-- element: struct (containsNull = true) > |||-- start: double (nullable = true) > |||-- width: double (nullable = true) > |||-- y: double (nullable = true) > {code} > Need to convert each row to json now and save to HBase > {code} > rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict( > {code} > Output JSON (Wrong) > {code} > { > "feature": "feature_id_001", > "histogram": [ > [ > 1.9796095151877942, > 968.0, > 0.1564485056196041 > ], > [ > 2.1360580208073983, > 892.0, > 0.1564485056196041 > ], > [ > 2.2925065264270024, > 814.0, > 0.15644850561960366 > ], > [ > 2.448955032046606, > 690.0, > 0.1564485056196041 > ] > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7481) Add spark-hadoop-cloud module to pull in object store support
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985040#comment-15985040 ] Steve Loughran commented on SPARK-7481: --- (This is a fairly long comment, but it tries to summarise the entire state of interaction with object stores, esp. S3A on Hadoop 2.8+. Azure is simpler, GCS: google's problem. Swift. not used very much). If you look at object store & Spark (or indeed, any code which uses a filesystem as the source and dest of work), there are problems which can generally be grouped into various categories. h3. Foundational: talking to the object stores classpath & execution: can you wire the JARs up? Longstanding issue in ASF Spark releases (SPARK-5348, SPARK-12557). This was exacerbated by the movement of S3n:// to the hadoop-aws-package (FWIW, I hadn't noticed that move, I'd have blocked it if I'd been paying attention). This includes transitive problems (SPARK-11413) Credential propagation. Spark's env var propagation is pretty cute here; SPARK-19739 picks up {{AWS_SESSION_TOKEN}} too. Diagnostics on failure is a real pain. h3. Observable Inconsistencies leading to Data loss Generally where the metaphor "it's just a filesystem" fail. These are bad because they often "just work", especially in dev & Test with small datasets, and when they go wrong, they can fail by generating bad results *and nobody notices*. * Expectations of consistent listing of "directories" S3Guard deals with this, HADOOP-13345, as can Netflix's S3mper and AWS's premium Dynamo backed S3 storage. * Expectations on the transacted nature of Directory renames, the core atomic commit operations against full filesystems. * Expectations that when things are deleted they go away. This does become visible sometimes, usually in checks for a destination not existing (SPARK-19013) * Expectations that write-in-progress data is visible/flushed, that {{close()}} is low cost. SPARK-19111. Committing pretty much combines all of these, see below for more details. h3. Aggressively bad performance That's the mismatch between what the object store offers, what the apps expect, and the metaphor work in the Hadooop FileSystem implementations, which, in trying to hide the conceptual mismatch can actually amplify the problem. Example: Directory tree scanning at the start of a query. The mock directory structure allows callers to do treewalks, when really a full list of all children can be done as a direct O(1) call. SPARK-17159 covers some of this for scanning directories in Spark Streaming, but there's a hidden tree walk in every call to {{FileSystem.globStatus()}} (HADOOP-13371). Given how S3Guard transforms this treewalk, and you need it for consistency, that's probably the best solution for now. Although I have a PoC which does a full List **/* followed by a filter, that's not viable when you have a wide deep tree and do need to prune aggressively. Checkpointing to object stores is similar: it's generally not dangerous to do the write+rename, just adds the copy overhead, consistency issues notwithstanding. h3. Suboptimal code. There's opportunities for speedup, but if it's not on the critical path, not worth the hassle. That said, as every call to {{getFileStatus()}} can take hundreds of millis, they get onto the critical path quite fast. Example checks for a file existing before calling {{fs.delete(path)}} (this is always a no-op if the dest path isn't there), and the equivalent on mkdirs: {{if (!fs.exists(dir) fs.mkdirs(path)}}. Hadoop 3.0 will help steer people on the path of righteousness there by deprecating a couple of methods which encourage inefficiencies (isFile/isDir). h3. The commit problem The full commit problem combines all of these: you need a consistent list of source data, your deleted destination path musn't appear in listings, the commit of each task must promote a task's work to the pending output of the job; an abort must leave no trace of it. The final job commit must place data into the final destination, again, job abort not make any output visible. There's some ambiguity about what happens if task and job commits fails; generally the safest is "abort everything". Futhermore nobody has any idea what to do if an {{abort()}} raises exceptions. Oh, and all of this must be fast. Spark is no better or worse than the core MapReduce committers here, or that of Hive. Spark generally uses the Hadoop {{FileOutputFormat}} via the {{HadoopMapReduceCommitProtocol}}, directly or indirectly (e.g {{ParquetOutputFormat}}), extracting its committer and casting it to {{FileOutputCommitter}}, primarily to get a working directory. This committer assumes the destination is a consistent FS, uses renames when promoting task and job output, assuming that is so fast it doesn't even bother to log a message "about to rename". Hence the recurrent Stack Overflow
[jira] [Commented] (SPARK-20476) Exception between "create table as" and "get_json_object"
[ https://issues.apache.org/jira/browse/SPARK-20476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985096#comment-15985096 ] Xiao Li commented on SPARK-20476: - This sounds a bug to me. Let me double check it > Exception between "create table as" and "get_json_object" > - > > Key: SPARK-20476 > URL: https://issues.apache.org/jira/browse/SPARK-20476 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: cen yuhai > > I encounter this problem when I want to create a table as select , > get_json_object from xxx; > It is wrong. > {code} > create table spark_json_object as > select get_json_object(deliver_geojson,'$.') > from dw.dw_prd_order where dt='2017-04-24' limit 10; > {code} > It is ok. > {code} > create table spark_json_object as > select * > from dw.dw_prd_order where dt='2017-04-24' limit 10; > {code} > It is ok > {code} > select get_json_object(deliver_geojson,'$.') > from dw.dw_prd_order where dt='2017-04-24' limit 10; > {code} > {code} > 17/04/26 23:12:56 ERROR [hive.log(397) -- main]: error in initSerDe: > org.apache.hadoop.hive.serde2.SerDeException > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements > while columns.types has 1 elements! > org.apache.hadoop.hive.serde2.SerDeException: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements > while columns.types has 1 elements! > at > org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:146) > at > org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85) > at > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125) > at > org.apache.hadoop.hive.serde2.AbstractSerDe.initialize(AbstractSerDe.java:53) > at > org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:521) > at > org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:391) > at > org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276) > at > org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:197) > at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:699) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply$mcV$sp(HiveClientImpl.scala:455) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:309) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:256) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:255) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:298) > at > org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:454) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:237) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) > at > org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:199) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:248) > at > org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$lzycompute$1(CreateHiveTableAsSelectCommand.scala:72) > at > org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$1(CreateHiveTableAsSelectCommand.scala:48) > at > org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:91) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:179) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) > at
[jira] [Commented] (SPARK-20475) Whether use "broadcast join" depends on hive configuration
[ https://issues.apache.org/jira/browse/SPARK-20475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985100#comment-15985100 ] Xiao Li commented on SPARK-20475: - cc [~ZenWzh] > Whether use "broadcast join" depends on hive configuration > -- > > Key: SPARK-20475 > URL: https://issues.apache.org/jira/browse/SPARK-20475 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Lijia Liu > > Currently, broadcast join in Spark only works while: > 1. The value of "spark.sql.autoBroadcastJoinThreshold" bigger than > 0(default is 10485760). > 2. The size of one of the hive tables less than > "spark.sql.autoBroadcastJoinThreshold". To get the size information of the > hive table from hive metastore, "hive.stats.autogather" should be set to > true in hive or the command "ANALYZE TABLE COMPUTE STATISTICS > noscan" has been run. > But in Hive, it calculate the size of the file or directory corresponding to > the hive table to determine whether to use the map side join, and does not > depend on the hive metastore. > This leads to two problems: > 1. Spark will not use "broadcast join" when the hive parameter > "hive.stats.autogather" is not set to ture or the command "ANALYZE TABLE > COMPUTE STATISTICS noscan" has not been run because the > information of the hive table has not saved in hive metastore . The mode of > work in Spark depends on the configuration of Hive. > 2. For some reason, we set "hive.stats.autogather" to false in our Hive. > For the same SQL, Hive is 4 times faster than Spark because Hive used "map > side join" but Spark did not use "broadcast join". > Is it possible to use the mechanism same to hive's to look up the size of a > hive tale in Spark. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20467) sbt-launch-lib.bash has lacked the ASF header.
[ https://issues.apache.org/jira/browse/SPARK-20467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984298#comment-15984298 ] Apache Spark commented on SPARK-20467: -- User 'liu-zhaokun' has created a pull request for this issue: https://github.com/apache/spark/pull/17769 > sbt-launch-lib.bash has lacked the ASF header. > --- > > Key: SPARK-20467 > URL: https://issues.apache.org/jira/browse/SPARK-20467 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.1.0 >Reporter: liuzhaokun > > When I use this script,I found sbt-launch-lib.bash lack the ASF header.It > doesn't be permitted by Apache Foundation according to apache license 2.0. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20467) sbt-launch-lib.bash has lacked the ASF header.
[ https://issues.apache.org/jira/browse/SPARK-20467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20467: Assignee: (was: Apache Spark) > sbt-launch-lib.bash has lacked the ASF header. > --- > > Key: SPARK-20467 > URL: https://issues.apache.org/jira/browse/SPARK-20467 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.1.0 >Reporter: liuzhaokun > > When I use this script,I found sbt-launch-lib.bash lack the ASF header.It > doesn't be permitted by Apache Foundation according to apache license 2.0. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20468) Refactor the ALS code
[ https://issues.apache.org/jira/browse/SPARK-20468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984365#comment-15984365 ] Sean Owen commented on SPARK-20468: --- Please read http://spark.apache.org/contributing.html first, we don't assign issues at this stage. A lot of this is just taste, so I'm not sure it clearly improves readability. A major problem with some of these suggestions, if I understand correctly, is that you'd be breaking an API. I think the while loops are on purpose to avoid a little overhead of a for loop. I'd start significantly smaller with things like a proposed doc cleanup, or clear improvements to internal-only elements. Please also weigh the cost: reviewer time, obstacle to back-ports. > Refactor the ALS code > - > > Key: SPARK-20468 > URL: https://issues.apache.org/jira/browse/SPARK-20468 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: Daniel Li >Priority: Minor > Labels: documentation, readability, refactoring > > The current ALS implementation ({{org.apache.spark.ml.recommendation}}) is > quite the beast --- 21 classes, traits, and objects across 1,500+ lines, all > in one file. Here are some things I think could improve the clarity and > maintainability of the code: > * The file can be split into more manageable parts. In particular, {{ALS}}, > {{ALSParams}}, {{ALSModel}}, and {{ALSModelParams}} can be in separate files > for better readability. > * Certain parts can be encapsulated or moved to clarify the intent. For > example: > ** The {{ALS.train}} method is currently defined in the {{ALS}} companion > object, and it seems to take 12 individual parameters that are all members of > the {{ALS}} class. This method can be made an instance method. > ** The code that creates in-blocks and out-blocks in the body of > {{ALS.train}}, along with the {{partitionRatings}} and {{makeBlocks}} methods > in the {{ALS}} companion object, can be moved into a separate case class that > holds the blocks. This has the added benefit of allowing us to write > specific Scaladoc to explain the logic behind these block objects, as their > usage is certainly nontrivial yet is fundamental to the implementation. > ** The {{KeyWrapper}} and {{UncompressedInBlockSort}} classes could be > hidden within {{UncompressedInBlock}} to clarify the scope of their usage. > ** Various builder classes could be encapsulated in the companion objects of > the classes they build. > * The code can be formatted more clearly. For example: > ** Certain methods such as {{ALS.train}} and {{ALS.makeBlocks}} can be > formatted more clearly and have comments added explaining the reasoning > behind key parts. That these methods form the core of the ALS logic makes > this doubly important for maintainability. > ** Parts of the code that use {{while}} loops with manually incremented > counters can be rewritten as {{for}} loops. > ** Where non-idiomatic Scala code is used that doesn't improve performance > much, clearer code can be substituted. (This in particular should be done > very carefully if at all, as it's apparent the original author spent much > time and pains optimizing the code to significantly improve its runtime > profile.) > * The documentation (both Scaladocs and inline comments) can be clarified > where needed and expanded where incomplete. This is especially important for > parts of the code that are written imperatively for performance, as these > parts don't benefit from the intuitive self-documentation of Scala's > higher-level language abstractions. Specifically, I'd like to add > documentation fully explaining the key functionality of the in-block and > out-block objects, their purpose, how they relate to the overall ALS > algorithm, and how they are calculated in such a way that new maintainers can > ramp up much more quickly. > The above is not a complete enumeration of improvements but a high-level > survey. All of these improvements will, I believe, add up to make the code > easier to understand, extend, and maintain. This issue will track the > progress of this refactoring so that going forward, authors will have an > easier time maintaining this part of the project. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows
[ https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984399#comment-15984399 ] Apache Spark commented on SPARK-20392: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/17770 > Slow performance when calling fit on ML pipeline for dataset with many > columns but few rows > --- > > Key: SPARK-20392 > URL: https://issues.apache.org/jira/browse/SPARK-20392 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Barry Becker > Attachments: blockbuster.csv, blockbuster_fewCols.csv, > giant_query_plan_for_fitting_pipeline.txt, model_9754.zip, model_9756.zip > > > This started as a [question on stack > overflow|http://stackoverflow.com/questions/43484006/why-is-it-slow-to-apply-a-spark-pipeline-to-dataset-with-many-columns-but-few-ro], > but it seems like a bug. > I am testing spark pipelines using a simple dataset (attached) with 312 > (mostly numeric) columns, but only 421 rows. It is small, but it takes 3 > minutes to apply my ML pipeline to it on a 24 core server with 60G of memory. > This seems much to long for such a tiny dataset. Similar pipelines run > quickly on datasets that have fewer columns and more rows. It's something > about the number of columns that is causing the slow performance. > Here are a list of the stages in my pipeline: > {code} > 000_strIdx_5708525b2b6c > 001_strIdx_ec2296082913 > 002_bucketizer_3cbc8811877b > 003_bucketizer_5a01d5d78436 > 004_bucketizer_bf290d11364d > 005_bucketizer_c3296dfe94b2 > 006_bucketizer_7071ca50eb85 > 007_bucketizer_27738213c2a1 > 008_bucketizer_bd728fd89ba1 > 009_bucketizer_e1e716f51796 > 010_bucketizer_38be665993ba > 011_bucketizer_5a0e41e5e94f > 012_bucketizer_b5a3d5743aaa > 013_bucketizer_4420f98ff7ff > 014_bucketizer_777cc4fe6d12 > 015_bucketizer_f0f3a3e5530e > 016_bucketizer_218ecca3b5c1 > 017_bucketizer_0b083439a192 > 018_bucketizer_4520203aec27 > 019_bucketizer_462c2c346079 > 020_bucketizer_47435822e04c > 021_bucketizer_eb9dccb5e6e8 > 022_bucketizer_b5f63dd7451d > 023_bucketizer_e0fd5041c841 > 024_bucketizer_ffb3b9737100 > 025_bucketizer_e06c0d29273c > 026_bucketizer_36ee535a425f > 027_bucketizer_ee3a330269f1 > 028_bucketizer_094b58ea01c0 > 029_bucketizer_e93ea86c08e2 > 030_bucketizer_4728a718bc4b > 031_bucketizer_08f6189c7fcc > 032_bucketizer_11feb74901e6 > 033_bucketizer_ab4add4966c7 > 034_bucketizer_4474f7f1b8ce > 035_bucketizer_90cfa5918d71 > 036_bucketizer_1a9ff5e4eccb > 037_bucketizer_38085415a4f4 > 038_bucketizer_9b5e5a8d12eb > 039_bucketizer_082bb650ecc3 > 040_bucketizer_57e1e363c483 > 041_bucketizer_337583fbfd65 > 042_bucketizer_73e8f6673262 > 043_bucketizer_0f9394ed30b8 > 044_bucketizer_8530f3570019 > 045_bucketizer_c53614f1e507 > 046_bucketizer_8fd99e6ec27b > 047_bucketizer_6a8610496d8a > 048_bucketizer_888b0055c1ad > 049_bucketizer_974e0a1433a6 > 050_bucketizer_e848c0937cb9 > 051_bucketizer_95611095a4ac > 052_bucketizer_660a6031acd9 > 053_bucketizer_aaffe5a3140d > 054_bucketizer_8dc569be285f > 055_bucketizer_83d1bffa07bc > 056_bucketizer_0c6180ba75e6 > 057_bucketizer_452f265a000d > 058_bucketizer_38e02ddfb447 > 059_bucketizer_6fa4ad5d3ebd > 060_bucketizer_91044ee766ce > 061_bucketizer_9a9ef04a173d > 062_bucketizer_3d98eb15f206 > 063_bucketizer_c4915bb4d4ed > 064_bucketizer_8ca2b6550c38 > 065_bucketizer_417ee9b760bc > 066_bucketizer_67f3556bebe8 > 067_bucketizer_0556deb652c6 > 068_bucketizer_067b4b3d234c > 069_bucketizer_30ba55321538 > 070_bucketizer_ad826cc5d746 > 071_bucketizer_77676a898055 > 072_bucketizer_05c37a38ce30 > 073_bucketizer_6d9ae54163ed > 074_bucketizer_8cd668b2855d > 075_bucketizer_d50ea1732021 > 076_bucketizer_c68f467c9559 > 077_bucketizer_ee1dfc840db1 > 078_bucketizer_83ec06a32519 > 079_bucketizer_741d08c1b69e > 080_bucketizer_b7402e4829c7 > 081_bucketizer_8adc590dc447 > 082_bucketizer_673be99bdace > 083_bucketizer_77693b45f94c > 084_bucketizer_53529c6b1ac4 > 085_bucketizer_6a3ca776a81e > 086_bucketizer_6679d9588ac1 > 087_bucketizer_6c73af456f65 > 088_bucketizer_2291b2c5ab51 > 089_bucketizer_cb3d0fe669d8 > 090_bucketizer_e71f913c1512 > 091_bucketizer_156528f65ce7 > 092_bucketizer_f3ec5dae079b > 093_bucketizer_809fab77eee1 > 094_bucketizer_6925831511e6 > 095_bucketizer_c5d853b95707 > 096_bucketizer_e677659ca253 > 097_bucketizer_396e35548c72 > 098_bucketizer_78a6410d7a84 > 099_bucketizer_e3ae6e54bca1 > 100_bucketizer_9fed5923fe8a > 101_bucketizer_8925ba4c3ee2 > 102_bucketizer_95750b6942b8 > 103_bucketizer_6e8b50a1918b > 104_bucketizer_36cfcc13d4ba > 105_bucketizer_2716d0455512 > 106_bucketizer_9bcf2891652f > 107_bucketizer_8c3d352915f7 > 108_bucketizer_0786c17d5ef9 > 109_bucketizer_f22df23ef56f > 110_bucketizer_bad04578bd20 > 111_bucketizer_35cfbde7e28f
[jira] [Updated] (SPARK-20470) Invalid json converting RDD row with Array of struct to json
[ https://issues.apache.org/jira/browse/SPARK-20470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Adetiloye updated SPARK-20470: - Description: Trying to convert an RDD in pyspark containing Array of struct doesn't generate the right json. It looks trivial but can't get a good json out. I read the json below into a dataframe: { "feature": "feature_id_001", "histogram": [ { "start": 1.9796095151877942, "y": 968.0, "width": 0.1564485056196041 }, { "start": 2.1360580208073983, "y": 892.0, "width": 0.1564485056196041 }, { "start": 2.2925065264270024, "y": 814.0, "width": 0.15644850561960366 }, { "start": 2.448955032046606, "y": 690.0, "width": 0.1564485056196041 }] } Df schema looks good root |-- feature: string (nullable = true) |-- histogram: array (nullable = true) ||-- element: struct (containsNull = true) |||-- start: double (nullable = true) |||-- width: double (nullable = true) |||-- y: double (nullable = true) Need to convert each row to json now and save to HBase rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict( Output JSON (Wrong) { "feature": "feature_id_001", "histogram": [ [ 1.9796095151877942, 968.0, 0.1564485056196041 ], [ 2.1360580208073983, 892.0, 0.1564485056196041 ], [ 2.2925065264270024, 814.0, 0.15644850561960366 ], [ 2.448955032046606, 690.0, 0.1564485056196041 ] } was: Trying to convert an RDD in pyspark containing Array of struct doesn't generate the right json. It looks trivial but can't get a good json out. I read the json below into a dataframe: { "feature": "feature_id_001", "histogram": [ { "start": 1.9796095151877942, "y": 968.0, "width": 0.1564485056196041 }, { "start": 2.1360580208073983, "y": 892.0, "width": 0.1564485056196041 }, { "start": 2.2925065264270024, "y": 814.0, "width": 0.15644850561960366 }, { "start": 2.448955032046606, "y": 690.0, "width": 0.1564485056196041 }] } Df schema looks good # root # |-- feature: string (nullable = true) # |-- histogram: array (nullable = true) # ||-- element: struct (containsNull = true) # |||-- start: double (nullable = true) # |||-- width: double (nullable = true) # |||-- y: double (nullable = true) Need to convert each row to json now and save to HBase rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict( Output JSON (Wrong) { "feature": "feature_id_001", "histogram": [ [ 1.9796095151877942, 968.0, 0.1564485056196041 ], [ 2.1360580208073983, 892.0, 0.1564485056196041 ], [ 2.2925065264270024, 814.0, 0.15644850561960366 ], [ 2.448955032046606, 690.0, 0.1564485056196041 ] } > Invalid json converting RDD row with Array of struct to json > > > Key: SPARK-20470 > URL: https://issues.apache.org/jira/browse/SPARK-20470 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.3 >Reporter: Philip Adetiloye > > Trying to convert an RDD in pyspark containing Array of struct doesn't > generate the right json. It looks trivial but can't get a good json out. > I read the json below into a dataframe: > { > "feature": "feature_id_001", > "histogram": [ > { > "start": 1.9796095151877942, > "y": 968.0, > "width": 0.1564485056196041 > }, > { > "start": 2.1360580208073983, > "y": 892.0, > "width": 0.1564485056196041 > }, > { > "start": 2.2925065264270024, > "y": 814.0, > "width": 0.15644850561960366 > }, > { > "start": 2.448955032046606, > "y": 690.0, > "width": 0.1564485056196041 > }] > } > Df schema looks good > root > |-- feature: string (nullable = true) > |-- histogram: array (nullable = true) > ||-- element: struct (containsNull = true) > |||-- start: double (nullable = true) > |||-- width: double (nullable = true) > |||-- y: double (nullable = true) > Need to convert each row to json now and save to HBase > rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict( > Output JSON (Wrong) > { > "feature": "feature_id_001", > "histogram": [ > [ > 1.9796095151877942, > 968.0, > 0.1564485056196041 > ], > [ > 2.1360580208073983, > 892.0, > 0.1564485056196041 > ], > [ > 2.2925065264270024, > 814.0, > 0.15644850561960366 > ], >
[jira] [Updated] (SPARK-20470) Invalid json converting RDD row with Array of struct to json
[ https://issues.apache.org/jira/browse/SPARK-20470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Adetiloye updated SPARK-20470: - Description: Trying to convert an RDD in pyspark containing Array of struct doesn't generate the right json. It looks trivial but can't get a good json out. I read the json below into a dataframe: { "feature": "feature_id_001", "histogram": [ { "start": 1.9796095151877942, "y": 968.0, "width": 0.1564485056196041 }, { "start": 2.1360580208073983, "y": 892.0, "width": 0.1564485056196041 }, { "start": 2.2925065264270024, "y": 814.0, "width": 0.15644850561960366 }, { "start": 2.448955032046606, "y": 690.0, "width": 0.1564485056196041 }] } Df schema looks good # root # |-- feature: string (nullable = true) # |-- histogram: array (nullable = true) # ||-- element: struct (containsNull = true) # |||-- start: double (nullable = true) # |||-- width: double (nullable = true) # |||-- y: double (nullable = true) Need to convert each row to json now and save to HBase rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict( Output JSON (Wrong) { "feature": "feature_id_001", "histogram": [ [ 1.9796095151877942, 968.0, 0.1564485056196041 ], [ 2.1360580208073983, 892.0, 0.1564485056196041 ], [ 2.2925065264270024, 814.0, 0.15644850561960366 ], [ 2.448955032046606, 690.0, 0.1564485056196041 ] } was: Trying to convert an RDD in pyspark containing Array of struct doesn't generate the right json. It looks trivial but can't get a good json out. rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict( > Invalid json converting RDD row with Array of struct to json > > > Key: SPARK-20470 > URL: https://issues.apache.org/jira/browse/SPARK-20470 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.3 >Reporter: Philip Adetiloye > > Trying to convert an RDD in pyspark containing Array of struct doesn't > generate the right json. It looks trivial but can't get a good json out. > I read the json below into a dataframe: > { > "feature": "feature_id_001", > "histogram": [ > { > "start": 1.9796095151877942, > "y": 968.0, > "width": 0.1564485056196041 > }, > { > "start": 2.1360580208073983, > "y": 892.0, > "width": 0.1564485056196041 > }, > { > "start": 2.2925065264270024, > "y": 814.0, > "width": 0.15644850561960366 > }, > { > "start": 2.448955032046606, > "y": 690.0, > "width": 0.1564485056196041 > }] > } > Df schema looks good > # root > # |-- feature: string (nullable = true) > # |-- histogram: array (nullable = true) > # ||-- element: struct (containsNull = true) > # |||-- start: double (nullable = true) > # |||-- width: double (nullable = true) > # |||-- y: double (nullable = true) > Need to convert each row to json now and save to HBase > rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict( > Output JSON (Wrong) > { > "feature": "feature_id_001", > "histogram": [ > [ > 1.9796095151877942, > 968.0, > 0.1564485056196041 > ], > [ > 2.1360580208073983, > 892.0, > 0.1564485056196041 > ], > [ > 2.2925065264270024, > 814.0, > 0.15644850561960366 > ], > [ > 2.448955032046606, > 690.0, > 0.1564485056196041 > ] > } -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-20208) Document R fpGrowth support in vignettes, programming guide and code example
[ https://issues.apache.org/jira/browse/SPARK-20208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung reopened SPARK-20208: -- Actually, would you mind updating the R programming guide too? > Document R fpGrowth support in vignettes, programming guide and code example > > > Key: SPARK-20208 > URL: https://issues.apache.org/jira/browse/SPARK-20208 > Project: Spark > Issue Type: Documentation > Components: Documentation, SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung >Assignee: Maciej Szymkiewicz > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20015) Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide, R example
[ https://issues.apache.org/jira/browse/SPARK-20015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-20015: - Summary: Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide, R example (was: Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide) > Document R Structured Streaming (experimental) in R vignettes and R & SS > programming guide, R example > - > > Key: SPARK-20015 > URL: https://issues.apache.org/jira/browse/SPARK-20015 > Project: Spark > Issue Type: Documentation > Components: Documentation, SparkR, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Felix Cheung > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20184) performance regression for complex/long sql when enable whole stage codegen
[ https://issues.apache.org/jira/browse/SPARK-20184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985319#comment-15985319 ] Kazuaki Ishizaki commented on SPARK-20184: -- When # of the aggregated columns gets large, I saw complicated Java code for HashAggregation. I will create another JIRA to simplify generated code. > performance regression for complex/long sql when enable whole stage codegen > --- > > Key: SPARK-20184 > URL: https://issues.apache.org/jira/browse/SPARK-20184 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0, 2.1.0 >Reporter: Fei Wang > > The performance of following SQL get much worse in spark 2.x in contrast > with codegen off. > SELECT >sum(COUNTER_57) > ,sum(COUNTER_71) > ,sum(COUNTER_3) > ,sum(COUNTER_70) > ,sum(COUNTER_66) > ,sum(COUNTER_75) > ,sum(COUNTER_69) > ,sum(COUNTER_55) > ,sum(COUNTER_63) > ,sum(COUNTER_68) > ,sum(COUNTER_56) > ,sum(COUNTER_37) > ,sum(COUNTER_51) > ,sum(COUNTER_42) > ,sum(COUNTER_43) > ,sum(COUNTER_1) > ,sum(COUNTER_76) > ,sum(COUNTER_54) > ,sum(COUNTER_44) > ,sum(COUNTER_46) > ,DIM_1 > ,DIM_2 > ,DIM_3 > FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100; > Num of rows of aggtable is about 3500. > whole stage codegen on(spark.sql.codegen.wholeStage = true):40s > whole stage codegen off(spark.sql.codegen.wholeStage = false):6s > After some analysis i think this is related to the huge java method(a java > method of thousand lines) which generated by codegen. > And If i config -XX:-DontCompileHugeMethods the performance get much > better(about 7s). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20479) Performance degradation for large number of hash-aggregated columns
Kazuaki Ishizaki created SPARK-20479: Summary: Performance degradation for large number of hash-aggregated columns Key: SPARK-20479 URL: https://issues.apache.org/jira/browse/SPARK-20479 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Kazuaki Ishizaki In comment of SPARK-20184, [~maropu] revealed that performance is degraded when # of aggregated columns get large with whole-stage codegen. {code} ./bin/spark-shell --master local[1] --conf spark.driver.memory=2g --conf spark.sql.shuffle.partitions=1 -v def timer[R](f: => {}): Unit = { val count = 9 val iters = (0 until count).map { i => val t0 = System.nanoTime() f val t1 = System.nanoTime() val elapsed = t1 - t0 + 0.0 println(s"#$i: ${elapsed / 10.0}") elapsed } println("Elapsed time: " + ((iters.sum / count) / 10.0) + "s") } val numCols = 80 val t = s"(SELECT id AS key1, id AS key2, ${((0 until numCols).map(i => s"id AS c$i")).mkString(", ")} FROM range(0, 10, 1, 1))" val sqlStr = s"SELECT key1, key2, ${((0 until numCols).map(i => s"SUM(c$i)")).mkString(", ")} FROM $t GROUP BY key1, key2 LIMIT 100" // Elapsed time: 2.308440490553s sql("SET spark.sql.codegen.wholeStage=true") timer { sql(sqlStr).collect } // Elapsed time: 0.527486733s sql("SET spark.sql.codegen.wholeStage=false") timer { sql(sqlStr).collect } {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20480) FileFormatWriter hides FetchFailedException from scheduler
[ https://issues.apache.org/jira/browse/SPARK-20480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985567#comment-15985567 ] Thomas Graves commented on SPARK-20480: --- exception in task manager looks like: 17/04/26 20:09:21 INFO TaskSetManager: Lost task 3516.0 in stage 4.0 (TID 103691) on gsbl521n33.blue.ygrid.yahoo.com, exec utor 4516: org.apache.spark.SparkException (Task failed while writing rows) [duplicate 22] > FileFormatWriter hides FetchFailedException from scheduler > -- > > Key: SPARK-20480 > URL: https://issues.apache.org/jira/browse/SPARK-20480 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Thomas Graves >Priority: Critical > > I was running a large job where it was getting faiures, noticed they were > listed as "SparkException: Task failed while writing rows", but when I looked > further they were really caused by FetchFailure exceptions. This is a > problem because the scheduler handles Fetch Failures differently then normal > exception. This can affect things like blacklisting. > {noformat} > 17/04/26 20:08:59 ERROR Executor: Exception in task 2727.0 in stage 4.0 (TID > 102902) > org.apache.spark.SparkException: Task failed while writing rows > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.shuffle.FetchFailedException: Failed to connect > to host1.com:7337 > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:396) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341) > at >
[jira] [Resolved] (SPARK-12868) ADD JAR via sparkSQL JDBC will fail when using a HDFS URL
[ https://issues.apache.org/jira/browse/SPARK-12868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-12868. Resolution: Fixed Assignee: Weiqing Yang Fix Version/s: 2.2.0 > ADD JAR via sparkSQL JDBC will fail when using a HDFS URL > - > > Key: SPARK-12868 > URL: https://issues.apache.org/jira/browse/SPARK-12868 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Trystan Leftwich >Assignee: Weiqing Yang > Fix For: 2.2.0 > > > When trying to add a jar with a HDFS URI, i.E > {code:sql} > ADD JAR hdfs:///tmp/foo.jar > {code} > Via the spark sql JDBC interface it will fail with: > {code:sql} > java.net.MalformedURLException: unknown protocol: hdfs > at java.net.URL.(URL.java:593) > at java.net.URL.(URL.java:483) > at java.net.URL.(URL.java:432) > at java.net.URI.toURL(URI.java:1089) > at > org.apache.spark.sql.hive.client.ClientWrapper.addJar(ClientWrapper.scala:578) > at org.apache.spark.sql.hive.HiveContext.addJar(HiveContext.scala:652) > at org.apache.spark.sql.hive.execution.AddJar.run(commands.scala:89) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:145) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:130) > at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:211) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:154) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:151) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:164) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20480) FileFormatWriter hides FetchFailedException from scheduler
Thomas Graves created SPARK-20480: - Summary: FileFormatWriter hides FetchFailedException from scheduler Key: SPARK-20480 URL: https://issues.apache.org/jira/browse/SPARK-20480 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Thomas Graves Priority: Critical I was running a large job where it was getting faiures, noticed they were listed as "SparkException: Task failed while writing rows", but when I looked further they were really caused by FetchFailure exceptions. This is a problem because the scheduler handles Fetch Failures differently then normal exception. This can affect things like blacklisting. {noformat} 17/04/26 20:08:59 ERROR Executor: Exception in task 2727.0 in stage 4.0 (TID 102902) org.apache.spark.SparkException: Task failed while writing rows at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.spark.shuffle.FetchFailedException: Failed to connect to gsbl546n07.blue.ygrid.yahoo.com/10.213.43.94:7337 at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:396) at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193) ... 8 more Caused by: java.io.IOException: Failed to connect to gsbl546n07.blue.ygrid.yahoo.com/10.213.43.94:7337 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179) at
[jira] [Closed] (SPARK-20480) FileFormatWriter hides FetchFailedException from scheduler
[ https://issues.apache.org/jira/browse/SPARK-20480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves closed SPARK-20480. - Resolution: Duplicate > FileFormatWriter hides FetchFailedException from scheduler > -- > > Key: SPARK-20480 > URL: https://issues.apache.org/jira/browse/SPARK-20480 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Thomas Graves >Priority: Critical > > I was running a large job where it was getting faiures, noticed they were > listed as "SparkException: Task failed while writing rows", but when I looked > further they were really caused by FetchFailure exceptions. This is a > problem because the scheduler handles Fetch Failures differently then normal > exception. This can affect things like blacklisting. > {noformat} > 17/04/26 20:08:59 ERROR Executor: Exception in task 2727.0 in stage 4.0 (TID > 102902) > org.apache.spark.SparkException: Task failed while writing rows > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.shuffle.FetchFailedException: Failed to connect > to host1.com:7337 > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:396) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193) > ... 8 more > Caused by: java.io.IOException: Failed to connect to host1.com:7337 > at >
[jira] [Commented] (SPARK-20178) Improve Scheduler fetch failures
[ https://issues.apache.org/jira/browse/SPARK-20178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985447#comment-15985447 ] Thomas Graves commented on SPARK-20178: --- Another thing we should tie in here is handling preempted containers better. This kind of matches with my point above "Improve logic around deciding which node is actually bad when you get a fetch failures." but a little bit of a special case. If the containers gets preempted on the yarn side we need to properly detect that and not count that as a normal fetch failure. Right now that seems pretty difficult with the way we handle stage failures but I guess you would just line that up and not caught that as a normal stage failure. > Improve Scheduler fetch failures > > > Key: SPARK-20178 > URL: https://issues.apache.org/jira/browse/SPARK-20178 > Project: Spark > Issue Type: Epic > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Thomas Graves > > We have been having a lot of discussions around improving the handling of > fetch failures. There are 4 jira currently related to this. > We should try to get a list of things we want to improve and come up with one > cohesive design. > SPARK-20163, SPARK-20091, SPARK-14649 , and SPARK-19753 > I will put my initial thoughts in a follow on comment. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20480) FileFormatWriter hides FetchFailedException from scheduler
[ https://issues.apache.org/jira/browse/SPARK-20480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-20480: -- Description: I was running a large job where it was getting faiures, noticed they were listed as "SparkException: Task failed while writing rows", but when I looked further they were really caused by FetchFailure exceptions. This is a problem because the scheduler handles Fetch Failures differently then normal exception. This can affect things like blacklisting. {noformat} 17/04/26 20:08:59 ERROR Executor: Exception in task 2727.0 in stage 4.0 (TID 102902) org.apache.spark.SparkException: Task failed while writing rows at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.spark.shuffle.FetchFailedException: Failed to connect to host1.com:7337 at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:396) at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193) ... 8 more Caused by: java.io.IOException: Failed to connect to host1.com:7337 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179) at org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:105) at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
[jira] [Assigned] (SPARK-20476) Exception between "create table as" and "get_json_object"
[ https://issues.apache.org/jira/browse/SPARK-20476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-20476: --- Assignee: Xiao Li > Exception between "create table as" and "get_json_object" > - > > Key: SPARK-20476 > URL: https://issues.apache.org/jira/browse/SPARK-20476 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: cen yuhai >Assignee: Xiao Li > > I encounter this problem when I want to create a table as select , > get_json_object from xxx; > It is wrong. > {code} > create table spark_json_object as > select get_json_object(deliver_geojson,'$.') > from dw.dw_prd_order where dt='2017-04-24' limit 10; > {code} > It is ok. > {code} > create table spark_json_object as > select * > from dw.dw_prd_order where dt='2017-04-24' limit 10; > {code} > It is ok > {code} > select get_json_object(deliver_geojson,'$.') > from dw.dw_prd_order where dt='2017-04-24' limit 10; > {code} > {code} > 17/04/26 23:12:56 ERROR [hive.log(397) -- main]: error in initSerDe: > org.apache.hadoop.hive.serde2.SerDeException > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements > while columns.types has 1 elements! > org.apache.hadoop.hive.serde2.SerDeException: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements > while columns.types has 1 elements! > at > org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:146) > at > org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85) > at > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125) > at > org.apache.hadoop.hive.serde2.AbstractSerDe.initialize(AbstractSerDe.java:53) > at > org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:521) > at > org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:391) > at > org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276) > at > org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:197) > at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:699) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply$mcV$sp(HiveClientImpl.scala:455) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:309) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:256) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:255) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:298) > at > org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:454) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:237) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) > at > org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:199) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:248) > at > org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$lzycompute$1(CreateHiveTableAsSelectCommand.scala:72) > at > org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$1(CreateHiveTableAsSelectCommand.scala:48) > at > org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:91) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:179) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) > at
[jira] [Updated] (SPARK-20454) Improvement of ShortestPaths in Spark GraphX
[ https://issues.apache.org/jira/browse/SPARK-20454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ji Dai updated SPARK-20454: --- Target Version/s: (was: 2.1.0) > Improvement of ShortestPaths in Spark GraphX > > > Key: SPARK-20454 > URL: https://issues.apache.org/jira/browse/SPARK-20454 > Project: Spark > Issue Type: Improvement > Components: GraphX, MLlib >Affects Versions: 2.1.0 >Reporter: Ji Dai > Labels: patch > > The output of ShortestPaths is not enough. ShortestPaths in Graph/lib is > currently in a simple version and can only return the distance to the source > vertex. However, the shortest path with intermediate nodes on the path is > needed and if two or more paths holds the same shortest distance from source > to destination, all these paths need to be returned. In this way, > ShortestPaths will be more functional and useful. > I think I have resolved the concern above with a improved version of > ShortestPaths which also based on the "pregel" function in GraphOps. > Can I get my code reviewed and merged? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20480) FileFormatWriter hides FetchFailedException from scheduler
[ https://issues.apache.org/jira/browse/SPARK-20480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985575#comment-15985575 ] Mridul Muralidharan commented on SPARK-20480: - Shouldn't fix for SPARK-19276 by [~imranr] not handle this ? > FileFormatWriter hides FetchFailedException from scheduler > -- > > Key: SPARK-20480 > URL: https://issues.apache.org/jira/browse/SPARK-20480 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Thomas Graves >Priority: Critical > > I was running a large job where it was getting faiures, noticed they were > listed as "SparkException: Task failed while writing rows", but when I looked > further they were really caused by FetchFailure exceptions. This is a > problem because the scheduler handles Fetch Failures differently then normal > exception. This can affect things like blacklisting. > {noformat} > 17/04/26 20:08:59 ERROR Executor: Exception in task 2727.0 in stage 4.0 (TID > 102902) > org.apache.spark.SparkException: Task failed while writing rows > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.shuffle.FetchFailedException: Failed to connect > to host1.com:7337 > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:396) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193) > ... 8 more > Caused by: java.io.IOException: Failed to connect to