[jira] [Comment Edited] (SPARK-20199) GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter

2017-04-26 Thread 颜发才

[ 
https://issues.apache.org/jira/browse/SPARK-20199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984161#comment-15984161
 ] 

Yan Facai (颜发才) edited comment on SPARK-20199 at 4/26/17 6:11 AM:
--

The work is easy, however **public method** is added and some adjustments are 
needed in inner implementation. Hence, I suggest to delay it until at least one 
of Committers agree to shepherd the issue.

I have two questions:

1. For both GBDT and RandomForest share the attribute,  we can pull 
`featureSubsetStrategy` parameter up to either TreeEnsembleParams or 
DecisionTreeParams. Which one is appropriate?

2. Is it right to add new parameter `featureSubsetStrategy` to Strategy class? 
Or add it to DecisionTree's train method?


was (Author: facai):
The work is easy, however Public method is added and some adjustments are 
needed in inner implementation. Hence, I suggest to delay it until one expert 
agree to shepherd the issue.

I have two questions:

1. For both GBDT and RandomForest share the attribute,  we can pull 
`featureSubsetStrategy` parameter up to either TreeEnsembleParams or 
DecisionTreeParams. Which one is appropriate?

2. Is it right to add new parameter `featureSubsetStrategy` to Strategy class? 
Or add it to DecisionTree's train method?

> GradientBoostedTreesModel doesn't have  Column Sampling Rate Paramenter
> ---
>
> Key: SPARK-20199
> URL: https://issues.apache.org/jira/browse/SPARK-20199
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: pralabhkumar
>
> Spark GradientBoostedTreesModel doesn't have Column  sampling rate parameter 
> . This parameter is available in H2O and XGBoost. 
> Sample from H2O.ai 
> gbmParams._col_sample_rate
> Please provide the parameter . 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows

2017-04-26 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984173#comment-15984173
 ] 

Liang-Chi Hsieh edited comment on SPARK-20392 at 4/26/17 6:43 AM:
--

[~barrybecker4] Currently I think the performance downgrade is caused by the 
cost of exchange between DataFrame/Dataset abstraction and logical plans. Some 
operations on DataFrames exchange between DataFrame and logical plans. It can 
be ignored in the usage of SQL. However, it's not rare to chain dozens of 
pipeline stages in ML. When the query plan grows incrementally during running 
those stages, the cost spent on the exchange grows too. In particular, the 
Analyzer will go through the big query plan even most part of it is analyzed.

By eliminating part of the cost, the time to run the example code locally is 
reduced from about 1min to about 30 secs.

In particular, the time applying the pipeline locally is mostly spent on 
calling {{transform}} of the 137 {{Bucketizer}}s. Before the change, each call 
of {{Bucketizer}}'s {{transform}} can cost about 0.4 sec. So the total time 
spent on all {{Bucketizer}}s' {{transform}} is about 50 secs. After the change, 
each call only costs about 0.1 sec.







was (Author: viirya):
[~barrybecker4] Currently I think the performance downgrade is caused by the 
cost of exchange between DataFrame/Dataset abstraction and logical plans. Some 
operations on DataFrames exchange between DataFrame and logical plans. It can 
be ignored in the usage of SQL. However, it's not rare to chain dozens of 
pipeline stages in ML. When the query plan grows incrementally during running 
those stages, the cost spent on the exchange grows too. In particular, the 
Analyzer will go through the big query plan even most part of it is analyzed.

By eliminating part of the cost, the time to run the example code locally is 
reduced from about 1min to about 30 secs.


> Slow performance when calling fit on ML pipeline for dataset with many 
> columns but few rows
> ---
>
> Key: SPARK-20392
> URL: https://issues.apache.org/jira/browse/SPARK-20392
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Barry Becker
> Attachments: blockbuster.csv, blockbuster_fewCols.csv, 
> giant_query_plan_for_fitting_pipeline.txt, model_9754.zip, model_9756.zip
>
>
> This started as a [question on stack 
> overflow|http://stackoverflow.com/questions/43484006/why-is-it-slow-to-apply-a-spark-pipeline-to-dataset-with-many-columns-but-few-ro],
>  but it seems like a bug.
> I am testing spark pipelines using a simple dataset (attached) with 312 
> (mostly numeric) columns, but only 421 rows. It is small, but it takes 3 
> minutes to apply my ML pipeline to it on a 24 core server with 60G of memory. 
> This seems much to long for such a tiny dataset. Similar pipelines run 
> quickly on datasets that have fewer columns and more rows. It's something 
> about the number of columns that is causing the slow performance.
> Here are a list of the stages in my pipeline:
> {code}
> 000_strIdx_5708525b2b6c
> 001_strIdx_ec2296082913
> 002_bucketizer_3cbc8811877b
> 003_bucketizer_5a01d5d78436
> 004_bucketizer_bf290d11364d
> 005_bucketizer_c3296dfe94b2
> 006_bucketizer_7071ca50eb85
> 007_bucketizer_27738213c2a1
> 008_bucketizer_bd728fd89ba1
> 009_bucketizer_e1e716f51796
> 010_bucketizer_38be665993ba
> 011_bucketizer_5a0e41e5e94f
> 012_bucketizer_b5a3d5743aaa
> 013_bucketizer_4420f98ff7ff
> 014_bucketizer_777cc4fe6d12
> 015_bucketizer_f0f3a3e5530e
> 016_bucketizer_218ecca3b5c1
> 017_bucketizer_0b083439a192
> 018_bucketizer_4520203aec27
> 019_bucketizer_462c2c346079
> 020_bucketizer_47435822e04c
> 021_bucketizer_eb9dccb5e6e8
> 022_bucketizer_b5f63dd7451d
> 023_bucketizer_e0fd5041c841
> 024_bucketizer_ffb3b9737100
> 025_bucketizer_e06c0d29273c
> 026_bucketizer_36ee535a425f
> 027_bucketizer_ee3a330269f1
> 028_bucketizer_094b58ea01c0
> 029_bucketizer_e93ea86c08e2
> 030_bucketizer_4728a718bc4b
> 031_bucketizer_08f6189c7fcc
> 032_bucketizer_11feb74901e6
> 033_bucketizer_ab4add4966c7
> 034_bucketizer_4474f7f1b8ce
> 035_bucketizer_90cfa5918d71
> 036_bucketizer_1a9ff5e4eccb
> 037_bucketizer_38085415a4f4
> 038_bucketizer_9b5e5a8d12eb
> 039_bucketizer_082bb650ecc3
> 040_bucketizer_57e1e363c483
> 041_bucketizer_337583fbfd65
> 042_bucketizer_73e8f6673262
> 043_bucketizer_0f9394ed30b8
> 044_bucketizer_8530f3570019
> 045_bucketizer_c53614f1e507
> 046_bucketizer_8fd99e6ec27b
> 047_bucketizer_6a8610496d8a
> 048_bucketizer_888b0055c1ad
> 049_bucketizer_974e0a1433a6
> 050_bucketizer_e848c0937cb9
> 051_bucketizer_95611095a4ac
> 052_bucketizer_660a6031acd9
> 053_bucketizer_aaffe5a3140d
> 054_bucketizer_8dc569be285f
> 055_bucketizer_83d1bffa07bc
> 

[jira] [Comment Edited] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows

2017-04-26 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984173#comment-15984173
 ] 

Liang-Chi Hsieh edited comment on SPARK-20392 at 4/26/17 6:43 AM:
--

[~barrybecker4] Currently I think the performance downgrade is caused by the 
cost of exchange between DataFrame/Dataset abstraction and logical plans. Some 
operations on DataFrames exchange between DataFrame and logical plans. It can 
be ignored in the usage of SQL. However, it's not rare to chain dozens of 
pipeline stages in ML. When the query plan grows incrementally during running 
those stages, the cost spent on the exchange grows too. In particular, the 
Analyzer will go through the big query plan even most part of it is analyzed.

By eliminating part of the cost, the time to run the example code locally is 
reduced from about 1min to about 30 secs.

In particular, the time applying the pipeline locally is mostly spent on 
calling {{transform}} of the 137 {{Bucketizer}}. Before the change, each call 
of {{Bucketizer}} 's {{transform}} can cost about 0.4 sec. So the total time 
spent on all {{Bucketizer}} 's {{transform}} is about 50 secs. After the 
change, each call only costs about 0.1 sec.







was (Author: viirya):
[~barrybecker4] Currently I think the performance downgrade is caused by the 
cost of exchange between DataFrame/Dataset abstraction and logical plans. Some 
operations on DataFrames exchange between DataFrame and logical plans. It can 
be ignored in the usage of SQL. However, it's not rare to chain dozens of 
pipeline stages in ML. When the query plan grows incrementally during running 
those stages, the cost spent on the exchange grows too. In particular, the 
Analyzer will go through the big query plan even most part of it is analyzed.

By eliminating part of the cost, the time to run the example code locally is 
reduced from about 1min to about 30 secs.

In particular, the time applying the pipeline locally is mostly spent on 
calling {{transform}} of the 137 {{Bucketizer}}s. Before the change, each call 
of {{Bucketizer}}'s {{transform}} can cost about 0.4 sec. So the total time 
spent on all {{Bucketizer}}s' {{transform}} is about 50 secs. After the change, 
each call only costs about 0.1 sec.






> Slow performance when calling fit on ML pipeline for dataset with many 
> columns but few rows
> ---
>
> Key: SPARK-20392
> URL: https://issues.apache.org/jira/browse/SPARK-20392
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Barry Becker
> Attachments: blockbuster.csv, blockbuster_fewCols.csv, 
> giant_query_plan_for_fitting_pipeline.txt, model_9754.zip, model_9756.zip
>
>
> This started as a [question on stack 
> overflow|http://stackoverflow.com/questions/43484006/why-is-it-slow-to-apply-a-spark-pipeline-to-dataset-with-many-columns-but-few-ro],
>  but it seems like a bug.
> I am testing spark pipelines using a simple dataset (attached) with 312 
> (mostly numeric) columns, but only 421 rows. It is small, but it takes 3 
> minutes to apply my ML pipeline to it on a 24 core server with 60G of memory. 
> This seems much to long for such a tiny dataset. Similar pipelines run 
> quickly on datasets that have fewer columns and more rows. It's something 
> about the number of columns that is causing the slow performance.
> Here are a list of the stages in my pipeline:
> {code}
> 000_strIdx_5708525b2b6c
> 001_strIdx_ec2296082913
> 002_bucketizer_3cbc8811877b
> 003_bucketizer_5a01d5d78436
> 004_bucketizer_bf290d11364d
> 005_bucketizer_c3296dfe94b2
> 006_bucketizer_7071ca50eb85
> 007_bucketizer_27738213c2a1
> 008_bucketizer_bd728fd89ba1
> 009_bucketizer_e1e716f51796
> 010_bucketizer_38be665993ba
> 011_bucketizer_5a0e41e5e94f
> 012_bucketizer_b5a3d5743aaa
> 013_bucketizer_4420f98ff7ff
> 014_bucketizer_777cc4fe6d12
> 015_bucketizer_f0f3a3e5530e
> 016_bucketizer_218ecca3b5c1
> 017_bucketizer_0b083439a192
> 018_bucketizer_4520203aec27
> 019_bucketizer_462c2c346079
> 020_bucketizer_47435822e04c
> 021_bucketizer_eb9dccb5e6e8
> 022_bucketizer_b5f63dd7451d
> 023_bucketizer_e0fd5041c841
> 024_bucketizer_ffb3b9737100
> 025_bucketizer_e06c0d29273c
> 026_bucketizer_36ee535a425f
> 027_bucketizer_ee3a330269f1
> 028_bucketizer_094b58ea01c0
> 029_bucketizer_e93ea86c08e2
> 030_bucketizer_4728a718bc4b
> 031_bucketizer_08f6189c7fcc
> 032_bucketizer_11feb74901e6
> 033_bucketizer_ab4add4966c7
> 034_bucketizer_4474f7f1b8ce
> 035_bucketizer_90cfa5918d71
> 036_bucketizer_1a9ff5e4eccb
> 037_bucketizer_38085415a4f4
> 038_bucketizer_9b5e5a8d12eb
> 039_bucketizer_082bb650ecc3
> 040_bucketizer_57e1e363c483
> 041_bucketizer_337583fbfd65
> 042_bucketizer_73e8f6673262
> 043_bucketizer_0f9394ed30b8
> 

[jira] [Comment Edited] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows

2017-04-26 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984173#comment-15984173
 ] 

Liang-Chi Hsieh edited comment on SPARK-20392 at 4/26/17 6:44 AM:
--

[~barrybecker4] Currently I think the performance downgrade is caused by the 
cost of exchange between DataFrame/Dataset abstraction and logical plans. Some 
operations on DataFrames exchange between DataFrame and logical plans. It can 
be ignored in the usage of SQL. However, it's not rare to chain dozens of 
pipeline stages in ML. When the query plan grows incrementally during running 
those stages, the cost spent on the exchange grows too. In particular, the 
Analyzer will go through the big query plan even most part of it is analyzed.

By eliminating part of the cost, the time to run the example code locally is 
reduced from about 1min to about 30 secs.

In particular, the time applying the pipeline locally is mostly spent on 
calling {{transform}} of the 137 {{Bucketizer}} s. Before the change, each call 
of {{Bucketizer}} 's {{transform}} can cost about 0.4 sec. So the total time 
spent on all {{Bucketizer}} s' {{transform}} is about 50 secs. After the 
change, each call only costs about 0.1 sec.







was (Author: viirya):
[~barrybecker4] Currently I think the performance downgrade is caused by the 
cost of exchange between DataFrame/Dataset abstraction and logical plans. Some 
operations on DataFrames exchange between DataFrame and logical plans. It can 
be ignored in the usage of SQL. However, it's not rare to chain dozens of 
pipeline stages in ML. When the query plan grows incrementally during running 
those stages, the cost spent on the exchange grows too. In particular, the 
Analyzer will go through the big query plan even most part of it is analyzed.

By eliminating part of the cost, the time to run the example code locally is 
reduced from about 1min to about 30 secs.

In particular, the time applying the pipeline locally is mostly spent on 
calling {{transform}} of the 137 {{Bucketizer}}. Before the change, each call 
of {{Bucketizer}} 's {{transform}} can cost about 0.4 sec. So the total time 
spent on all {{Bucketizer}} 's {{transform}} is about 50 secs. After the 
change, each call only costs about 0.1 sec.






> Slow performance when calling fit on ML pipeline for dataset with many 
> columns but few rows
> ---
>
> Key: SPARK-20392
> URL: https://issues.apache.org/jira/browse/SPARK-20392
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Barry Becker
> Attachments: blockbuster.csv, blockbuster_fewCols.csv, 
> giant_query_plan_for_fitting_pipeline.txt, model_9754.zip, model_9756.zip
>
>
> This started as a [question on stack 
> overflow|http://stackoverflow.com/questions/43484006/why-is-it-slow-to-apply-a-spark-pipeline-to-dataset-with-many-columns-but-few-ro],
>  but it seems like a bug.
> I am testing spark pipelines using a simple dataset (attached) with 312 
> (mostly numeric) columns, but only 421 rows. It is small, but it takes 3 
> minutes to apply my ML pipeline to it on a 24 core server with 60G of memory. 
> This seems much to long for such a tiny dataset. Similar pipelines run 
> quickly on datasets that have fewer columns and more rows. It's something 
> about the number of columns that is causing the slow performance.
> Here are a list of the stages in my pipeline:
> {code}
> 000_strIdx_5708525b2b6c
> 001_strIdx_ec2296082913
> 002_bucketizer_3cbc8811877b
> 003_bucketizer_5a01d5d78436
> 004_bucketizer_bf290d11364d
> 005_bucketizer_c3296dfe94b2
> 006_bucketizer_7071ca50eb85
> 007_bucketizer_27738213c2a1
> 008_bucketizer_bd728fd89ba1
> 009_bucketizer_e1e716f51796
> 010_bucketizer_38be665993ba
> 011_bucketizer_5a0e41e5e94f
> 012_bucketizer_b5a3d5743aaa
> 013_bucketizer_4420f98ff7ff
> 014_bucketizer_777cc4fe6d12
> 015_bucketizer_f0f3a3e5530e
> 016_bucketizer_218ecca3b5c1
> 017_bucketizer_0b083439a192
> 018_bucketizer_4520203aec27
> 019_bucketizer_462c2c346079
> 020_bucketizer_47435822e04c
> 021_bucketizer_eb9dccb5e6e8
> 022_bucketizer_b5f63dd7451d
> 023_bucketizer_e0fd5041c841
> 024_bucketizer_ffb3b9737100
> 025_bucketizer_e06c0d29273c
> 026_bucketizer_36ee535a425f
> 027_bucketizer_ee3a330269f1
> 028_bucketizer_094b58ea01c0
> 029_bucketizer_e93ea86c08e2
> 030_bucketizer_4728a718bc4b
> 031_bucketizer_08f6189c7fcc
> 032_bucketizer_11feb74901e6
> 033_bucketizer_ab4add4966c7
> 034_bucketizer_4474f7f1b8ce
> 035_bucketizer_90cfa5918d71
> 036_bucketizer_1a9ff5e4eccb
> 037_bucketizer_38085415a4f4
> 038_bucketizer_9b5e5a8d12eb
> 039_bucketizer_082bb650ecc3
> 040_bucketizer_57e1e363c483
> 041_bucketizer_337583fbfd65
> 042_bucketizer_73e8f6673262
> 043_bucketizer_0f9394ed30b8
> 

[jira] [Created] (SPARK-20466) HadoopRDD#addLocalConfiguration throws NPE

2017-04-26 Thread liyunzhang_intel (JIRA)
liyunzhang_intel created SPARK-20466:


 Summary: HadoopRDD#addLocalConfiguration throws NPE
 Key: SPARK-20466
 URL: https://issues.apache.org/jira/browse/SPARK-20466
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 2.0.2
Reporter: liyunzhang_intel


in spark2.0.2, it throws NPE
{code}
  17/04/23 08:19:55 ERROR executor.Executor: Exception in task 439.0 in stage 
16.0 (TID 986)$ 
java.lang.NullPointerException$
^Iat org.apache.spark.rdd.HadoopRDD$.addLocalConfiguration(HadoopRDD.scala:373)$
^Iat org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:243)$
^Iat org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:208)$
^Iat org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)$
^Iat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)$
^Iat org.apache.spark.rdd.RDD.iterator(RDD.scala:283)$
^Iat org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)$
^Iat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)$
^Iat org.apache.spark.rdd.RDD.iterator(RDD.scala:283)$
^Iat org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)$
^Iat org.apache.spark.scheduler.Task.run(Task.scala:86)$
^Iat org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)$
^Iat 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)$
^Iat 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)$
^Iat java.lang.Thread.run(Thread.java:745)$
{code}

suggestion to add some code to avoid NPE

{code} 

   /** Add Hadoop configuration specific to a single partition and attempt. */
  def addLocalConfiguration(jobTrackerId: String, jobId: Int, splitId: Int, 
attemptId: Int,
conf: JobConf) {
val jobID = new JobID(jobTrackerId, jobId)
val taId = new TaskAttemptID(new TaskID(jobID, TaskType.MAP, splitId), 
attemptId)
if ( conf != null){
conf.set("mapred.tip.id", taId.getTaskID.toString)
conf.set("mapred.task.id", taId.toString)
conf.setBoolean("mapred.task.is.map", true)
conf.setInt("mapred.task.partition", splitId)
conf.set("mapred.job.id", jobID.toString)
   }
  }


{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20467) sbt-launch-lib.bash has lacked the ASF header.

2017-04-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20467:


Assignee: Apache Spark

> sbt-launch-lib.bash has lacked the  ASF header.
> ---
>
> Key: SPARK-20467
> URL: https://issues.apache.org/jira/browse/SPARK-20467
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: liuzhaokun
>Assignee: Apache Spark
>
> When I use this script,I found sbt-launch-lib.bash lack the  ASF header.It 
> doesn't be permitted by Apache Foundation according to apache license 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows

2017-04-26 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984270#comment-15984270
 ] 

Liang-Chi Hsieh commented on SPARK-20392:
-

[~barrybecker4] Btw, the time applying the model_9756 pipeline on 
blockbuster_fewCols.csv is less than 1 sec, however, this pipeline only has 5 
{{Bucketizer}} s. So it doesn't cost so much time as model_9756 pipeline.

> Slow performance when calling fit on ML pipeline for dataset with many 
> columns but few rows
> ---
>
> Key: SPARK-20392
> URL: https://issues.apache.org/jira/browse/SPARK-20392
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Barry Becker
> Attachments: blockbuster.csv, blockbuster_fewCols.csv, 
> giant_query_plan_for_fitting_pipeline.txt, model_9754.zip, model_9756.zip
>
>
> This started as a [question on stack 
> overflow|http://stackoverflow.com/questions/43484006/why-is-it-slow-to-apply-a-spark-pipeline-to-dataset-with-many-columns-but-few-ro],
>  but it seems like a bug.
> I am testing spark pipelines using a simple dataset (attached) with 312 
> (mostly numeric) columns, but only 421 rows. It is small, but it takes 3 
> minutes to apply my ML pipeline to it on a 24 core server with 60G of memory. 
> This seems much to long for such a tiny dataset. Similar pipelines run 
> quickly on datasets that have fewer columns and more rows. It's something 
> about the number of columns that is causing the slow performance.
> Here are a list of the stages in my pipeline:
> {code}
> 000_strIdx_5708525b2b6c
> 001_strIdx_ec2296082913
> 002_bucketizer_3cbc8811877b
> 003_bucketizer_5a01d5d78436
> 004_bucketizer_bf290d11364d
> 005_bucketizer_c3296dfe94b2
> 006_bucketizer_7071ca50eb85
> 007_bucketizer_27738213c2a1
> 008_bucketizer_bd728fd89ba1
> 009_bucketizer_e1e716f51796
> 010_bucketizer_38be665993ba
> 011_bucketizer_5a0e41e5e94f
> 012_bucketizer_b5a3d5743aaa
> 013_bucketizer_4420f98ff7ff
> 014_bucketizer_777cc4fe6d12
> 015_bucketizer_f0f3a3e5530e
> 016_bucketizer_218ecca3b5c1
> 017_bucketizer_0b083439a192
> 018_bucketizer_4520203aec27
> 019_bucketizer_462c2c346079
> 020_bucketizer_47435822e04c
> 021_bucketizer_eb9dccb5e6e8
> 022_bucketizer_b5f63dd7451d
> 023_bucketizer_e0fd5041c841
> 024_bucketizer_ffb3b9737100
> 025_bucketizer_e06c0d29273c
> 026_bucketizer_36ee535a425f
> 027_bucketizer_ee3a330269f1
> 028_bucketizer_094b58ea01c0
> 029_bucketizer_e93ea86c08e2
> 030_bucketizer_4728a718bc4b
> 031_bucketizer_08f6189c7fcc
> 032_bucketizer_11feb74901e6
> 033_bucketizer_ab4add4966c7
> 034_bucketizer_4474f7f1b8ce
> 035_bucketizer_90cfa5918d71
> 036_bucketizer_1a9ff5e4eccb
> 037_bucketizer_38085415a4f4
> 038_bucketizer_9b5e5a8d12eb
> 039_bucketizer_082bb650ecc3
> 040_bucketizer_57e1e363c483
> 041_bucketizer_337583fbfd65
> 042_bucketizer_73e8f6673262
> 043_bucketizer_0f9394ed30b8
> 044_bucketizer_8530f3570019
> 045_bucketizer_c53614f1e507
> 046_bucketizer_8fd99e6ec27b
> 047_bucketizer_6a8610496d8a
> 048_bucketizer_888b0055c1ad
> 049_bucketizer_974e0a1433a6
> 050_bucketizer_e848c0937cb9
> 051_bucketizer_95611095a4ac
> 052_bucketizer_660a6031acd9
> 053_bucketizer_aaffe5a3140d
> 054_bucketizer_8dc569be285f
> 055_bucketizer_83d1bffa07bc
> 056_bucketizer_0c6180ba75e6
> 057_bucketizer_452f265a000d
> 058_bucketizer_38e02ddfb447
> 059_bucketizer_6fa4ad5d3ebd
> 060_bucketizer_91044ee766ce
> 061_bucketizer_9a9ef04a173d
> 062_bucketizer_3d98eb15f206
> 063_bucketizer_c4915bb4d4ed
> 064_bucketizer_8ca2b6550c38
> 065_bucketizer_417ee9b760bc
> 066_bucketizer_67f3556bebe8
> 067_bucketizer_0556deb652c6
> 068_bucketizer_067b4b3d234c
> 069_bucketizer_30ba55321538
> 070_bucketizer_ad826cc5d746
> 071_bucketizer_77676a898055
> 072_bucketizer_05c37a38ce30
> 073_bucketizer_6d9ae54163ed
> 074_bucketizer_8cd668b2855d
> 075_bucketizer_d50ea1732021
> 076_bucketizer_c68f467c9559
> 077_bucketizer_ee1dfc840db1
> 078_bucketizer_83ec06a32519
> 079_bucketizer_741d08c1b69e
> 080_bucketizer_b7402e4829c7
> 081_bucketizer_8adc590dc447
> 082_bucketizer_673be99bdace
> 083_bucketizer_77693b45f94c
> 084_bucketizer_53529c6b1ac4
> 085_bucketizer_6a3ca776a81e
> 086_bucketizer_6679d9588ac1
> 087_bucketizer_6c73af456f65
> 088_bucketizer_2291b2c5ab51
> 089_bucketizer_cb3d0fe669d8
> 090_bucketizer_e71f913c1512
> 091_bucketizer_156528f65ce7
> 092_bucketizer_f3ec5dae079b
> 093_bucketizer_809fab77eee1
> 094_bucketizer_6925831511e6
> 095_bucketizer_c5d853b95707
> 096_bucketizer_e677659ca253
> 097_bucketizer_396e35548c72
> 098_bucketizer_78a6410d7a84
> 099_bucketizer_e3ae6e54bca1
> 100_bucketizer_9fed5923fe8a
> 101_bucketizer_8925ba4c3ee2
> 102_bucketizer_95750b6942b8
> 103_bucketizer_6e8b50a1918b
> 104_bucketizer_36cfcc13d4ba
> 105_bucketizer_2716d0455512
> 106_bucketizer_9bcf2891652f
> 

[jira] [Assigned] (SPARK-20468) Refactor the ALS code

2017-04-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20468:


Assignee: Apache Spark

> Refactor the ALS code
> -
>
> Key: SPARK-20468
> URL: https://issues.apache.org/jira/browse/SPARK-20468
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Daniel Li
>Assignee: Apache Spark
>Priority: Minor
>  Labels: documentation, readability, refactoring
>
> The current ALS implementation ({{org.apache.spark.ml.recommendation}}) is 
> quite the beast --- 21 classes, traits, and objects across 1,500+ lines, all 
> in one file.  Here are some things I think could improve the clarity and 
> maintainability of the code:
> *  The file can be split into more manageable parts.  In particular, {{ALS}}, 
> {{ALSParams}}, {{ALSModel}}, and {{ALSModelParams}} can be in separate files 
> for better readability.
> *  Certain parts can be encapsulated or moved to clarify the intent.  For 
> example:
> **  The {{ALS.train}} method is currently defined in the {{ALS}} companion 
> object, and it seems to take 12 individual parameters that are all members of 
> the {{ALS}} class.  This method can be made an instance method.
> **  The code that creates in-blocks and out-blocks in the body of 
> {{ALS.train}}, along with the {{partitionRatings}} and {{makeBlocks}} methods 
> in the {{ALS}} companion object, can be moved into a separate case class that 
> holds the blocks.  This has the added benefit of allowing us to write 
> specific Scaladoc to explain the logic behind these block objects, as their 
> usage is certainly nontrivial yet is fundamental to the implementation.
> **  The {{KeyWrapper}} and {{UncompressedInBlockSort}} classes could be 
> hidden within {{UncompressedInBlock}} to clarify the scope of their usage.
> **  Various builder classes could be encapsulated in the companion objects of 
> the classes they build.
> *  The code can be formatted more clearly.  For example:
> **  Certain methods such as {{ALS.train}} and {{ALS.makeBlocks}} can be 
> formatted more clearly and have comments added explaining the reasoning 
> behind key parts.  That these methods form the core of the ALS logic makes 
> this doubly important for maintainability.
> **  Parts of the code that use {{while}} loops with manually incremented 
> counters can be rewritten as {{for}} loops.
> **  Where non-idiomatic Scala code is used that doesn't improve performance 
> much, clearer code can be substituted.  (This in particular should be done 
> very carefully if at all, as it's apparent the original author spent much 
> time and pains optimizing the code to significantly improve its runtime 
> profile.)
> *  The documentation (both Scaladocs and inline comments) can be clarified 
> where needed and expanded where incomplete.  This is especially important for 
> parts of the code that are written imperatively for performance, as these 
> parts don't benefit from the intuitive self-documentation of Scala's 
> higher-level language abstractions.  Specifically, I'd like to add 
> documentation fully explaining the key functionality of the in-block and 
> out-block objects, their purpose, how they relate to the overall ALS 
> algorithm, and how they are calculated in such a way that new maintainers can 
> ramp up much more quickly.
> The above is not a complete enumeration of improvements but a high-level 
> survey.  All of these improvements will, I believe, add up to make the code 
> easier to understand, extend, and maintain.  This issue will track the 
> progress of this refactoring so that going forward, authors will have an 
> easier time maintaining this part of the project.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20468) Refactor the ALS code

2017-04-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20468:


Assignee: (was: Apache Spark)

> Refactor the ALS code
> -
>
> Key: SPARK-20468
> URL: https://issues.apache.org/jira/browse/SPARK-20468
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Daniel Li
>Priority: Minor
>  Labels: documentation, readability, refactoring
>
> The current ALS implementation ({{org.apache.spark.ml.recommendation}}) is 
> quite the beast --- 21 classes, traits, and objects across 1,500+ lines, all 
> in one file.  Here are some things I think could improve the clarity and 
> maintainability of the code:
> *  The file can be split into more manageable parts.  In particular, {{ALS}}, 
> {{ALSParams}}, {{ALSModel}}, and {{ALSModelParams}} can be in separate files 
> for better readability.
> *  Certain parts can be encapsulated or moved to clarify the intent.  For 
> example:
> **  The {{ALS.train}} method is currently defined in the {{ALS}} companion 
> object, and it seems to take 12 individual parameters that are all members of 
> the {{ALS}} class.  This method can be made an instance method.
> **  The code that creates in-blocks and out-blocks in the body of 
> {{ALS.train}}, along with the {{partitionRatings}} and {{makeBlocks}} methods 
> in the {{ALS}} companion object, can be moved into a separate case class that 
> holds the blocks.  This has the added benefit of allowing us to write 
> specific Scaladoc to explain the logic behind these block objects, as their 
> usage is certainly nontrivial yet is fundamental to the implementation.
> **  The {{KeyWrapper}} and {{UncompressedInBlockSort}} classes could be 
> hidden within {{UncompressedInBlock}} to clarify the scope of their usage.
> **  Various builder classes could be encapsulated in the companion objects of 
> the classes they build.
> *  The code can be formatted more clearly.  For example:
> **  Certain methods such as {{ALS.train}} and {{ALS.makeBlocks}} can be 
> formatted more clearly and have comments added explaining the reasoning 
> behind key parts.  That these methods form the core of the ALS logic makes 
> this doubly important for maintainability.
> **  Parts of the code that use {{while}} loops with manually incremented 
> counters can be rewritten as {{for}} loops.
> **  Where non-idiomatic Scala code is used that doesn't improve performance 
> much, clearer code can be substituted.  (This in particular should be done 
> very carefully if at all, as it's apparent the original author spent much 
> time and pains optimizing the code to significantly improve its runtime 
> profile.)
> *  The documentation (both Scaladocs and inline comments) can be clarified 
> where needed and expanded where incomplete.  This is especially important for 
> parts of the code that are written imperatively for performance, as these 
> parts don't benefit from the intuitive self-documentation of Scala's 
> higher-level language abstractions.  Specifically, I'd like to add 
> documentation fully explaining the key functionality of the in-block and 
> out-block objects, their purpose, how they relate to the overall ALS 
> algorithm, and how they are calculated in such a way that new maintainers can 
> ramp up much more quickly.
> The above is not a complete enumeration of improvements but a high-level 
> survey.  All of these improvements will, I believe, add up to make the code 
> easier to understand, extend, and maintain.  This issue will track the 
> progress of this refactoring so that going forward, authors will have an 
> easier time maintaining this part of the project.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20469) Add a method to display DataFrame schema in PipelineStage

2017-04-26 Thread darion yaphet (JIRA)
darion yaphet created SPARK-20469:
-

 Summary: Add a method to display DataFrame schema in PipelineStage
 Key: SPARK-20469
 URL: https://issues.apache.org/jira/browse/SPARK-20469
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Affects Versions: 2.1.0, 2.0.2, 1.6.3
Reporter: darion yaphet
Priority: Minor


Sometime apply Transformer and Estimator on a pipeline. The PipelineStage could 
display schema will be a big help to understand and check the dataset .



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows

2017-04-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20392:


Assignee: (was: Apache Spark)

> Slow performance when calling fit on ML pipeline for dataset with many 
> columns but few rows
> ---
>
> Key: SPARK-20392
> URL: https://issues.apache.org/jira/browse/SPARK-20392
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Barry Becker
> Attachments: blockbuster.csv, blockbuster_fewCols.csv, 
> giant_query_plan_for_fitting_pipeline.txt, model_9754.zip, model_9756.zip
>
>
> This started as a [question on stack 
> overflow|http://stackoverflow.com/questions/43484006/why-is-it-slow-to-apply-a-spark-pipeline-to-dataset-with-many-columns-but-few-ro],
>  but it seems like a bug.
> I am testing spark pipelines using a simple dataset (attached) with 312 
> (mostly numeric) columns, but only 421 rows. It is small, but it takes 3 
> minutes to apply my ML pipeline to it on a 24 core server with 60G of memory. 
> This seems much to long for such a tiny dataset. Similar pipelines run 
> quickly on datasets that have fewer columns and more rows. It's something 
> about the number of columns that is causing the slow performance.
> Here are a list of the stages in my pipeline:
> {code}
> 000_strIdx_5708525b2b6c
> 001_strIdx_ec2296082913
> 002_bucketizer_3cbc8811877b
> 003_bucketizer_5a01d5d78436
> 004_bucketizer_bf290d11364d
> 005_bucketizer_c3296dfe94b2
> 006_bucketizer_7071ca50eb85
> 007_bucketizer_27738213c2a1
> 008_bucketizer_bd728fd89ba1
> 009_bucketizer_e1e716f51796
> 010_bucketizer_38be665993ba
> 011_bucketizer_5a0e41e5e94f
> 012_bucketizer_b5a3d5743aaa
> 013_bucketizer_4420f98ff7ff
> 014_bucketizer_777cc4fe6d12
> 015_bucketizer_f0f3a3e5530e
> 016_bucketizer_218ecca3b5c1
> 017_bucketizer_0b083439a192
> 018_bucketizer_4520203aec27
> 019_bucketizer_462c2c346079
> 020_bucketizer_47435822e04c
> 021_bucketizer_eb9dccb5e6e8
> 022_bucketizer_b5f63dd7451d
> 023_bucketizer_e0fd5041c841
> 024_bucketizer_ffb3b9737100
> 025_bucketizer_e06c0d29273c
> 026_bucketizer_36ee535a425f
> 027_bucketizer_ee3a330269f1
> 028_bucketizer_094b58ea01c0
> 029_bucketizer_e93ea86c08e2
> 030_bucketizer_4728a718bc4b
> 031_bucketizer_08f6189c7fcc
> 032_bucketizer_11feb74901e6
> 033_bucketizer_ab4add4966c7
> 034_bucketizer_4474f7f1b8ce
> 035_bucketizer_90cfa5918d71
> 036_bucketizer_1a9ff5e4eccb
> 037_bucketizer_38085415a4f4
> 038_bucketizer_9b5e5a8d12eb
> 039_bucketizer_082bb650ecc3
> 040_bucketizer_57e1e363c483
> 041_bucketizer_337583fbfd65
> 042_bucketizer_73e8f6673262
> 043_bucketizer_0f9394ed30b8
> 044_bucketizer_8530f3570019
> 045_bucketizer_c53614f1e507
> 046_bucketizer_8fd99e6ec27b
> 047_bucketizer_6a8610496d8a
> 048_bucketizer_888b0055c1ad
> 049_bucketizer_974e0a1433a6
> 050_bucketizer_e848c0937cb9
> 051_bucketizer_95611095a4ac
> 052_bucketizer_660a6031acd9
> 053_bucketizer_aaffe5a3140d
> 054_bucketizer_8dc569be285f
> 055_bucketizer_83d1bffa07bc
> 056_bucketizer_0c6180ba75e6
> 057_bucketizer_452f265a000d
> 058_bucketizer_38e02ddfb447
> 059_bucketizer_6fa4ad5d3ebd
> 060_bucketizer_91044ee766ce
> 061_bucketizer_9a9ef04a173d
> 062_bucketizer_3d98eb15f206
> 063_bucketizer_c4915bb4d4ed
> 064_bucketizer_8ca2b6550c38
> 065_bucketizer_417ee9b760bc
> 066_bucketizer_67f3556bebe8
> 067_bucketizer_0556deb652c6
> 068_bucketizer_067b4b3d234c
> 069_bucketizer_30ba55321538
> 070_bucketizer_ad826cc5d746
> 071_bucketizer_77676a898055
> 072_bucketizer_05c37a38ce30
> 073_bucketizer_6d9ae54163ed
> 074_bucketizer_8cd668b2855d
> 075_bucketizer_d50ea1732021
> 076_bucketizer_c68f467c9559
> 077_bucketizer_ee1dfc840db1
> 078_bucketizer_83ec06a32519
> 079_bucketizer_741d08c1b69e
> 080_bucketizer_b7402e4829c7
> 081_bucketizer_8adc590dc447
> 082_bucketizer_673be99bdace
> 083_bucketizer_77693b45f94c
> 084_bucketizer_53529c6b1ac4
> 085_bucketizer_6a3ca776a81e
> 086_bucketizer_6679d9588ac1
> 087_bucketizer_6c73af456f65
> 088_bucketizer_2291b2c5ab51
> 089_bucketizer_cb3d0fe669d8
> 090_bucketizer_e71f913c1512
> 091_bucketizer_156528f65ce7
> 092_bucketizer_f3ec5dae079b
> 093_bucketizer_809fab77eee1
> 094_bucketizer_6925831511e6
> 095_bucketizer_c5d853b95707
> 096_bucketizer_e677659ca253
> 097_bucketizer_396e35548c72
> 098_bucketizer_78a6410d7a84
> 099_bucketizer_e3ae6e54bca1
> 100_bucketizer_9fed5923fe8a
> 101_bucketizer_8925ba4c3ee2
> 102_bucketizer_95750b6942b8
> 103_bucketizer_6e8b50a1918b
> 104_bucketizer_36cfcc13d4ba
> 105_bucketizer_2716d0455512
> 106_bucketizer_9bcf2891652f
> 107_bucketizer_8c3d352915f7
> 108_bucketizer_0786c17d5ef9
> 109_bucketizer_f22df23ef56f
> 110_bucketizer_bad04578bd20
> 111_bucketizer_35cfbde7e28f
> 112_bucketizer_cf89177a528b
> 113_bucketizer_183a0d393ef0
> 114_bucketizer_467c78156a67
> 

[jira] [Assigned] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows

2017-04-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20392:


Assignee: Apache Spark

> Slow performance when calling fit on ML pipeline for dataset with many 
> columns but few rows
> ---
>
> Key: SPARK-20392
> URL: https://issues.apache.org/jira/browse/SPARK-20392
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Barry Becker
>Assignee: Apache Spark
> Attachments: blockbuster.csv, blockbuster_fewCols.csv, 
> giant_query_plan_for_fitting_pipeline.txt, model_9754.zip, model_9756.zip
>
>
> This started as a [question on stack 
> overflow|http://stackoverflow.com/questions/43484006/why-is-it-slow-to-apply-a-spark-pipeline-to-dataset-with-many-columns-but-few-ro],
>  but it seems like a bug.
> I am testing spark pipelines using a simple dataset (attached) with 312 
> (mostly numeric) columns, but only 421 rows. It is small, but it takes 3 
> minutes to apply my ML pipeline to it on a 24 core server with 60G of memory. 
> This seems much to long for such a tiny dataset. Similar pipelines run 
> quickly on datasets that have fewer columns and more rows. It's something 
> about the number of columns that is causing the slow performance.
> Here are a list of the stages in my pipeline:
> {code}
> 000_strIdx_5708525b2b6c
> 001_strIdx_ec2296082913
> 002_bucketizer_3cbc8811877b
> 003_bucketizer_5a01d5d78436
> 004_bucketizer_bf290d11364d
> 005_bucketizer_c3296dfe94b2
> 006_bucketizer_7071ca50eb85
> 007_bucketizer_27738213c2a1
> 008_bucketizer_bd728fd89ba1
> 009_bucketizer_e1e716f51796
> 010_bucketizer_38be665993ba
> 011_bucketizer_5a0e41e5e94f
> 012_bucketizer_b5a3d5743aaa
> 013_bucketizer_4420f98ff7ff
> 014_bucketizer_777cc4fe6d12
> 015_bucketizer_f0f3a3e5530e
> 016_bucketizer_218ecca3b5c1
> 017_bucketizer_0b083439a192
> 018_bucketizer_4520203aec27
> 019_bucketizer_462c2c346079
> 020_bucketizer_47435822e04c
> 021_bucketizer_eb9dccb5e6e8
> 022_bucketizer_b5f63dd7451d
> 023_bucketizer_e0fd5041c841
> 024_bucketizer_ffb3b9737100
> 025_bucketizer_e06c0d29273c
> 026_bucketizer_36ee535a425f
> 027_bucketizer_ee3a330269f1
> 028_bucketizer_094b58ea01c0
> 029_bucketizer_e93ea86c08e2
> 030_bucketizer_4728a718bc4b
> 031_bucketizer_08f6189c7fcc
> 032_bucketizer_11feb74901e6
> 033_bucketizer_ab4add4966c7
> 034_bucketizer_4474f7f1b8ce
> 035_bucketizer_90cfa5918d71
> 036_bucketizer_1a9ff5e4eccb
> 037_bucketizer_38085415a4f4
> 038_bucketizer_9b5e5a8d12eb
> 039_bucketizer_082bb650ecc3
> 040_bucketizer_57e1e363c483
> 041_bucketizer_337583fbfd65
> 042_bucketizer_73e8f6673262
> 043_bucketizer_0f9394ed30b8
> 044_bucketizer_8530f3570019
> 045_bucketizer_c53614f1e507
> 046_bucketizer_8fd99e6ec27b
> 047_bucketizer_6a8610496d8a
> 048_bucketizer_888b0055c1ad
> 049_bucketizer_974e0a1433a6
> 050_bucketizer_e848c0937cb9
> 051_bucketizer_95611095a4ac
> 052_bucketizer_660a6031acd9
> 053_bucketizer_aaffe5a3140d
> 054_bucketizer_8dc569be285f
> 055_bucketizer_83d1bffa07bc
> 056_bucketizer_0c6180ba75e6
> 057_bucketizer_452f265a000d
> 058_bucketizer_38e02ddfb447
> 059_bucketizer_6fa4ad5d3ebd
> 060_bucketizer_91044ee766ce
> 061_bucketizer_9a9ef04a173d
> 062_bucketizer_3d98eb15f206
> 063_bucketizer_c4915bb4d4ed
> 064_bucketizer_8ca2b6550c38
> 065_bucketizer_417ee9b760bc
> 066_bucketizer_67f3556bebe8
> 067_bucketizer_0556deb652c6
> 068_bucketizer_067b4b3d234c
> 069_bucketizer_30ba55321538
> 070_bucketizer_ad826cc5d746
> 071_bucketizer_77676a898055
> 072_bucketizer_05c37a38ce30
> 073_bucketizer_6d9ae54163ed
> 074_bucketizer_8cd668b2855d
> 075_bucketizer_d50ea1732021
> 076_bucketizer_c68f467c9559
> 077_bucketizer_ee1dfc840db1
> 078_bucketizer_83ec06a32519
> 079_bucketizer_741d08c1b69e
> 080_bucketizer_b7402e4829c7
> 081_bucketizer_8adc590dc447
> 082_bucketizer_673be99bdace
> 083_bucketizer_77693b45f94c
> 084_bucketizer_53529c6b1ac4
> 085_bucketizer_6a3ca776a81e
> 086_bucketizer_6679d9588ac1
> 087_bucketizer_6c73af456f65
> 088_bucketizer_2291b2c5ab51
> 089_bucketizer_cb3d0fe669d8
> 090_bucketizer_e71f913c1512
> 091_bucketizer_156528f65ce7
> 092_bucketizer_f3ec5dae079b
> 093_bucketizer_809fab77eee1
> 094_bucketizer_6925831511e6
> 095_bucketizer_c5d853b95707
> 096_bucketizer_e677659ca253
> 097_bucketizer_396e35548c72
> 098_bucketizer_78a6410d7a84
> 099_bucketizer_e3ae6e54bca1
> 100_bucketizer_9fed5923fe8a
> 101_bucketizer_8925ba4c3ee2
> 102_bucketizer_95750b6942b8
> 103_bucketizer_6e8b50a1918b
> 104_bucketizer_36cfcc13d4ba
> 105_bucketizer_2716d0455512
> 106_bucketizer_9bcf2891652f
> 107_bucketizer_8c3d352915f7
> 108_bucketizer_0786c17d5ef9
> 109_bucketizer_f22df23ef56f
> 110_bucketizer_bad04578bd20
> 111_bucketizer_35cfbde7e28f
> 112_bucketizer_cf89177a528b
> 113_bucketizer_183a0d393ef0
> 

[jira] [Created] (SPARK-20470) Invalid json converting RDD row with Array of struct to json

2017-04-26 Thread Philip Adetiloye (JIRA)
Philip Adetiloye created SPARK-20470:


 Summary: Invalid json converting RDD row with Array of struct to 
json
 Key: SPARK-20470
 URL: https://issues.apache.org/jira/browse/SPARK-20470
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.6.3
Reporter: Philip Adetiloye


Trying to convert an RDD in pyspark containing Array of struct doesn't generate 
the right json. It looks trivial but can't get a good json out.


rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict(



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20468) Refactor the ALS code

2017-04-26 Thread Daniel Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984350#comment-15984350
 ] 

Daniel Li commented on SPARK-20468:
---

I'd appreciate if someone with permissions could assign this issue to me.

> Refactor the ALS code
> -
>
> Key: SPARK-20468
> URL: https://issues.apache.org/jira/browse/SPARK-20468
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Daniel Li
>Priority: Minor
>  Labels: documentation, readability, refactoring
>
> The current ALS implementation ({{org.apache.spark.ml.recommendation}}) is 
> quite the beast --- 21 classes, traits, and objects across 1,500+ lines, all 
> in one file.  Here are some things I think could improve the clarity and 
> maintainability of the code:
> *  The file can be split into more manageable parts.  In particular, {{ALS}}, 
> {{ALSParams}}, {{ALSModel}}, and {{ALSModelParams}} can be in separate files 
> for better readability.
> *  Certain parts can be encapsulated or moved to clarify the intent.  For 
> example:
> **  The {{ALS.train}} method is currently defined in the {{ALS}} companion 
> object, and it seems to take 12 individual parameters that are all members of 
> the {{ALS}} class.  This method can be made an instance method.
> **  The code that creates in-blocks and out-blocks in the body of 
> {{ALS.train}}, along with the {{partitionRatings}} and {{makeBlocks}} methods 
> in the {{ALS}} companion object, can be moved into a separate case class that 
> holds the blocks.  This has the added benefit of allowing us to write 
> specific Scaladoc to explain the logic behind these block objects, as their 
> usage is certainly nontrivial yet is fundamental to the implementation.
> **  The {{KeyWrapper}} and {{UncompressedInBlockSort}} classes could be 
> hidden within {{UncompressedInBlock}} to clarify the scope of their usage.
> **  Various builder classes could be encapsulated in the companion objects of 
> the classes they build.
> *  The code can be formatted more clearly.  For example:
> **  Certain methods such as {{ALS.train}} and {{ALS.makeBlocks}} can be 
> formatted more clearly and have comments added explaining the reasoning 
> behind key parts.  That these methods form the core of the ALS logic makes 
> this doubly important for maintainability.
> **  Parts of the code that use {{while}} loops with manually incremented 
> counters can be rewritten as {{for}} loops.
> **  Where non-idiomatic Scala code is used that doesn't improve performance 
> much, clearer code can be substituted.  (This in particular should be done 
> very carefully if at all, as it's apparent the original author spent much 
> time and pains optimizing the code to significantly improve its runtime 
> profile.)
> *  The documentation (both Scaladocs and inline comments) can be clarified 
> where needed and expanded where incomplete.  This is especially important for 
> parts of the code that are written imperatively for performance, as these 
> parts don't benefit from the intuitive self-documentation of Scala's 
> higher-level language abstractions.  Specifically, I'd like to add 
> documentation fully explaining the key functionality of the in-block and 
> out-block objects, their purpose, how they relate to the overall ALS 
> algorithm, and how they are calculated in such a way that new maintainers can 
> ramp up much more quickly.
> The above is not a complete enumeration of improvements but a high-level 
> survey.  All of these improvements will, I believe, add up to make the code 
> easier to understand, extend, and maintain.  This issue will track the 
> progress of this refactoring so that going forward, authors will have an 
> easier time maintaining this part of the project.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20471) Remove AggregateBenchmark testsuite warning: Two level hashmap is disabled but vectorized hashmap is enabled.

2017-04-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20471:


Assignee: Apache Spark

> Remove AggregateBenchmark testsuite warning: Two level hashmap is disabled 
> but vectorized hashmap is enabled.
> -
>
> Key: SPARK-20471
> URL: https://issues.apache.org/jira/browse/SPARK-20471
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.1.0
>Reporter: caoxuewen
>Assignee: Apache Spark
>
> remove  AggregateBenchmark testsuite warning:
> such as '14:26:33.220 WARN 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec: Two level hashmap 
> is disabled but vectorized hashmap is enabled.'
> unit tests: AggregateBenchmark 
> Modify the 'ignore function for 'test funtion



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20471) Remove AggregateBenchmark testsuite warning: Two level hashmap is disabled but vectorized hashmap is enabled.

2017-04-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20471:


Assignee: (was: Apache Spark)

> Remove AggregateBenchmark testsuite warning: Two level hashmap is disabled 
> but vectorized hashmap is enabled.
> -
>
> Key: SPARK-20471
> URL: https://issues.apache.org/jira/browse/SPARK-20471
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.1.0
>Reporter: caoxuewen
>
> remove  AggregateBenchmark testsuite warning:
> such as '14:26:33.220 WARN 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec: Two level hashmap 
> is disabled but vectorized hashmap is enabled.'
> unit tests: AggregateBenchmark 
> Modify the 'ignore function for 'test funtion



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20466) HadoopRDD#addLocalConfiguration throws NPE

2017-04-26 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated SPARK-20466:
-
Attachment: NPE_log

NPE_log describes the detailed info.

> HadoopRDD#addLocalConfiguration throws NPE
> --
>
> Key: SPARK-20466
> URL: https://issues.apache.org/jira/browse/SPARK-20466
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.2
>Reporter: liyunzhang_intel
> Attachments: NPE_log
>
>
> in spark2.0.2, it throws NPE
> {code}
>   17/04/23 08:19:55 ERROR executor.Executor: Exception in task 439.0 in stage 
> 16.0 (TID 986)$ 
> java.lang.NullPointerException$
> ^Iat 
> org.apache.spark.rdd.HadoopRDD$.addLocalConfiguration(HadoopRDD.scala:373)$
> ^Iat org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:243)$
> ^Iat org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:208)$
> ^Iat org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)$
> ^Iat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)$
> ^Iat org.apache.spark.rdd.RDD.iterator(RDD.scala:283)$
> ^Iat org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)$
> ^Iat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)$
> ^Iat org.apache.spark.rdd.RDD.iterator(RDD.scala:283)$
> ^Iat org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)$
> ^Iat org.apache.spark.scheduler.Task.run(Task.scala:86)$
> ^Iat org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)$
> ^Iat 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)$
> ^Iat 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)$
> ^Iat java.lang.Thread.run(Thread.java:745)$
> {code}
> suggestion to add some code to avoid NPE
> {code} 
>/** Add Hadoop configuration specific to a single partition and attempt. */
>   def addLocalConfiguration(jobTrackerId: String, jobId: Int, splitId: Int, 
> attemptId: Int,
> conf: JobConf) {
> val jobID = new JobID(jobTrackerId, jobId)
> val taId = new TaskAttemptID(new TaskID(jobID, TaskType.MAP, splitId), 
> attemptId)
> if ( conf != null){
> conf.set("mapred.tip.id", taId.getTaskID.toString)
> conf.set("mapred.task.id", taId.toString)
> conf.setBoolean("mapred.task.is.map", true)
> conf.setInt("mapred.task.partition", splitId)
> conf.set("mapred.job.id", jobID.toString)
>}
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20467) sbt-launch-lib.bash has lacked the ASF header.

2017-04-26 Thread liuzhaokun (JIRA)
liuzhaokun created SPARK-20467:
--

 Summary: sbt-launch-lib.bash has lacked the  ASF header.
 Key: SPARK-20467
 URL: https://issues.apache.org/jira/browse/SPARK-20467
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.1.0
Reporter: liuzhaokun


When I use this script,I found sbt-launch-lib.bash lack the  ASF header.It 
doesn't be permitted by Apache Foundation according to apache license 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20468) Refactor the ALS code

2017-04-26 Thread Daniel Li (JIRA)
Daniel Li created SPARK-20468:
-

 Summary: Refactor the ALS code
 Key: SPARK-20468
 URL: https://issues.apache.org/jira/browse/SPARK-20468
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 2.1.0
Reporter: Daniel Li
Priority: Minor


The current ALS implementation ({{org.apache.spark.ml.recommendation}}) is 
quite the beast --- 21 classes, traits, and objects across 1,500+ lines, all in 
one file.  Here are some things I think could improve the clarity and 
maintainability of the code:

*  The file can be split into more manageable parts.  In particular, {{ALS}}, 
{{ALSParams}}, {{ALSModel}}, and {{ALSModelParams}} can be in separate files 
for better readability.

*  Certain parts can be encapsulated or moved to clarify the intent.  For 
example:
**  The {{ALS.train}} method is currently defined in the {{ALS}} companion 
object, and it seems to take 12 individual parameters that are all members of 
the {{ALS}} class.  This method can be made an instance method.
**  The code that creates in-blocks and out-blocks in the body of 
{{ALS.train}}, along with the {{partitionRatings}} and {{makeBlocks}} methods 
in the {{ALS}} companion object, can be moved into a separate case class that 
holds the blocks.  This has the added benefit of allowing us to write specific 
Scaladoc to explain the logic behind these block objects, as their usage is 
certainly nontrivial yet is fundamental to the implementation.
**  The {{KeyWrapper}} and {{UncompressedInBlockSort}} classes could be hidden 
within {{UncompressedInBlock}} to clarify the scope of their usage.
**  Various builder classes could be encapsulated in the companion objects of 
the classes they build.

*  The code can be formatted more clearly.  For example:
**  Certain methods such as {{ALS.train}} and {{ALS.makeBlocks}} can be 
formatted more clearly and have comments added explaining the reasoning behind 
key parts.  That these methods form the core of the ALS logic makes this doubly 
important for maintainability.
**  Parts of the code that use {{while}} loops with manually incremented 
counters can be rewritten as {{for}} loops.
**  Where non-idiomatic Scala code is used that doesn't improve performance 
much, clearer code can be substituted.  (This in particular should be done very 
carefully if at all, as it's apparent the original author spent much time and 
pains optimizing the code to significantly improve its runtime profile.)

*  The documentation (both Scaladocs and inline comments) can be clarified 
where needed and expanded where incomplete.  This is especially important for 
parts of the code that are written imperatively for performance, as these parts 
don't benefit from the intuitive self-documentation of Scala's higher-level 
language abstractions.  Specifically, I'd like to add documentation fully 
explaining the key functionality of the in-block and out-block objects, their 
purpose, how they relate to the overall ALS algorithm, and how they are 
calculated in such a way that new maintainers can ramp up much more quickly.

The above is not a complete enumeration of improvements but a high-level 
survey.  All of these improvements will, I believe, add up to make the code 
easier to understand, extend, and maintain.  This issue will track the progress 
of this refactoring so that going forward, authors will have an easier time 
maintaining this part of the project.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20468) Refactor the ALS code

2017-04-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984346#comment-15984346
 ] 

Apache Spark commented on SPARK-20468:
--

User 'danielyli' has created a pull request for this issue:
https://github.com/apache/spark/pull/17767

> Refactor the ALS code
> -
>
> Key: SPARK-20468
> URL: https://issues.apache.org/jira/browse/SPARK-20468
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Daniel Li
>Priority: Minor
>  Labels: documentation, readability, refactoring
>
> The current ALS implementation ({{org.apache.spark.ml.recommendation}}) is 
> quite the beast --- 21 classes, traits, and objects across 1,500+ lines, all 
> in one file.  Here are some things I think could improve the clarity and 
> maintainability of the code:
> *  The file can be split into more manageable parts.  In particular, {{ALS}}, 
> {{ALSParams}}, {{ALSModel}}, and {{ALSModelParams}} can be in separate files 
> for better readability.
> *  Certain parts can be encapsulated or moved to clarify the intent.  For 
> example:
> **  The {{ALS.train}} method is currently defined in the {{ALS}} companion 
> object, and it seems to take 12 individual parameters that are all members of 
> the {{ALS}} class.  This method can be made an instance method.
> **  The code that creates in-blocks and out-blocks in the body of 
> {{ALS.train}}, along with the {{partitionRatings}} and {{makeBlocks}} methods 
> in the {{ALS}} companion object, can be moved into a separate case class that 
> holds the blocks.  This has the added benefit of allowing us to write 
> specific Scaladoc to explain the logic behind these block objects, as their 
> usage is certainly nontrivial yet is fundamental to the implementation.
> **  The {{KeyWrapper}} and {{UncompressedInBlockSort}} classes could be 
> hidden within {{UncompressedInBlock}} to clarify the scope of their usage.
> **  Various builder classes could be encapsulated in the companion objects of 
> the classes they build.
> *  The code can be formatted more clearly.  For example:
> **  Certain methods such as {{ALS.train}} and {{ALS.makeBlocks}} can be 
> formatted more clearly and have comments added explaining the reasoning 
> behind key parts.  That these methods form the core of the ALS logic makes 
> this doubly important for maintainability.
> **  Parts of the code that use {{while}} loops with manually incremented 
> counters can be rewritten as {{for}} loops.
> **  Where non-idiomatic Scala code is used that doesn't improve performance 
> much, clearer code can be substituted.  (This in particular should be done 
> very carefully if at all, as it's apparent the original author spent much 
> time and pains optimizing the code to significantly improve its runtime 
> profile.)
> *  The documentation (both Scaladocs and inline comments) can be clarified 
> where needed and expanded where incomplete.  This is especially important for 
> parts of the code that are written imperatively for performance, as these 
> parts don't benefit from the intuitive self-documentation of Scala's 
> higher-level language abstractions.  Specifically, I'd like to add 
> documentation fully explaining the key functionality of the in-block and 
> out-block objects, their purpose, how they relate to the overall ALS 
> algorithm, and how they are calculated in such a way that new maintainers can 
> ramp up much more quickly.
> The above is not a complete enumeration of improvements but a high-level 
> survey.  All of these improvements will, I believe, add up to make the code 
> easier to understand, extend, and maintain.  This issue will track the 
> progress of this refactoring so that going forward, authors will have an 
> easier time maintaining this part of the project.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20470) Invalid json converting RDD row with Array of struct to json

2017-04-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984403#comment-15984403
 ] 

Sean Owen commented on SPARK-20470:
---

What JSON do you expect? what JSON do you get?
There's no relevant detail here.

> Invalid json converting RDD row with Array of struct to json
> 
>
> Key: SPARK-20470
> URL: https://issues.apache.org/jira/browse/SPARK-20470
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.3
>Reporter: Philip Adetiloye
>
> Trying to convert an RDD in pyspark containing Array of struct doesn't 
> generate the right json. It looks trivial but can't get a good json out.
> rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict(



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20470) Invalid json converting RDD row with Array of struct to json

2017-04-26 Thread Philip Adetiloye (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philip Adetiloye updated SPARK-20470:
-
Description: 
Trying to convert an RDD in pyspark containing Array of struct doesn't generate 
the right json. It looks trivial but can't get a good json out.

I read the json below into a dataframe:
{
  "feature": "feature_id_001",
  "histogram": [
{
  "start": 1.9796095151877942,
  "y": 968.0,
  "width": 0.1564485056196041
},
{
  "start": 2.1360580208073983,
  "y": 892.0,
  "width": 0.1564485056196041
},
{
  "start": 2.2925065264270024,
  "y": 814.0,
  "width": 0.15644850561960366
},
{
  "start": 2.448955032046606,
  "y": 690.0,
  "width": 0.1564485056196041
}]
}

Df schema looks good 

{code}
 root
  |-- feature: string (nullable = true)
  |-- histogram: array (nullable = true)
  ||-- element: struct (containsNull = true)
  |||-- start: double (nullable = true)
  |||-- width: double (nullable = true)
  |||-- y: double (nullable = true)
{code}
Need to convert each row to json now and save to HBase 
rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict(

Output JSON (Wrong)

{
  "feature": "feature_id_001",
  "histogram": [
[
  1.9796095151877942,
  968.0,
  0.1564485056196041
],
[
  2.1360580208073983,
  892.0,
  0.1564485056196041
],
[
  2.2925065264270024,
  814.0,
  0.15644850561960366
],
[
  2.448955032046606,
  690.0,
  0.1564485056196041
]
}



  was:
Trying to convert an RDD in pyspark containing Array of struct doesn't generate 
the right json. It looks trivial but can't get a good json out.

I read the json below into a dataframe:
{
  "feature": "feature_id_001",
  "histogram": [
{
  "start": 1.9796095151877942,
  "y": 968.0,
  "width": 0.1564485056196041
},
{
  "start": 2.1360580208073983,
  "y": 892.0,
  "width": 0.1564485056196041
},
{
  "start": 2.2925065264270024,
  "y": 814.0,
  "width": 0.15644850561960366
},
{
  "start": 2.448955032046606,
  "y": 690.0,
  "width": 0.1564485056196041
}]
}

Df schema looks good 

 root
  |-- feature: string (nullable = true)
  |-- histogram: array (nullable = true)
  ||-- element: struct (containsNull = true)
  |||-- start: double (nullable = true)
  |||-- width: double (nullable = true)
  |||-- y: double (nullable = true)

Need to convert each row to json now and save to HBase 
rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict(

Output JSON (Wrong)

{
  "feature": "feature_id_001",
  "histogram": [
[
  1.9796095151877942,
  968.0,
  0.1564485056196041
],
[
  2.1360580208073983,
  892.0,
  0.1564485056196041
],
[
  2.2925065264270024,
  814.0,
  0.15644850561960366
],
[
  2.448955032046606,
  690.0,
  0.1564485056196041
]
}




> Invalid json converting RDD row with Array of struct to json
> 
>
> Key: SPARK-20470
> URL: https://issues.apache.org/jira/browse/SPARK-20470
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.3
>Reporter: Philip Adetiloye
>
> Trying to convert an RDD in pyspark containing Array of struct doesn't 
> generate the right json. It looks trivial but can't get a good json out.
> I read the json below into a dataframe:
> {
>   "feature": "feature_id_001",
>   "histogram": [
> {
>   "start": 1.9796095151877942,
>   "y": 968.0,
>   "width": 0.1564485056196041
> },
> {
>   "start": 2.1360580208073983,
>   "y": 892.0,
>   "width": 0.1564485056196041
> },
> {
>   "start": 2.2925065264270024,
>   "y": 814.0,
>   "width": 0.15644850561960366
> },
> {
>   "start": 2.448955032046606,
>   "y": 690.0,
>   "width": 0.1564485056196041
> }]
> }
> Df schema looks good 
> {code}
>  root
>   |-- feature: string (nullable = true)
>   |-- histogram: array (nullable = true)
>   ||-- element: struct (containsNull = true)
>   |||-- start: double (nullable = true)
>   |||-- width: double (nullable = true)
>   |||-- y: double (nullable = true)
> {code}
> Need to convert each row to json now and save to HBase 
> rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict(
> Output JSON (Wrong)
> {
>   "feature": "feature_id_001",
>   "histogram": [
> [
>   1.9796095151877942,
>   968.0,
>   0.1564485056196041
> ],
> [
>   2.1360580208073983,
>   892.0,
>   0.1564485056196041
> ],
> [
>   2.2925065264270024,
>   814.0,
>   

[jira] [Updated] (SPARK-20470) Invalid json converting RDD row with Array of struct to json

2017-04-26 Thread Philip Adetiloye (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philip Adetiloye updated SPARK-20470:
-
Description: 
Trying to convert an RDD in pyspark containing Array of struct doesn't generate 
the right json. It looks trivial but can't get a good json out.

I read the json below into a dataframe:
{code}
{
  "feature": "feature_id_001",
  "histogram": [
{
  "start": 1.9796095151877942,
  "y": 968.0,
  "width": 0.1564485056196041
},
{
  "start": 2.1360580208073983,
  "y": 892.0,
  "width": 0.1564485056196041
},
{
  "start": 2.2925065264270024,
  "y": 814.0,
  "width": 0.15644850561960366
},
{
  "start": 2.448955032046606,
  "y": 690.0,
  "width": 0.1564485056196041
}]
}
{code}

Df schema looks good 

{code}
 root
  |-- feature: string (nullable = true)
  |-- histogram: array (nullable = true)
  ||-- element: struct (containsNull = true)
  |||-- start: double (nullable = true)
  |||-- width: double (nullable = true)
  |||-- y: double (nullable = true)
{code}
Need to convert each row to json now and save to HBase 
rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict(

Output JSON (Wrong)
{code}
{
  "feature": "feature_id_001",
  "histogram": [
[
  1.9796095151877942,
  968.0,
  0.1564485056196041
],
[
  2.1360580208073983,
  892.0,
  0.1564485056196041
],
[
  2.2925065264270024,
  814.0,
  0.15644850561960366
],
[
  2.448955032046606,
  690.0,
  0.1564485056196041
]
}
{code}


  was:
Trying to convert an RDD in pyspark containing Array of struct doesn't generate 
the right json. It looks trivial but can't get a good json out.

I read the json below into a dataframe:
{
  "feature": "feature_id_001",
  "histogram": [
{
  "start": 1.9796095151877942,
  "y": 968.0,
  "width": 0.1564485056196041
},
{
  "start": 2.1360580208073983,
  "y": 892.0,
  "width": 0.1564485056196041
},
{
  "start": 2.2925065264270024,
  "y": 814.0,
  "width": 0.15644850561960366
},
{
  "start": 2.448955032046606,
  "y": 690.0,
  "width": 0.1564485056196041
}]
}

Df schema looks good 

{code}
 root
  |-- feature: string (nullable = true)
  |-- histogram: array (nullable = true)
  ||-- element: struct (containsNull = true)
  |||-- start: double (nullable = true)
  |||-- width: double (nullable = true)
  |||-- y: double (nullable = true)
{code}
Need to convert each row to json now and save to HBase 
rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict(

Output JSON (Wrong)

{
  "feature": "feature_id_001",
  "histogram": [
[
  1.9796095151877942,
  968.0,
  0.1564485056196041
],
[
  2.1360580208073983,
  892.0,
  0.1564485056196041
],
[
  2.2925065264270024,
  814.0,
  0.15644850561960366
],
[
  2.448955032046606,
  690.0,
  0.1564485056196041
]
}




> Invalid json converting RDD row with Array of struct to json
> 
>
> Key: SPARK-20470
> URL: https://issues.apache.org/jira/browse/SPARK-20470
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.3
>Reporter: Philip Adetiloye
>
> Trying to convert an RDD in pyspark containing Array of struct doesn't 
> generate the right json. It looks trivial but can't get a good json out.
> I read the json below into a dataframe:
> {code}
> {
>   "feature": "feature_id_001",
>   "histogram": [
> {
>   "start": 1.9796095151877942,
>   "y": 968.0,
>   "width": 0.1564485056196041
> },
> {
>   "start": 2.1360580208073983,
>   "y": 892.0,
>   "width": 0.1564485056196041
> },
> {
>   "start": 2.2925065264270024,
>   "y": 814.0,
>   "width": 0.15644850561960366
> },
> {
>   "start": 2.448955032046606,
>   "y": 690.0,
>   "width": 0.1564485056196041
> }]
> }
> {code}
> Df schema looks good 
> {code}
>  root
>   |-- feature: string (nullable = true)
>   |-- histogram: array (nullable = true)
>   ||-- element: struct (containsNull = true)
>   |||-- start: double (nullable = true)
>   |||-- width: double (nullable = true)
>   |||-- y: double (nullable = true)
> {code}
> Need to convert each row to json now and save to HBase 
> rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict(
> Output JSON (Wrong)
> {code}
> {
>   "feature": "feature_id_001",
>   "histogram": [
> [
>   1.9796095151877942,
>   968.0,
>   0.1564485056196041
> ],
> [
>   2.1360580208073983,
>   892.0,
>   0.1564485056196041
> ],
>   

[jira] [Created] (SPARK-20472) Support for Dynamic Configuration

2017-04-26 Thread Shahbaz Hussain (JIRA)
Shahbaz Hussain created SPARK-20472:
---

 Summary: Support for Dynamic Configuration
 Key: SPARK-20472
 URL: https://issues.apache.org/jira/browse/SPARK-20472
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 2.1.0
Reporter: Shahbaz Hussain


Currently Spark Configuration can not be dynamically changed.
It requires Spark Job be killed and started again for a new configuration to 
take in to effect.
This bug is to enhance Spark ,such that configuration changes can be 
dynamically changed without requiring a application restart.

Ex: If Batch Interval in a Streaming Job is 20 seconds ,and if user wants to 
reduce it to 5 seconds,currently it requires a re-submit of the job.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20471) Remove AggregateBenchmark testsuite warning: Two level hashmap is disabled but vectorized hashmap is enabled.

2017-04-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984451#comment-15984451
 ] 

Apache Spark commented on SPARK-20471:
--

User 'heary-cao' has created a pull request for this issue:
https://github.com/apache/spark/pull/17771

> Remove AggregateBenchmark testsuite warning: Two level hashmap is disabled 
> but vectorized hashmap is enabled.
> -
>
> Key: SPARK-20471
> URL: https://issues.apache.org/jira/browse/SPARK-20471
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.1.0
>Reporter: caoxuewen
>
> remove  AggregateBenchmark testsuite warning:
> such as '14:26:33.220 WARN 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec: Two level hashmap 
> is disabled but vectorized hashmap is enabled.'
> unit tests: AggregateBenchmark 
> Modify the 'ignore function for 'test funtion



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11968) ALS recommend all methods spend most of time in GC

2017-04-26 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984533#comment-15984533
 ] 

Nick Pentreath commented on SPARK-11968:


Thanks - in the meantime I will take a look at the PR.

> ALS recommend all methods spend most of time in GC
> --
>
> Key: SPARK-11968
> URL: https://issues.apache.org/jira/browse/SPARK-11968
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Joseph K. Bradley
>Assignee: Nick Pentreath
>
> After adding recommendUsersForProducts and recommendProductsForUsers to ALS 
> in spark-perf, I noticed that it takes much longer than ALS itself.  Looking 
> at the monitoring page, I can see it is spending about 8min doing GC for each 
> 10min task.  That sounds fixable.  Looking at the implementation, there is 
> clearly an opportunity to avoid extra allocations: 
> [https://github.com/apache/spark/blob/e6dd237463d2de8c506f0735dfdb3f43e8122513/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala#L283]
> CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20081) RandomForestClassifier doesn't seem to support more than 100 labels

2017-04-26 Thread Christian Reiniger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984326#comment-15984326
 ] 

Christian Reiniger commented on SPARK-20081:


Thank you. All (potential) labels *are* in fact known in advance, so 
constructing a StringIndexerModel directly proved to be an excellend (and 
performant) solution.

> RandomForestClassifier doesn't seem to support more than 100 labels
> ---
>
> Key: SPARK-20081
> URL: https://issues.apache.org/jira/browse/SPARK-20081
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.1.0
> Environment: Java
>Reporter: Christian Reiniger
>
> When feeding data with more than 100 labels into RanfomForestClassifer#fit() 
> (from java code), I get the following error message:
> {code}
> Classifier inferred 143 from label values in column 
> rfc_df0e968db9df__labelCol, but this exceeded the max numClasses (100) 
> allowed to be inferred from values.  
>   To avoid this error for labels with > 100 classes, specify numClasses 
> explicitly in the metadata; this can be done by applying StringIndexer to the 
> label column.
> {code}
> Setting "numClasses" in the metadata for the label column doesn't make a 
> difference. Looking at the code, this is not surprising, since 
> MetadataUtils.getNumClasses() ignores this setting:
> {code:language=scala}
>   def getNumClasses(labelSchema: StructField): Option[Int] = {
> Attribute.fromStructField(labelSchema) match {
>   case binAttr: BinaryAttribute => Some(2)
>   case nomAttr: NominalAttribute => nomAttr.getNumValues
>   case _: NumericAttribute | UnresolvedAttribute => None
> }
>   }
> {code}
> The alternative would be to pass a proper "maxNumClasses" parameter to the 
> classifier, so that Classifier#getNumClasses() allows a larger number of 
> auto-detected labels. However, RandomForestClassifer#train() calls 
> #getNumClasses without the "maxNumClasses" parameter, causing it to use the 
> default of 100:
> {code:language=scala}
>   override protected def train(dataset: Dataset[_]): 
> RandomForestClassificationModel = {
> val categoricalFeatures: Map[Int, Int] =
>   MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol)))
> val numClasses: Int = getNumClasses(dataset)
> // ...
> {code}
> My scala skills are pretty sketchy, so please correct me if I misinterpreted 
> something. But as it seems right now, there is no way to learn from data with 
> more than 100 labels via RandomForestClassifier.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20467) sbt-launch-lib.bash has lacked the ASF header.

2017-04-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20467.
---
Resolution: Not A Problem

This file should not have an ASF header.

> sbt-launch-lib.bash has lacked the  ASF header.
> ---
>
> Key: SPARK-20467
> URL: https://issues.apache.org/jira/browse/SPARK-20467
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: liuzhaokun
>
> When I use this script,I found sbt-launch-lib.bash lack the  ASF header.It 
> doesn't be permitted by Apache Foundation according to apache license 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20400) Remove References to Third Party Vendors from Spark ASF Documentation

2017-04-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-20400:
-

Assignee: Bill Chambers

> Remove References to Third Party Vendors from Spark ASF Documentation
> -
>
> Key: SPARK-20400
> URL: https://issues.apache.org/jira/browse/SPARK-20400
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Bill Chambers
>Assignee: Bill Chambers
>Priority: Trivial
> Fix For: 2.2.0
>
>
> Similar to SPARK-17445, vendors should probably not be referenced on the ASF 
> documentation.
> Related:
> https://github.com/apache/spark/commit/dc0a4c916151c795dc41b5714e9d23b4937f4636



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20400) Remove References to Third Party Vendors from Spark ASF Documentation

2017-04-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20400.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17695
[https://github.com/apache/spark/pull/17695]

> Remove References to Third Party Vendors from Spark ASF Documentation
> -
>
> Key: SPARK-20400
> URL: https://issues.apache.org/jira/browse/SPARK-20400
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Bill Chambers
>Priority: Trivial
> Fix For: 2.2.0
>
>
> Similar to SPARK-17445, vendors should probably not be referenced on the ASF 
> documentation.
> Related:
> https://github.com/apache/spark/commit/dc0a4c916151c795dc41b5714e9d23b4937f4636



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20470) Invalid json converting RDD row with Array of struct to json

2017-04-26 Thread Philip Adetiloye (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philip Adetiloye updated SPARK-20470:
-
Description: 
Trying to convert an RDD in pyspark containing Array of struct doesn't generate 
the right json. It looks trivial but can't get a good json out.

I read the json below into a dataframe:
{code}
{
  "feature": "feature_id_001",
  "histogram": [
{
  "start": 1.9796095151877942,
  "y": 968.0,
  "width": 0.1564485056196041
},
{
  "start": 2.1360580208073983,
  "y": 892.0,
  "width": 0.1564485056196041
},
{
  "start": 2.2925065264270024,
  "y": 814.0,
  "width": 0.15644850561960366
},
{
  "start": 2.448955032046606,
  "y": 690.0,
  "width": 0.1564485056196041
}]
}
{code}

Df schema looks good 

{code}
 root
  |-- feature: string (nullable = true)
  |-- histogram: array (nullable = true)
  ||-- element: struct (containsNull = true)
  |||-- start: double (nullable = true)
  |||-- width: double (nullable = true)
  |||-- y: double (nullable = true)
{code}

Need to convert each row to json now and save to HBase 
{code}
rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict(
{code}

Output JSON (Wrong)
{code}
{
  "feature": "feature_id_001",
  "histogram": [
[
  1.9796095151877942,
  968.0,
  0.1564485056196041
],
[
  2.1360580208073983,
  892.0,
  0.1564485056196041
],
[
  2.2925065264270024,
  814.0,
  0.15644850561960366
],
[
  2.448955032046606,
  690.0,
  0.1564485056196041
]
}
{code}


  was:
Trying to convert an RDD in pyspark containing Array of struct doesn't generate 
the right json. It looks trivial but can't get a good json out.

I read the json below into a dataframe:
{code}
{
  "feature": "feature_id_001",
  "histogram": [
{
  "start": 1.9796095151877942,
  "y": 968.0,
  "width": 0.1564485056196041
},
{
  "start": 2.1360580208073983,
  "y": 892.0,
  "width": 0.1564485056196041
},
{
  "start": 2.2925065264270024,
  "y": 814.0,
  "width": 0.15644850561960366
},
{
  "start": 2.448955032046606,
  "y": 690.0,
  "width": 0.1564485056196041
}]
}
{code}

Df schema looks good 

{code}
 root
  |-- feature: string (nullable = true)
  |-- histogram: array (nullable = true)
  ||-- element: struct (containsNull = true)
  |||-- start: double (nullable = true)
  |||-- width: double (nullable = true)
  |||-- y: double (nullable = true)
{code}
Need to convert each row to json now and save to HBase 
rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict(

Output JSON (Wrong)
{code}
{
  "feature": "feature_id_001",
  "histogram": [
[
  1.9796095151877942,
  968.0,
  0.1564485056196041
],
[
  2.1360580208073983,
  892.0,
  0.1564485056196041
],
[
  2.2925065264270024,
  814.0,
  0.15644850561960366
],
[
  2.448955032046606,
  690.0,
  0.1564485056196041
]
}
{code}



> Invalid json converting RDD row with Array of struct to json
> 
>
> Key: SPARK-20470
> URL: https://issues.apache.org/jira/browse/SPARK-20470
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.3
>Reporter: Philip Adetiloye
>
> Trying to convert an RDD in pyspark containing Array of struct doesn't 
> generate the right json. It looks trivial but can't get a good json out.
> I read the json below into a dataframe:
> {code}
> {
>   "feature": "feature_id_001",
>   "histogram": [
> {
>   "start": 1.9796095151877942,
>   "y": 968.0,
>   "width": 0.1564485056196041
> },
> {
>   "start": 2.1360580208073983,
>   "y": 892.0,
>   "width": 0.1564485056196041
> },
> {
>   "start": 2.2925065264270024,
>   "y": 814.0,
>   "width": 0.15644850561960366
> },
> {
>   "start": 2.448955032046606,
>   "y": 690.0,
>   "width": 0.1564485056196041
> }]
> }
> {code}
> Df schema looks good 
> {code}
>  root
>   |-- feature: string (nullable = true)
>   |-- histogram: array (nullable = true)
>   ||-- element: struct (containsNull = true)
>   |||-- start: double (nullable = true)
>   |||-- width: double (nullable = true)
>   |||-- y: double (nullable = true)
> {code}
> Need to convert each row to json now and save to HBase 
> {code}
> rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict(
> {code}
> Output JSON (Wrong)
> {code}
> {
>   "feature": "feature_id_001",
>   "histogram": [
> [
>   1.9796095151877942,
>   968.0,
>   0.1564485056196041
> ],
> [
>   

[jira] [Commented] (SPARK-20470) Invalid json converting RDD row with Array of struct to json

2017-04-26 Thread Philip Adetiloye (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984428#comment-15984428
 ] 

Philip Adetiloye commented on SPARK-20470:
--

[~srowen] I'm sorry, I added an example. 

> Invalid json converting RDD row with Array of struct to json
> 
>
> Key: SPARK-20470
> URL: https://issues.apache.org/jira/browse/SPARK-20470
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.3
>Reporter: Philip Adetiloye
>
> Trying to convert an RDD in pyspark containing Array of struct doesn't 
> generate the right json. It looks trivial but can't get a good json out.
> I read the json below into a dataframe:
> {code}
> {
>   "feature": "feature_id_001",
>   "histogram": [
> {
>   "start": 1.9796095151877942,
>   "y": 968.0,
>   "width": 0.1564485056196041
> },
> {
>   "start": 2.1360580208073983,
>   "y": 892.0,
>   "width": 0.1564485056196041
> },
> {
>   "start": 2.2925065264270024,
>   "y": 814.0,
>   "width": 0.15644850561960366
> },
> {
>   "start": 2.448955032046606,
>   "y": 690.0,
>   "width": 0.1564485056196041
> }]
> }
> {code}
> Df schema looks good 
> {code}
>  root
>   |-- feature: string (nullable = true)
>   |-- histogram: array (nullable = true)
>   ||-- element: struct (containsNull = true)
>   |||-- start: double (nullable = true)
>   |||-- width: double (nullable = true)
>   |||-- y: double (nullable = true)
> {code}
> Need to convert each row to json now and save to HBase 
> {code}
> rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict(
> {code}
> Output JSON (Wrong)
> {code}
> {
>   "feature": "feature_id_001",
>   "histogram": [
> [
>   1.9796095151877942,
>   968.0,
>   0.1564485056196041
> ],
> [
>   2.1360580208073983,
>   892.0,
>   0.1564485056196041
> ],
> [
>   2.2925065264270024,
>   814.0,
>   0.15644850561960366
> ],
> [
>   2.448955032046606,
>   690.0,
>   0.1564485056196041
> ]
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20471) Remove AggregateBenchmark testsuite warning: Two level hashmap is disabled but vectorized hashmap is enabled.

2017-04-26 Thread caoxuewen (JIRA)
caoxuewen created SPARK-20471:
-

 Summary: Remove AggregateBenchmark testsuite warning: Two level 
hashmap is disabled but vectorized hashmap is enabled.
 Key: SPARK-20471
 URL: https://issues.apache.org/jira/browse/SPARK-20471
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 2.1.0
Reporter: caoxuewen


remove  AggregateBenchmark testsuite warning:
such as '14:26:33.220 WARN 
org.apache.spark.sql.execution.aggregate.HashAggregateExec: Two level hashmap 
is disabled but vectorized hashmap is enabled.'

unit tests: AggregateBenchmark 
Modify the 'ignore function for 'test funtion



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20472) Support for Dynamic Configuration

2017-04-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984461#comment-15984461
 ] 

Sean Owen commented on SPARK-20472:
---

I don't think this is generally possible, because some config is global, it is 
needed and has effect at startup, and can't be changed even if you wanted 
(think: JVM heap size). I doubt this is achievable. You will probably have to 
narrow this down much further. 

> Support for Dynamic Configuration
> -
>
> Key: SPARK-20472
> URL: https://issues.apache.org/jira/browse/SPARK-20472
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.1.0
>Reporter: Shahbaz Hussain
>
> Currently Spark Configuration can not be dynamically changed.
> It requires Spark Job be killed and started again for a new configuration to 
> take in to effect.
> This bug is to enhance Spark ,such that configuration changes can be 
> dynamically changed without requiring a application restart.
> Ex: If Batch Interval in a Streaming Job is 20 seconds ,and if user wants to 
> reduce it to 5 seconds,currently it requires a re-submit of the job.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20473) ColumnVector.Array is missing accessors for some types

2017-04-26 Thread Michal Szafranski (JIRA)
Michal Szafranski created SPARK-20473:
-

 Summary: ColumnVector.Array is missing accessors for some types
 Key: SPARK-20473
 URL: https://issues.apache.org/jira/browse/SPARK-20473
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Michal Szafranski


ColumnVector implementations originally did not support some Catalyst types 
(float, short, and boolean). Now that they do, those types should be also added 
to the ColumnVector.Array.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12965) Indexer setInputCol() doesn't resolve column names like DataFrame.col()

2017-04-26 Thread Calin Cocan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984540#comment-15984540
 ] 

Calin Cocan commented on SPARK-12965:
-

I have encountered the same problem using StringIndexer and VectorAssembles 
components.
In my opinion this particular issue can be fixed directly on these ML classes.

Replacing in StringIndexing fit method 
  val counts = dataset.select(col($(inputCol)).cast(StringType))
with 
  val counts = dataset.select(col(s"`${$(inputCol)}`").cast(StringType))

should do the trick
Also a change must be done as well in StringIndexerModel transform method. The 
call
   dataset.where(filterer(dataset($(inputCol
must be replaced with
   dataset.where(filterer(dataset(s"`${$(inputCol)}`")))

BTW a similar problem can be encountered in VectorAssembler transform method at 
this line (105 in spark 2.1):
case _: NumericType | BooleanType => 
dataset(c).cast(DoubleType).as(s"${c}_double_$uid")

Changing dataset(columnName) with its backquote columName should fix the 
problem:

case _: NumericType | BooleanType => 
dataset(c).cast(DoubleType).as(s"${c}_double_$uid")

> Indexer setInputCol() doesn't resolve column names like DataFrame.col()
> ---
>
> Key: SPARK-12965
> URL: https://issues.apache.org/jira/browse/SPARK-12965
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.2.0
>Reporter: Joshua Taylor
> Attachments: SparkMLDotColumn.java
>
>
> The setInputCol() method doesn't seem to resolve column names in the same way 
> that other methods do.  E.g., Given a DataFrame df, {{df.col("`a.b`")}} will 
> return a column.  On a StringIndexer indexer, 
> {{indexer.setInputCol("`a.b`")}} produces leads to an indexer where fitting 
> and transforming seem to have no effect.  Running the following code produces:
> {noformat}
> +---+---++
> |a.b|a_b|a_bIndex|
> +---+---++
> |foo|foo| 0.0|
> |bar|bar| 1.0|
> +---+---++
> {noformat}
> but I think it should have another column, {{abIndex}} with the same contents 
> as a_bIndex.
> {code}
> public class SparkMLDotColumn {
>   public static void main(String[] args) {
>   // Get the contexts
>   SparkConf conf = new SparkConf()
>   .setMaster("local[*]")
>   .setAppName("test")
>   .set("spark.ui.enabled", "false");
>   JavaSparkContext sparkContext = new JavaSparkContext(conf);
>   SQLContext sqlContext = new SQLContext(sparkContext);
>   
>   // Create a schema with a single string column named "a.b"
>   StructType schema = new StructType(new StructField[] {
>   DataTypes.createStructField("a.b", 
> DataTypes.StringType, false)
>   });
>   // Create an empty RDD and DataFrame
>   List rows = Arrays.asList(RowFactory.create("foo"), 
> RowFactory.create("bar")); 
>   JavaRDD rdd = sparkContext.parallelize(rows);
>   DataFrame df = sqlContext.createDataFrame(rdd, schema);
>   
>   df = df.withColumn("a_b", df.col("`a.b`"));
>   
>   StringIndexer indexer0 = new StringIndexer();
>   indexer0.setInputCol("a_b");
>   indexer0.setOutputCol("a_bIndex");
>   df = indexer0.fit(df).transform(df);
>   
>   StringIndexer indexer1 = new StringIndexer();
>   indexer1.setInputCol("`a.b`");
>   indexer1.setOutputCol("abIndex");
>   df = indexer1.fit(df).transform(df);
>   
>   df.show();
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20473) ColumnVector.Array is missing accessors for some types

2017-04-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984576#comment-15984576
 ] 

Apache Spark commented on SPARK-20473:
--

User 'michal-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/17772

> ColumnVector.Array is missing accessors for some types
> --
>
> Key: SPARK-20473
> URL: https://issues.apache.org/jira/browse/SPARK-20473
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michal Szafranski
>
> ColumnVector implementations originally did not support some Catalyst types 
> (float, short, and boolean). Now that they do, those types should be also 
> added to the ColumnVector.Array.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20475) Whether use "broadcast join" depends on hive configuration

2017-04-26 Thread Lijia Liu (JIRA)
Lijia Liu created SPARK-20475:
-

 Summary: Whether use "broadcast join" depends on hive configuration
 Key: SPARK-20475
 URL: https://issues.apache.org/jira/browse/SPARK-20475
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
Reporter: Lijia Liu


Currently,  broadcast join in Spark only works while:
  1. The value of "spark.sql.autoBroadcastJoinThreshold" bigger than 0(default 
is 10485760).
  2. The size of one of the hive tables less than 
"spark.sql.autoBroadcastJoinThreshold". To get the size information of the hive 
table from hive metastore,  "hive.stats.autogather" should be set to true in 
hive or the command "ANALYZE TABLE  COMPUTE STATISTICS noscan" has 
been run.

But in Hive, it calculate the size of the file or directory corresponding to 
the hive table to determine whether to use the map side join, and does not 
depend on the hive metastore.

This leads to two problems:
  1. Spark will not use "broadcast join" when the hive parameter 
"hive.stats.autogather" is not set to ture or the command "ANALYZE TABLE 
 COMPUTE STATISTICS noscan" has not been run because the information 
of the hive table has not saved in hive metastore . The mode of work in Spark 
depends on the configuration of Hive.
  2. For some reason, we set "hive.stats.autogather" to false in our Hive. For 
the same SQL, Hive is 4 times faster than Spark because Hive used "map side 
join" but Spark did not use "broadcast join".

Is it possible to use the mechanism same to hive's to look up the size of a 
hive tale in Spark.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20474) OnHeapColumnVector realocation may not copy existing data

2017-04-26 Thread Michal Szafranski (JIRA)
Michal Szafranski created SPARK-20474:
-

 Summary: OnHeapColumnVector realocation may not copy existing data
 Key: SPARK-20474
 URL: https://issues.apache.org/jira/browse/SPARK-20474
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Michal Szafranski


OnHeapColumnVector reallocation copies to the new storage data up to 
'elementsAppended'. This variable is only updated when using the 
ColumnVector.appendX API, while ColumnVector.putX is more commonly used.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20474) OnHeapColumnVector realocation may not copy existing data

2017-04-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984577#comment-15984577
 ] 

Apache Spark commented on SPARK-20474:
--

User 'michal-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/17773

> OnHeapColumnVector realocation may not copy existing data
> -
>
> Key: SPARK-20474
> URL: https://issues.apache.org/jira/browse/SPARK-20474
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michal Szafranski
>
> OnHeapColumnVector reallocation copies to the new storage data up to 
> 'elementsAppended'. This variable is only updated when using the 
> ColumnVector.appendX API, while ColumnVector.putX is more commonly used.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20473) ColumnVector.Array is missing accessors for some types

2017-04-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20473:


Assignee: Apache Spark

> ColumnVector.Array is missing accessors for some types
> --
>
> Key: SPARK-20473
> URL: https://issues.apache.org/jira/browse/SPARK-20473
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michal Szafranski
>Assignee: Apache Spark
>
> ColumnVector implementations originally did not support some Catalyst types 
> (float, short, and boolean). Now that they do, those types should be also 
> added to the ColumnVector.Array.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20473) ColumnVector.Array is missing accessors for some types

2017-04-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20473:


Assignee: (was: Apache Spark)

> ColumnVector.Array is missing accessors for some types
> --
>
> Key: SPARK-20473
> URL: https://issues.apache.org/jira/browse/SPARK-20473
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michal Szafranski
>
> ColumnVector implementations originally did not support some Catalyst types 
> (float, short, and boolean). Now that they do, those types should be also 
> added to the ColumnVector.Array.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20473) ColumnVector.Array is missing accessors for some types

2017-04-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20473:


Assignee: Apache Spark

> ColumnVector.Array is missing accessors for some types
> --
>
> Key: SPARK-20473
> URL: https://issues.apache.org/jira/browse/SPARK-20473
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michal Szafranski
>Assignee: Apache Spark
>
> ColumnVector implementations originally did not support some Catalyst types 
> (float, short, and boolean). Now that they do, those types should be also 
> added to the ColumnVector.Array.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20474) OnHeapColumnVector realocation may not copy existing data

2017-04-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20474:


Assignee: (was: Apache Spark)

> OnHeapColumnVector realocation may not copy existing data
> -
>
> Key: SPARK-20474
> URL: https://issues.apache.org/jira/browse/SPARK-20474
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michal Szafranski
>
> OnHeapColumnVector reallocation copies to the new storage data up to 
> 'elementsAppended'. This variable is only updated when using the 
> ColumnVector.appendX API, while ColumnVector.putX is more commonly used.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20473) ColumnVector.Array is missing accessors for some types

2017-04-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20473:


Assignee: (was: Apache Spark)

> ColumnVector.Array is missing accessors for some types
> --
>
> Key: SPARK-20473
> URL: https://issues.apache.org/jira/browse/SPARK-20473
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michal Szafranski
>
> ColumnVector implementations originally did not support some Catalyst types 
> (float, short, and boolean). Now that they do, those types should be also 
> added to the ColumnVector.Array.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20474) OnHeapColumnVector realocation may not copy existing data

2017-04-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20474:


Assignee: Apache Spark

> OnHeapColumnVector realocation may not copy existing data
> -
>
> Key: SPARK-20474
> URL: https://issues.apache.org/jira/browse/SPARK-20474
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michal Szafranski
>Assignee: Apache Spark
>
> OnHeapColumnVector reallocation copies to the new storage data up to 
> 'elementsAppended'. This variable is only updated when using the 
> ColumnVector.appendX API, while ColumnVector.putX is more commonly used.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records

2017-04-26 Thread Sebastian Arzt (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984776#comment-15984776
 ] 

Sebastian Arzt commented on SPARK-18371:


I deep dived into it and found a simple solution. The problem is that 
[maxRateLimitPerPartition|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L94]
 returns {{None}} for an unintended case. {{None}} should only be returned if 
there is no lag as indicated by this 
[condition|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L114].
 However, this condition is also true if all backpressureRates are 
[rounded|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L107]
 to zero. I propose a solution, where rounding is omitted at all. I has the 
nice side-effect that the backpressure in more fine-grained and not only an 
integral multiple of the 
[batchDuration|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L117]
 in seconds. I will open a pull request for it soon.

> Spark Streaming backpressure bug - generates a batch with large number of 
> records
> -
>
> Key: SPARK-18371
> URL: https://issues.apache.org/jira/browse/SPARK-18371
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
>Reporter: mapreduced
> Attachments: GiantBatch2.png, GiantBatch3.png, 
> Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png
>
>
> When the streaming job is configured with backpressureEnabled=true, it 
> generates a GIANT batch of records if the processing time + scheduled delay 
> is (much) larger than batchDuration. This creates a backlog of records like 
> no other and results in batches queueing for hours until it chews through 
> this giant batch.
> Expectation is that it should reduce the number of records per batch in some 
> time to whatever it can really process.
> Attaching some screen shots where it seems that this issue is quite easily 
> reproducible.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records

2017-04-26 Thread Sebastian Arzt (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984776#comment-15984776
 ] 

Sebastian Arzt edited comment on SPARK-18371 at 4/26/17 1:24 PM:
-

I deep dived into it and found a simple solution. The problem is that 
[maxRateLimitPerPartition|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L94]
 returns {{None}} for an unintended case. {{None}} should only be returned if 
there is no lag as indicated by this 
[condition|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L114].
 However, this condition is also true if all backpressureRates are 
[rounded|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L107]
 to zero. I propose a solution, where rounding is omitted at all. This has the 
nice side-effect that backpressure is more fine-grained and not only an 
integral multiple of the 
[batchDuration|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L117]
 in seconds. I will open a pull request for it soon.


was (Author: seb.arzt):
I deep dived into it and found a simple solution. The problem is that 
[maxRateLimitPerPartition|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L94]
 returns {{None}} for an unintended case. {{None}} should only be returned if 
there is no lag as indicated by this 
[condition|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L114].
 However, this condition is also true if all backpressureRates are 
[rounded|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L107]
 to zero. I propose a solution, where rounding is omitted at all. I has the 
nice side-effect that the backpressure is more fine-grained and not only an 
integral multiple of the 
[batchDuration|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L117]
 in seconds. I will open a pull request for it soon.

> Spark Streaming backpressure bug - generates a batch with large number of 
> records
> -
>
> Key: SPARK-18371
> URL: https://issues.apache.org/jira/browse/SPARK-18371
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
>Reporter: mapreduced
> Attachments: GiantBatch2.png, GiantBatch3.png, 
> Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png
>
>
> When the streaming job is configured with backpressureEnabled=true, it 
> generates a GIANT batch of records if the processing time + scheduled delay 
> is (much) larger than batchDuration. This creates a backlog of records like 
> no other and results in batches queueing for hours until it chews through 
> this giant batch.
> Expectation is that it should reduce the number of records per batch in some 
> time to whatever it can really process.
> Attaching some screen shots where it seems that this issue is quite easily 
> reproducible.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19812) YARN shuffle service fails to relocate recovery DB across NFS directories

2017-04-26 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-19812.
---
   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.0

> YARN shuffle service fails to relocate recovery DB across NFS directories
> -
>
> Key: SPARK-19812
> URL: https://issues.apache.org/jira/browse/SPARK-19812
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.1
>Reporter: Thomas Graves
>Assignee: Thomas Graves
> Fix For: 2.2.0, 2.3.0
>
>
> The yarn shuffle service tries to switch from the yarn local directories to 
> the real recovery directory but can fail to move the existing recovery db's.  
> It fails due to Files.move not doing directories that have contents.
> 2017-03-03 14:57:19,558 [main] ERROR yarn.YarnShuffleService: Failed to move 
> recovery file sparkShuffleRecovery.ldb to the path 
> /mapred/yarn-nodemanager/nm-aux-services/spark_shuffle
> java.nio.file.DirectoryNotEmptyException:/yarn-local/sparkShuffleRecovery.ldb
> at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:498)
> at 
> sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262)
> at java.nio.file.Files.move(Files.java:1395)
> at 
> org.apache.spark.network.yarn.YarnShuffleService.initRecoveryDb(YarnShuffleService.java:369)
> at 
> org.apache.spark.network.yarn.YarnShuffleService.createSecretManager(YarnShuffleService.java:200)
> at 
> org.apache.spark.network.yarn.YarnShuffleService.serviceInit(YarnShuffleService.java:174)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:143)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:262)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:357)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:636)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:684)
> This used to use f.renameTo and we switched it in the pr due to review 
> comments and it looks like didn't do a final real test. The tests are using 
> files rather then directories so it didn't catch. We need to fix the test 
> also.
> history: 
> https://github.com/apache/spark/pull/14999/commits/65de8531ccb91287f5a8a749c7819e99533b9440



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records

2017-04-26 Thread Sebastian Arzt (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984776#comment-15984776
 ] 

Sebastian Arzt edited comment on SPARK-18371 at 4/26/17 1:23 PM:
-

I deep dived into it and found a simple solution. The problem is that 
[maxRateLimitPerPartition|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L94]
 returns {{None}} for an unintended case. {{None}} should only be returned if 
there is no lag as indicated by this 
[condition|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L114].
 However, this condition is also true if all backpressureRates are 
[rounded|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L107]
 to zero. I propose a solution, where rounding is omitted at all. I has the 
nice side-effect that the backpressure is more fine-grained and not only an 
integral multiple of the 
[batchDuration|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L117]
 in seconds. I will open a pull request for it soon.


was (Author: seb.arzt):
I deep dived into it and found a simple solution. The problem is that 
[maxRateLimitPerPartition|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L94]
 returns {{None}} for an unintended case. {{None}} should only be returned if 
there is no lag as indicated by this 
[condition|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L114].
 However, this condition is also true if all backpressureRates are 
[rounded|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L107]
 to zero. I propose a solution, where rounding is omitted at all. I has the 
nice side-effect that the backpressure in more fine-grained and not only an 
integral multiple of the 
[batchDuration|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L117]
 in seconds. I will open a pull request for it soon.

> Spark Streaming backpressure bug - generates a batch with large number of 
> records
> -
>
> Key: SPARK-18371
> URL: https://issues.apache.org/jira/browse/SPARK-18371
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
>Reporter: mapreduced
> Attachments: GiantBatch2.png, GiantBatch3.png, 
> Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png
>
>
> When the streaming job is configured with backpressureEnabled=true, it 
> generates a GIANT batch of records if the processing time + scheduled delay 
> is (much) larger than batchDuration. This creates a backlog of records like 
> no other and results in batches queueing for hours until it chews through 
> this giant batch.
> Expectation is that it should reduce the number of records per batch in some 
> time to whatever it can really process.
> Attaching some screen shots where it seems that this issue is quite easily 
> reproducible.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20391) Properly rename the memory related fields in ExecutorSummary REST API

2017-04-26 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid reassigned SPARK-20391:


Assignee: Saisai Shao

> Properly rename the memory related fields in ExecutorSummary REST API
> -
>
> Key: SPARK-20391
> URL: https://issues.apache.org/jira/browse/SPARK-20391
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Blocker
> Fix For: 2.2.0
>
>
> Currently in Spark we could get executor summary through REST API 
> {{/api/v1/applications//executors}}. The format of executor summary 
> is:
> {code}
> class ExecutorSummary private[spark](
> val id: String,
> val hostPort: String,
> val isActive: Boolean,
> val rddBlocks: Int,
> val memoryUsed: Long,
> val diskUsed: Long,
> val totalCores: Int,
> val maxTasks: Int,
> val activeTasks: Int,
> val failedTasks: Int,
> val completedTasks: Int,
> val totalTasks: Int,
> val totalDuration: Long,
> val totalGCTime: Long,
> val totalInputBytes: Long,
> val totalShuffleRead: Long,
> val totalShuffleWrite: Long,
> val isBlacklisted: Boolean,
> val maxMemory: Long,
> val executorLogs: Map[String, String],
> val onHeapMemoryUsed: Option[Long],
> val offHeapMemoryUsed: Option[Long],
> val maxOnHeapMemory: Option[Long],
> val maxOffHeapMemory: Option[Long])
> {code}
> Here are 6 memory related fields: {{memoryUsed}}, {{maxMemory}}, 
> {{onHeapMemoryUsed}}, {{offHeapMemoryUsed}}, {{maxOnHeapMemory}}, 
> {{maxOffHeapMemory}}.
> These all 6 fields reflects the *storage* memory usage in Spark, but from the 
> name of this 6 fields, user doesn't really know it is referring to *storage* 
> memory or the total memory (storage memory + execution memory). This will be 
> misleading.
> So I think we should properly rename these fields to reflect their real 
> meanings. Or we should will document it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7481) Add spark-hadoop-cloud module to pull in object store support

2017-04-26 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984771#comment-15984771
 ] 

Steve Loughran commented on SPARK-7481:
---

I think we ended up going in circles on that PR. Sean has actually been very 
tolerant of me, however it's been hampered by my full time focus on other 
thingsr. I've only been had time to work on the spark PR intermittently and 
that's been hard for all: me in the rebase/retest, the one reviewer in having 
to catch up again.

Now, anyone who does manage to get that CP right will discover that S3A 
absolutely flies with Spark, in partitioning (list file improvements), data 
input (set fadvise=true for ORC and Parquet), and for output (set 
fast.output=true, play with the pool options). It delivers that performance 
because this patch set things up for the integration tests, downstream of this 
patch so I and others can be confident that the things actually work, at sped, 
at scale. Indeed, many of S3A performance work was actually based on Hive and 
Spark workloads:, the data formats & their seek patterns, directory layouts, 
file generation. All that's left is the little problem of getting the classpath 
right. Oh, and the committer.


For now, for people's enjoyment, here's some videos from Spark Summit East on 
the topic

* [Spark and object stores|https://youtu.be/8F2Jqw5_OnI]. 
* [Robust and Scalable etl over Cloud Storage With 
Spark|https://spark-summit.org/east-2017/events/robust-and-scalable-etl-over-cloud-storage-with-spark/]



> Add spark-hadoop-cloud module to pull in object store support
> -
>
> Key: SPARK-7481
> URL: https://issues.apache.org/jira/browse/SPARK-7481
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Steve Loughran
>
> To keep the s3n classpath right, to add s3a, swift & azure, the dependencies 
> of spark in a 2.6+ profile need to add the relevant object store packages 
> (hadoop-aws, hadoop-openstack, hadoop-azure)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20391) Properly rename the memory related fields in ExecutorSummary REST API

2017-04-26 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-20391.
--
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17700
[https://github.com/apache/spark/pull/17700]

> Properly rename the memory related fields in ExecutorSummary REST API
> -
>
> Key: SPARK-20391
> URL: https://issues.apache.org/jira/browse/SPARK-20391
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Saisai Shao
>Priority: Blocker
> Fix For: 2.2.0
>
>
> Currently in Spark we could get executor summary through REST API 
> {{/api/v1/applications//executors}}. The format of executor summary 
> is:
> {code}
> class ExecutorSummary private[spark](
> val id: String,
> val hostPort: String,
> val isActive: Boolean,
> val rddBlocks: Int,
> val memoryUsed: Long,
> val diskUsed: Long,
> val totalCores: Int,
> val maxTasks: Int,
> val activeTasks: Int,
> val failedTasks: Int,
> val completedTasks: Int,
> val totalTasks: Int,
> val totalDuration: Long,
> val totalGCTime: Long,
> val totalInputBytes: Long,
> val totalShuffleRead: Long,
> val totalShuffleWrite: Long,
> val isBlacklisted: Boolean,
> val maxMemory: Long,
> val executorLogs: Map[String, String],
> val onHeapMemoryUsed: Option[Long],
> val offHeapMemoryUsed: Option[Long],
> val maxOnHeapMemory: Option[Long],
> val maxOffHeapMemory: Option[Long])
> {code}
> Here are 6 memory related fields: {{memoryUsed}}, {{maxMemory}}, 
> {{onHeapMemoryUsed}}, {{offHeapMemoryUsed}}, {{maxOnHeapMemory}}, 
> {{maxOffHeapMemory}}.
> These all 6 fields reflects the *storage* memory usage in Spark, but from the 
> name of this 6 fields, user doesn't really know it is referring to *storage* 
> memory or the total memory (storage memory + execution memory). This will be 
> misleading.
> So I think we should properly rename these fields to reflect their real 
> meanings. Or we should will document it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records

2017-04-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984962#comment-15984962
 ] 

Apache Spark commented on SPARK-18371:
--

User 'arzt' has created a pull request for this issue:
https://github.com/apache/spark/pull/17774

> Spark Streaming backpressure bug - generates a batch with large number of 
> records
> -
>
> Key: SPARK-18371
> URL: https://issues.apache.org/jira/browse/SPARK-18371
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
>Reporter: mapreduced
> Attachments: GiantBatch2.png, GiantBatch3.png, 
> Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png
>
>
> When the streaming job is configured with backpressureEnabled=true, it 
> generates a GIANT batch of records if the processing time + scheduled delay 
> is (much) larger than batchDuration. This creates a backlog of records like 
> no other and results in batches queueing for hours until it chews through 
> this giant batch.
> Expectation is that it should reduce the number of records per batch in some 
> time to whatever it can really process.
> Attaching some screen shots where it seems that this issue is quite easily 
> reproducible.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records

2017-04-26 Thread Sebastian Arzt (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Arzt updated SPARK-18371:
---
Comment: was deleted

(was: After fix)

> Spark Streaming backpressure bug - generates a batch with large number of 
> records
> -
>
> Key: SPARK-18371
> URL: https://issues.apache.org/jira/browse/SPARK-18371
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
>Reporter: mapreduced
> Attachments: 01.png, 02.png, GiantBatch2.png, GiantBatch3.png, 
> Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png
>
>
> When the streaming job is configured with backpressureEnabled=true, it 
> generates a GIANT batch of records if the processing time + scheduled delay 
> is (much) larger than batchDuration. This creates a backlog of records like 
> no other and results in batches queueing for hours until it chews through 
> this giant batch.
> Expectation is that it should reduce the number of records per batch in some 
> time to whatever it can really process.
> Attaching some screen shots where it seems that this issue is quite easily 
> reproducible.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records

2017-04-26 Thread Sebastian Arzt (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Arzt updated SPARK-18371:
---
Comment: was deleted

(was: Before fix)

> Spark Streaming backpressure bug - generates a batch with large number of 
> records
> -
>
> Key: SPARK-18371
> URL: https://issues.apache.org/jira/browse/SPARK-18371
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
>Reporter: mapreduced
> Attachments: 01.png, 02.png, GiantBatch2.png, GiantBatch3.png, 
> Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png
>
>
> When the streaming job is configured with backpressureEnabled=true, it 
> generates a GIANT batch of records if the processing time + scheduled delay 
> is (much) larger than batchDuration. This creates a backlog of records like 
> no other and results in batches queueing for hours until it chews through 
> this giant batch.
> Expectation is that it should reduce the number of records per batch in some 
> time to whatever it can really process.
> Attaching some screen shots where it seems that this issue is quite easily 
> reproducible.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20476) Exception between "create table as" and "get_json_object"

2017-04-26 Thread cen yuhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cen yuhai updated SPARK-20476:
--
Description: 
I encounter this problem when I want to create a table as select , 
get_json_object from xxx;
It is wrong.
{code}
create table spark_json_object as
select get_json_object(deliver_geojson,'$.')
from dw.dw_prd_order where dt='2017-04-24' limit 10;
{code}
It is ok.
{code}
create table spark_json_object as
select *
from dw.dw_prd_order where dt='2017-04-24' limit 10;
{code}
It is ok
{code}
select get_json_object(deliver_geojson,'$.')
from dw.dw_prd_order where dt='2017-04-24' limit 10;
{code}

{code}
17/04/26 23:12:56 ERROR [hive.log(397) -- main]: error in initSerDe: 
org.apache.hadoop.hive.serde2.SerDeException 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements 
while columns.types has 1 elements!
org.apache.hadoop.hive.serde2.SerDeException: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements 
while columns.types has 1 elements!
at 
org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:146)
at 
org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85)
at 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125)
at 
org.apache.hadoop.hive.serde2.AbstractSerDe.initialize(AbstractSerDe.java:53)
at 
org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:521)
at 
org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:391)
at 
org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276)
at 
org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:197)
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:699)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply$mcV$sp(HiveClientImpl.scala:455)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:309)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:256)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:255)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:298)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:454)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:237)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:199)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:248)
at 
org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$lzycompute$1(CreateHiveTableAsSelectCommand.scala:72)
at 
org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$1(CreateHiveTableAsSelectCommand.scala:48)
at 
org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:91)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
at org.apache.spark.sql.Dataset.(Dataset.scala:179)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:593)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:699)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:335)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:247)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)

[jira] [Updated] (SPARK-20476) Exception between "create table as" and "get_json_object"

2017-04-26 Thread cen yuhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cen yuhai updated SPARK-20476:
--
Description: 
I encounter this problem when I want to create a table as select , 
get_json_object from xxx;
It is wrong.
{code}
create table spark_json_object as
select get_json_object(deliver_geojson,'$.')
from dw.dw_prd_order where dt='2017-04-24' limit 10;
{code}
It is ok.
{code}
create table spark_json_object as
select *
from dw.dw_prd_order where dt='2017-04-24' limit 10;
{code}
It is ok
{code}
select get_json_object(deliver_geojson,'$.')
from dw.dw_prd_order where dt='2017-04-24' limit 10;
{code}
17/04/26 23:12:56 ERROR [hive.log(397) -- main]: error in initSerDe: 
org.apache.hadoop.hive.serde2.SerDeException 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements 
while columns.types has 1 elements!
org.apache.hadoop.hive.serde2.SerDeException: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements 
while columns.types has 1 elements!
at 
org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:146)
at 
org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85)
at 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125)
at 
org.apache.hadoop.hive.serde2.AbstractSerDe.initialize(AbstractSerDe.java:53)
at 
org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:521)
at 
org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:391)
at 
org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276)
at 
org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:197)
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:699)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply$mcV$sp(HiveClientImpl.scala:455)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:309)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:256)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:255)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:298)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:454)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:237)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:199)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:248)
at 
org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$lzycompute$1(CreateHiveTableAsSelectCommand.scala:72)
at 
org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$1(CreateHiveTableAsSelectCommand.scala:48)
at 
org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:91)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
at org.apache.spark.sql.Dataset.(Dataset.scala:179)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:593)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:699)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:335)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:247)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at 

[jira] [Updated] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records

2017-04-26 Thread Sebastian Arzt (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Arzt updated SPARK-18371:
---
Attachment: 01.png

Before fix

> Spark Streaming backpressure bug - generates a batch with large number of 
> records
> -
>
> Key: SPARK-18371
> URL: https://issues.apache.org/jira/browse/SPARK-18371
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
>Reporter: mapreduced
> Attachments: 01.png, 02.png, GiantBatch2.png, GiantBatch3.png, 
> Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png
>
>
> When the streaming job is configured with backpressureEnabled=true, it 
> generates a GIANT batch of records if the processing time + scheduled delay 
> is (much) larger than batchDuration. This creates a backlog of records like 
> no other and results in batches queueing for hours until it chews through 
> this giant batch.
> Expectation is that it should reduce the number of records per batch in some 
> time to whatever it can really process.
> Attaching some screen shots where it seems that this issue is quite easily 
> reproducible.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records

2017-04-26 Thread Sebastian Arzt (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Arzt updated SPARK-18371:
---
Attachment: 02.png

After fix

> Spark Streaming backpressure bug - generates a batch with large number of 
> records
> -
>
> Key: SPARK-18371
> URL: https://issues.apache.org/jira/browse/SPARK-18371
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
>Reporter: mapreduced
> Attachments: 01.png, 02.png, GiantBatch2.png, GiantBatch3.png, 
> Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png
>
>
> When the streaming job is configured with backpressureEnabled=true, it 
> generates a GIANT batch of records if the processing time + scheduled delay 
> is (much) larger than batchDuration. This creates a backlog of records like 
> no other and results in batches queueing for hours until it chews through 
> this giant batch.
> Expectation is that it should reduce the number of records per batch in some 
> time to whatever it can really process.
> Attaching some screen shots where it seems that this issue is quite easily 
> reproducible.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records

2017-04-26 Thread Sebastian Arzt (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984975#comment-15984975
 ] 

Sebastian Arzt commented on SPARK-18371:


Screenshots:

[before|https://issues.apache.org/jira/secure/attachment/12865156/01.png]
[after|https://issues.apache.org/jira/secure/attachment/12865158/02.png]

> Spark Streaming backpressure bug - generates a batch with large number of 
> records
> -
>
> Key: SPARK-18371
> URL: https://issues.apache.org/jira/browse/SPARK-18371
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
>Reporter: mapreduced
> Attachments: 01.png, 02.png, GiantBatch2.png, GiantBatch3.png, 
> Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png
>
>
> When the streaming job is configured with backpressureEnabled=true, it 
> generates a GIANT batch of records if the processing time + scheduled delay 
> is (much) larger than batchDuration. This creates a backlog of records like 
> no other and results in batches queueing for hours until it chews through 
> this giant batch.
> Expectation is that it should reduce the number of records per batch in some 
> time to whatever it can really process.
> Attaching some screen shots where it seems that this issue is quite easily 
> reproducible.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17403) Fatal Error: Scan cached strings

2017-04-26 Thread Paul Lysak (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15982595#comment-15982595
 ] 

Paul Lysak edited comment on SPARK-17403 at 4/26/17 3:23 PM:
-

Looks like we have the same issue with Spark 2.1 on YARN (Amazon EMR release 
emr-5.4.0).
Workaround that solves the issue for us (at the cost of some performance) is to 
use df.persist(StorageLevel.DISK_ONLY) instead of df.cache().

Depending on the node types, memory settings, storage level and some other 
factors I couldn't clearly identify it may appear as 

{noformat}
User class threw exception: org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 158 in stage 504.0 failed 4 times, most recent failure: 
Lost task 158.3 in stage 504.0 (TID 427365, ip-10-35-162-171.ec2.internal, 
executor 83): java.lang.NegativeArraySizeException
at org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229)
at org.apache.spark.unsafe.types.UTF8String.clone(UTF8String.java:826)
at 
org.apache.spark.sql.execution.columnar.StringColumnStats.gatherStats(ColumnStats.scala:217)
at 
org.apache.spark.sql.execution.columnar.NullableColumnBuilder$class.appendFrom(NullableColumnBuilder.scala:55)
at 
org.apache.spark.sql.execution.columnar.NativeColumnBuilder.org$apache$spark$sql$execution$columnar$compression$CompressibleColumnBuilder$$super$appendFrom(ColumnBuilder.scala:97)
at 
org.apache.spark.sql.execution.columnar.compression.CompressibleColumnBuilder$class.appendFrom(CompressibleColumnBuilder.scala:78)
at 
org.apache.spark.sql.execution.columnar.NativeColumnBuilder.appendFrom(ColumnBuilder.scala:97)
at 
org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:122)
at 
org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:97)
at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:957)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:888)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:694)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
{noformat}

or as 

{noformat}
User class threw exception: org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 27 in stage 61.0 failed 4 times, most recent failure: Lost 
task 27.3 in stage 61.0 (TID 36167, ip-10-35-162-149.ec2.internal, executor 1): 
java.lang.OutOfMemoryError: Java heap space
at 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:73)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_38$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at 
org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:107)
at 
org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:97)
at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:957)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:888)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:694)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
{noformat}

or as

{noformat}
2017-04-24 19:02:45,951 ERROR 
org.apache.spark.util.SparkUncaughtExceptionHandler: Uncaught exception in 
thread Thread[Executor task launch worker-3,5,main]
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:73)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_37$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.execution.joins.HashJoin$$anonfun$join$1.apply(HashJoin.scala:217)
at 

[jira] [Assigned] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records

2017-04-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18371:


Assignee: (was: Apache Spark)

> Spark Streaming backpressure bug - generates a batch with large number of 
> records
> -
>
> Key: SPARK-18371
> URL: https://issues.apache.org/jira/browse/SPARK-18371
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
>Reporter: mapreduced
> Attachments: GiantBatch2.png, GiantBatch3.png, 
> Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png
>
>
> When the streaming job is configured with backpressureEnabled=true, it 
> generates a GIANT batch of records if the processing time + scheduled delay 
> is (much) larger than batchDuration. This creates a backlog of records like 
> no other and results in batches queueing for hours until it chews through 
> this giant batch.
> Expectation is that it should reduce the number of records per batch in some 
> time to whatever it can really process.
> Attaching some screen shots where it seems that this issue is quite easily 
> reproducible.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records

2017-04-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18371:


Assignee: Apache Spark

> Spark Streaming backpressure bug - generates a batch with large number of 
> records
> -
>
> Key: SPARK-18371
> URL: https://issues.apache.org/jira/browse/SPARK-18371
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
>Reporter: mapreduced
>Assignee: Apache Spark
> Attachments: GiantBatch2.png, GiantBatch3.png, 
> Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png
>
>
> When the streaming job is configured with backpressureEnabled=true, it 
> generates a GIANT batch of records if the processing time + scheduled delay 
> is (much) larger than batchDuration. This creates a backlog of records like 
> no other and results in batches queueing for hours until it chews through 
> this giant batch.
> Expectation is that it should reduce the number of records per batch in some 
> time to whatever it can really process.
> Attaching some screen shots where it seems that this issue is quite easily 
> reproducible.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17403) Fatal Error: Scan cached strings

2017-04-26 Thread Paul Lysak (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984987#comment-15984987
 ] 

Paul Lysak commented on SPARK-17403:


Hope that helps - finally managed to reproduce it without using production data:

{code}
import org.apache.spark.sql.functions._

val leftRows = spark.sparkContext.parallelize(numSlices = 3000, seq = for 
(k <- 1 to 1000; j <- 1 to 1000) yield Row(k, j))
  .flatMap(r => (1 to 1000).map(l => Row.fromSeq(r.toSeq :+ l)))

val leftDf = spark.createDataFrame(leftRows, StructType(Seq(
  StructField("k", IntegerType),
  StructField("j", IntegerType),
  StructField("l", IntegerType)
)))
  .withColumn("combinedKey", expr("k*100 + j*1000 + l"))
  .withColumn("fixedCol", lit("sampleVal"))
  .withColumn("combKeyStr", format_number(col("combinedKey"), 0))
  .withColumn("k100", expr("k*100"))
  .withColumn("j100", expr("j*100"))
  .withColumn("l100", expr("l*100"))
  .withColumn("k_200", expr("k+200"))
  .withColumn("j_200", expr("j+200"))
  .withColumn("l_200", expr("l+200"))
  .withColumn("strCol1_1", concat(lit("value of sample column number one 
with which column k will be concatenated:" * 5), format_number(col("k"), 0)))
  .withColumn("strCol1_2", concat(lit("value of sample column two one with 
which column j will be concatenated:" * 5), format_number(col("j"), 0)))
  .withColumn("strCol1_3", concat(lit("value of sample column three one 
with which column r will be concatenated:" * 5), format_number(col("l"), 0)))
  .withColumn("strCol2_1", concat(lit("value of sample column number one 
with which column k will be concatenated:" * 5), format_number(col("k"), 0)))
  .withColumn("strCol2_2", concat(lit("value of sample column two one with 
which column j will be concatenated:" * 5), format_number(col("j"), 0)))
  .withColumn("strCol2_3", concat(lit("value of sample column three one 
with which column r will be concatenated:" * 5), format_number(col("l"), 0)))
  .withColumn("strCol3_1", concat(lit("value of sample column number one 
with which column k will be concatenated:" * 5), format_number(col("k"), 0)))
  .withColumn("strCol3_2", concat(lit("value of sample column two one with 
which column j will be concatenated:" * 5), format_number(col("j"), 0)))
  .withColumn("strCol3_3", concat(lit("value of sample column three one 
with which column r will be concatenated:" * 5), format_number(col("l"), 0))) 
//if further columns commented out - error disappears

leftDf.cache()
println("= leftDf count:" + leftDf.count())
leftDf.show(10)

val rightRows = spark.sparkContext.parallelize((1 to 800).map(i => Row(i, 
"k_" + i, "sampleVal")))
val rightDf = spark.createDataFrame(rightRows, StructType(Seq(
  StructField("k", IntegerType),
  StructField("kStr", StringType),
  StructField("sampleCol", StringType)
)))

rightDf.cache()
println("= rightDf count:" + rightDf.count())
rightDf.show(10)

val joinedDf = leftDf.join(broadcast(rightDf), usingColumns = Seq("k"), 
joinType = "left")

joinedDf.cache()
println("= joinedDf count:" + joinedDf.count())
joinedDf.show(10)
{code}

ApplicationMaster fails with such exception:
{noformat}
User class threw exception: org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 949 in stage 8.0 failed 4 times, most recent failure: Lost 
task 949.3 in stage 8.0 (TID 4922, ip-10-35-162-219.ec2.internal, executor 
139): java.lang.NegativeArraySizeException
at org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229)
at org.apache.spark.unsafe.types.UTF8String.clone(UTF8String.java:826)
at 
org.apache.spark.sql.execution.columnar.StringColumnStats.gatherStats(ColumnStats.scala:216)
at 
org.apache.spark.sql.execution.columnar.NullableColumnBuilder$class.appendFrom(NullableColumnBuilder.scala:55)
at 
org.apache.spark.sql.execution.columnar.NativeColumnBuilder.org$apache$spark$sql$execution$columnar$compression$CompressibleColumnBuilder$$super$appendFrom(ColumnBuilder.scala:97)
at 
org.apache.spark.sql.execution.columnar.compression.CompressibleColumnBuilder$class.appendFrom(CompressibleColumnBuilder.scala:78)
at 
org.apache.spark.sql.execution.columnar.NativeColumnBuilder.appendFrom(ColumnBuilder.scala:97)
at 
org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:122)
at 
org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:97)
at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:957)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948)
at 

[jira] [Created] (SPARK-20476) Exception between "create table as" and "get_json_object"

2017-04-26 Thread cen yuhai (JIRA)
cen yuhai created SPARK-20476:
-

 Summary: Exception between "create table as" and "get_json_object"
 Key: SPARK-20476
 URL: https://issues.apache.org/jira/browse/SPARK-20476
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: cen yuhai


{code}
create table spark_json_object as
select get_json_object(deliver_geojson,'')
from dw.dw_prd_restaurant where dt='2017-04-24' limit 10;
{code}

{code}
17/04/26 23:12:56 ERROR [hive.log(397) -- main]: error in initSerDe: 
org.apache.hadoop.hive.serde2.SerDeException 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements 
while columns.types has 1 elements!
org.apache.hadoop.hive.serde2.SerDeException: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements 
while columns.types has 1 elements!
at 
org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:146)
at 
org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85)
at 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125)
at 
org.apache.hadoop.hive.serde2.AbstractSerDe.initialize(AbstractSerDe.java:53)
at 
org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:521)
at 
org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:391)
at 
org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276)
at 
org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:197)
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:699)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply$mcV$sp(HiveClientImpl.scala:455)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:309)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:256)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:255)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:298)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:454)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:237)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:199)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:248)
at 
org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$lzycompute$1(CreateHiveTableAsSelectCommand.scala:72)
at 
org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$1(CreateHiveTableAsSelectCommand.scala:48)
at 
org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:91)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
at org.apache.spark.sql.Dataset.(Dataset.scala:179)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:593)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:699)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:335)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:247)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 

[jira] [Updated] (SPARK-20476) Exception between "create table as" and "get_json_object"

2017-04-26 Thread cen yuhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cen yuhai updated SPARK-20476:
--
Description: 
{code}
create table spark_json_object as
select get_json_object(deliver_geojson,'$.')
from dw.dw_prd_order where dt='2017-04-24' limit 10;
{code}



{code}
17/04/26 23:12:56 ERROR [hive.log(397) -- main]: error in initSerDe: 
org.apache.hadoop.hive.serde2.SerDeException 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements 
while columns.types has 1 elements!
org.apache.hadoop.hive.serde2.SerDeException: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements 
while columns.types has 1 elements!
at 
org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:146)
at 
org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85)
at 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125)
at 
org.apache.hadoop.hive.serde2.AbstractSerDe.initialize(AbstractSerDe.java:53)
at 
org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:521)
at 
org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:391)
at 
org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276)
at 
org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:197)
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:699)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply$mcV$sp(HiveClientImpl.scala:455)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:309)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:256)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:255)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:298)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:454)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:237)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:199)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:248)
at 
org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$lzycompute$1(CreateHiveTableAsSelectCommand.scala:72)
at 
org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$1(CreateHiveTableAsSelectCommand.scala:48)
at 
org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:91)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
at org.apache.spark.sql.Dataset.(Dataset.scala:179)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:593)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:699)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:335)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:247)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 

[jira] [Commented] (SPARK-20476) Exception between "create table as" and "get_json_object"

2017-04-26 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985175#comment-15985175
 ] 

Xiao Li commented on SPARK-20476:
-

You can bypass it by 
{noformat}
create table spark_json_object as
select get_json_object(deliver_geojson,'$.') as col1
from dw.dw_prd_order where dt='2017-04-24' limit 10;
{noformat}

Will fix it later.

> Exception between "create table as" and "get_json_object"
> -
>
> Key: SPARK-20476
> URL: https://issues.apache.org/jira/browse/SPARK-20476
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: cen yuhai
>
> I encounter this problem when I want to create a table as select , 
> get_json_object from xxx;
> It is wrong.
> {code}
> create table spark_json_object as
> select get_json_object(deliver_geojson,'$.')
> from dw.dw_prd_order where dt='2017-04-24' limit 10;
> {code}
> It is ok.
> {code}
> create table spark_json_object as
> select *
> from dw.dw_prd_order where dt='2017-04-24' limit 10;
> {code}
> It is ok
> {code}
> select get_json_object(deliver_geojson,'$.')
> from dw.dw_prd_order where dt='2017-04-24' limit 10;
> {code}
> {code}
> 17/04/26 23:12:56 ERROR [hive.log(397) -- main]: error in initSerDe: 
> org.apache.hadoop.hive.serde2.SerDeException 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements 
> while columns.types has 1 elements!
> org.apache.hadoop.hive.serde2.SerDeException: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements 
> while columns.types has 1 elements!
> at 
> org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:146)
> at 
> org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85)
> at 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125)
> at 
> org.apache.hadoop.hive.serde2.AbstractSerDe.initialize(AbstractSerDe.java:53)
> at 
> org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:521)
> at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:391)
> at 
> org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276)
> at 
> org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:197)
> at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:699)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply$mcV$sp(HiveClientImpl.scala:455)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:309)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:256)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:255)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:298)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:454)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:237)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:199)
> at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:248)
> at 
> org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$lzycompute$1(CreateHiveTableAsSelectCommand.scala:72)
> at 
> org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$1(CreateHiveTableAsSelectCommand.scala:48)
> at 
> org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:91)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
> at 
> 

[jira] [Created] (SPARK-20478) Document LinearSVC in R programming guide

2017-04-26 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-20478:


 Summary: Document LinearSVC in R programming guide
 Key: SPARK-20478
 URL: https://issues.apache.org/jira/browse/SPARK-20478
 Project: Spark
  Issue Type: Documentation
  Components: SparkR
Affects Versions: 2.2.0
Reporter: Felix Cheung






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20477) Document R bisecting k-means in R programming guide

2017-04-26 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-20477:


 Summary: Document R bisecting k-means in R programming guide
 Key: SPARK-20477
 URL: https://issues.apache.org/jira/browse/SPARK-20477
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.2.0
Reporter: Felix Cheung






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20478) Document LinearSVC in R programming guide

2017-04-26 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985227#comment-15985227
 ] 

Felix Cheung commented on SPARK-20478:
--

[~wangmiao1981] Would you like to add this?

> Document LinearSVC in R programming guide
> -
>
> Key: SPARK-20478
> URL: https://issues.apache.org/jira/browse/SPARK-20478
> Project: Spark
>  Issue Type: Documentation
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20477) Document R bisecting k-means in R programming guide

2017-04-26 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985226#comment-15985226
 ] 

Felix Cheung commented on SPARK-20477:
--

[~wangmiao1981] Would you like to add this?

> Document R bisecting k-means in R programming guide
> ---
>
> Key: SPARK-20477
> URL: https://issues.apache.org/jira/browse/SPARK-20477
> Project: Spark
>  Issue Type: Documentation
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20477) Document R bisecting k-means in R programming guide

2017-04-26 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-20477:
-
Issue Type: Documentation  (was: Bug)

> Document R bisecting k-means in R programming guide
> ---
>
> Key: SPARK-20477
> URL: https://issues.apache.org/jira/browse/SPARK-20477
> Project: Spark
>  Issue Type: Documentation
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20473) ColumnVector.Array is missing accessors for some types

2017-04-26 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-20473.
-
   Resolution: Fixed
 Assignee: Michal Szafranski
Fix Version/s: 2.2.0

> ColumnVector.Array is missing accessors for some types
> --
>
> Key: SPARK-20473
> URL: https://issues.apache.org/jira/browse/SPARK-20473
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michal Szafranski
>Assignee: Michal Szafranski
> Fix For: 2.2.0
>
>
> ColumnVector implementations originally did not support some Catalyst types 
> (float, short, and boolean). Now that they do, those types should be also 
> added to the ColumnVector.Array.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20208) Document R fpGrowth support in vignettes, programming guide and code example

2017-04-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985309#comment-15985309
 ] 

Apache Spark commented on SPARK-20208:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/17775

> Document R fpGrowth support in vignettes, programming guide and code example
> 
>
> Key: SPARK-20208
> URL: https://issues.apache.org/jira/browse/SPARK-20208
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Assignee: Maciej Szymkiewicz
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20208) Document R fpGrowth support in vignettes, programming guide and code example

2017-04-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20208:


Assignee: Apache Spark  (was: Maciej Szymkiewicz)

> Document R fpGrowth support in vignettes, programming guide and code example
> 
>
> Key: SPARK-20208
> URL: https://issues.apache.org/jira/browse/SPARK-20208
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Assignee: Apache Spark
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20208) Document R fpGrowth support in vignettes, programming guide and code example

2017-04-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20208:


Assignee: Maciej Szymkiewicz  (was: Apache Spark)

> Document R fpGrowth support in vignettes, programming guide and code example
> 
>
> Key: SPARK-20208
> URL: https://issues.apache.org/jira/browse/SPARK-20208
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Assignee: Maciej Szymkiewicz
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20470) Invalid json converting RDD row with Array of struct to json

2017-04-26 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20470:

Component/s: SQL

> Invalid json converting RDD row with Array of struct to json
> 
>
> Key: SPARK-20470
> URL: https://issues.apache.org/jira/browse/SPARK-20470
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.6.3
>Reporter: Philip Adetiloye
>
> Trying to convert an RDD in pyspark containing Array of struct doesn't 
> generate the right json. It looks trivial but can't get a good json out.
> I read the json below into a dataframe:
> {code}
> {
>   "feature": "feature_id_001",
>   "histogram": [
> {
>   "start": 1.9796095151877942,
>   "y": 968.0,
>   "width": 0.1564485056196041
> },
> {
>   "start": 2.1360580208073983,
>   "y": 892.0,
>   "width": 0.1564485056196041
> },
> {
>   "start": 2.2925065264270024,
>   "y": 814.0,
>   "width": 0.15644850561960366
> },
> {
>   "start": 2.448955032046606,
>   "y": 690.0,
>   "width": 0.1564485056196041
> }]
> }
> {code}
> Df schema looks good 
> {code}
>  root
>   |-- feature: string (nullable = true)
>   |-- histogram: array (nullable = true)
>   ||-- element: struct (containsNull = true)
>   |||-- start: double (nullable = true)
>   |||-- width: double (nullable = true)
>   |||-- y: double (nullable = true)
> {code}
> Need to convert each row to json now and save to HBase 
> {code}
> rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict(
> {code}
> Output JSON (Wrong)
> {code}
> {
>   "feature": "feature_id_001",
>   "histogram": [
> [
>   1.9796095151877942,
>   968.0,
>   0.1564485056196041
> ],
> [
>   2.1360580208073983,
>   892.0,
>   0.1564485056196041
> ],
> [
>   2.2925065264270024,
>   814.0,
>   0.15644850561960366
> ],
> [
>   2.448955032046606,
>   690.0,
>   0.1564485056196041
> ]
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7481) Add spark-hadoop-cloud module to pull in object store support

2017-04-26 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985040#comment-15985040
 ] 

Steve Loughran commented on SPARK-7481:
---

(This is a fairly long comment, but it tries to summarise the entire state of 
interaction with object stores, esp. S3A on Hadoop 2.8+. Azure is simpler, GCS: 
google's problem. Swift. not used very much).

If you look at object store & Spark (or indeed, any code which uses a 
filesystem as the source and dest of work), there are problems which can 
generally be grouped into various categories.

h3. Foundational: talking to the object stores

classpath & execution: can you wire the JARs up? Longstanding issue in ASF 
Spark releases (SPARK-5348, SPARK-12557). This was exacerbated by the movement 
of S3n:// to the hadoop-aws-package (FWIW, I hadn't noticed that move, I'd have 
blocked it if I'd been paying attention). This includes transitive problems 
(SPARK-11413)

Credential propagation. Spark's env var propagation is pretty cute here; 
SPARK-19739 picks up {{AWS_SESSION_TOKEN}} too. Diagnostics on failure is a 
real pain.


h3. Observable Inconsistencies leading to Data loss

Generally where the metaphor "it's just a filesystem" fail. These are bad 
because they often "just work", especially in dev & Test with small datasets, 
and when they go wrong, they can fail by generating bad results *and nobody 
notices*.

* Expectations of consistent listing of "directories" S3Guard deals with this, 
HADOOP-13345, as can Netflix's S3mper and AWS's premium Dynamo backed S3 
storage.
* Expectations on the transacted nature of Directory renames, the core atomic 
commit operations against full filesystems.
* Expectations that when things are deleted they go away. This does become 
visible sometimes, usually in checks for a destination not existing 
(SPARK-19013)
* Expectations that write-in-progress data is visible/flushed, that {{close()}} 
is low cost. SPARK-19111.

Committing pretty much combines all of these, see below for more details.

h3. Aggressively bad performance

That's the mismatch between what the object store offers, what the apps expect, 
and the metaphor work in the Hadooop FileSystem implementations, which, in 
trying to hide the conceptual mismatch can actually amplify the problem. 
Example: Directory tree scanning at the start of a query. The mock directory 
structure allows callers to do treewalks, when really a full list of all 
children can be done as a direct O(1) call. SPARK-17159 covers some of this for 
scanning directories in Spark Streaming, but there's a hidden tree walk in 
every call to {{FileSystem.globStatus()}} (HADOOP-13371). Given how S3Guard 
transforms this treewalk, and you need it for consistency, that's probably the 
best solution for now. Although I have a PoC which does a full List **/* 
followed by a filter, that's not viable when you have a wide deep tree and do 
need to prune aggressively.

Checkpointing to object stores is similar: it's generally not dangerous to do 
the write+rename, just adds the copy overhead, consistency issues 
notwithstanding.


h3. Suboptimal code. 

There's opportunities for speedup, but if it's not on the critical path, not 
worth the hassle. That said, as every call to {{getFileStatus()}} can take 
hundreds of millis, they get onto the critical path quite fast. Example checks 
for a file existing before calling {{fs.delete(path)}} (this is always a no-op 
if the dest path isn't there), and the equivalent on mkdirs: {{if 
(!fs.exists(dir) fs.mkdirs(path)}}. Hadoop 3.0 will help steer people on the 
path of righteousness there by deprecating a couple of methods which encourage 
inefficiencies (isFile/isDir).


h3. The commit problem 

The full commit problem combines all of these: you need a consistent list of 
source data, your deleted destination path musn't appear in listings, the 
commit of each task must promote a task's work to the pending output of the 
job; an abort must leave no trace of it. The final job commit must place data 
into the final destination, again, job abort not make any output visible. 
There's some ambiguity about what happens if task and job commits fails; 
generally the safest is "abort everything". Futhermore nobody has any idea what 
to do if an {{abort()}} raises exceptions. Oh, and all of this must be fast. 
Spark is no better or worse than the core MapReduce committers here, or that of 
Hive.

Spark generally uses the Hadoop {{FileOutputFormat}} via the 
{{HadoopMapReduceCommitProtocol}}, directly or indirectly (e.g 
{{ParquetOutputFormat}}), extracting its committer and casting it to 
{{FileOutputCommitter}}, primarily to get a working directory. This committer 
assumes the destination is a consistent FS, uses renames when promoting task 
and job output, assuming that is so fast it doesn't even bother to log a 
message "about to rename". Hence the recurrent Stack Overflow 

[jira] [Commented] (SPARK-20476) Exception between "create table as" and "get_json_object"

2017-04-26 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985096#comment-15985096
 ] 

Xiao Li commented on SPARK-20476:
-

This sounds a bug to me. Let me double check it

> Exception between "create table as" and "get_json_object"
> -
>
> Key: SPARK-20476
> URL: https://issues.apache.org/jira/browse/SPARK-20476
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: cen yuhai
>
> I encounter this problem when I want to create a table as select , 
> get_json_object from xxx;
> It is wrong.
> {code}
> create table spark_json_object as
> select get_json_object(deliver_geojson,'$.')
> from dw.dw_prd_order where dt='2017-04-24' limit 10;
> {code}
> It is ok.
> {code}
> create table spark_json_object as
> select *
> from dw.dw_prd_order where dt='2017-04-24' limit 10;
> {code}
> It is ok
> {code}
> select get_json_object(deliver_geojson,'$.')
> from dw.dw_prd_order where dt='2017-04-24' limit 10;
> {code}
> {code}
> 17/04/26 23:12:56 ERROR [hive.log(397) -- main]: error in initSerDe: 
> org.apache.hadoop.hive.serde2.SerDeException 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements 
> while columns.types has 1 elements!
> org.apache.hadoop.hive.serde2.SerDeException: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements 
> while columns.types has 1 elements!
> at 
> org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:146)
> at 
> org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85)
> at 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125)
> at 
> org.apache.hadoop.hive.serde2.AbstractSerDe.initialize(AbstractSerDe.java:53)
> at 
> org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:521)
> at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:391)
> at 
> org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276)
> at 
> org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:197)
> at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:699)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply$mcV$sp(HiveClientImpl.scala:455)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:309)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:256)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:255)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:298)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:454)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:237)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:199)
> at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:248)
> at 
> org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$lzycompute$1(CreateHiveTableAsSelectCommand.scala:72)
> at 
> org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$1(CreateHiveTableAsSelectCommand.scala:48)
> at 
> org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:91)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
> at org.apache.spark.sql.Dataset.(Dataset.scala:179)
> at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
> at 

[jira] [Commented] (SPARK-20475) Whether use "broadcast join" depends on hive configuration

2017-04-26 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985100#comment-15985100
 ] 

Xiao Li commented on SPARK-20475:
-

cc [~ZenWzh]

> Whether use "broadcast join" depends on hive configuration
> --
>
> Key: SPARK-20475
> URL: https://issues.apache.org/jira/browse/SPARK-20475
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Lijia Liu
>
> Currently,  broadcast join in Spark only works while:
>   1. The value of "spark.sql.autoBroadcastJoinThreshold" bigger than 
> 0(default is 10485760).
>   2. The size of one of the hive tables less than 
> "spark.sql.autoBroadcastJoinThreshold". To get the size information of the 
> hive table from hive metastore,  "hive.stats.autogather" should be set to 
> true in hive or the command "ANALYZE TABLE  COMPUTE STATISTICS 
> noscan" has been run.
> But in Hive, it calculate the size of the file or directory corresponding to 
> the hive table to determine whether to use the map side join, and does not 
> depend on the hive metastore.
> This leads to two problems:
>   1. Spark will not use "broadcast join" when the hive parameter 
> "hive.stats.autogather" is not set to ture or the command "ANALYZE TABLE 
>  COMPUTE STATISTICS noscan" has not been run because the 
> information of the hive table has not saved in hive metastore . The mode of 
> work in Spark depends on the configuration of Hive.
>   2. For some reason, we set "hive.stats.autogather" to false in our Hive. 
> For the same SQL, Hive is 4 times faster than Spark because Hive used "map 
> side join" but Spark did not use "broadcast join".
> Is it possible to use the mechanism same to hive's to look up the size of a 
> hive tale in Spark.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20467) sbt-launch-lib.bash has lacked the ASF header.

2017-04-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984298#comment-15984298
 ] 

Apache Spark commented on SPARK-20467:
--

User 'liu-zhaokun' has created a pull request for this issue:
https://github.com/apache/spark/pull/17769

> sbt-launch-lib.bash has lacked the  ASF header.
> ---
>
> Key: SPARK-20467
> URL: https://issues.apache.org/jira/browse/SPARK-20467
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: liuzhaokun
>
> When I use this script,I found sbt-launch-lib.bash lack the  ASF header.It 
> doesn't be permitted by Apache Foundation according to apache license 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20467) sbt-launch-lib.bash has lacked the ASF header.

2017-04-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20467:


Assignee: (was: Apache Spark)

> sbt-launch-lib.bash has lacked the  ASF header.
> ---
>
> Key: SPARK-20467
> URL: https://issues.apache.org/jira/browse/SPARK-20467
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: liuzhaokun
>
> When I use this script,I found sbt-launch-lib.bash lack the  ASF header.It 
> doesn't be permitted by Apache Foundation according to apache license 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20468) Refactor the ALS code

2017-04-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984365#comment-15984365
 ] 

Sean Owen commented on SPARK-20468:
---

Please read http://spark.apache.org/contributing.html first, we don't assign 
issues at this stage.

A lot of this is just taste, so I'm not sure it clearly improves readability.
A major problem with some of these suggestions, if I understand correctly, is 
that you'd be breaking an API.
I think the while loops are on purpose to avoid a little overhead of a for loop.

I'd start significantly smaller with things like a proposed doc cleanup, or 
clear improvements to internal-only elements.
Please also weigh the cost: reviewer time, obstacle to back-ports.

> Refactor the ALS code
> -
>
> Key: SPARK-20468
> URL: https://issues.apache.org/jira/browse/SPARK-20468
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Daniel Li
>Priority: Minor
>  Labels: documentation, readability, refactoring
>
> The current ALS implementation ({{org.apache.spark.ml.recommendation}}) is 
> quite the beast --- 21 classes, traits, and objects across 1,500+ lines, all 
> in one file.  Here are some things I think could improve the clarity and 
> maintainability of the code:
> *  The file can be split into more manageable parts.  In particular, {{ALS}}, 
> {{ALSParams}}, {{ALSModel}}, and {{ALSModelParams}} can be in separate files 
> for better readability.
> *  Certain parts can be encapsulated or moved to clarify the intent.  For 
> example:
> **  The {{ALS.train}} method is currently defined in the {{ALS}} companion 
> object, and it seems to take 12 individual parameters that are all members of 
> the {{ALS}} class.  This method can be made an instance method.
> **  The code that creates in-blocks and out-blocks in the body of 
> {{ALS.train}}, along with the {{partitionRatings}} and {{makeBlocks}} methods 
> in the {{ALS}} companion object, can be moved into a separate case class that 
> holds the blocks.  This has the added benefit of allowing us to write 
> specific Scaladoc to explain the logic behind these block objects, as their 
> usage is certainly nontrivial yet is fundamental to the implementation.
> **  The {{KeyWrapper}} and {{UncompressedInBlockSort}} classes could be 
> hidden within {{UncompressedInBlock}} to clarify the scope of their usage.
> **  Various builder classes could be encapsulated in the companion objects of 
> the classes they build.
> *  The code can be formatted more clearly.  For example:
> **  Certain methods such as {{ALS.train}} and {{ALS.makeBlocks}} can be 
> formatted more clearly and have comments added explaining the reasoning 
> behind key parts.  That these methods form the core of the ALS logic makes 
> this doubly important for maintainability.
> **  Parts of the code that use {{while}} loops with manually incremented 
> counters can be rewritten as {{for}} loops.
> **  Where non-idiomatic Scala code is used that doesn't improve performance 
> much, clearer code can be substituted.  (This in particular should be done 
> very carefully if at all, as it's apparent the original author spent much 
> time and pains optimizing the code to significantly improve its runtime 
> profile.)
> *  The documentation (both Scaladocs and inline comments) can be clarified 
> where needed and expanded where incomplete.  This is especially important for 
> parts of the code that are written imperatively for performance, as these 
> parts don't benefit from the intuitive self-documentation of Scala's 
> higher-level language abstractions.  Specifically, I'd like to add 
> documentation fully explaining the key functionality of the in-block and 
> out-block objects, their purpose, how they relate to the overall ALS 
> algorithm, and how they are calculated in such a way that new maintainers can 
> ramp up much more quickly.
> The above is not a complete enumeration of improvements but a high-level 
> survey.  All of these improvements will, I believe, add up to make the code 
> easier to understand, extend, and maintain.  This issue will track the 
> progress of this refactoring so that going forward, authors will have an 
> easier time maintaining this part of the project.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows

2017-04-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984399#comment-15984399
 ] 

Apache Spark commented on SPARK-20392:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/17770

> Slow performance when calling fit on ML pipeline for dataset with many 
> columns but few rows
> ---
>
> Key: SPARK-20392
> URL: https://issues.apache.org/jira/browse/SPARK-20392
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Barry Becker
> Attachments: blockbuster.csv, blockbuster_fewCols.csv, 
> giant_query_plan_for_fitting_pipeline.txt, model_9754.zip, model_9756.zip
>
>
> This started as a [question on stack 
> overflow|http://stackoverflow.com/questions/43484006/why-is-it-slow-to-apply-a-spark-pipeline-to-dataset-with-many-columns-but-few-ro],
>  but it seems like a bug.
> I am testing spark pipelines using a simple dataset (attached) with 312 
> (mostly numeric) columns, but only 421 rows. It is small, but it takes 3 
> minutes to apply my ML pipeline to it on a 24 core server with 60G of memory. 
> This seems much to long for such a tiny dataset. Similar pipelines run 
> quickly on datasets that have fewer columns and more rows. It's something 
> about the number of columns that is causing the slow performance.
> Here are a list of the stages in my pipeline:
> {code}
> 000_strIdx_5708525b2b6c
> 001_strIdx_ec2296082913
> 002_bucketizer_3cbc8811877b
> 003_bucketizer_5a01d5d78436
> 004_bucketizer_bf290d11364d
> 005_bucketizer_c3296dfe94b2
> 006_bucketizer_7071ca50eb85
> 007_bucketizer_27738213c2a1
> 008_bucketizer_bd728fd89ba1
> 009_bucketizer_e1e716f51796
> 010_bucketizer_38be665993ba
> 011_bucketizer_5a0e41e5e94f
> 012_bucketizer_b5a3d5743aaa
> 013_bucketizer_4420f98ff7ff
> 014_bucketizer_777cc4fe6d12
> 015_bucketizer_f0f3a3e5530e
> 016_bucketizer_218ecca3b5c1
> 017_bucketizer_0b083439a192
> 018_bucketizer_4520203aec27
> 019_bucketizer_462c2c346079
> 020_bucketizer_47435822e04c
> 021_bucketizer_eb9dccb5e6e8
> 022_bucketizer_b5f63dd7451d
> 023_bucketizer_e0fd5041c841
> 024_bucketizer_ffb3b9737100
> 025_bucketizer_e06c0d29273c
> 026_bucketizer_36ee535a425f
> 027_bucketizer_ee3a330269f1
> 028_bucketizer_094b58ea01c0
> 029_bucketizer_e93ea86c08e2
> 030_bucketizer_4728a718bc4b
> 031_bucketizer_08f6189c7fcc
> 032_bucketizer_11feb74901e6
> 033_bucketizer_ab4add4966c7
> 034_bucketizer_4474f7f1b8ce
> 035_bucketizer_90cfa5918d71
> 036_bucketizer_1a9ff5e4eccb
> 037_bucketizer_38085415a4f4
> 038_bucketizer_9b5e5a8d12eb
> 039_bucketizer_082bb650ecc3
> 040_bucketizer_57e1e363c483
> 041_bucketizer_337583fbfd65
> 042_bucketizer_73e8f6673262
> 043_bucketizer_0f9394ed30b8
> 044_bucketizer_8530f3570019
> 045_bucketizer_c53614f1e507
> 046_bucketizer_8fd99e6ec27b
> 047_bucketizer_6a8610496d8a
> 048_bucketizer_888b0055c1ad
> 049_bucketizer_974e0a1433a6
> 050_bucketizer_e848c0937cb9
> 051_bucketizer_95611095a4ac
> 052_bucketizer_660a6031acd9
> 053_bucketizer_aaffe5a3140d
> 054_bucketizer_8dc569be285f
> 055_bucketizer_83d1bffa07bc
> 056_bucketizer_0c6180ba75e6
> 057_bucketizer_452f265a000d
> 058_bucketizer_38e02ddfb447
> 059_bucketizer_6fa4ad5d3ebd
> 060_bucketizer_91044ee766ce
> 061_bucketizer_9a9ef04a173d
> 062_bucketizer_3d98eb15f206
> 063_bucketizer_c4915bb4d4ed
> 064_bucketizer_8ca2b6550c38
> 065_bucketizer_417ee9b760bc
> 066_bucketizer_67f3556bebe8
> 067_bucketizer_0556deb652c6
> 068_bucketizer_067b4b3d234c
> 069_bucketizer_30ba55321538
> 070_bucketizer_ad826cc5d746
> 071_bucketizer_77676a898055
> 072_bucketizer_05c37a38ce30
> 073_bucketizer_6d9ae54163ed
> 074_bucketizer_8cd668b2855d
> 075_bucketizer_d50ea1732021
> 076_bucketizer_c68f467c9559
> 077_bucketizer_ee1dfc840db1
> 078_bucketizer_83ec06a32519
> 079_bucketizer_741d08c1b69e
> 080_bucketizer_b7402e4829c7
> 081_bucketizer_8adc590dc447
> 082_bucketizer_673be99bdace
> 083_bucketizer_77693b45f94c
> 084_bucketizer_53529c6b1ac4
> 085_bucketizer_6a3ca776a81e
> 086_bucketizer_6679d9588ac1
> 087_bucketizer_6c73af456f65
> 088_bucketizer_2291b2c5ab51
> 089_bucketizer_cb3d0fe669d8
> 090_bucketizer_e71f913c1512
> 091_bucketizer_156528f65ce7
> 092_bucketizer_f3ec5dae079b
> 093_bucketizer_809fab77eee1
> 094_bucketizer_6925831511e6
> 095_bucketizer_c5d853b95707
> 096_bucketizer_e677659ca253
> 097_bucketizer_396e35548c72
> 098_bucketizer_78a6410d7a84
> 099_bucketizer_e3ae6e54bca1
> 100_bucketizer_9fed5923fe8a
> 101_bucketizer_8925ba4c3ee2
> 102_bucketizer_95750b6942b8
> 103_bucketizer_6e8b50a1918b
> 104_bucketizer_36cfcc13d4ba
> 105_bucketizer_2716d0455512
> 106_bucketizer_9bcf2891652f
> 107_bucketizer_8c3d352915f7
> 108_bucketizer_0786c17d5ef9
> 109_bucketizer_f22df23ef56f
> 110_bucketizer_bad04578bd20
> 111_bucketizer_35cfbde7e28f

[jira] [Updated] (SPARK-20470) Invalid json converting RDD row with Array of struct to json

2017-04-26 Thread Philip Adetiloye (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philip Adetiloye updated SPARK-20470:
-
Description: 
Trying to convert an RDD in pyspark containing Array of struct doesn't generate 
the right json. It looks trivial but can't get a good json out.

I read the json below into a dataframe:
{
  "feature": "feature_id_001",
  "histogram": [
{
  "start": 1.9796095151877942,
  "y": 968.0,
  "width": 0.1564485056196041
},
{
  "start": 2.1360580208073983,
  "y": 892.0,
  "width": 0.1564485056196041
},
{
  "start": 2.2925065264270024,
  "y": 814.0,
  "width": 0.15644850561960366
},
{
  "start": 2.448955032046606,
  "y": 690.0,
  "width": 0.1564485056196041
}]
}

Df schema looks good 

 root
  |-- feature: string (nullable = true)
  |-- histogram: array (nullable = true)
  ||-- element: struct (containsNull = true)
  |||-- start: double (nullable = true)
  |||-- width: double (nullable = true)
  |||-- y: double (nullable = true)

Need to convert each row to json now and save to HBase 
rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict(

Output JSON (Wrong)

{
  "feature": "feature_id_001",
  "histogram": [
[
  1.9796095151877942,
  968.0,
  0.1564485056196041
],
[
  2.1360580208073983,
  892.0,
  0.1564485056196041
],
[
  2.2925065264270024,
  814.0,
  0.15644850561960366
],
[
  2.448955032046606,
  690.0,
  0.1564485056196041
]
}



  was:
Trying to convert an RDD in pyspark containing Array of struct doesn't generate 
the right json. It looks trivial but can't get a good json out.

I read the json below into a dataframe:
{
  "feature": "feature_id_001",
  "histogram": [
{
  "start": 1.9796095151877942,
  "y": 968.0,
  "width": 0.1564485056196041
},
{
  "start": 2.1360580208073983,
  "y": 892.0,
  "width": 0.1564485056196041
},
{
  "start": 2.2925065264270024,
  "y": 814.0,
  "width": 0.15644850561960366
},
{
  "start": 2.448955032046606,
  "y": 690.0,
  "width": 0.1564485056196041
}]
}

Df schema looks good 

# root
#  |-- feature: string (nullable = true)
#  |-- histogram: array (nullable = true)
#  ||-- element: struct (containsNull = true)
#  |||-- start: double (nullable = true)
#  |||-- width: double (nullable = true)
#  |||-- y: double (nullable = true)

Need to convert each row to json now and save to HBase 
rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict(

Output JSON (Wrong)

{
  "feature": "feature_id_001",
  "histogram": [
[
  1.9796095151877942,
  968.0,
  0.1564485056196041
],
[
  2.1360580208073983,
  892.0,
  0.1564485056196041
],
[
  2.2925065264270024,
  814.0,
  0.15644850561960366
],
[
  2.448955032046606,
  690.0,
  0.1564485056196041
]
}




> Invalid json converting RDD row with Array of struct to json
> 
>
> Key: SPARK-20470
> URL: https://issues.apache.org/jira/browse/SPARK-20470
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.3
>Reporter: Philip Adetiloye
>
> Trying to convert an RDD in pyspark containing Array of struct doesn't 
> generate the right json. It looks trivial but can't get a good json out.
> I read the json below into a dataframe:
> {
>   "feature": "feature_id_001",
>   "histogram": [
> {
>   "start": 1.9796095151877942,
>   "y": 968.0,
>   "width": 0.1564485056196041
> },
> {
>   "start": 2.1360580208073983,
>   "y": 892.0,
>   "width": 0.1564485056196041
> },
> {
>   "start": 2.2925065264270024,
>   "y": 814.0,
>   "width": 0.15644850561960366
> },
> {
>   "start": 2.448955032046606,
>   "y": 690.0,
>   "width": 0.1564485056196041
> }]
> }
> Df schema looks good 
>  root
>   |-- feature: string (nullable = true)
>   |-- histogram: array (nullable = true)
>   ||-- element: struct (containsNull = true)
>   |||-- start: double (nullable = true)
>   |||-- width: double (nullable = true)
>   |||-- y: double (nullable = true)
> Need to convert each row to json now and save to HBase 
> rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict(
> Output JSON (Wrong)
> {
>   "feature": "feature_id_001",
>   "histogram": [
> [
>   1.9796095151877942,
>   968.0,
>   0.1564485056196041
> ],
> [
>   2.1360580208073983,
>   892.0,
>   0.1564485056196041
> ],
> [
>   2.2925065264270024,
>   814.0,
>   0.15644850561960366
> ],
> 

[jira] [Updated] (SPARK-20470) Invalid json converting RDD row with Array of struct to json

2017-04-26 Thread Philip Adetiloye (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philip Adetiloye updated SPARK-20470:
-
Description: 
Trying to convert an RDD in pyspark containing Array of struct doesn't generate 
the right json. It looks trivial but can't get a good json out.

I read the json below into a dataframe:
{
  "feature": "feature_id_001",
  "histogram": [
{
  "start": 1.9796095151877942,
  "y": 968.0,
  "width": 0.1564485056196041
},
{
  "start": 2.1360580208073983,
  "y": 892.0,
  "width": 0.1564485056196041
},
{
  "start": 2.2925065264270024,
  "y": 814.0,
  "width": 0.15644850561960366
},
{
  "start": 2.448955032046606,
  "y": 690.0,
  "width": 0.1564485056196041
}]
}

Df schema looks good 

# root
#  |-- feature: string (nullable = true)
#  |-- histogram: array (nullable = true)
#  ||-- element: struct (containsNull = true)
#  |||-- start: double (nullable = true)
#  |||-- width: double (nullable = true)
#  |||-- y: double (nullable = true)

Need to convert each row to json now and save to HBase 
rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict(

Output JSON (Wrong)

{
  "feature": "feature_id_001",
  "histogram": [
[
  1.9796095151877942,
  968.0,
  0.1564485056196041
],
[
  2.1360580208073983,
  892.0,
  0.1564485056196041
],
[
  2.2925065264270024,
  814.0,
  0.15644850561960366
],
[
  2.448955032046606,
  690.0,
  0.1564485056196041
]
}



  was:
Trying to convert an RDD in pyspark containing Array of struct doesn't generate 
the right json. It looks trivial but can't get a good json out.


rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict(


> Invalid json converting RDD row with Array of struct to json
> 
>
> Key: SPARK-20470
> URL: https://issues.apache.org/jira/browse/SPARK-20470
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.3
>Reporter: Philip Adetiloye
>
> Trying to convert an RDD in pyspark containing Array of struct doesn't 
> generate the right json. It looks trivial but can't get a good json out.
> I read the json below into a dataframe:
> {
>   "feature": "feature_id_001",
>   "histogram": [
> {
>   "start": 1.9796095151877942,
>   "y": 968.0,
>   "width": 0.1564485056196041
> },
> {
>   "start": 2.1360580208073983,
>   "y": 892.0,
>   "width": 0.1564485056196041
> },
> {
>   "start": 2.2925065264270024,
>   "y": 814.0,
>   "width": 0.15644850561960366
> },
> {
>   "start": 2.448955032046606,
>   "y": 690.0,
>   "width": 0.1564485056196041
> }]
> }
> Df schema looks good 
> # root
> #  |-- feature: string (nullable = true)
> #  |-- histogram: array (nullable = true)
> #  ||-- element: struct (containsNull = true)
> #  |||-- start: double (nullable = true)
> #  |||-- width: double (nullable = true)
> #  |||-- y: double (nullable = true)
> Need to convert each row to json now and save to HBase 
> rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict(
> Output JSON (Wrong)
> {
>   "feature": "feature_id_001",
>   "histogram": [
> [
>   1.9796095151877942,
>   968.0,
>   0.1564485056196041
> ],
> [
>   2.1360580208073983,
>   892.0,
>   0.1564485056196041
> ],
> [
>   2.2925065264270024,
>   814.0,
>   0.15644850561960366
> ],
> [
>   2.448955032046606,
>   690.0,
>   0.1564485056196041
> ]
> }



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-20208) Document R fpGrowth support in vignettes, programming guide and code example

2017-04-26 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung reopened SPARK-20208:
--

Actually, would you mind updating the R programming guide too?

> Document R fpGrowth support in vignettes, programming guide and code example
> 
>
> Key: SPARK-20208
> URL: https://issues.apache.org/jira/browse/SPARK-20208
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Assignee: Maciej Szymkiewicz
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20015) Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide, R example

2017-04-26 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-20015:
-
Summary: Document R Structured Streaming (experimental) in R vignettes and 
R & SS programming guide, R example  (was: Document R Structured Streaming 
(experimental) in R vignettes and R & SS programming guide)

> Document R Structured Streaming (experimental) in R vignettes and R & SS 
> programming guide, R example
> -
>
> Key: SPARK-20015
> URL: https://issues.apache.org/jira/browse/SPARK-20015
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20184) performance regression for complex/long sql when enable whole stage codegen

2017-04-26 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985319#comment-15985319
 ] 

Kazuaki Ishizaki commented on SPARK-20184:
--

When # of the aggregated columns gets large, I saw complicated Java code for 
HashAggregation. I will create another JIRA to simplify generated code.

> performance regression for complex/long sql when enable whole stage codegen
> ---
>
> Key: SPARK-20184
> URL: https://issues.apache.org/jira/browse/SPARK-20184
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Fei Wang
>
> The performance of following SQL get much worse in spark 2.x  in contrast 
> with codegen off.
> SELECT
>sum(COUNTER_57) 
> ,sum(COUNTER_71) 
> ,sum(COUNTER_3)  
> ,sum(COUNTER_70) 
> ,sum(COUNTER_66) 
> ,sum(COUNTER_75) 
> ,sum(COUNTER_69) 
> ,sum(COUNTER_55) 
> ,sum(COUNTER_63) 
> ,sum(COUNTER_68) 
> ,sum(COUNTER_56) 
> ,sum(COUNTER_37) 
> ,sum(COUNTER_51) 
> ,sum(COUNTER_42) 
> ,sum(COUNTER_43) 
> ,sum(COUNTER_1)  
> ,sum(COUNTER_76) 
> ,sum(COUNTER_54) 
> ,sum(COUNTER_44) 
> ,sum(COUNTER_46) 
> ,DIM_1 
> ,DIM_2 
>   ,DIM_3
> FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;
> Num of rows of aggtable is about 3500.
> whole stage codegen on(spark.sql.codegen.wholeStage = true):40s
> whole stage codegen  off(spark.sql.codegen.wholeStage = false):6s
> After some analysis i think this is related to the huge java method(a java 
> method of thousand lines) which generated by codegen.
> And If i config -XX:-DontCompileHugeMethods the performance get much 
> better(about 7s).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20479) Performance degradation for large number of hash-aggregated columns

2017-04-26 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-20479:


 Summary: Performance degradation for large number of 
hash-aggregated columns
 Key: SPARK-20479
 URL: https://issues.apache.org/jira/browse/SPARK-20479
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Kazuaki Ishizaki


In comment of SPARK-20184, [~maropu] revealed that performance is degraded when 
# of aggregated columns get large with whole-stage codegen.

{code}
./bin/spark-shell --master local[1] --conf spark.driver.memory=2g --conf 
spark.sql.shuffle.partitions=1 -v

def timer[R](f: => {}): Unit = {
  val count = 9
  val iters = (0 until count).map { i =>
val t0 = System.nanoTime()
f
val t1 = System.nanoTime()
val elapsed = t1 - t0 + 0.0
println(s"#$i: ${elapsed / 10.0}")
elapsed
  }
  println("Elapsed time: " + ((iters.sum / count) / 10.0) + "s")
}

val numCols = 80
val t = s"(SELECT id AS key1, id AS key2, ${((0 until numCols).map(i => s"id AS 
c$i")).mkString(", ")} FROM range(0, 10, 1, 1))"
val sqlStr = s"SELECT key1, key2, ${((0 until numCols).map(i => 
s"SUM(c$i)")).mkString(", ")} FROM $t GROUP BY key1, key2 LIMIT 100"

// Elapsed time: 2.308440490553s
sql("SET spark.sql.codegen.wholeStage=true")
timer { sql(sqlStr).collect }

// Elapsed time: 0.527486733s
sql("SET spark.sql.codegen.wholeStage=false")
timer { sql(sqlStr).collect }
{code}




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20480) FileFormatWriter hides FetchFailedException from scheduler

2017-04-26 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985567#comment-15985567
 ] 

Thomas Graves commented on SPARK-20480:
---

exception in task manager looks like:
17/04/26 20:09:21 INFO TaskSetManager: Lost task 3516.0 in stage 4.0 (TID 
103691) on gsbl521n33.blue.ygrid.yahoo.com, exec
utor 4516: org.apache.spark.SparkException (Task failed while writing rows) 
[duplicate 22]

> FileFormatWriter hides FetchFailedException from scheduler
> --
>
> Key: SPARK-20480
> URL: https://issues.apache.org/jira/browse/SPARK-20480
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>Priority: Critical
>
> I was running a large job where it was getting faiures, noticed they were 
> listed as "SparkException: Task failed while writing rows", but when I looked 
> further they were really caused by FetchFailure exceptions.  This is a 
> problem because the scheduler handles Fetch Failures differently then normal 
> exception. This can affect things like blacklisting.
> {noformat}
> 17/04/26 20:08:59 ERROR Executor: Exception in task 2727.0 in stage 4.0 (TID 
> 102902)
> org.apache.spark.SparkException: Task failed while writing rows
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.shuffle.FetchFailedException: Failed to connect 
> to host1.com:7337
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:396)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
>   at 
> 

[jira] [Resolved] (SPARK-12868) ADD JAR via sparkSQL JDBC will fail when using a HDFS URL

2017-04-26 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-12868.

   Resolution: Fixed
 Assignee: Weiqing Yang
Fix Version/s: 2.2.0

> ADD JAR via sparkSQL JDBC will fail when using a HDFS URL
> -
>
> Key: SPARK-12868
> URL: https://issues.apache.org/jira/browse/SPARK-12868
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Trystan Leftwich
>Assignee: Weiqing Yang
> Fix For: 2.2.0
>
>
> When trying to add a jar with a HDFS URI, i.E
> {code:sql}
> ADD JAR hdfs:///tmp/foo.jar
> {code}
> Via the spark sql JDBC interface it will fail with:
> {code:sql}
> java.net.MalformedURLException: unknown protocol: hdfs
> at java.net.URL.(URL.java:593)
> at java.net.URL.(URL.java:483)
> at java.net.URL.(URL.java:432)
> at java.net.URI.toURL(URI.java:1089)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.addJar(ClientWrapper.scala:578)
> at org.apache.spark.sql.hive.HiveContext.addJar(HiveContext.scala:652)
> at org.apache.spark.sql.hive.execution.AddJar.run(commands.scala:89)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:145)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:130)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:211)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:154)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:151)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:164)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20480) FileFormatWriter hides FetchFailedException from scheduler

2017-04-26 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-20480:
-

 Summary: FileFormatWriter hides FetchFailedException from scheduler
 Key: SPARK-20480
 URL: https://issues.apache.org/jira/browse/SPARK-20480
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Thomas Graves
Priority: Critical


I was running a large job where it was getting faiures, noticed they were 
listed as "SparkException: Task failed while writing rows", but when I looked 
further they were really caused by FetchFailure exceptions.  This is a problem 
because the scheduler handles Fetch Failures differently then normal exception. 
This can affect things like blacklisting.



{noformat}
17/04/26 20:08:59 ERROR Executor: Exception in task 2727.0 in stage 4.0 (TID 
102902)
org.apache.spark.SparkException: Task failed while writing rows
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.shuffle.FetchFailedException: Failed to connect to 
gsbl546n07.blue.ygrid.yahoo.com/10.213.43.94:7337
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:396)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)
at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)
... 8 more
Caused by: java.io.IOException: Failed to connect to 
gsbl546n07.blue.ygrid.yahoo.com/10.213.43.94:7337
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
at 

[jira] [Closed] (SPARK-20480) FileFormatWriter hides FetchFailedException from scheduler

2017-04-26 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves closed SPARK-20480.
-
Resolution: Duplicate

> FileFormatWriter hides FetchFailedException from scheduler
> --
>
> Key: SPARK-20480
> URL: https://issues.apache.org/jira/browse/SPARK-20480
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>Priority: Critical
>
> I was running a large job where it was getting faiures, noticed they were 
> listed as "SparkException: Task failed while writing rows", but when I looked 
> further they were really caused by FetchFailure exceptions.  This is a 
> problem because the scheduler handles Fetch Failures differently then normal 
> exception. This can affect things like blacklisting.
> {noformat}
> 17/04/26 20:08:59 ERROR Executor: Exception in task 2727.0 in stage 4.0 (TID 
> 102902)
> org.apache.spark.SparkException: Task failed while writing rows
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.shuffle.FetchFailedException: Failed to connect 
> to host1.com:7337
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:396)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)
>   ... 8 more
> Caused by: java.io.IOException: Failed to connect to host1.com:7337
>   at 
> 

[jira] [Commented] (SPARK-20178) Improve Scheduler fetch failures

2017-04-26 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985447#comment-15985447
 ] 

Thomas Graves commented on SPARK-20178:
---

Another thing we should tie in here is handling preempted containers better. 
This kind of matches with my point above "Improve logic around deciding which 
node is actually bad when you get a fetch failures."  but a little bit of a 
special case.  If the containers gets preempted on the yarn side we need to 
properly detect that and not count that as a normal fetch failure. Right now 
that seems pretty difficult with the way we handle stage failures but I guess 
you would just line that up and not caught that as a normal stage failure.

> Improve Scheduler fetch failures
> 
>
> Key: SPARK-20178
> URL: https://issues.apache.org/jira/browse/SPARK-20178
> Project: Spark
>  Issue Type: Epic
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>
> We have been having a lot of discussions around improving the handling of 
> fetch failures.  There are 4 jira currently related to this.  
> We should try to get a list of things we want to improve and come up with one 
> cohesive design.
> SPARK-20163,  SPARK-20091,  SPARK-14649 , and SPARK-19753
> I will put my initial thoughts in a follow on comment.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20480) FileFormatWriter hides FetchFailedException from scheduler

2017-04-26 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-20480:
--
Description: 
I was running a large job where it was getting faiures, noticed they were 
listed as "SparkException: Task failed while writing rows", but when I looked 
further they were really caused by FetchFailure exceptions.  This is a problem 
because the scheduler handles Fetch Failures differently then normal exception. 
This can affect things like blacklisting.



{noformat}
17/04/26 20:08:59 ERROR Executor: Exception in task 2727.0 in stage 4.0 (TID 
102902)
org.apache.spark.SparkException: Task failed while writing rows
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.shuffle.FetchFailedException: Failed to connect to 
host1.com:7337
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:396)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)
at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)
... 8 more
Caused by: java.io.IOException: Failed to connect to host1.com:7337
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
at 
org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:105)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
  

[jira] [Assigned] (SPARK-20476) Exception between "create table as" and "get_json_object"

2017-04-26 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-20476:
---

Assignee: Xiao Li

> Exception between "create table as" and "get_json_object"
> -
>
> Key: SPARK-20476
> URL: https://issues.apache.org/jira/browse/SPARK-20476
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: cen yuhai
>Assignee: Xiao Li
>
> I encounter this problem when I want to create a table as select , 
> get_json_object from xxx;
> It is wrong.
> {code}
> create table spark_json_object as
> select get_json_object(deliver_geojson,'$.')
> from dw.dw_prd_order where dt='2017-04-24' limit 10;
> {code}
> It is ok.
> {code}
> create table spark_json_object as
> select *
> from dw.dw_prd_order where dt='2017-04-24' limit 10;
> {code}
> It is ok
> {code}
> select get_json_object(deliver_geojson,'$.')
> from dw.dw_prd_order where dt='2017-04-24' limit 10;
> {code}
> {code}
> 17/04/26 23:12:56 ERROR [hive.log(397) -- main]: error in initSerDe: 
> org.apache.hadoop.hive.serde2.SerDeException 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements 
> while columns.types has 1 elements!
> org.apache.hadoop.hive.serde2.SerDeException: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements 
> while columns.types has 1 elements!
> at 
> org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:146)
> at 
> org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85)
> at 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125)
> at 
> org.apache.hadoop.hive.serde2.AbstractSerDe.initialize(AbstractSerDe.java:53)
> at 
> org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:521)
> at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:391)
> at 
> org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276)
> at 
> org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:197)
> at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:699)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply$mcV$sp(HiveClientImpl.scala:455)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:309)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:256)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:255)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:298)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:454)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:237)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:199)
> at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:248)
> at 
> org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$lzycompute$1(CreateHiveTableAsSelectCommand.scala:72)
> at 
> org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$1(CreateHiveTableAsSelectCommand.scala:48)
> at 
> org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:91)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
> at org.apache.spark.sql.Dataset.(Dataset.scala:179)
> at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
> at 

[jira] [Updated] (SPARK-20454) Improvement of ShortestPaths in Spark GraphX

2017-04-26 Thread Ji Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ji Dai updated SPARK-20454:
---
Target Version/s:   (was: 2.1.0)

> Improvement of ShortestPaths in Spark GraphX
> 
>
> Key: SPARK-20454
> URL: https://issues.apache.org/jira/browse/SPARK-20454
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX, MLlib
>Affects Versions: 2.1.0
>Reporter: Ji Dai
>  Labels: patch
>
> The output of ShortestPaths is not enough. ShortestPaths in Graph/lib is 
> currently in a simple version and can only return the distance to the source 
> vertex. However, the shortest path with intermediate nodes on the path is 
> needed and if two or more paths holds the same shortest distance from source 
> to destination, all these paths need to be returned. In this way, 
> ShortestPaths will be more functional and useful.
> I think I have resolved the concern above with a improved version of 
> ShortestPaths which also based on the "pregel" function in GraphOps.
> Can I get my code reviewed and merged?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20480) FileFormatWriter hides FetchFailedException from scheduler

2017-04-26 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985575#comment-15985575
 ] 

Mridul Muralidharan commented on SPARK-20480:
-

Shouldn't fix for SPARK-19276 by [~imranr] not handle this ?

> FileFormatWriter hides FetchFailedException from scheduler
> --
>
> Key: SPARK-20480
> URL: https://issues.apache.org/jira/browse/SPARK-20480
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>Priority: Critical
>
> I was running a large job where it was getting faiures, noticed they were 
> listed as "SparkException: Task failed while writing rows", but when I looked 
> further they were really caused by FetchFailure exceptions.  This is a 
> problem because the scheduler handles Fetch Failures differently then normal 
> exception. This can affect things like blacklisting.
> {noformat}
> 17/04/26 20:08:59 ERROR Executor: Exception in task 2727.0 in stage 4.0 (TID 
> 102902)
> org.apache.spark.SparkException: Task failed while writing rows
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.shuffle.FetchFailedException: Failed to connect 
> to host1.com:7337
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:396)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)
>   ... 8 more
> Caused by: java.io.IOException: Failed to connect to 

  1   2   >