[jira] [Assigned] (SPARK-20425) Support an extended display mode to print a column data per line

2017-04-26 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-20425:
---

Assignee: Takeshi Yamamuro

> Support an extended display mode to print a column data per line
> 
>
> Key: SPARK-20425
> URL: https://issues.apache.org/jira/browse/SPARK-20425
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 2.3.0
>
>
> In the master, when printing Dataset with many columns, the readability is 
> too low like;
> {code}
> scala> val df = spark.range(100).selectExpr((0 until 100).map(i => s"rand() 
> AS c$i"): _*)
> scala> df.show(3, 0)
> +--+--+--+---+--+--+---+--+--+--+--+---+--+--+--+---+---+---+--+--+---+--+---+--+---+---+---++---+--+---++--+--+---+---+---+--+--+---+--+--+---+---+---+--+++---+---+---+---+---+---++---+---+---+---+--+--+---+---+--+---+--+--+-+---+---+--+---+--+---+---+---+--+---+--+---+---+---+---+---+---+---+---+--+---+---+--+--+--+---+--+---+--+---+---+---+
> |c0|c1|c2|c3 
> |c4|c5|c6 |c7
> |c8|c9|c10   |c11
> |c12   |c13   |c14   |c15
> |c16|c17|c18   |c19   
> |c20|c21   |c22|c23   
> |c24|c25|c26|c27  
>|c28|c29   |c30|c31
>  |c32   |c33   |c34|c35   
>  |c36|c37   |c38   |c39   
>  |c40   |c41   |c42|c43   
>  |c44|c45   |c46 |c47 
> |c48|c49|c50|c51  
>   |c52|c53|c54 |c55   
>  |c56|c57|c58|c59 
>   |c60   |c61|c62|c63 
>   |c64|c65   |c66   |c67  
> |c68|c69|c70   |c71   
>  |c72   |c73|c74|c75  
>   |c76   |c77|c78   |c79  
>   |c80|c81|c82
> |c83|c84|c85|c86  
>   |c87   |c88|c89|c90 
>   |c91   |c92   |c93|c94  
>  |c95|c96   |c97|c98  
>   |c99|
> 

[jira] [Resolved] (SPARK-20425) Support an extended display mode to print a column data per line

2017-04-26 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-20425.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> Support an extended display mode to print a column data per line
> 
>
> Key: SPARK-20425
> URL: https://issues.apache.org/jira/browse/SPARK-20425
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 2.3.0
>
>
> In the master, when printing Dataset with many columns, the readability is 
> too low like;
> {code}
> scala> val df = spark.range(100).selectExpr((0 until 100).map(i => s"rand() 
> AS c$i"): _*)
> scala> df.show(3, 0)
> +--+--+--+---+--+--+---+--+--+--+--+---+--+--+--+---+---+---+--+--+---+--+---+--+---+---+---++---+--+---++--+--+---+---+---+--+--+---+--+--+---+---+---+--+++---+---+---+---+---+---++---+---+---+---+--+--+---+---+--+---+--+--+-+---+---+--+---+--+---+---+---+--+---+--+---+---+---+---+---+---+---+---+--+---+---+--+--+--+---+--+---+--+---+---+---+
> |c0|c1|c2|c3 
> |c4|c5|c6 |c7
> |c8|c9|c10   |c11
> |c12   |c13   |c14   |c15
> |c16|c17|c18   |c19   
> |c20|c21   |c22|c23   
> |c24|c25|c26|c27  
>|c28|c29   |c30|c31
>  |c32   |c33   |c34|c35   
>  |c36|c37   |c38   |c39   
>  |c40   |c41   |c42|c43   
>  |c44|c45   |c46 |c47 
> |c48|c49|c50|c51  
>   |c52|c53|c54 |c55   
>  |c56|c57|c58|c59 
>   |c60   |c61|c62|c63 
>   |c64|c65   |c66   |c67  
> |c68|c69|c70   |c71   
>  |c72   |c73|c74|c75  
>   |c76   |c77|c78   |c79  
>   |c80|c81|c82
> |c83|c84|c85|c86  
>   |c87   |c88|c89|c90 
>   |c91   |c92   |c93|c94  
>  |c95|c96   |c97|c98  
>   |c99|
> 

[jira] [Issue Comment Deleted] (SPARK-20435) More thorough redaction of sensitive information from logs/UI, more unit tests

2017-04-26 Thread Mark Grover (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Grover updated SPARK-20435:

Comment: was deleted

(was: Thanks Marcelo!


)

> More thorough redaction of sensitive information from logs/UI, more unit tests
> --
>
> Key: SPARK-20435
> URL: https://issues.apache.org/jira/browse/SPARK-20435
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Mark Grover
>Assignee: Mark Grover
> Fix For: 2.2.0
>
>
> SPARK-18535 and SPARK-19720 were works to redact sensitive information (e.g. 
> hadoop credential provider password, AWS access/secret keys) from event logs 
> + YARN logs + UI and from the console output, respectively.
> While some unit tests were added along with these changes - they asserted 
> when a sensitive key was found, that redaction took place for that key. They 
> didn't assert globally that when running a full-fledged Spark app (whether or 
> YARN or locally), that sensitive information was not present in any of the 
> logs or UI. Such a test would also prevent regressions from happening in the 
> future if someone unknowingly adds extra logging that publishes out sensitive 
> information to disk or UI.
> Consequently, it was found that in some Java configurations, sensitive 
> information was still being leaked in the event logs under the 
> {{SparkListenerEnvironmentUpdate}} event, like so:
> {code}
> "sun.java.command":"org.apache.spark.deploy.SparkSubmit ... --conf 
> spark.executorEnv.HADOOP_CREDSTORE_PASSWORD=secret_password ...
> {code}
> "secret_password" should have been redacted.
> Moreover, previously redaction logic was only checking if the key matched the 
> secret regex pattern, it'd redact it's value. That worked for most cases. 
> However, in the above case, the key (sun.java.command) doesn't tell much, so 
> the value needs to be searched. So the check needs to be expanded to match 
> against values as well.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-20475) Whether use "broadcast join" depends on hive configuration

2017-04-26 Thread Lijia Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lijia Liu closed SPARK-20475.
-
   Resolution: Fixed
Fix Version/s: 2.0.3

> Whether use "broadcast join" depends on hive configuration
> --
>
> Key: SPARK-20475
> URL: https://issues.apache.org/jira/browse/SPARK-20475
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Lijia Liu
> Fix For: 2.0.3
>
>
> Currently,  broadcast join in Spark only works while:
>   1. The value of "spark.sql.autoBroadcastJoinThreshold" bigger than 
> 0(default is 10485760).
>   2. The size of one of the hive tables less than 
> "spark.sql.autoBroadcastJoinThreshold". To get the size information of the 
> hive table from hive metastore,  "hive.stats.autogather" should be set to 
> true in hive or the command "ANALYZE TABLE  COMPUTE STATISTICS 
> noscan" has been run.
> But in Hive, it calculate the size of the file or directory corresponding to 
> the hive table to determine whether to use the map side join, and does not 
> depend on the hive metastore.
> This leads to two problems:
>   1. Spark will not use "broadcast join" when the hive parameter 
> "hive.stats.autogather" is not set to ture or the command "ANALYZE TABLE 
>  COMPUTE STATISTICS noscan" has not been run because the 
> information of the hive table has not saved in hive metastore . The mode of 
> work in Spark depends on the configuration of Hive.
>   2. For some reason, we set "hive.stats.autogather" to false in our Hive. 
> For the same SQL, Hive is 4 times faster than Spark because Hive used "map 
> side join" but Spark did not use "broadcast join".
> Is it possible to use the mechanism same to hive's to look up the size of a 
> hive tale in Spark.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20475) Whether use "broadcast join" depends on hive configuration

2017-04-26 Thread Lijia Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985995#comment-15985995
 ] 

Lijia Liu commented on SPARK-20475:
---

[~ZenWzh] 
This issue is duplicated to https://issues.apache.org/jira/browse/SPARK-15365
So, spark.sql.statistics.fallBackToHdfs=true solve this problem.


> Whether use "broadcast join" depends on hive configuration
> --
>
> Key: SPARK-20475
> URL: https://issues.apache.org/jira/browse/SPARK-20475
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Lijia Liu
>
> Currently,  broadcast join in Spark only works while:
>   1. The value of "spark.sql.autoBroadcastJoinThreshold" bigger than 
> 0(default is 10485760).
>   2. The size of one of the hive tables less than 
> "spark.sql.autoBroadcastJoinThreshold". To get the size information of the 
> hive table from hive metastore,  "hive.stats.autogather" should be set to 
> true in hive or the command "ANALYZE TABLE  COMPUTE STATISTICS 
> noscan" has been run.
> But in Hive, it calculate the size of the file or directory corresponding to 
> the hive table to determine whether to use the map side join, and does not 
> depend on the hive metastore.
> This leads to two problems:
>   1. Spark will not use "broadcast join" when the hive parameter 
> "hive.stats.autogather" is not set to ture or the command "ANALYZE TABLE 
>  COMPUTE STATISTICS noscan" has not been run because the 
> information of the hive table has not saved in hive metastore . The mode of 
> work in Spark depends on the configuration of Hive.
>   2. For some reason, we set "hive.stats.autogather" to false in our Hive. 
> For the same SQL, Hive is 4 times faster than Spark because Hive used "map 
> side join" but Spark did not use "broadcast join".
> Is it possible to use the mechanism same to hive's to look up the size of a 
> hive tale in Spark.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20470) Invalid json converting RDD row with Array of struct to json

2017-04-26 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-20470.
--
Resolution: Invalid

This is because {{recursive}} should be set to {{True}} in {{asDict}}.

With given data stored in {{tmp.json}},

{code}
import json

from pyspark.sql.types import Row


df = spark.read.json("tmp.json", wholeFile=True)
df.printSchema()

rdd = df.rdd.map(lambda row: json.dumps(row.asDict(recursive=True), indent=2))
print rdd.collect()[0]
{code}

prints as below:

{code}
root
 |-- feature: string (nullable = true)
 |-- histogram: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- start: double (nullable = true)
 |||-- width: double (nullable = true)
 |||-- y: double (nullable = true)


{
  "feature": "feature_id_001",
  "histogram": [
{
  "y": 968.0,
  "start": 1.9796095151877942,
  "width": 0.1564485056196041
},
{
  "y": 892.0,
  "start": 2.1360580208073983,
  "width": 0.1564485056196041
},
{
  "y": 814.0,
  "start": 2.2925065264270024,
  "width": 0.15644850561960366
},
{
  "y": 690.0,
  "start": 2.448955032046606,
  "width": 0.1564485056196041
}
  ]
}
{code}

> Invalid json converting RDD row with Array of struct to json
> 
>
> Key: SPARK-20470
> URL: https://issues.apache.org/jira/browse/SPARK-20470
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.6.3
>Reporter: Philip Adetiloye
>
> Trying to convert an RDD in pyspark containing Array of struct doesn't 
> generate the right json. It looks trivial but can't get a good json out.
> I read the json below into a dataframe:
> {code}
> {
>   "feature": "feature_id_001",
>   "histogram": [
> {
>   "start": 1.9796095151877942,
>   "y": 968.0,
>   "width": 0.1564485056196041
> },
> {
>   "start": 2.1360580208073983,
>   "y": 892.0,
>   "width": 0.1564485056196041
> },
> {
>   "start": 2.2925065264270024,
>   "y": 814.0,
>   "width": 0.15644850561960366
> },
> {
>   "start": 2.448955032046606,
>   "y": 690.0,
>   "width": 0.1564485056196041
> }]
> }
> {code}
> Df schema looks good 
> {code}
>  root
>   |-- feature: string (nullable = true)
>   |-- histogram: array (nullable = true)
>   ||-- element: struct (containsNull = true)
>   |||-- start: double (nullable = true)
>   |||-- width: double (nullable = true)
>   |||-- y: double (nullable = true)
> {code}
> Need to convert each row to json now and save to HBase 
> {code}
> rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict(
> {code}
> Output JSON (Wrong)
> {code}
> {
>   "feature": "feature_id_001",
>   "histogram": [
> [
>   1.9796095151877942,
>   968.0,
>   0.1564485056196041
> ],
> [
>   2.1360580208073983,
>   892.0,
>   0.1564485056196041
> ],
> [
>   2.2925065264270024,
>   814.0,
>   0.15644850561960366
> ],
> [
>   2.448955032046606,
>   690.0,
>   0.1564485056196041
> ]
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20475) Whether use "broadcast join" depends on hive configuration

2017-04-26 Thread Zhenhua Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985838#comment-15985838
 ] 

Zhenhua Wang edited comment on SPARK-20475 at 4/27/17 3:53 AM:
---

Spark also collects table size no matter "analyze" command is run or not.
Did you check the threshould in Hive for map side join? Is it the same as in 
Spark?
Besides, if you want to use broadcast join in spark, we have a broadcast hint 
in Spark.


was (Author: zenwzh):
Spark also collects table size no matter "analyze" command is run or not.
Did check the threshould in Hive for map side join? Is it the same as in Spark?
Besides, if you want to use broadcast join in spark, we have a broadcast hint 
in Spark.

> Whether use "broadcast join" depends on hive configuration
> --
>
> Key: SPARK-20475
> URL: https://issues.apache.org/jira/browse/SPARK-20475
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Lijia Liu
>
> Currently,  broadcast join in Spark only works while:
>   1. The value of "spark.sql.autoBroadcastJoinThreshold" bigger than 
> 0(default is 10485760).
>   2. The size of one of the hive tables less than 
> "spark.sql.autoBroadcastJoinThreshold". To get the size information of the 
> hive table from hive metastore,  "hive.stats.autogather" should be set to 
> true in hive or the command "ANALYZE TABLE  COMPUTE STATISTICS 
> noscan" has been run.
> But in Hive, it calculate the size of the file or directory corresponding to 
> the hive table to determine whether to use the map side join, and does not 
> depend on the hive metastore.
> This leads to two problems:
>   1. Spark will not use "broadcast join" when the hive parameter 
> "hive.stats.autogather" is not set to ture or the command "ANALYZE TABLE 
>  COMPUTE STATISTICS noscan" has not been run because the 
> information of the hive table has not saved in hive metastore . The mode of 
> work in Spark depends on the configuration of Hive.
>   2. For some reason, we set "hive.stats.autogather" to false in our Hive. 
> For the same SQL, Hive is 4 times faster than Spark because Hive used "map 
> side join" but Spark did not use "broadcast join".
> Is it possible to use the mechanism same to hive's to look up the size of a 
> hive tale in Spark.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20487) `HiveTableScan` node is quite verbose in explained plan

2017-04-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20487:


Assignee: (was: Apache Spark)

> `HiveTableScan` node is quite verbose in explained plan
> ---
>
> Key: SPARK-20487
> URL: https://issues.apache.org/jira/browse/SPARK-20487
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Tejas Patil
>Priority: Trivial
>
> For hive tables, `explain()` prints a lot of information. This makes it hard 
> to read the plan (esp. for large sql strings with numerous tables).
> eg.
> {noformat}
> scala> hc.sql(" SELECT * FROM my_table WHERE name = 'foo' ").explain(true)
> == Parsed Logical Plan ==
> 'Project [*]
> +- 'Filter ('name = foo)
>+- 'UnresolvedRelation `my_table`
> == Analyzed Logical Plan ==
> user_id: bigint, name: string, ds: string
> Project [user_id#13L, name#14, ds#15]
> +- Filter (name#14 = foo)
>+- SubqueryAlias my_table
>   +- CatalogRelation CatalogTable(
> Database: default
> Table: my_table
> Owner: tejasp
> Created: Fri Apr 14 17:05:50 PDT 2017
> Last Access: Wed Dec 31 16:00:00 PST 1969
> Type: MANAGED
> Provider: hive
> Properties: [serialization.format=1]
> Statistics: 9223372036854775807 bytes
> Location: file:/tmp/warehouse/my_table
> Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> InputFormat: org.apache.hadoop.mapred.TextInputFormat
> OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> Partition Provider: Catalog
> Partition Columns: [`ds`]
> Schema: root
> -- user_id: long (nullable = true)
> -- name: string (nullable = true)
> -- ds: string (nullable = true)
> ), [user_id#13L, name#14], [ds#15]
> == Optimized Logical Plan ==
> Filter (isnotnull(name#14) && (name#14 = foo))
> +- CatalogRelation CatalogTable(
> Database: default
> Table: my_table
> Owner: tejasp
> Created: Fri Apr 14 17:05:50 PDT 2017
> Last Access: Wed Dec 31 16:00:00 PST 1969
> Type: MANAGED
> Provider: hive
> Properties: [serialization.format=1]
> Statistics: 9223372036854775807 bytes
> Location: file:/tmp/warehouse/my_table
> Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> InputFormat: org.apache.hadoop.mapred.TextInputFormat
> OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> Partition Provider: Catalog
> Partition Columns: [`ds`]
> Schema: root
> -- user_id: long (nullable = true)
> -- name: string (nullable = true)
> -- ds: string (nullable = true)
> ), [user_id#13L, name#14], [ds#15]
> == Physical Plan ==
> *Filter (isnotnull(name#14) && (name#14 = foo))
> +- HiveTableScan [user_id#13L, name#14, ds#15], CatalogRelation CatalogTable(
> Database: default
> Table: my_table
> Owner: tejasp
> Created: Fri Apr 14 17:05:50 PDT 2017
> Last Access: Wed Dec 31 16:00:00 PST 1969
> Type: MANAGED
> Provider: hive
> Properties: [serialization.format=1]
> Statistics: 9223372036854775807 bytes
> Location: file:/tmp/warehouse/my_table
> Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> InputFormat: org.apache.hadoop.mapred.TextInputFormat
> OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> Partition Provider: Catalog
> Partition Columns: [`ds`]
> Schema: root
> -- user_id: long (nullable = true)
> -- name: string (nullable = true)
> -- ds: string (nullable = true)
> ), [user_id#13L, name#14], [ds#15]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20487) `HiveTableScan` node is quite verbose in explained plan

2017-04-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20487:


Assignee: Apache Spark

> `HiveTableScan` node is quite verbose in explained plan
> ---
>
> Key: SPARK-20487
> URL: https://issues.apache.org/jira/browse/SPARK-20487
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Tejas Patil
>Assignee: Apache Spark
>Priority: Trivial
>
> For hive tables, `explain()` prints a lot of information. This makes it hard 
> to read the plan (esp. for large sql strings with numerous tables).
> eg.
> {noformat}
> scala> hc.sql(" SELECT * FROM my_table WHERE name = 'foo' ").explain(true)
> == Parsed Logical Plan ==
> 'Project [*]
> +- 'Filter ('name = foo)
>+- 'UnresolvedRelation `my_table`
> == Analyzed Logical Plan ==
> user_id: bigint, name: string, ds: string
> Project [user_id#13L, name#14, ds#15]
> +- Filter (name#14 = foo)
>+- SubqueryAlias my_table
>   +- CatalogRelation CatalogTable(
> Database: default
> Table: my_table
> Owner: tejasp
> Created: Fri Apr 14 17:05:50 PDT 2017
> Last Access: Wed Dec 31 16:00:00 PST 1969
> Type: MANAGED
> Provider: hive
> Properties: [serialization.format=1]
> Statistics: 9223372036854775807 bytes
> Location: file:/tmp/warehouse/my_table
> Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> InputFormat: org.apache.hadoop.mapred.TextInputFormat
> OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> Partition Provider: Catalog
> Partition Columns: [`ds`]
> Schema: root
> -- user_id: long (nullable = true)
> -- name: string (nullable = true)
> -- ds: string (nullable = true)
> ), [user_id#13L, name#14], [ds#15]
> == Optimized Logical Plan ==
> Filter (isnotnull(name#14) && (name#14 = foo))
> +- CatalogRelation CatalogTable(
> Database: default
> Table: my_table
> Owner: tejasp
> Created: Fri Apr 14 17:05:50 PDT 2017
> Last Access: Wed Dec 31 16:00:00 PST 1969
> Type: MANAGED
> Provider: hive
> Properties: [serialization.format=1]
> Statistics: 9223372036854775807 bytes
> Location: file:/tmp/warehouse/my_table
> Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> InputFormat: org.apache.hadoop.mapred.TextInputFormat
> OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> Partition Provider: Catalog
> Partition Columns: [`ds`]
> Schema: root
> -- user_id: long (nullable = true)
> -- name: string (nullable = true)
> -- ds: string (nullable = true)
> ), [user_id#13L, name#14], [ds#15]
> == Physical Plan ==
> *Filter (isnotnull(name#14) && (name#14 = foo))
> +- HiveTableScan [user_id#13L, name#14, ds#15], CatalogRelation CatalogTable(
> Database: default
> Table: my_table
> Owner: tejasp
> Created: Fri Apr 14 17:05:50 PDT 2017
> Last Access: Wed Dec 31 16:00:00 PST 1969
> Type: MANAGED
> Provider: hive
> Properties: [serialization.format=1]
> Statistics: 9223372036854775807 bytes
> Location: file:/tmp/warehouse/my_table
> Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> InputFormat: org.apache.hadoop.mapred.TextInputFormat
> OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> Partition Provider: Catalog
> Partition Columns: [`ds`]
> Schema: root
> -- user_id: long (nullable = true)
> -- name: string (nullable = true)
> -- ds: string (nullable = true)
> ), [user_id#13L, name#14], [ds#15]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20487) `HiveTableScan` node is quite verbose in explained plan

2017-04-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985907#comment-15985907
 ] 

Apache Spark commented on SPARK-20487:
--

User 'tejasapatil' has created a pull request for this issue:
https://github.com/apache/spark/pull/17780

> `HiveTableScan` node is quite verbose in explained plan
> ---
>
> Key: SPARK-20487
> URL: https://issues.apache.org/jira/browse/SPARK-20487
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Tejas Patil
>Priority: Trivial
>
> For hive tables, `explain()` prints a lot of information. This makes it hard 
> to read the plan (esp. for large sql strings with numerous tables).
> eg.
> {noformat}
> scala> hc.sql(" SELECT * FROM my_table WHERE name = 'foo' ").explain(true)
> == Parsed Logical Plan ==
> 'Project [*]
> +- 'Filter ('name = foo)
>+- 'UnresolvedRelation `my_table`
> == Analyzed Logical Plan ==
> user_id: bigint, name: string, ds: string
> Project [user_id#13L, name#14, ds#15]
> +- Filter (name#14 = foo)
>+- SubqueryAlias my_table
>   +- CatalogRelation CatalogTable(
> Database: default
> Table: my_table
> Owner: tejasp
> Created: Fri Apr 14 17:05:50 PDT 2017
> Last Access: Wed Dec 31 16:00:00 PST 1969
> Type: MANAGED
> Provider: hive
> Properties: [serialization.format=1]
> Statistics: 9223372036854775807 bytes
> Location: file:/tmp/warehouse/my_table
> Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> InputFormat: org.apache.hadoop.mapred.TextInputFormat
> OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> Partition Provider: Catalog
> Partition Columns: [`ds`]
> Schema: root
> -- user_id: long (nullable = true)
> -- name: string (nullable = true)
> -- ds: string (nullable = true)
> ), [user_id#13L, name#14], [ds#15]
> == Optimized Logical Plan ==
> Filter (isnotnull(name#14) && (name#14 = foo))
> +- CatalogRelation CatalogTable(
> Database: default
> Table: my_table
> Owner: tejasp
> Created: Fri Apr 14 17:05:50 PDT 2017
> Last Access: Wed Dec 31 16:00:00 PST 1969
> Type: MANAGED
> Provider: hive
> Properties: [serialization.format=1]
> Statistics: 9223372036854775807 bytes
> Location: file:/tmp/warehouse/my_table
> Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> InputFormat: org.apache.hadoop.mapred.TextInputFormat
> OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> Partition Provider: Catalog
> Partition Columns: [`ds`]
> Schema: root
> -- user_id: long (nullable = true)
> -- name: string (nullable = true)
> -- ds: string (nullable = true)
> ), [user_id#13L, name#14], [ds#15]
> == Physical Plan ==
> *Filter (isnotnull(name#14) && (name#14 = foo))
> +- HiveTableScan [user_id#13L, name#14, ds#15], CatalogRelation CatalogTable(
> Database: default
> Table: my_table
> Owner: tejasp
> Created: Fri Apr 14 17:05:50 PDT 2017
> Last Access: Wed Dec 31 16:00:00 PST 1969
> Type: MANAGED
> Provider: hive
> Properties: [serialization.format=1]
> Statistics: 9223372036854775807 bytes
> Location: file:/tmp/warehouse/my_table
> Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> InputFormat: org.apache.hadoop.mapred.TextInputFormat
> OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> Partition Provider: Catalog
> Partition Columns: [`ds`]
> Schema: root
> -- user_id: long (nullable = true)
> -- name: string (nullable = true)
> -- ds: string (nullable = true)
> ), [user_id#13L, name#14], [ds#15]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20487) `HiveTableScan` node is quite verbose in explained plan

2017-04-26 Thread Tejas Patil (JIRA)
Tejas Patil created SPARK-20487:
---

 Summary: `HiveTableScan` node is quite verbose in explained plan
 Key: SPARK-20487
 URL: https://issues.apache.org/jira/browse/SPARK-20487
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
Reporter: Tejas Patil
Priority: Trivial


For hive tables, `explain()` prints a lot of information. This makes it hard to 
read the plan (esp. for large sql strings with numerous tables).

eg.
{noformat}
scala> hc.sql(" SELECT * FROM my_table WHERE name = 'foo' ").explain(true)
== Parsed Logical Plan ==
'Project [*]
+- 'Filter ('name = foo)
   +- 'UnresolvedRelation `my_table`

== Analyzed Logical Plan ==
user_id: bigint, name: string, ds: string
Project [user_id#13L, name#14, ds#15]
+- Filter (name#14 = foo)
   +- SubqueryAlias my_table
  +- CatalogRelation CatalogTable(
Database: default
Table: my_table
Owner: tejasp
Created: Fri Apr 14 17:05:50 PDT 2017
Last Access: Wed Dec 31 16:00:00 PST 1969
Type: MANAGED
Provider: hive
Properties: [serialization.format=1]
Statistics: 9223372036854775807 bytes
Location: file:/tmp/warehouse/my_table
Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Partition Provider: Catalog
Partition Columns: [`ds`]
Schema: root
-- user_id: long (nullable = true)
-- name: string (nullable = true)
-- ds: string (nullable = true)
), [user_id#13L, name#14], [ds#15]

== Optimized Logical Plan ==
Filter (isnotnull(name#14) && (name#14 = foo))
+- CatalogRelation CatalogTable(
Database: default
Table: my_table
Owner: tejasp
Created: Fri Apr 14 17:05:50 PDT 2017
Last Access: Wed Dec 31 16:00:00 PST 1969
Type: MANAGED
Provider: hive
Properties: [serialization.format=1]
Statistics: 9223372036854775807 bytes
Location: file:/tmp/warehouse/my_table
Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Partition Provider: Catalog
Partition Columns: [`ds`]
Schema: root
-- user_id: long (nullable = true)
-- name: string (nullable = true)
-- ds: string (nullable = true)
), [user_id#13L, name#14], [ds#15]

== Physical Plan ==
*Filter (isnotnull(name#14) && (name#14 = foo))
+- HiveTableScan [user_id#13L, name#14, ds#15], CatalogRelation CatalogTable(
Database: default
Table: my_table
Owner: tejasp
Created: Fri Apr 14 17:05:50 PDT 2017
Last Access: Wed Dec 31 16:00:00 PST 1969
Type: MANAGED
Provider: hive
Properties: [serialization.format=1]
Statistics: 9223372036854775807 bytes
Location: file:/tmp/warehouse/my_table
Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Partition Provider: Catalog
Partition Columns: [`ds`]
Schema: root
-- user_id: long (nullable = true)
-- name: string (nullable = true)
-- ds: string (nullable = true)
), [user_id#13L, name#14], [ds#15]
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20403) It is wrong to the instructions of some functions,such as boolean,tinyint,smallint,int,bigint,float,double,decimal,date,timestamp,binary,string

2017-04-26 Thread liuxian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-20403:

Priority: Minor  (was: Trivial)

> It is wrong to the instructions of some functions,such as  
> boolean,tinyint,smallint,int,bigint,float,double,decimal,date,timestamp,binary,string
> 
>
> Key: SPARK-20403
> URL: https://issues.apache.org/jira/browse/SPARK-20403
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Affects Versions: 2.1.0
>Reporter: liuxian
>Priority: Minor
>
> spark-sql>desc function boolean;
> Function: boolean
> Class: org.apache.spark.sql.catalyst.expressions.Cast
> Usage: boolean(expr AS type) - Casts the value `expr` to the target data type 
> `type`.
> spark-sql>desc function int;
> Function: int
> Class: org.apache.spark.sql.catalyst.expressions.Cast
> Usage: int(expr AS type) - Casts the value `expr` to the target data type 
> `type`.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20483) Mesos Coarse mode may starve other Mesos frameworks if max cores is not a multiple of executor cores

2017-04-26 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-20483:
---

Assignee: Davis Shepherd

> Mesos Coarse mode may starve other Mesos frameworks if max cores is not a 
> multiple of executor cores
> 
>
> Key: SPARK-20483
> URL: https://issues.apache.org/jira/browse/SPARK-20483
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.0.2, 2.1.0, 2.1.1
>Reporter: Davis Shepherd
>Assignee: Davis Shepherd
>Priority: Minor
>
> if {{spark.cores.max = 10}} for example and {{spark.executor.cores = 4}}, 2 
> executors will get launched thus {{totalCoresAcquired = 8}}. All future Mesos 
> offers will not get tasks launched because 
> {{sc.conf.getInt("spark.executor.cores", ...) + totalCoresAcquired <= 
> maxCores}} will always evaluate to false.  However, in 
> {{handleMatchedOffers}} we check if {{totalCoresAcquired >= maxCores}} to 
> determine if we should decline the offer "for a configurable amount of time 
> to avoid starving other frameworks", and this will always evaluate to false 
> in the above scenario. This leaves the framework in a state of limbo where it 
> will never launch any new executors, but only decline offers for the Mesos 
> default of 5 seconds, thus starving other frameworks of offers.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20483) Mesos Coarse mode may starve other Mesos frameworks if max cores is not a multiple of executor cores

2017-04-26 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-20483:

Shepherd: DB Tsai

> Mesos Coarse mode may starve other Mesos frameworks if max cores is not a 
> multiple of executor cores
> 
>
> Key: SPARK-20483
> URL: https://issues.apache.org/jira/browse/SPARK-20483
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.0.2, 2.1.0, 2.1.1
>Reporter: Davis Shepherd
>Priority: Minor
>
> if {{spark.cores.max = 10}} for example and {{spark.executor.cores = 4}}, 2 
> executors will get launched thus {{totalCoresAcquired = 8}}. All future Mesos 
> offers will not get tasks launched because 
> {{sc.conf.getInt("spark.executor.cores", ...) + totalCoresAcquired <= 
> maxCores}} will always evaluate to false.  However, in 
> {{handleMatchedOffers}} we check if {{totalCoresAcquired >= maxCores}} to 
> determine if we should decline the offer "for a configurable amount of time 
> to avoid starving other frameworks", and this will always evaluate to false 
> in the above scenario. This leaves the framework in a state of limbo where it 
> will never launch any new executors, but only decline offers for the Mesos 
> default of 5 seconds, thus starving other frameworks of offers.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20483) Mesos Coarse mode may starve other Mesos frameworks if max cores is not a multiple of executor cores

2017-04-26 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-20483:

Affects Version/s: 2.1.1
   2.0.2

> Mesos Coarse mode may starve other Mesos frameworks if max cores is not a 
> multiple of executor cores
> 
>
> Key: SPARK-20483
> URL: https://issues.apache.org/jira/browse/SPARK-20483
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.0.2, 2.1.0, 2.1.1
>Reporter: Davis Shepherd
>Priority: Minor
>
> if {{spark.cores.max = 10}} for example and {{spark.executor.cores = 4}}, 2 
> executors will get launched thus {{totalCoresAcquired = 8}}. All future Mesos 
> offers will not get tasks launched because 
> {{sc.conf.getInt("spark.executor.cores", ...) + totalCoresAcquired <= 
> maxCores}} will always evaluate to false.  However, in 
> {{handleMatchedOffers}} we check if {{totalCoresAcquired >= maxCores}} to 
> determine if we should decline the offer "for a configurable amount of time 
> to avoid starving other frameworks", and this will always evaluate to false 
> in the above scenario. This leaves the framework in a state of limbo where it 
> will never launch any new executors, but only decline offers for the Mesos 
> default of 5 seconds, thus starving other frameworks of offers.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20468) Refactor the ALS code

2017-04-26 Thread Daniel Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985888#comment-15985888
 ] 

Daniel Li commented on SPARK-20468:
---

Thanks for the guidance, [~srowen].  I've read the Contributing guide now; my 
apologies for not reading it through sooner.

I've closed the PR and created a few new Jira issues to propose documentation 
additions and internal code clarity improvements: SPARK-20484, SPARK-20485, and 
SPARK-20486.  How do you want to proceed?

> Refactor the ALS code
> -
>
> Key: SPARK-20468
> URL: https://issues.apache.org/jira/browse/SPARK-20468
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Daniel Li
>Priority: Minor
>  Labels: documentation, readability, refactoring
>
> The current ALS implementation ({{org.apache.spark.ml.recommendation}}) is 
> quite the beast --- 21 classes, traits, and objects across 1,500+ lines, all 
> in one file.  Here are some things I think could improve the clarity and 
> maintainability of the code:
> *  The file can be split into more manageable parts.  In particular, {{ALS}}, 
> {{ALSParams}}, {{ALSModel}}, and {{ALSModelParams}} can be in separate files 
> for better readability.
> *  Certain parts can be encapsulated or moved to clarify the intent.  For 
> example:
> **  The {{ALS.train}} method is currently defined in the {{ALS}} companion 
> object, and it seems to take 12 individual parameters that are all members of 
> the {{ALS}} class.  This method can be made an instance method.
> **  The code that creates in-blocks and out-blocks in the body of 
> {{ALS.train}}, along with the {{partitionRatings}} and {{makeBlocks}} methods 
> in the {{ALS}} companion object, can be moved into a separate case class that 
> holds the blocks.  This has the added benefit of allowing us to write 
> specific Scaladoc to explain the logic behind these block objects, as their 
> usage is certainly nontrivial yet is fundamental to the implementation.
> **  The {{KeyWrapper}} and {{UncompressedInBlockSort}} classes could be 
> hidden within {{UncompressedInBlock}} to clarify the scope of their usage.
> **  Various builder classes could be encapsulated in the companion objects of 
> the classes they build.
> *  The code can be formatted more clearly.  For example:
> **  Certain methods such as {{ALS.train}} and {{ALS.makeBlocks}} can be 
> formatted more clearly and have comments added explaining the reasoning 
> behind key parts.  That these methods form the core of the ALS logic makes 
> this doubly important for maintainability.
> **  Parts of the code that use {{while}} loops with manually incremented 
> counters can be rewritten as {{for}} loops.
> **  Where non-idiomatic Scala code is used that doesn't improve performance 
> much, clearer code can be substituted.  (This in particular should be done 
> very carefully if at all, as it's apparent the original author spent much 
> time and pains optimizing the code to significantly improve its runtime 
> profile.)
> *  The documentation (both Scaladocs and inline comments) can be clarified 
> where needed and expanded where incomplete.  This is especially important for 
> parts of the code that are written imperatively for performance, as these 
> parts don't benefit from the intuitive self-documentation of Scala's 
> higher-level language abstractions.  Specifically, I'd like to add 
> documentation fully explaining the key functionality of the in-block and 
> out-block objects, their purpose, how they relate to the overall ALS 
> algorithm, and how they are calculated in such a way that new maintainers can 
> ramp up much more quickly.
> The above is not a complete enumeration of improvements but a high-level 
> survey.  All of these improvements will, I believe, add up to make the code 
> easier to understand, extend, and maintain.  This issue will track the 
> progress of this refactoring so that going forward, authors will have an 
> easier time maintaining this part of the project.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20486) Encapsulate ALS in-block and out-block data structures and methods into a separate class

2017-04-26 Thread Daniel Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Li updated SPARK-20486:
--
Description: 
The in-block and out-block data structures in the ALS code is currently 
calculated within the {{ALS.train}} method itself.  I propose to move this 
code, along with its helper functions, into a separate class to encapsulate the 
creation of the blocks.  This has the added benefit of allowing us to include a 
comprehensive Scaladoc to this new class to explain in detail how this core 
part of the algorithm works.

Proposal:

{code}
private[recommendation] final case class RatingBlocks[ID](
  userIn: RDD[(Int, InBlock[ID])],
  userOut: RDD[(Int, OutBlock)],
  itemIn: RDD[(Int, InBlock[ID])],
  itemOut: RDD[(Int, OutBlock)]
)

private[recommendation] object RatingBlocks {
  def create[ID: ClassTag: Ordering](
  ratings: RDD[Rating[ID]],
  numUserBlocks: Int,
  numItemBlocks: Int,
  storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK): 
RatingBlocks[ID] = {

val userPart = new ALSPartitioner(numUserBlocks)
val itemPart = new ALSPartitioner(numItemBlocks)

val blockRatings =
  partitionRatings(ratings, userPart, itemPart)
.persist(storageLevel)

val (userInBlocks, userOutBlocks) =
  makeBlocks("user", blockRatings, userPart, itemPart, storageLevel)

userOutBlocks.count()   // materialize `blockRatings` and user blocks

val swappedBlockRatings = blockRatings.map {
  case ((userBlockId, itemBlockId), RatingBlock(userIds, itemIds, 
localRatings)) =>
((itemBlockId, userBlockId), RatingBlock(itemIds, userIds, 
localRatings))
}

val (itemInBlocks, itemOutBlocks) =
  makeBlocks("item", swappedBlockRatings, itemPart, userPart, storageLevel)

itemOutBlocks.count()   // materialize item blocks
blockRatings.unpersist()
new RatingBlocks(userInBlocks, userOutBlocks, itemInBlocks, itemOutBlocks)
  }

  private[this] def partitionRatings[ID: ClassTag](...) = {
// existing code goes here verbatim
  }

  private[this] def makeBlocks[ID: ClassTag](...) = {
// existing code goes here verbatim
  }
}
{code}

  was:
The in-block and out-block data structures in the ALS code is currently 
calculated within the {{ALS.train}} method itself.  I propose to move this 
code, along with its helper functions, into a separate class to encapsulate the 
creation of the blocks.  This has the added benefit of allowing us to include a 
comprehensive Scaladoc to this new class to explain in detail how this core 
part of the algorithm works.

Proposal:

{code}
private[recommendation] final case class RatingBlocks[ID](
  userIn: RDD[(Int, InBlock[ID])],
  userOut: RDD[(Int, OutBlock)],
  itemIn: RDD[(Int, InBlock[ID])],
  itemOut: RDD[(Int, OutBlock)]
)

private[recommendation] object RatingBlocks {
  def create[ID: ClassTag: Ordering](
  ratings: RDD[Rating[ID]],
  numUserBlocks: Int,
  numItemBlocks: Int,
  storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK): 
RatingBlocks[ID] = {
// In-block and out-block code currently in `ALS.train` goes here
  }

  private[this] def partitionRatings[ID: ClassTag](...) = { ... }

  private[this] def makeBlocks[ID: ClassTag](...) = { ... }
}
{code}


> Encapsulate ALS in-block and out-block data structures and methods into a 
> separate class
> 
>
> Key: SPARK-20486
> URL: https://issues.apache.org/jira/browse/SPARK-20486
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Daniel Li
>Priority: Trivial
>
> The in-block and out-block data structures in the ALS code is currently 
> calculated within the {{ALS.train}} method itself.  I propose to move this 
> code, along with its helper functions, into a separate class to encapsulate 
> the creation of the blocks.  This has the added benefit of allowing us to 
> include a comprehensive Scaladoc to this new class to explain in detail how 
> this core part of the algorithm works.
> Proposal:
> {code}
> private[recommendation] final case class RatingBlocks[ID](
>   userIn: RDD[(Int, InBlock[ID])],
>   userOut: RDD[(Int, OutBlock)],
>   itemIn: RDD[(Int, InBlock[ID])],
>   itemOut: RDD[(Int, OutBlock)]
> )
> private[recommendation] object RatingBlocks {
>   def create[ID: ClassTag: Ordering](
>   ratings: RDD[Rating[ID]],
>   numUserBlocks: Int,
>   numItemBlocks: Int,
>   storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK): 
> RatingBlocks[ID] = {
> val userPart = new ALSPartitioner(numUserBlocks)
> val itemPart = new ALSPartitioner(numItemBlocks)
> val blockRatings =
>   partitionRatings(ratings, userPart, itemPart)
> .persist(storageLevel)
> val 

[jira] [Created] (SPARK-20486) Encapsulate ALS in-block and out-block data structures and methods into a separate class

2017-04-26 Thread Daniel Li (JIRA)
Daniel Li created SPARK-20486:
-

 Summary: Encapsulate ALS in-block and out-block data structures 
and methods into a separate class
 Key: SPARK-20486
 URL: https://issues.apache.org/jira/browse/SPARK-20486
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 2.1.0
Reporter: Daniel Li
Priority: Trivial


The in-block and out-block data structures in the ALS code is currently 
calculated within the {{ALS.train}} method itself.  I propose to move this 
code, along with its helper functions, into a separate class to encapsulate the 
creation of the blocks.  This has the added benefit of allowing us to include a 
comprehensive Scaladoc to this new class to explain in detail how this core 
part of the algorithm works.

Proposal:

{code}
private[recommendation] final case class RatingBlocks[ID](
  userIn: RDD[(Int, InBlock[ID])],
  userOut: RDD[(Int, OutBlock)],
  itemIn: RDD[(Int, InBlock[ID])],
  itemOut: RDD[(Int, OutBlock)]
)

private[recommendation] object RatingBlocks {
  def create[ID: ClassTag: Ordering](
  ratings: RDD[Rating[ID]],
  numUserBlocks: Int,
  numItemBlocks: Int,
  storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK): 
RatingBlocks[ID] = {
// In-block and out-block code currently in `ALS.train` goes here
  }

  private[this] def partitionRatings[ID: ClassTag](...) = { ... }

  private[this] def makeBlocks[ID: ClassTag](...) = { ... }
}
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20485) Split ALS.scala into multiple files

2017-04-26 Thread Daniel Li (JIRA)
Daniel Li created SPARK-20485:
-

 Summary: Split ALS.scala into multiple files
 Key: SPARK-20485
 URL: https://issues.apache.org/jira/browse/SPARK-20485
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 2.1.0
Reporter: Daniel Li
Priority: Trivial


The file {{/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala}} 
currently contains four top-level classes that span 1,500+ lines.  To make the 
file easier to navigate, and to bring it in line with the [Scala style 
guide|http://docs.scala-lang.org/style/files.html], I propose we split it into 
four files holding their respective classes: {{ALS.scala}}, 
{{ALSParams.scala}}, {{ALSModel.scala}}, and {{ALSModelParams.scala}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20484) Add documentation to ALS code

2017-04-26 Thread Daniel Li (JIRA)
Daniel Li created SPARK-20484:
-

 Summary: Add documentation to ALS code
 Key: SPARK-20484
 URL: https://issues.apache.org/jira/browse/SPARK-20484
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 2.1.0
Reporter: Daniel Li
Priority: Trivial


The documentation (both Scaladocs and inline comments) for the ALS code (in 
package {{org.apache.spark.ml.recommendation}}) can be clarified where needed 
and expanded where incomplete. This is especially important for parts of the 
code that are written imperatively for performance, as these parts don't 
benefit from the intuitive self-documentation of Scala's higher-level language 
abstractions. Specifically, I'd like to add documentation fully explaining the 
key functionality of the in-block and out-block objects, their purpose, how 
they relate to the overall ALS algorithm, and how they are calculated in such a 
way that new maintainers can ramp up much more quickly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20475) Whether use "broadcast join" depends on hive configuration

2017-04-26 Thread Zhenhua Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985838#comment-15985838
 ] 

Zhenhua Wang commented on SPARK-20475:
--

Spark also collects table size no matter "analyze" command is run or not.
Did check the threshould in Hive for map side join? Is it the same as in Spark?
Besides, if you want to use broadcast join in spark, we have a broadcast hint 
in Spark.

> Whether use "broadcast join" depends on hive configuration
> --
>
> Key: SPARK-20475
> URL: https://issues.apache.org/jira/browse/SPARK-20475
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Lijia Liu
>
> Currently,  broadcast join in Spark only works while:
>   1. The value of "spark.sql.autoBroadcastJoinThreshold" bigger than 
> 0(default is 10485760).
>   2. The size of one of the hive tables less than 
> "spark.sql.autoBroadcastJoinThreshold". To get the size information of the 
> hive table from hive metastore,  "hive.stats.autogather" should be set to 
> true in hive or the command "ANALYZE TABLE  COMPUTE STATISTICS 
> noscan" has been run.
> But in Hive, it calculate the size of the file or directory corresponding to 
> the hive table to determine whether to use the map side join, and does not 
> depend on the hive metastore.
> This leads to two problems:
>   1. Spark will not use "broadcast join" when the hive parameter 
> "hive.stats.autogather" is not set to ture or the command "ANALYZE TABLE 
>  COMPUTE STATISTICS noscan" has not been run because the 
> information of the hive table has not saved in hive metastore . The mode of 
> work in Spark depends on the configuration of Hive.
>   2. For some reason, we set "hive.stats.autogather" to false in our Hive. 
> For the same SQL, Hive is 4 times faster than Spark because Hive used "map 
> side join" but Spark did not use "broadcast join".
> Is it possible to use the mechanism same to hive's to look up the size of a 
> hive tale in Spark.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20435) More thorough redaction of sensitive information from logs/UI, more unit tests

2017-04-26 Thread Mark Grover (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Grover updated SPARK-20435:


Thanks Marcelo!




> More thorough redaction of sensitive information from logs/UI, more unit tests
> --
>
> Key: SPARK-20435
> URL: https://issues.apache.org/jira/browse/SPARK-20435
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Mark Grover
>Assignee: Mark Grover
> Fix For: 2.2.0
>
>
> SPARK-18535 and SPARK-19720 were works to redact sensitive information (e.g. 
> hadoop credential provider password, AWS access/secret keys) from event logs 
> + YARN logs + UI and from the console output, respectively.
> While some unit tests were added along with these changes - they asserted 
> when a sensitive key was found, that redaction took place for that key. They 
> didn't assert globally that when running a full-fledged Spark app (whether or 
> YARN or locally), that sensitive information was not present in any of the 
> logs or UI. Such a test would also prevent regressions from happening in the 
> future if someone unknowingly adds extra logging that publishes out sensitive 
> information to disk or UI.
> Consequently, it was found that in some Java configurations, sensitive 
> information was still being leaked in the event logs under the 
> {{SparkListenerEnvironmentUpdate}} event, like so:
> {code}
> "sun.java.command":"org.apache.spark.deploy.SparkSubmit ... --conf 
> spark.executorEnv.HADOOP_CREDSTORE_PASSWORD=secret_password ...
> {code}
> "secret_password" should have been redacted.
> Moreover, previously redaction logic was only checking if the key matched the 
> secret regex pattern, it'd redact it's value. That worked for most cases. 
> However, in the above case, the key (sun.java.command) doesn't tell much, so 
> the value needs to be searched. So the check needs to be expanded to match 
> against values as well.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20435) More thorough redaction of sensitive information from logs/UI, more unit tests

2017-04-26 Thread Mark Grover (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Grover updated SPARK-20435:


Thanks Marcelo!




> More thorough redaction of sensitive information from logs/UI, more unit tests
> --
>
> Key: SPARK-20435
> URL: https://issues.apache.org/jira/browse/SPARK-20435
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Mark Grover
>Assignee: Mark Grover
> Fix For: 2.2.0
>
>
> SPARK-18535 and SPARK-19720 were works to redact sensitive information (e.g. 
> hadoop credential provider password, AWS access/secret keys) from event logs 
> + YARN logs + UI and from the console output, respectively.
> While some unit tests were added along with these changes - they asserted 
> when a sensitive key was found, that redaction took place for that key. They 
> didn't assert globally that when running a full-fledged Spark app (whether or 
> YARN or locally), that sensitive information was not present in any of the 
> logs or UI. Such a test would also prevent regressions from happening in the 
> future if someone unknowingly adds extra logging that publishes out sensitive 
> information to disk or UI.
> Consequently, it was found that in some Java configurations, sensitive 
> information was still being leaked in the event logs under the 
> {{SparkListenerEnvironmentUpdate}} event, like so:
> {code}
> "sun.java.command":"org.apache.spark.deploy.SparkSubmit ... --conf 
> spark.executorEnv.HADOOP_CREDSTORE_PASSWORD=secret_password ...
> {code}
> "secret_password" should have been redacted.
> Moreover, previously redaction logic was only checking if the key matched the 
> secret regex pattern, it'd redact it's value. That worked for most cases. 
> However, in the above case, the key (sun.java.command) doesn't tell much, so 
> the value needs to be searched. So the check needs to be expanded to match 
> against values as well.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: [jira] [Resolved] (SPARK-20435) More thorough redaction of sensitive information from logs/UI, more unit tests

2017-04-26 Thread Mark Grover
Thanks Marcelo!

On Apr 26, 2017 5:07 PM, "Marcelo Vanzin (JIRA)"  wrote:

>
>  [ https://issues.apache.org/jira/browse/SPARK-20435?page=
> com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>
> Marcelo Vanzin resolved SPARK-20435.
> 
>Resolution: Fixed
>  Assignee: Mark Grover
> Fix Version/s: 2.2.0
>
> > More thorough redaction of sensitive information from logs/UI, more unit
> tests
> > 
> --
> >
> > Key: SPARK-20435
> > URL: https://issues.apache.org/jira/browse/SPARK-20435
> > Project: Spark
> >  Issue Type: Bug
> >  Components: Spark Core
> >Affects Versions: 2.2.0
> >Reporter: Mark Grover
> >Assignee: Mark Grover
> > Fix For: 2.2.0
> >
> >
> > SPARK-18535 and SPARK-19720 were works to redact sensitive information
> (e.g. hadoop credential provider password, AWS access/secret keys) from
> event logs + YARN logs + UI and from the console output, respectively.
> > While some unit tests were added along with these changes - they
> asserted when a sensitive key was found, that redaction took place for that
> key. They didn't assert globally that when running a full-fledged Spark app
> (whether or YARN or locally), that sensitive information was not present in
> any of the logs or UI. Such a test would also prevent regressions from
> happening in the future if someone unknowingly adds extra logging that
> publishes out sensitive information to disk or UI.
> > Consequently, it was found that in some Java configurations, sensitive
> information was still being leaked in the event logs under the {{
> SparkListenerEnvironmentUpdate}} event, like so:
> > {code}
> > "sun.java.command":"org.apache.spark.deploy.SparkSubmit ... --conf
> spark.executorEnv.HADOOP_CREDSTORE_PASSWORD=secret_password ...
> > {code}
> > "secret_password" should have been redacted.
> > Moreover, previously redaction logic was only checking if the key
> matched the secret regex pattern, it'd redact it's value. That worked for
> most cases. However, in the above case, the key (sun.java.command) doesn't
> tell much, so the value needs to be searched. So the check needs to be
> expanded to match against values as well.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.15#6346)
>
> -
> To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
> For additional commands, e-mail: issues-h...@spark.apache.org
>
>


[jira] [Resolved] (SPARK-20435) More thorough redaction of sensitive information from logs/UI, more unit tests

2017-04-26 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-20435.

   Resolution: Fixed
 Assignee: Mark Grover
Fix Version/s: 2.2.0

> More thorough redaction of sensitive information from logs/UI, more unit tests
> --
>
> Key: SPARK-20435
> URL: https://issues.apache.org/jira/browse/SPARK-20435
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Mark Grover
>Assignee: Mark Grover
> Fix For: 2.2.0
>
>
> SPARK-18535 and SPARK-19720 were works to redact sensitive information (e.g. 
> hadoop credential provider password, AWS access/secret keys) from event logs 
> + YARN logs + UI and from the console output, respectively.
> While some unit tests were added along with these changes - they asserted 
> when a sensitive key was found, that redaction took place for that key. They 
> didn't assert globally that when running a full-fledged Spark app (whether or 
> YARN or locally), that sensitive information was not present in any of the 
> logs or UI. Such a test would also prevent regressions from happening in the 
> future if someone unknowingly adds extra logging that publishes out sensitive 
> information to disk or UI.
> Consequently, it was found that in some Java configurations, sensitive 
> information was still being leaked in the event logs under the 
> {{SparkListenerEnvironmentUpdate}} event, like so:
> {code}
> "sun.java.command":"org.apache.spark.deploy.SparkSubmit ... --conf 
> spark.executorEnv.HADOOP_CREDSTORE_PASSWORD=secret_password ...
> {code}
> "secret_password" should have been redacted.
> Moreover, previously redaction logic was only checking if the key matched the 
> secret regex pattern, it'd redact it's value. That worked for most cases. 
> However, in the above case, the key (sun.java.command) doesn't tell much, so 
> the value needs to be searched. So the check needs to be expanded to match 
> against values as well.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20482) Resolving Casts is too strict on having time zone set

2017-04-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985720#comment-15985720
 ] 

Apache Spark commented on SPARK-20482:
--

User 'rednaxelafx' has created a pull request for this issue:
https://github.com/apache/spark/pull/1

> Resolving Casts is too strict on having time zone set
> -
>
> Key: SPARK-20482
> URL: https://issues.apache.org/jira/browse/SPARK-20482
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kris Mok
>Assignee: Kris Mok
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20483) Mesos Coarse mode may starve other Mesos frameworks if max cores is not a multiple of executor cores

2017-04-26 Thread Davis Shepherd (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davis Shepherd updated SPARK-20483:
---
Description: if {{spark.cores.max = 10}} for example and 
{{spark.executor.cores = 4}}, 2 executors will get launched thus 
{{totalCoresAcquired = 8}}. All future Mesos offers will not get tasks launched 
because {{sc.conf.getInt("spark.executor.cores", ...) + totalCoresAcquired <= 
maxCores}} will always evaluate to false.  However, in {{handleMatchedOffers}} 
we check if {{totalCoresAcquired >= maxCores}} to determine if we should 
decline the offer "for a configurable amount of time to avoid starving other 
frameworks", and this will always evaluate to false in the above scenario. This 
leaves the framework in a state of limbo where it will never launch any new 
executors, but only decline offers for the Mesos default of 5 seconds, thus 
starving other frameworks of offers.  (was: if {{spark.cores.max = 10}} for 
example and {{spark.executor.cores = 4}}, 2 executors will get launched thus 
{{totalCoresAcquired = 8}}. All future Mesos offers will not get tasks launched 
because {{sc.conf.getInt("spark.executor.cores", ...) + totalCoresAcquired <= 
maxCores}} will always evaluate to false.  However, in {{handleMatchedOffers}} 
we check if {{totalCoresAcquired >= maxCores}} to determine if we should 
decline the offer "for a configurable amount of time to avoid starving other 
frameworks", and this will always evaluate to false in the above scenario. This 
leaves the framework in a state of limbo where it will never launch any new 
executors, but only decline offers for the Mesos default of 5 seconds, thus 
starving other frameworks of offers.

Relates to: SPARK-12554, SPARK-19702)

> Mesos Coarse mode may starve other Mesos frameworks if max cores is not a 
> multiple of executor cores
> 
>
> Key: SPARK-20483
> URL: https://issues.apache.org/jira/browse/SPARK-20483
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: Davis Shepherd
>Priority: Minor
>
> if {{spark.cores.max = 10}} for example and {{spark.executor.cores = 4}}, 2 
> executors will get launched thus {{totalCoresAcquired = 8}}. All future Mesos 
> offers will not get tasks launched because 
> {{sc.conf.getInt("spark.executor.cores", ...) + totalCoresAcquired <= 
> maxCores}} will always evaluate to false.  However, in 
> {{handleMatchedOffers}} we check if {{totalCoresAcquired >= maxCores}} to 
> determine if we should decline the offer "for a configurable amount of time 
> to avoid starving other frameworks", and this will always evaluate to false 
> in the above scenario. This leaves the framework in a state of limbo where it 
> will never launch any new executors, but only decline offers for the Mesos 
> default of 5 seconds, thus starving other frameworks of offers.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20483) Mesos Coarse mode may starve other Mesos frameworks if max cores is not a multiple of executor cores

2017-04-26 Thread Davis Shepherd (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davis Shepherd updated SPARK-20483:
---
Description: 
if {{spark.cores.max = 10}} for example and {{spark.executor.cores = 4}}, 2 
executors will get launched thus {{totalCoresAcquired = 8}}. All future Mesos 
offers will not get tasks launched because 
{{sc.conf.getInt("spark.executor.cores", ...) + totalCoresAcquired <= 
maxCores}} will always evaluate to false.  However, in {{handleMatchedOffers}} 
we check if {{totalCoresAcquired >= maxCores}} to determine if we should 
decline the offer "for a configurable amount of time to avoid starving other 
frameworks", and this will always evaluate to false in the above scenario. This 
leaves the framework in a state of limbo where it will never launch any new 
executors, but only decline offers for the Mesos default of 5 seconds, thus 
starving other frameworks of offers.

Relates to: SPARK-12554, SPARK-19702

  was:
if {{spark.cores.max = 10}} for example and {{spark.executor.cores = 4}}, 2 
executors will get launched thus {{totalCoresAcquired = 8}}. All future Mesos 
offers will not get tasks launched because 
{{sc.conf.getInt("spark.executor.cores", ...) + totalCoresAcquired <= 
maxCores}} will always evaluate to false.  However, in {{handleMatchedOffers}} 
we check if {{totalCoresAcquired >= maxCores}} to determine if we should 
decline the offer "for a configurable amount of time to avoid starving other 
frameworks", and this will always evaluate to false in the above scenario. This 
leaves the framework in a state of limbo where it will never launch any new 
executors, but only decline offers for the Mesos default of 5 seconds, thus 
starving other frameworks of offers.

Relates to: SPARK-12554


> Mesos Coarse mode may starve other Mesos frameworks if max cores is not a 
> multiple of executor cores
> 
>
> Key: SPARK-20483
> URL: https://issues.apache.org/jira/browse/SPARK-20483
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: Davis Shepherd
>Priority: Minor
>
> if {{spark.cores.max = 10}} for example and {{spark.executor.cores = 4}}, 2 
> executors will get launched thus {{totalCoresAcquired = 8}}. All future Mesos 
> offers will not get tasks launched because 
> {{sc.conf.getInt("spark.executor.cores", ...) + totalCoresAcquired <= 
> maxCores}} will always evaluate to false.  However, in 
> {{handleMatchedOffers}} we check if {{totalCoresAcquired >= maxCores}} to 
> determine if we should decline the offer "for a configurable amount of time 
> to avoid starving other frameworks", and this will always evaluate to false 
> in the above scenario. This leaves the framework in a state of limbo where it 
> will never launch any new executors, but only decline offers for the Mesos 
> default of 5 seconds, thus starving other frameworks of offers.
> Relates to: SPARK-12554, SPARK-19702



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20481) Wrong mapping for BooleanType in Spark SQL

2017-04-26 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985716#comment-15985716
 ] 

Xiao Li commented on SPARK-20481:
-

We are unable to make a change in this default data type mapping. It can cause 
external behavior changes.

In Spark 2.2., you can manually change it by using the option 
`createTableColumnTypes`. For details, see the following PR: 
https://github.com/apache/spark/pull/16209

> Wrong mapping for BooleanType in Spark SQL
> --
>
> Key: SPARK-20481
> URL: https://issues.apache.org/jira/browse/SPARK-20481
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Talat UYARER
>Priority: Minor
>
> Booleantype is mapping to BIT type on SQL side. 
> https://github.com/apache/spark/blob/a26e3ed5e414d0a350cfe65dd511b154868b9f1d/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L159
> Is there is any specific reason it should be mapping to Boolean 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20481) Wrong mapping for BooleanType in Spark SQL

2017-04-26 Thread Talat UYARER (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985695#comment-15985695
 ] 

Talat UYARER commented on SPARK-20481:
--

I am using Redshift.

> Wrong mapping for BooleanType in Spark SQL
> --
>
> Key: SPARK-20481
> URL: https://issues.apache.org/jira/browse/SPARK-20481
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Talat UYARER
>Priority: Minor
>
> Booleantype is mapping to BIT type on SQL side. 
> https://github.com/apache/spark/blob/a26e3ed5e414d0a350cfe65dd511b154868b9f1d/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L159
> Is there is any specific reason it should be mapping to Boolean 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20483) Mesos Coarse mode may starve other Mesos frameworks if max cores is not a multiple of executor cores

2017-04-26 Thread Davis Shepherd (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davis Shepherd updated SPARK-20483:
---
Description: 
if {{spark.cores.max = 10}} for example and {{spark.executor.cores = 4}}, 2 
executors will get launched thus {{totalCoresAcquired = 8}}. All future Mesos 
offers will not get tasks launched because 
{{sc.conf.getInt("spark.executor.cores", ...) + totalCoresAcquired <= 
maxCores}} will always evaluate to false.  However, in {{handleMatchedOffers}} 
we check if {{totalCoresAcquired >= maxCores}} to determine if we should 
decline the offer "for a configurable amount of time to avoid starving other 
frameworks", and this will always evaluate to false in the above scenario. This 
leaves the framework in a state of limbo where it will never launch any new 
executors, but only decline offers for the Mesos default of 5 seconds, thus 
starving other frameworks of offers.

Relates to: SPARK-12554

  was:
if `spark.cores.max = 10` for example and `spark.executor.cores = 4`, 2 
executors will get lauched thus `totalCoresAcquired = 8`. All future Mesos 
offers will not get tasks launched because 
`sc.conf.getInt("spark.executor.cores", ...) + totalCoresAcquired <= maxCores` 
will always evaluate to false.  However, in `handleMatchedOffers` we check if 
`totalCoresAcquired >= maxCores` to determine if we should decline the offer 
"for a configurable amount of time to avoid starving other frameworks", and 
this will always evaluate to false in the above scenario. This leaves the 
framework in a state of limbo where it will never launch any new executors, but 
only decline offers for the Mesos default of 5 seconds, thus starving other 
frameworks of offers.

Relates to: SPARK-12554


> Mesos Coarse mode may starve other Mesos frameworks if max cores is not a 
> multiple of executor cores
> 
>
> Key: SPARK-20483
> URL: https://issues.apache.org/jira/browse/SPARK-20483
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: Davis Shepherd
>Priority: Minor
>
> if {{spark.cores.max = 10}} for example and {{spark.executor.cores = 4}}, 2 
> executors will get launched thus {{totalCoresAcquired = 8}}. All future Mesos 
> offers will not get tasks launched because 
> {{sc.conf.getInt("spark.executor.cores", ...) + totalCoresAcquired <= 
> maxCores}} will always evaluate to false.  However, in 
> {{handleMatchedOffers}} we check if {{totalCoresAcquired >= maxCores}} to 
> determine if we should decline the offer "for a configurable amount of time 
> to avoid starving other frameworks", and this will always evaluate to false 
> in the above scenario. This leaves the framework in a state of limbo where it 
> will never launch any new executors, but only decline offers for the Mesos 
> default of 5 seconds, thus starving other frameworks of offers.
> Relates to: SPARK-12554



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20483) Mesos Coarse mode may starve other Mesos frameworks if max cores is not a multiple of executor cores

2017-04-26 Thread Davis Shepherd (JIRA)
Davis Shepherd created SPARK-20483:
--

 Summary: Mesos Coarse mode may starve other Mesos frameworks if 
max cores is not a multiple of executor cores
 Key: SPARK-20483
 URL: https://issues.apache.org/jira/browse/SPARK-20483
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 2.1.0
Reporter: Davis Shepherd
Priority: Minor


if `spark.cores.max = 10` for example and `spark.executor.cores = 4`, 2 
executors will get lauched thus `totalCoresAcquired = 8`. All future Mesos 
offers will not get tasks launched because 
`sc.conf.getInt("spark.executor.cores", ...) + totalCoresAcquired <= maxCores` 
will always evaluate to false.  However, in `handleMatchedOffers` we check if 
`totalCoresAcquired >= maxCores` to determine if we should decline the offer 
"for a configurable amount of time to avoid starving other frameworks", and 
this will always evaluate to false in the above scenario. This leaves the 
framework in a state of limbo where it will never launch any new executors, but 
only decline offers for the Mesos default of 5 seconds, thus starving other 
frameworks of offers.

Relates to: SPARK-12554



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20481) Wrong mapping for BooleanType in Spark SQL

2017-04-26 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985690#comment-15985690
 ] 

Xiao Li commented on SPARK-20481:
-

Which DB are you using?

> Wrong mapping for BooleanType in Spark SQL
> --
>
> Key: SPARK-20481
> URL: https://issues.apache.org/jira/browse/SPARK-20481
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Talat UYARER
>Priority: Minor
>
> Booleantype is mapping to BIT type on SQL side. 
> https://github.com/apache/spark/blob/a26e3ed5e414d0a350cfe65dd511b154868b9f1d/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L159
> Is there is any specific reason it should be mapping to Boolean 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20482) Resolving Casts is too strict on having time zone set

2017-04-26 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell reassigned SPARK-20482:
-

Assignee: Kris Mok

> Resolving Casts is too strict on having time zone set
> -
>
> Key: SPARK-20482
> URL: https://issues.apache.org/jira/browse/SPARK-20482
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kris Mok
>Assignee: Kris Mok
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20482) Resolving Casts is too strict on having time zone set

2017-04-26 Thread Kris Mok (JIRA)
Kris Mok created SPARK-20482:


 Summary: Resolving Casts is too strict on having time zone set
 Key: SPARK-20482
 URL: https://issues.apache.org/jira/browse/SPARK-20482
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Kris Mok






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20454) Improvement of ShortestPaths in Spark GraphX

2017-04-26 Thread Ji Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ji Dai updated SPARK-20454:
---
Labels:   (was: patch)

> Improvement of ShortestPaths in Spark GraphX
> 
>
> Key: SPARK-20454
> URL: https://issues.apache.org/jira/browse/SPARK-20454
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX, MLlib
>Affects Versions: 2.1.0
>Reporter: Ji Dai
>
> The output of ShortestPaths is not enough. ShortestPaths in Graph/lib is 
> currently in a simple version and can only return the distance to the source 
> vertex. However, the shortest path with intermediate nodes on the path is 
> needed and if two or more paths holds the same shortest distance from source 
> to destination, all these paths need to be returned. In this way, 
> ShortestPaths will be more functional and useful.
> I think I have resolved the concern above with a improved version of 
> ShortestPaths which also based on the "pregel" function in GraphOps.
> Can I get my code reviewed and merged?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20454) Improvement of ShortestPaths in Spark GraphX

2017-04-26 Thread Ji Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ji Dai updated SPARK-20454:
---
Flags: Important  (was: Patch,Important)

> Improvement of ShortestPaths in Spark GraphX
> 
>
> Key: SPARK-20454
> URL: https://issues.apache.org/jira/browse/SPARK-20454
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX, MLlib
>Affects Versions: 2.1.0
>Reporter: Ji Dai
>
> The output of ShortestPaths is not enough. ShortestPaths in Graph/lib is 
> currently in a simple version and can only return the distance to the source 
> vertex. However, the shortest path with intermediate nodes on the path is 
> needed and if two or more paths holds the same shortest distance from source 
> to destination, all these paths need to be returned. In this way, 
> ShortestPaths will be more functional and useful.
> I think I have resolved the concern above with a improved version of 
> ShortestPaths which also based on the "pregel" function in GraphOps.
> Can I get my code reviewed and merged?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20454) Improvement of ShortestPaths in Spark GraphX

2017-04-26 Thread Ji Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ji Dai updated SPARK-20454:
---
Target Version/s:   (was: 2.1.0)

> Improvement of ShortestPaths in Spark GraphX
> 
>
> Key: SPARK-20454
> URL: https://issues.apache.org/jira/browse/SPARK-20454
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX, MLlib
>Affects Versions: 2.1.0
>Reporter: Ji Dai
>  Labels: patch
>
> The output of ShortestPaths is not enough. ShortestPaths in Graph/lib is 
> currently in a simple version and can only return the distance to the source 
> vertex. However, the shortest path with intermediate nodes on the path is 
> needed and if two or more paths holds the same shortest distance from source 
> to destination, all these paths need to be returned. In this way, 
> ShortestPaths will be more functional and useful.
> I think I have resolved the concern above with a improved version of 
> ShortestPaths which also based on the "pregel" function in GraphOps.
> Can I get my code reviewed and merged?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-20480) FileFormatWriter hides FetchFailedException from scheduler

2017-04-26 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves closed SPARK-20480.
-
Resolution: Duplicate

> FileFormatWriter hides FetchFailedException from scheduler
> --
>
> Key: SPARK-20480
> URL: https://issues.apache.org/jira/browse/SPARK-20480
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>Priority: Critical
>
> I was running a large job where it was getting faiures, noticed they were 
> listed as "SparkException: Task failed while writing rows", but when I looked 
> further they were really caused by FetchFailure exceptions.  This is a 
> problem because the scheduler handles Fetch Failures differently then normal 
> exception. This can affect things like blacklisting.
> {noformat}
> 17/04/26 20:08:59 ERROR Executor: Exception in task 2727.0 in stage 4.0 (TID 
> 102902)
> org.apache.spark.SparkException: Task failed while writing rows
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.shuffle.FetchFailedException: Failed to connect 
> to host1.com:7337
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:396)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)
>   ... 8 more
> Caused by: java.io.IOException: Failed to connect to host1.com:7337
>   at 
> 

[jira] [Commented] (SPARK-20480) FileFormatWriter hides FetchFailedException from scheduler

2017-04-26 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985590#comment-15985590
 ] 

Thomas Graves commented on SPARK-20480:
---

ah it looks like it should, hadn't seen that jira, I'll give that fix a try.

> FileFormatWriter hides FetchFailedException from scheduler
> --
>
> Key: SPARK-20480
> URL: https://issues.apache.org/jira/browse/SPARK-20480
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>Priority: Critical
>
> I was running a large job where it was getting faiures, noticed they were 
> listed as "SparkException: Task failed while writing rows", but when I looked 
> further they were really caused by FetchFailure exceptions.  This is a 
> problem because the scheduler handles Fetch Failures differently then normal 
> exception. This can affect things like blacklisting.
> {noformat}
> 17/04/26 20:08:59 ERROR Executor: Exception in task 2727.0 in stage 4.0 (TID 
> 102902)
> org.apache.spark.SparkException: Task failed while writing rows
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.shuffle.FetchFailedException: Failed to connect 
> to host1.com:7337
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:396)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)
>   ... 8 more
> Caused by: java.io.IOException: Failed to 

[jira] [Created] (SPARK-20481) Wrong mapping for BooleanType in Spark SQL

2017-04-26 Thread Talat UYARER (JIRA)
Talat UYARER created SPARK-20481:


 Summary: Wrong mapping for BooleanType in Spark SQL
 Key: SPARK-20481
 URL: https://issues.apache.org/jira/browse/SPARK-20481
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Talat UYARER
Priority: Minor


Booleantype is mapping to BIT type on SQL side. 
https://github.com/apache/spark/blob/a26e3ed5e414d0a350cfe65dd511b154868b9f1d/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L159

Is there is any specific reason it should be mapping to Boolean 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20480) FileFormatWriter hides FetchFailedException from scheduler

2017-04-26 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985575#comment-15985575
 ] 

Mridul Muralidharan commented on SPARK-20480:
-

Shouldn't fix for SPARK-19276 by [~imranr] not handle this ?

> FileFormatWriter hides FetchFailedException from scheduler
> --
>
> Key: SPARK-20480
> URL: https://issues.apache.org/jira/browse/SPARK-20480
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>Priority: Critical
>
> I was running a large job where it was getting faiures, noticed they were 
> listed as "SparkException: Task failed while writing rows", but when I looked 
> further they were really caused by FetchFailure exceptions.  This is a 
> problem because the scheduler handles Fetch Failures differently then normal 
> exception. This can affect things like blacklisting.
> {noformat}
> 17/04/26 20:08:59 ERROR Executor: Exception in task 2727.0 in stage 4.0 (TID 
> 102902)
> org.apache.spark.SparkException: Task failed while writing rows
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.shuffle.FetchFailedException: Failed to connect 
> to host1.com:7337
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:396)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)
>   ... 8 more
> Caused by: java.io.IOException: Failed to connect to 

[jira] [Commented] (SPARK-20480) FileFormatWriter hides FetchFailedException from scheduler

2017-04-26 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985571#comment-15985571
 ] 

Thomas Graves commented on SPARK-20480:
---

Note with blacklisting on this caused the job to fail:
Job aborted due to stage failure: Aborting TaskSet 4.0 because task 21 
(partition 21) cannot run anywhere due to node and executor blacklist. 
Blacklisting behavior can be configured via spark.blacklist.*.

> FileFormatWriter hides FetchFailedException from scheduler
> --
>
> Key: SPARK-20480
> URL: https://issues.apache.org/jira/browse/SPARK-20480
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>Priority: Critical
>
> I was running a large job where it was getting faiures, noticed they were 
> listed as "SparkException: Task failed while writing rows", but when I looked 
> further they were really caused by FetchFailure exceptions.  This is a 
> problem because the scheduler handles Fetch Failures differently then normal 
> exception. This can affect things like blacklisting.
> {noformat}
> 17/04/26 20:08:59 ERROR Executor: Exception in task 2727.0 in stage 4.0 (TID 
> 102902)
> org.apache.spark.SparkException: Task failed while writing rows
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.shuffle.FetchFailedException: Failed to connect 
> to host1.com:7337
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:396)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
>   at 
> 

[jira] [Commented] (SPARK-20480) FileFormatWriter hides FetchFailedException from scheduler

2017-04-26 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985567#comment-15985567
 ] 

Thomas Graves commented on SPARK-20480:
---

exception in task manager looks like:
17/04/26 20:09:21 INFO TaskSetManager: Lost task 3516.0 in stage 4.0 (TID 
103691) on gsbl521n33.blue.ygrid.yahoo.com, exec
utor 4516: org.apache.spark.SparkException (Task failed while writing rows) 
[duplicate 22]

> FileFormatWriter hides FetchFailedException from scheduler
> --
>
> Key: SPARK-20480
> URL: https://issues.apache.org/jira/browse/SPARK-20480
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>Priority: Critical
>
> I was running a large job where it was getting faiures, noticed they were 
> listed as "SparkException: Task failed while writing rows", but when I looked 
> further they were really caused by FetchFailure exceptions.  This is a 
> problem because the scheduler handles Fetch Failures differently then normal 
> exception. This can affect things like blacklisting.
> {noformat}
> 17/04/26 20:08:59 ERROR Executor: Exception in task 2727.0 in stage 4.0 (TID 
> 102902)
> org.apache.spark.SparkException: Task failed while writing rows
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.shuffle.FetchFailedException: Failed to connect 
> to host1.com:7337
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:396)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
>   at 
> 

[jira] [Updated] (SPARK-20480) FileFormatWriter hides FetchFailedException from scheduler

2017-04-26 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-20480:
--
Description: 
I was running a large job where it was getting faiures, noticed they were 
listed as "SparkException: Task failed while writing rows", but when I looked 
further they were really caused by FetchFailure exceptions.  This is a problem 
because the scheduler handles Fetch Failures differently then normal exception. 
This can affect things like blacklisting.



{noformat}
17/04/26 20:08:59 ERROR Executor: Exception in task 2727.0 in stage 4.0 (TID 
102902)
org.apache.spark.SparkException: Task failed while writing rows
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.shuffle.FetchFailedException: Failed to connect to 
host1.com:7337
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:396)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)
at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)
... 8 more
Caused by: java.io.IOException: Failed to connect to host1.com:7337
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
at 
org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:105)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
  

[jira] [Created] (SPARK-20480) FileFormatWriter hides FetchFailedException from scheduler

2017-04-26 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-20480:
-

 Summary: FileFormatWriter hides FetchFailedException from scheduler
 Key: SPARK-20480
 URL: https://issues.apache.org/jira/browse/SPARK-20480
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Thomas Graves
Priority: Critical


I was running a large job where it was getting faiures, noticed they were 
listed as "SparkException: Task failed while writing rows", but when I looked 
further they were really caused by FetchFailure exceptions.  This is a problem 
because the scheduler handles Fetch Failures differently then normal exception. 
This can affect things like blacklisting.



{noformat}
17/04/26 20:08:59 ERROR Executor: Exception in task 2727.0 in stage 4.0 (TID 
102902)
org.apache.spark.SparkException: Task failed while writing rows
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.shuffle.FetchFailedException: Failed to connect to 
gsbl546n07.blue.ygrid.yahoo.com/10.213.43.94:7337
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:396)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)
at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)
... 8 more
Caused by: java.io.IOException: Failed to connect to 
gsbl546n07.blue.ygrid.yahoo.com/10.213.43.94:7337
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
at 

[jira] [Resolved] (SPARK-12868) ADD JAR via sparkSQL JDBC will fail when using a HDFS URL

2017-04-26 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-12868.

   Resolution: Fixed
 Assignee: Weiqing Yang
Fix Version/s: 2.2.0

> ADD JAR via sparkSQL JDBC will fail when using a HDFS URL
> -
>
> Key: SPARK-12868
> URL: https://issues.apache.org/jira/browse/SPARK-12868
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Trystan Leftwich
>Assignee: Weiqing Yang
> Fix For: 2.2.0
>
>
> When trying to add a jar with a HDFS URI, i.E
> {code:sql}
> ADD JAR hdfs:///tmp/foo.jar
> {code}
> Via the spark sql JDBC interface it will fail with:
> {code:sql}
> java.net.MalformedURLException: unknown protocol: hdfs
> at java.net.URL.(URL.java:593)
> at java.net.URL.(URL.java:483)
> at java.net.URL.(URL.java:432)
> at java.net.URI.toURL(URI.java:1089)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.addJar(ClientWrapper.scala:578)
> at org.apache.spark.sql.hive.HiveContext.addJar(HiveContext.scala:652)
> at org.apache.spark.sql.hive.execution.AddJar.run(commands.scala:89)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:145)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:130)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:211)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:154)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:151)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:164)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20474) OnHeapColumnVector realocation may not copy existing data

2017-04-26 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-20474.
-
   Resolution: Fixed
 Assignee: Michal Szafranski
Fix Version/s: 2.2.0

> OnHeapColumnVector realocation may not copy existing data
> -
>
> Key: SPARK-20474
> URL: https://issues.apache.org/jira/browse/SPARK-20474
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michal Szafranski
>Assignee: Michal Szafranski
> Fix For: 2.2.0
>
>
> OnHeapColumnVector reallocation copies to the new storage data up to 
> 'elementsAppended'. This variable is only updated when using the 
> ColumnVector.appendX API, while ColumnVector.putX is more commonly used.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20178) Improve Scheduler fetch failures

2017-04-26 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985447#comment-15985447
 ] 

Thomas Graves commented on SPARK-20178:
---

Another thing we should tie in here is handling preempted containers better. 
This kind of matches with my point above "Improve logic around deciding which 
node is actually bad when you get a fetch failures."  but a little bit of a 
special case.  If the containers gets preempted on the yarn side we need to 
properly detect that and not count that as a normal fetch failure. Right now 
that seems pretty difficult with the way we handle stage failures but I guess 
you would just line that up and not caught that as a normal stage failure.

> Improve Scheduler fetch failures
> 
>
> Key: SPARK-20178
> URL: https://issues.apache.org/jira/browse/SPARK-20178
> Project: Spark
>  Issue Type: Epic
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>
> We have been having a lot of discussions around improving the handling of 
> fetch failures.  There are 4 jira currently related to this.  
> We should try to get a list of things we want to improve and come up with one 
> cohesive design.
> SPARK-20163,  SPARK-20091,  SPARK-14649 , and SPARK-19753
> I will put my initial thoughts in a follow on comment.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20479) Performance degradation for large number of hash-aggregated columns

2017-04-26 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20479:

Affects Version/s: (was: 2.3.0)
   2.2.0

> Performance degradation for large number of hash-aggregated columns
> ---
>
> Key: SPARK-20479
> URL: https://issues.apache.org/jira/browse/SPARK-20479
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>
> In comment of SPARK-20184, [~maropu] revealed that performance is degraded 
> when # of aggregated columns get large with whole-stage codegen.
> {code}
> ./bin/spark-shell --master local[1] --conf spark.driver.memory=2g --conf 
> spark.sql.shuffle.partitions=1 -v
> def timer[R](f: => {}): Unit = {
>   val count = 9
>   val iters = (0 until count).map { i =>
> val t0 = System.nanoTime()
> f
> val t1 = System.nanoTime()
> val elapsed = t1 - t0 + 0.0
> println(s"#$i: ${elapsed / 10.0}")
> elapsed
>   }
>   println("Elapsed time: " + ((iters.sum / count) / 10.0) + "s")
> }
> val numCols = 80
> val t = s"(SELECT id AS key1, id AS key2, ${((0 until numCols).map(i => s"id 
> AS c$i")).mkString(", ")} FROM range(0, 10, 1, 1))"
> val sqlStr = s"SELECT key1, key2, ${((0 until numCols).map(i => 
> s"SUM(c$i)")).mkString(", ")} FROM $t GROUP BY key1, key2 LIMIT 100"
> // Elapsed time: 2.308440490553s
> sql("SET spark.sql.codegen.wholeStage=true")
> timer { sql(sqlStr).collect }
> // Elapsed time: 0.527486733s
> sql("SET spark.sql.codegen.wholeStage=false")
> timer { sql(sqlStr).collect }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20476) Exception between "create table as" and "get_json_object"

2017-04-26 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-20476:
---

Assignee: Xiao Li

> Exception between "create table as" and "get_json_object"
> -
>
> Key: SPARK-20476
> URL: https://issues.apache.org/jira/browse/SPARK-20476
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: cen yuhai
>Assignee: Xiao Li
>
> I encounter this problem when I want to create a table as select , 
> get_json_object from xxx;
> It is wrong.
> {code}
> create table spark_json_object as
> select get_json_object(deliver_geojson,'$.')
> from dw.dw_prd_order where dt='2017-04-24' limit 10;
> {code}
> It is ok.
> {code}
> create table spark_json_object as
> select *
> from dw.dw_prd_order where dt='2017-04-24' limit 10;
> {code}
> It is ok
> {code}
> select get_json_object(deliver_geojson,'$.')
> from dw.dw_prd_order where dt='2017-04-24' limit 10;
> {code}
> {code}
> 17/04/26 23:12:56 ERROR [hive.log(397) -- main]: error in initSerDe: 
> org.apache.hadoop.hive.serde2.SerDeException 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements 
> while columns.types has 1 elements!
> org.apache.hadoop.hive.serde2.SerDeException: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements 
> while columns.types has 1 elements!
> at 
> org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:146)
> at 
> org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85)
> at 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125)
> at 
> org.apache.hadoop.hive.serde2.AbstractSerDe.initialize(AbstractSerDe.java:53)
> at 
> org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:521)
> at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:391)
> at 
> org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276)
> at 
> org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:197)
> at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:699)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply$mcV$sp(HiveClientImpl.scala:455)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:309)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:256)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:255)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:298)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:454)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:237)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:199)
> at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:248)
> at 
> org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$lzycompute$1(CreateHiveTableAsSelectCommand.scala:72)
> at 
> org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$1(CreateHiveTableAsSelectCommand.scala:48)
> at 
> org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:91)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
> at org.apache.spark.sql.Dataset.(Dataset.scala:179)
> at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
> at 

[jira] [Assigned] (SPARK-20476) Exception between "create table as" and "get_json_object"

2017-04-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20476:


Assignee: (was: Apache Spark)

> Exception between "create table as" and "get_json_object"
> -
>
> Key: SPARK-20476
> URL: https://issues.apache.org/jira/browse/SPARK-20476
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: cen yuhai
>
> I encounter this problem when I want to create a table as select , 
> get_json_object from xxx;
> It is wrong.
> {code}
> create table spark_json_object as
> select get_json_object(deliver_geojson,'$.')
> from dw.dw_prd_order where dt='2017-04-24' limit 10;
> {code}
> It is ok.
> {code}
> create table spark_json_object as
> select *
> from dw.dw_prd_order where dt='2017-04-24' limit 10;
> {code}
> It is ok
> {code}
> select get_json_object(deliver_geojson,'$.')
> from dw.dw_prd_order where dt='2017-04-24' limit 10;
> {code}
> {code}
> 17/04/26 23:12:56 ERROR [hive.log(397) -- main]: error in initSerDe: 
> org.apache.hadoop.hive.serde2.SerDeException 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements 
> while columns.types has 1 elements!
> org.apache.hadoop.hive.serde2.SerDeException: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements 
> while columns.types has 1 elements!
> at 
> org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:146)
> at 
> org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85)
> at 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125)
> at 
> org.apache.hadoop.hive.serde2.AbstractSerDe.initialize(AbstractSerDe.java:53)
> at 
> org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:521)
> at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:391)
> at 
> org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276)
> at 
> org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:197)
> at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:699)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply$mcV$sp(HiveClientImpl.scala:455)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:309)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:256)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:255)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:298)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:454)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:237)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:199)
> at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:248)
> at 
> org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$lzycompute$1(CreateHiveTableAsSelectCommand.scala:72)
> at 
> org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$1(CreateHiveTableAsSelectCommand.scala:48)
> at 
> org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:91)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
> at org.apache.spark.sql.Dataset.(Dataset.scala:179)
> at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
> at 

[jira] [Assigned] (SPARK-20476) Exception between "create table as" and "get_json_object"

2017-04-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20476:


Assignee: Apache Spark

> Exception between "create table as" and "get_json_object"
> -
>
> Key: SPARK-20476
> URL: https://issues.apache.org/jira/browse/SPARK-20476
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: cen yuhai
>Assignee: Apache Spark
>
> I encounter this problem when I want to create a table as select , 
> get_json_object from xxx;
> It is wrong.
> {code}
> create table spark_json_object as
> select get_json_object(deliver_geojson,'$.')
> from dw.dw_prd_order where dt='2017-04-24' limit 10;
> {code}
> It is ok.
> {code}
> create table spark_json_object as
> select *
> from dw.dw_prd_order where dt='2017-04-24' limit 10;
> {code}
> It is ok
> {code}
> select get_json_object(deliver_geojson,'$.')
> from dw.dw_prd_order where dt='2017-04-24' limit 10;
> {code}
> {code}
> 17/04/26 23:12:56 ERROR [hive.log(397) -- main]: error in initSerDe: 
> org.apache.hadoop.hive.serde2.SerDeException 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements 
> while columns.types has 1 elements!
> org.apache.hadoop.hive.serde2.SerDeException: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements 
> while columns.types has 1 elements!
> at 
> org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:146)
> at 
> org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85)
> at 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125)
> at 
> org.apache.hadoop.hive.serde2.AbstractSerDe.initialize(AbstractSerDe.java:53)
> at 
> org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:521)
> at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:391)
> at 
> org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276)
> at 
> org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:197)
> at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:699)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply$mcV$sp(HiveClientImpl.scala:455)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:309)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:256)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:255)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:298)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:454)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:237)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:199)
> at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:248)
> at 
> org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$lzycompute$1(CreateHiveTableAsSelectCommand.scala:72)
> at 
> org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$1(CreateHiveTableAsSelectCommand.scala:48)
> at 
> org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:91)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
> at org.apache.spark.sql.Dataset.(Dataset.scala:179)
> at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
> at 

[jira] [Commented] (SPARK-20476) Exception between "create table as" and "get_json_object"

2017-04-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985399#comment-15985399
 ] 

Apache Spark commented on SPARK-20476:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/17776

> Exception between "create table as" and "get_json_object"
> -
>
> Key: SPARK-20476
> URL: https://issues.apache.org/jira/browse/SPARK-20476
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: cen yuhai
>
> I encounter this problem when I want to create a table as select , 
> get_json_object from xxx;
> It is wrong.
> {code}
> create table spark_json_object as
> select get_json_object(deliver_geojson,'$.')
> from dw.dw_prd_order where dt='2017-04-24' limit 10;
> {code}
> It is ok.
> {code}
> create table spark_json_object as
> select *
> from dw.dw_prd_order where dt='2017-04-24' limit 10;
> {code}
> It is ok
> {code}
> select get_json_object(deliver_geojson,'$.')
> from dw.dw_prd_order where dt='2017-04-24' limit 10;
> {code}
> {code}
> 17/04/26 23:12:56 ERROR [hive.log(397) -- main]: error in initSerDe: 
> org.apache.hadoop.hive.serde2.SerDeException 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements 
> while columns.types has 1 elements!
> org.apache.hadoop.hive.serde2.SerDeException: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements 
> while columns.types has 1 elements!
> at 
> org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:146)
> at 
> org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85)
> at 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125)
> at 
> org.apache.hadoop.hive.serde2.AbstractSerDe.initialize(AbstractSerDe.java:53)
> at 
> org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:521)
> at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:391)
> at 
> org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276)
> at 
> org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:197)
> at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:699)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply$mcV$sp(HiveClientImpl.scala:455)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:309)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:256)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:255)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:298)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:454)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:237)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:199)
> at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:248)
> at 
> org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$lzycompute$1(CreateHiveTableAsSelectCommand.scala:72)
> at 
> org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$1(CreateHiveTableAsSelectCommand.scala:48)
> at 
> org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:91)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
> at org.apache.spark.sql.Dataset.(Dataset.scala:179)
> at 

[jira] [Created] (SPARK-20479) Performance degradation for large number of hash-aggregated columns

2017-04-26 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-20479:


 Summary: Performance degradation for large number of 
hash-aggregated columns
 Key: SPARK-20479
 URL: https://issues.apache.org/jira/browse/SPARK-20479
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Kazuaki Ishizaki


In comment of SPARK-20184, [~maropu] revealed that performance is degraded when 
# of aggregated columns get large with whole-stage codegen.

{code}
./bin/spark-shell --master local[1] --conf spark.driver.memory=2g --conf 
spark.sql.shuffle.partitions=1 -v

def timer[R](f: => {}): Unit = {
  val count = 9
  val iters = (0 until count).map { i =>
val t0 = System.nanoTime()
f
val t1 = System.nanoTime()
val elapsed = t1 - t0 + 0.0
println(s"#$i: ${elapsed / 10.0}")
elapsed
  }
  println("Elapsed time: " + ((iters.sum / count) / 10.0) + "s")
}

val numCols = 80
val t = s"(SELECT id AS key1, id AS key2, ${((0 until numCols).map(i => s"id AS 
c$i")).mkString(", ")} FROM range(0, 10, 1, 1))"
val sqlStr = s"SELECT key1, key2, ${((0 until numCols).map(i => 
s"SUM(c$i)")).mkString(", ")} FROM $t GROUP BY key1, key2 LIMIT 100"

// Elapsed time: 2.308440490553s
sql("SET spark.sql.codegen.wholeStage=true")
timer { sql(sqlStr).collect }

// Elapsed time: 0.527486733s
sql("SET spark.sql.codegen.wholeStage=false")
timer { sql(sqlStr).collect }
{code}




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20184) performance regression for complex/long sql when enable whole stage codegen

2017-04-26 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985319#comment-15985319
 ] 

Kazuaki Ishizaki commented on SPARK-20184:
--

When # of the aggregated columns gets large, I saw complicated Java code for 
HashAggregation. I will create another JIRA to simplify generated code.

> performance regression for complex/long sql when enable whole stage codegen
> ---
>
> Key: SPARK-20184
> URL: https://issues.apache.org/jira/browse/SPARK-20184
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Fei Wang
>
> The performance of following SQL get much worse in spark 2.x  in contrast 
> with codegen off.
> SELECT
>sum(COUNTER_57) 
> ,sum(COUNTER_71) 
> ,sum(COUNTER_3)  
> ,sum(COUNTER_70) 
> ,sum(COUNTER_66) 
> ,sum(COUNTER_75) 
> ,sum(COUNTER_69) 
> ,sum(COUNTER_55) 
> ,sum(COUNTER_63) 
> ,sum(COUNTER_68) 
> ,sum(COUNTER_56) 
> ,sum(COUNTER_37) 
> ,sum(COUNTER_51) 
> ,sum(COUNTER_42) 
> ,sum(COUNTER_43) 
> ,sum(COUNTER_1)  
> ,sum(COUNTER_76) 
> ,sum(COUNTER_54) 
> ,sum(COUNTER_44) 
> ,sum(COUNTER_46) 
> ,DIM_1 
> ,DIM_2 
>   ,DIM_3
> FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;
> Num of rows of aggtable is about 3500.
> whole stage codegen on(spark.sql.codegen.wholeStage = true):40s
> whole stage codegen  off(spark.sql.codegen.wholeStage = false):6s
> After some analysis i think this is related to the huge java method(a java 
> method of thousand lines) which generated by codegen.
> And If i config -XX:-DontCompileHugeMethods the performance get much 
> better(about 7s).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20208) Document R fpGrowth support in vignettes, programming guide and code example

2017-04-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20208:


Assignee: Maciej Szymkiewicz  (was: Apache Spark)

> Document R fpGrowth support in vignettes, programming guide and code example
> 
>
> Key: SPARK-20208
> URL: https://issues.apache.org/jira/browse/SPARK-20208
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Assignee: Maciej Szymkiewicz
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20208) Document R fpGrowth support in vignettes, programming guide and code example

2017-04-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20208:


Assignee: Apache Spark  (was: Maciej Szymkiewicz)

> Document R fpGrowth support in vignettes, programming guide and code example
> 
>
> Key: SPARK-20208
> URL: https://issues.apache.org/jira/browse/SPARK-20208
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Assignee: Apache Spark
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20208) Document R fpGrowth support in vignettes, programming guide and code example

2017-04-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985309#comment-15985309
 ] 

Apache Spark commented on SPARK-20208:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/17775

> Document R fpGrowth support in vignettes, programming guide and code example
> 
>
> Key: SPARK-20208
> URL: https://issues.apache.org/jira/browse/SPARK-20208
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Assignee: Maciej Szymkiewicz
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20473) ColumnVector.Array is missing accessors for some types

2017-04-26 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-20473.
-
   Resolution: Fixed
 Assignee: Michal Szafranski
Fix Version/s: 2.2.0

> ColumnVector.Array is missing accessors for some types
> --
>
> Key: SPARK-20473
> URL: https://issues.apache.org/jira/browse/SPARK-20473
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michal Szafranski
>Assignee: Michal Szafranski
> Fix For: 2.2.0
>
>
> ColumnVector implementations originally did not support some Catalyst types 
> (float, short, and boolean). Now that they do, those types should be also 
> added to the ColumnVector.Array.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20015) Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide, R example

2017-04-26 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-20015:
-
Summary: Document R Structured Streaming (experimental) in R vignettes and 
R & SS programming guide, R example  (was: Document R Structured Streaming 
(experimental) in R vignettes and R & SS programming guide)

> Document R Structured Streaming (experimental) in R vignettes and R & SS 
> programming guide, R example
> -
>
> Key: SPARK-20015
> URL: https://issues.apache.org/jira/browse/SPARK-20015
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20477) Document R bisecting k-means in R programming guide

2017-04-26 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985226#comment-15985226
 ] 

Felix Cheung commented on SPARK-20477:
--

[~wangmiao1981] Would you like to add this?

> Document R bisecting k-means in R programming guide
> ---
>
> Key: SPARK-20477
> URL: https://issues.apache.org/jira/browse/SPARK-20477
> Project: Spark
>  Issue Type: Documentation
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20478) Document LinearSVC in R programming guide

2017-04-26 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985227#comment-15985227
 ] 

Felix Cheung commented on SPARK-20478:
--

[~wangmiao1981] Would you like to add this?

> Document LinearSVC in R programming guide
> -
>
> Key: SPARK-20478
> URL: https://issues.apache.org/jira/browse/SPARK-20478
> Project: Spark
>  Issue Type: Documentation
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20477) Document R bisecting k-means in R programming guide

2017-04-26 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-20477:
-
Issue Type: Documentation  (was: Bug)

> Document R bisecting k-means in R programming guide
> ---
>
> Key: SPARK-20477
> URL: https://issues.apache.org/jira/browse/SPARK-20477
> Project: Spark
>  Issue Type: Documentation
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-20208) Document R fpGrowth support in vignettes, programming guide and code example

2017-04-26 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung reopened SPARK-20208:
--

Actually, would you mind updating the R programming guide too?

> Document R fpGrowth support in vignettes, programming guide and code example
> 
>
> Key: SPARK-20208
> URL: https://issues.apache.org/jira/browse/SPARK-20208
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Assignee: Maciej Szymkiewicz
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20478) Document LinearSVC in R programming guide

2017-04-26 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-20478:


 Summary: Document LinearSVC in R programming guide
 Key: SPARK-20478
 URL: https://issues.apache.org/jira/browse/SPARK-20478
 Project: Spark
  Issue Type: Documentation
  Components: SparkR
Affects Versions: 2.2.0
Reporter: Felix Cheung






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20477) Document R bisecting k-means in R programming guide

2017-04-26 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-20477:


 Summary: Document R bisecting k-means in R programming guide
 Key: SPARK-20477
 URL: https://issues.apache.org/jira/browse/SPARK-20477
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.2.0
Reporter: Felix Cheung






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20476) Exception between "create table as" and "get_json_object"

2017-04-26 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985175#comment-15985175
 ] 

Xiao Li commented on SPARK-20476:
-

You can bypass it by 
{noformat}
create table spark_json_object as
select get_json_object(deliver_geojson,'$.') as col1
from dw.dw_prd_order where dt='2017-04-24' limit 10;
{noformat}

Will fix it later.

> Exception between "create table as" and "get_json_object"
> -
>
> Key: SPARK-20476
> URL: https://issues.apache.org/jira/browse/SPARK-20476
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: cen yuhai
>
> I encounter this problem when I want to create a table as select , 
> get_json_object from xxx;
> It is wrong.
> {code}
> create table spark_json_object as
> select get_json_object(deliver_geojson,'$.')
> from dw.dw_prd_order where dt='2017-04-24' limit 10;
> {code}
> It is ok.
> {code}
> create table spark_json_object as
> select *
> from dw.dw_prd_order where dt='2017-04-24' limit 10;
> {code}
> It is ok
> {code}
> select get_json_object(deliver_geojson,'$.')
> from dw.dw_prd_order where dt='2017-04-24' limit 10;
> {code}
> {code}
> 17/04/26 23:12:56 ERROR [hive.log(397) -- main]: error in initSerDe: 
> org.apache.hadoop.hive.serde2.SerDeException 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements 
> while columns.types has 1 elements!
> org.apache.hadoop.hive.serde2.SerDeException: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements 
> while columns.types has 1 elements!
> at 
> org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:146)
> at 
> org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85)
> at 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125)
> at 
> org.apache.hadoop.hive.serde2.AbstractSerDe.initialize(AbstractSerDe.java:53)
> at 
> org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:521)
> at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:391)
> at 
> org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276)
> at 
> org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:197)
> at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:699)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply$mcV$sp(HiveClientImpl.scala:455)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:309)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:256)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:255)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:298)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:454)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:237)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:199)
> at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:248)
> at 
> org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$lzycompute$1(CreateHiveTableAsSelectCommand.scala:72)
> at 
> org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$1(CreateHiveTableAsSelectCommand.scala:48)
> at 
> org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:91)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
> at 
> 

[jira] [Updated] (SPARK-20470) Invalid json converting RDD row with Array of struct to json

2017-04-26 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20470:

Component/s: SQL

> Invalid json converting RDD row with Array of struct to json
> 
>
> Key: SPARK-20470
> URL: https://issues.apache.org/jira/browse/SPARK-20470
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.6.3
>Reporter: Philip Adetiloye
>
> Trying to convert an RDD in pyspark containing Array of struct doesn't 
> generate the right json. It looks trivial but can't get a good json out.
> I read the json below into a dataframe:
> {code}
> {
>   "feature": "feature_id_001",
>   "histogram": [
> {
>   "start": 1.9796095151877942,
>   "y": 968.0,
>   "width": 0.1564485056196041
> },
> {
>   "start": 2.1360580208073983,
>   "y": 892.0,
>   "width": 0.1564485056196041
> },
> {
>   "start": 2.2925065264270024,
>   "y": 814.0,
>   "width": 0.15644850561960366
> },
> {
>   "start": 2.448955032046606,
>   "y": 690.0,
>   "width": 0.1564485056196041
> }]
> }
> {code}
> Df schema looks good 
> {code}
>  root
>   |-- feature: string (nullable = true)
>   |-- histogram: array (nullable = true)
>   ||-- element: struct (containsNull = true)
>   |||-- start: double (nullable = true)
>   |||-- width: double (nullable = true)
>   |||-- y: double (nullable = true)
> {code}
> Need to convert each row to json now and save to HBase 
> {code}
> rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict(
> {code}
> Output JSON (Wrong)
> {code}
> {
>   "feature": "feature_id_001",
>   "histogram": [
> [
>   1.9796095151877942,
>   968.0,
>   0.1564485056196041
> ],
> [
>   2.1360580208073983,
>   892.0,
>   0.1564485056196041
> ],
> [
>   2.2925065264270024,
>   814.0,
>   0.15644850561960366
> ],
> [
>   2.448955032046606,
>   690.0,
>   0.1564485056196041
> ]
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20475) Whether use "broadcast join" depends on hive configuration

2017-04-26 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985100#comment-15985100
 ] 

Xiao Li commented on SPARK-20475:
-

cc [~ZenWzh]

> Whether use "broadcast join" depends on hive configuration
> --
>
> Key: SPARK-20475
> URL: https://issues.apache.org/jira/browse/SPARK-20475
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Lijia Liu
>
> Currently,  broadcast join in Spark only works while:
>   1. The value of "spark.sql.autoBroadcastJoinThreshold" bigger than 
> 0(default is 10485760).
>   2. The size of one of the hive tables less than 
> "spark.sql.autoBroadcastJoinThreshold". To get the size information of the 
> hive table from hive metastore,  "hive.stats.autogather" should be set to 
> true in hive or the command "ANALYZE TABLE  COMPUTE STATISTICS 
> noscan" has been run.
> But in Hive, it calculate the size of the file or directory corresponding to 
> the hive table to determine whether to use the map side join, and does not 
> depend on the hive metastore.
> This leads to two problems:
>   1. Spark will not use "broadcast join" when the hive parameter 
> "hive.stats.autogather" is not set to ture or the command "ANALYZE TABLE 
>  COMPUTE STATISTICS noscan" has not been run because the 
> information of the hive table has not saved in hive metastore . The mode of 
> work in Spark depends on the configuration of Hive.
>   2. For some reason, we set "hive.stats.autogather" to false in our Hive. 
> For the same SQL, Hive is 4 times faster than Spark because Hive used "map 
> side join" but Spark did not use "broadcast join".
> Is it possible to use the mechanism same to hive's to look up the size of a 
> hive tale in Spark.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20476) Exception between "create table as" and "get_json_object"

2017-04-26 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985096#comment-15985096
 ] 

Xiao Li commented on SPARK-20476:
-

This sounds a bug to me. Let me double check it

> Exception between "create table as" and "get_json_object"
> -
>
> Key: SPARK-20476
> URL: https://issues.apache.org/jira/browse/SPARK-20476
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: cen yuhai
>
> I encounter this problem when I want to create a table as select , 
> get_json_object from xxx;
> It is wrong.
> {code}
> create table spark_json_object as
> select get_json_object(deliver_geojson,'$.')
> from dw.dw_prd_order where dt='2017-04-24' limit 10;
> {code}
> It is ok.
> {code}
> create table spark_json_object as
> select *
> from dw.dw_prd_order where dt='2017-04-24' limit 10;
> {code}
> It is ok
> {code}
> select get_json_object(deliver_geojson,'$.')
> from dw.dw_prd_order where dt='2017-04-24' limit 10;
> {code}
> {code}
> 17/04/26 23:12:56 ERROR [hive.log(397) -- main]: error in initSerDe: 
> org.apache.hadoop.hive.serde2.SerDeException 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements 
> while columns.types has 1 elements!
> org.apache.hadoop.hive.serde2.SerDeException: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements 
> while columns.types has 1 elements!
> at 
> org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:146)
> at 
> org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85)
> at 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125)
> at 
> org.apache.hadoop.hive.serde2.AbstractSerDe.initialize(AbstractSerDe.java:53)
> at 
> org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:521)
> at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:391)
> at 
> org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276)
> at 
> org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:197)
> at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:699)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply$mcV$sp(HiveClientImpl.scala:455)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:309)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:256)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:255)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:298)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:454)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:237)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:199)
> at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:248)
> at 
> org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$lzycompute$1(CreateHiveTableAsSelectCommand.scala:72)
> at 
> org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$1(CreateHiveTableAsSelectCommand.scala:48)
> at 
> org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:91)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
> at org.apache.spark.sql.Dataset.(Dataset.scala:179)
> at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
> at 

[jira] [Commented] (SPARK-7481) Add spark-hadoop-cloud module to pull in object store support

2017-04-26 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985040#comment-15985040
 ] 

Steve Loughran commented on SPARK-7481:
---

(This is a fairly long comment, but it tries to summarise the entire state of 
interaction with object stores, esp. S3A on Hadoop 2.8+. Azure is simpler, GCS: 
google's problem. Swift. not used very much).

If you look at object store & Spark (or indeed, any code which uses a 
filesystem as the source and dest of work), there are problems which can 
generally be grouped into various categories.

h3. Foundational: talking to the object stores

classpath & execution: can you wire the JARs up? Longstanding issue in ASF 
Spark releases (SPARK-5348, SPARK-12557). This was exacerbated by the movement 
of S3n:// to the hadoop-aws-package (FWIW, I hadn't noticed that move, I'd have 
blocked it if I'd been paying attention). This includes transitive problems 
(SPARK-11413)

Credential propagation. Spark's env var propagation is pretty cute here; 
SPARK-19739 picks up {{AWS_SESSION_TOKEN}} too. Diagnostics on failure is a 
real pain.


h3. Observable Inconsistencies leading to Data loss

Generally where the metaphor "it's just a filesystem" fail. These are bad 
because they often "just work", especially in dev & Test with small datasets, 
and when they go wrong, they can fail by generating bad results *and nobody 
notices*.

* Expectations of consistent listing of "directories" S3Guard deals with this, 
HADOOP-13345, as can Netflix's S3mper and AWS's premium Dynamo backed S3 
storage.
* Expectations on the transacted nature of Directory renames, the core atomic 
commit operations against full filesystems.
* Expectations that when things are deleted they go away. This does become 
visible sometimes, usually in checks for a destination not existing 
(SPARK-19013)
* Expectations that write-in-progress data is visible/flushed, that {{close()}} 
is low cost. SPARK-19111.

Committing pretty much combines all of these, see below for more details.

h3. Aggressively bad performance

That's the mismatch between what the object store offers, what the apps expect, 
and the metaphor work in the Hadooop FileSystem implementations, which, in 
trying to hide the conceptual mismatch can actually amplify the problem. 
Example: Directory tree scanning at the start of a query. The mock directory 
structure allows callers to do treewalks, when really a full list of all 
children can be done as a direct O(1) call. SPARK-17159 covers some of this for 
scanning directories in Spark Streaming, but there's a hidden tree walk in 
every call to {{FileSystem.globStatus()}} (HADOOP-13371). Given how S3Guard 
transforms this treewalk, and you need it for consistency, that's probably the 
best solution for now. Although I have a PoC which does a full List **/* 
followed by a filter, that's not viable when you have a wide deep tree and do 
need to prune aggressively.

Checkpointing to object stores is similar: it's generally not dangerous to do 
the write+rename, just adds the copy overhead, consistency issues 
notwithstanding.


h3. Suboptimal code. 

There's opportunities for speedup, but if it's not on the critical path, not 
worth the hassle. That said, as every call to {{getFileStatus()}} can take 
hundreds of millis, they get onto the critical path quite fast. Example checks 
for a file existing before calling {{fs.delete(path)}} (this is always a no-op 
if the dest path isn't there), and the equivalent on mkdirs: {{if 
(!fs.exists(dir) fs.mkdirs(path)}}. Hadoop 3.0 will help steer people on the 
path of righteousness there by deprecating a couple of methods which encourage 
inefficiencies (isFile/isDir).


h3. The commit problem 

The full commit problem combines all of these: you need a consistent list of 
source data, your deleted destination path musn't appear in listings, the 
commit of each task must promote a task's work to the pending output of the 
job; an abort must leave no trace of it. The final job commit must place data 
into the final destination, again, job abort not make any output visible. 
There's some ambiguity about what happens if task and job commits fails; 
generally the safest is "abort everything". Futhermore nobody has any idea what 
to do if an {{abort()}} raises exceptions. Oh, and all of this must be fast. 
Spark is no better or worse than the core MapReduce committers here, or that of 
Hive.

Spark generally uses the Hadoop {{FileOutputFormat}} via the 
{{HadoopMapReduceCommitProtocol}}, directly or indirectly (e.g 
{{ParquetOutputFormat}}), extracting its committer and casting it to 
{{FileOutputCommitter}}, primarily to get a working directory. This committer 
assumes the destination is a consistent FS, uses renames when promoting task 
and job output, assuming that is so fast it doesn't even bother to log a 
message "about to rename". Hence the recurrent Stack Overflow 

[jira] [Updated] (SPARK-20476) Exception between "create table as" and "get_json_object"

2017-04-26 Thread cen yuhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cen yuhai updated SPARK-20476:
--
Description: 
I encounter this problem when I want to create a table as select , 
get_json_object from xxx;
It is wrong.
{code}
create table spark_json_object as
select get_json_object(deliver_geojson,'$.')
from dw.dw_prd_order where dt='2017-04-24' limit 10;
{code}
It is ok.
{code}
create table spark_json_object as
select *
from dw.dw_prd_order where dt='2017-04-24' limit 10;
{code}
It is ok
{code}
select get_json_object(deliver_geojson,'$.')
from dw.dw_prd_order where dt='2017-04-24' limit 10;
{code}

{code}
17/04/26 23:12:56 ERROR [hive.log(397) -- main]: error in initSerDe: 
org.apache.hadoop.hive.serde2.SerDeException 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements 
while columns.types has 1 elements!
org.apache.hadoop.hive.serde2.SerDeException: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements 
while columns.types has 1 elements!
at 
org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:146)
at 
org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85)
at 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125)
at 
org.apache.hadoop.hive.serde2.AbstractSerDe.initialize(AbstractSerDe.java:53)
at 
org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:521)
at 
org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:391)
at 
org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276)
at 
org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:197)
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:699)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply$mcV$sp(HiveClientImpl.scala:455)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:309)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:256)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:255)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:298)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:454)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:237)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:199)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:248)
at 
org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$lzycompute$1(CreateHiveTableAsSelectCommand.scala:72)
at 
org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$1(CreateHiveTableAsSelectCommand.scala:48)
at 
org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:91)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
at org.apache.spark.sql.Dataset.(Dataset.scala:179)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:593)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:699)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:335)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:247)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)

[jira] [Updated] (SPARK-20476) Exception between "create table as" and "get_json_object"

2017-04-26 Thread cen yuhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cen yuhai updated SPARK-20476:
--
Description: 
I encounter this problem when I want to create a table as select , 
get_json_object from xxx;
It is wrong.
{code}
create table spark_json_object as
select get_json_object(deliver_geojson,'$.')
from dw.dw_prd_order where dt='2017-04-24' limit 10;
{code}
It is ok.
{code}
create table spark_json_object as
select *
from dw.dw_prd_order where dt='2017-04-24' limit 10;
{code}
It is ok
{code}
select get_json_object(deliver_geojson,'$.')
from dw.dw_prd_order where dt='2017-04-24' limit 10;
{code}
17/04/26 23:12:56 ERROR [hive.log(397) -- main]: error in initSerDe: 
org.apache.hadoop.hive.serde2.SerDeException 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements 
while columns.types has 1 elements!
org.apache.hadoop.hive.serde2.SerDeException: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements 
while columns.types has 1 elements!
at 
org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:146)
at 
org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85)
at 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125)
at 
org.apache.hadoop.hive.serde2.AbstractSerDe.initialize(AbstractSerDe.java:53)
at 
org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:521)
at 
org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:391)
at 
org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276)
at 
org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:197)
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:699)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply$mcV$sp(HiveClientImpl.scala:455)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:309)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:256)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:255)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:298)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:454)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:237)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:199)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:248)
at 
org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$lzycompute$1(CreateHiveTableAsSelectCommand.scala:72)
at 
org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$1(CreateHiveTableAsSelectCommand.scala:48)
at 
org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:91)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
at org.apache.spark.sql.Dataset.(Dataset.scala:179)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:593)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:699)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:335)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:247)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at 

[jira] [Comment Edited] (SPARK-17403) Fatal Error: Scan cached strings

2017-04-26 Thread Paul Lysak (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15982595#comment-15982595
 ] 

Paul Lysak edited comment on SPARK-17403 at 4/26/17 3:23 PM:
-

Looks like we have the same issue with Spark 2.1 on YARN (Amazon EMR release 
emr-5.4.0).
Workaround that solves the issue for us (at the cost of some performance) is to 
use df.persist(StorageLevel.DISK_ONLY) instead of df.cache().

Depending on the node types, memory settings, storage level and some other 
factors I couldn't clearly identify it may appear as 

{noformat}
User class threw exception: org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 158 in stage 504.0 failed 4 times, most recent failure: 
Lost task 158.3 in stage 504.0 (TID 427365, ip-10-35-162-171.ec2.internal, 
executor 83): java.lang.NegativeArraySizeException
at org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229)
at org.apache.spark.unsafe.types.UTF8String.clone(UTF8String.java:826)
at 
org.apache.spark.sql.execution.columnar.StringColumnStats.gatherStats(ColumnStats.scala:217)
at 
org.apache.spark.sql.execution.columnar.NullableColumnBuilder$class.appendFrom(NullableColumnBuilder.scala:55)
at 
org.apache.spark.sql.execution.columnar.NativeColumnBuilder.org$apache$spark$sql$execution$columnar$compression$CompressibleColumnBuilder$$super$appendFrom(ColumnBuilder.scala:97)
at 
org.apache.spark.sql.execution.columnar.compression.CompressibleColumnBuilder$class.appendFrom(CompressibleColumnBuilder.scala:78)
at 
org.apache.spark.sql.execution.columnar.NativeColumnBuilder.appendFrom(ColumnBuilder.scala:97)
at 
org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:122)
at 
org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:97)
at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:957)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:888)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:694)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
{noformat}

or as 

{noformat}
User class threw exception: org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 27 in stage 61.0 failed 4 times, most recent failure: Lost 
task 27.3 in stage 61.0 (TID 36167, ip-10-35-162-149.ec2.internal, executor 1): 
java.lang.OutOfMemoryError: Java heap space
at 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:73)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_38$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at 
org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:107)
at 
org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:97)
at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:957)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:888)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:694)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
{noformat}

or as

{noformat}
2017-04-24 19:02:45,951 ERROR 
org.apache.spark.util.SparkUncaughtExceptionHandler: Uncaught exception in 
thread Thread[Executor task launch worker-3,5,main]
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:73)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_37$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.execution.joins.HashJoin$$anonfun$join$1.apply(HashJoin.scala:217)
at 

[jira] [Commented] (SPARK-17403) Fatal Error: Scan cached strings

2017-04-26 Thread Paul Lysak (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984987#comment-15984987
 ] 

Paul Lysak commented on SPARK-17403:


Hope that helps - finally managed to reproduce it without using production data:

{code}
import org.apache.spark.sql.functions._

val leftRows = spark.sparkContext.parallelize(numSlices = 3000, seq = for 
(k <- 1 to 1000; j <- 1 to 1000) yield Row(k, j))
  .flatMap(r => (1 to 1000).map(l => Row.fromSeq(r.toSeq :+ l)))

val leftDf = spark.createDataFrame(leftRows, StructType(Seq(
  StructField("k", IntegerType),
  StructField("j", IntegerType),
  StructField("l", IntegerType)
)))
  .withColumn("combinedKey", expr("k*100 + j*1000 + l"))
  .withColumn("fixedCol", lit("sampleVal"))
  .withColumn("combKeyStr", format_number(col("combinedKey"), 0))
  .withColumn("k100", expr("k*100"))
  .withColumn("j100", expr("j*100"))
  .withColumn("l100", expr("l*100"))
  .withColumn("k_200", expr("k+200"))
  .withColumn("j_200", expr("j+200"))
  .withColumn("l_200", expr("l+200"))
  .withColumn("strCol1_1", concat(lit("value of sample column number one 
with which column k will be concatenated:" * 5), format_number(col("k"), 0)))
  .withColumn("strCol1_2", concat(lit("value of sample column two one with 
which column j will be concatenated:" * 5), format_number(col("j"), 0)))
  .withColumn("strCol1_3", concat(lit("value of sample column three one 
with which column r will be concatenated:" * 5), format_number(col("l"), 0)))
  .withColumn("strCol2_1", concat(lit("value of sample column number one 
with which column k will be concatenated:" * 5), format_number(col("k"), 0)))
  .withColumn("strCol2_2", concat(lit("value of sample column two one with 
which column j will be concatenated:" * 5), format_number(col("j"), 0)))
  .withColumn("strCol2_3", concat(lit("value of sample column three one 
with which column r will be concatenated:" * 5), format_number(col("l"), 0)))
  .withColumn("strCol3_1", concat(lit("value of sample column number one 
with which column k will be concatenated:" * 5), format_number(col("k"), 0)))
  .withColumn("strCol3_2", concat(lit("value of sample column two one with 
which column j will be concatenated:" * 5), format_number(col("j"), 0)))
  .withColumn("strCol3_3", concat(lit("value of sample column three one 
with which column r will be concatenated:" * 5), format_number(col("l"), 0))) 
//if further columns commented out - error disappears

leftDf.cache()
println("= leftDf count:" + leftDf.count())
leftDf.show(10)

val rightRows = spark.sparkContext.parallelize((1 to 800).map(i => Row(i, 
"k_" + i, "sampleVal")))
val rightDf = spark.createDataFrame(rightRows, StructType(Seq(
  StructField("k", IntegerType),
  StructField("kStr", StringType),
  StructField("sampleCol", StringType)
)))

rightDf.cache()
println("= rightDf count:" + rightDf.count())
rightDf.show(10)

val joinedDf = leftDf.join(broadcast(rightDf), usingColumns = Seq("k"), 
joinType = "left")

joinedDf.cache()
println("= joinedDf count:" + joinedDf.count())
joinedDf.show(10)
{code}

ApplicationMaster fails with such exception:
{noformat}
User class threw exception: org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 949 in stage 8.0 failed 4 times, most recent failure: Lost 
task 949.3 in stage 8.0 (TID 4922, ip-10-35-162-219.ec2.internal, executor 
139): java.lang.NegativeArraySizeException
at org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229)
at org.apache.spark.unsafe.types.UTF8String.clone(UTF8String.java:826)
at 
org.apache.spark.sql.execution.columnar.StringColumnStats.gatherStats(ColumnStats.scala:216)
at 
org.apache.spark.sql.execution.columnar.NullableColumnBuilder$class.appendFrom(NullableColumnBuilder.scala:55)
at 
org.apache.spark.sql.execution.columnar.NativeColumnBuilder.org$apache$spark$sql$execution$columnar$compression$CompressibleColumnBuilder$$super$appendFrom(ColumnBuilder.scala:97)
at 
org.apache.spark.sql.execution.columnar.compression.CompressibleColumnBuilder$class.appendFrom(CompressibleColumnBuilder.scala:78)
at 
org.apache.spark.sql.execution.columnar.NativeColumnBuilder.appendFrom(ColumnBuilder.scala:97)
at 
org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:122)
at 
org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:97)
at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:957)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948)
at 

[jira] [Updated] (SPARK-20476) Exception between "create table as" and "get_json_object"

2017-04-26 Thread cen yuhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cen yuhai updated SPARK-20476:
--
Description: 
{code}
create table spark_json_object as
select get_json_object(deliver_geojson,'$.')
from dw.dw_prd_order where dt='2017-04-24' limit 10;
{code}



{code}
17/04/26 23:12:56 ERROR [hive.log(397) -- main]: error in initSerDe: 
org.apache.hadoop.hive.serde2.SerDeException 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements 
while columns.types has 1 elements!
org.apache.hadoop.hive.serde2.SerDeException: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements 
while columns.types has 1 elements!
at 
org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:146)
at 
org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85)
at 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125)
at 
org.apache.hadoop.hive.serde2.AbstractSerDe.initialize(AbstractSerDe.java:53)
at 
org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:521)
at 
org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:391)
at 
org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276)
at 
org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:197)
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:699)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply$mcV$sp(HiveClientImpl.scala:455)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:309)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:256)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:255)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:298)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:454)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:237)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:199)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:248)
at 
org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$lzycompute$1(CreateHiveTableAsSelectCommand.scala:72)
at 
org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$1(CreateHiveTableAsSelectCommand.scala:48)
at 
org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:91)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
at org.apache.spark.sql.Dataset.(Dataset.scala:179)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:593)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:699)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:335)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:247)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 

[jira] [Created] (SPARK-20476) Exception between "create table as" and "get_json_object"

2017-04-26 Thread cen yuhai (JIRA)
cen yuhai created SPARK-20476:
-

 Summary: Exception between "create table as" and "get_json_object"
 Key: SPARK-20476
 URL: https://issues.apache.org/jira/browse/SPARK-20476
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: cen yuhai


{code}
create table spark_json_object as
select get_json_object(deliver_geojson,'')
from dw.dw_prd_restaurant where dt='2017-04-24' limit 10;
{code}

{code}
17/04/26 23:12:56 ERROR [hive.log(397) -- main]: error in initSerDe: 
org.apache.hadoop.hive.serde2.SerDeException 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements 
while columns.types has 1 elements!
org.apache.hadoop.hive.serde2.SerDeException: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements 
while columns.types has 1 elements!
at 
org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:146)
at 
org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85)
at 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125)
at 
org.apache.hadoop.hive.serde2.AbstractSerDe.initialize(AbstractSerDe.java:53)
at 
org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:521)
at 
org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:391)
at 
org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276)
at 
org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:197)
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:699)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply$mcV$sp(HiveClientImpl.scala:455)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:309)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:256)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:255)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:298)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:454)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:237)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:199)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:248)
at 
org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$lzycompute$1(CreateHiveTableAsSelectCommand.scala:72)
at 
org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$1(CreateHiveTableAsSelectCommand.scala:48)
at 
org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:91)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
at org.apache.spark.sql.Dataset.(Dataset.scala:179)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:593)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:699)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:335)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:247)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 

[jira] [Commented] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records

2017-04-26 Thread Sebastian Arzt (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984975#comment-15984975
 ] 

Sebastian Arzt commented on SPARK-18371:


Screenshots:

[before|https://issues.apache.org/jira/secure/attachment/12865156/01.png]
[after|https://issues.apache.org/jira/secure/attachment/12865158/02.png]

> Spark Streaming backpressure bug - generates a batch with large number of 
> records
> -
>
> Key: SPARK-18371
> URL: https://issues.apache.org/jira/browse/SPARK-18371
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
>Reporter: mapreduced
> Attachments: 01.png, 02.png, GiantBatch2.png, GiantBatch3.png, 
> Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png
>
>
> When the streaming job is configured with backpressureEnabled=true, it 
> generates a GIANT batch of records if the processing time + scheduled delay 
> is (much) larger than batchDuration. This creates a backlog of records like 
> no other and results in batches queueing for hours until it chews through 
> this giant batch.
> Expectation is that it should reduce the number of records per batch in some 
> time to whatever it can really process.
> Attaching some screen shots where it seems that this issue is quite easily 
> reproducible.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records

2017-04-26 Thread Sebastian Arzt (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Arzt updated SPARK-18371:
---
Comment: was deleted

(was: After fix)

> Spark Streaming backpressure bug - generates a batch with large number of 
> records
> -
>
> Key: SPARK-18371
> URL: https://issues.apache.org/jira/browse/SPARK-18371
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
>Reporter: mapreduced
> Attachments: 01.png, 02.png, GiantBatch2.png, GiantBatch3.png, 
> Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png
>
>
> When the streaming job is configured with backpressureEnabled=true, it 
> generates a GIANT batch of records if the processing time + scheduled delay 
> is (much) larger than batchDuration. This creates a backlog of records like 
> no other and results in batches queueing for hours until it chews through 
> this giant batch.
> Expectation is that it should reduce the number of records per batch in some 
> time to whatever it can really process.
> Attaching some screen shots where it seems that this issue is quite easily 
> reproducible.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records

2017-04-26 Thread Sebastian Arzt (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Arzt updated SPARK-18371:
---
Comment: was deleted

(was: Before fix)

> Spark Streaming backpressure bug - generates a batch with large number of 
> records
> -
>
> Key: SPARK-18371
> URL: https://issues.apache.org/jira/browse/SPARK-18371
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
>Reporter: mapreduced
> Attachments: 01.png, 02.png, GiantBatch2.png, GiantBatch3.png, 
> Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png
>
>
> When the streaming job is configured with backpressureEnabled=true, it 
> generates a GIANT batch of records if the processing time + scheduled delay 
> is (much) larger than batchDuration. This creates a backlog of records like 
> no other and results in batches queueing for hours until it chews through 
> this giant batch.
> Expectation is that it should reduce the number of records per batch in some 
> time to whatever it can really process.
> Attaching some screen shots where it seems that this issue is quite easily 
> reproducible.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records

2017-04-26 Thread Sebastian Arzt (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Arzt updated SPARK-18371:
---
Attachment: 01.png

Before fix

> Spark Streaming backpressure bug - generates a batch with large number of 
> records
> -
>
> Key: SPARK-18371
> URL: https://issues.apache.org/jira/browse/SPARK-18371
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
>Reporter: mapreduced
> Attachments: 01.png, 02.png, GiantBatch2.png, GiantBatch3.png, 
> Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png
>
>
> When the streaming job is configured with backpressureEnabled=true, it 
> generates a GIANT batch of records if the processing time + scheduled delay 
> is (much) larger than batchDuration. This creates a backlog of records like 
> no other and results in batches queueing for hours until it chews through 
> this giant batch.
> Expectation is that it should reduce the number of records per batch in some 
> time to whatever it can really process.
> Attaching some screen shots where it seems that this issue is quite easily 
> reproducible.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records

2017-04-26 Thread Sebastian Arzt (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Arzt updated SPARK-18371:
---
Attachment: 02.png

After fix

> Spark Streaming backpressure bug - generates a batch with large number of 
> records
> -
>
> Key: SPARK-18371
> URL: https://issues.apache.org/jira/browse/SPARK-18371
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
>Reporter: mapreduced
> Attachments: 01.png, 02.png, GiantBatch2.png, GiantBatch3.png, 
> Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png
>
>
> When the streaming job is configured with backpressureEnabled=true, it 
> generates a GIANT batch of records if the processing time + scheduled delay 
> is (much) larger than batchDuration. This creates a backlog of records like 
> no other and results in batches queueing for hours until it chews through 
> this giant batch.
> Expectation is that it should reduce the number of records per batch in some 
> time to whatever it can really process.
> Attaching some screen shots where it seems that this issue is quite easily 
> reproducible.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records

2017-04-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18371:


Assignee: Apache Spark

> Spark Streaming backpressure bug - generates a batch with large number of 
> records
> -
>
> Key: SPARK-18371
> URL: https://issues.apache.org/jira/browse/SPARK-18371
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
>Reporter: mapreduced
>Assignee: Apache Spark
> Attachments: GiantBatch2.png, GiantBatch3.png, 
> Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png
>
>
> When the streaming job is configured with backpressureEnabled=true, it 
> generates a GIANT batch of records if the processing time + scheduled delay 
> is (much) larger than batchDuration. This creates a backlog of records like 
> no other and results in batches queueing for hours until it chews through 
> this giant batch.
> Expectation is that it should reduce the number of records per batch in some 
> time to whatever it can really process.
> Attaching some screen shots where it seems that this issue is quite easily 
> reproducible.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records

2017-04-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18371:


Assignee: (was: Apache Spark)

> Spark Streaming backpressure bug - generates a batch with large number of 
> records
> -
>
> Key: SPARK-18371
> URL: https://issues.apache.org/jira/browse/SPARK-18371
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
>Reporter: mapreduced
> Attachments: GiantBatch2.png, GiantBatch3.png, 
> Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png
>
>
> When the streaming job is configured with backpressureEnabled=true, it 
> generates a GIANT batch of records if the processing time + scheduled delay 
> is (much) larger than batchDuration. This creates a backlog of records like 
> no other and results in batches queueing for hours until it chews through 
> this giant batch.
> Expectation is that it should reduce the number of records per batch in some 
> time to whatever it can really process.
> Attaching some screen shots where it seems that this issue is quite easily 
> reproducible.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records

2017-04-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984962#comment-15984962
 ] 

Apache Spark commented on SPARK-18371:
--

User 'arzt' has created a pull request for this issue:
https://github.com/apache/spark/pull/17774

> Spark Streaming backpressure bug - generates a batch with large number of 
> records
> -
>
> Key: SPARK-18371
> URL: https://issues.apache.org/jira/browse/SPARK-18371
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
>Reporter: mapreduced
> Attachments: GiantBatch2.png, GiantBatch3.png, 
> Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png
>
>
> When the streaming job is configured with backpressureEnabled=true, it 
> generates a GIANT batch of records if the processing time + scheduled delay 
> is (much) larger than batchDuration. This creates a backlog of records like 
> no other and results in batches queueing for hours until it chews through 
> this giant batch.
> Expectation is that it should reduce the number of records per batch in some 
> time to whatever it can really process.
> Attaching some screen shots where it seems that this issue is quite easily 
> reproducible.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20391) Properly rename the memory related fields in ExecutorSummary REST API

2017-04-26 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid reassigned SPARK-20391:


Assignee: Saisai Shao

> Properly rename the memory related fields in ExecutorSummary REST API
> -
>
> Key: SPARK-20391
> URL: https://issues.apache.org/jira/browse/SPARK-20391
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Blocker
> Fix For: 2.2.0
>
>
> Currently in Spark we could get executor summary through REST API 
> {{/api/v1/applications//executors}}. The format of executor summary 
> is:
> {code}
> class ExecutorSummary private[spark](
> val id: String,
> val hostPort: String,
> val isActive: Boolean,
> val rddBlocks: Int,
> val memoryUsed: Long,
> val diskUsed: Long,
> val totalCores: Int,
> val maxTasks: Int,
> val activeTasks: Int,
> val failedTasks: Int,
> val completedTasks: Int,
> val totalTasks: Int,
> val totalDuration: Long,
> val totalGCTime: Long,
> val totalInputBytes: Long,
> val totalShuffleRead: Long,
> val totalShuffleWrite: Long,
> val isBlacklisted: Boolean,
> val maxMemory: Long,
> val executorLogs: Map[String, String],
> val onHeapMemoryUsed: Option[Long],
> val offHeapMemoryUsed: Option[Long],
> val maxOnHeapMemory: Option[Long],
> val maxOffHeapMemory: Option[Long])
> {code}
> Here are 6 memory related fields: {{memoryUsed}}, {{maxMemory}}, 
> {{onHeapMemoryUsed}}, {{offHeapMemoryUsed}}, {{maxOnHeapMemory}}, 
> {{maxOffHeapMemory}}.
> These all 6 fields reflects the *storage* memory usage in Spark, but from the 
> name of this 6 fields, user doesn't really know it is referring to *storage* 
> memory or the total memory (storage memory + execution memory). This will be 
> misleading.
> So I think we should properly rename these fields to reflect their real 
> meanings. Or we should will document it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20391) Properly rename the memory related fields in ExecutorSummary REST API

2017-04-26 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-20391.
--
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17700
[https://github.com/apache/spark/pull/17700]

> Properly rename the memory related fields in ExecutorSummary REST API
> -
>
> Key: SPARK-20391
> URL: https://issues.apache.org/jira/browse/SPARK-20391
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Saisai Shao
>Priority: Blocker
> Fix For: 2.2.0
>
>
> Currently in Spark we could get executor summary through REST API 
> {{/api/v1/applications//executors}}. The format of executor summary 
> is:
> {code}
> class ExecutorSummary private[spark](
> val id: String,
> val hostPort: String,
> val isActive: Boolean,
> val rddBlocks: Int,
> val memoryUsed: Long,
> val diskUsed: Long,
> val totalCores: Int,
> val maxTasks: Int,
> val activeTasks: Int,
> val failedTasks: Int,
> val completedTasks: Int,
> val totalTasks: Int,
> val totalDuration: Long,
> val totalGCTime: Long,
> val totalInputBytes: Long,
> val totalShuffleRead: Long,
> val totalShuffleWrite: Long,
> val isBlacklisted: Boolean,
> val maxMemory: Long,
> val executorLogs: Map[String, String],
> val onHeapMemoryUsed: Option[Long],
> val offHeapMemoryUsed: Option[Long],
> val maxOnHeapMemory: Option[Long],
> val maxOffHeapMemory: Option[Long])
> {code}
> Here are 6 memory related fields: {{memoryUsed}}, {{maxMemory}}, 
> {{onHeapMemoryUsed}}, {{offHeapMemoryUsed}}, {{maxOnHeapMemory}}, 
> {{maxOffHeapMemory}}.
> These all 6 fields reflects the *storage* memory usage in Spark, but from the 
> name of this 6 fields, user doesn't really know it is referring to *storage* 
> memory or the total memory (storage memory + execution memory). This will be 
> misleading.
> So I think we should properly rename these fields to reflect their real 
> meanings. Or we should will document it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19812) YARN shuffle service fails to relocate recovery DB across NFS directories

2017-04-26 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-19812.
---
   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.0

> YARN shuffle service fails to relocate recovery DB across NFS directories
> -
>
> Key: SPARK-19812
> URL: https://issues.apache.org/jira/browse/SPARK-19812
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.1
>Reporter: Thomas Graves
>Assignee: Thomas Graves
> Fix For: 2.2.0, 2.3.0
>
>
> The yarn shuffle service tries to switch from the yarn local directories to 
> the real recovery directory but can fail to move the existing recovery db's.  
> It fails due to Files.move not doing directories that have contents.
> 2017-03-03 14:57:19,558 [main] ERROR yarn.YarnShuffleService: Failed to move 
> recovery file sparkShuffleRecovery.ldb to the path 
> /mapred/yarn-nodemanager/nm-aux-services/spark_shuffle
> java.nio.file.DirectoryNotEmptyException:/yarn-local/sparkShuffleRecovery.ldb
> at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:498)
> at 
> sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262)
> at java.nio.file.Files.move(Files.java:1395)
> at 
> org.apache.spark.network.yarn.YarnShuffleService.initRecoveryDb(YarnShuffleService.java:369)
> at 
> org.apache.spark.network.yarn.YarnShuffleService.createSecretManager(YarnShuffleService.java:200)
> at 
> org.apache.spark.network.yarn.YarnShuffleService.serviceInit(YarnShuffleService.java:174)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:143)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:262)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:357)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:636)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:684)
> This used to use f.renameTo and we switched it in the pr due to review 
> comments and it looks like didn't do a final real test. The tests are using 
> files rather then directories so it didn't catch. We need to fix the test 
> also.
> history: 
> https://github.com/apache/spark/pull/14999/commits/65de8531ccb91287f5a8a749c7819e99533b9440



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records

2017-04-26 Thread Sebastian Arzt (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984776#comment-15984776
 ] 

Sebastian Arzt edited comment on SPARK-18371 at 4/26/17 1:24 PM:
-

I deep dived into it and found a simple solution. The problem is that 
[maxRateLimitPerPartition|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L94]
 returns {{None}} for an unintended case. {{None}} should only be returned if 
there is no lag as indicated by this 
[condition|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L114].
 However, this condition is also true if all backpressureRates are 
[rounded|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L107]
 to zero. I propose a solution, where rounding is omitted at all. This has the 
nice side-effect that backpressure is more fine-grained and not only an 
integral multiple of the 
[batchDuration|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L117]
 in seconds. I will open a pull request for it soon.


was (Author: seb.arzt):
I deep dived into it and found a simple solution. The problem is that 
[maxRateLimitPerPartition|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L94]
 returns {{None}} for an unintended case. {{None}} should only be returned if 
there is no lag as indicated by this 
[condition|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L114].
 However, this condition is also true if all backpressureRates are 
[rounded|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L107]
 to zero. I propose a solution, where rounding is omitted at all. I has the 
nice side-effect that the backpressure is more fine-grained and not only an 
integral multiple of the 
[batchDuration|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L117]
 in seconds. I will open a pull request for it soon.

> Spark Streaming backpressure bug - generates a batch with large number of 
> records
> -
>
> Key: SPARK-18371
> URL: https://issues.apache.org/jira/browse/SPARK-18371
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
>Reporter: mapreduced
> Attachments: GiantBatch2.png, GiantBatch3.png, 
> Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png
>
>
> When the streaming job is configured with backpressureEnabled=true, it 
> generates a GIANT batch of records if the processing time + scheduled delay 
> is (much) larger than batchDuration. This creates a backlog of records like 
> no other and results in batches queueing for hours until it chews through 
> this giant batch.
> Expectation is that it should reduce the number of records per batch in some 
> time to whatever it can really process.
> Attaching some screen shots where it seems that this issue is quite easily 
> reproducible.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records

2017-04-26 Thread Sebastian Arzt (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984776#comment-15984776
 ] 

Sebastian Arzt edited comment on SPARK-18371 at 4/26/17 1:23 PM:
-

I deep dived into it and found a simple solution. The problem is that 
[maxRateLimitPerPartition|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L94]
 returns {{None}} for an unintended case. {{None}} should only be returned if 
there is no lag as indicated by this 
[condition|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L114].
 However, this condition is also true if all backpressureRates are 
[rounded|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L107]
 to zero. I propose a solution, where rounding is omitted at all. I has the 
nice side-effect that the backpressure is more fine-grained and not only an 
integral multiple of the 
[batchDuration|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L117]
 in seconds. I will open a pull request for it soon.


was (Author: seb.arzt):
I deep dived into it and found a simple solution. The problem is that 
[maxRateLimitPerPartition|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L94]
 returns {{None}} for an unintended case. {{None}} should only be returned if 
there is no lag as indicated by this 
[condition|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L114].
 However, this condition is also true if all backpressureRates are 
[rounded|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L107]
 to zero. I propose a solution, where rounding is omitted at all. I has the 
nice side-effect that the backpressure in more fine-grained and not only an 
integral multiple of the 
[batchDuration|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L117]
 in seconds. I will open a pull request for it soon.

> Spark Streaming backpressure bug - generates a batch with large number of 
> records
> -
>
> Key: SPARK-18371
> URL: https://issues.apache.org/jira/browse/SPARK-18371
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
>Reporter: mapreduced
> Attachments: GiantBatch2.png, GiantBatch3.png, 
> Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png
>
>
> When the streaming job is configured with backpressureEnabled=true, it 
> generates a GIANT batch of records if the processing time + scheduled delay 
> is (much) larger than batchDuration. This creates a backlog of records like 
> no other and results in batches queueing for hours until it chews through 
> this giant batch.
> Expectation is that it should reduce the number of records per batch in some 
> time to whatever it can really process.
> Attaching some screen shots where it seems that this issue is quite easily 
> reproducible.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records

2017-04-26 Thread Sebastian Arzt (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984776#comment-15984776
 ] 

Sebastian Arzt commented on SPARK-18371:


I deep dived into it and found a simple solution. The problem is that 
[maxRateLimitPerPartition|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L94]
 returns {{None}} for an unintended case. {{None}} should only be returned if 
there is no lag as indicated by this 
[condition|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L114].
 However, this condition is also true if all backpressureRates are 
[rounded|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L107]
 to zero. I propose a solution, where rounding is omitted at all. I has the 
nice side-effect that the backpressure in more fine-grained and not only an 
integral multiple of the 
[batchDuration|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L117]
 in seconds. I will open a pull request for it soon.

> Spark Streaming backpressure bug - generates a batch with large number of 
> records
> -
>
> Key: SPARK-18371
> URL: https://issues.apache.org/jira/browse/SPARK-18371
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
>Reporter: mapreduced
> Attachments: GiantBatch2.png, GiantBatch3.png, 
> Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png
>
>
> When the streaming job is configured with backpressureEnabled=true, it 
> generates a GIANT batch of records if the processing time + scheduled delay 
> is (much) larger than batchDuration. This creates a backlog of records like 
> no other and results in batches queueing for hours until it chews through 
> this giant batch.
> Expectation is that it should reduce the number of records per batch in some 
> time to whatever it can really process.
> Attaching some screen shots where it seems that this issue is quite easily 
> reproducible.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7481) Add spark-hadoop-cloud module to pull in object store support

2017-04-26 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984771#comment-15984771
 ] 

Steve Loughran commented on SPARK-7481:
---

I think we ended up going in circles on that PR. Sean has actually been very 
tolerant of me, however it's been hampered by my full time focus on other 
thingsr. I've only been had time to work on the spark PR intermittently and 
that's been hard for all: me in the rebase/retest, the one reviewer in having 
to catch up again.

Now, anyone who does manage to get that CP right will discover that S3A 
absolutely flies with Spark, in partitioning (list file improvements), data 
input (set fadvise=true for ORC and Parquet), and for output (set 
fast.output=true, play with the pool options). It delivers that performance 
because this patch set things up for the integration tests, downstream of this 
patch so I and others can be confident that the things actually work, at sped, 
at scale. Indeed, many of S3A performance work was actually based on Hive and 
Spark workloads:, the data formats & their seek patterns, directory layouts, 
file generation. All that's left is the little problem of getting the classpath 
right. Oh, and the committer.


For now, for people's enjoyment, here's some videos from Spark Summit East on 
the topic

* [Spark and object stores|https://youtu.be/8F2Jqw5_OnI]. 
* [Robust and Scalable etl over Cloud Storage With 
Spark|https://spark-summit.org/east-2017/events/robust-and-scalable-etl-over-cloud-storage-with-spark/]



> Add spark-hadoop-cloud module to pull in object store support
> -
>
> Key: SPARK-7481
> URL: https://issues.apache.org/jira/browse/SPARK-7481
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Steve Loughran
>
> To keep the s3n classpath right, to add s3a, swift & azure, the dependencies 
> of spark in a 2.6+ profile need to add the relevant object store packages 
> (hadoop-aws, hadoop-openstack, hadoop-azure)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20473) ColumnVector.Array is missing accessors for some types

2017-04-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20473:


Assignee: Apache Spark

> ColumnVector.Array is missing accessors for some types
> --
>
> Key: SPARK-20473
> URL: https://issues.apache.org/jira/browse/SPARK-20473
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michal Szafranski
>Assignee: Apache Spark
>
> ColumnVector implementations originally did not support some Catalyst types 
> (float, short, and boolean). Now that they do, those types should be also 
> added to the ColumnVector.Array.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20474) OnHeapColumnVector realocation may not copy existing data

2017-04-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20474:


Assignee: (was: Apache Spark)

> OnHeapColumnVector realocation may not copy existing data
> -
>
> Key: SPARK-20474
> URL: https://issues.apache.org/jira/browse/SPARK-20474
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michal Szafranski
>
> OnHeapColumnVector reallocation copies to the new storage data up to 
> 'elementsAppended'. This variable is only updated when using the 
> ColumnVector.appendX API, while ColumnVector.putX is more commonly used.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20473) ColumnVector.Array is missing accessors for some types

2017-04-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20473:


Assignee: (was: Apache Spark)

> ColumnVector.Array is missing accessors for some types
> --
>
> Key: SPARK-20473
> URL: https://issues.apache.org/jira/browse/SPARK-20473
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michal Szafranski
>
> ColumnVector implementations originally did not support some Catalyst types 
> (float, short, and boolean). Now that they do, those types should be also 
> added to the ColumnVector.Array.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20474) OnHeapColumnVector realocation may not copy existing data

2017-04-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20474:


Assignee: Apache Spark

> OnHeapColumnVector realocation may not copy existing data
> -
>
> Key: SPARK-20474
> URL: https://issues.apache.org/jira/browse/SPARK-20474
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michal Szafranski
>Assignee: Apache Spark
>
> OnHeapColumnVector reallocation copies to the new storage data up to 
> 'elementsAppended'. This variable is only updated when using the 
> ColumnVector.appendX API, while ColumnVector.putX is more commonly used.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20473) ColumnVector.Array is missing accessors for some types

2017-04-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20473:


Assignee: (was: Apache Spark)

> ColumnVector.Array is missing accessors for some types
> --
>
> Key: SPARK-20473
> URL: https://issues.apache.org/jira/browse/SPARK-20473
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michal Szafranski
>
> ColumnVector implementations originally did not support some Catalyst types 
> (float, short, and boolean). Now that they do, those types should be also 
> added to the ColumnVector.Array.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20474) OnHeapColumnVector realocation may not copy existing data

2017-04-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984577#comment-15984577
 ] 

Apache Spark commented on SPARK-20474:
--

User 'michal-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/17773

> OnHeapColumnVector realocation may not copy existing data
> -
>
> Key: SPARK-20474
> URL: https://issues.apache.org/jira/browse/SPARK-20474
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michal Szafranski
>
> OnHeapColumnVector reallocation copies to the new storage data up to 
> 'elementsAppended'. This variable is only updated when using the 
> ColumnVector.appendX API, while ColumnVector.putX is more commonly used.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20473) ColumnVector.Array is missing accessors for some types

2017-04-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20473:


Assignee: Apache Spark

> ColumnVector.Array is missing accessors for some types
> --
>
> Key: SPARK-20473
> URL: https://issues.apache.org/jira/browse/SPARK-20473
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michal Szafranski
>Assignee: Apache Spark
>
> ColumnVector implementations originally did not support some Catalyst types 
> (float, short, and boolean). Now that they do, those types should be also 
> added to the ColumnVector.Array.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >