[jira] [Assigned] (SPARK-20425) Support an extended display mode to print a column data per line
[ https://issues.apache.org/jira/browse/SPARK-20425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-20425: --- Assignee: Takeshi Yamamuro > Support an extended display mode to print a column data per line > > > Key: SPARK-20425 > URL: https://issues.apache.org/jira/browse/SPARK-20425 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Minor > Fix For: 2.3.0 > > > In the master, when printing Dataset with many columns, the readability is > too low like; > {code} > scala> val df = spark.range(100).selectExpr((0 until 100).map(i => s"rand() > AS c$i"): _*) > scala> df.show(3, 0) > +--+--+--+---+--+--+---+--+--+--+--+---+--+--+--+---+---+---+--+--+---+--+---+--+---+---+---++---+--+---++--+--+---+---+---+--+--+---+--+--+---+---+---+--+++---+---+---+---+---+---++---+---+---+---+--+--+---+---+--+---+--+--+-+---+---+--+---+--+---+---+---+--+---+--+---+---+---+---+---+---+---+---+--+---+---+--+--+--+---+--+---+--+---+---+---+ > |c0|c1|c2|c3 > |c4|c5|c6 |c7 > |c8|c9|c10 |c11 > |c12 |c13 |c14 |c15 > |c16|c17|c18 |c19 > |c20|c21 |c22|c23 > |c24|c25|c26|c27 >|c28|c29 |c30|c31 > |c32 |c33 |c34|c35 > |c36|c37 |c38 |c39 > |c40 |c41 |c42|c43 > |c44|c45 |c46 |c47 > |c48|c49|c50|c51 > |c52|c53|c54 |c55 > |c56|c57|c58|c59 > |c60 |c61|c62|c63 > |c64|c65 |c66 |c67 > |c68|c69|c70 |c71 > |c72 |c73|c74|c75 > |c76 |c77|c78 |c79 > |c80|c81|c82 > |c83|c84|c85|c86 > |c87 |c88|c89|c90 > |c91 |c92 |c93|c94 > |c95|c96 |c97|c98 > |c99| >
[jira] [Resolved] (SPARK-20425) Support an extended display mode to print a column data per line
[ https://issues.apache.org/jira/browse/SPARK-20425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-20425. - Resolution: Fixed Fix Version/s: 2.3.0 > Support an extended display mode to print a column data per line > > > Key: SPARK-20425 > URL: https://issues.apache.org/jira/browse/SPARK-20425 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Minor > Fix For: 2.3.0 > > > In the master, when printing Dataset with many columns, the readability is > too low like; > {code} > scala> val df = spark.range(100).selectExpr((0 until 100).map(i => s"rand() > AS c$i"): _*) > scala> df.show(3, 0) > +--+--+--+---+--+--+---+--+--+--+--+---+--+--+--+---+---+---+--+--+---+--+---+--+---+---+---++---+--+---++--+--+---+---+---+--+--+---+--+--+---+---+---+--+++---+---+---+---+---+---++---+---+---+---+--+--+---+---+--+---+--+--+-+---+---+--+---+--+---+---+---+--+---+--+---+---+---+---+---+---+---+---+--+---+---+--+--+--+---+--+---+--+---+---+---+ > |c0|c1|c2|c3 > |c4|c5|c6 |c7 > |c8|c9|c10 |c11 > |c12 |c13 |c14 |c15 > |c16|c17|c18 |c19 > |c20|c21 |c22|c23 > |c24|c25|c26|c27 >|c28|c29 |c30|c31 > |c32 |c33 |c34|c35 > |c36|c37 |c38 |c39 > |c40 |c41 |c42|c43 > |c44|c45 |c46 |c47 > |c48|c49|c50|c51 > |c52|c53|c54 |c55 > |c56|c57|c58|c59 > |c60 |c61|c62|c63 > |c64|c65 |c66 |c67 > |c68|c69|c70 |c71 > |c72 |c73|c74|c75 > |c76 |c77|c78 |c79 > |c80|c81|c82 > |c83|c84|c85|c86 > |c87 |c88|c89|c90 > |c91 |c92 |c93|c94 > |c95|c96 |c97|c98 > |c99| >
[jira] [Issue Comment Deleted] (SPARK-20435) More thorough redaction of sensitive information from logs/UI, more unit tests
[ https://issues.apache.org/jira/browse/SPARK-20435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Grover updated SPARK-20435: Comment: was deleted (was: Thanks Marcelo! ) > More thorough redaction of sensitive information from logs/UI, more unit tests > -- > > Key: SPARK-20435 > URL: https://issues.apache.org/jira/browse/SPARK-20435 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Mark Grover >Assignee: Mark Grover > Fix For: 2.2.0 > > > SPARK-18535 and SPARK-19720 were works to redact sensitive information (e.g. > hadoop credential provider password, AWS access/secret keys) from event logs > + YARN logs + UI and from the console output, respectively. > While some unit tests were added along with these changes - they asserted > when a sensitive key was found, that redaction took place for that key. They > didn't assert globally that when running a full-fledged Spark app (whether or > YARN or locally), that sensitive information was not present in any of the > logs or UI. Such a test would also prevent regressions from happening in the > future if someone unknowingly adds extra logging that publishes out sensitive > information to disk or UI. > Consequently, it was found that in some Java configurations, sensitive > information was still being leaked in the event logs under the > {{SparkListenerEnvironmentUpdate}} event, like so: > {code} > "sun.java.command":"org.apache.spark.deploy.SparkSubmit ... --conf > spark.executorEnv.HADOOP_CREDSTORE_PASSWORD=secret_password ... > {code} > "secret_password" should have been redacted. > Moreover, previously redaction logic was only checking if the key matched the > secret regex pattern, it'd redact it's value. That worked for most cases. > However, in the above case, the key (sun.java.command) doesn't tell much, so > the value needs to be searched. So the check needs to be expanded to match > against values as well. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-20475) Whether use "broadcast join" depends on hive configuration
[ https://issues.apache.org/jira/browse/SPARK-20475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lijia Liu closed SPARK-20475. - Resolution: Fixed Fix Version/s: 2.0.3 > Whether use "broadcast join" depends on hive configuration > -- > > Key: SPARK-20475 > URL: https://issues.apache.org/jira/browse/SPARK-20475 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Lijia Liu > Fix For: 2.0.3 > > > Currently, broadcast join in Spark only works while: > 1. The value of "spark.sql.autoBroadcastJoinThreshold" bigger than > 0(default is 10485760). > 2. The size of one of the hive tables less than > "spark.sql.autoBroadcastJoinThreshold". To get the size information of the > hive table from hive metastore, "hive.stats.autogather" should be set to > true in hive or the command "ANALYZE TABLE COMPUTE STATISTICS > noscan" has been run. > But in Hive, it calculate the size of the file or directory corresponding to > the hive table to determine whether to use the map side join, and does not > depend on the hive metastore. > This leads to two problems: > 1. Spark will not use "broadcast join" when the hive parameter > "hive.stats.autogather" is not set to ture or the command "ANALYZE TABLE > COMPUTE STATISTICS noscan" has not been run because the > information of the hive table has not saved in hive metastore . The mode of > work in Spark depends on the configuration of Hive. > 2. For some reason, we set "hive.stats.autogather" to false in our Hive. > For the same SQL, Hive is 4 times faster than Spark because Hive used "map > side join" but Spark did not use "broadcast join". > Is it possible to use the mechanism same to hive's to look up the size of a > hive tale in Spark. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20475) Whether use "broadcast join" depends on hive configuration
[ https://issues.apache.org/jira/browse/SPARK-20475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985995#comment-15985995 ] Lijia Liu commented on SPARK-20475: --- [~ZenWzh] This issue is duplicated to https://issues.apache.org/jira/browse/SPARK-15365 So, spark.sql.statistics.fallBackToHdfs=true solve this problem. > Whether use "broadcast join" depends on hive configuration > -- > > Key: SPARK-20475 > URL: https://issues.apache.org/jira/browse/SPARK-20475 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Lijia Liu > > Currently, broadcast join in Spark only works while: > 1. The value of "spark.sql.autoBroadcastJoinThreshold" bigger than > 0(default is 10485760). > 2. The size of one of the hive tables less than > "spark.sql.autoBroadcastJoinThreshold". To get the size information of the > hive table from hive metastore, "hive.stats.autogather" should be set to > true in hive or the command "ANALYZE TABLE COMPUTE STATISTICS > noscan" has been run. > But in Hive, it calculate the size of the file or directory corresponding to > the hive table to determine whether to use the map side join, and does not > depend on the hive metastore. > This leads to two problems: > 1. Spark will not use "broadcast join" when the hive parameter > "hive.stats.autogather" is not set to ture or the command "ANALYZE TABLE > COMPUTE STATISTICS noscan" has not been run because the > information of the hive table has not saved in hive metastore . The mode of > work in Spark depends on the configuration of Hive. > 2. For some reason, we set "hive.stats.autogather" to false in our Hive. > For the same SQL, Hive is 4 times faster than Spark because Hive used "map > side join" but Spark did not use "broadcast join". > Is it possible to use the mechanism same to hive's to look up the size of a > hive tale in Spark. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20470) Invalid json converting RDD row with Array of struct to json
[ https://issues.apache.org/jira/browse/SPARK-20470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-20470. -- Resolution: Invalid This is because {{recursive}} should be set to {{True}} in {{asDict}}. With given data stored in {{tmp.json}}, {code} import json from pyspark.sql.types import Row df = spark.read.json("tmp.json", wholeFile=True) df.printSchema() rdd = df.rdd.map(lambda row: json.dumps(row.asDict(recursive=True), indent=2)) print rdd.collect()[0] {code} prints as below: {code} root |-- feature: string (nullable = true) |-- histogram: array (nullable = true) ||-- element: struct (containsNull = true) |||-- start: double (nullable = true) |||-- width: double (nullable = true) |||-- y: double (nullable = true) { "feature": "feature_id_001", "histogram": [ { "y": 968.0, "start": 1.9796095151877942, "width": 0.1564485056196041 }, { "y": 892.0, "start": 2.1360580208073983, "width": 0.1564485056196041 }, { "y": 814.0, "start": 2.2925065264270024, "width": 0.15644850561960366 }, { "y": 690.0, "start": 2.448955032046606, "width": 0.1564485056196041 } ] } {code} > Invalid json converting RDD row with Array of struct to json > > > Key: SPARK-20470 > URL: https://issues.apache.org/jira/browse/SPARK-20470 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.6.3 >Reporter: Philip Adetiloye > > Trying to convert an RDD in pyspark containing Array of struct doesn't > generate the right json. It looks trivial but can't get a good json out. > I read the json below into a dataframe: > {code} > { > "feature": "feature_id_001", > "histogram": [ > { > "start": 1.9796095151877942, > "y": 968.0, > "width": 0.1564485056196041 > }, > { > "start": 2.1360580208073983, > "y": 892.0, > "width": 0.1564485056196041 > }, > { > "start": 2.2925065264270024, > "y": 814.0, > "width": 0.15644850561960366 > }, > { > "start": 2.448955032046606, > "y": 690.0, > "width": 0.1564485056196041 > }] > } > {code} > Df schema looks good > {code} > root > |-- feature: string (nullable = true) > |-- histogram: array (nullable = true) > ||-- element: struct (containsNull = true) > |||-- start: double (nullable = true) > |||-- width: double (nullable = true) > |||-- y: double (nullable = true) > {code} > Need to convert each row to json now and save to HBase > {code} > rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict( > {code} > Output JSON (Wrong) > {code} > { > "feature": "feature_id_001", > "histogram": [ > [ > 1.9796095151877942, > 968.0, > 0.1564485056196041 > ], > [ > 2.1360580208073983, > 892.0, > 0.1564485056196041 > ], > [ > 2.2925065264270024, > 814.0, > 0.15644850561960366 > ], > [ > 2.448955032046606, > 690.0, > 0.1564485056196041 > ] > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20475) Whether use "broadcast join" depends on hive configuration
[ https://issues.apache.org/jira/browse/SPARK-20475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985838#comment-15985838 ] Zhenhua Wang edited comment on SPARK-20475 at 4/27/17 3:53 AM: --- Spark also collects table size no matter "analyze" command is run or not. Did you check the threshould in Hive for map side join? Is it the same as in Spark? Besides, if you want to use broadcast join in spark, we have a broadcast hint in Spark. was (Author: zenwzh): Spark also collects table size no matter "analyze" command is run or not. Did check the threshould in Hive for map side join? Is it the same as in Spark? Besides, if you want to use broadcast join in spark, we have a broadcast hint in Spark. > Whether use "broadcast join" depends on hive configuration > -- > > Key: SPARK-20475 > URL: https://issues.apache.org/jira/browse/SPARK-20475 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Lijia Liu > > Currently, broadcast join in Spark only works while: > 1. The value of "spark.sql.autoBroadcastJoinThreshold" bigger than > 0(default is 10485760). > 2. The size of one of the hive tables less than > "spark.sql.autoBroadcastJoinThreshold". To get the size information of the > hive table from hive metastore, "hive.stats.autogather" should be set to > true in hive or the command "ANALYZE TABLE COMPUTE STATISTICS > noscan" has been run. > But in Hive, it calculate the size of the file or directory corresponding to > the hive table to determine whether to use the map side join, and does not > depend on the hive metastore. > This leads to two problems: > 1. Spark will not use "broadcast join" when the hive parameter > "hive.stats.autogather" is not set to ture or the command "ANALYZE TABLE > COMPUTE STATISTICS noscan" has not been run because the > information of the hive table has not saved in hive metastore . The mode of > work in Spark depends on the configuration of Hive. > 2. For some reason, we set "hive.stats.autogather" to false in our Hive. > For the same SQL, Hive is 4 times faster than Spark because Hive used "map > side join" but Spark did not use "broadcast join". > Is it possible to use the mechanism same to hive's to look up the size of a > hive tale in Spark. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20487) `HiveTableScan` node is quite verbose in explained plan
[ https://issues.apache.org/jira/browse/SPARK-20487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20487: Assignee: (was: Apache Spark) > `HiveTableScan` node is quite verbose in explained plan > --- > > Key: SPARK-20487 > URL: https://issues.apache.org/jira/browse/SPARK-20487 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Tejas Patil >Priority: Trivial > > For hive tables, `explain()` prints a lot of information. This makes it hard > to read the plan (esp. for large sql strings with numerous tables). > eg. > {noformat} > scala> hc.sql(" SELECT * FROM my_table WHERE name = 'foo' ").explain(true) > == Parsed Logical Plan == > 'Project [*] > +- 'Filter ('name = foo) >+- 'UnresolvedRelation `my_table` > == Analyzed Logical Plan == > user_id: bigint, name: string, ds: string > Project [user_id#13L, name#14, ds#15] > +- Filter (name#14 = foo) >+- SubqueryAlias my_table > +- CatalogRelation CatalogTable( > Database: default > Table: my_table > Owner: tejasp > Created: Fri Apr 14 17:05:50 PDT 2017 > Last Access: Wed Dec 31 16:00:00 PST 1969 > Type: MANAGED > Provider: hive > Properties: [serialization.format=1] > Statistics: 9223372036854775807 bytes > Location: file:/tmp/warehouse/my_table > Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > InputFormat: org.apache.hadoop.mapred.TextInputFormat > OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Partition Provider: Catalog > Partition Columns: [`ds`] > Schema: root > -- user_id: long (nullable = true) > -- name: string (nullable = true) > -- ds: string (nullable = true) > ), [user_id#13L, name#14], [ds#15] > == Optimized Logical Plan == > Filter (isnotnull(name#14) && (name#14 = foo)) > +- CatalogRelation CatalogTable( > Database: default > Table: my_table > Owner: tejasp > Created: Fri Apr 14 17:05:50 PDT 2017 > Last Access: Wed Dec 31 16:00:00 PST 1969 > Type: MANAGED > Provider: hive > Properties: [serialization.format=1] > Statistics: 9223372036854775807 bytes > Location: file:/tmp/warehouse/my_table > Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > InputFormat: org.apache.hadoop.mapred.TextInputFormat > OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Partition Provider: Catalog > Partition Columns: [`ds`] > Schema: root > -- user_id: long (nullable = true) > -- name: string (nullable = true) > -- ds: string (nullable = true) > ), [user_id#13L, name#14], [ds#15] > == Physical Plan == > *Filter (isnotnull(name#14) && (name#14 = foo)) > +- HiveTableScan [user_id#13L, name#14, ds#15], CatalogRelation CatalogTable( > Database: default > Table: my_table > Owner: tejasp > Created: Fri Apr 14 17:05:50 PDT 2017 > Last Access: Wed Dec 31 16:00:00 PST 1969 > Type: MANAGED > Provider: hive > Properties: [serialization.format=1] > Statistics: 9223372036854775807 bytes > Location: file:/tmp/warehouse/my_table > Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > InputFormat: org.apache.hadoop.mapred.TextInputFormat > OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Partition Provider: Catalog > Partition Columns: [`ds`] > Schema: root > -- user_id: long (nullable = true) > -- name: string (nullable = true) > -- ds: string (nullable = true) > ), [user_id#13L, name#14], [ds#15] > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20487) `HiveTableScan` node is quite verbose in explained plan
[ https://issues.apache.org/jira/browse/SPARK-20487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20487: Assignee: Apache Spark > `HiveTableScan` node is quite verbose in explained plan > --- > > Key: SPARK-20487 > URL: https://issues.apache.org/jira/browse/SPARK-20487 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Tejas Patil >Assignee: Apache Spark >Priority: Trivial > > For hive tables, `explain()` prints a lot of information. This makes it hard > to read the plan (esp. for large sql strings with numerous tables). > eg. > {noformat} > scala> hc.sql(" SELECT * FROM my_table WHERE name = 'foo' ").explain(true) > == Parsed Logical Plan == > 'Project [*] > +- 'Filter ('name = foo) >+- 'UnresolvedRelation `my_table` > == Analyzed Logical Plan == > user_id: bigint, name: string, ds: string > Project [user_id#13L, name#14, ds#15] > +- Filter (name#14 = foo) >+- SubqueryAlias my_table > +- CatalogRelation CatalogTable( > Database: default > Table: my_table > Owner: tejasp > Created: Fri Apr 14 17:05:50 PDT 2017 > Last Access: Wed Dec 31 16:00:00 PST 1969 > Type: MANAGED > Provider: hive > Properties: [serialization.format=1] > Statistics: 9223372036854775807 bytes > Location: file:/tmp/warehouse/my_table > Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > InputFormat: org.apache.hadoop.mapred.TextInputFormat > OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Partition Provider: Catalog > Partition Columns: [`ds`] > Schema: root > -- user_id: long (nullable = true) > -- name: string (nullable = true) > -- ds: string (nullable = true) > ), [user_id#13L, name#14], [ds#15] > == Optimized Logical Plan == > Filter (isnotnull(name#14) && (name#14 = foo)) > +- CatalogRelation CatalogTable( > Database: default > Table: my_table > Owner: tejasp > Created: Fri Apr 14 17:05:50 PDT 2017 > Last Access: Wed Dec 31 16:00:00 PST 1969 > Type: MANAGED > Provider: hive > Properties: [serialization.format=1] > Statistics: 9223372036854775807 bytes > Location: file:/tmp/warehouse/my_table > Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > InputFormat: org.apache.hadoop.mapred.TextInputFormat > OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Partition Provider: Catalog > Partition Columns: [`ds`] > Schema: root > -- user_id: long (nullable = true) > -- name: string (nullable = true) > -- ds: string (nullable = true) > ), [user_id#13L, name#14], [ds#15] > == Physical Plan == > *Filter (isnotnull(name#14) && (name#14 = foo)) > +- HiveTableScan [user_id#13L, name#14, ds#15], CatalogRelation CatalogTable( > Database: default > Table: my_table > Owner: tejasp > Created: Fri Apr 14 17:05:50 PDT 2017 > Last Access: Wed Dec 31 16:00:00 PST 1969 > Type: MANAGED > Provider: hive > Properties: [serialization.format=1] > Statistics: 9223372036854775807 bytes > Location: file:/tmp/warehouse/my_table > Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > InputFormat: org.apache.hadoop.mapred.TextInputFormat > OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Partition Provider: Catalog > Partition Columns: [`ds`] > Schema: root > -- user_id: long (nullable = true) > -- name: string (nullable = true) > -- ds: string (nullable = true) > ), [user_id#13L, name#14], [ds#15] > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20487) `HiveTableScan` node is quite verbose in explained plan
[ https://issues.apache.org/jira/browse/SPARK-20487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985907#comment-15985907 ] Apache Spark commented on SPARK-20487: -- User 'tejasapatil' has created a pull request for this issue: https://github.com/apache/spark/pull/17780 > `HiveTableScan` node is quite verbose in explained plan > --- > > Key: SPARK-20487 > URL: https://issues.apache.org/jira/browse/SPARK-20487 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Tejas Patil >Priority: Trivial > > For hive tables, `explain()` prints a lot of information. This makes it hard > to read the plan (esp. for large sql strings with numerous tables). > eg. > {noformat} > scala> hc.sql(" SELECT * FROM my_table WHERE name = 'foo' ").explain(true) > == Parsed Logical Plan == > 'Project [*] > +- 'Filter ('name = foo) >+- 'UnresolvedRelation `my_table` > == Analyzed Logical Plan == > user_id: bigint, name: string, ds: string > Project [user_id#13L, name#14, ds#15] > +- Filter (name#14 = foo) >+- SubqueryAlias my_table > +- CatalogRelation CatalogTable( > Database: default > Table: my_table > Owner: tejasp > Created: Fri Apr 14 17:05:50 PDT 2017 > Last Access: Wed Dec 31 16:00:00 PST 1969 > Type: MANAGED > Provider: hive > Properties: [serialization.format=1] > Statistics: 9223372036854775807 bytes > Location: file:/tmp/warehouse/my_table > Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > InputFormat: org.apache.hadoop.mapred.TextInputFormat > OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Partition Provider: Catalog > Partition Columns: [`ds`] > Schema: root > -- user_id: long (nullable = true) > -- name: string (nullable = true) > -- ds: string (nullable = true) > ), [user_id#13L, name#14], [ds#15] > == Optimized Logical Plan == > Filter (isnotnull(name#14) && (name#14 = foo)) > +- CatalogRelation CatalogTable( > Database: default > Table: my_table > Owner: tejasp > Created: Fri Apr 14 17:05:50 PDT 2017 > Last Access: Wed Dec 31 16:00:00 PST 1969 > Type: MANAGED > Provider: hive > Properties: [serialization.format=1] > Statistics: 9223372036854775807 bytes > Location: file:/tmp/warehouse/my_table > Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > InputFormat: org.apache.hadoop.mapred.TextInputFormat > OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Partition Provider: Catalog > Partition Columns: [`ds`] > Schema: root > -- user_id: long (nullable = true) > -- name: string (nullable = true) > -- ds: string (nullable = true) > ), [user_id#13L, name#14], [ds#15] > == Physical Plan == > *Filter (isnotnull(name#14) && (name#14 = foo)) > +- HiveTableScan [user_id#13L, name#14, ds#15], CatalogRelation CatalogTable( > Database: default > Table: my_table > Owner: tejasp > Created: Fri Apr 14 17:05:50 PDT 2017 > Last Access: Wed Dec 31 16:00:00 PST 1969 > Type: MANAGED > Provider: hive > Properties: [serialization.format=1] > Statistics: 9223372036854775807 bytes > Location: file:/tmp/warehouse/my_table > Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > InputFormat: org.apache.hadoop.mapred.TextInputFormat > OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Partition Provider: Catalog > Partition Columns: [`ds`] > Schema: root > -- user_id: long (nullable = true) > -- name: string (nullable = true) > -- ds: string (nullable = true) > ), [user_id#13L, name#14], [ds#15] > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20487) `HiveTableScan` node is quite verbose in explained plan
Tejas Patil created SPARK-20487: --- Summary: `HiveTableScan` node is quite verbose in explained plan Key: SPARK-20487 URL: https://issues.apache.org/jira/browse/SPARK-20487 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.0 Reporter: Tejas Patil Priority: Trivial For hive tables, `explain()` prints a lot of information. This makes it hard to read the plan (esp. for large sql strings with numerous tables). eg. {noformat} scala> hc.sql(" SELECT * FROM my_table WHERE name = 'foo' ").explain(true) == Parsed Logical Plan == 'Project [*] +- 'Filter ('name = foo) +- 'UnresolvedRelation `my_table` == Analyzed Logical Plan == user_id: bigint, name: string, ds: string Project [user_id#13L, name#14, ds#15] +- Filter (name#14 = foo) +- SubqueryAlias my_table +- CatalogRelation CatalogTable( Database: default Table: my_table Owner: tejasp Created: Fri Apr 14 17:05:50 PDT 2017 Last Access: Wed Dec 31 16:00:00 PST 1969 Type: MANAGED Provider: hive Properties: [serialization.format=1] Statistics: 9223372036854775807 bytes Location: file:/tmp/warehouse/my_table Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Partition Provider: Catalog Partition Columns: [`ds`] Schema: root -- user_id: long (nullable = true) -- name: string (nullable = true) -- ds: string (nullable = true) ), [user_id#13L, name#14], [ds#15] == Optimized Logical Plan == Filter (isnotnull(name#14) && (name#14 = foo)) +- CatalogRelation CatalogTable( Database: default Table: my_table Owner: tejasp Created: Fri Apr 14 17:05:50 PDT 2017 Last Access: Wed Dec 31 16:00:00 PST 1969 Type: MANAGED Provider: hive Properties: [serialization.format=1] Statistics: 9223372036854775807 bytes Location: file:/tmp/warehouse/my_table Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Partition Provider: Catalog Partition Columns: [`ds`] Schema: root -- user_id: long (nullable = true) -- name: string (nullable = true) -- ds: string (nullable = true) ), [user_id#13L, name#14], [ds#15] == Physical Plan == *Filter (isnotnull(name#14) && (name#14 = foo)) +- HiveTableScan [user_id#13L, name#14, ds#15], CatalogRelation CatalogTable( Database: default Table: my_table Owner: tejasp Created: Fri Apr 14 17:05:50 PDT 2017 Last Access: Wed Dec 31 16:00:00 PST 1969 Type: MANAGED Provider: hive Properties: [serialization.format=1] Statistics: 9223372036854775807 bytes Location: file:/tmp/warehouse/my_table Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Partition Provider: Catalog Partition Columns: [`ds`] Schema: root -- user_id: long (nullable = true) -- name: string (nullable = true) -- ds: string (nullable = true) ), [user_id#13L, name#14], [ds#15] {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20403) It is wrong to the instructions of some functions,such as boolean,tinyint,smallint,int,bigint,float,double,decimal,date,timestamp,binary,string
[ https://issues.apache.org/jira/browse/SPARK-20403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-20403: Priority: Minor (was: Trivial) > It is wrong to the instructions of some functions,such as > boolean,tinyint,smallint,int,bigint,float,double,decimal,date,timestamp,binary,string > > > Key: SPARK-20403 > URL: https://issues.apache.org/jira/browse/SPARK-20403 > Project: Spark > Issue Type: Bug > Components: Documentation, SQL >Affects Versions: 2.1.0 >Reporter: liuxian >Priority: Minor > > spark-sql>desc function boolean; > Function: boolean > Class: org.apache.spark.sql.catalyst.expressions.Cast > Usage: boolean(expr AS type) - Casts the value `expr` to the target data type > `type`. > spark-sql>desc function int; > Function: int > Class: org.apache.spark.sql.catalyst.expressions.Cast > Usage: int(expr AS type) - Casts the value `expr` to the target data type > `type`. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20483) Mesos Coarse mode may starve other Mesos frameworks if max cores is not a multiple of executor cores
[ https://issues.apache.org/jira/browse/SPARK-20483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai reassigned SPARK-20483: --- Assignee: Davis Shepherd > Mesos Coarse mode may starve other Mesos frameworks if max cores is not a > multiple of executor cores > > > Key: SPARK-20483 > URL: https://issues.apache.org/jira/browse/SPARK-20483 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.0.2, 2.1.0, 2.1.1 >Reporter: Davis Shepherd >Assignee: Davis Shepherd >Priority: Minor > > if {{spark.cores.max = 10}} for example and {{spark.executor.cores = 4}}, 2 > executors will get launched thus {{totalCoresAcquired = 8}}. All future Mesos > offers will not get tasks launched because > {{sc.conf.getInt("spark.executor.cores", ...) + totalCoresAcquired <= > maxCores}} will always evaluate to false. However, in > {{handleMatchedOffers}} we check if {{totalCoresAcquired >= maxCores}} to > determine if we should decline the offer "for a configurable amount of time > to avoid starving other frameworks", and this will always evaluate to false > in the above scenario. This leaves the framework in a state of limbo where it > will never launch any new executors, but only decline offers for the Mesos > default of 5 seconds, thus starving other frameworks of offers. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20483) Mesos Coarse mode may starve other Mesos frameworks if max cores is not a multiple of executor cores
[ https://issues.apache.org/jira/browse/SPARK-20483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai updated SPARK-20483: Shepherd: DB Tsai > Mesos Coarse mode may starve other Mesos frameworks if max cores is not a > multiple of executor cores > > > Key: SPARK-20483 > URL: https://issues.apache.org/jira/browse/SPARK-20483 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.0.2, 2.1.0, 2.1.1 >Reporter: Davis Shepherd >Priority: Minor > > if {{spark.cores.max = 10}} for example and {{spark.executor.cores = 4}}, 2 > executors will get launched thus {{totalCoresAcquired = 8}}. All future Mesos > offers will not get tasks launched because > {{sc.conf.getInt("spark.executor.cores", ...) + totalCoresAcquired <= > maxCores}} will always evaluate to false. However, in > {{handleMatchedOffers}} we check if {{totalCoresAcquired >= maxCores}} to > determine if we should decline the offer "for a configurable amount of time > to avoid starving other frameworks", and this will always evaluate to false > in the above scenario. This leaves the framework in a state of limbo where it > will never launch any new executors, but only decline offers for the Mesos > default of 5 seconds, thus starving other frameworks of offers. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20483) Mesos Coarse mode may starve other Mesos frameworks if max cores is not a multiple of executor cores
[ https://issues.apache.org/jira/browse/SPARK-20483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai updated SPARK-20483: Affects Version/s: 2.1.1 2.0.2 > Mesos Coarse mode may starve other Mesos frameworks if max cores is not a > multiple of executor cores > > > Key: SPARK-20483 > URL: https://issues.apache.org/jira/browse/SPARK-20483 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.0.2, 2.1.0, 2.1.1 >Reporter: Davis Shepherd >Priority: Minor > > if {{spark.cores.max = 10}} for example and {{spark.executor.cores = 4}}, 2 > executors will get launched thus {{totalCoresAcquired = 8}}. All future Mesos > offers will not get tasks launched because > {{sc.conf.getInt("spark.executor.cores", ...) + totalCoresAcquired <= > maxCores}} will always evaluate to false. However, in > {{handleMatchedOffers}} we check if {{totalCoresAcquired >= maxCores}} to > determine if we should decline the offer "for a configurable amount of time > to avoid starving other frameworks", and this will always evaluate to false > in the above scenario. This leaves the framework in a state of limbo where it > will never launch any new executors, but only decline offers for the Mesos > default of 5 seconds, thus starving other frameworks of offers. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20468) Refactor the ALS code
[ https://issues.apache.org/jira/browse/SPARK-20468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985888#comment-15985888 ] Daniel Li commented on SPARK-20468: --- Thanks for the guidance, [~srowen]. I've read the Contributing guide now; my apologies for not reading it through sooner. I've closed the PR and created a few new Jira issues to propose documentation additions and internal code clarity improvements: SPARK-20484, SPARK-20485, and SPARK-20486. How do you want to proceed? > Refactor the ALS code > - > > Key: SPARK-20468 > URL: https://issues.apache.org/jira/browse/SPARK-20468 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: Daniel Li >Priority: Minor > Labels: documentation, readability, refactoring > > The current ALS implementation ({{org.apache.spark.ml.recommendation}}) is > quite the beast --- 21 classes, traits, and objects across 1,500+ lines, all > in one file. Here are some things I think could improve the clarity and > maintainability of the code: > * The file can be split into more manageable parts. In particular, {{ALS}}, > {{ALSParams}}, {{ALSModel}}, and {{ALSModelParams}} can be in separate files > for better readability. > * Certain parts can be encapsulated or moved to clarify the intent. For > example: > ** The {{ALS.train}} method is currently defined in the {{ALS}} companion > object, and it seems to take 12 individual parameters that are all members of > the {{ALS}} class. This method can be made an instance method. > ** The code that creates in-blocks and out-blocks in the body of > {{ALS.train}}, along with the {{partitionRatings}} and {{makeBlocks}} methods > in the {{ALS}} companion object, can be moved into a separate case class that > holds the blocks. This has the added benefit of allowing us to write > specific Scaladoc to explain the logic behind these block objects, as their > usage is certainly nontrivial yet is fundamental to the implementation. > ** The {{KeyWrapper}} and {{UncompressedInBlockSort}} classes could be > hidden within {{UncompressedInBlock}} to clarify the scope of their usage. > ** Various builder classes could be encapsulated in the companion objects of > the classes they build. > * The code can be formatted more clearly. For example: > ** Certain methods such as {{ALS.train}} and {{ALS.makeBlocks}} can be > formatted more clearly and have comments added explaining the reasoning > behind key parts. That these methods form the core of the ALS logic makes > this doubly important for maintainability. > ** Parts of the code that use {{while}} loops with manually incremented > counters can be rewritten as {{for}} loops. > ** Where non-idiomatic Scala code is used that doesn't improve performance > much, clearer code can be substituted. (This in particular should be done > very carefully if at all, as it's apparent the original author spent much > time and pains optimizing the code to significantly improve its runtime > profile.) > * The documentation (both Scaladocs and inline comments) can be clarified > where needed and expanded where incomplete. This is especially important for > parts of the code that are written imperatively for performance, as these > parts don't benefit from the intuitive self-documentation of Scala's > higher-level language abstractions. Specifically, I'd like to add > documentation fully explaining the key functionality of the in-block and > out-block objects, their purpose, how they relate to the overall ALS > algorithm, and how they are calculated in such a way that new maintainers can > ramp up much more quickly. > The above is not a complete enumeration of improvements but a high-level > survey. All of these improvements will, I believe, add up to make the code > easier to understand, extend, and maintain. This issue will track the > progress of this refactoring so that going forward, authors will have an > easier time maintaining this part of the project. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20486) Encapsulate ALS in-block and out-block data structures and methods into a separate class
[ https://issues.apache.org/jira/browse/SPARK-20486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Li updated SPARK-20486: -- Description: The in-block and out-block data structures in the ALS code is currently calculated within the {{ALS.train}} method itself. I propose to move this code, along with its helper functions, into a separate class to encapsulate the creation of the blocks. This has the added benefit of allowing us to include a comprehensive Scaladoc to this new class to explain in detail how this core part of the algorithm works. Proposal: {code} private[recommendation] final case class RatingBlocks[ID]( userIn: RDD[(Int, InBlock[ID])], userOut: RDD[(Int, OutBlock)], itemIn: RDD[(Int, InBlock[ID])], itemOut: RDD[(Int, OutBlock)] ) private[recommendation] object RatingBlocks { def create[ID: ClassTag: Ordering]( ratings: RDD[Rating[ID]], numUserBlocks: Int, numItemBlocks: Int, storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK): RatingBlocks[ID] = { val userPart = new ALSPartitioner(numUserBlocks) val itemPart = new ALSPartitioner(numItemBlocks) val blockRatings = partitionRatings(ratings, userPart, itemPart) .persist(storageLevel) val (userInBlocks, userOutBlocks) = makeBlocks("user", blockRatings, userPart, itemPart, storageLevel) userOutBlocks.count() // materialize `blockRatings` and user blocks val swappedBlockRatings = blockRatings.map { case ((userBlockId, itemBlockId), RatingBlock(userIds, itemIds, localRatings)) => ((itemBlockId, userBlockId), RatingBlock(itemIds, userIds, localRatings)) } val (itemInBlocks, itemOutBlocks) = makeBlocks("item", swappedBlockRatings, itemPart, userPart, storageLevel) itemOutBlocks.count() // materialize item blocks blockRatings.unpersist() new RatingBlocks(userInBlocks, userOutBlocks, itemInBlocks, itemOutBlocks) } private[this] def partitionRatings[ID: ClassTag](...) = { // existing code goes here verbatim } private[this] def makeBlocks[ID: ClassTag](...) = { // existing code goes here verbatim } } {code} was: The in-block and out-block data structures in the ALS code is currently calculated within the {{ALS.train}} method itself. I propose to move this code, along with its helper functions, into a separate class to encapsulate the creation of the blocks. This has the added benefit of allowing us to include a comprehensive Scaladoc to this new class to explain in detail how this core part of the algorithm works. Proposal: {code} private[recommendation] final case class RatingBlocks[ID]( userIn: RDD[(Int, InBlock[ID])], userOut: RDD[(Int, OutBlock)], itemIn: RDD[(Int, InBlock[ID])], itemOut: RDD[(Int, OutBlock)] ) private[recommendation] object RatingBlocks { def create[ID: ClassTag: Ordering]( ratings: RDD[Rating[ID]], numUserBlocks: Int, numItemBlocks: Int, storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK): RatingBlocks[ID] = { // In-block and out-block code currently in `ALS.train` goes here } private[this] def partitionRatings[ID: ClassTag](...) = { ... } private[this] def makeBlocks[ID: ClassTag](...) = { ... } } {code} > Encapsulate ALS in-block and out-block data structures and methods into a > separate class > > > Key: SPARK-20486 > URL: https://issues.apache.org/jira/browse/SPARK-20486 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: Daniel Li >Priority: Trivial > > The in-block and out-block data structures in the ALS code is currently > calculated within the {{ALS.train}} method itself. I propose to move this > code, along with its helper functions, into a separate class to encapsulate > the creation of the blocks. This has the added benefit of allowing us to > include a comprehensive Scaladoc to this new class to explain in detail how > this core part of the algorithm works. > Proposal: > {code} > private[recommendation] final case class RatingBlocks[ID]( > userIn: RDD[(Int, InBlock[ID])], > userOut: RDD[(Int, OutBlock)], > itemIn: RDD[(Int, InBlock[ID])], > itemOut: RDD[(Int, OutBlock)] > ) > private[recommendation] object RatingBlocks { > def create[ID: ClassTag: Ordering]( > ratings: RDD[Rating[ID]], > numUserBlocks: Int, > numItemBlocks: Int, > storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK): > RatingBlocks[ID] = { > val userPart = new ALSPartitioner(numUserBlocks) > val itemPart = new ALSPartitioner(numItemBlocks) > val blockRatings = > partitionRatings(ratings, userPart, itemPart) > .persist(storageLevel) > val
[jira] [Created] (SPARK-20486) Encapsulate ALS in-block and out-block data structures and methods into a separate class
Daniel Li created SPARK-20486: - Summary: Encapsulate ALS in-block and out-block data structures and methods into a separate class Key: SPARK-20486 URL: https://issues.apache.org/jira/browse/SPARK-20486 Project: Spark Issue Type: Improvement Components: ML, MLlib Affects Versions: 2.1.0 Reporter: Daniel Li Priority: Trivial The in-block and out-block data structures in the ALS code is currently calculated within the {{ALS.train}} method itself. I propose to move this code, along with its helper functions, into a separate class to encapsulate the creation of the blocks. This has the added benefit of allowing us to include a comprehensive Scaladoc to this new class to explain in detail how this core part of the algorithm works. Proposal: {code} private[recommendation] final case class RatingBlocks[ID]( userIn: RDD[(Int, InBlock[ID])], userOut: RDD[(Int, OutBlock)], itemIn: RDD[(Int, InBlock[ID])], itemOut: RDD[(Int, OutBlock)] ) private[recommendation] object RatingBlocks { def create[ID: ClassTag: Ordering]( ratings: RDD[Rating[ID]], numUserBlocks: Int, numItemBlocks: Int, storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK): RatingBlocks[ID] = { // In-block and out-block code currently in `ALS.train` goes here } private[this] def partitionRatings[ID: ClassTag](...) = { ... } private[this] def makeBlocks[ID: ClassTag](...) = { ... } } {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20485) Split ALS.scala into multiple files
Daniel Li created SPARK-20485: - Summary: Split ALS.scala into multiple files Key: SPARK-20485 URL: https://issues.apache.org/jira/browse/SPARK-20485 Project: Spark Issue Type: Improvement Components: ML, MLlib Affects Versions: 2.1.0 Reporter: Daniel Li Priority: Trivial The file {{/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala}} currently contains four top-level classes that span 1,500+ lines. To make the file easier to navigate, and to bring it in line with the [Scala style guide|http://docs.scala-lang.org/style/files.html], I propose we split it into four files holding their respective classes: {{ALS.scala}}, {{ALSParams.scala}}, {{ALSModel.scala}}, and {{ALSModelParams.scala}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20484) Add documentation to ALS code
Daniel Li created SPARK-20484: - Summary: Add documentation to ALS code Key: SPARK-20484 URL: https://issues.apache.org/jira/browse/SPARK-20484 Project: Spark Issue Type: Improvement Components: ML, MLlib Affects Versions: 2.1.0 Reporter: Daniel Li Priority: Trivial The documentation (both Scaladocs and inline comments) for the ALS code (in package {{org.apache.spark.ml.recommendation}}) can be clarified where needed and expanded where incomplete. This is especially important for parts of the code that are written imperatively for performance, as these parts don't benefit from the intuitive self-documentation of Scala's higher-level language abstractions. Specifically, I'd like to add documentation fully explaining the key functionality of the in-block and out-block objects, their purpose, how they relate to the overall ALS algorithm, and how they are calculated in such a way that new maintainers can ramp up much more quickly. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20475) Whether use "broadcast join" depends on hive configuration
[ https://issues.apache.org/jira/browse/SPARK-20475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985838#comment-15985838 ] Zhenhua Wang commented on SPARK-20475: -- Spark also collects table size no matter "analyze" command is run or not. Did check the threshould in Hive for map side join? Is it the same as in Spark? Besides, if you want to use broadcast join in spark, we have a broadcast hint in Spark. > Whether use "broadcast join" depends on hive configuration > -- > > Key: SPARK-20475 > URL: https://issues.apache.org/jira/browse/SPARK-20475 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Lijia Liu > > Currently, broadcast join in Spark only works while: > 1. The value of "spark.sql.autoBroadcastJoinThreshold" bigger than > 0(default is 10485760). > 2. The size of one of the hive tables less than > "spark.sql.autoBroadcastJoinThreshold". To get the size information of the > hive table from hive metastore, "hive.stats.autogather" should be set to > true in hive or the command "ANALYZE TABLE COMPUTE STATISTICS > noscan" has been run. > But in Hive, it calculate the size of the file or directory corresponding to > the hive table to determine whether to use the map side join, and does not > depend on the hive metastore. > This leads to two problems: > 1. Spark will not use "broadcast join" when the hive parameter > "hive.stats.autogather" is not set to ture or the command "ANALYZE TABLE > COMPUTE STATISTICS noscan" has not been run because the > information of the hive table has not saved in hive metastore . The mode of > work in Spark depends on the configuration of Hive. > 2. For some reason, we set "hive.stats.autogather" to false in our Hive. > For the same SQL, Hive is 4 times faster than Spark because Hive used "map > side join" but Spark did not use "broadcast join". > Is it possible to use the mechanism same to hive's to look up the size of a > hive tale in Spark. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20435) More thorough redaction of sensitive information from logs/UI, more unit tests
[ https://issues.apache.org/jira/browse/SPARK-20435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Grover updated SPARK-20435: Thanks Marcelo! > More thorough redaction of sensitive information from logs/UI, more unit tests > -- > > Key: SPARK-20435 > URL: https://issues.apache.org/jira/browse/SPARK-20435 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Mark Grover >Assignee: Mark Grover > Fix For: 2.2.0 > > > SPARK-18535 and SPARK-19720 were works to redact sensitive information (e.g. > hadoop credential provider password, AWS access/secret keys) from event logs > + YARN logs + UI and from the console output, respectively. > While some unit tests were added along with these changes - they asserted > when a sensitive key was found, that redaction took place for that key. They > didn't assert globally that when running a full-fledged Spark app (whether or > YARN or locally), that sensitive information was not present in any of the > logs or UI. Such a test would also prevent regressions from happening in the > future if someone unknowingly adds extra logging that publishes out sensitive > information to disk or UI. > Consequently, it was found that in some Java configurations, sensitive > information was still being leaked in the event logs under the > {{SparkListenerEnvironmentUpdate}} event, like so: > {code} > "sun.java.command":"org.apache.spark.deploy.SparkSubmit ... --conf > spark.executorEnv.HADOOP_CREDSTORE_PASSWORD=secret_password ... > {code} > "secret_password" should have been redacted. > Moreover, previously redaction logic was only checking if the key matched the > secret regex pattern, it'd redact it's value. That worked for most cases. > However, in the above case, the key (sun.java.command) doesn't tell much, so > the value needs to be searched. So the check needs to be expanded to match > against values as well. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20435) More thorough redaction of sensitive information from logs/UI, more unit tests
[ https://issues.apache.org/jira/browse/SPARK-20435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Grover updated SPARK-20435: Thanks Marcelo! > More thorough redaction of sensitive information from logs/UI, more unit tests > -- > > Key: SPARK-20435 > URL: https://issues.apache.org/jira/browse/SPARK-20435 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Mark Grover >Assignee: Mark Grover > Fix For: 2.2.0 > > > SPARK-18535 and SPARK-19720 were works to redact sensitive information (e.g. > hadoop credential provider password, AWS access/secret keys) from event logs > + YARN logs + UI and from the console output, respectively. > While some unit tests were added along with these changes - they asserted > when a sensitive key was found, that redaction took place for that key. They > didn't assert globally that when running a full-fledged Spark app (whether or > YARN or locally), that sensitive information was not present in any of the > logs or UI. Such a test would also prevent regressions from happening in the > future if someone unknowingly adds extra logging that publishes out sensitive > information to disk or UI. > Consequently, it was found that in some Java configurations, sensitive > information was still being leaked in the event logs under the > {{SparkListenerEnvironmentUpdate}} event, like so: > {code} > "sun.java.command":"org.apache.spark.deploy.SparkSubmit ... --conf > spark.executorEnv.HADOOP_CREDSTORE_PASSWORD=secret_password ... > {code} > "secret_password" should have been redacted. > Moreover, previously redaction logic was only checking if the key matched the > secret regex pattern, it'd redact it's value. That worked for most cases. > However, in the above case, the key (sun.java.command) doesn't tell much, so > the value needs to be searched. So the check needs to be expanded to match > against values as well. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: [jira] [Resolved] (SPARK-20435) More thorough redaction of sensitive information from logs/UI, more unit tests
Thanks Marcelo! On Apr 26, 2017 5:07 PM, "Marcelo Vanzin (JIRA)"wrote: > > [ https://issues.apache.org/jira/browse/SPARK-20435?page= > com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] > > Marcelo Vanzin resolved SPARK-20435. > >Resolution: Fixed > Assignee: Mark Grover > Fix Version/s: 2.2.0 > > > More thorough redaction of sensitive information from logs/UI, more unit > tests > > > -- > > > > Key: SPARK-20435 > > URL: https://issues.apache.org/jira/browse/SPARK-20435 > > Project: Spark > > Issue Type: Bug > > Components: Spark Core > >Affects Versions: 2.2.0 > >Reporter: Mark Grover > >Assignee: Mark Grover > > Fix For: 2.2.0 > > > > > > SPARK-18535 and SPARK-19720 were works to redact sensitive information > (e.g. hadoop credential provider password, AWS access/secret keys) from > event logs + YARN logs + UI and from the console output, respectively. > > While some unit tests were added along with these changes - they > asserted when a sensitive key was found, that redaction took place for that > key. They didn't assert globally that when running a full-fledged Spark app > (whether or YARN or locally), that sensitive information was not present in > any of the logs or UI. Such a test would also prevent regressions from > happening in the future if someone unknowingly adds extra logging that > publishes out sensitive information to disk or UI. > > Consequently, it was found that in some Java configurations, sensitive > information was still being leaked in the event logs under the {{ > SparkListenerEnvironmentUpdate}} event, like so: > > {code} > > "sun.java.command":"org.apache.spark.deploy.SparkSubmit ... --conf > spark.executorEnv.HADOOP_CREDSTORE_PASSWORD=secret_password ... > > {code} > > "secret_password" should have been redacted. > > Moreover, previously redaction logic was only checking if the key > matched the secret regex pattern, it'd redact it's value. That worked for > most cases. However, in the above case, the key (sun.java.command) doesn't > tell much, so the value needs to be searched. So the check needs to be > expanded to match against values as well. > > > > -- > This message was sent by Atlassian JIRA > (v6.3.15#6346) > > - > To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org > For additional commands, e-mail: issues-h...@spark.apache.org > >
[jira] [Resolved] (SPARK-20435) More thorough redaction of sensitive information from logs/UI, more unit tests
[ https://issues.apache.org/jira/browse/SPARK-20435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-20435. Resolution: Fixed Assignee: Mark Grover Fix Version/s: 2.2.0 > More thorough redaction of sensitive information from logs/UI, more unit tests > -- > > Key: SPARK-20435 > URL: https://issues.apache.org/jira/browse/SPARK-20435 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Mark Grover >Assignee: Mark Grover > Fix For: 2.2.0 > > > SPARK-18535 and SPARK-19720 were works to redact sensitive information (e.g. > hadoop credential provider password, AWS access/secret keys) from event logs > + YARN logs + UI and from the console output, respectively. > While some unit tests were added along with these changes - they asserted > when a sensitive key was found, that redaction took place for that key. They > didn't assert globally that when running a full-fledged Spark app (whether or > YARN or locally), that sensitive information was not present in any of the > logs or UI. Such a test would also prevent regressions from happening in the > future if someone unknowingly adds extra logging that publishes out sensitive > information to disk or UI. > Consequently, it was found that in some Java configurations, sensitive > information was still being leaked in the event logs under the > {{SparkListenerEnvironmentUpdate}} event, like so: > {code} > "sun.java.command":"org.apache.spark.deploy.SparkSubmit ... --conf > spark.executorEnv.HADOOP_CREDSTORE_PASSWORD=secret_password ... > {code} > "secret_password" should have been redacted. > Moreover, previously redaction logic was only checking if the key matched the > secret regex pattern, it'd redact it's value. That worked for most cases. > However, in the above case, the key (sun.java.command) doesn't tell much, so > the value needs to be searched. So the check needs to be expanded to match > against values as well. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20482) Resolving Casts is too strict on having time zone set
[ https://issues.apache.org/jira/browse/SPARK-20482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985720#comment-15985720 ] Apache Spark commented on SPARK-20482: -- User 'rednaxelafx' has created a pull request for this issue: https://github.com/apache/spark/pull/1 > Resolving Casts is too strict on having time zone set > - > > Key: SPARK-20482 > URL: https://issues.apache.org/jira/browse/SPARK-20482 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kris Mok >Assignee: Kris Mok > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20483) Mesos Coarse mode may starve other Mesos frameworks if max cores is not a multiple of executor cores
[ https://issues.apache.org/jira/browse/SPARK-20483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davis Shepherd updated SPARK-20483: --- Description: if {{spark.cores.max = 10}} for example and {{spark.executor.cores = 4}}, 2 executors will get launched thus {{totalCoresAcquired = 8}}. All future Mesos offers will not get tasks launched because {{sc.conf.getInt("spark.executor.cores", ...) + totalCoresAcquired <= maxCores}} will always evaluate to false. However, in {{handleMatchedOffers}} we check if {{totalCoresAcquired >= maxCores}} to determine if we should decline the offer "for a configurable amount of time to avoid starving other frameworks", and this will always evaluate to false in the above scenario. This leaves the framework in a state of limbo where it will never launch any new executors, but only decline offers for the Mesos default of 5 seconds, thus starving other frameworks of offers. (was: if {{spark.cores.max = 10}} for example and {{spark.executor.cores = 4}}, 2 executors will get launched thus {{totalCoresAcquired = 8}}. All future Mesos offers will not get tasks launched because {{sc.conf.getInt("spark.executor.cores", ...) + totalCoresAcquired <= maxCores}} will always evaluate to false. However, in {{handleMatchedOffers}} we check if {{totalCoresAcquired >= maxCores}} to determine if we should decline the offer "for a configurable amount of time to avoid starving other frameworks", and this will always evaluate to false in the above scenario. This leaves the framework in a state of limbo where it will never launch any new executors, but only decline offers for the Mesos default of 5 seconds, thus starving other frameworks of offers. Relates to: SPARK-12554, SPARK-19702) > Mesos Coarse mode may starve other Mesos frameworks if max cores is not a > multiple of executor cores > > > Key: SPARK-20483 > URL: https://issues.apache.org/jira/browse/SPARK-20483 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.1.0 >Reporter: Davis Shepherd >Priority: Minor > > if {{spark.cores.max = 10}} for example and {{spark.executor.cores = 4}}, 2 > executors will get launched thus {{totalCoresAcquired = 8}}. All future Mesos > offers will not get tasks launched because > {{sc.conf.getInt("spark.executor.cores", ...) + totalCoresAcquired <= > maxCores}} will always evaluate to false. However, in > {{handleMatchedOffers}} we check if {{totalCoresAcquired >= maxCores}} to > determine if we should decline the offer "for a configurable amount of time > to avoid starving other frameworks", and this will always evaluate to false > in the above scenario. This leaves the framework in a state of limbo where it > will never launch any new executors, but only decline offers for the Mesos > default of 5 seconds, thus starving other frameworks of offers. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20483) Mesos Coarse mode may starve other Mesos frameworks if max cores is not a multiple of executor cores
[ https://issues.apache.org/jira/browse/SPARK-20483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davis Shepherd updated SPARK-20483: --- Description: if {{spark.cores.max = 10}} for example and {{spark.executor.cores = 4}}, 2 executors will get launched thus {{totalCoresAcquired = 8}}. All future Mesos offers will not get tasks launched because {{sc.conf.getInt("spark.executor.cores", ...) + totalCoresAcquired <= maxCores}} will always evaluate to false. However, in {{handleMatchedOffers}} we check if {{totalCoresAcquired >= maxCores}} to determine if we should decline the offer "for a configurable amount of time to avoid starving other frameworks", and this will always evaluate to false in the above scenario. This leaves the framework in a state of limbo where it will never launch any new executors, but only decline offers for the Mesos default of 5 seconds, thus starving other frameworks of offers. Relates to: SPARK-12554, SPARK-19702 was: if {{spark.cores.max = 10}} for example and {{spark.executor.cores = 4}}, 2 executors will get launched thus {{totalCoresAcquired = 8}}. All future Mesos offers will not get tasks launched because {{sc.conf.getInt("spark.executor.cores", ...) + totalCoresAcquired <= maxCores}} will always evaluate to false. However, in {{handleMatchedOffers}} we check if {{totalCoresAcquired >= maxCores}} to determine if we should decline the offer "for a configurable amount of time to avoid starving other frameworks", and this will always evaluate to false in the above scenario. This leaves the framework in a state of limbo where it will never launch any new executors, but only decline offers for the Mesos default of 5 seconds, thus starving other frameworks of offers. Relates to: SPARK-12554 > Mesos Coarse mode may starve other Mesos frameworks if max cores is not a > multiple of executor cores > > > Key: SPARK-20483 > URL: https://issues.apache.org/jira/browse/SPARK-20483 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.1.0 >Reporter: Davis Shepherd >Priority: Minor > > if {{spark.cores.max = 10}} for example and {{spark.executor.cores = 4}}, 2 > executors will get launched thus {{totalCoresAcquired = 8}}. All future Mesos > offers will not get tasks launched because > {{sc.conf.getInt("spark.executor.cores", ...) + totalCoresAcquired <= > maxCores}} will always evaluate to false. However, in > {{handleMatchedOffers}} we check if {{totalCoresAcquired >= maxCores}} to > determine if we should decline the offer "for a configurable amount of time > to avoid starving other frameworks", and this will always evaluate to false > in the above scenario. This leaves the framework in a state of limbo where it > will never launch any new executors, but only decline offers for the Mesos > default of 5 seconds, thus starving other frameworks of offers. > Relates to: SPARK-12554, SPARK-19702 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20481) Wrong mapping for BooleanType in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-20481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985716#comment-15985716 ] Xiao Li commented on SPARK-20481: - We are unable to make a change in this default data type mapping. It can cause external behavior changes. In Spark 2.2., you can manually change it by using the option `createTableColumnTypes`. For details, see the following PR: https://github.com/apache/spark/pull/16209 > Wrong mapping for BooleanType in Spark SQL > -- > > Key: SPARK-20481 > URL: https://issues.apache.org/jira/browse/SPARK-20481 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Talat UYARER >Priority: Minor > > Booleantype is mapping to BIT type on SQL side. > https://github.com/apache/spark/blob/a26e3ed5e414d0a350cfe65dd511b154868b9f1d/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L159 > Is there is any specific reason it should be mapping to Boolean -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20481) Wrong mapping for BooleanType in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-20481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985695#comment-15985695 ] Talat UYARER commented on SPARK-20481: -- I am using Redshift. > Wrong mapping for BooleanType in Spark SQL > -- > > Key: SPARK-20481 > URL: https://issues.apache.org/jira/browse/SPARK-20481 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Talat UYARER >Priority: Minor > > Booleantype is mapping to BIT type on SQL side. > https://github.com/apache/spark/blob/a26e3ed5e414d0a350cfe65dd511b154868b9f1d/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L159 > Is there is any specific reason it should be mapping to Boolean -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20483) Mesos Coarse mode may starve other Mesos frameworks if max cores is not a multiple of executor cores
[ https://issues.apache.org/jira/browse/SPARK-20483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davis Shepherd updated SPARK-20483: --- Description: if {{spark.cores.max = 10}} for example and {{spark.executor.cores = 4}}, 2 executors will get launched thus {{totalCoresAcquired = 8}}. All future Mesos offers will not get tasks launched because {{sc.conf.getInt("spark.executor.cores", ...) + totalCoresAcquired <= maxCores}} will always evaluate to false. However, in {{handleMatchedOffers}} we check if {{totalCoresAcquired >= maxCores}} to determine if we should decline the offer "for a configurable amount of time to avoid starving other frameworks", and this will always evaluate to false in the above scenario. This leaves the framework in a state of limbo where it will never launch any new executors, but only decline offers for the Mesos default of 5 seconds, thus starving other frameworks of offers. Relates to: SPARK-12554 was: if `spark.cores.max = 10` for example and `spark.executor.cores = 4`, 2 executors will get lauched thus `totalCoresAcquired = 8`. All future Mesos offers will not get tasks launched because `sc.conf.getInt("spark.executor.cores", ...) + totalCoresAcquired <= maxCores` will always evaluate to false. However, in `handleMatchedOffers` we check if `totalCoresAcquired >= maxCores` to determine if we should decline the offer "for a configurable amount of time to avoid starving other frameworks", and this will always evaluate to false in the above scenario. This leaves the framework in a state of limbo where it will never launch any new executors, but only decline offers for the Mesos default of 5 seconds, thus starving other frameworks of offers. Relates to: SPARK-12554 > Mesos Coarse mode may starve other Mesos frameworks if max cores is not a > multiple of executor cores > > > Key: SPARK-20483 > URL: https://issues.apache.org/jira/browse/SPARK-20483 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.1.0 >Reporter: Davis Shepherd >Priority: Minor > > if {{spark.cores.max = 10}} for example and {{spark.executor.cores = 4}}, 2 > executors will get launched thus {{totalCoresAcquired = 8}}. All future Mesos > offers will not get tasks launched because > {{sc.conf.getInt("spark.executor.cores", ...) + totalCoresAcquired <= > maxCores}} will always evaluate to false. However, in > {{handleMatchedOffers}} we check if {{totalCoresAcquired >= maxCores}} to > determine if we should decline the offer "for a configurable amount of time > to avoid starving other frameworks", and this will always evaluate to false > in the above scenario. This leaves the framework in a state of limbo where it > will never launch any new executors, but only decline offers for the Mesos > default of 5 seconds, thus starving other frameworks of offers. > Relates to: SPARK-12554 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20483) Mesos Coarse mode may starve other Mesos frameworks if max cores is not a multiple of executor cores
Davis Shepherd created SPARK-20483: -- Summary: Mesos Coarse mode may starve other Mesos frameworks if max cores is not a multiple of executor cores Key: SPARK-20483 URL: https://issues.apache.org/jira/browse/SPARK-20483 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 2.1.0 Reporter: Davis Shepherd Priority: Minor if `spark.cores.max = 10` for example and `spark.executor.cores = 4`, 2 executors will get lauched thus `totalCoresAcquired = 8`. All future Mesos offers will not get tasks launched because `sc.conf.getInt("spark.executor.cores", ...) + totalCoresAcquired <= maxCores` will always evaluate to false. However, in `handleMatchedOffers` we check if `totalCoresAcquired >= maxCores` to determine if we should decline the offer "for a configurable amount of time to avoid starving other frameworks", and this will always evaluate to false in the above scenario. This leaves the framework in a state of limbo where it will never launch any new executors, but only decline offers for the Mesos default of 5 seconds, thus starving other frameworks of offers. Relates to: SPARK-12554 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20481) Wrong mapping for BooleanType in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-20481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985690#comment-15985690 ] Xiao Li commented on SPARK-20481: - Which DB are you using? > Wrong mapping for BooleanType in Spark SQL > -- > > Key: SPARK-20481 > URL: https://issues.apache.org/jira/browse/SPARK-20481 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Talat UYARER >Priority: Minor > > Booleantype is mapping to BIT type on SQL side. > https://github.com/apache/spark/blob/a26e3ed5e414d0a350cfe65dd511b154868b9f1d/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L159 > Is there is any specific reason it should be mapping to Boolean -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20482) Resolving Casts is too strict on having time zone set
[ https://issues.apache.org/jira/browse/SPARK-20482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell reassigned SPARK-20482: - Assignee: Kris Mok > Resolving Casts is too strict on having time zone set > - > > Key: SPARK-20482 > URL: https://issues.apache.org/jira/browse/SPARK-20482 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kris Mok >Assignee: Kris Mok > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20482) Resolving Casts is too strict on having time zone set
Kris Mok created SPARK-20482: Summary: Resolving Casts is too strict on having time zone set Key: SPARK-20482 URL: https://issues.apache.org/jira/browse/SPARK-20482 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Kris Mok -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20454) Improvement of ShortestPaths in Spark GraphX
[ https://issues.apache.org/jira/browse/SPARK-20454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ji Dai updated SPARK-20454: --- Labels: (was: patch) > Improvement of ShortestPaths in Spark GraphX > > > Key: SPARK-20454 > URL: https://issues.apache.org/jira/browse/SPARK-20454 > Project: Spark > Issue Type: Improvement > Components: GraphX, MLlib >Affects Versions: 2.1.0 >Reporter: Ji Dai > > The output of ShortestPaths is not enough. ShortestPaths in Graph/lib is > currently in a simple version and can only return the distance to the source > vertex. However, the shortest path with intermediate nodes on the path is > needed and if two or more paths holds the same shortest distance from source > to destination, all these paths need to be returned. In this way, > ShortestPaths will be more functional and useful. > I think I have resolved the concern above with a improved version of > ShortestPaths which also based on the "pregel" function in GraphOps. > Can I get my code reviewed and merged? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20454) Improvement of ShortestPaths in Spark GraphX
[ https://issues.apache.org/jira/browse/SPARK-20454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ji Dai updated SPARK-20454: --- Flags: Important (was: Patch,Important) > Improvement of ShortestPaths in Spark GraphX > > > Key: SPARK-20454 > URL: https://issues.apache.org/jira/browse/SPARK-20454 > Project: Spark > Issue Type: Improvement > Components: GraphX, MLlib >Affects Versions: 2.1.0 >Reporter: Ji Dai > > The output of ShortestPaths is not enough. ShortestPaths in Graph/lib is > currently in a simple version and can only return the distance to the source > vertex. However, the shortest path with intermediate nodes on the path is > needed and if two or more paths holds the same shortest distance from source > to destination, all these paths need to be returned. In this way, > ShortestPaths will be more functional and useful. > I think I have resolved the concern above with a improved version of > ShortestPaths which also based on the "pregel" function in GraphOps. > Can I get my code reviewed and merged? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20454) Improvement of ShortestPaths in Spark GraphX
[ https://issues.apache.org/jira/browse/SPARK-20454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ji Dai updated SPARK-20454: --- Target Version/s: (was: 2.1.0) > Improvement of ShortestPaths in Spark GraphX > > > Key: SPARK-20454 > URL: https://issues.apache.org/jira/browse/SPARK-20454 > Project: Spark > Issue Type: Improvement > Components: GraphX, MLlib >Affects Versions: 2.1.0 >Reporter: Ji Dai > Labels: patch > > The output of ShortestPaths is not enough. ShortestPaths in Graph/lib is > currently in a simple version and can only return the distance to the source > vertex. However, the shortest path with intermediate nodes on the path is > needed and if two or more paths holds the same shortest distance from source > to destination, all these paths need to be returned. In this way, > ShortestPaths will be more functional and useful. > I think I have resolved the concern above with a improved version of > ShortestPaths which also based on the "pregel" function in GraphOps. > Can I get my code reviewed and merged? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-20480) FileFormatWriter hides FetchFailedException from scheduler
[ https://issues.apache.org/jira/browse/SPARK-20480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves closed SPARK-20480. - Resolution: Duplicate > FileFormatWriter hides FetchFailedException from scheduler > -- > > Key: SPARK-20480 > URL: https://issues.apache.org/jira/browse/SPARK-20480 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Thomas Graves >Priority: Critical > > I was running a large job where it was getting faiures, noticed they were > listed as "SparkException: Task failed while writing rows", but when I looked > further they were really caused by FetchFailure exceptions. This is a > problem because the scheduler handles Fetch Failures differently then normal > exception. This can affect things like blacklisting. > {noformat} > 17/04/26 20:08:59 ERROR Executor: Exception in task 2727.0 in stage 4.0 (TID > 102902) > org.apache.spark.SparkException: Task failed while writing rows > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.shuffle.FetchFailedException: Failed to connect > to host1.com:7337 > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:396) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193) > ... 8 more > Caused by: java.io.IOException: Failed to connect to host1.com:7337 > at >
[jira] [Commented] (SPARK-20480) FileFormatWriter hides FetchFailedException from scheduler
[ https://issues.apache.org/jira/browse/SPARK-20480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985590#comment-15985590 ] Thomas Graves commented on SPARK-20480: --- ah it looks like it should, hadn't seen that jira, I'll give that fix a try. > FileFormatWriter hides FetchFailedException from scheduler > -- > > Key: SPARK-20480 > URL: https://issues.apache.org/jira/browse/SPARK-20480 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Thomas Graves >Priority: Critical > > I was running a large job where it was getting faiures, noticed they were > listed as "SparkException: Task failed while writing rows", but when I looked > further they were really caused by FetchFailure exceptions. This is a > problem because the scheduler handles Fetch Failures differently then normal > exception. This can affect things like blacklisting. > {noformat} > 17/04/26 20:08:59 ERROR Executor: Exception in task 2727.0 in stage 4.0 (TID > 102902) > org.apache.spark.SparkException: Task failed while writing rows > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.shuffle.FetchFailedException: Failed to connect > to host1.com:7337 > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:396) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193) > ... 8 more > Caused by: java.io.IOException: Failed to
[jira] [Created] (SPARK-20481) Wrong mapping for BooleanType in Spark SQL
Talat UYARER created SPARK-20481: Summary: Wrong mapping for BooleanType in Spark SQL Key: SPARK-20481 URL: https://issues.apache.org/jira/browse/SPARK-20481 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Talat UYARER Priority: Minor Booleantype is mapping to BIT type on SQL side. https://github.com/apache/spark/blob/a26e3ed5e414d0a350cfe65dd511b154868b9f1d/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L159 Is there is any specific reason it should be mapping to Boolean -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20480) FileFormatWriter hides FetchFailedException from scheduler
[ https://issues.apache.org/jira/browse/SPARK-20480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985575#comment-15985575 ] Mridul Muralidharan commented on SPARK-20480: - Shouldn't fix for SPARK-19276 by [~imranr] not handle this ? > FileFormatWriter hides FetchFailedException from scheduler > -- > > Key: SPARK-20480 > URL: https://issues.apache.org/jira/browse/SPARK-20480 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Thomas Graves >Priority: Critical > > I was running a large job where it was getting faiures, noticed they were > listed as "SparkException: Task failed while writing rows", but when I looked > further they were really caused by FetchFailure exceptions. This is a > problem because the scheduler handles Fetch Failures differently then normal > exception. This can affect things like blacklisting. > {noformat} > 17/04/26 20:08:59 ERROR Executor: Exception in task 2727.0 in stage 4.0 (TID > 102902) > org.apache.spark.SparkException: Task failed while writing rows > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.shuffle.FetchFailedException: Failed to connect > to host1.com:7337 > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:396) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193) > ... 8 more > Caused by: java.io.IOException: Failed to connect to
[jira] [Commented] (SPARK-20480) FileFormatWriter hides FetchFailedException from scheduler
[ https://issues.apache.org/jira/browse/SPARK-20480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985571#comment-15985571 ] Thomas Graves commented on SPARK-20480: --- Note with blacklisting on this caused the job to fail: Job aborted due to stage failure: Aborting TaskSet 4.0 because task 21 (partition 21) cannot run anywhere due to node and executor blacklist. Blacklisting behavior can be configured via spark.blacklist.*. > FileFormatWriter hides FetchFailedException from scheduler > -- > > Key: SPARK-20480 > URL: https://issues.apache.org/jira/browse/SPARK-20480 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Thomas Graves >Priority: Critical > > I was running a large job where it was getting faiures, noticed they were > listed as "SparkException: Task failed while writing rows", but when I looked > further they were really caused by FetchFailure exceptions. This is a > problem because the scheduler handles Fetch Failures differently then normal > exception. This can affect things like blacklisting. > {noformat} > 17/04/26 20:08:59 ERROR Executor: Exception in task 2727.0 in stage 4.0 (TID > 102902) > org.apache.spark.SparkException: Task failed while writing rows > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.shuffle.FetchFailedException: Failed to connect > to host1.com:7337 > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:396) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341) > at >
[jira] [Commented] (SPARK-20480) FileFormatWriter hides FetchFailedException from scheduler
[ https://issues.apache.org/jira/browse/SPARK-20480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985567#comment-15985567 ] Thomas Graves commented on SPARK-20480: --- exception in task manager looks like: 17/04/26 20:09:21 INFO TaskSetManager: Lost task 3516.0 in stage 4.0 (TID 103691) on gsbl521n33.blue.ygrid.yahoo.com, exec utor 4516: org.apache.spark.SparkException (Task failed while writing rows) [duplicate 22] > FileFormatWriter hides FetchFailedException from scheduler > -- > > Key: SPARK-20480 > URL: https://issues.apache.org/jira/browse/SPARK-20480 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Thomas Graves >Priority: Critical > > I was running a large job where it was getting faiures, noticed they were > listed as "SparkException: Task failed while writing rows", but when I looked > further they were really caused by FetchFailure exceptions. This is a > problem because the scheduler handles Fetch Failures differently then normal > exception. This can affect things like blacklisting. > {noformat} > 17/04/26 20:08:59 ERROR Executor: Exception in task 2727.0 in stage 4.0 (TID > 102902) > org.apache.spark.SparkException: Task failed while writing rows > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.shuffle.FetchFailedException: Failed to connect > to host1.com:7337 > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:396) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341) > at >
[jira] [Updated] (SPARK-20480) FileFormatWriter hides FetchFailedException from scheduler
[ https://issues.apache.org/jira/browse/SPARK-20480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-20480: -- Description: I was running a large job where it was getting faiures, noticed they were listed as "SparkException: Task failed while writing rows", but when I looked further they were really caused by FetchFailure exceptions. This is a problem because the scheduler handles Fetch Failures differently then normal exception. This can affect things like blacklisting. {noformat} 17/04/26 20:08:59 ERROR Executor: Exception in task 2727.0 in stage 4.0 (TID 102902) org.apache.spark.SparkException: Task failed while writing rows at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.spark.shuffle.FetchFailedException: Failed to connect to host1.com:7337 at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:396) at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193) ... 8 more Caused by: java.io.IOException: Failed to connect to host1.com:7337 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179) at org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:105) at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
[jira] [Created] (SPARK-20480) FileFormatWriter hides FetchFailedException from scheduler
Thomas Graves created SPARK-20480: - Summary: FileFormatWriter hides FetchFailedException from scheduler Key: SPARK-20480 URL: https://issues.apache.org/jira/browse/SPARK-20480 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Thomas Graves Priority: Critical I was running a large job where it was getting faiures, noticed they were listed as "SparkException: Task failed while writing rows", but when I looked further they were really caused by FetchFailure exceptions. This is a problem because the scheduler handles Fetch Failures differently then normal exception. This can affect things like blacklisting. {noformat} 17/04/26 20:08:59 ERROR Executor: Exception in task 2727.0 in stage 4.0 (TID 102902) org.apache.spark.SparkException: Task failed while writing rows at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.spark.shuffle.FetchFailedException: Failed to connect to gsbl546n07.blue.ygrid.yahoo.com/10.213.43.94:7337 at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:396) at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193) ... 8 more Caused by: java.io.IOException: Failed to connect to gsbl546n07.blue.ygrid.yahoo.com/10.213.43.94:7337 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179) at
[jira] [Resolved] (SPARK-12868) ADD JAR via sparkSQL JDBC will fail when using a HDFS URL
[ https://issues.apache.org/jira/browse/SPARK-12868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-12868. Resolution: Fixed Assignee: Weiqing Yang Fix Version/s: 2.2.0 > ADD JAR via sparkSQL JDBC will fail when using a HDFS URL > - > > Key: SPARK-12868 > URL: https://issues.apache.org/jira/browse/SPARK-12868 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Trystan Leftwich >Assignee: Weiqing Yang > Fix For: 2.2.0 > > > When trying to add a jar with a HDFS URI, i.E > {code:sql} > ADD JAR hdfs:///tmp/foo.jar > {code} > Via the spark sql JDBC interface it will fail with: > {code:sql} > java.net.MalformedURLException: unknown protocol: hdfs > at java.net.URL.(URL.java:593) > at java.net.URL.(URL.java:483) > at java.net.URL.(URL.java:432) > at java.net.URI.toURL(URI.java:1089) > at > org.apache.spark.sql.hive.client.ClientWrapper.addJar(ClientWrapper.scala:578) > at org.apache.spark.sql.hive.HiveContext.addJar(HiveContext.scala:652) > at org.apache.spark.sql.hive.execution.AddJar.run(commands.scala:89) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:145) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:130) > at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:211) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:154) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:151) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:164) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20474) OnHeapColumnVector realocation may not copy existing data
[ https://issues.apache.org/jira/browse/SPARK-20474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-20474. - Resolution: Fixed Assignee: Michal Szafranski Fix Version/s: 2.2.0 > OnHeapColumnVector realocation may not copy existing data > - > > Key: SPARK-20474 > URL: https://issues.apache.org/jira/browse/SPARK-20474 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Michal Szafranski >Assignee: Michal Szafranski > Fix For: 2.2.0 > > > OnHeapColumnVector reallocation copies to the new storage data up to > 'elementsAppended'. This variable is only updated when using the > ColumnVector.appendX API, while ColumnVector.putX is more commonly used. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20178) Improve Scheduler fetch failures
[ https://issues.apache.org/jira/browse/SPARK-20178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985447#comment-15985447 ] Thomas Graves commented on SPARK-20178: --- Another thing we should tie in here is handling preempted containers better. This kind of matches with my point above "Improve logic around deciding which node is actually bad when you get a fetch failures." but a little bit of a special case. If the containers gets preempted on the yarn side we need to properly detect that and not count that as a normal fetch failure. Right now that seems pretty difficult with the way we handle stage failures but I guess you would just line that up and not caught that as a normal stage failure. > Improve Scheduler fetch failures > > > Key: SPARK-20178 > URL: https://issues.apache.org/jira/browse/SPARK-20178 > Project: Spark > Issue Type: Epic > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Thomas Graves > > We have been having a lot of discussions around improving the handling of > fetch failures. There are 4 jira currently related to this. > We should try to get a list of things we want to improve and come up with one > cohesive design. > SPARK-20163, SPARK-20091, SPARK-14649 , and SPARK-19753 > I will put my initial thoughts in a follow on comment. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20479) Performance degradation for large number of hash-aggregated columns
[ https://issues.apache.org/jira/browse/SPARK-20479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-20479: Affects Version/s: (was: 2.3.0) 2.2.0 > Performance degradation for large number of hash-aggregated columns > --- > > Key: SPARK-20479 > URL: https://issues.apache.org/jira/browse/SPARK-20479 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kazuaki Ishizaki > > In comment of SPARK-20184, [~maropu] revealed that performance is degraded > when # of aggregated columns get large with whole-stage codegen. > {code} > ./bin/spark-shell --master local[1] --conf spark.driver.memory=2g --conf > spark.sql.shuffle.partitions=1 -v > def timer[R](f: => {}): Unit = { > val count = 9 > val iters = (0 until count).map { i => > val t0 = System.nanoTime() > f > val t1 = System.nanoTime() > val elapsed = t1 - t0 + 0.0 > println(s"#$i: ${elapsed / 10.0}") > elapsed > } > println("Elapsed time: " + ((iters.sum / count) / 10.0) + "s") > } > val numCols = 80 > val t = s"(SELECT id AS key1, id AS key2, ${((0 until numCols).map(i => s"id > AS c$i")).mkString(", ")} FROM range(0, 10, 1, 1))" > val sqlStr = s"SELECT key1, key2, ${((0 until numCols).map(i => > s"SUM(c$i)")).mkString(", ")} FROM $t GROUP BY key1, key2 LIMIT 100" > // Elapsed time: 2.308440490553s > sql("SET spark.sql.codegen.wholeStage=true") > timer { sql(sqlStr).collect } > // Elapsed time: 0.527486733s > sql("SET spark.sql.codegen.wholeStage=false") > timer { sql(sqlStr).collect } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20476) Exception between "create table as" and "get_json_object"
[ https://issues.apache.org/jira/browse/SPARK-20476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-20476: --- Assignee: Xiao Li > Exception between "create table as" and "get_json_object" > - > > Key: SPARK-20476 > URL: https://issues.apache.org/jira/browse/SPARK-20476 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: cen yuhai >Assignee: Xiao Li > > I encounter this problem when I want to create a table as select , > get_json_object from xxx; > It is wrong. > {code} > create table spark_json_object as > select get_json_object(deliver_geojson,'$.') > from dw.dw_prd_order where dt='2017-04-24' limit 10; > {code} > It is ok. > {code} > create table spark_json_object as > select * > from dw.dw_prd_order where dt='2017-04-24' limit 10; > {code} > It is ok > {code} > select get_json_object(deliver_geojson,'$.') > from dw.dw_prd_order where dt='2017-04-24' limit 10; > {code} > {code} > 17/04/26 23:12:56 ERROR [hive.log(397) -- main]: error in initSerDe: > org.apache.hadoop.hive.serde2.SerDeException > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements > while columns.types has 1 elements! > org.apache.hadoop.hive.serde2.SerDeException: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements > while columns.types has 1 elements! > at > org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:146) > at > org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85) > at > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125) > at > org.apache.hadoop.hive.serde2.AbstractSerDe.initialize(AbstractSerDe.java:53) > at > org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:521) > at > org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:391) > at > org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276) > at > org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:197) > at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:699) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply$mcV$sp(HiveClientImpl.scala:455) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:309) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:256) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:255) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:298) > at > org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:454) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:237) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) > at > org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:199) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:248) > at > org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$lzycompute$1(CreateHiveTableAsSelectCommand.scala:72) > at > org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$1(CreateHiveTableAsSelectCommand.scala:48) > at > org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:91) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:179) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) > at
[jira] [Assigned] (SPARK-20476) Exception between "create table as" and "get_json_object"
[ https://issues.apache.org/jira/browse/SPARK-20476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20476: Assignee: (was: Apache Spark) > Exception between "create table as" and "get_json_object" > - > > Key: SPARK-20476 > URL: https://issues.apache.org/jira/browse/SPARK-20476 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: cen yuhai > > I encounter this problem when I want to create a table as select , > get_json_object from xxx; > It is wrong. > {code} > create table spark_json_object as > select get_json_object(deliver_geojson,'$.') > from dw.dw_prd_order where dt='2017-04-24' limit 10; > {code} > It is ok. > {code} > create table spark_json_object as > select * > from dw.dw_prd_order where dt='2017-04-24' limit 10; > {code} > It is ok > {code} > select get_json_object(deliver_geojson,'$.') > from dw.dw_prd_order where dt='2017-04-24' limit 10; > {code} > {code} > 17/04/26 23:12:56 ERROR [hive.log(397) -- main]: error in initSerDe: > org.apache.hadoop.hive.serde2.SerDeException > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements > while columns.types has 1 elements! > org.apache.hadoop.hive.serde2.SerDeException: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements > while columns.types has 1 elements! > at > org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:146) > at > org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85) > at > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125) > at > org.apache.hadoop.hive.serde2.AbstractSerDe.initialize(AbstractSerDe.java:53) > at > org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:521) > at > org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:391) > at > org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276) > at > org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:197) > at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:699) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply$mcV$sp(HiveClientImpl.scala:455) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:309) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:256) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:255) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:298) > at > org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:454) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:237) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) > at > org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:199) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:248) > at > org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$lzycompute$1(CreateHiveTableAsSelectCommand.scala:72) > at > org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$1(CreateHiveTableAsSelectCommand.scala:48) > at > org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:91) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:179) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) > at
[jira] [Assigned] (SPARK-20476) Exception between "create table as" and "get_json_object"
[ https://issues.apache.org/jira/browse/SPARK-20476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20476: Assignee: Apache Spark > Exception between "create table as" and "get_json_object" > - > > Key: SPARK-20476 > URL: https://issues.apache.org/jira/browse/SPARK-20476 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: cen yuhai >Assignee: Apache Spark > > I encounter this problem when I want to create a table as select , > get_json_object from xxx; > It is wrong. > {code} > create table spark_json_object as > select get_json_object(deliver_geojson,'$.') > from dw.dw_prd_order where dt='2017-04-24' limit 10; > {code} > It is ok. > {code} > create table spark_json_object as > select * > from dw.dw_prd_order where dt='2017-04-24' limit 10; > {code} > It is ok > {code} > select get_json_object(deliver_geojson,'$.') > from dw.dw_prd_order where dt='2017-04-24' limit 10; > {code} > {code} > 17/04/26 23:12:56 ERROR [hive.log(397) -- main]: error in initSerDe: > org.apache.hadoop.hive.serde2.SerDeException > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements > while columns.types has 1 elements! > org.apache.hadoop.hive.serde2.SerDeException: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements > while columns.types has 1 elements! > at > org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:146) > at > org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85) > at > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125) > at > org.apache.hadoop.hive.serde2.AbstractSerDe.initialize(AbstractSerDe.java:53) > at > org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:521) > at > org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:391) > at > org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276) > at > org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:197) > at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:699) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply$mcV$sp(HiveClientImpl.scala:455) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:309) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:256) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:255) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:298) > at > org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:454) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:237) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) > at > org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:199) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:248) > at > org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$lzycompute$1(CreateHiveTableAsSelectCommand.scala:72) > at > org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$1(CreateHiveTableAsSelectCommand.scala:48) > at > org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:91) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:179) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) > at
[jira] [Commented] (SPARK-20476) Exception between "create table as" and "get_json_object"
[ https://issues.apache.org/jira/browse/SPARK-20476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985399#comment-15985399 ] Apache Spark commented on SPARK-20476: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/17776 > Exception between "create table as" and "get_json_object" > - > > Key: SPARK-20476 > URL: https://issues.apache.org/jira/browse/SPARK-20476 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: cen yuhai > > I encounter this problem when I want to create a table as select , > get_json_object from xxx; > It is wrong. > {code} > create table spark_json_object as > select get_json_object(deliver_geojson,'$.') > from dw.dw_prd_order where dt='2017-04-24' limit 10; > {code} > It is ok. > {code} > create table spark_json_object as > select * > from dw.dw_prd_order where dt='2017-04-24' limit 10; > {code} > It is ok > {code} > select get_json_object(deliver_geojson,'$.') > from dw.dw_prd_order where dt='2017-04-24' limit 10; > {code} > {code} > 17/04/26 23:12:56 ERROR [hive.log(397) -- main]: error in initSerDe: > org.apache.hadoop.hive.serde2.SerDeException > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements > while columns.types has 1 elements! > org.apache.hadoop.hive.serde2.SerDeException: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements > while columns.types has 1 elements! > at > org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:146) > at > org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85) > at > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125) > at > org.apache.hadoop.hive.serde2.AbstractSerDe.initialize(AbstractSerDe.java:53) > at > org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:521) > at > org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:391) > at > org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276) > at > org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:197) > at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:699) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply$mcV$sp(HiveClientImpl.scala:455) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:309) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:256) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:255) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:298) > at > org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:454) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:237) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) > at > org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:199) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:248) > at > org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$lzycompute$1(CreateHiveTableAsSelectCommand.scala:72) > at > org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$1(CreateHiveTableAsSelectCommand.scala:48) > at > org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:91) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:179) > at
[jira] [Created] (SPARK-20479) Performance degradation for large number of hash-aggregated columns
Kazuaki Ishizaki created SPARK-20479: Summary: Performance degradation for large number of hash-aggregated columns Key: SPARK-20479 URL: https://issues.apache.org/jira/browse/SPARK-20479 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Kazuaki Ishizaki In comment of SPARK-20184, [~maropu] revealed that performance is degraded when # of aggregated columns get large with whole-stage codegen. {code} ./bin/spark-shell --master local[1] --conf spark.driver.memory=2g --conf spark.sql.shuffle.partitions=1 -v def timer[R](f: => {}): Unit = { val count = 9 val iters = (0 until count).map { i => val t0 = System.nanoTime() f val t1 = System.nanoTime() val elapsed = t1 - t0 + 0.0 println(s"#$i: ${elapsed / 10.0}") elapsed } println("Elapsed time: " + ((iters.sum / count) / 10.0) + "s") } val numCols = 80 val t = s"(SELECT id AS key1, id AS key2, ${((0 until numCols).map(i => s"id AS c$i")).mkString(", ")} FROM range(0, 10, 1, 1))" val sqlStr = s"SELECT key1, key2, ${((0 until numCols).map(i => s"SUM(c$i)")).mkString(", ")} FROM $t GROUP BY key1, key2 LIMIT 100" // Elapsed time: 2.308440490553s sql("SET spark.sql.codegen.wholeStage=true") timer { sql(sqlStr).collect } // Elapsed time: 0.527486733s sql("SET spark.sql.codegen.wholeStage=false") timer { sql(sqlStr).collect } {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20184) performance regression for complex/long sql when enable whole stage codegen
[ https://issues.apache.org/jira/browse/SPARK-20184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985319#comment-15985319 ] Kazuaki Ishizaki commented on SPARK-20184: -- When # of the aggregated columns gets large, I saw complicated Java code for HashAggregation. I will create another JIRA to simplify generated code. > performance regression for complex/long sql when enable whole stage codegen > --- > > Key: SPARK-20184 > URL: https://issues.apache.org/jira/browse/SPARK-20184 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0, 2.1.0 >Reporter: Fei Wang > > The performance of following SQL get much worse in spark 2.x in contrast > with codegen off. > SELECT >sum(COUNTER_57) > ,sum(COUNTER_71) > ,sum(COUNTER_3) > ,sum(COUNTER_70) > ,sum(COUNTER_66) > ,sum(COUNTER_75) > ,sum(COUNTER_69) > ,sum(COUNTER_55) > ,sum(COUNTER_63) > ,sum(COUNTER_68) > ,sum(COUNTER_56) > ,sum(COUNTER_37) > ,sum(COUNTER_51) > ,sum(COUNTER_42) > ,sum(COUNTER_43) > ,sum(COUNTER_1) > ,sum(COUNTER_76) > ,sum(COUNTER_54) > ,sum(COUNTER_44) > ,sum(COUNTER_46) > ,DIM_1 > ,DIM_2 > ,DIM_3 > FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100; > Num of rows of aggtable is about 3500. > whole stage codegen on(spark.sql.codegen.wholeStage = true):40s > whole stage codegen off(spark.sql.codegen.wholeStage = false):6s > After some analysis i think this is related to the huge java method(a java > method of thousand lines) which generated by codegen. > And If i config -XX:-DontCompileHugeMethods the performance get much > better(about 7s). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20208) Document R fpGrowth support in vignettes, programming guide and code example
[ https://issues.apache.org/jira/browse/SPARK-20208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20208: Assignee: Maciej Szymkiewicz (was: Apache Spark) > Document R fpGrowth support in vignettes, programming guide and code example > > > Key: SPARK-20208 > URL: https://issues.apache.org/jira/browse/SPARK-20208 > Project: Spark > Issue Type: Documentation > Components: Documentation, SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung >Assignee: Maciej Szymkiewicz > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20208) Document R fpGrowth support in vignettes, programming guide and code example
[ https://issues.apache.org/jira/browse/SPARK-20208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20208: Assignee: Apache Spark (was: Maciej Szymkiewicz) > Document R fpGrowth support in vignettes, programming guide and code example > > > Key: SPARK-20208 > URL: https://issues.apache.org/jira/browse/SPARK-20208 > Project: Spark > Issue Type: Documentation > Components: Documentation, SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung >Assignee: Apache Spark > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20208) Document R fpGrowth support in vignettes, programming guide and code example
[ https://issues.apache.org/jira/browse/SPARK-20208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985309#comment-15985309 ] Apache Spark commented on SPARK-20208: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/17775 > Document R fpGrowth support in vignettes, programming guide and code example > > > Key: SPARK-20208 > URL: https://issues.apache.org/jira/browse/SPARK-20208 > Project: Spark > Issue Type: Documentation > Components: Documentation, SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung >Assignee: Maciej Szymkiewicz > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20473) ColumnVector.Array is missing accessors for some types
[ https://issues.apache.org/jira/browse/SPARK-20473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-20473. - Resolution: Fixed Assignee: Michal Szafranski Fix Version/s: 2.2.0 > ColumnVector.Array is missing accessors for some types > -- > > Key: SPARK-20473 > URL: https://issues.apache.org/jira/browse/SPARK-20473 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Michal Szafranski >Assignee: Michal Szafranski > Fix For: 2.2.0 > > > ColumnVector implementations originally did not support some Catalyst types > (float, short, and boolean). Now that they do, those types should be also > added to the ColumnVector.Array. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20015) Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide, R example
[ https://issues.apache.org/jira/browse/SPARK-20015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-20015: - Summary: Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide, R example (was: Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide) > Document R Structured Streaming (experimental) in R vignettes and R & SS > programming guide, R example > - > > Key: SPARK-20015 > URL: https://issues.apache.org/jira/browse/SPARK-20015 > Project: Spark > Issue Type: Documentation > Components: Documentation, SparkR, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Felix Cheung > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20477) Document R bisecting k-means in R programming guide
[ https://issues.apache.org/jira/browse/SPARK-20477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985226#comment-15985226 ] Felix Cheung commented on SPARK-20477: -- [~wangmiao1981] Would you like to add this? > Document R bisecting k-means in R programming guide > --- > > Key: SPARK-20477 > URL: https://issues.apache.org/jira/browse/SPARK-20477 > Project: Spark > Issue Type: Documentation > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20478) Document LinearSVC in R programming guide
[ https://issues.apache.org/jira/browse/SPARK-20478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985227#comment-15985227 ] Felix Cheung commented on SPARK-20478: -- [~wangmiao1981] Would you like to add this? > Document LinearSVC in R programming guide > - > > Key: SPARK-20478 > URL: https://issues.apache.org/jira/browse/SPARK-20478 > Project: Spark > Issue Type: Documentation > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20477) Document R bisecting k-means in R programming guide
[ https://issues.apache.org/jira/browse/SPARK-20477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-20477: - Issue Type: Documentation (was: Bug) > Document R bisecting k-means in R programming guide > --- > > Key: SPARK-20477 > URL: https://issues.apache.org/jira/browse/SPARK-20477 > Project: Spark > Issue Type: Documentation > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-20208) Document R fpGrowth support in vignettes, programming guide and code example
[ https://issues.apache.org/jira/browse/SPARK-20208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung reopened SPARK-20208: -- Actually, would you mind updating the R programming guide too? > Document R fpGrowth support in vignettes, programming guide and code example > > > Key: SPARK-20208 > URL: https://issues.apache.org/jira/browse/SPARK-20208 > Project: Spark > Issue Type: Documentation > Components: Documentation, SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung >Assignee: Maciej Szymkiewicz > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20478) Document LinearSVC in R programming guide
Felix Cheung created SPARK-20478: Summary: Document LinearSVC in R programming guide Key: SPARK-20478 URL: https://issues.apache.org/jira/browse/SPARK-20478 Project: Spark Issue Type: Documentation Components: SparkR Affects Versions: 2.2.0 Reporter: Felix Cheung -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20477) Document R bisecting k-means in R programming guide
Felix Cheung created SPARK-20477: Summary: Document R bisecting k-means in R programming guide Key: SPARK-20477 URL: https://issues.apache.org/jira/browse/SPARK-20477 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.2.0 Reporter: Felix Cheung -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20476) Exception between "create table as" and "get_json_object"
[ https://issues.apache.org/jira/browse/SPARK-20476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985175#comment-15985175 ] Xiao Li commented on SPARK-20476: - You can bypass it by {noformat} create table spark_json_object as select get_json_object(deliver_geojson,'$.') as col1 from dw.dw_prd_order where dt='2017-04-24' limit 10; {noformat} Will fix it later. > Exception between "create table as" and "get_json_object" > - > > Key: SPARK-20476 > URL: https://issues.apache.org/jira/browse/SPARK-20476 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: cen yuhai > > I encounter this problem when I want to create a table as select , > get_json_object from xxx; > It is wrong. > {code} > create table spark_json_object as > select get_json_object(deliver_geojson,'$.') > from dw.dw_prd_order where dt='2017-04-24' limit 10; > {code} > It is ok. > {code} > create table spark_json_object as > select * > from dw.dw_prd_order where dt='2017-04-24' limit 10; > {code} > It is ok > {code} > select get_json_object(deliver_geojson,'$.') > from dw.dw_prd_order where dt='2017-04-24' limit 10; > {code} > {code} > 17/04/26 23:12:56 ERROR [hive.log(397) -- main]: error in initSerDe: > org.apache.hadoop.hive.serde2.SerDeException > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements > while columns.types has 1 elements! > org.apache.hadoop.hive.serde2.SerDeException: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements > while columns.types has 1 elements! > at > org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:146) > at > org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85) > at > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125) > at > org.apache.hadoop.hive.serde2.AbstractSerDe.initialize(AbstractSerDe.java:53) > at > org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:521) > at > org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:391) > at > org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276) > at > org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:197) > at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:699) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply$mcV$sp(HiveClientImpl.scala:455) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:309) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:256) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:255) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:298) > at > org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:454) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:237) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) > at > org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:199) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:248) > at > org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$lzycompute$1(CreateHiveTableAsSelectCommand.scala:72) > at > org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$1(CreateHiveTableAsSelectCommand.scala:48) > at > org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:91) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at >
[jira] [Updated] (SPARK-20470) Invalid json converting RDD row with Array of struct to json
[ https://issues.apache.org/jira/browse/SPARK-20470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-20470: Component/s: SQL > Invalid json converting RDD row with Array of struct to json > > > Key: SPARK-20470 > URL: https://issues.apache.org/jira/browse/SPARK-20470 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.6.3 >Reporter: Philip Adetiloye > > Trying to convert an RDD in pyspark containing Array of struct doesn't > generate the right json. It looks trivial but can't get a good json out. > I read the json below into a dataframe: > {code} > { > "feature": "feature_id_001", > "histogram": [ > { > "start": 1.9796095151877942, > "y": 968.0, > "width": 0.1564485056196041 > }, > { > "start": 2.1360580208073983, > "y": 892.0, > "width": 0.1564485056196041 > }, > { > "start": 2.2925065264270024, > "y": 814.0, > "width": 0.15644850561960366 > }, > { > "start": 2.448955032046606, > "y": 690.0, > "width": 0.1564485056196041 > }] > } > {code} > Df schema looks good > {code} > root > |-- feature: string (nullable = true) > |-- histogram: array (nullable = true) > ||-- element: struct (containsNull = true) > |||-- start: double (nullable = true) > |||-- width: double (nullable = true) > |||-- y: double (nullable = true) > {code} > Need to convert each row to json now and save to HBase > {code} > rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict( > {code} > Output JSON (Wrong) > {code} > { > "feature": "feature_id_001", > "histogram": [ > [ > 1.9796095151877942, > 968.0, > 0.1564485056196041 > ], > [ > 2.1360580208073983, > 892.0, > 0.1564485056196041 > ], > [ > 2.2925065264270024, > 814.0, > 0.15644850561960366 > ], > [ > 2.448955032046606, > 690.0, > 0.1564485056196041 > ] > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20475) Whether use "broadcast join" depends on hive configuration
[ https://issues.apache.org/jira/browse/SPARK-20475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985100#comment-15985100 ] Xiao Li commented on SPARK-20475: - cc [~ZenWzh] > Whether use "broadcast join" depends on hive configuration > -- > > Key: SPARK-20475 > URL: https://issues.apache.org/jira/browse/SPARK-20475 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Lijia Liu > > Currently, broadcast join in Spark only works while: > 1. The value of "spark.sql.autoBroadcastJoinThreshold" bigger than > 0(default is 10485760). > 2. The size of one of the hive tables less than > "spark.sql.autoBroadcastJoinThreshold". To get the size information of the > hive table from hive metastore, "hive.stats.autogather" should be set to > true in hive or the command "ANALYZE TABLE COMPUTE STATISTICS > noscan" has been run. > But in Hive, it calculate the size of the file or directory corresponding to > the hive table to determine whether to use the map side join, and does not > depend on the hive metastore. > This leads to two problems: > 1. Spark will not use "broadcast join" when the hive parameter > "hive.stats.autogather" is not set to ture or the command "ANALYZE TABLE > COMPUTE STATISTICS noscan" has not been run because the > information of the hive table has not saved in hive metastore . The mode of > work in Spark depends on the configuration of Hive. > 2. For some reason, we set "hive.stats.autogather" to false in our Hive. > For the same SQL, Hive is 4 times faster than Spark because Hive used "map > side join" but Spark did not use "broadcast join". > Is it possible to use the mechanism same to hive's to look up the size of a > hive tale in Spark. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20476) Exception between "create table as" and "get_json_object"
[ https://issues.apache.org/jira/browse/SPARK-20476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985096#comment-15985096 ] Xiao Li commented on SPARK-20476: - This sounds a bug to me. Let me double check it > Exception between "create table as" and "get_json_object" > - > > Key: SPARK-20476 > URL: https://issues.apache.org/jira/browse/SPARK-20476 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: cen yuhai > > I encounter this problem when I want to create a table as select , > get_json_object from xxx; > It is wrong. > {code} > create table spark_json_object as > select get_json_object(deliver_geojson,'$.') > from dw.dw_prd_order where dt='2017-04-24' limit 10; > {code} > It is ok. > {code} > create table spark_json_object as > select * > from dw.dw_prd_order where dt='2017-04-24' limit 10; > {code} > It is ok > {code} > select get_json_object(deliver_geojson,'$.') > from dw.dw_prd_order where dt='2017-04-24' limit 10; > {code} > {code} > 17/04/26 23:12:56 ERROR [hive.log(397) -- main]: error in initSerDe: > org.apache.hadoop.hive.serde2.SerDeException > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements > while columns.types has 1 elements! > org.apache.hadoop.hive.serde2.SerDeException: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements > while columns.types has 1 elements! > at > org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:146) > at > org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85) > at > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125) > at > org.apache.hadoop.hive.serde2.AbstractSerDe.initialize(AbstractSerDe.java:53) > at > org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:521) > at > org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:391) > at > org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276) > at > org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:197) > at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:699) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply$mcV$sp(HiveClientImpl.scala:455) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:309) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:256) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:255) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:298) > at > org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:454) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:237) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) > at > org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:199) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:248) > at > org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$lzycompute$1(CreateHiveTableAsSelectCommand.scala:72) > at > org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$1(CreateHiveTableAsSelectCommand.scala:48) > at > org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:91) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:179) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) > at
[jira] [Commented] (SPARK-7481) Add spark-hadoop-cloud module to pull in object store support
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985040#comment-15985040 ] Steve Loughran commented on SPARK-7481: --- (This is a fairly long comment, but it tries to summarise the entire state of interaction with object stores, esp. S3A on Hadoop 2.8+. Azure is simpler, GCS: google's problem. Swift. not used very much). If you look at object store & Spark (or indeed, any code which uses a filesystem as the source and dest of work), there are problems which can generally be grouped into various categories. h3. Foundational: talking to the object stores classpath & execution: can you wire the JARs up? Longstanding issue in ASF Spark releases (SPARK-5348, SPARK-12557). This was exacerbated by the movement of S3n:// to the hadoop-aws-package (FWIW, I hadn't noticed that move, I'd have blocked it if I'd been paying attention). This includes transitive problems (SPARK-11413) Credential propagation. Spark's env var propagation is pretty cute here; SPARK-19739 picks up {{AWS_SESSION_TOKEN}} too. Diagnostics on failure is a real pain. h3. Observable Inconsistencies leading to Data loss Generally where the metaphor "it's just a filesystem" fail. These are bad because they often "just work", especially in dev & Test with small datasets, and when they go wrong, they can fail by generating bad results *and nobody notices*. * Expectations of consistent listing of "directories" S3Guard deals with this, HADOOP-13345, as can Netflix's S3mper and AWS's premium Dynamo backed S3 storage. * Expectations on the transacted nature of Directory renames, the core atomic commit operations against full filesystems. * Expectations that when things are deleted they go away. This does become visible sometimes, usually in checks for a destination not existing (SPARK-19013) * Expectations that write-in-progress data is visible/flushed, that {{close()}} is low cost. SPARK-19111. Committing pretty much combines all of these, see below for more details. h3. Aggressively bad performance That's the mismatch between what the object store offers, what the apps expect, and the metaphor work in the Hadooop FileSystem implementations, which, in trying to hide the conceptual mismatch can actually amplify the problem. Example: Directory tree scanning at the start of a query. The mock directory structure allows callers to do treewalks, when really a full list of all children can be done as a direct O(1) call. SPARK-17159 covers some of this for scanning directories in Spark Streaming, but there's a hidden tree walk in every call to {{FileSystem.globStatus()}} (HADOOP-13371). Given how S3Guard transforms this treewalk, and you need it for consistency, that's probably the best solution for now. Although I have a PoC which does a full List **/* followed by a filter, that's not viable when you have a wide deep tree and do need to prune aggressively. Checkpointing to object stores is similar: it's generally not dangerous to do the write+rename, just adds the copy overhead, consistency issues notwithstanding. h3. Suboptimal code. There's opportunities for speedup, but if it's not on the critical path, not worth the hassle. That said, as every call to {{getFileStatus()}} can take hundreds of millis, they get onto the critical path quite fast. Example checks for a file existing before calling {{fs.delete(path)}} (this is always a no-op if the dest path isn't there), and the equivalent on mkdirs: {{if (!fs.exists(dir) fs.mkdirs(path)}}. Hadoop 3.0 will help steer people on the path of righteousness there by deprecating a couple of methods which encourage inefficiencies (isFile/isDir). h3. The commit problem The full commit problem combines all of these: you need a consistent list of source data, your deleted destination path musn't appear in listings, the commit of each task must promote a task's work to the pending output of the job; an abort must leave no trace of it. The final job commit must place data into the final destination, again, job abort not make any output visible. There's some ambiguity about what happens if task and job commits fails; generally the safest is "abort everything". Futhermore nobody has any idea what to do if an {{abort()}} raises exceptions. Oh, and all of this must be fast. Spark is no better or worse than the core MapReduce committers here, or that of Hive. Spark generally uses the Hadoop {{FileOutputFormat}} via the {{HadoopMapReduceCommitProtocol}}, directly or indirectly (e.g {{ParquetOutputFormat}}), extracting its committer and casting it to {{FileOutputCommitter}}, primarily to get a working directory. This committer assumes the destination is a consistent FS, uses renames when promoting task and job output, assuming that is so fast it doesn't even bother to log a message "about to rename". Hence the recurrent Stack Overflow
[jira] [Updated] (SPARK-20476) Exception between "create table as" and "get_json_object"
[ https://issues.apache.org/jira/browse/SPARK-20476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] cen yuhai updated SPARK-20476: -- Description: I encounter this problem when I want to create a table as select , get_json_object from xxx; It is wrong. {code} create table spark_json_object as select get_json_object(deliver_geojson,'$.') from dw.dw_prd_order where dt='2017-04-24' limit 10; {code} It is ok. {code} create table spark_json_object as select * from dw.dw_prd_order where dt='2017-04-24' limit 10; {code} It is ok {code} select get_json_object(deliver_geojson,'$.') from dw.dw_prd_order where dt='2017-04-24' limit 10; {code} {code} 17/04/26 23:12:56 ERROR [hive.log(397) -- main]: error in initSerDe: org.apache.hadoop.hive.serde2.SerDeException org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements while columns.types has 1 elements! org.apache.hadoop.hive.serde2.SerDeException: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements while columns.types has 1 elements! at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:146) at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125) at org.apache.hadoop.hive.serde2.AbstractSerDe.initialize(AbstractSerDe.java:53) at org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:521) at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:391) at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276) at org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:197) at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:699) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply$mcV$sp(HiveClientImpl.scala:455) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:309) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:256) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:255) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:298) at org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:454) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:237) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) at org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:199) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:248) at org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$lzycompute$1(CreateHiveTableAsSelectCommand.scala:72) at org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$1(CreateHiveTableAsSelectCommand.scala:48) at org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:91) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) at org.apache.spark.sql.Dataset.(Dataset.scala:179) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:593) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:699) at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:335) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:247) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
[jira] [Updated] (SPARK-20476) Exception between "create table as" and "get_json_object"
[ https://issues.apache.org/jira/browse/SPARK-20476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] cen yuhai updated SPARK-20476: -- Description: I encounter this problem when I want to create a table as select , get_json_object from xxx; It is wrong. {code} create table spark_json_object as select get_json_object(deliver_geojson,'$.') from dw.dw_prd_order where dt='2017-04-24' limit 10; {code} It is ok. {code} create table spark_json_object as select * from dw.dw_prd_order where dt='2017-04-24' limit 10; {code} It is ok {code} select get_json_object(deliver_geojson,'$.') from dw.dw_prd_order where dt='2017-04-24' limit 10; {code} 17/04/26 23:12:56 ERROR [hive.log(397) -- main]: error in initSerDe: org.apache.hadoop.hive.serde2.SerDeException org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements while columns.types has 1 elements! org.apache.hadoop.hive.serde2.SerDeException: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements while columns.types has 1 elements! at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:146) at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125) at org.apache.hadoop.hive.serde2.AbstractSerDe.initialize(AbstractSerDe.java:53) at org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:521) at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:391) at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276) at org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:197) at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:699) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply$mcV$sp(HiveClientImpl.scala:455) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:309) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:256) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:255) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:298) at org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:454) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:237) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) at org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:199) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:248) at org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$lzycompute$1(CreateHiveTableAsSelectCommand.scala:72) at org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$1(CreateHiveTableAsSelectCommand.scala:48) at org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:91) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) at org.apache.spark.sql.Dataset.(Dataset.scala:179) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:593) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:699) at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:335) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:247) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at
[jira] [Comment Edited] (SPARK-17403) Fatal Error: Scan cached strings
[ https://issues.apache.org/jira/browse/SPARK-17403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15982595#comment-15982595 ] Paul Lysak edited comment on SPARK-17403 at 4/26/17 3:23 PM: - Looks like we have the same issue with Spark 2.1 on YARN (Amazon EMR release emr-5.4.0). Workaround that solves the issue for us (at the cost of some performance) is to use df.persist(StorageLevel.DISK_ONLY) instead of df.cache(). Depending on the node types, memory settings, storage level and some other factors I couldn't clearly identify it may appear as {noformat} User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 158 in stage 504.0 failed 4 times, most recent failure: Lost task 158.3 in stage 504.0 (TID 427365, ip-10-35-162-171.ec2.internal, executor 83): java.lang.NegativeArraySizeException at org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229) at org.apache.spark.unsafe.types.UTF8String.clone(UTF8String.java:826) at org.apache.spark.sql.execution.columnar.StringColumnStats.gatherStats(ColumnStats.scala:217) at org.apache.spark.sql.execution.columnar.NullableColumnBuilder$class.appendFrom(NullableColumnBuilder.scala:55) at org.apache.spark.sql.execution.columnar.NativeColumnBuilder.org$apache$spark$sql$execution$columnar$compression$CompressibleColumnBuilder$$super$appendFrom(ColumnBuilder.scala:97) at org.apache.spark.sql.execution.columnar.compression.CompressibleColumnBuilder$class.appendFrom(CompressibleColumnBuilder.scala:78) at org.apache.spark.sql.execution.columnar.NativeColumnBuilder.appendFrom(ColumnBuilder.scala:97) at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:122) at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:97) at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:957) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:888) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:694) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at org.apache.spark.rdd.RDD.iterator(RDD.scala:285) {noformat} or as {noformat} User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 27 in stage 61.0 failed 4 times, most recent failure: Lost task 27.3 in stage 61.0 (TID 36167, ip-10-35-162-149.ec2.internal, executor 1): java.lang.OutOfMemoryError: Java heap space at org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:73) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_38$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:107) at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:97) at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:957) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:888) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:694) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) {noformat} or as {noformat} 2017-04-24 19:02:45,951 ERROR org.apache.spark.util.SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-3,5,main] java.lang.OutOfMemoryError: Requested array size exceeds VM limit at org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:73) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_37$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.execution.joins.HashJoin$$anonfun$join$1.apply(HashJoin.scala:217) at
[jira] [Commented] (SPARK-17403) Fatal Error: Scan cached strings
[ https://issues.apache.org/jira/browse/SPARK-17403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984987#comment-15984987 ] Paul Lysak commented on SPARK-17403: Hope that helps - finally managed to reproduce it without using production data: {code} import org.apache.spark.sql.functions._ val leftRows = spark.sparkContext.parallelize(numSlices = 3000, seq = for (k <- 1 to 1000; j <- 1 to 1000) yield Row(k, j)) .flatMap(r => (1 to 1000).map(l => Row.fromSeq(r.toSeq :+ l))) val leftDf = spark.createDataFrame(leftRows, StructType(Seq( StructField("k", IntegerType), StructField("j", IntegerType), StructField("l", IntegerType) ))) .withColumn("combinedKey", expr("k*100 + j*1000 + l")) .withColumn("fixedCol", lit("sampleVal")) .withColumn("combKeyStr", format_number(col("combinedKey"), 0)) .withColumn("k100", expr("k*100")) .withColumn("j100", expr("j*100")) .withColumn("l100", expr("l*100")) .withColumn("k_200", expr("k+200")) .withColumn("j_200", expr("j+200")) .withColumn("l_200", expr("l+200")) .withColumn("strCol1_1", concat(lit("value of sample column number one with which column k will be concatenated:" * 5), format_number(col("k"), 0))) .withColumn("strCol1_2", concat(lit("value of sample column two one with which column j will be concatenated:" * 5), format_number(col("j"), 0))) .withColumn("strCol1_3", concat(lit("value of sample column three one with which column r will be concatenated:" * 5), format_number(col("l"), 0))) .withColumn("strCol2_1", concat(lit("value of sample column number one with which column k will be concatenated:" * 5), format_number(col("k"), 0))) .withColumn("strCol2_2", concat(lit("value of sample column two one with which column j will be concatenated:" * 5), format_number(col("j"), 0))) .withColumn("strCol2_3", concat(lit("value of sample column three one with which column r will be concatenated:" * 5), format_number(col("l"), 0))) .withColumn("strCol3_1", concat(lit("value of sample column number one with which column k will be concatenated:" * 5), format_number(col("k"), 0))) .withColumn("strCol3_2", concat(lit("value of sample column two one with which column j will be concatenated:" * 5), format_number(col("j"), 0))) .withColumn("strCol3_3", concat(lit("value of sample column three one with which column r will be concatenated:" * 5), format_number(col("l"), 0))) //if further columns commented out - error disappears leftDf.cache() println("= leftDf count:" + leftDf.count()) leftDf.show(10) val rightRows = spark.sparkContext.parallelize((1 to 800).map(i => Row(i, "k_" + i, "sampleVal"))) val rightDf = spark.createDataFrame(rightRows, StructType(Seq( StructField("k", IntegerType), StructField("kStr", StringType), StructField("sampleCol", StringType) ))) rightDf.cache() println("= rightDf count:" + rightDf.count()) rightDf.show(10) val joinedDf = leftDf.join(broadcast(rightDf), usingColumns = Seq("k"), joinType = "left") joinedDf.cache() println("= joinedDf count:" + joinedDf.count()) joinedDf.show(10) {code} ApplicationMaster fails with such exception: {noformat} User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 949 in stage 8.0 failed 4 times, most recent failure: Lost task 949.3 in stage 8.0 (TID 4922, ip-10-35-162-219.ec2.internal, executor 139): java.lang.NegativeArraySizeException at org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229) at org.apache.spark.unsafe.types.UTF8String.clone(UTF8String.java:826) at org.apache.spark.sql.execution.columnar.StringColumnStats.gatherStats(ColumnStats.scala:216) at org.apache.spark.sql.execution.columnar.NullableColumnBuilder$class.appendFrom(NullableColumnBuilder.scala:55) at org.apache.spark.sql.execution.columnar.NativeColumnBuilder.org$apache$spark$sql$execution$columnar$compression$CompressibleColumnBuilder$$super$appendFrom(ColumnBuilder.scala:97) at org.apache.spark.sql.execution.columnar.compression.CompressibleColumnBuilder$class.appendFrom(CompressibleColumnBuilder.scala:78) at org.apache.spark.sql.execution.columnar.NativeColumnBuilder.appendFrom(ColumnBuilder.scala:97) at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:122) at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:97) at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:957) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948) at
[jira] [Updated] (SPARK-20476) Exception between "create table as" and "get_json_object"
[ https://issues.apache.org/jira/browse/SPARK-20476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] cen yuhai updated SPARK-20476: -- Description: {code} create table spark_json_object as select get_json_object(deliver_geojson,'$.') from dw.dw_prd_order where dt='2017-04-24' limit 10; {code} {code} 17/04/26 23:12:56 ERROR [hive.log(397) -- main]: error in initSerDe: org.apache.hadoop.hive.serde2.SerDeException org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements while columns.types has 1 elements! org.apache.hadoop.hive.serde2.SerDeException: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements while columns.types has 1 elements! at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:146) at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125) at org.apache.hadoop.hive.serde2.AbstractSerDe.initialize(AbstractSerDe.java:53) at org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:521) at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:391) at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276) at org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:197) at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:699) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply$mcV$sp(HiveClientImpl.scala:455) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:309) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:256) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:255) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:298) at org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:454) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:237) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) at org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:199) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:248) at org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$lzycompute$1(CreateHiveTableAsSelectCommand.scala:72) at org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$1(CreateHiveTableAsSelectCommand.scala:48) at org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:91) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) at org.apache.spark.sql.Dataset.(Dataset.scala:179) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:593) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:699) at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:335) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:247) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at
[jira] [Created] (SPARK-20476) Exception between "create table as" and "get_json_object"
cen yuhai created SPARK-20476: - Summary: Exception between "create table as" and "get_json_object" Key: SPARK-20476 URL: https://issues.apache.org/jira/browse/SPARK-20476 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: cen yuhai {code} create table spark_json_object as select get_json_object(deliver_geojson,'') from dw.dw_prd_restaurant where dt='2017-04-24' limit 10; {code} {code} 17/04/26 23:12:56 ERROR [hive.log(397) -- main]: error in initSerDe: org.apache.hadoop.hive.serde2.SerDeException org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements while columns.types has 1 elements! org.apache.hadoop.hive.serde2.SerDeException: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements while columns.types has 1 elements! at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:146) at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125) at org.apache.hadoop.hive.serde2.AbstractSerDe.initialize(AbstractSerDe.java:53) at org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:521) at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:391) at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276) at org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:197) at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:699) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply$mcV$sp(HiveClientImpl.scala:455) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:309) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:256) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:255) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:298) at org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:454) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:237) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) at org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:199) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:248) at org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$lzycompute$1(CreateHiveTableAsSelectCommand.scala:72) at org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$1(CreateHiveTableAsSelectCommand.scala:48) at org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:91) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) at org.apache.spark.sql.Dataset.(Dataset.scala:179) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:593) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:699) at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:335) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:247) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at
[jira] [Commented] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records
[ https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984975#comment-15984975 ] Sebastian Arzt commented on SPARK-18371: Screenshots: [before|https://issues.apache.org/jira/secure/attachment/12865156/01.png] [after|https://issues.apache.org/jira/secure/attachment/12865158/02.png] > Spark Streaming backpressure bug - generates a batch with large number of > records > - > > Key: SPARK-18371 > URL: https://issues.apache.org/jira/browse/SPARK-18371 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0 >Reporter: mapreduced > Attachments: 01.png, 02.png, GiantBatch2.png, GiantBatch3.png, > Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png > > > When the streaming job is configured with backpressureEnabled=true, it > generates a GIANT batch of records if the processing time + scheduled delay > is (much) larger than batchDuration. This creates a backlog of records like > no other and results in batches queueing for hours until it chews through > this giant batch. > Expectation is that it should reduce the number of records per batch in some > time to whatever it can really process. > Attaching some screen shots where it seems that this issue is quite easily > reproducible. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records
[ https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Arzt updated SPARK-18371: --- Comment: was deleted (was: After fix) > Spark Streaming backpressure bug - generates a batch with large number of > records > - > > Key: SPARK-18371 > URL: https://issues.apache.org/jira/browse/SPARK-18371 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0 >Reporter: mapreduced > Attachments: 01.png, 02.png, GiantBatch2.png, GiantBatch3.png, > Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png > > > When the streaming job is configured with backpressureEnabled=true, it > generates a GIANT batch of records if the processing time + scheduled delay > is (much) larger than batchDuration. This creates a backlog of records like > no other and results in batches queueing for hours until it chews through > this giant batch. > Expectation is that it should reduce the number of records per batch in some > time to whatever it can really process. > Attaching some screen shots where it seems that this issue is quite easily > reproducible. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records
[ https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Arzt updated SPARK-18371: --- Comment: was deleted (was: Before fix) > Spark Streaming backpressure bug - generates a batch with large number of > records > - > > Key: SPARK-18371 > URL: https://issues.apache.org/jira/browse/SPARK-18371 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0 >Reporter: mapreduced > Attachments: 01.png, 02.png, GiantBatch2.png, GiantBatch3.png, > Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png > > > When the streaming job is configured with backpressureEnabled=true, it > generates a GIANT batch of records if the processing time + scheduled delay > is (much) larger than batchDuration. This creates a backlog of records like > no other and results in batches queueing for hours until it chews through > this giant batch. > Expectation is that it should reduce the number of records per batch in some > time to whatever it can really process. > Attaching some screen shots where it seems that this issue is quite easily > reproducible. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records
[ https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Arzt updated SPARK-18371: --- Attachment: 01.png Before fix > Spark Streaming backpressure bug - generates a batch with large number of > records > - > > Key: SPARK-18371 > URL: https://issues.apache.org/jira/browse/SPARK-18371 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0 >Reporter: mapreduced > Attachments: 01.png, 02.png, GiantBatch2.png, GiantBatch3.png, > Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png > > > When the streaming job is configured with backpressureEnabled=true, it > generates a GIANT batch of records if the processing time + scheduled delay > is (much) larger than batchDuration. This creates a backlog of records like > no other and results in batches queueing for hours until it chews through > this giant batch. > Expectation is that it should reduce the number of records per batch in some > time to whatever it can really process. > Attaching some screen shots where it seems that this issue is quite easily > reproducible. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records
[ https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Arzt updated SPARK-18371: --- Attachment: 02.png After fix > Spark Streaming backpressure bug - generates a batch with large number of > records > - > > Key: SPARK-18371 > URL: https://issues.apache.org/jira/browse/SPARK-18371 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0 >Reporter: mapreduced > Attachments: 01.png, 02.png, GiantBatch2.png, GiantBatch3.png, > Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png > > > When the streaming job is configured with backpressureEnabled=true, it > generates a GIANT batch of records if the processing time + scheduled delay > is (much) larger than batchDuration. This creates a backlog of records like > no other and results in batches queueing for hours until it chews through > this giant batch. > Expectation is that it should reduce the number of records per batch in some > time to whatever it can really process. > Attaching some screen shots where it seems that this issue is quite easily > reproducible. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records
[ https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18371: Assignee: Apache Spark > Spark Streaming backpressure bug - generates a batch with large number of > records > - > > Key: SPARK-18371 > URL: https://issues.apache.org/jira/browse/SPARK-18371 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0 >Reporter: mapreduced >Assignee: Apache Spark > Attachments: GiantBatch2.png, GiantBatch3.png, > Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png > > > When the streaming job is configured with backpressureEnabled=true, it > generates a GIANT batch of records if the processing time + scheduled delay > is (much) larger than batchDuration. This creates a backlog of records like > no other and results in batches queueing for hours until it chews through > this giant batch. > Expectation is that it should reduce the number of records per batch in some > time to whatever it can really process. > Attaching some screen shots where it seems that this issue is quite easily > reproducible. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records
[ https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18371: Assignee: (was: Apache Spark) > Spark Streaming backpressure bug - generates a batch with large number of > records > - > > Key: SPARK-18371 > URL: https://issues.apache.org/jira/browse/SPARK-18371 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0 >Reporter: mapreduced > Attachments: GiantBatch2.png, GiantBatch3.png, > Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png > > > When the streaming job is configured with backpressureEnabled=true, it > generates a GIANT batch of records if the processing time + scheduled delay > is (much) larger than batchDuration. This creates a backlog of records like > no other and results in batches queueing for hours until it chews through > this giant batch. > Expectation is that it should reduce the number of records per batch in some > time to whatever it can really process. > Attaching some screen shots where it seems that this issue is quite easily > reproducible. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records
[ https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984962#comment-15984962 ] Apache Spark commented on SPARK-18371: -- User 'arzt' has created a pull request for this issue: https://github.com/apache/spark/pull/17774 > Spark Streaming backpressure bug - generates a batch with large number of > records > - > > Key: SPARK-18371 > URL: https://issues.apache.org/jira/browse/SPARK-18371 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0 >Reporter: mapreduced > Attachments: GiantBatch2.png, GiantBatch3.png, > Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png > > > When the streaming job is configured with backpressureEnabled=true, it > generates a GIANT batch of records if the processing time + scheduled delay > is (much) larger than batchDuration. This creates a backlog of records like > no other and results in batches queueing for hours until it chews through > this giant batch. > Expectation is that it should reduce the number of records per batch in some > time to whatever it can really process. > Attaching some screen shots where it seems that this issue is quite easily > reproducible. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20391) Properly rename the memory related fields in ExecutorSummary REST API
[ https://issues.apache.org/jira/browse/SPARK-20391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid reassigned SPARK-20391: Assignee: Saisai Shao > Properly rename the memory related fields in ExecutorSummary REST API > - > > Key: SPARK-20391 > URL: https://issues.apache.org/jira/browse/SPARK-20391 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Blocker > Fix For: 2.2.0 > > > Currently in Spark we could get executor summary through REST API > {{/api/v1/applications//executors}}. The format of executor summary > is: > {code} > class ExecutorSummary private[spark]( > val id: String, > val hostPort: String, > val isActive: Boolean, > val rddBlocks: Int, > val memoryUsed: Long, > val diskUsed: Long, > val totalCores: Int, > val maxTasks: Int, > val activeTasks: Int, > val failedTasks: Int, > val completedTasks: Int, > val totalTasks: Int, > val totalDuration: Long, > val totalGCTime: Long, > val totalInputBytes: Long, > val totalShuffleRead: Long, > val totalShuffleWrite: Long, > val isBlacklisted: Boolean, > val maxMemory: Long, > val executorLogs: Map[String, String], > val onHeapMemoryUsed: Option[Long], > val offHeapMemoryUsed: Option[Long], > val maxOnHeapMemory: Option[Long], > val maxOffHeapMemory: Option[Long]) > {code} > Here are 6 memory related fields: {{memoryUsed}}, {{maxMemory}}, > {{onHeapMemoryUsed}}, {{offHeapMemoryUsed}}, {{maxOnHeapMemory}}, > {{maxOffHeapMemory}}. > These all 6 fields reflects the *storage* memory usage in Spark, but from the > name of this 6 fields, user doesn't really know it is referring to *storage* > memory or the total memory (storage memory + execution memory). This will be > misleading. > So I think we should properly rename these fields to reflect their real > meanings. Or we should will document it. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20391) Properly rename the memory related fields in ExecutorSummary REST API
[ https://issues.apache.org/jira/browse/SPARK-20391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-20391. -- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17700 [https://github.com/apache/spark/pull/17700] > Properly rename the memory related fields in ExecutorSummary REST API > - > > Key: SPARK-20391 > URL: https://issues.apache.org/jira/browse/SPARK-20391 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Saisai Shao >Priority: Blocker > Fix For: 2.2.0 > > > Currently in Spark we could get executor summary through REST API > {{/api/v1/applications//executors}}. The format of executor summary > is: > {code} > class ExecutorSummary private[spark]( > val id: String, > val hostPort: String, > val isActive: Boolean, > val rddBlocks: Int, > val memoryUsed: Long, > val diskUsed: Long, > val totalCores: Int, > val maxTasks: Int, > val activeTasks: Int, > val failedTasks: Int, > val completedTasks: Int, > val totalTasks: Int, > val totalDuration: Long, > val totalGCTime: Long, > val totalInputBytes: Long, > val totalShuffleRead: Long, > val totalShuffleWrite: Long, > val isBlacklisted: Boolean, > val maxMemory: Long, > val executorLogs: Map[String, String], > val onHeapMemoryUsed: Option[Long], > val offHeapMemoryUsed: Option[Long], > val maxOnHeapMemory: Option[Long], > val maxOffHeapMemory: Option[Long]) > {code} > Here are 6 memory related fields: {{memoryUsed}}, {{maxMemory}}, > {{onHeapMemoryUsed}}, {{offHeapMemoryUsed}}, {{maxOnHeapMemory}}, > {{maxOffHeapMemory}}. > These all 6 fields reflects the *storage* memory usage in Spark, but from the > name of this 6 fields, user doesn't really know it is referring to *storage* > memory or the total memory (storage memory + execution memory). This will be > misleading. > So I think we should properly rename these fields to reflect their real > meanings. Or we should will document it. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19812) YARN shuffle service fails to relocate recovery DB across NFS directories
[ https://issues.apache.org/jira/browse/SPARK-19812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-19812. --- Resolution: Fixed Fix Version/s: 2.3.0 2.2.0 > YARN shuffle service fails to relocate recovery DB across NFS directories > - > > Key: SPARK-19812 > URL: https://issues.apache.org/jira/browse/SPARK-19812 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.1 >Reporter: Thomas Graves >Assignee: Thomas Graves > Fix For: 2.2.0, 2.3.0 > > > The yarn shuffle service tries to switch from the yarn local directories to > the real recovery directory but can fail to move the existing recovery db's. > It fails due to Files.move not doing directories that have contents. > 2017-03-03 14:57:19,558 [main] ERROR yarn.YarnShuffleService: Failed to move > recovery file sparkShuffleRecovery.ldb to the path > /mapred/yarn-nodemanager/nm-aux-services/spark_shuffle > java.nio.file.DirectoryNotEmptyException:/yarn-local/sparkShuffleRecovery.ldb > at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:498) > at > sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262) > at java.nio.file.Files.move(Files.java:1395) > at > org.apache.spark.network.yarn.YarnShuffleService.initRecoveryDb(YarnShuffleService.java:369) > at > org.apache.spark.network.yarn.YarnShuffleService.createSecretManager(YarnShuffleService.java:200) > at > org.apache.spark.network.yarn.YarnShuffleService.serviceInit(YarnShuffleService.java:174) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:143) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:262) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:357) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:636) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:684) > This used to use f.renameTo and we switched it in the pr due to review > comments and it looks like didn't do a final real test. The tests are using > files rather then directories so it didn't catch. We need to fix the test > also. > history: > https://github.com/apache/spark/pull/14999/commits/65de8531ccb91287f5a8a749c7819e99533b9440 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records
[ https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984776#comment-15984776 ] Sebastian Arzt edited comment on SPARK-18371 at 4/26/17 1:24 PM: - I deep dived into it and found a simple solution. The problem is that [maxRateLimitPerPartition|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L94] returns {{None}} for an unintended case. {{None}} should only be returned if there is no lag as indicated by this [condition|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L114]. However, this condition is also true if all backpressureRates are [rounded|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L107] to zero. I propose a solution, where rounding is omitted at all. This has the nice side-effect that backpressure is more fine-grained and not only an integral multiple of the [batchDuration|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L117] in seconds. I will open a pull request for it soon. was (Author: seb.arzt): I deep dived into it and found a simple solution. The problem is that [maxRateLimitPerPartition|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L94] returns {{None}} for an unintended case. {{None}} should only be returned if there is no lag as indicated by this [condition|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L114]. However, this condition is also true if all backpressureRates are [rounded|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L107] to zero. I propose a solution, where rounding is omitted at all. I has the nice side-effect that the backpressure is more fine-grained and not only an integral multiple of the [batchDuration|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L117] in seconds. I will open a pull request for it soon. > Spark Streaming backpressure bug - generates a batch with large number of > records > - > > Key: SPARK-18371 > URL: https://issues.apache.org/jira/browse/SPARK-18371 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0 >Reporter: mapreduced > Attachments: GiantBatch2.png, GiantBatch3.png, > Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png > > > When the streaming job is configured with backpressureEnabled=true, it > generates a GIANT batch of records if the processing time + scheduled delay > is (much) larger than batchDuration. This creates a backlog of records like > no other and results in batches queueing for hours until it chews through > this giant batch. > Expectation is that it should reduce the number of records per batch in some > time to whatever it can really process. > Attaching some screen shots where it seems that this issue is quite easily > reproducible. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records
[ https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984776#comment-15984776 ] Sebastian Arzt edited comment on SPARK-18371 at 4/26/17 1:23 PM: - I deep dived into it and found a simple solution. The problem is that [maxRateLimitPerPartition|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L94] returns {{None}} for an unintended case. {{None}} should only be returned if there is no lag as indicated by this [condition|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L114]. However, this condition is also true if all backpressureRates are [rounded|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L107] to zero. I propose a solution, where rounding is omitted at all. I has the nice side-effect that the backpressure is more fine-grained and not only an integral multiple of the [batchDuration|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L117] in seconds. I will open a pull request for it soon. was (Author: seb.arzt): I deep dived into it and found a simple solution. The problem is that [maxRateLimitPerPartition|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L94] returns {{None}} for an unintended case. {{None}} should only be returned if there is no lag as indicated by this [condition|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L114]. However, this condition is also true if all backpressureRates are [rounded|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L107] to zero. I propose a solution, where rounding is omitted at all. I has the nice side-effect that the backpressure in more fine-grained and not only an integral multiple of the [batchDuration|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L117] in seconds. I will open a pull request for it soon. > Spark Streaming backpressure bug - generates a batch with large number of > records > - > > Key: SPARK-18371 > URL: https://issues.apache.org/jira/browse/SPARK-18371 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0 >Reporter: mapreduced > Attachments: GiantBatch2.png, GiantBatch3.png, > Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png > > > When the streaming job is configured with backpressureEnabled=true, it > generates a GIANT batch of records if the processing time + scheduled delay > is (much) larger than batchDuration. This creates a backlog of records like > no other and results in batches queueing for hours until it chews through > this giant batch. > Expectation is that it should reduce the number of records per batch in some > time to whatever it can really process. > Attaching some screen shots where it seems that this issue is quite easily > reproducible. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records
[ https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984776#comment-15984776 ] Sebastian Arzt commented on SPARK-18371: I deep dived into it and found a simple solution. The problem is that [maxRateLimitPerPartition|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L94] returns {{None}} for an unintended case. {{None}} should only be returned if there is no lag as indicated by this [condition|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L114]. However, this condition is also true if all backpressureRates are [rounded|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L107] to zero. I propose a solution, where rounding is omitted at all. I has the nice side-effect that the backpressure in more fine-grained and not only an integral multiple of the [batchDuration|https://github.com/apache/spark/blob/branch-2.0/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L117] in seconds. I will open a pull request for it soon. > Spark Streaming backpressure bug - generates a batch with large number of > records > - > > Key: SPARK-18371 > URL: https://issues.apache.org/jira/browse/SPARK-18371 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0 >Reporter: mapreduced > Attachments: GiantBatch2.png, GiantBatch3.png, > Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png > > > When the streaming job is configured with backpressureEnabled=true, it > generates a GIANT batch of records if the processing time + scheduled delay > is (much) larger than batchDuration. This creates a backlog of records like > no other and results in batches queueing for hours until it chews through > this giant batch. > Expectation is that it should reduce the number of records per batch in some > time to whatever it can really process. > Attaching some screen shots where it seems that this issue is quite easily > reproducible. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7481) Add spark-hadoop-cloud module to pull in object store support
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984771#comment-15984771 ] Steve Loughran commented on SPARK-7481: --- I think we ended up going in circles on that PR. Sean has actually been very tolerant of me, however it's been hampered by my full time focus on other thingsr. I've only been had time to work on the spark PR intermittently and that's been hard for all: me in the rebase/retest, the one reviewer in having to catch up again. Now, anyone who does manage to get that CP right will discover that S3A absolutely flies with Spark, in partitioning (list file improvements), data input (set fadvise=true for ORC and Parquet), and for output (set fast.output=true, play with the pool options). It delivers that performance because this patch set things up for the integration tests, downstream of this patch so I and others can be confident that the things actually work, at sped, at scale. Indeed, many of S3A performance work was actually based on Hive and Spark workloads:, the data formats & their seek patterns, directory layouts, file generation. All that's left is the little problem of getting the classpath right. Oh, and the committer. For now, for people's enjoyment, here's some videos from Spark Summit East on the topic * [Spark and object stores|https://youtu.be/8F2Jqw5_OnI]. * [Robust and Scalable etl over Cloud Storage With Spark|https://spark-summit.org/east-2017/events/robust-and-scalable-etl-over-cloud-storage-with-spark/] > Add spark-hadoop-cloud module to pull in object store support > - > > Key: SPARK-7481 > URL: https://issues.apache.org/jira/browse/SPARK-7481 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.1.0 >Reporter: Steve Loughran > > To keep the s3n classpath right, to add s3a, swift & azure, the dependencies > of spark in a 2.6+ profile need to add the relevant object store packages > (hadoop-aws, hadoop-openstack, hadoop-azure) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20473) ColumnVector.Array is missing accessors for some types
[ https://issues.apache.org/jira/browse/SPARK-20473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20473: Assignee: Apache Spark > ColumnVector.Array is missing accessors for some types > -- > > Key: SPARK-20473 > URL: https://issues.apache.org/jira/browse/SPARK-20473 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Michal Szafranski >Assignee: Apache Spark > > ColumnVector implementations originally did not support some Catalyst types > (float, short, and boolean). Now that they do, those types should be also > added to the ColumnVector.Array. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20474) OnHeapColumnVector realocation may not copy existing data
[ https://issues.apache.org/jira/browse/SPARK-20474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20474: Assignee: (was: Apache Spark) > OnHeapColumnVector realocation may not copy existing data > - > > Key: SPARK-20474 > URL: https://issues.apache.org/jira/browse/SPARK-20474 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Michal Szafranski > > OnHeapColumnVector reallocation copies to the new storage data up to > 'elementsAppended'. This variable is only updated when using the > ColumnVector.appendX API, while ColumnVector.putX is more commonly used. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20473) ColumnVector.Array is missing accessors for some types
[ https://issues.apache.org/jira/browse/SPARK-20473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20473: Assignee: (was: Apache Spark) > ColumnVector.Array is missing accessors for some types > -- > > Key: SPARK-20473 > URL: https://issues.apache.org/jira/browse/SPARK-20473 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Michal Szafranski > > ColumnVector implementations originally did not support some Catalyst types > (float, short, and boolean). Now that they do, those types should be also > added to the ColumnVector.Array. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20474) OnHeapColumnVector realocation may not copy existing data
[ https://issues.apache.org/jira/browse/SPARK-20474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20474: Assignee: Apache Spark > OnHeapColumnVector realocation may not copy existing data > - > > Key: SPARK-20474 > URL: https://issues.apache.org/jira/browse/SPARK-20474 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Michal Szafranski >Assignee: Apache Spark > > OnHeapColumnVector reallocation copies to the new storage data up to > 'elementsAppended'. This variable is only updated when using the > ColumnVector.appendX API, while ColumnVector.putX is more commonly used. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20473) ColumnVector.Array is missing accessors for some types
[ https://issues.apache.org/jira/browse/SPARK-20473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20473: Assignee: (was: Apache Spark) > ColumnVector.Array is missing accessors for some types > -- > > Key: SPARK-20473 > URL: https://issues.apache.org/jira/browse/SPARK-20473 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Michal Szafranski > > ColumnVector implementations originally did not support some Catalyst types > (float, short, and boolean). Now that they do, those types should be also > added to the ColumnVector.Array. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20474) OnHeapColumnVector realocation may not copy existing data
[ https://issues.apache.org/jira/browse/SPARK-20474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984577#comment-15984577 ] Apache Spark commented on SPARK-20474: -- User 'michal-databricks' has created a pull request for this issue: https://github.com/apache/spark/pull/17773 > OnHeapColumnVector realocation may not copy existing data > - > > Key: SPARK-20474 > URL: https://issues.apache.org/jira/browse/SPARK-20474 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Michal Szafranski > > OnHeapColumnVector reallocation copies to the new storage data up to > 'elementsAppended'. This variable is only updated when using the > ColumnVector.appendX API, while ColumnVector.putX is more commonly used. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20473) ColumnVector.Array is missing accessors for some types
[ https://issues.apache.org/jira/browse/SPARK-20473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20473: Assignee: Apache Spark > ColumnVector.Array is missing accessors for some types > -- > > Key: SPARK-20473 > URL: https://issues.apache.org/jira/browse/SPARK-20473 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Michal Szafranski >Assignee: Apache Spark > > ColumnVector implementations originally did not support some Catalyst types > (float, short, and boolean). Now that they do, those types should be also > added to the ColumnVector.Array. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org