[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15898855#comment-15898855 ] Nick Pentreath commented on SPARK-14409: [~josephkb] the proposed input schema above encompasses that - the {{labelCol}} is the true relevance score (rating, confidence, etc), while the {{predictionCol}} is the predicted relevance (rating, confidence, etc). Note we can name these columns something more specific ({{labelCol}} and {{predictionCol}} are re-used really from the other evaluators). This also allows "weighted" forms of ranking metric later (e.g. some metrics can incorporate the true relevance score into the computation which serves as a form of weighting of the metric) - the metrics we currently have don't do that. So for now the true relevance can serve as a filter - for example, when computing the ranking metric for recommendation, we *don't* want to include negative ratings in the "ground truth set of relevant documents" - hence the {{goodThreshold}} param above (I would rather call it something like {{relevanceThreshold}} myself). *Note* that there are 2 formats I detail in my comment above - the first is the the actual schema of the {{DataFrame}} used as input to the {{RankingEvaluator}} - this must therefore be the schema of the DF output by {{model.transform}} (whether that is ALS for recommendation, a logistic regression for ad prediction, or whatever). The second format mentioned is simply illustrating the *intermediate internal transformation* that the evaluator will do in the {{evaluate}} method. You can see a rough draft of it in Danilo's PR [here|https://github.com/apache/spark/pull/16618/files#diff-0345c4cb1878d3bb0d84297202fdc95fR93]. > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18549) Failed to Uncache a View that References a Dropped Table.
[ https://issues.apache.org/jira/browse/SPARK-18549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18549: Assignee: Apache Spark > Failed to Uncache a View that References a Dropped Table. > - > > Key: SPARK-18549 > URL: https://issues.apache.org/jira/browse/SPARK-18549 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Xiao Li >Assignee: Apache Spark >Priority: Critical > > {code} > spark.range(1, 10).toDF("id1").write.format("json").saveAsTable("jt1") > spark.range(1, 10).toDF("id2").write.format("json").saveAsTable("jt2") > sql("CREATE VIEW testView AS SELECT * FROM jt1 JOIN jt2 ON id1 == id2") > // Cache is empty at the beginning > assert(spark.sharedState.cacheManager.isEmpty) > sql("CACHE TABLE testView") > assert(spark.catalog.isCached("testView")) > // Cache is not empty > assert(!spark.sharedState.cacheManager.isEmpty) > {code} > {code} > // drop a table referenced by a cached view > sql("DROP TABLE jt1") > -- So far everything is fine > // Failed to unache the view > val e = intercept[AnalysisException] { > sql("UNCACHE TABLE testView") > }.getMessage > assert(e.contains("Table or view not found: `default`.`jt1`")) > // We are unable to drop it from the cache > assert(!spark.sharedState.cacheManager.isEmpty) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18549) Failed to Uncache a View that References a Dropped Table.
[ https://issues.apache.org/jira/browse/SPARK-18549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18549: Assignee: (was: Apache Spark) > Failed to Uncache a View that References a Dropped Table. > - > > Key: SPARK-18549 > URL: https://issues.apache.org/jira/browse/SPARK-18549 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Xiao Li >Priority: Critical > > {code} > spark.range(1, 10).toDF("id1").write.format("json").saveAsTable("jt1") > spark.range(1, 10).toDF("id2").write.format("json").saveAsTable("jt2") > sql("CREATE VIEW testView AS SELECT * FROM jt1 JOIN jt2 ON id1 == id2") > // Cache is empty at the beginning > assert(spark.sharedState.cacheManager.isEmpty) > sql("CACHE TABLE testView") > assert(spark.catalog.isCached("testView")) > // Cache is not empty > assert(!spark.sharedState.cacheManager.isEmpty) > {code} > {code} > // drop a table referenced by a cached view > sql("DROP TABLE jt1") > -- So far everything is fine > // Failed to unache the view > val e = intercept[AnalysisException] { > sql("UNCACHE TABLE testView") > }.getMessage > assert(e.contains("Table or view not found: `default`.`jt1`")) > // We are unable to drop it from the cache > assert(!spark.sharedState.cacheManager.isEmpty) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18549) Failed to Uncache a View that References a Dropped Table.
[ https://issues.apache.org/jira/browse/SPARK-18549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15898834#comment-15898834 ] Apache Spark commented on SPARK-18549: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/17097 > Failed to Uncache a View that References a Dropped Table. > - > > Key: SPARK-18549 > URL: https://issues.apache.org/jira/browse/SPARK-18549 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Xiao Li >Priority: Critical > > {code} > spark.range(1, 10).toDF("id1").write.format("json").saveAsTable("jt1") > spark.range(1, 10).toDF("id2").write.format("json").saveAsTable("jt2") > sql("CREATE VIEW testView AS SELECT * FROM jt1 JOIN jt2 ON id1 == id2") > // Cache is empty at the beginning > assert(spark.sharedState.cacheManager.isEmpty) > sql("CACHE TABLE testView") > assert(spark.catalog.isCached("testView")) > // Cache is not empty > assert(!spark.sharedState.cacheManager.isEmpty) > {code} > {code} > // drop a table referenced by a cached view > sql("DROP TABLE jt1") > -- So far everything is fine > // Failed to unache the view > val e = intercept[AnalysisException] { > sql("UNCACHE TABLE testView") > }.getMessage > assert(e.contains("Table or view not found: `default`.`jt1`")) > // We are unable to drop it from the cache > assert(!spark.sharedState.cacheManager.isEmpty) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18549) Failed to Uncache a View that References a Dropped Table.
[ https://issues.apache.org/jira/browse/SPARK-18549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18549: Assignee: (was: Apache Spark) > Failed to Uncache a View that References a Dropped Table. > - > > Key: SPARK-18549 > URL: https://issues.apache.org/jira/browse/SPARK-18549 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Xiao Li >Priority: Critical > > {code} > spark.range(1, 10).toDF("id1").write.format("json").saveAsTable("jt1") > spark.range(1, 10).toDF("id2").write.format("json").saveAsTable("jt2") > sql("CREATE VIEW testView AS SELECT * FROM jt1 JOIN jt2 ON id1 == id2") > // Cache is empty at the beginning > assert(spark.sharedState.cacheManager.isEmpty) > sql("CACHE TABLE testView") > assert(spark.catalog.isCached("testView")) > // Cache is not empty > assert(!spark.sharedState.cacheManager.isEmpty) > {code} > {code} > // drop a table referenced by a cached view > sql("DROP TABLE jt1") > -- So far everything is fine > // Failed to unache the view > val e = intercept[AnalysisException] { > sql("UNCACHE TABLE testView") > }.getMessage > assert(e.contains("Table or view not found: `default`.`jt1`")) > // We are unable to drop it from the cache > assert(!spark.sharedState.cacheManager.isEmpty) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18549) Failed to Uncache a View that References a Dropped Table.
[ https://issues.apache.org/jira/browse/SPARK-18549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18549: Assignee: Apache Spark > Failed to Uncache a View that References a Dropped Table. > - > > Key: SPARK-18549 > URL: https://issues.apache.org/jira/browse/SPARK-18549 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Xiao Li >Assignee: Apache Spark >Priority: Critical > > {code} > spark.range(1, 10).toDF("id1").write.format("json").saveAsTable("jt1") > spark.range(1, 10).toDF("id2").write.format("json").saveAsTable("jt2") > sql("CREATE VIEW testView AS SELECT * FROM jt1 JOIN jt2 ON id1 == id2") > // Cache is empty at the beginning > assert(spark.sharedState.cacheManager.isEmpty) > sql("CACHE TABLE testView") > assert(spark.catalog.isCached("testView")) > // Cache is not empty > assert(!spark.sharedState.cacheManager.isEmpty) > {code} > {code} > // drop a table referenced by a cached view > sql("DROP TABLE jt1") > -- So far everything is fine > // Failed to unache the view > val e = intercept[AnalysisException] { > sql("UNCACHE TABLE testView") > }.getMessage > assert(e.contains("Table or view not found: `default`.`jt1`")) > // We are unable to drop it from the cache > assert(!spark.sharedState.cacheManager.isEmpty) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19832) DynamicPartitionWriteTask should escape the partition name
[ https://issues.apache.org/jira/browse/SPARK-19832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-19832: --- Assignee: Song Jun > DynamicPartitionWriteTask should escape the partition name > --- > > Key: SPARK-19832 > URL: https://issues.apache.org/jira/browse/SPARK-19832 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun >Assignee: Song Jun > Fix For: 2.2.0 > > > Currently in DynamicPartitionWriteTask, when we get the paritionPath of a > parition, we just escape the partition value, not escape the partition name. > this will cause some problems for some special partition name situation, for > example : > 1) if the partition name contains '%' etc, there will be two partition path > created in the filesytem, one is for escaped path like '/path/a%25b=1', > another is for unescaped path like '/path/a%b=1'. > and the data inserted stored in unescaped path, while the show partitions > table will return 'a%25b=1' which the partition name is escaped. So here it > is not consist. And I think the data should be stored in the escaped path in > filesystem, which Hive2.0.0 also have the same action. > 2) if the partition name contains ':', there will throw exception that new > Path("/path","a:b"), this is illegal which has a colon in the relative path. > {code} > java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative > path in absolute URI: a:b > at org.apache.hadoop.fs.Path.initialize(Path.java:205) > at org.apache.hadoop.fs.Path.(Path.java:171) > at org.apache.hadoop.fs.Path.(Path.java:88) > ... 48 elided > Caused by: java.net.URISyntaxException: Relative path in absolute URI: a:b > at java.net.URI.checkPath(URI.java:1823) > at java.net.URI.(URI.java:745) > at org.apache.hadoop.fs.Path.initialize(Path.java:202) > ... 50 more > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19832) DynamicPartitionWriteTask should escape the partition name
[ https://issues.apache.org/jira/browse/SPARK-19832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-19832. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17173 [https://github.com/apache/spark/pull/17173] > DynamicPartitionWriteTask should escape the partition name > --- > > Key: SPARK-19832 > URL: https://issues.apache.org/jira/browse/SPARK-19832 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun > Fix For: 2.2.0 > > > Currently in DynamicPartitionWriteTask, when we get the paritionPath of a > parition, we just escape the partition value, not escape the partition name. > this will cause some problems for some special partition name situation, for > example : > 1) if the partition name contains '%' etc, there will be two partition path > created in the filesytem, one is for escaped path like '/path/a%25b=1', > another is for unescaped path like '/path/a%b=1'. > and the data inserted stored in unescaped path, while the show partitions > table will return 'a%25b=1' which the partition name is escaped. So here it > is not consist. And I think the data should be stored in the escaped path in > filesystem, which Hive2.0.0 also have the same action. > 2) if the partition name contains ':', there will throw exception that new > Path("/path","a:b"), this is illegal which has a colon in the relative path. > {code} > java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative > path in absolute URI: a:b > at org.apache.hadoop.fs.Path.initialize(Path.java:205) > at org.apache.hadoop.fs.Path.(Path.java:171) > at org.apache.hadoop.fs.Path.(Path.java:88) > ... 48 elided > Caused by: java.net.URISyntaxException: Relative path in absolute URI: a:b > at java.net.URI.checkPath(URI.java:1823) > at java.net.URI.(URI.java:745) > at org.apache.hadoop.fs.Path.initialize(Path.java:202) > ... 50 more > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19847) port hive read to FileFormat API
[ https://issues.apache.org/jira/browse/SPARK-19847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15898824#comment-15898824 ] Apache Spark commented on SPARK-19847: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/17187 > port hive read to FileFormat API > > > Key: SPARK-19847 > URL: https://issues.apache.org/jira/browse/SPARK-19847 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19847) port hive read to FileFormat API
[ https://issues.apache.org/jira/browse/SPARK-19847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19847: Assignee: Apache Spark (was: Wenchen Fan) > port hive read to FileFormat API > > > Key: SPARK-19847 > URL: https://issues.apache.org/jira/browse/SPARK-19847 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15898823#comment-15898823 ] Joseph K. Bradley commented on SPARK-14409: --- Thanks [~nick.pentre...@gmail.com]! I like this general approach. A few initial thoughts: Schema for evaluator: * Some evaluators will take rating or confidence values as well. Will those be appended as extra columns? * If a recommendation model like ALSModel returns top K recommendations for each user, that will not fit the RankingEvaluator input. Do you plan to have RankingEvaluator or CrossValidator handle efficient calculation of top K recommendations? * Relatedly, I'll comment on the schema in [https://github.com/apache/spark/pull/17090] directly in that PR in case we want to make changes in a quick follow-up. > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19847) port hive read to FileFormat API
[ https://issues.apache.org/jira/browse/SPARK-19847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19847: Assignee: Wenchen Fan (was: Apache Spark) > port hive read to FileFormat API > > > Key: SPARK-19847 > URL: https://issues.apache.org/jira/browse/SPARK-19847 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19847) port hive read to FileFormat API
Wenchen Fan created SPARK-19847: --- Summary: port hive read to FileFormat API Key: SPARK-19847 URL: https://issues.apache.org/jira/browse/SPARK-19847 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19818) rbind should check for name consistency of input data frames
[ https://issues.apache.org/jira/browse/SPARK-19818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung resolved SPARK-19818. -- Resolution: Fixed Assignee: Wayne Zhang Fix Version/s: 2.2.0 Target Version/s: 2.2.0 > rbind should check for name consistency of input data frames > > > Key: SPARK-19818 > URL: https://issues.apache.org/jira/browse/SPARK-19818 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Wayne Zhang >Assignee: Wayne Zhang >Priority: Minor > Labels: releasenotes > Fix For: 2.2.0 > > > The current implementation accepts data frames with different schemas. See > issues below: > {code} > df <- createDataFrame(data.frame(name = c("Michael", "Andy", "Justin"), age = > c(1, 30, 19))) > union(df, df[, c(2, 1)]) > name age > 1 Michael 1.0 > 2Andy30.0 > 3 Justin19.0 > 4 1.0 Michael > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19818) rbind should check for name consistency of input data frames
[ https://issues.apache.org/jira/browse/SPARK-19818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-19818: - Labels: releasenotes (was: ) > rbind should check for name consistency of input data frames > > > Key: SPARK-19818 > URL: https://issues.apache.org/jira/browse/SPARK-19818 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Wayne Zhang >Priority: Minor > Labels: releasenotes > > The current implementation accepts data frames with different schemas. See > issues below: > {code} > df <- createDataFrame(data.frame(name = c("Michael", "Andy", "Justin"), age = > c(1, 30, 19))) > union(df, df[, c(2, 1)]) > name age > 1 Michael 1.0 > 2Andy30.0 > 3 Justin19.0 > 4 1.0 Michael > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19818) rbind should check for name consistency of input data frames
[ https://issues.apache.org/jira/browse/SPARK-19818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-19818: - Summary: rbind should check for name consistency of input data frames (was: SparkR union should check for name consistency of input data frames ) > rbind should check for name consistency of input data frames > > > Key: SPARK-19818 > URL: https://issues.apache.org/jira/browse/SPARK-19818 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Wayne Zhang >Priority: Minor > > The current implementation accepts data frames with different schemas. See > issues below: > {code} > df <- createDataFrame(data.frame(name = c("Michael", "Andy", "Justin"), age = > c(1, 30, 19))) > union(df, df[, c(2, 1)]) > name age > 1 Michael 1.0 > 2Andy30.0 > 3 Justin19.0 > 4 1.0 Michael > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19350) Cardinality estimation of Limit and Sample
[ https://issues.apache.org/jira/browse/SPARK-19350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-19350: --- Assignee: Zhenhua Wang > Cardinality estimation of Limit and Sample > -- > > Key: SPARK-19350 > URL: https://issues.apache.org/jira/browse/SPARK-19350 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Zhenhua Wang >Assignee: Zhenhua Wang > Fix For: 2.2.0 > > > Currently, LocalLimit/GlobalLimit/Sample propagates the same row count and > column stats from its child, which is incorrect. > We can get the correct rowCount in Statistics for Limit/Sample whether cbo is > enabled or not. And column stats should not be propagated because we don't > know the distribution of columns after Limit or Sample. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19350) Cardinality estimation of Limit and Sample
[ https://issues.apache.org/jira/browse/SPARK-19350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-19350. - Resolution: Fixed Fix Version/s: 2.2.0 > Cardinality estimation of Limit and Sample > -- > > Key: SPARK-19350 > URL: https://issues.apache.org/jira/browse/SPARK-19350 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Zhenhua Wang >Assignee: Zhenhua Wang > Fix For: 2.2.0 > > > Currently, LocalLimit/GlobalLimit/Sample propagates the same row count and > column stats from its child, which is incorrect. > We can get the correct rowCount in Statistics for Limit/Sample whether cbo is > enabled or not. And column stats should not be propagated because we don't > know the distribution of columns after Limit or Sample. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19846) Add a flag to disable constraint propagation
[ https://issues.apache.org/jira/browse/SPARK-19846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19846: Assignee: (was: Apache Spark) > Add a flag to disable constraint propagation > > > Key: SPARK-19846 > URL: https://issues.apache.org/jira/browse/SPARK-19846 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Liang-Chi Hsieh > > Constraint propagation can be computation expensive and block the driver > execution for long time. For example, the below benchmark needs 30mins. > Compared with other attempts to modify how constraints propagation works, > this is a much simpler option: add a flag to disable constraint propagation. > {code} > import org.apache.spark.ml.{Pipeline, PipelineStage} > import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer, > VectorAssembler} > spark.conf.set(SQLConf.CONSTRAINT_PROPAGATION_ENABLED.key, false) > val df = (1 to 40).foldLeft(Seq((1, "foo"), (2, "bar"), (3, > "baz")).toDF("id", "x0"))((df, i) => df.withColumn(s"x$i", $"x0")) > val indexers = df.columns.tail.map(c => new StringIndexer() > .setInputCol(c) > .setOutputCol(s"${c}_indexed") > .setHandleInvalid("skip")) > val encoders = indexers.map(indexer => new OneHotEncoder() > .setInputCol(indexer.getOutputCol) > .setOutputCol(s"${indexer.getOutputCol}_encoded") > .setDropLast(true)) > val stages: Array[PipelineStage] = indexers ++ encoders > val pipeline = new Pipeline().setStages(stages) > val startTime = System.nanoTime > pipeline.fit(df).transform(df).show > val runningTime = System.nanoTime - startTime > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19846) Add a flag to disable constraint propagation
[ https://issues.apache.org/jira/browse/SPARK-19846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19846: Assignee: Apache Spark > Add a flag to disable constraint propagation > > > Key: SPARK-19846 > URL: https://issues.apache.org/jira/browse/SPARK-19846 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Liang-Chi Hsieh >Assignee: Apache Spark > > Constraint propagation can be computation expensive and block the driver > execution for long time. For example, the below benchmark needs 30mins. > Compared with other attempts to modify how constraints propagation works, > this is a much simpler option: add a flag to disable constraint propagation. > {code} > import org.apache.spark.ml.{Pipeline, PipelineStage} > import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer, > VectorAssembler} > spark.conf.set(SQLConf.CONSTRAINT_PROPAGATION_ENABLED.key, false) > val df = (1 to 40).foldLeft(Seq((1, "foo"), (2, "bar"), (3, > "baz")).toDF("id", "x0"))((df, i) => df.withColumn(s"x$i", $"x0")) > val indexers = df.columns.tail.map(c => new StringIndexer() > .setInputCol(c) > .setOutputCol(s"${c}_indexed") > .setHandleInvalid("skip")) > val encoders = indexers.map(indexer => new OneHotEncoder() > .setInputCol(indexer.getOutputCol) > .setOutputCol(s"${indexer.getOutputCol}_encoded") > .setDropLast(true)) > val stages: Array[PipelineStage] = indexers ++ encoders > val pipeline = new Pipeline().setStages(stages) > val startTime = System.nanoTime > pipeline.fit(df).transform(df).show > val runningTime = System.nanoTime - startTime > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19846) Add a flag to disable constraint propagation
[ https://issues.apache.org/jira/browse/SPARK-19846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15898774#comment-15898774 ] Apache Spark commented on SPARK-19846: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/17186 > Add a flag to disable constraint propagation > > > Key: SPARK-19846 > URL: https://issues.apache.org/jira/browse/SPARK-19846 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Liang-Chi Hsieh > > Constraint propagation can be computation expensive and block the driver > execution for long time. For example, the below benchmark needs 30mins. > Compared with other attempts to modify how constraints propagation works, > this is a much simpler option: add a flag to disable constraint propagation. > {code} > import org.apache.spark.ml.{Pipeline, PipelineStage} > import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer, > VectorAssembler} > spark.conf.set(SQLConf.CONSTRAINT_PROPAGATION_ENABLED.key, false) > val df = (1 to 40).foldLeft(Seq((1, "foo"), (2, "bar"), (3, > "baz")).toDF("id", "x0"))((df, i) => df.withColumn(s"x$i", $"x0")) > val indexers = df.columns.tail.map(c => new StringIndexer() > .setInputCol(c) > .setOutputCol(s"${c}_indexed") > .setHandleInvalid("skip")) > val encoders = indexers.map(indexer => new OneHotEncoder() > .setInputCol(indexer.getOutputCol) > .setOutputCol(s"${indexer.getOutputCol}_encoded") > .setDropLast(true)) > val stages: Array[PipelineStage] = indexers ++ encoders > val pipeline = new Pipeline().setStages(stages) > val startTime = System.nanoTime > pipeline.fit(df).transform(df).show > val runningTime = System.nanoTime - startTime > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19846) Add a flag to disable constraint propagation
Liang-Chi Hsieh created SPARK-19846: --- Summary: Add a flag to disable constraint propagation Key: SPARK-19846 URL: https://issues.apache.org/jira/browse/SPARK-19846 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.0 Reporter: Liang-Chi Hsieh Constraint propagation can be computation expensive and block the driver execution for long time. For example, the below benchmark needs 30mins. Compared with other attempts to modify how constraints propagation works, this is a much simpler option: add a flag to disable constraint propagation. {code} import org.apache.spark.ml.{Pipeline, PipelineStage} import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer, VectorAssembler} spark.conf.set(SQLConf.CONSTRAINT_PROPAGATION_ENABLED.key, false) val df = (1 to 40).foldLeft(Seq((1, "foo"), (2, "bar"), (3, "baz")).toDF("id", "x0"))((df, i) => df.withColumn(s"x$i", $"x0")) val indexers = df.columns.tail.map(c => new StringIndexer() .setInputCol(c) .setOutputCol(s"${c}_indexed") .setHandleInvalid("skip")) val encoders = indexers.map(indexer => new OneHotEncoder() .setInputCol(indexer.getOutputCol) .setOutputCol(s"${indexer.getOutputCol}_encoded") .setDropLast(true)) val stages: Array[PipelineStage] = indexers ++ encoders val pipeline = new Pipeline().setStages(stages) val startTime = System.nanoTime pipeline.fit(df).transform(df).show val runningTime = System.nanoTime - startTime {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19602) Unable to query using the fully qualified column name of the form ( ..)
[ https://issues.apache.org/jira/browse/SPARK-19602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15898741#comment-15898741 ] Apache Spark commented on SPARK-19602: -- User 'skambha' has created a pull request for this issue: https://github.com/apache/spark/pull/17185 > Unable to query using the fully qualified column name of the form ( > ..) > -- > > Key: SPARK-19602 > URL: https://issues.apache.org/jira/browse/SPARK-19602 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Sunitha Kambhampati > Attachments: Design_ColResolution_JIRA19602.pdf > > > 1) Spark SQL fails to analyze this query: select db1.t1.i1 from db1.t1, > db2.t1 > Most of the other database systems support this ( e.g DB2, Oracle, MySQL). > Note: In DB2, Oracle, the notion is of .. > 2) Another scenario where this fully qualified name is useful is as follows: > // current database is db1. > select t1.i1 from t1, db2.t1 > If the i1 column exists in both tables: db1.t1 and db2.t1, this will throw an > error during column resolution in the analyzer, as it is ambiguous. > Lets say the user intended to retrieve i1 from db1.t1 but in the example > db2.t1 only has i1 column. The query would still succeed instead of throwing > an error. > One way to avoid confusion would be to explicitly specify using the fully > qualified name db1.t1.i1 > For e.g: select db1.t1.i1 from t1, db2.t1 > Workarounds: > There is a workaround for these issues, which is to use an alias. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19831) Sending the heartbeat master from worker maybe blocked by other rpc messages
[ https://issues.apache.org/jira/browse/SPARK-19831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hustfxj updated SPARK-19831: Description: Cleaning the application may cost much time at worker, then it will block that the worker send heartbeats master because the worker is extend *ThreadSafeRpcEndpoint*. If the heartbeat from a worker is blocked by the message *ApplicationFinished*, master will think the worker is dead. If the worker has a driver, the driver will be scheduled by master again. So I think it is the bug on spark. It may solve this problem by the followed suggests: 1. It had better put the cleaning the application in a single asynchronous thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages like *SendHeartbeat*; 2. It had better not receive the heartbeat master by *receive* method. Because any other rpc message may block the *receive* method. Then worker won't receive the heartbeat message. So it had better send the heartbeat master at an asynchronous timing thread . was: Cleaning the application may cost much time at worker, then it will block that the worker send heartbeats master because the worker is extend *ThreadSafeRpcEndpoint*. If the heartbeat from a worker is blocked by the message *ApplicationFinished*, master will think the worker is dead. If the worker has a driver, the driver will be scheduled by master again. So I think it is the bug on spark. It may solve this problem by the followed suggests: 1. It had better put the cleaning the application in a single asynchronous thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages like *SendHeartbeat*; 2. It had better receive the heartbeat master by *receive* method. Because any other rpc message may block the *receive* method. Then worker won't receive the heartbeat message. So it had better send the heartbeat master at an asynchronous timing thread . > Sending the heartbeat master from worker maybe blocked by other rpc messages > -- > > Key: SPARK-19831 > URL: https://issues.apache.org/jira/browse/SPARK-19831 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: hustfxj >Priority: Minor > > Cleaning the application may cost much time at worker, then it will block > that the worker send heartbeats master because the worker is extend > *ThreadSafeRpcEndpoint*. If the heartbeat from a worker is blocked by the > message *ApplicationFinished*, master will think the worker is dead. If the > worker has a driver, the driver will be scheduled by master again. So I think > it is the bug on spark. It may solve this problem by the followed suggests: > 1. It had better put the cleaning the application in a single asynchronous > thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages > like *SendHeartbeat*; > 2. It had better not receive the heartbeat master by *receive* method. > Because any other rpc message may block the *receive* method. Then worker > won't receive the heartbeat message. So it had better send the heartbeat > master at an asynchronous timing thread . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19831) Sending the heartbeat master from worker maybe blocked by other rpc messages
[ https://issues.apache.org/jira/browse/SPARK-19831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hustfxj updated SPARK-19831: Description: Cleaning the application may cost much time at worker, then it will block that the worker send heartbeats master because the worker is extend *ThreadSafeRpcEndpoint*. If the heartbeat from a worker is blocked by the message *ApplicationFinished*, master will think the worker is dead. If the worker has a driver, the driver will be scheduled by master again. So I think it is the bug on spark. It may solve this problem by the followed suggests: 1. It had better put the cleaning the application in a single asynchronous thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages like *SendHeartbeat*; 2. It had better not receive the heartbeat master by *receive* method. Because any other rpc message may block the *receive* method. Then worker won't receive the heartbeat message timely. So it had better send the heartbeat master at an asynchronous timing thread . was: Cleaning the application may cost much time at worker, then it will block that the worker send heartbeats master because the worker is extend *ThreadSafeRpcEndpoint*. If the heartbeat from a worker is blocked by the message *ApplicationFinished*, master will think the worker is dead. If the worker has a driver, the driver will be scheduled by master again. So I think it is the bug on spark. It may solve this problem by the followed suggests: 1. It had better put the cleaning the application in a single asynchronous thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages like *SendHeartbeat*; 2. It had better not receive the heartbeat master by *receive* method. Because any other rpc message may block the *receive* method. Then worker won't receive the heartbeat message. So it had better send the heartbeat master at an asynchronous timing thread . > Sending the heartbeat master from worker maybe blocked by other rpc messages > -- > > Key: SPARK-19831 > URL: https://issues.apache.org/jira/browse/SPARK-19831 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: hustfxj >Priority: Minor > > Cleaning the application may cost much time at worker, then it will block > that the worker send heartbeats master because the worker is extend > *ThreadSafeRpcEndpoint*. If the heartbeat from a worker is blocked by the > message *ApplicationFinished*, master will think the worker is dead. If the > worker has a driver, the driver will be scheduled by master again. So I think > it is the bug on spark. It may solve this problem by the followed suggests: > 1. It had better put the cleaning the application in a single asynchronous > thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages > like *SendHeartbeat*; > 2. It had better not receive the heartbeat master by *receive* method. > Because any other rpc message may block the *receive* method. Then worker > won't receive the heartbeat message timely. So it had better send the > heartbeat master at an asynchronous timing thread . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19831) Sending the heartbeat master from worker maybe blocked by other rpc messages
[ https://issues.apache.org/jira/browse/SPARK-19831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hustfxj updated SPARK-19831: Description: Cleaning the application may cost much time at worker, then it will block that the worker send heartbeats master because the worker is extend *ThreadSafeRpcEndpoint*. If the heartbeat from a worker is blocked by the message *ApplicationFinished*, master will think the worker is dead. If the worker has a driver, the driver will be scheduled by master again. So I think it is the bug on spark. It may solve this problem by the followed suggests: 1. It had better put the cleaning the application in a single asynchronous thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages like *SendHeartbeat*; 2. It had better receive the heartbeat master by *receive* method. Because any other rpc message may block the *receive* method. Then worker won't receive the heartbeat message. So it had better send the heartbeat master at an asynchronous timing thread . was: Cleaning the application may cost much time at worker, then it will block that the worker send heartbeats master because the worker is extend *ThreadSafeRpcEndpoint*. If the heartbeat from a worker is blocked by the message *ApplicationFinished*, master will think the worker is dead. If the worker has a driver, the driver will be scheduled by master again. So I think it is the bug on spark. It may solve this problem by the followed suggests: 1. It had better put the cleaning the application in a single asynchronous thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages like *SendHeartbeat*; 2. It had better not send the heartbeat master by *receive* method. Because any other rpc message may block the *receive* method. Then worker won't receive the heartbeat message. So it had better send the heartbeat master at an asynchronous timing thread . > Sending the heartbeat master from worker maybe blocked by other rpc messages > -- > > Key: SPARK-19831 > URL: https://issues.apache.org/jira/browse/SPARK-19831 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: hustfxj >Priority: Minor > > Cleaning the application may cost much time at worker, then it will block > that the worker send heartbeats master because the worker is extend > *ThreadSafeRpcEndpoint*. If the heartbeat from a worker is blocked by the > message *ApplicationFinished*, master will think the worker is dead. If the > worker has a driver, the driver will be scheduled by master again. So I think > it is the bug on spark. It may solve this problem by the followed suggests: > 1. It had better put the cleaning the application in a single asynchronous > thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages > like *SendHeartbeat*; > 2. It had better receive the heartbeat master by *receive* method. Because > any other rpc message may block the *receive* method. Then worker won't > receive the heartbeat message. So it had better send the heartbeat master at > an asynchronous timing thread . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19831) Sending the heartbeat master from worker maybe blocked by other rpc messages
[ https://issues.apache.org/jira/browse/SPARK-19831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hustfxj updated SPARK-19831: Description: Cleaning the application may cost much time at worker, then it will block that the worker send heartbeats master because the worker is extend *ThreadSafeRpcEndpoint*. If the heartbeat from a worker is blocked by the message *ApplicationFinished*, master will think the worker is dead. If the worker has a driver, the driver will be scheduled by master again. So I think it is the bug on spark. It may solve this problem by the followed suggests: 1. It had better put the cleaning the application in a single asynchronous thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages like *SendHeartbeat*; 2. It had better not send the heartbeat master by *receive* method. Because any other rpc message may block the *receive* method. Then worker won't receive the heartbeat message. So it had better send the heartbeat master at an asynchronous timing thread . was: Cleaning the application may cost much time at worker, then it will block that the worker send heartbeats master because the worker is extend *ThreadSafeRpcEndpoint*. If the heartbeat from a worker is blocked by the message *ApplicationFinished*, master will think the worker is dead. If the worker has a driver, the driver will be scheduled by master again. So I think it is the bug on spark. It may solve this problem by the followed suggests: 1. It had better put the cleaning the application in a single asynchronous thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages like *SendHeartbeat*; 2. It had better not send the heartbeat master by Rpc channel. Because any other rpc message may block the rpc channel. It had better send the heartbeat master at an asynchronous timing thread . > Sending the heartbeat master from worker maybe blocked by other rpc messages > -- > > Key: SPARK-19831 > URL: https://issues.apache.org/jira/browse/SPARK-19831 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: hustfxj >Priority: Minor > > Cleaning the application may cost much time at worker, then it will block > that the worker send heartbeats master because the worker is extend > *ThreadSafeRpcEndpoint*. If the heartbeat from a worker is blocked by the > message *ApplicationFinished*, master will think the worker is dead. If the > worker has a driver, the driver will be scheduled by master again. So I think > it is the bug on spark. It may solve this problem by the followed suggests: > 1. It had better put the cleaning the application in a single asynchronous > thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages > like *SendHeartbeat*; > 2. It had better not send the heartbeat master by *receive* method. Because > any other rpc message may block the *receive* method. Then worker won't > receive the heartbeat message. So it had better send the heartbeat master at > an asynchronous timing thread . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19831) Sending the heartbeat master from worker maybe blocked by other rpc messages
[ https://issues.apache.org/jira/browse/SPARK-19831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15898704#comment-15898704 ] hustfxj edited comment on SPARK-19831 at 3/7/17 4:15 AM: - [~zsxwing]. I only find the code which handles *ApplicationFinished* message is slow . So I also think such codes should be run in a separate thread. I will submit a PR which make the codes in a separate thread. was (Author: hustfxj): [~zsxwing]. I only find the code which handles *ApplicationFinished* message is slow until now. So I also think such codes should be run in a separate thread. > Sending the heartbeat master from worker maybe blocked by other rpc messages > -- > > Key: SPARK-19831 > URL: https://issues.apache.org/jira/browse/SPARK-19831 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: hustfxj >Priority: Minor > > Cleaning the application may cost much time at worker, then it will block > that the worker send heartbeats master because the worker is extend > *ThreadSafeRpcEndpoint*. If the heartbeat from a worker is blocked by the > message *ApplicationFinished*, master will think the worker is dead. If the > worker has a driver, the driver will be scheduled by master again. So I think > it is the bug on spark. It may solve this problem by the followed suggests: > 1. It had better put the cleaning the application in a single asynchronous > thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages > like *SendHeartbeat*; > 2. It had better not send the heartbeat master by Rpc channel. Because any > other rpc message may block the rpc channel. It had better send the heartbeat > master at an asynchronous timing thread . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19829) The log about driver should support rolling like executor
[ https://issues.apache.org/jira/browse/SPARK-19829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15898713#comment-15898713 ] hustfxj commented on SPARK-19829: - [~sowen] Yes, this is handled by systems like YARN. But my spark cluster is standalone. A standalone cluster should build its own rolling about driver' log in default other than defined log4j configuration. > The log about driver should support rolling like executor > - > > Key: SPARK-19829 > URL: https://issues.apache.org/jira/browse/SPARK-19829 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: hustfxj >Priority: Minor > > We should rollback the log of the driver , or the log maybe large!!! > {code:title=DriverRunner.java|borderStyle=solid} > // modify the runDriver > private def runDriver(builder: ProcessBuilder, baseDir: File, supervise: > Boolean): Int = { > builder.directory(baseDir) > def initialize(process: Process): Unit = { > // Redirect stdout and stderr to files-- the old code > // val stdout = new File(baseDir, "stdout") > // CommandUtils.redirectStream(process.getInputStream, stdout) > // > // val stderr = new File(baseDir, "stderr") > // val formattedCommand = builder.command.asScala.mkString("\"", "\" > \"", "\"") > // val header = "Launch Command: %s\n%s\n\n".format(formattedCommand, > "=" * 40) > // Files.append(header, stderr, StandardCharsets.UTF_8) > // CommandUtils.redirectStream(process.getErrorStream, stderr) > // Redirect its stdout and stderr to files-support rolling > val stdout = new File(baseDir, "stdout") > stdoutAppender = FileAppender(process.getInputStream, stdout, conf) > val stderr = new File(baseDir, "stderr") > val formattedCommand = builder.command.asScala.mkString("\"", "\" \"", > "\"") > val header = "Launch Command: %s\n%s\n\n".format(formattedCommand, "=" > * 40) > Files.append(header, stderr, StandardCharsets.UTF_8) > stderrAppender = FileAppender(process.getErrorStream, stderr, conf) > } > runCommandWithRetry(ProcessBuilderLike(builder), initialize, supervise) > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19831) Sending the heartbeat master from worker maybe blocked by other rpc messages
[ https://issues.apache.org/jira/browse/SPARK-19831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15898704#comment-15898704 ] hustfxj commented on SPARK-19831: - [~zsxwing]. I only find the code which handles *ApplicationFinished* message is slow until now. So I also think such codes should be run in a separate thread. > Sending the heartbeat master from worker maybe blocked by other rpc messages > -- > > Key: SPARK-19831 > URL: https://issues.apache.org/jira/browse/SPARK-19831 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: hustfxj >Priority: Minor > > Cleaning the application may cost much time at worker, then it will block > that the worker send heartbeats master because the worker is extend > *ThreadSafeRpcEndpoint*. If the heartbeat from a worker is blocked by the > message *ApplicationFinished*, master will think the worker is dead. If the > worker has a driver, the driver will be scheduled by master again. So I think > it is the bug on spark. It may solve this problem by the followed suggests: > 1. It had better put the cleaning the application in a single asynchronous > thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages > like *SendHeartbeat*; > 2. It had better not send the heartbeat master by Rpc channel. Because any > other rpc message may block the rpc channel. It had better send the heartbeat > master at an asynchronous timing thread . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19843) UTF8String => (int / long) conversion expensive for invalid inputs
[ https://issues.apache.org/jira/browse/SPARK-19843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19843: Assignee: Apache Spark > UTF8String => (int / long) conversion expensive for invalid inputs > -- > > Key: SPARK-19843 > URL: https://issues.apache.org/jira/browse/SPARK-19843 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Tejas Patil >Assignee: Apache Spark > > In case of invalid inputs, converting a UTF8String to int or long returns > null. This comes at a cost wherein the method for conversion (e.g [0]) would > throw an exception. Exception handling is expensive as it will convert the > UTF8String into a java string, populate the stack trace (which is a native > call). While migrating workloads from Hive -> Spark, I see that this at an > aggregate level affects the performance of queries in comparison with hive. > The exception is just indicating that the conversion failed.. its not > propagated to users so it would be good to avoid. > Couple of options: > - Return Integer / Long (instead of primitive types) which can be set to NULL > if the conversion fails. This is boxing and super bad for perf so a big no. > - Hive has a pre-check [1] for this which is not a perfect safety net but > helpful to capture typical bad inputs eg. empty string, "null". > [0] : > https://github.com/apache/spark/blob/4ba9c6c453606f5e5a1e324d5f933d2c9307a604/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L950 > [1] : > https://github.com/apache/hive/blob/ff67cdda1c538dc65087878eeba3e165cf3230f4/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyUtils.java#L90 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19843) UTF8String => (int / long) conversion expensive for invalid inputs
[ https://issues.apache.org/jira/browse/SPARK-19843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15898632#comment-15898632 ] Apache Spark commented on SPARK-19843: -- User 'tejasapatil' has created a pull request for this issue: https://github.com/apache/spark/pull/17184 > UTF8String => (int / long) conversion expensive for invalid inputs > -- > > Key: SPARK-19843 > URL: https://issues.apache.org/jira/browse/SPARK-19843 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Tejas Patil > > In case of invalid inputs, converting a UTF8String to int or long returns > null. This comes at a cost wherein the method for conversion (e.g [0]) would > throw an exception. Exception handling is expensive as it will convert the > UTF8String into a java string, populate the stack trace (which is a native > call). While migrating workloads from Hive -> Spark, I see that this at an > aggregate level affects the performance of queries in comparison with hive. > The exception is just indicating that the conversion failed.. its not > propagated to users so it would be good to avoid. > Couple of options: > - Return Integer / Long (instead of primitive types) which can be set to NULL > if the conversion fails. This is boxing and super bad for perf so a big no. > - Hive has a pre-check [1] for this which is not a perfect safety net but > helpful to capture typical bad inputs eg. empty string, "null". > [0] : > https://github.com/apache/spark/blob/4ba9c6c453606f5e5a1e324d5f933d2c9307a604/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L950 > [1] : > https://github.com/apache/hive/blob/ff67cdda1c538dc65087878eeba3e165cf3230f4/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyUtils.java#L90 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19843) UTF8String => (int / long) conversion expensive for invalid inputs
[ https://issues.apache.org/jira/browse/SPARK-19843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19843: Assignee: (was: Apache Spark) > UTF8String => (int / long) conversion expensive for invalid inputs > -- > > Key: SPARK-19843 > URL: https://issues.apache.org/jira/browse/SPARK-19843 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Tejas Patil > > In case of invalid inputs, converting a UTF8String to int or long returns > null. This comes at a cost wherein the method for conversion (e.g [0]) would > throw an exception. Exception handling is expensive as it will convert the > UTF8String into a java string, populate the stack trace (which is a native > call). While migrating workloads from Hive -> Spark, I see that this at an > aggregate level affects the performance of queries in comparison with hive. > The exception is just indicating that the conversion failed.. its not > propagated to users so it would be good to avoid. > Couple of options: > - Return Integer / Long (instead of primitive types) which can be set to NULL > if the conversion fails. This is boxing and super bad for perf so a big no. > - Hive has a pre-check [1] for this which is not a perfect safety net but > helpful to capture typical bad inputs eg. empty string, "null". > [0] : > https://github.com/apache/spark/blob/4ba9c6c453606f5e5a1e324d5f933d2c9307a604/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L950 > [1] : > https://github.com/apache/hive/blob/ff67cdda1c538dc65087878eeba3e165cf3230f4/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyUtils.java#L90 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19845) failed to uncache datasource table after the table location altered
[ https://issues.apache.org/jira/browse/SPARK-19845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15898626#comment-15898626 ] Song Jun edited comment on SPARK-19845 at 3/7/17 2:33 AM: -- yes, this jira is doing this https://issues.apache.org/jira/browse/SPARK-19784 the location changed, it make it more complex to uncache the table and recache other tables reference this. I will dig it more was (Author: windpiger): yes, this jira is doing this https://issues.apache.org/jira/browse/SPARK-19784 the location changed, it make it more complex to uncache the table and recache other tables reference this. > failed to uncache datasource table after the table location altered > --- > > Key: SPARK-19845 > URL: https://issues.apache.org/jira/browse/SPARK-19845 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun > > Currently if we first cache a datasource table, then we alter the table > location, > then we drop the table, uncache table will failed in the DropTableCommand, > because the location has changed and sameResult for two InMemoryFileIndex > with different location return false, so we can't find the table key in the > cache. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18085) Better History Server scalability for many / large applications
[ https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DjvuLee updated SPARK-18085: Yes, you're right. I just want to impact as few as possible. > Better History Server scalability for many / large applications > --- > > Key: SPARK-18085 > URL: https://issues.apache.org/jira/browse/SPARK-18085 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin > Attachments: spark_hs_next_gen.pdf > > > It's a known fact that the History Server currently has some annoying issues > when serving lots of applications, and when serving large applications. > I'm filing this umbrella to track work related to addressing those issues. > I'll be attaching a document shortly describing the issues and suggesting a > path to how to solve them. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19845) failed to uncache datasource table after the table location altered
[ https://issues.apache.org/jira/browse/SPARK-19845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15898626#comment-15898626 ] Song Jun commented on SPARK-19845: -- yes, this jira is doing this https://issues.apache.org/jira/browse/SPARK-19784 the location changed, it make it more complex to uncache the table and recache other tables reference this. > failed to uncache datasource table after the table location altered > --- > > Key: SPARK-19845 > URL: https://issues.apache.org/jira/browse/SPARK-19845 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun > > Currently if we first cache a datasource table, then we alter the table > location, > then we drop the table, uncache table will failed in the DropTableCommand, > because the location has changed and sameResult for two InMemoryFileIndex > with different location return false, so we can't find the table key in the > cache. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19845) failed to uncache datasource table after the table location altered
[ https://issues.apache.org/jira/browse/SPARK-19845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15898591#comment-15898591 ] Wenchen Fan commented on SPARK-19845: - In this case, I think we have to uncache the table when alter the location. > failed to uncache datasource table after the table location altered > --- > > Key: SPARK-19845 > URL: https://issues.apache.org/jira/browse/SPARK-19845 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun > > Currently if we first cache a datasource table, then we alter the table > location, > then we drop the table, uncache table will failed in the DropTableCommand, > because the location has changed and sameResult for two InMemoryFileIndex > with different location return false, so we can't find the table key in the > cache. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19845) failed to uncache datasource table after the table location altered
Song Jun created SPARK-19845: Summary: failed to uncache datasource table after the table location altered Key: SPARK-19845 URL: https://issues.apache.org/jira/browse/SPARK-19845 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Song Jun Currently if we first cache a datasource table, then we alter the table location, then we drop the table, uncache table will failed in the DropTableCommand, because the location has changed and sameResult for two InMemoryFileIndex with different location return false, so we can't find the table key in the cache. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19842) Informational Referential Integrity Constraints Support in Spark
[ https://issues.apache.org/jira/browse/SPARK-19842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ioana Delaney updated SPARK-19842: -- Description: *Informational Referential Integrity Constraints Support in Spark* This work proposes support for _informational primary key_ and _foreign key (referential integrity) constraints_ in Spark. The main purpose is to open up an area of query optimization techniques that rely on referential integrity constraints semantics. An _informational_ or _statistical constraint_ is a constraint such as a _unique_, _primary key_, _foreign key_, or _check constraint_, that can be used by Spark to improve query performance. Informational constraints are not enforced by the Spark SQL engine; rather, they are used by Catalyst to optimize the query processing. They provide semantics information that allows Catalyst to rewrite queries to eliminate joins, push down aggregates, remove unnecessary Distinct operations, and perform a number of other optimizations. Informational constraints are primarily targeted to applications that load and analyze data that originated from a data warehouse. For such applications, the conditions for a given constraint are known to be true, so the constraint does not need to be enforced during data load operations. The attached document covers constraint definition, metastore storage, constraint validation, and maintenance. The document shows many examples of query performance improvements that utilize referential integrity constraints and can be implemented in Spark. Link to the google doc: [InformationalRIConstraints|https://docs.google.com/document/d/17r-cOqbKF7Px0xb9L7krKg2-RQB_gD2pxOmklm-ehsw/edit] was: *Informational Referential Integrity Constraints Support in Spark* This work proposes support for _informational primary key_ and _foreign key (referential integrity) constraints_ in Spark. The main purpose is to open up an area of query optimization techniques that rely on referential integrity constraints semantics. An _informational_ or _statistical constraint_ is a constraint such as a _unique_, _primary key_, _foreign key_, or _check constraint_, that can be used by Spark to improve query performance. Informational constraints are not enforced by the Spark SQL engine; rather, they are used by Catalyst to optimize the query processing. They provide semantics information that allows Catalyst to rewrite queries to eliminate joins, push down aggregates, remove unnecessary Distinct operations, and perform a number of other optimizations. Informational constraints are primarily targeted to applications that load and analyze data that originated from a data warehouse. For such applications, the conditions for a given constraint are known to be true, so the constraint does not need to be enforced during data load operations. The attached document covers constraint definition, metastore storage, constraint validation, and maintenance. The document shows many examples of query performance improvements that utilize referential integrity constraints and can be implemented in Spark. Link to the google doc: [InformationalRIConstraints|https://docs.google.com/document/d/17r-cOqbKF7Px0xb9L7krKg2-RQB_gD2pxOmklm-ehsw/edit] > Informational Referential Integrity Constraints Support in Spark > > > Key: SPARK-19842 > URL: https://issues.apache.org/jira/browse/SPARK-19842 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Ioana Delaney > Attachments: InformationalRIConstraints.doc > > > *Informational Referential Integrity Constraints Support in Spark* > This work proposes support for _informational primary key_ and _foreign key > (referential integrity) constraints_ in Spark. The main purpose is to open up > an area of query optimization techniques that rely on referential integrity > constraints semantics. > An _informational_ or _statistical constraint_ is a constraint such as a > _unique_, _primary key_, _foreign key_, or _check constraint_, that can be > used by Spark to improve query performance. Informational constraints are not > enforced by the Spark SQL engine; rather, they are used by Catalyst to > optimize the query processing. They provide semantics information that allows > Catalyst to rewrite queries to eliminate joins, push down aggregates, remove > unnecessary Distinct operations, and perform a number of other optimizations. > Informational constraints are primarily targeted to applications that load > and analyze data that originated from a data warehouse. For such > applications, the conditions for a given constraint are known to be true, so > the constraint does not need to be enforced during data load operations. >
[jira] [Updated] (SPARK-19842) Informational Referential Integrity Constraints Support in Spark
[ https://issues.apache.org/jira/browse/SPARK-19842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ioana Delaney updated SPARK-19842: -- Description: *Informational Referential Integrity Constraints Support in Spark* This work proposes support for _informational primary key_ and _foreign key (referential integrity) constraints_ in Spark. The main purpose is to open up an area of query optimization techniques that rely on referential integrity constraints semantics. An _informational_ or _statistical constraint_ is a constraint such as a _unique_, _primary key_, _foreign key_, or _check constraint_, that can be used by Spark to improve query performance. Informational constraints are not enforced by the Spark SQL engine; rather, they are used by Catalyst to optimize the query processing. They provide semantics information that allows Catalyst to rewrite queries to eliminate joins, push down aggregates, remove unnecessary Distinct operations, and perform a number of other optimizations. Informational constraints are primarily targeted to applications that load and analyze data that originated from a data warehouse. For such applications, the conditions for a given constraint are known to be true, so the constraint does not need to be enforced during data load operations. The attached document covers constraint definition, metastore storage, constraint validation, and maintenance. The document shows many examples of query performance improvements that utilize referential integrity constraints and can be implemented in Spark. Link to the google doc: [InformationalRIConstraints|https://docs.google.com/document/d/17r-cOqbKF7Px0xb9L7krKg2-RQB_gD2pxOmklm-ehsw/edit] was: *Informational Referential Integrity Constraints Support in Spark* This work proposes support for _informational primary key_ and _foreign key (referential integrity) constraints_ in Spark. The main purpose is to open up an area of query optimization techniques that rely on referential integrity constraints semantics. An _informational_ or _statistical constraint_ is a constraint such as a _unique_, _primary key_, _foreign key_, or _check constraint_, that can be used by Spark to improve query performance. Informational constraints are not enforced by the Spark SQL engine; rather, they are used by Catalyst to optimize the query processing. They provide semantics information that allows Catalyst to rewrite queries to eliminate joins, push down aggregates, remove unnecessary Distinct operations, and perform a number of other optimizations. Informational constraints are primarily targeted to applications that load and analyze data that originated from a data warehouse. For such applications, the conditions for a given constraint are known to be true, so the constraint does not need to be enforced during data load operations. The attached document covers constraint definition, metastore storage, constraint validation, and maintenance. The document shows many examples of query performance improvements that utilize referential integrity constraints and can be implemented in Spark. > Informational Referential Integrity Constraints Support in Spark > > > Key: SPARK-19842 > URL: https://issues.apache.org/jira/browse/SPARK-19842 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Ioana Delaney > Attachments: InformationalRIConstraints.doc > > > *Informational Referential Integrity Constraints Support in Spark* > This work proposes support for _informational primary key_ and _foreign key > (referential integrity) constraints_ in Spark. The main purpose is to open up > an area of query optimization techniques that rely on referential integrity > constraints semantics. > An _informational_ or _statistical constraint_ is a constraint such as a > _unique_, _primary key_, _foreign key_, or _check constraint_, that can be > used by Spark to improve query performance. Informational constraints are not > enforced by the Spark SQL engine; rather, they are used by Catalyst to > optimize the query processing. They provide semantics information that allows > Catalyst to rewrite queries to eliminate joins, push down aggregates, remove > unnecessary Distinct operations, and perform a number of other optimizations. > Informational constraints are primarily targeted to applications that load > and analyze data that originated from a data warehouse. For such > applications, the conditions for a given constraint are known to be true, so > the constraint does not need to be enforced during data load operations. > The attached document covers constraint definition, metastore storage, > constraint validation, and maintenance. The document shows many ex
[jira] [Created] (SPARK-19844) UDF in when control function is executed before the when clause is evaluated.
Franklyn Dsouza created SPARK-19844: --- Summary: UDF in when control function is executed before the when clause is evaluated. Key: SPARK-19844 URL: https://issues.apache.org/jira/browse/SPARK-19844 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 2.1.0, 2.0.1 Reporter: Franklyn Dsouza Sometimes we try to filter out the argument to a udf using {code}when(clause, udf).otherwise(default){code} but we've noticed that sometimes the udf is being run on data that shouldn't have matched the clause. heres some code to reproduce the issue. {code} from pyspark.sql import functions as F from pyspark.sql import types df = sc.sql.createDataFrame([{'r': None}], schema=types.StructType([types.StructField('r', types.StringType())])) simple_udf = F.udf(lambda ref: ref.strip("/"), types.StringType()) df.withColumn('test', F.when(F.col("r").isNotNull(), simple_udf(F.col("r"))) .otherwise(F.lit(None)) ).collect() {code} This causes an exception because the udf is running on null data. i get AttributeError: 'NoneType' object has no attribute 'strip'. so it looks like the udf is being evaluated before the clause in the when is evaulated. Oddly enough when i change {code}F.col("r").isNotNull(){code} to {code}df["r"] != None{code} then it works. might be related to https://issues.apache.org/jira/browse/SPARK-13773 and https://issues.apache.org/jira/browse/SPARK-15282 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19719) Structured Streaming write to Kafka
[ https://issues.apache.org/jira/browse/SPARK-19719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-19719. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17043 [https://github.com/apache/spark/pull/17043] > Structured Streaming write to Kafka > --- > > Key: SPARK-19719 > URL: https://issues.apache.org/jira/browse/SPARK-19719 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Tyson Condie > Fix For: 2.2.0 > > > This issue deals with writing to Apache Kafka for both streaming and batch > queries. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19843) UTF8String => (int / long) conversion expensive for invalid inputs
Tejas Patil created SPARK-19843: --- Summary: UTF8String => (int / long) conversion expensive for invalid inputs Key: SPARK-19843 URL: https://issues.apache.org/jira/browse/SPARK-19843 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.0 Reporter: Tejas Patil In case of invalid inputs, converting a UTF8String to int or long returns null. This comes at a cost wherein the method for conversion (e.g [0]) would throw an exception. Exception handling is expensive as it will convert the UTF8String into a java string, populate the stack trace (which is a native call). While migrating workloads from Hive -> Spark, I see that this at an aggregate level affects the performance of queries in comparison with hive. The exception is just indicating that the conversion failed.. its not propagated to users so it would be good to avoid. Couple of options: - Return Integer / Long (instead of primitive types) which can be set to NULL if the conversion fails. This is boxing and super bad for perf so a big no. - Hive has a pre-check [1] for this which is not a perfect safety net but helpful to capture typical bad inputs eg. empty string, "null". [0] : https://github.com/apache/spark/blob/4ba9c6c453606f5e5a1e324d5f933d2c9307a604/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L950 [1] : https://github.com/apache/hive/blob/ff67cdda1c538dc65087878eeba3e165cf3230f4/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyUtils.java#L90 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19842) Informational Referential Integrity Constraints Support in Spark
[ https://issues.apache.org/jira/browse/SPARK-19842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ioana Delaney updated SPARK-19842: -- Target Version/s: (was: 2.2.0) > Informational Referential Integrity Constraints Support in Spark > > > Key: SPARK-19842 > URL: https://issues.apache.org/jira/browse/SPARK-19842 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Ioana Delaney > Attachments: InformationalRIConstraints.doc > > > *Informational Referential Integrity Constraints Support in Spark* > This work proposes support for _informational primary key_ and _foreign key > (referential integrity) constraints_ in Spark. The main purpose is to open up > an area of query optimization techniques that rely on referential integrity > constraints semantics. > An _informational_ or _statistical constraint_ is a constraint such as a > _unique_, _primary key_, _foreign key_, or _check constraint_, that can be > used by Spark to improve query performance. Informational constraints are not > enforced by the Spark SQL engine; rather, they are used by Catalyst to > optimize the query processing. They provide semantics information that allows > Catalyst to rewrite queries to eliminate joins, push down aggregates, remove > unnecessary Distinct operations, and perform a number of other optimizations. > Informational constraints are primarily targeted to applications that load > and analyze data that originated from a data warehouse. For such > applications, the conditions for a given constraint are known to be true, so > the constraint does not need to be enforced during data load operations. > The attached document covers constraint definition, metastore storage, > constraint validation, and maintenance. The document shows many examples of > query performance improvements that utilize referential integrity constraints > and can be implemented in Spark. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19842) Informational Referential Integrity Constraints Support in Spark
[ https://issues.apache.org/jira/browse/SPARK-19842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ioana Delaney updated SPARK-19842: -- Attachment: InformationalRIConstraints.doc > Informational Referential Integrity Constraints Support in Spark > > > Key: SPARK-19842 > URL: https://issues.apache.org/jira/browse/SPARK-19842 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Ioana Delaney > Attachments: InformationalRIConstraints.doc > > > *Informational Referential Integrity Constraints Support in Spark* > This work proposes support for _informational primary key_ and _foreign key > (referential integrity) constraints_ in Spark. The main purpose is to open up > an area of query optimization techniques that rely on referential integrity > constraints semantics. > An _informational_ or _statistical constraint_ is a constraint such as a > _unique_, _primary key_, _foreign key_, or _check constraint_, that can be > used by Spark to improve query performance. Informational constraints are not > enforced by the Spark SQL engine; rather, they are used by Catalyst to > optimize the query processing. They provide semantics information that allows > Catalyst to rewrite queries to eliminate joins, push down aggregates, remove > unnecessary Distinct operations, and perform a number of other optimizations. > Informational constraints are primarily targeted to applications that load > and analyze data that originated from a data warehouse. For such > applications, the conditions for a given constraint are known to be true, so > the constraint does not need to be enforced during data load operations. > The attached document covers constraint definition, metastore storage, > constraint validation, and maintenance. The document shows many examples of > query performance improvements that utilize referential integrity constraints > and can be implemented in Spark. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19842) Informational Referential Integrity Constraints Support in Spark
Ioana Delaney created SPARK-19842: - Summary: Informational Referential Integrity Constraints Support in Spark Key: SPARK-19842 URL: https://issues.apache.org/jira/browse/SPARK-19842 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0 Reporter: Ioana Delaney *Informational Referential Integrity Constraints Support in Spark* This work proposes support for _informational primary key_ and _foreign key (referential integrity) constraints_ in Spark. The main purpose is to open up an area of query optimization techniques that rely on referential integrity constraints semantics. An _informational_ or _statistical constraint_ is a constraint such as a _unique_, _primary key_, _foreign key_, or _check constraint_, that can be used by Spark to improve query performance. Informational constraints are not enforced by the Spark SQL engine; rather, they are used by Catalyst to optimize the query processing. They provide semantics information that allows Catalyst to rewrite queries to eliminate joins, push down aggregates, remove unnecessary Distinct operations, and perform a number of other optimizations. Informational constraints are primarily targeted to applications that load and analyze data that originated from a data warehouse. For such applications, the conditions for a given constraint are known to be true, so the constraint does not need to be enforced during data load operations. The attached document covers constraint definition, metastore storage, constraint validation, and maintenance. The document shows many examples of query performance improvements that utilize referential integrity constraints and can be implemented in Spark. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19841) StreamingDeduplicateExec.watermarkPredicate should filter rows based on keys
[ https://issues.apache.org/jira/browse/SPARK-19841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19841: Assignee: Shixiong Zhu (was: Apache Spark) > StreamingDeduplicateExec.watermarkPredicate should filter rows based on keys > > > Key: SPARK-19841 > URL: https://issues.apache.org/jira/browse/SPARK-19841 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > Right now it just uses the rows to filter but a column position in > keyExpressions may be different than the position in row. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19841) StreamingDeduplicateExec.watermarkPredicate should filter rows based on keys
[ https://issues.apache.org/jira/browse/SPARK-19841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19841: Assignee: Apache Spark (was: Shixiong Zhu) > StreamingDeduplicateExec.watermarkPredicate should filter rows based on keys > > > Key: SPARK-19841 > URL: https://issues.apache.org/jira/browse/SPARK-19841 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu >Assignee: Apache Spark > > Right now it just uses the rows to filter but a column position in > keyExpressions may be different than the position in row. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19841) StreamingDeduplicateExec.watermarkPredicate should filter rows based on keys
[ https://issues.apache.org/jira/browse/SPARK-19841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15898446#comment-15898446 ] Apache Spark commented on SPARK-19841: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/17183 > StreamingDeduplicateExec.watermarkPredicate should filter rows based on keys > > > Key: SPARK-19841 > URL: https://issues.apache.org/jira/browse/SPARK-19841 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > Right now it just uses the rows to filter but a column position in > keyExpressions may be different than the position in row. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19841) StreamingDeduplicateExec.watermarkPredicate should filter rows based on keys
Shixiong Zhu created SPARK-19841: Summary: StreamingDeduplicateExec.watermarkPredicate should filter rows based on keys Key: SPARK-19841 URL: https://issues.apache.org/jira/browse/SPARK-19841 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.2.0 Reporter: Shixiong Zhu Assignee: Shixiong Zhu Right now it just uses the rows to filter but a column position in keyExpressions may be different than the position in row. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19840) Disallow creating permanent functions with invalid class names
[ https://issues.apache.org/jira/browse/SPARK-19840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15898402#comment-15898402 ] Apache Spark commented on SPARK-19840: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/17182 > Disallow creating permanent functions with invalid class names > -- > > Key: SPARK-19840 > URL: https://issues.apache.org/jira/browse/SPARK-19840 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Dongjoon Hyun > > Currently, Spark raises exceptions on creating invalid **temporary** > functions, but doesn't for **permanent** functions. This issue aims to > disallow creating permanent functions with invalid class names. > **BEFORE** > {code} > scala> sql("CREATE TEMPORARY FUNCTION function_with_invalid_classname AS > 'org.invalid'").show > java.lang.ClassNotFoundException: org.invalid at > ... > scala> sql("CREATE FUNCTION function_with_invalid_classname AS > 'org.invalid'").show > ++ > || > ++ > ++ > {code} > **AFTER** > {code} > scala> sql("CREATE FUNCTION function_with_invalid_classname AS > 'org.invalid'").show > java.lang.ClassNotFoundException: org.invalid > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19840) Disallow creating permanent functions with invalid class names
[ https://issues.apache.org/jira/browse/SPARK-19840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19840: Assignee: (was: Apache Spark) > Disallow creating permanent functions with invalid class names > -- > > Key: SPARK-19840 > URL: https://issues.apache.org/jira/browse/SPARK-19840 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Dongjoon Hyun > > Currently, Spark raises exceptions on creating invalid **temporary** > functions, but doesn't for **permanent** functions. This issue aims to > disallow creating permanent functions with invalid class names. > **BEFORE** > {code} > scala> sql("CREATE TEMPORARY FUNCTION function_with_invalid_classname AS > 'org.invalid'").show > java.lang.ClassNotFoundException: org.invalid at > ... > scala> sql("CREATE FUNCTION function_with_invalid_classname AS > 'org.invalid'").show > ++ > || > ++ > ++ > {code} > **AFTER** > {code} > scala> sql("CREATE FUNCTION function_with_invalid_classname AS > 'org.invalid'").show > java.lang.ClassNotFoundException: org.invalid > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19840) Disallow creating permanent functions with invalid class names
[ https://issues.apache.org/jira/browse/SPARK-19840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19840: Assignee: Apache Spark > Disallow creating permanent functions with invalid class names > -- > > Key: SPARK-19840 > URL: https://issues.apache.org/jira/browse/SPARK-19840 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark > > Currently, Spark raises exceptions on creating invalid **temporary** > functions, but doesn't for **permanent** functions. This issue aims to > disallow creating permanent functions with invalid class names. > **BEFORE** > {code} > scala> sql("CREATE TEMPORARY FUNCTION function_with_invalid_classname AS > 'org.invalid'").show > java.lang.ClassNotFoundException: org.invalid at > ... > scala> sql("CREATE FUNCTION function_with_invalid_classname AS > 'org.invalid'").show > ++ > || > ++ > ++ > {code} > **AFTER** > {code} > scala> sql("CREATE FUNCTION function_with_invalid_classname AS > 'org.invalid'").show > java.lang.ClassNotFoundException: org.invalid > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19840) Disallow creating permanent functions with invalid class names
Dongjoon Hyun created SPARK-19840: - Summary: Disallow creating permanent functions with invalid class names Key: SPARK-19840 URL: https://issues.apache.org/jira/browse/SPARK-19840 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Dongjoon Hyun Currently, Spark raises exceptions on creating invalid **temporary** functions, but doesn't for **permanent** functions. This issue aims to disallow creating permanent functions with invalid class names. **BEFORE** {code} scala> sql("CREATE TEMPORARY FUNCTION function_with_invalid_classname AS 'org.invalid'").show java.lang.ClassNotFoundException: org.invalid at ... scala> sql("CREATE FUNCTION function_with_invalid_classname AS 'org.invalid'").show ++ || ++ ++ {code} **AFTER** {code} scala> sql("CREATE FUNCTION function_with_invalid_classname AS 'org.invalid'").show java.lang.ClassNotFoundException: org.invalid {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15522) DataFrame Column Names That are Numbers aren't referenced correctly in SQL
[ https://issues.apache.org/jira/browse/SPARK-15522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15897638#comment-15897638 ] Hyukjin Kwon edited comment on SPARK-15522 at 3/6/17 11:26 PM: --- We can use backticks for it as below: {code} scala> Seq(Some(1), None).toDF("82").createOrReplaceTempView("piv_qhID") scala> spark.sql("select * from piv_qhID where '82' is NULL limit 20").show() +---+ | 82| +---+ +---+ scala> spark.sql("select * from piv_qhID where `82` is NULL limit 20").show() ++ | 82| ++ |null| ++ {code} It seems {{'82'}} or {{"82"}} was being created as a constant. I am resolving this JIRA. Please reopen this if I misunderstood. was (Author: hyukjin.kwon): We can use backticks for it as below: {code} scala> spark.range(10).toDF("82").createOrReplaceTempView("piv_qhID") scala> spark.sql("select * from piv_qhID where `82` is NULL limit 20").show() +---+ | 82| +---+ +---+ {code} It seems {{'82'}} or {{"82"}} was being created as a constant. I am resolving this JIRA. Please reopen this if I misunderstood. > DataFrame Column Names That are Numbers aren't referenced correctly in SQL > -- > > Key: SPARK-15522 > URL: https://issues.apache.org/jira/browse/SPARK-15522 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Jason Pohl > > The following code is run: > val pre_piv_df_a = sqlContext.sql(""" > SELECT > CASE WHEN Gender = 'M' Then 1 ELSE 0 END AS has_male, > CASE WHEN Gender = 'F' Then 1 ELSE 0 END AS has_female, > CAST(StartAge AS Double) AS StartAge_dbl, > CAST(EndAge AS Double) AS EndAge_dbl, > * > FROM alldem_union_curr > """) > .withColumn("JavaStartTimestamp", create_ts($"StartTimestamp")) > .drop("StartTimestamp").withColumnRenamed("JavaStartTimestamp", > "StartTimestamp") > .drop("StartAge").drop("EndAge") > .withColumnRenamed("StartAge_dbl", > "StartAge").withColumnRenamed("EndAge_dbl", "EndAge") > val pre_piv_df_b = pre_piv_df_a > .withColumn("media_month_cc", media_month_cc($"MediaMonth")) > .withColumn("media_week_cc", media_week_sts_cc($"StartTimestamp")) > .withColumn("media_day_cc", media_day_sts_cc($"StartTimestamp")) > .withColumn("week_day", week_day($"StartTimestamp")) > .withColumn("week_end", week_end($"StartTimestamp")) > .join(sqlContext.table("cad_nets"), $"Network" === $"nielsen_network", > "inner") > .withColumnRenamed("cad_network", "norm_net_code_a") > .withColumn("norm_net_code", reCodeNets($"norm_net_code_a")) > pre_piv_df_b.registerTempTable("pre_piv_df") > val piv_qhID_df = pre_piv_df_b.groupBy("Network", "Audience", "StartDate", > "rating_category_cd") > .pivot("qaID").agg("rating" -> "mean") > The pivot creates a lot of columns (96) with names that are like > ‘01’,’02’,…,’96’ as a result of pivoting a table that has quarter hour IDs. > In the below SQL the highlighted section causes problems. If I rename the > columns to ‘col01’,’col02’,…,’col96’ I can run the SQL correctly and get the > expected results. > select * from piv_qhID where 82 is NULL limit 20 > And I am getting no rows even though there are nulls. > On the other hand the query: > select * from piv_qhID where 82 is NOT NULL limit 20 > Returns all rows (even those with nulls) > Renaming the columns fixes this, but it would be nice if the columns were > referenced correctly. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14423) Handle jar conflict issue when uploading to distributed cache
[ https://issues.apache.org/jira/browse/SPARK-14423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15898256#comment-15898256 ] Junping Du commented on SPARK-14423: Thanks [~jerryshao] for reporting this issue. I think YARN should fix this problem also. If the same jars are added to distributed cache, it should detect and failed fast with throwing indicating messages: YARN-5306 already get filed to track this issue. > Handle jar conflict issue when uploading to distributed cache > - > > Key: SPARK-14423 > URL: https://issues.apache.org/jira/browse/SPARK-14423 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0 >Reporter: Saisai Shao >Assignee: Saisai Shao > Fix For: 2.0.0 > > > Currently with the introduction of assembly-free deployment of Spark, by > default yarn#client will upload all the jars in assembly to HDFS staging > folder. If the jars in assembly and specified with \--jars have the same > name, this will introduce exception while downloading these jars and make the > application fail to run. > Here is the exception when running example with {{run-example}}: > {noformat} > 16/04/06 10:29:48 INFO Client: Application report for > application_1459907402325_0004 (state: FAILED) > 16/04/06 10:29:48 INFO Client: >client token: N/A >diagnostics: Application application_1459907402325_0004 failed 2 times > due to AM Container for appattempt_1459907402325_0004_02 exited with > exitCode: -1000 > For more detailed output, check application tracking > page:http://hw12100.local:8088/proxy/application_1459907402325_0004/Then, > click on links to logs of each attempt. > Diagnostics: Resource > hdfs://localhost:8020/user/sshao/.sparkStaging/application_1459907402325_0004/avro-mapred-1.7.7-hadoop2.jar > changed on src filesystem (expected 1459909780508, was 1459909782590 > java.io.IOException: Resource > hdfs://localhost:8020/user/sshao/.sparkStaging/application_1459907402325_0004/avro-mapred-1.7.7-hadoop2.jar > changed on src filesystem (expected 1459909780508, was 1459909782590 > at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253) > at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:61) > at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359) > at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:357) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:356) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} > The problem is that this jar {{avro-mapred-1.7.7-hadoop2.jar}} both existed > in assembly and example folder. > We should handle this situation, since now spark example is failed to run > under yarn mode. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19767) API Doc pages for Streaming with Kafka 0.10 not current
[ https://issues.apache.org/jira/browse/SPARK-19767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15898159#comment-15898159 ] Sean Owen commented on SPARK-19767: --- Hm, have you installed all the ruby-based dependencies? I don't know exactly what pulls in liquid. sudo gem install jekyll jekyll-redirect-from pygments.rb > API Doc pages for Streaming with Kafka 0.10 not current > --- > > Key: SPARK-19767 > URL: https://issues.apache.org/jira/browse/SPARK-19767 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Nick Afshartous >Priority: Minor > > The API docs linked from the Spark Kafka 0.10 Integration page are not > current. For instance, on the page >https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html > the code examples show the new API (i.e. class ConsumerStrategies). However, > following the links > API Docs --> (Scala | Java) > lead to API pages that do not have class ConsumerStrategies) . The API doc > package names also have {code}streaming.kafka{code} as opposed to > {code}streaming.kafka10{code} > as in the code examples on streaming-kafka-0-10-integration.html. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19709) CSV datasource fails to read empty file
[ https://issues.apache.org/jira/browse/SPARK-19709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-19709: --- Assignee: Wojciech Szymanski > CSV datasource fails to read empty file > --- > > Key: SPARK-19709 > URL: https://issues.apache.org/jira/browse/SPARK-19709 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Assignee: Wojciech Szymanski >Priority: Minor > Fix For: 2.2.0 > > > I just {{touch a}} and then ran the codes below: > {code} > scala> spark.read.csv("a") > java.util.NoSuchElementException: next on empty iterator > at scala.collection.Iterator$$anon$2.next(Iterator.scala:39) > at scala.collection.Iterator$$anon$2.next(Iterator.scala:37) > at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike. > {code} > It seems we should produce an empty dataframe consistently with > `spark.read.json("a")`. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19709) CSV datasource fails to read empty file
[ https://issues.apache.org/jira/browse/SPARK-19709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-19709. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17068 [https://github.com/apache/spark/pull/17068] > CSV datasource fails to read empty file > --- > > Key: SPARK-19709 > URL: https://issues.apache.org/jira/browse/SPARK-19709 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Priority: Minor > Fix For: 2.2.0 > > > I just {{touch a}} and then ran the codes below: > {code} > scala> spark.read.csv("a") > java.util.NoSuchElementException: next on empty iterator > at scala.collection.Iterator$$anon$2.next(Iterator.scala:39) > at scala.collection.Iterator$$anon$2.next(Iterator.scala:37) > at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike. > {code} > It seems we should produce an empty dataframe consistently with > `spark.read.json("a")`. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19094) Plumb through logging/error messages from the JVM to Jupyter PySpark
[ https://issues.apache.org/jira/browse/SPARK-19094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15898114#comment-15898114 ] Kyle Kelley commented on SPARK-19094: - Super interested in this, as it's been confusing for our users. I've thought about making an alternate endpoint for a kernel to get logs out of, it would be much better to re-route these logs so that the python kernel can handle them directly. > Plumb through logging/error messages from the JVM to Jupyter PySpark > > > Key: SPARK-19094 > URL: https://issues.apache.org/jira/browse/SPARK-19094 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: holdenk >Priority: Trivial > > Jupyter/IPython notebooks works by overriding sys.stdout & sys.stderr, as > such the error messages that show up in IJupyter/IPython are often missing > the related logs - which is often more useful than the exception its self. > This could make it easier for Python developers getting started with Spark on > their local laptops to debug their applications, since otherwise they need to > remember to keep going to the terminal where they launched the notebook from. > One counterpoint to this is that Spark's logging is fairly verbose, but since > we provide the ability for the user to tune the log messages from within the > notebook that should be OK. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-16845: Fix Version/s: 2.0.3 1.6.4 > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > - > > Key: SPARK-16845 > URL: https://issues.apache.org/jira/browse/SPARK-16845 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: hejie >Assignee: Liwei Lin > Fix For: 1.6.4, 2.0.3, 2.1.1, 2.2.0 > > Attachments: error.txt.zip > > > I have a wide table(400 columns), when I try fitting the traindata on all > columns, the fatal error occurs. > ... 46 more > Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19824) Standalone master JSON not showing cores for running applications
[ https://issues.apache.org/jira/browse/SPARK-19824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19824: Assignee: Apache Spark > Standalone master JSON not showing cores for running applications > - > > Key: SPARK-19824 > URL: https://issues.apache.org/jira/browse/SPARK-19824 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.1.0 >Reporter: Dan >Assignee: Apache Spark >Priority: Minor > > The JSON API of the standalone master ("/json") does not show the number of > cores for a running application, which is available on the UI. > "activeapps" : [ { > "starttime" : 1488702337788, > "id" : "app-20170305102537-19717", > "name" : "POPAI_Aggregated", > "user" : "ibiuser", > "memoryperslave" : 16384, > "submitdate" : "Sun Mar 05 10:25:37 IST 2017", > "state" : "RUNNING", > "duration" : 1141934 > } ], -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19824) Standalone master JSON not showing cores for running applications
[ https://issues.apache.org/jira/browse/SPARK-19824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15898111#comment-15898111 ] Apache Spark commented on SPARK-19824: -- User 'yongtang' has created a pull request for this issue: https://github.com/apache/spark/pull/17181 > Standalone master JSON not showing cores for running applications > - > > Key: SPARK-19824 > URL: https://issues.apache.org/jira/browse/SPARK-19824 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.1.0 >Reporter: Dan >Priority: Minor > > The JSON API of the standalone master ("/json") does not show the number of > cores for a running application, which is available on the UI. > "activeapps" : [ { > "starttime" : 1488702337788, > "id" : "app-20170305102537-19717", > "name" : "POPAI_Aggregated", > "user" : "ibiuser", > "memoryperslave" : 16384, > "submitdate" : "Sun Mar 05 10:25:37 IST 2017", > "state" : "RUNNING", > "duration" : 1141934 > } ], -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19824) Standalone master JSON not showing cores for running applications
[ https://issues.apache.org/jira/browse/SPARK-19824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19824: Assignee: (was: Apache Spark) > Standalone master JSON not showing cores for running applications > - > > Key: SPARK-19824 > URL: https://issues.apache.org/jira/browse/SPARK-19824 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.1.0 >Reporter: Dan >Priority: Minor > > The JSON API of the standalone master ("/json") does not show the number of > cores for a running application, which is available on the UI. > "activeapps" : [ { > "starttime" : 1488702337788, > "id" : "app-20170305102537-19717", > "name" : "POPAI_Aggregated", > "user" : "ibiuser", > "memoryperslave" : 16384, > "submitdate" : "Sun Mar 05 10:25:37 IST 2017", > "state" : "RUNNING", > "duration" : 1141934 > } ], -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19382) Test sparse vectors in LinearSVCSuite
[ https://issues.apache.org/jira/browse/SPARK-19382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-19382. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16784 [https://github.com/apache/spark/pull/16784] > Test sparse vectors in LinearSVCSuite > - > > Key: SPARK-19382 > URL: https://issues.apache.org/jira/browse/SPARK-19382 > Project: Spark > Issue Type: Test > Components: ML >Reporter: Joseph K. Bradley >Assignee: Miao Wang >Priority: Minor > Fix For: 2.2.0 > > > Currently, LinearSVCSuite does not test sparse vectors. We should. I > recommend that generateSVMInput be modified to create a mix of dense and > sparse vectors, rather than adding an additional test. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19382) Test sparse vectors in LinearSVCSuite
[ https://issues.apache.org/jira/browse/SPARK-19382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-19382: - Assignee: Miao Wang > Test sparse vectors in LinearSVCSuite > - > > Key: SPARK-19382 > URL: https://issues.apache.org/jira/browse/SPARK-19382 > Project: Spark > Issue Type: Test > Components: ML >Reporter: Joseph K. Bradley >Assignee: Miao Wang >Priority: Minor > > Currently, LinearSVCSuite does not test sparse vectors. We should. I > recommend that generateSVMInput be modified to create a mix of dense and > sparse vectors, rather than adding an additional test. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18832) Spark SQL: Thriftserver unable to run a registered Hive UDTF
[ https://issues.apache.org/jira/browse/SPARK-18832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15898052#comment-15898052 ] Dongjoon Hyun commented on SPARK-18832: --- Could you check the permission of the following? Is it accessible for both `hive` and `spark` on your server and on the server running Spark Thrift Server? {code} '/root/spark_files/experiments-1.2.jar'; {code} If there is a permission problem, it also ends up `org.apache.spark.sql.AnalysisException: Undefined function` with additional logs. > Spark SQL: Thriftserver unable to run a registered Hive UDTF > > > Key: SPARK-18832 > URL: https://issues.apache.org/jira/browse/SPARK-18832 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2 > Environment: HDP: 2.5 > Spark: 2.0.0 >Reporter: Lokesh Yadav > Attachments: SampleUDTF.java > > > Spark Thriftserver is unable to run a HiveUDTF. > It throws the error that it is unable to find the functions although the > function registration succeeds and the funtions does show up in the list > output by {{show functions}}. > I am using a Hive UDTF, registering it using a jar placed on my local > machine. Calling it using the following commands: > //Registering the functions, this command succeeds. > {{CREATE FUNCTION SampleUDTF AS > 'com.fuzzylogix.experiments.udf.hiveUDF.SampleUDTF' USING JAR > '/root/spark_files/experiments-1.2.jar';}} > //Thriftserver is able to look up the functuion, on this command: > {{DESCRIBE FUNCTION SampleUDTF;}} > {quote} > {noformat} > Output: > +---+--+ > | function_desc | > +---+--+ > | Function: default.SampleUDTF | > | Class: com.fuzzylogix.experiments.udf.hiveUDF.SampleUDTF | > | Usage: N/A. | > +---+--+ > {noformat} > {quote} > // Calling the function: > {{SELECT SampleUDTF('Paris');}} > bq. Output of the above command: Error: > org.apache.spark.sql.AnalysisException: Undefined function: 'SampleUDTF'. > This function is neither a registered temporary function nor a permanent > function registered in the database 'default'.; line 1 pos 7 (state=,code=0) > I have also tried with using a non-local (on hdfs) jar, but I get the same > error. > My environment: HDP 2.5 with spark 2.0.0 > I have attached the class file for the UDTF I am using in testing this. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19211) Explicitly prevent Insert into View or Create View As Insert
[ https://issues.apache.org/jira/browse/SPARK-19211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-19211: --- Assignee: Jiang Xingbo > Explicitly prevent Insert into View or Create View As Insert > > > Key: SPARK-19211 > URL: https://issues.apache.org/jira/browse/SPARK-19211 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Jiang Xingbo >Assignee: Jiang Xingbo > Fix For: 2.2.0 > > > Currently we don't explicitly forbid the following behaviors: > 1. The statement CREATE VIEW AS INSERT INTO throws the following exception > from SQLBuilder: > `java.lang.UnsupportedOperationException: unsupported plan InsertIntoTable > MetastoreRelation default, tbl, false, false`; > 2. The statement INSERT INTO view VALUES throws the following exception from > checkAnalysis: > `Error in query: Inserting into an RDD-based table is not allowed.;;` > We should check for these behaviors earlier and explicitly prevent them. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19211) Explicitly prevent Insert into View or Create View As Insert
[ https://issues.apache.org/jira/browse/SPARK-19211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-19211. - Resolution: Fixed Fix Version/s: 2.2.0 > Explicitly prevent Insert into View or Create View As Insert > > > Key: SPARK-19211 > URL: https://issues.apache.org/jira/browse/SPARK-19211 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Jiang Xingbo >Assignee: Jiang Xingbo > Fix For: 2.2.0 > > > Currently we don't explicitly forbid the following behaviors: > 1. The statement CREATE VIEW AS INSERT INTO throws the following exception > from SQLBuilder: > `java.lang.UnsupportedOperationException: unsupported plan InsertIntoTable > MetastoreRelation default, tbl, false, false`; > 2. The statement INSERT INTO view VALUES throws the following exception from > checkAnalysis: > `Error in query: Inserting into an RDD-based table is not allowed.;;` > We should check for these behaviors earlier and explicitly prevent them. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17495) Hive hash implementation
[ https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15898005#comment-15898005 ] Reynold Xin commented on SPARK-17495: - We should probably create subtickets next time for this ... > Hive hash implementation > > > Key: SPARK-17495 > URL: https://issues.apache.org/jira/browse/SPARK-17495 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Tejas Patil >Assignee: Tejas Patil >Priority: Minor > Fix For: 2.2.0 > > > Spark internally uses Murmur3Hash for partitioning. This is different from > the one used by Hive. For queries which use bucketing this leads to different > results if one tries the same query on both engines. For us, we want users to > have backward compatibility to that one can switch parts of applications > across the engines without observing regressions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19790) OutputCommitCoordinator should not allow another task to commit after an ExecutorFailure
[ https://issues.apache.org/jira/browse/SPARK-19790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15898003#comment-15898003 ] Steve Loughran commented on SPARK-19790: Thinking some more & looking at code snippets # FileOutputFormat with algorithm 2 can recover from a failed task commit, somehow. That code is too complex to make sense of now that it mixes two algorithms with corecursion and stuff to run in client and server. # The s3guard committer will have the task upload it's files in parallel, but will not complete the multipart commit; the information to do this will be persisted to HDFS for execution by the job committer. # I plan a little spark extension which will do the same for files with absolute destinations, this time passing the data back to the job committer. Which means a failed task can be recovered from. All pending writes for that task will need to be found (scan FS) and abort. Still an issue about ordering of PUT vs save of upload data; some GC of pending commits to a dest dir would be the way to avoid running up bills. This is one of those coordination problems where someone with TLA+ algorithm specification skills would be good, along with the foundation specs for filesystems and object stores. Someone needs to find a CS student looking for a project. > OutputCommitCoordinator should not allow another task to commit after an > ExecutorFailure > > > Key: SPARK-19790 > URL: https://issues.apache.org/jira/browse/SPARK-19790 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Imran Rashid > > The OutputCommitCoordinator resets the allowed committer when the task fails. > > https://github.com/apache/spark/blob/8aa560b75e6b083b2a890c52301414285ba35c3d/core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala#L143 > However, if a task fails because of an ExecutorFailure, we actually have no > idea what the status is of the task. The task may actually still be running, > and perhaps successfully commit its output. By allowing another task to > commit its output, there is a chance that multiple tasks commit, which can > result in corrupt output. This would be particularly problematic when > commit() is an expensive operation, eg. moving files on S3. > For other task failures, we can allow other tasks to commit. But with an > ExecutorFailure, its not clear what the right thing to do is. The only safe > thing to do may be to fail the job. > This is related to SPARK-19631, and was discovered during discussion on that > PR https://github.com/apache/spark/pull/16959#discussion_r103549134 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19796) taskScheduler fails serializing long statements received by thrift server
[ https://issues.apache.org/jira/browse/SPARK-19796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15897997#comment-15897997 ] Imran Rashid commented on SPARK-19796: -- I'm opposed to (b) as well. It feels wrong to only do a one-off just for JOB_DESCRIPTION, but maybe its a large enough savings that its worth doing. I was thinking of something larger, along the lines of SPARK-19108. Another option would be to add new apis, eg., jobs would take `driverProperties` and `executorProperties`, but maybe that is overkill. > taskScheduler fails serializing long statements received by thrift server > - > > Key: SPARK-19796 > URL: https://issues.apache.org/jira/browse/SPARK-19796 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Giambattista >Assignee: Imran Rashid >Priority: Blocker > Fix For: 2.2.0 > > > This problem was observed after the changes made for SPARK-17931. > In my use-case I'm sending very long insert statements to Spark thrift server > and they are failing at TaskDescription.scala:89 because writeUTF fails if > requested to write strings longer than 64Kb (see > https://www.drillio.com/en/2009/java-encoded-string-too-long-64kb-limit/ for > a description of the issue). > As suggested by Imran Rashid I tracked down the offending key: it is > "spark.job.description" and it contains the complete SQL statement. > The problem can be reproduced by creating a table like: > create table test (a int) using parquet > and by sending an insert statement like: > scala> val r = 1 to 128000 > scala> println("insert into table test values (" + r.mkString("),(") + ")") -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19839) Fix memory leak in BytesToBytesMap
[ https://issues.apache.org/jira/browse/SPARK-19839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15897975#comment-15897975 ] Zhan Zhang commented on SPARK-19839: When BytesToBytesMap spills, its longArray should be released. Otherwise, it may not released until the task complete. This array may take a significant amount of memory, which cannot be used by later operator, such as UnsafeShuffleExternalSorter, resulting in more frequent spill in sorter. This patch release the array as destructive iterator will not use this array anymore. > Fix memory leak in BytesToBytesMap > -- > > Key: SPARK-19839 > URL: https://issues.apache.org/jira/browse/SPARK-19839 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Zhan Zhang > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19839) Fix memory leak in BytesToBytesMap
[ https://issues.apache.org/jira/browse/SPARK-19839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15897970#comment-15897970 ] Apache Spark commented on SPARK-19839: -- User 'zhzhan' has created a pull request for this issue: https://github.com/apache/spark/pull/17180 > Fix memory leak in BytesToBytesMap > -- > > Key: SPARK-19839 > URL: https://issues.apache.org/jira/browse/SPARK-19839 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Zhan Zhang > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19839) Fix memory leak in BytesToBytesMap
[ https://issues.apache.org/jira/browse/SPARK-19839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19839: Assignee: Apache Spark > Fix memory leak in BytesToBytesMap > -- > > Key: SPARK-19839 > URL: https://issues.apache.org/jira/browse/SPARK-19839 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Zhan Zhang >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19839) Fix memory leak in BytesToBytesMap
[ https://issues.apache.org/jira/browse/SPARK-19839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19839: Assignee: (was: Apache Spark) > Fix memory leak in BytesToBytesMap > -- > > Key: SPARK-19839 > URL: https://issues.apache.org/jira/browse/SPARK-19839 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Zhan Zhang > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19796) taskScheduler fails serializing long statements received by thrift server
[ https://issues.apache.org/jira/browse/SPARK-19796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid reassigned SPARK-19796: Assignee: Imran Rashid > taskScheduler fails serializing long statements received by thrift server > - > > Key: SPARK-19796 > URL: https://issues.apache.org/jira/browse/SPARK-19796 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Giambattista >Assignee: Imran Rashid >Priority: Blocker > Fix For: 2.2.0 > > > This problem was observed after the changes made for SPARK-17931. > In my use-case I'm sending very long insert statements to Spark thrift server > and they are failing at TaskDescription.scala:89 because writeUTF fails if > requested to write strings longer than 64Kb (see > https://www.drillio.com/en/2009/java-encoded-string-too-long-64kb-limit/ for > a description of the issue). > As suggested by Imran Rashid I tracked down the offending key: it is > "spark.job.description" and it contains the complete SQL statement. > The problem can be reproduced by creating a table like: > create table test (a int) using parquet > and by sending an insert statement like: > scala> val r = 1 to 128000 > scala> println("insert into table test values (" + r.mkString("),(") + ")") -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19837) Fetch failure throws a SparkException in SparkHiveWriter
[ https://issues.apache.org/jira/browse/SPARK-19837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15897961#comment-15897961 ] Sital Kedia commented on SPARK-19837: - `SparkHiveDynamicPartitionWriterContainer` has been refactored in latest master. Not sure if the issue still exists. Will close this JIRA and reopen if we still see issue with latest. > Fetch failure throws a SparkException in SparkHiveWriter > > > Key: SPARK-19837 > URL: https://issues.apache.org/jira/browse/SPARK-19837 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Sital Kedia > > Currently Fetchfailure in SparkHiveWriter fails the job with following > exception > {code} > 0_0): org.apache.spark.SparkException: Task failed while writing rows. > at > org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.writeToFile(hiveWriterContainers.scala:385) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:84) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:84) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.shuffle.FetchFailedException: Connection reset by > peer > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83) > at > org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:731) > at > org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextOuterJoinRows(SortMergeJoinExec.scala:692) > at > org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceStream(SortMergeJoinExec.scala:854) > at > org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceNext(SortMergeJoinExec.scala:887) > at > org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) > at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:211) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) > at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:211) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.writeToFile(hiveWriterContainers.scala:343) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsu
[jira] [Closed] (SPARK-19837) Fetch failure throws a SparkException in SparkHiveWriter
[ https://issues.apache.org/jira/browse/SPARK-19837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sital Kedia closed SPARK-19837. --- Resolution: Fixed > Fetch failure throws a SparkException in SparkHiveWriter > > > Key: SPARK-19837 > URL: https://issues.apache.org/jira/browse/SPARK-19837 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Sital Kedia > > Currently Fetchfailure in SparkHiveWriter fails the job with following > exception > {code} > 0_0): org.apache.spark.SparkException: Task failed while writing rows. > at > org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.writeToFile(hiveWriterContainers.scala:385) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:84) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:84) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.shuffle.FetchFailedException: Connection reset by > peer > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83) > at > org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:731) > at > org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextOuterJoinRows(SortMergeJoinExec.scala:692) > at > org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceStream(SortMergeJoinExec.scala:854) > at > org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceNext(SortMergeJoinExec.scala:887) > at > org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) > at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:211) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) > at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:211) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.writeToFile(hiveWriterContainers.scala:343) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19767) API Doc pages for Streaming with Kafka 0.10 not current
[ https://issues.apache.org/jira/browse/SPARK-19767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15897965#comment-15897965 ] Nick Afshartous commented on SPARK-19767: - I believe all the dependencies are installed and got this error about a missing tag {{'include_example'}} {code} SKIP_API=1 jekyll build Configuration file: none Source: /home/nafshartous/Projects/spark Destination: /home/nafshartous/Projects/spark/_site Incremental build: disabled. Enable with --incremental Generating... Build Warning: Layout 'global' requested in docs/api.md does not exist. Build Warning: Layout 'global' requested in docs/building-spark.md does not exist. Build Warning: Layout 'global' requested in docs/cluster-overview.md does not exist. Build Warning: Layout 'global' requested in docs/configuration.md does not exist. Build Warning: Layout 'global' requested in docs/contributing-to-spark.md does not exist. Build Warning: Layout 'global' requested in docs/ec2-scripts.md does not exist. Liquid Exception: Liquid syntax error (line 581): Unknown tag 'include_example' in docs/graphx-programming-guide.md jekyll 3.4.0 | Error: Liquid syntax error (line 581): Unknown tag 'include_example' {code} > API Doc pages for Streaming with Kafka 0.10 not current > --- > > Key: SPARK-19767 > URL: https://issues.apache.org/jira/browse/SPARK-19767 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Nick Afshartous >Priority: Minor > > The API docs linked from the Spark Kafka 0.10 Integration page are not > current. For instance, on the page >https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html > the code examples show the new API (i.e. class ConsumerStrategies). However, > following the links > API Docs --> (Scala | Java) > lead to API pages that do not have class ConsumerStrategies) . The API doc > package names also have {code}streaming.kafka{code} as opposed to > {code}streaming.kafka10{code} > as in the code examples on streaming-kafka-0-10-integration.html. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19796) taskScheduler fails serializing long statements received by thrift server
[ https://issues.apache.org/jira/browse/SPARK-19796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-19796. -- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17140 [https://github.com/apache/spark/pull/17140] > taskScheduler fails serializing long statements received by thrift server > - > > Key: SPARK-19796 > URL: https://issues.apache.org/jira/browse/SPARK-19796 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Giambattista >Priority: Blocker > Fix For: 2.2.0 > > > This problem was observed after the changes made for SPARK-17931. > In my use-case I'm sending very long insert statements to Spark thrift server > and they are failing at TaskDescription.scala:89 because writeUTF fails if > requested to write strings longer than 64Kb (see > https://www.drillio.com/en/2009/java-encoded-string-too-long-64kb-limit/ for > a description of the issue). > As suggested by Imran Rashid I tracked down the offending key: it is > "spark.job.description" and it contains the complete SQL statement. > The problem can be reproduced by creating a table like: > create table test (a int) using parquet > and by sending an insert statement like: > scala> val r = 1 to 128000 > scala> println("insert into table test values (" + r.mkString("),(") + ")") -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19839) Fix memory leak in BytesToBytesMap
Zhan Zhang created SPARK-19839: -- Summary: Fix memory leak in BytesToBytesMap Key: SPARK-19839 URL: https://issues.apache.org/jira/browse/SPARK-19839 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.0 Reporter: Zhan Zhang -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17495) Hive hash implementation
[ https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17495: Assignee: Apache Spark (was: Tejas Patil) > Hive hash implementation > > > Key: SPARK-17495 > URL: https://issues.apache.org/jira/browse/SPARK-17495 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Tejas Patil >Assignee: Apache Spark >Priority: Minor > Fix For: 2.2.0 > > > Spark internally uses Murmur3Hash for partitioning. This is different from > the one used by Hive. For queries which use bucketing this leads to different > results if one tries the same query on both engines. For us, we want users to > have backward compatibility to that one can switch parts of applications > across the engines without observing regressions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17495) Hive hash implementation
[ https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17495: Assignee: Tejas Patil (was: Apache Spark) > Hive hash implementation > > > Key: SPARK-17495 > URL: https://issues.apache.org/jira/browse/SPARK-17495 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Tejas Patil >Assignee: Tejas Patil >Priority: Minor > Fix For: 2.2.0 > > > Spark internally uses Murmur3Hash for partitioning. This is different from > the one used by Hive. For queries which use bucketing this leads to different > results if one tries the same query on both engines. For us, we want users to > have backward compatibility to that one can switch parts of applications > across the engines without observing regressions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-17495) Hive hash implementation
[ https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil reopened SPARK-17495: - Re-opening. This is not done yet as there are time related datatypes that need to be handled and making using of this hash in the codebase. > Hive hash implementation > > > Key: SPARK-17495 > URL: https://issues.apache.org/jira/browse/SPARK-17495 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Tejas Patil >Assignee: Tejas Patil >Priority: Minor > Fix For: 2.2.0 > > > Spark internally uses Murmur3Hash for partitioning. This is different from > the one used by Hive. For queries which use bucketing this leads to different > results if one tries the same query on both engines. For us, we want users to > have backward compatibility to that one can switch parts of applications > across the engines without observing regressions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19067) mapGroupsWithState - arbitrary stateful operations with Structured Streaming (similar to DStream.mapWithState)
[ https://issues.apache.org/jira/browse/SPARK-19067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15897868#comment-15897868 ] Amit Sela commented on SPARK-19067: --- Sweet! I will make time tomorrow to go over the PR thoroughly (We're on ~opposite timezones ;) ). I also see a note about State Store API, which is something I'm really looking forward to. Any news there ? > mapGroupsWithState - arbitrary stateful operations with Structured Streaming > (similar to DStream.mapWithState) > -- > > Key: SPARK-19067 > URL: https://issues.apache.org/jira/browse/SPARK-19067 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Reporter: Michael Armbrust >Assignee: Tathagata Das >Priority: Critical > > Right now the only way to do stateful operations with with Aggregator or > UDAF. However, this does not give users control of emission or expiration of > state making it hard to implement things like sessionization. We should add > a more general construct (probably similar to {{DStream.mapWithState}}) to > structured streaming. Here is the design. > *Requirements* > - Users should be able to specify a function that can do the following > - Access the input row corresponding to a key > - Access the previous state corresponding to a key > - Optionally, update or remove the state > - Output any number of new rows (or none at all) > *Proposed API* > {code} > // New methods on KeyValueGroupedDataset > class KeyValueGroupedDataset[K, V] { > // Scala friendly > def mapGroupsWithState[S: Encoder, U: Encoder](func: (K, Iterator[V], > State[S]) => U) > def flatMapGroupsWithState[S: Encode, U: Encoder](func: (K, > Iterator[V], State[S]) => Iterator[U]) > // Java friendly >def mapGroupsWithState[S, U](func: MapGroupsWithStateFunction[K, V, S, > R], stateEncoder: Encoder[S], resultEncoder: Encoder[U]) >def flatMapGroupsWithState[S, U](func: > FlatMapGroupsWithStateFunction[K, V, S, R], stateEncoder: Encoder[S], > resultEncoder: Encoder[U]) > } > // --- New Java-friendly function classes --- > public interface MapGroupsWithStateFunction extends Serializable { > R call(K key, Iterator values, state: State) throws Exception; > } > public interface FlatMapGroupsWithStateFunction extends > Serializable { > Iterator call(K key, Iterator values, state: State) throws > Exception; > } > // -- Wrapper class for state data -- > trait KeyedState[S] { > def exists(): Boolean > def get(): S// throws Exception is state does not > exist > def getOption(): Option[S] > def update(newState: S): Unit > def remove(): Unit // exists() will be false after this > } > {code} > Key Semantics of the State class > - The state can be null. > - If the state.remove() is called, then state.exists() will return false, and > getOption will returm None. > - After that state.update(newState) is called, then state.exists() will > return true, and getOption will return Some(...). > - None of the operations are thread-safe. This is to avoid memory barriers. > *Usage* > {code} > val stateFunc = (word: String, words: Iterator[String, runningCount: > KeyedState[Long]) => { > val newCount = words.size + runningCount.getOption.getOrElse(0L) > runningCount.update(newCount) >(word, newCount) > } > dataset // type > is Dataset[String] > .groupByKey[String](w => w) // generates > KeyValueGroupedDataset[String, String] > .mapGroupsWithState[Long, (String, Long)](stateFunc)// returns > Dataset[(String, Long)] > {code} > *Future Directions* > - Timeout based state expiration (that has not received data for a while) > - General expression based expiration -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19790) OutputCommitCoordinator should not allow another task to commit after an ExecutorFailure
[ https://issues.apache.org/jira/browse/SPARK-19790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15897863#comment-15897863 ] Steve Loughran commented on SPARK-19790: The only time a task output committer should be making observable state changes is during the actual commit operation. If it is doing things before that commit operation, that's a bug in that it doesn't meet the goal "committer". The Hadoop output committer has two stages here: the FileOutputFormat work and then rename of files; together they are not a transaction, but on a real filesystem: fast The now deleted DirectOutputCommitter was doing things as it went along —but that's why it got pulled. That leaves: the Hadoop Output Committer committing work on object stores which implement rename() as a copy, hence slow and with a large enough failure window. HADOOP-13786 is going to make that window very small indeed, at least for job completion. One thing to look at here is the {{org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter}} protocol, where the committer can be asked whether or not it supports recovery, as well as {{isCommitJobRepeatable}} to probe for a job commit being repeatable even if it fails partway through. The committer gets to implement its policy there. > OutputCommitCoordinator should not allow another task to commit after an > ExecutorFailure > > > Key: SPARK-19790 > URL: https://issues.apache.org/jira/browse/SPARK-19790 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Imran Rashid > > The OutputCommitCoordinator resets the allowed committer when the task fails. > > https://github.com/apache/spark/blob/8aa560b75e6b083b2a890c52301414285ba35c3d/core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala#L143 > However, if a task fails because of an ExecutorFailure, we actually have no > idea what the status is of the task. The task may actually still be running, > and perhaps successfully commit its output. By allowing another task to > commit its output, there is a chance that multiple tasks commit, which can > result in corrupt output. This would be particularly problematic when > commit() is an expensive operation, eg. moving files on S3. > For other task failures, we can allow other tasks to commit. But with an > ExecutorFailure, its not clear what the right thing to do is. The only safe > thing to do may be to fail the job. > This is related to SPARK-19631, and was discovered during discussion on that > PR https://github.com/apache/spark/pull/16959#discussion_r103549134 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19067) mapGroupsWithState - arbitrary stateful operations with Structured Streaming (similar to DStream.mapWithState)
[ https://issues.apache.org/jira/browse/SPARK-19067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15897859#comment-15897859 ] Apache Spark commented on SPARK-19067: -- User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/17179 > mapGroupsWithState - arbitrary stateful operations with Structured Streaming > (similar to DStream.mapWithState) > -- > > Key: SPARK-19067 > URL: https://issues.apache.org/jira/browse/SPARK-19067 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Reporter: Michael Armbrust >Assignee: Tathagata Das >Priority: Critical > > Right now the only way to do stateful operations with with Aggregator or > UDAF. However, this does not give users control of emission or expiration of > state making it hard to implement things like sessionization. We should add > a more general construct (probably similar to {{DStream.mapWithState}}) to > structured streaming. Here is the design. > *Requirements* > - Users should be able to specify a function that can do the following > - Access the input row corresponding to a key > - Access the previous state corresponding to a key > - Optionally, update or remove the state > - Output any number of new rows (or none at all) > *Proposed API* > {code} > // New methods on KeyValueGroupedDataset > class KeyValueGroupedDataset[K, V] { > // Scala friendly > def mapGroupsWithState[S: Encoder, U: Encoder](func: (K, Iterator[V], > State[S]) => U) > def flatMapGroupsWithState[S: Encode, U: Encoder](func: (K, > Iterator[V], State[S]) => Iterator[U]) > // Java friendly >def mapGroupsWithState[S, U](func: MapGroupsWithStateFunction[K, V, S, > R], stateEncoder: Encoder[S], resultEncoder: Encoder[U]) >def flatMapGroupsWithState[S, U](func: > FlatMapGroupsWithStateFunction[K, V, S, R], stateEncoder: Encoder[S], > resultEncoder: Encoder[U]) > } > // --- New Java-friendly function classes --- > public interface MapGroupsWithStateFunction extends Serializable { > R call(K key, Iterator values, state: State) throws Exception; > } > public interface FlatMapGroupsWithStateFunction extends > Serializable { > Iterator call(K key, Iterator values, state: State) throws > Exception; > } > // -- Wrapper class for state data -- > trait KeyedState[S] { > def exists(): Boolean > def get(): S// throws Exception is state does not > exist > def getOption(): Option[S] > def update(newState: S): Unit > def remove(): Unit // exists() will be false after this > } > {code} > Key Semantics of the State class > - The state can be null. > - If the state.remove() is called, then state.exists() will return false, and > getOption will returm None. > - After that state.update(newState) is called, then state.exists() will > return true, and getOption will return Some(...). > - None of the operations are thread-safe. This is to avoid memory barriers. > *Usage* > {code} > val stateFunc = (word: String, words: Iterator[String, runningCount: > KeyedState[Long]) => { > val newCount = words.size + runningCount.getOption.getOrElse(0L) > runningCount.update(newCount) >(word, newCount) > } > dataset // type > is Dataset[String] > .groupByKey[String](w => w) // generates > KeyValueGroupedDataset[String, String] > .mapGroupsWithState[Long, (String, Long)](stateFunc)// returns > Dataset[(String, Long)] > {code} > *Future Directions* > - Timeout based state expiration (that has not received data for a while) > - General expression based expiration -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18832) Spark SQL: Thriftserver unable to run a registered Hive UDTF
[ https://issues.apache.org/jira/browse/SPARK-18832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15897799#comment-15897799 ] Dongjoon Hyun edited comment on SPARK-18832 at 3/6/17 7:11 PM: --- Hi, [~roadster11x]. Thank you for the sample file. I tried the following locally with your Sample code on Apache Spark 2.0.0. (I removed the package name line from the code just for simplicity.) *HIVE* {code} $ hive hive> CREATE FUNCTION hello AS 'SampleUDTF' USING JAR '/Users/dhyun/UDF/a.jar'; hive> exit; $ hive Logging initialized using configuration in jar:file:/usr/local/Cellar/hive12/1.2.1/libexec/lib/hive-common-1.2.1.jar!/hive-log4j.properties hive> select hello('a'); Added [/Users/dhyun/UDF/a.jar] to class path Added resources: [/Users/dhyun/UDF/a.jar] OK ***a*** ###a### Time taken: 1.347 seconds, Fetched: 1 row(s) {code} *SPARK THRIFTSERVER* {code} $ SPARK_HOME=$PWD sbin/start-thriftserver.sh starting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to /Users/dhyun/spark-release/spark-2.0.0-bin-hadoop2.7/logs/spark-dhyun-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-HW13499.local.out $ bin/beeline -u jdbc:hive2://localhost:1/default Connecting to jdbc:hive2://localhost:1/default ... Connected to: Spark SQL (version 2.0.0) Driver: Hive JDBC (version 1.2.1.spark2) Transaction isolation: TRANSACTION_REPEATABLE_READ Beeline version 1.2.1.spark2 by Apache Hive 0: jdbc:hive2://localhost:1/default> select hello('a'); +--+--+--+ | first | second | +--+--+--+ | ***a*** | ###a### | +--+--+--+ 1 row selected (2.031 seconds) 0: jdbc:hive2://localhost:1/default> describe function hello; +--+--+ | function_desc | +--+--+ | Function: default.hello | | Class: SampleUDTF| | Usage: N/A. | +--+--+ 3 rows selected (0.041 seconds) 0: jdbc:hive2://localhost:1/default> {code} was (Author: dongjoon): Hi, [~roadster11x]. Thank you for the sample file. I tried the following locally with your Sample code on Apache Spark 2.0.0. (I removed the package name line from the code just for simplicity.) *HIVE* {code} $ hive hive> CREATE FUNCTION hello AS 'SampleUDTF' USING JAR '/Users/dhyun/UDF/a.jar'; hive> exit; $ hive Logging initialized using configuration in jar:file:/usr/local/Cellar/hive12/1.2.1/libexec/lib/hive-common-1.2.1.jar!/hive-log4j.properties hive> select hello('a'); Added [/Users/dhyun/UDF/a.jar] to class path Added resources: [/Users/dhyun/UDF/a.jar] OK ***a*** ###a### Time taken: 1.347 seconds, Fetched: 1 row(s) {code} *SPARK THRIFTSERVER* {code} $ SPARK_HOME=$PWD sbin/start-thriftserver.sh starting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to /Users/dhyun/spark-release/spark-2.0.0-bin-hadoop2.7/logs/spark-dhyun-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-HW13499.local.out $ bin/beeline -u jdbc:hive2://localhost:1/default Connecting to jdbc:hive2://localhost:1/default ... Connected to: Spark SQL (version 2.0.0) Driver: Hive JDBC (version 1.2.1.spark2) Transaction isolation: TRANSACTION_REPEATABLE_READ Beeline version 1.2.1.spark2 by Apache Hive 0: jdbc:hive2://localhost:1/default> select hello('a'); +--+--+--+ | first | second | +--+--+--+ | ***a*** | ###a### | +--+--+--+ 1 row selected (2.031 seconds) 0: jdbc:hive2://localhost:1/default> describe function hello; +--+--+ | function_desc | +--+--+ | Function: default.hello | | Class: SampleUDTF| | Usage: N/A. | +--+--+ 3 rows selected (0.041 seconds) 0: jdbc:hive2://localhost:1/default> {code} I'm wondering if your Hive work with your function. > Spark SQL: Thriftserver unable to run a registered Hive UDTF > > > Key: SPARK-18832 > URL: https://issues.apache.org/jira/browse/SPARK-18832 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2 > Environment: HDP: 2.5 > Spark: 2.0.0 >Reporter: Lokesh Yadav > Attachments: SampleUDTF.java > > > Spark Thriftserver is unable to run a HiveUDTF. > It throws the error that it is unable to find the functions although the > function registration succeeds and the funtions does show up in the list > output by {{show functions}}. > I am using a Hive UDTF, registering it using a jar placed on my local > machine. Calling it using the following commands: > //Registering the functions, this command succeeds. > {{CREATE FUNCTION SampleUDTF AS > 'com.fuzzylogix.experiments.udf.hiveUDF.SampleUDTF' USING JAR
[jira] [Commented] (SPARK-18568) vertex attributes in the edge triplet not getting updated in super steps for Pregel API
[ https://issues.apache.org/jira/browse/SPARK-18568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15897852#comment-15897852 ] Laurent Philippart commented on SPARK-18568: Does it also occur when the vertex attribute or the message contains an array (despite using only array copy)? > vertex attributes in the edge triplet not getting updated in super steps for > Pregel API > --- > > Key: SPARK-18568 > URL: https://issues.apache.org/jira/browse/SPARK-18568 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 2.0.2 >Reporter: Rohit > > When running the Pregel API with vertex attribute as complex objects. The > vertex attributes are not getting updated in the triplet view. For example if > the vertex attributes changes in first superstep for vertex"a" the triplet > src attributes in the send msg program for the first super step gets the > latest attributes of the vertex "a" but on 2nd super step if the vertex > attributes changes in the vprog the edge triplets are not updated with this > new state of the vertex for all the edge triplets having the vertex "a" as > src or destination. if I re-create the graph using g = Graph(g.vertices, > g.edges) in the while loop before the next super step then its getting > updated. But this fix is not good performance wise. A detailed description of > the bug along with the code to recreate it is in the attached URL. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19067) mapGroupsWithState - arbitrary stateful operations with Structured Streaming (similar to DStream.mapWithState)
[ https://issues.apache.org/jira/browse/SPARK-19067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15897851#comment-15897851 ] Tathagata Das commented on SPARK-19067: --- Yes, they will be resettable. Just see the timeout subtask that I added just now. > mapGroupsWithState - arbitrary stateful operations with Structured Streaming > (similar to DStream.mapWithState) > -- > > Key: SPARK-19067 > URL: https://issues.apache.org/jira/browse/SPARK-19067 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Reporter: Michael Armbrust >Assignee: Tathagata Das >Priority: Critical > > Right now the only way to do stateful operations with with Aggregator or > UDAF. However, this does not give users control of emission or expiration of > state making it hard to implement things like sessionization. We should add > a more general construct (probably similar to {{DStream.mapWithState}}) to > structured streaming. Here is the design. > *Requirements* > - Users should be able to specify a function that can do the following > - Access the input row corresponding to a key > - Access the previous state corresponding to a key > - Optionally, update or remove the state > - Output any number of new rows (or none at all) > *Proposed API* > {code} > // New methods on KeyValueGroupedDataset > class KeyValueGroupedDataset[K, V] { > // Scala friendly > def mapGroupsWithState[S: Encoder, U: Encoder](func: (K, Iterator[V], > State[S]) => U) > def flatMapGroupsWithState[S: Encode, U: Encoder](func: (K, > Iterator[V], State[S]) => Iterator[U]) > // Java friendly >def mapGroupsWithState[S, U](func: MapGroupsWithStateFunction[K, V, S, > R], stateEncoder: Encoder[S], resultEncoder: Encoder[U]) >def flatMapGroupsWithState[S, U](func: > FlatMapGroupsWithStateFunction[K, V, S, R], stateEncoder: Encoder[S], > resultEncoder: Encoder[U]) > } > // --- New Java-friendly function classes --- > public interface MapGroupsWithStateFunction extends Serializable { > R call(K key, Iterator values, state: State) throws Exception; > } > public interface FlatMapGroupsWithStateFunction extends > Serializable { > Iterator call(K key, Iterator values, state: State) throws > Exception; > } > // -- Wrapper class for state data -- > trait KeyedState[S] { > def exists(): Boolean > def get(): S// throws Exception is state does not > exist > def getOption(): Option[S] > def update(newState: S): Unit > def remove(): Unit // exists() will be false after this > } > {code} > Key Semantics of the State class > - The state can be null. > - If the state.remove() is called, then state.exists() will return false, and > getOption will returm None. > - After that state.update(newState) is called, then state.exists() will > return true, and getOption will return Some(...). > - None of the operations are thread-safe. This is to avoid memory barriers. > *Usage* > {code} > val stateFunc = (word: String, words: Iterator[String, runningCount: > KeyedState[Long]) => { > val newCount = words.size + runningCount.getOption.getOrElse(0L) > runningCount.update(newCount) >(word, newCount) > } > dataset // type > is Dataset[String] > .groupByKey[String](w => w) // generates > KeyValueGroupedDataset[String, String] > .mapGroupsWithState[Long, (String, Long)](stateFunc)// returns > Dataset[(String, Long)] > {code} > *Future Directions* > - Timeout based state expiration (that has not received data for a while) > - General expression based expiration -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19838) Adding Processing Time based timeout
Tathagata Das created SPARK-19838: - Summary: Adding Processing Time based timeout Key: SPARK-19838 URL: https://issues.apache.org/jira/browse/SPARK-19838 Project: Spark Issue Type: Sub-task Components: Structured Streaming Affects Versions: 2.1.0 Reporter: Tathagata Das Assignee: Tathagata Das When a key does not get any new data in mapGroupsWithState, the mapping function is never called on it. So we need a timeout feature that calls the function again in such cases, so that the user can decide whether to continue waiting or clean up (remove state, save stuff externally, etc.). Timeouts can be either based on processing time or event time. This JIRA is for processing time, but defines the high level API design for both. The usage would look like this {code} def stateFunction(key: K, value: Iterator[V], state: KeyedState[S]): U = { ... state.setTimeoutDuration(1) ... } dataset // type is Dataset[T] .groupByKey[K](keyingFunc) // generates KeyValueGroupedDataset[K, T] .mapGroupsWithState[S, U]( func = stateFunction, timeout = KeyedStateTimeout.withProcessingTime)// returns Dataset[U] {code} Note the following design aspects. - The timeout type is provided as a param in mapGroupsWithState as a parameter global to all the keys. This is so that the planner knows this at planning time, and accordingly optimize the execution based on whether to saves extra info in state or not (e.g. timeout durations or timestamps). - The exact timeout duration is provided inside the function call so that it can be customized on a per key basis. - When the timeout occurs for a key, the function is called with no values, and {{KeyedState.isTimingOut()}} set to {{true}}. - The timeout is reset for key every time the function is called on the key, that is, when the key has new data, or the key has timed out. So the user has to set the timeout duration everytime the function is called, otherwise there will not be any timeout set. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18832) Spark SQL: Thriftserver unable to run a registered Hive UDTF
[ https://issues.apache.org/jira/browse/SPARK-18832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15897841#comment-15897841 ] Dongjoon Hyun commented on SPARK-18832: --- Could you upload your `jar` file? It could be a problem of jar packaging. > Spark SQL: Thriftserver unable to run a registered Hive UDTF > > > Key: SPARK-18832 > URL: https://issues.apache.org/jira/browse/SPARK-18832 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2 > Environment: HDP: 2.5 > Spark: 2.0.0 >Reporter: Lokesh Yadav > Attachments: SampleUDTF.java > > > Spark Thriftserver is unable to run a HiveUDTF. > It throws the error that it is unable to find the functions although the > function registration succeeds and the funtions does show up in the list > output by {{show functions}}. > I am using a Hive UDTF, registering it using a jar placed on my local > machine. Calling it using the following commands: > //Registering the functions, this command succeeds. > {{CREATE FUNCTION SampleUDTF AS > 'com.fuzzylogix.experiments.udf.hiveUDF.SampleUDTF' USING JAR > '/root/spark_files/experiments-1.2.jar';}} > //Thriftserver is able to look up the functuion, on this command: > {{DESCRIBE FUNCTION SampleUDTF;}} > {quote} > {noformat} > Output: > +---+--+ > | function_desc | > +---+--+ > | Function: default.SampleUDTF | > | Class: com.fuzzylogix.experiments.udf.hiveUDF.SampleUDTF | > | Usage: N/A. | > +---+--+ > {noformat} > {quote} > // Calling the function: > {{SELECT SampleUDTF('Paris');}} > bq. Output of the above command: Error: > org.apache.spark.sql.AnalysisException: Undefined function: 'SampleUDTF'. > This function is neither a registered temporary function nor a permanent > function registered in the database 'default'.; line 1 pos 7 (state=,code=0) > I have also tried with using a non-local (on hdfs) jar, but I get the same > error. > My environment: HDP 2.5 with spark 2.0.0 > I have attached the class file for the UDTF I am using in testing this. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18832) Spark SQL: Thriftserver unable to run a registered Hive UDTF
[ https://issues.apache.org/jira/browse/SPARK-18832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15897799#comment-15897799 ] Dongjoon Hyun edited comment on SPARK-18832 at 3/6/17 6:55 PM: --- Hi, [~roadster11x]. Thank you for the sample file. I tried the following locally with your Sample code on Apache Spark 2.0.0. (I removed the package name line from the code just for simplicity.) *HIVE* {code} $ hive hive> CREATE FUNCTION hello AS 'SampleUDTF' USING JAR '/Users/dhyun/UDF/a.jar'; hive> exit; $ hive Logging initialized using configuration in jar:file:/usr/local/Cellar/hive12/1.2.1/libexec/lib/hive-common-1.2.1.jar!/hive-log4j.properties hive> select hello('a'); Added [/Users/dhyun/UDF/a.jar] to class path Added resources: [/Users/dhyun/UDF/a.jar] OK ***a*** ###a### Time taken: 1.347 seconds, Fetched: 1 row(s) {code} *SPARK THRIFTSERVER* {code} $ SPARK_HOME=$PWD sbin/start-thriftserver.sh starting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to /Users/dhyun/spark-release/spark-2.0.0-bin-hadoop2.7/logs/spark-dhyun-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-HW13499.local.out $ bin/beeline -u jdbc:hive2://localhost:1/default Connecting to jdbc:hive2://localhost:1/default ... Connected to: Spark SQL (version 2.0.0) Driver: Hive JDBC (version 1.2.1.spark2) Transaction isolation: TRANSACTION_REPEATABLE_READ Beeline version 1.2.1.spark2 by Apache Hive 0: jdbc:hive2://localhost:1/default> select hello('a'); +--+--+--+ | first | second | +--+--+--+ | ***a*** | ###a### | +--+--+--+ 1 row selected (2.031 seconds) 0: jdbc:hive2://localhost:1/default> describe function hello; +--+--+ | function_desc | +--+--+ | Function: default.hello | | Class: SampleUDTF| | Usage: N/A. | +--+--+ 3 rows selected (0.041 seconds) 0: jdbc:hive2://localhost:1/default> {code} I'm wondering if your Hive work with your function. was (Author: dongjoon): Hi, [~roadster11x]. Thank you for the sample file. I tried the following locally with your Sample code on Apache Spark 2.0.0. (I removed the package name line from the code just for simplicity.) *HIVE* {code} $ hive hive> CREATE FUNCTION hello AS 'SampleUDTF' USING JAR '/Users/dhyun/UDF/a.jar'; hive> exit; $ hive Logging initialized using configuration in jar:file:/usr/local/Cellar/hive12/1.2.1/libexec/lib/hive-common-1.2.1.jar!/hive-log4j.properties hive> select hello('a'); Added [/Users/dhyun/UDF/a.jar] to class path Added resources: [/Users/dhyun/UDF/a.jar] OK ***a*** ###a### Time taken: 1.347 seconds, Fetched: 1 row(s) {code} *SPARK THRIFTSERVER* {code} $ SPARK_HOME=$PWD sbin/start-thriftserver.sh starting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to /Users/dhyun/spark-release/spark-2.0.0-bin-hadoop2.7/logs/spark-dhyun-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-HW13499.local.out $ bin/beeline -u jdbc:hive2://localhost:1/default Connecting to jdbc:hive2://localhost:1/default ... Connected to: Spark SQL (version 2.0.0) Driver: Hive JDBC (version 1.2.1.spark2) Transaction isolation: TRANSACTION_REPEATABLE_READ Beeline version 1.2.1.spark2 by Apache Hive 0: jdbc:hive2://localhost:1/default> select hello('a'); +--+--+--+ | first | second | +--+--+--+ | ***a*** | ###a### | +--+--+--+ 1 row selected (2.031 seconds) 0: jdbc:hive2://localhost:1/default> describe function hello; +--+--+ | function_desc | +--+--+ | Function: default.hello | | Class: SampleUDTF| | Usage: N/A. | +--+--+ 3 rows selected (0.041 seconds) 0: jdbc:hive2://localhost:1/default> {code} I'm wondering if Hive work with your function. > Spark SQL: Thriftserver unable to run a registered Hive UDTF > > > Key: SPARK-18832 > URL: https://issues.apache.org/jira/browse/SPARK-18832 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2 > Environment: HDP: 2.5 > Spark: 2.0.0 >Reporter: Lokesh Yadav > Attachments: SampleUDTF.java > > > Spark Thriftserver is unable to run a HiveUDTF. > It throws the error that it is unable to find the functions although the > function registration succeeds and the funtions does show up in the list > output by {{show functions}}. > I am using a Hive UDTF, registering it using a jar placed on my local > machine. Calling it using the following commands: > //Registering the functions, this command succeeds. > {{CREATE FUNCTION SampleUDTF AS > 'com.fuzzylog
[jira] [Comment Edited] (SPARK-18832) Spark SQL: Thriftserver unable to run a registered Hive UDTF
[ https://issues.apache.org/jira/browse/SPARK-18832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15897799#comment-15897799 ] Dongjoon Hyun edited comment on SPARK-18832 at 3/6/17 6:53 PM: --- Hi, [~roadster11x]. Thank you for the sample file. I tried the following locally with your Sample code on Apache Spark 2.0.0. (I removed the package name line from the code just for simplicity.) *HIVE* {code} $ hive hive> CREATE FUNCTION hello AS 'SampleUDTF' USING JAR '/Users/dhyun/UDF/a.jar'; hive> exit; $ hive Logging initialized using configuration in jar:file:/usr/local/Cellar/hive12/1.2.1/libexec/lib/hive-common-1.2.1.jar!/hive-log4j.properties hive> select hello('a'); Added [/Users/dhyun/UDF/a.jar] to class path Added resources: [/Users/dhyun/UDF/a.jar] OK ***a*** ###a### Time taken: 1.347 seconds, Fetched: 1 row(s) {code} *SPARK THRIFTSERVER* {code} $ SPARK_HOME=$PWD sbin/start-thriftserver.sh starting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to /Users/dhyun/spark-release/spark-2.0.0-bin-hadoop2.7/logs/spark-dhyun-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-HW13499.local.out $ bin/beeline -u jdbc:hive2://localhost:1/default Connecting to jdbc:hive2://localhost:1/default ... Connected to: Spark SQL (version 2.0.0) Driver: Hive JDBC (version 1.2.1.spark2) Transaction isolation: TRANSACTION_REPEATABLE_READ Beeline version 1.2.1.spark2 by Apache Hive 0: jdbc:hive2://localhost:1/default> select hello('a'); +--+--+--+ | first | second | +--+--+--+ | ***a*** | ###a### | +--+--+--+ 1 row selected (2.031 seconds) 0: jdbc:hive2://localhost:1/default> describe function hello; +--+--+ | function_desc | +--+--+ | Function: default.hello | | Class: SampleUDTF| | Usage: N/A. | +--+--+ 3 rows selected (0.041 seconds) 0: jdbc:hive2://localhost:1/default> {code} I'm wondering if Hive work with your function. was (Author: dongjoon): Hi, [~roadster11x]. Thank you for the sample file. I tried the following locally with your Sample code on Apache Spark 2.0.0. (I removed the package name line from the code just for simplicity.) *HIVE* {code} $ hive Logging initialized using configuration in jar:file:/usr/local/Cellar/hive12/1.2.1/libexec/lib/hive-common-1.2.1.jar!/hive-log4j.properties hive> select hello('a'); Added [/Users/dhyun/UDF/a.jar] to class path Added resources: [/Users/dhyun/UDF/a.jar] OK ***a*** ###a### Time taken: 1.347 seconds, Fetched: 1 row(s) {code} *SPARK THRIFTSERVER* {code} $ SPARK_HOME=$PWD sbin/start-thriftserver.sh starting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to /Users/dhyun/spark-release/spark-2.0.0-bin-hadoop2.7/logs/spark-dhyun-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-HW13499.local.out $ bin/beeline -u jdbc:hive2://localhost:1/default Connecting to jdbc:hive2://localhost:1/default ... Connected to: Spark SQL (version 2.0.0) Driver: Hive JDBC (version 1.2.1.spark2) Transaction isolation: TRANSACTION_REPEATABLE_READ Beeline version 1.2.1.spark2 by Apache Hive 0: jdbc:hive2://localhost:1/default> select hello('a'); +--+--+--+ | first | second | +--+--+--+ | ***a*** | ###a### | +--+--+--+ 1 row selected (2.031 seconds) 0: jdbc:hive2://localhost:1/default> describe function hello; +--+--+ | function_desc | +--+--+ | Function: default.hello | | Class: SampleUDTF| | Usage: N/A. | +--+--+ 3 rows selected (0.041 seconds) 0: jdbc:hive2://localhost:1/default> {code} I'm wondering if Hive work with your function. > Spark SQL: Thriftserver unable to run a registered Hive UDTF > > > Key: SPARK-18832 > URL: https://issues.apache.org/jira/browse/SPARK-18832 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2 > Environment: HDP: 2.5 > Spark: 2.0.0 >Reporter: Lokesh Yadav > Attachments: SampleUDTF.java > > > Spark Thriftserver is unable to run a HiveUDTF. > It throws the error that it is unable to find the functions although the > function registration succeeds and the funtions does show up in the list > output by {{show functions}}. > I am using a Hive UDTF, registering it using a jar placed on my local > machine. Calling it using the following commands: > //Registering the functions, this command succeeds. > {{CREATE FUNCTION SampleUDTF AS > 'com.fuzzylogix.experiments.udf.hiveUDF.SampleUDTF' USING JAR > '/root/spark_files/experiments-1.2.jar';}} > //Thrift
[jira] [Comment Edited] (SPARK-18832) Spark SQL: Thriftserver unable to run a registered Hive UDTF
[ https://issues.apache.org/jira/browse/SPARK-18832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15897799#comment-15897799 ] Dongjoon Hyun edited comment on SPARK-18832 at 3/6/17 6:46 PM: --- Hi, [~roadster11x]. Thank you for the sample file. I tried the following locally with your Sample code on Apache Spark 2.0.0. (I removed the package name line from the code just for simplicity.) *HIVE* {code} $ hive Logging initialized using configuration in jar:file:/usr/local/Cellar/hive12/1.2.1/libexec/lib/hive-common-1.2.1.jar!/hive-log4j.properties hive> select hello('a'); Added [/Users/dhyun/UDF/a.jar] to class path Added resources: [/Users/dhyun/UDF/a.jar] OK ***a*** ###a### Time taken: 1.347 seconds, Fetched: 1 row(s) {code} *SPARK THRIFTSERVER* {code} $ SPARK_HOME=$PWD sbin/start-thriftserver.sh starting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to /Users/dhyun/spark-release/spark-2.0.0-bin-hadoop2.7/logs/spark-dhyun-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-HW13499.local.out $ bin/beeline -u jdbc:hive2://localhost:1/default Connecting to jdbc:hive2://localhost:1/default ... Connected to: Spark SQL (version 2.0.0) Driver: Hive JDBC (version 1.2.1.spark2) Transaction isolation: TRANSACTION_REPEATABLE_READ Beeline version 1.2.1.spark2 by Apache Hive 0: jdbc:hive2://localhost:1/default> select hello('a'); +--+--+--+ | first | second | +--+--+--+ | ***a*** | ###a### | +--+--+--+ 1 row selected (2.031 seconds) 0: jdbc:hive2://localhost:1/default> describe function hello; +--+--+ | function_desc | +--+--+ | Function: default.hello | | Class: SampleUDTF| | Usage: N/A. | +--+--+ 3 rows selected (0.041 seconds) 0: jdbc:hive2://localhost:1/default> {code} I'm wondering if Hive work with your function. was (Author: dongjoon): Hi, [~roadster11x]. Thank you for the sample file. I tried the following with your Sample code on Apache Spark 2.0.0. (I removed the package name line from the code just for simplicity.) *HIVE* {code} $ hive Logging initialized using configuration in jar:file:/usr/local/Cellar/hive12/1.2.1/libexec/lib/hive-common-1.2.1.jar!/hive-log4j.properties hive> select hello('a'); Added [/Users/dhyun/UDF/a.jar] to class path Added resources: [/Users/dhyun/UDF/a.jar] OK ***a*** ###a### Time taken: 1.347 seconds, Fetched: 1 row(s) {code} *SPARK THRIFTSERVER* {code} $ SPARK_HOME=$PWD sbin/start-thriftserver.sh starting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to /Users/dhyun/spark-release/spark-2.0.0-bin-hadoop2.7/logs/spark-dhyun-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-HW13499.local.out $ bin/beeline -u jdbc:hive2://localhost:1/default Connecting to jdbc:hive2://localhost:1/default ... Connected to: Spark SQL (version 2.0.0) Driver: Hive JDBC (version 1.2.1.spark2) Transaction isolation: TRANSACTION_REPEATABLE_READ Beeline version 1.2.1.spark2 by Apache Hive 0: jdbc:hive2://localhost:1/default> select hello('a'); +--+--+--+ | first | second | +--+--+--+ | ***a*** | ###a### | +--+--+--+ 1 row selected (2.031 seconds) 0: jdbc:hive2://localhost:1/default> describe function hello; +--+--+ | function_desc | +--+--+ | Function: default.hello | | Class: SampleUDTF| | Usage: N/A. | +--+--+ 3 rows selected (0.041 seconds) 0: jdbc:hive2://localhost:1/default> {code} I'm wondering if Hive work with your function. > Spark SQL: Thriftserver unable to run a registered Hive UDTF > > > Key: SPARK-18832 > URL: https://issues.apache.org/jira/browse/SPARK-18832 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2 > Environment: HDP: 2.5 > Spark: 2.0.0 >Reporter: Lokesh Yadav > Attachments: SampleUDTF.java > > > Spark Thriftserver is unable to run a HiveUDTF. > It throws the error that it is unable to find the functions although the > function registration succeeds and the funtions does show up in the list > output by {{show functions}}. > I am using a Hive UDTF, registering it using a jar placed on my local > machine. Calling it using the following commands: > //Registering the functions, this command succeeds. > {{CREATE FUNCTION SampleUDTF AS > 'com.fuzzylogix.experiments.udf.hiveUDF.SampleUDTF' USING JAR > '/root/spark_files/experiments-1.2.jar';}} > //Thriftserver is able to look up the functuion, on this command: > {{DESCRIBE FUNCTION SampleUDTF;}} > {quote} > {n
[jira] [Assigned] (SPARK-19257) The type of CatalogStorageFormat.locationUri should be java.net.URI instead of String
[ https://issues.apache.org/jira/browse/SPARK-19257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-19257: --- Assignee: Song Jun > The type of CatalogStorageFormat.locationUri should be java.net.URI instead > of String > - > > Key: SPARK-19257 > URL: https://issues.apache.org/jira/browse/SPARK-19257 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Song Jun > Fix For: 2.2.0 > > > Currently we treat `CatalogStorageFormat.locationUri` as URI string and > always convert it to path by `new Path(new URI(locationUri))` > It will be safer if we can make the type of > `CatalogStorageFormat.locationUri` java.net.URI. We should finish the TODO in > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L50-L52 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19257) The type of CatalogStorageFormat.locationUri should be java.net.URI instead of String
[ https://issues.apache.org/jira/browse/SPARK-19257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-19257. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17149 [https://github.com/apache/spark/pull/17149] > The type of CatalogStorageFormat.locationUri should be java.net.URI instead > of String > - > > Key: SPARK-19257 > URL: https://issues.apache.org/jira/browse/SPARK-19257 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan > Fix For: 2.2.0 > > > Currently we treat `CatalogStorageFormat.locationUri` as URI string and > always convert it to path by `new Path(new URI(locationUri))` > It will be safer if we can make the type of > `CatalogStorageFormat.locationUri` java.net.URI. We should finish the TODO in > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L50-L52 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19304) Kinesis checkpoint recovery is 10x slow
[ https://issues.apache.org/jira/browse/SPARK-19304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz resolved SPARK-19304. - Resolution: Fixed Fix Version/s: 2.2.0 Target Version/s: 2.2.0 Resolved by: https://github.com/apache/spark/pull/16842 > Kinesis checkpoint recovery is 10x slow > --- > > Key: SPARK-19304 > URL: https://issues.apache.org/jira/browse/SPARK-19304 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 > Environment: using s3 for checkpoints using 1 executor, with 19g mem > & 3 cores per executor >Reporter: Gaurav Shah >Assignee: Gaurav Shah > Labels: kinesis > Fix For: 2.2.0 > > > Application runs fine initially, running batches of 1hour and the processing > time is less than 30 minutes on average. For some reason lets say the > application crashes, and we try to restart from checkpoint. The processing > now takes forever and does not move forward. We tried to test out the same > thing at batch interval of 1 minute, the processing runs fine and takes 1.2 > minutes for batch to finish. When we recover from checkpoint it takes about > 15 minutes for each batch. Post the recovery the batches again process at > normal speed > I suspect the KinesisBackedBlockRDD used for recovery is causing the slowdown. > Stackoverflow post with more details: > http://stackoverflow.com/questions/38390567/spark-streaming-checkpoint-recovery-is-very-very-slow -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19304) Kinesis checkpoint recovery is 10x slow
[ https://issues.apache.org/jira/browse/SPARK-19304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz reassigned SPARK-19304: --- Assignee: Gaurav Shah > Kinesis checkpoint recovery is 10x slow > --- > > Key: SPARK-19304 > URL: https://issues.apache.org/jira/browse/SPARK-19304 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 > Environment: using s3 for checkpoints using 1 executor, with 19g mem > & 3 cores per executor >Reporter: Gaurav Shah >Assignee: Gaurav Shah > Labels: kinesis > > Application runs fine initially, running batches of 1hour and the processing > time is less than 30 minutes on average. For some reason lets say the > application crashes, and we try to restart from checkpoint. The processing > now takes forever and does not move forward. We tried to test out the same > thing at batch interval of 1 minute, the processing runs fine and takes 1.2 > minutes for batch to finish. When we recover from checkpoint it takes about > 15 minutes for each batch. Post the recovery the batches again process at > normal speed > I suspect the KinesisBackedBlockRDD used for recovery is causing the slowdown. > Stackoverflow post with more details: > http://stackoverflow.com/questions/38390567/spark-streaming-checkpoint-recovery-is-very-very-slow -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org