[jira] [Commented] (SPARK-8007) Support resolving virtual columns in DataFrames
[ https://issues.apache.org/jira/browse/SPARK-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15718790#comment-15718790 ] Reynold Xin commented on SPARK-8007: spark_partition_id() is available in PySpark starting 1.6. It's in pyspark.functions.spark_partition_id. > Support resolving virtual columns in DataFrames > --- > > Key: SPARK-8007 > URL: https://issues.apache.org/jira/browse/SPARK-8007 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Joseph Batchik > > Create the infrastructure so we can resolve df("SPARK__PARTITION__ID") to > SparkPartitionID expression. > A cool use case is to understand physical data skew: > {code} > df.groupBy("SPARK__PARTITION__ID").count() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8007) Support resolving virtual columns in DataFrames
[ https://issues.apache.org/jira/browse/SPARK-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15718615#comment-15718615 ] Ruslan Dautkhanov commented on SPARK-8007: -- Is spark__partition__id available in PySpark too? Can't find a way to run the same code in PySpark. > Support resolving virtual columns in DataFrames > --- > > Key: SPARK-8007 > URL: https://issues.apache.org/jira/browse/SPARK-8007 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Joseph Batchik > > Create the infrastructure so we can resolve df("SPARK__PARTITION__ID") to > SparkPartitionID expression. > A cool use case is to understand physical data skew: > {code} > df.groupBy("SPARK__PARTITION__ID").count() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8007) Support resolving virtual columns in DataFrames
[ https://issues.apache.org/jira/browse/SPARK-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639187#comment-14639187 ] Joseph Batchik commented on SPARK-8007: --- You will be able to solve this issue by doing: {code:java} df.groupBy(expr(spark__partition__id())) {code} with SPARK-8668 . This will make all of these virtual column just be function calls so no changes to the analyzer will be needed. Support resolving virtual columns in DataFrames --- Key: SPARK-8007 URL: https://issues.apache.org/jira/browse/SPARK-8007 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Joseph Batchik Create the infrastructure so we can resolve df(SPARK__PARTITION__ID) to SparkPartitionID expression. A cool use case is to understand physical data skew: {code} df.groupBy(SPARK__PARTITION__ID).count() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8007) Support resolving virtual columns in DataFrames
[ https://issues.apache.org/jira/browse/SPARK-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635597#comment-14635597 ] Michael Armbrust commented on SPARK-8007: - I'm going to propose that we don't change the analyzer, but instead just use functions for all the cases that were specified. This is nice because we can never be ambiguous with a user column. Support resolving virtual columns in DataFrames --- Key: SPARK-8007 URL: https://issues.apache.org/jira/browse/SPARK-8007 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Joseph Batchik Create the infrastructure so we can resolve df(SPARK__PARTITION__ID) to SparkPartitionID expression. A cool use case is to understand physical data skew: {code} df.groupBy(SPARK__PARTITION__ID).count() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8007) Support resolving virtual columns in DataFrames
[ https://issues.apache.org/jira/browse/SPARK-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630897#comment-14630897 ] Reynold Xin commented on SPARK-8007: [~jd] Take a look at this: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L146 Support resolving virtual columns in DataFrames --- Key: SPARK-8007 URL: https://issues.apache.org/jira/browse/SPARK-8007 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Create the infrastructure so we can resolve df(SPARK__PARTITION__ID) to SparkPartitionID expression. A cool use case is to understand physical data skew: {code} df.groupBy(SPARK__PARTITION__ID).count() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8007) Support resolving virtual columns in DataFrames
[ https://issues.apache.org/jira/browse/SPARK-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631761#comment-14631761 ] Reynold Xin commented on SPARK-8007: Thanks - please submit a pull request once you have it working. Support resolving virtual columns in DataFrames --- Key: SPARK-8007 URL: https://issues.apache.org/jira/browse/SPARK-8007 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Create the infrastructure so we can resolve df(SPARK__PARTITION__ID) to SparkPartitionID expression. A cool use case is to understand physical data skew: {code} df.groupBy(SPARK__PARTITION__ID).count() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8007) Support resolving virtual columns in DataFrames
[ https://issues.apache.org/jira/browse/SPARK-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631673#comment-14631673 ] Joseph Batchik commented on SPARK-8007: --- Reynold, thanks for pointing that out. I updated the commit to use what you suggested. This should also make it easy to add other virtual columns as described in the parent ticket. All that should need to be done is updating the resolver in the logical plan and the new virtual column rule. https://github.com/JDrit/spark/commit/7b46e7de6f98df98480fa34c85248aa2d90bc635#diff-d74f782d414a74eee09a4b6b9994be87R34 Support resolving virtual columns in DataFrames --- Key: SPARK-8007 URL: https://issues.apache.org/jira/browse/SPARK-8007 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Create the infrastructure so we can resolve df(SPARK__PARTITION__ID) to SparkPartitionID expression. A cool use case is to understand physical data skew: {code} df.groupBy(SPARK__PARTITION__ID).count() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8007) Support resolving virtual columns in DataFrames
[ https://issues.apache.org/jira/browse/SPARK-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631835#comment-14631835 ] Apache Spark commented on SPARK-8007: - User 'JDrit' has created a pull request for this issue: https://github.com/apache/spark/pull/7478 Support resolving virtual columns in DataFrames --- Key: SPARK-8007 URL: https://issues.apache.org/jira/browse/SPARK-8007 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Create the infrastructure so we can resolve df(SPARK__PARTITION__ID) to SparkPartitionID expression. A cool use case is to understand physical data skew: {code} df.groupBy(SPARK__PARTITION__ID).count() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8007) Support resolving virtual columns in DataFrames
[ https://issues.apache.org/jira/browse/SPARK-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630820#comment-14630820 ] Joseph Batchik commented on SPARK-8007: --- [~rxin] Reynold, I start adding virtual columns to the DataFrames and SQL queries for SPARK-8003 and SPARK-8007. My initial code is here: https://github.com/JDrit/spark/commit/e34d3a7eabbc9c41c2dd85b128b2bb5713039e40. The one issue I ran into though was that the catalyst package cannot access org.apache.spark.sql.execution.expressions where SparkPartitionID resides. For prototyping purposes I copied SparkPartitionID to the catalyst package, but am wondering what would be the best way to deal with that dependency, Can you let me know what you think about my changes and what else needs to be done to it. Support resolving virtual columns in DataFrames --- Key: SPARK-8007 URL: https://issues.apache.org/jira/browse/SPARK-8007 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Create the infrastructure so we can resolve df(SPARK__PARTITION__ID) to SparkPartitionID expression. A cool use case is to understand physical data skew: {code} df.groupBy(SPARK__PARTITION__ID).count() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org