[jira] [Commented] (SPARK-8007) Support resolving virtual columns in DataFrames

2016-12-03 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15718790#comment-15718790
 ] 

Reynold Xin commented on SPARK-8007:


spark_partition_id() is available in PySpark starting 1.6. It's in 
pyspark.functions.spark_partition_id.


> Support resolving virtual columns in DataFrames
> ---
>
> Key: SPARK-8007
> URL: https://issues.apache.org/jira/browse/SPARK-8007
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Joseph Batchik
>
> Create the infrastructure so we can resolve df("SPARK__PARTITION__ID") to 
> SparkPartitionID expression.
> A cool use case is to understand physical data skew:
> {code}
> df.groupBy("SPARK__PARTITION__ID").count()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8007) Support resolving virtual columns in DataFrames

2016-12-03 Thread Ruslan Dautkhanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15718615#comment-15718615
 ] 

Ruslan Dautkhanov commented on SPARK-8007:
--

Is spark__partition__id available in PySpark too? Can't find a way to run the 
same code in PySpark.

> Support resolving virtual columns in DataFrames
> ---
>
> Key: SPARK-8007
> URL: https://issues.apache.org/jira/browse/SPARK-8007
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Joseph Batchik
>
> Create the infrastructure so we can resolve df("SPARK__PARTITION__ID") to 
> SparkPartitionID expression.
> A cool use case is to understand physical data skew:
> {code}
> df.groupBy("SPARK__PARTITION__ID").count()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8007) Support resolving virtual columns in DataFrames

2015-07-23 Thread Joseph Batchik (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639187#comment-14639187
 ] 

Joseph Batchik commented on SPARK-8007:
---

You will be able to solve this issue by doing:

{code:java}
df.groupBy(expr(spark__partition__id()))
{code}

with SPARK-8668 .

This will make all of these virtual column just be function calls so no 
changes to the analyzer will be needed.

 Support resolving virtual columns in DataFrames
 ---

 Key: SPARK-8007
 URL: https://issues.apache.org/jira/browse/SPARK-8007
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Joseph Batchik

 Create the infrastructure so we can resolve df(SPARK__PARTITION__ID) to 
 SparkPartitionID expression.
 A cool use case is to understand physical data skew:
 {code}
 df.groupBy(SPARK__PARTITION__ID).count()
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8007) Support resolving virtual columns in DataFrames

2015-07-21 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635597#comment-14635597
 ] 

Michael Armbrust commented on SPARK-8007:
-

I'm going to propose that we don't change the analyzer, but instead just use 
functions for all the cases that were specified.  This is nice because we can 
never be ambiguous with a user column.


 Support resolving virtual columns in DataFrames
 ---

 Key: SPARK-8007
 URL: https://issues.apache.org/jira/browse/SPARK-8007
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Joseph Batchik

 Create the infrastructure so we can resolve df(SPARK__PARTITION__ID) to 
 SparkPartitionID expression.
 A cool use case is to understand physical data skew:
 {code}
 df.groupBy(SPARK__PARTITION__ID).count()
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8007) Support resolving virtual columns in DataFrames

2015-07-17 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630897#comment-14630897
 ] 

Reynold Xin commented on SPARK-8007:


[~jd] Take a look at this: 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L146



 Support resolving virtual columns in DataFrames
 ---

 Key: SPARK-8007
 URL: https://issues.apache.org/jira/browse/SPARK-8007
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 Create the infrastructure so we can resolve df(SPARK__PARTITION__ID) to 
 SparkPartitionID expression.
 A cool use case is to understand physical data skew:
 {code}
 df.groupBy(SPARK__PARTITION__ID).count()
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8007) Support resolving virtual columns in DataFrames

2015-07-17 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631761#comment-14631761
 ] 

Reynold Xin commented on SPARK-8007:


Thanks - please submit a pull request once you have it working. 

 Support resolving virtual columns in DataFrames
 ---

 Key: SPARK-8007
 URL: https://issues.apache.org/jira/browse/SPARK-8007
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 Create the infrastructure so we can resolve df(SPARK__PARTITION__ID) to 
 SparkPartitionID expression.
 A cool use case is to understand physical data skew:
 {code}
 df.groupBy(SPARK__PARTITION__ID).count()
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8007) Support resolving virtual columns in DataFrames

2015-07-17 Thread Joseph Batchik (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631673#comment-14631673
 ] 

Joseph Batchik commented on SPARK-8007:
---

Reynold, thanks for pointing that out. I updated the commit to use what you 
suggested. This should also make it easy to add other virtual columns as 
described in the parent ticket. All that should need to be done is updating the 
resolver in the logical plan and the new virtual column rule.

https://github.com/JDrit/spark/commit/7b46e7de6f98df98480fa34c85248aa2d90bc635#diff-d74f782d414a74eee09a4b6b9994be87R34

 Support resolving virtual columns in DataFrames
 ---

 Key: SPARK-8007
 URL: https://issues.apache.org/jira/browse/SPARK-8007
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 Create the infrastructure so we can resolve df(SPARK__PARTITION__ID) to 
 SparkPartitionID expression.
 A cool use case is to understand physical data skew:
 {code}
 df.groupBy(SPARK__PARTITION__ID).count()
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8007) Support resolving virtual columns in DataFrames

2015-07-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631835#comment-14631835
 ] 

Apache Spark commented on SPARK-8007:
-

User 'JDrit' has created a pull request for this issue:
https://github.com/apache/spark/pull/7478

 Support resolving virtual columns in DataFrames
 ---

 Key: SPARK-8007
 URL: https://issues.apache.org/jira/browse/SPARK-8007
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 Create the infrastructure so we can resolve df(SPARK__PARTITION__ID) to 
 SparkPartitionID expression.
 A cool use case is to understand physical data skew:
 {code}
 df.groupBy(SPARK__PARTITION__ID).count()
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8007) Support resolving virtual columns in DataFrames

2015-07-16 Thread Joseph Batchik (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630820#comment-14630820
 ] 

Joseph Batchik commented on SPARK-8007:
---

[~rxin] Reynold, I start adding virtual columns to the DataFrames and SQL 
queries for SPARK-8003 and SPARK-8007. My initial code is here: 
https://github.com/JDrit/spark/commit/e34d3a7eabbc9c41c2dd85b128b2bb5713039e40.

The one issue I ran into though was that the catalyst package cannot access 
org.apache.spark.sql.execution.expressions where SparkPartitionID resides. For 
prototyping purposes I copied SparkPartitionID to the catalyst package, but am 
wondering what would be the best way to deal with that dependency,  

Can you let me know what you think about my changes and what else needs to be 
done to it.

 Support resolving virtual columns in DataFrames
 ---

 Key: SPARK-8007
 URL: https://issues.apache.org/jira/browse/SPARK-8007
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 Create the infrastructure so we can resolve df(SPARK__PARTITION__ID) to 
 SparkPartitionID expression.
 A cool use case is to understand physical data skew:
 {code}
 df.groupBy(SPARK__PARTITION__ID).count()
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org