[jira] [Commented] (SPARK-5097) Adding data frame APIs to SchemaRDD

2015-01-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14294846#comment-14294846
 ] 

Apache Spark commented on SPARK-5097:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/4241

 Adding data frame APIs to SchemaRDD
 ---

 Key: SPARK-5097
 URL: https://issues.apache.org/jira/browse/SPARK-5097
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical
 Attachments: DesignDocAddingDataFrameAPIstoSchemaRDD.pdf


 SchemaRDD, through its DSL, already provides common data frame 
 functionalities. However, the DSL was originally created for constructing 
 test cases without much end-user usability and API stability consideration. 
 This design doc proposes a set of API changes for Scala and Python to make 
 the SchemaRDD DSL API more usable and stable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5097) Adding data frame APIs to SchemaRDD

2015-01-27 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14293905#comment-14293905
 ] 

Reynold Xin commented on SPARK-5097:


I've debating that myself for a while. The main question is whether we want to 
make SchemaRDD/DataFrame column-mutable. I think it can make certain uses more 
concise. However, it can also be confusing when the following happens ...
{code}
val df1 = df.map(...)
df[newCol] = df[oldCol] + 1
df1.map(...)
{code}

or

{code}
val df1 = df.as(a)
df[newCol] = df[oldCol] + 1
df1.join(df) ...
{code}

 Adding data frame APIs to SchemaRDD
 ---

 Key: SPARK-5097
 URL: https://issues.apache.org/jira/browse/SPARK-5097
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical
 Attachments: DesignDocAddingDataFrameAPIstoSchemaRDD.pdf


 SchemaRDD, through its DSL, already provides common data frame 
 functionalities. However, the DSL was originally created for constructing 
 test cases without much end-user usability and API stability consideration. 
 This design doc proposes a set of API changes for Scala and Python to make 
 the SchemaRDD DSL API more usable and stable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5097) Adding data frame APIs to SchemaRDD

2015-01-27 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14294193#comment-14294193
 ] 

Sandy Ryza commented on SPARK-5097:
---

Ah, yeah, I hadn't considered that aspect.  I definitely agree that making them 
column-mutable would be very ugly.  I'd push for add_column over addColumn, but 
maybe the ship with those naming conventions on it has already sailed.

 Adding data frame APIs to SchemaRDD
 ---

 Key: SPARK-5097
 URL: https://issues.apache.org/jira/browse/SPARK-5097
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical
 Attachments: DesignDocAddingDataFrameAPIstoSchemaRDD.pdf


 SchemaRDD, through its DSL, already provides common data frame 
 functionalities. However, the DSL was originally created for constructing 
 test cases without much end-user usability and API stability consideration. 
 This design doc proposes a set of API changes for Scala and Python to make 
 the SchemaRDD DSL API more usable and stable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5097) Adding data frame APIs to SchemaRDD

2015-01-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14294454#comment-14294454
 ] 

Apache Spark commented on SPARK-5097:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/4235

 Adding data frame APIs to SchemaRDD
 ---

 Key: SPARK-5097
 URL: https://issues.apache.org/jira/browse/SPARK-5097
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical
 Attachments: DesignDocAddingDataFrameAPIstoSchemaRDD.pdf


 SchemaRDD, through its DSL, already provides common data frame 
 functionalities. However, the DSL was originally created for constructing 
 test cases without much end-user usability and API stability consideration. 
 This design doc proposes a set of API changes for Scala and Python to make 
 the SchemaRDD DSL API more usable and stable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5097) Adding data frame APIs to SchemaRDD

2015-01-27 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14293849#comment-14293849
 ] 

Sandy Ryza commented on SPARK-5097:
---

Would it be possible to keep the Python versions of setColumn and addColumn 
consistent with Pandas?  The naming sticks out to me as unpythonic, and it 
seems like the migration path we'd like to make appealing is Pandas - Spark, 
not Spark Scala - Spark Python.


 Adding data frame APIs to SchemaRDD
 ---

 Key: SPARK-5097
 URL: https://issues.apache.org/jira/browse/SPARK-5097
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical
 Attachments: DesignDocAddingDataFrameAPIstoSchemaRDD.pdf


 SchemaRDD, through its DSL, already provides common data frame 
 functionalities. However, the DSL was originally created for constructing 
 test cases without much end-user usability and API stability consideration. 
 This design doc proposes a set of API changes for Scala and Python to make 
 the SchemaRDD DSL API more usable and stable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5097) Adding data frame APIs to SchemaRDD

2015-01-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14288947#comment-14288947
 ] 

Apache Spark commented on SPARK-5097:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/4173

 Adding data frame APIs to SchemaRDD
 ---

 Key: SPARK-5097
 URL: https://issues.apache.org/jira/browse/SPARK-5097
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical
 Attachments: DesignDocAddingDataFrameAPIstoSchemaRDD.pdf


 SchemaRDD, through its DSL, already provides common data frame 
 functionalities. However, the DSL was originally created for constructing 
 test cases without much end-user usability and API stability consideration. 
 This design doc proposes a set of API changes for Scala and Python to make 
 the SchemaRDD DSL API more usable and stable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5097) Adding data frame APIs to SchemaRDD

2015-01-19 Thread Hamel Ajay Kothari (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14282978#comment-14282978
 ] 

Hamel Ajay Kothari commented on SPARK-5097:
---

Thanks for the response [~rxin], one more question: how are we planning on 
allowing the breadth of things that we enabled by expressions with this new 
API. For example, if I want to do a join where {{rdd1.colA == rdd2.colB}} but I 
want to cast rdd2.colB to String first, how would I do that? 

In the expressions API I could do {{new EqualTo(colAExpression, 
Cast(colBExpression, DataType.StringType))}} where colAExpression and 
colBExpression are resolved NamedExpressions. How would this look in the new 
API?

I'm happy to take these questions elsewhere if there is a better place to ask. 
Thanks for your help!

 Adding data frame APIs to SchemaRDD
 ---

 Key: SPARK-5097
 URL: https://issues.apache.org/jira/browse/SPARK-5097
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical
 Attachments: DesignDocAddingDataFrameAPIstoSchemaRDD.pdf


 SchemaRDD, through its DSL, already provides common data frame 
 functionalities. However, the DSL was originally created for constructing 
 test cases without much end-user usability and API stability consideration. 
 This design doc proposes a set of API changes for Scala and Python to make 
 the SchemaRDD DSL API more usable and stable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5097) Adding data frame APIs to SchemaRDD

2015-01-15 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279078#comment-14279078
 ] 

Reynold Xin commented on SPARK-5097:


[~hkothari] that is correct. It will be trivially doable to select columns at 
runtime.

For the 2nd one, not yet. That's a very good point. You can always do an extra 
projection. We will try to add it, if not in the 1st iteration, then in the 2nd 
iteration.

 Adding data frame APIs to SchemaRDD
 ---

 Key: SPARK-5097
 URL: https://issues.apache.org/jira/browse/SPARK-5097
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical
 Attachments: DesignDocAddingDataFrameAPIstoSchemaRDD.pdf


 SchemaRDD, through its DSL, already provides common data frame 
 functionalities. However, the DSL was originally created for constructing 
 test cases without much end-user usability and API stability consideration. 
 This design doc proposes a set of API changes for Scala and Python to make 
 the SchemaRDD DSL API more usable and stable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5097) Adding data frame APIs to SchemaRDD

2015-01-15 Thread Hamel Ajay Kothari (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278819#comment-14278819
 ] 

Hamel Ajay Kothari commented on SPARK-5097:
---

Am I correct in interpreting that this would allow us to trivially select 
columns at runtime since we'd just use {{SchemaRDD(stringColumnName)}}? In the 
world of catalyst selecting columns known only at runtime was a real pain 
because the only defined way to do it in the docs was to use quasiquotes or use 
{{SchemaRDD.baseLogicalPlan.resolve()}}. The first couldn't be defined at 
runtime (as far as I know) and the second required you to depend on expressions.

Also, is there any way to control the name of the resulting columns from 
groupby+aggregate (or similar methods that add columns) in this plan?

 Adding data frame APIs to SchemaRDD
 ---

 Key: SPARK-5097
 URL: https://issues.apache.org/jira/browse/SPARK-5097
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical
 Attachments: DesignDocAddingDataFrameAPIstoSchemaRDD.pdf


 SchemaRDD, through its DSL, already provides common data frame 
 functionalities. However, the DSL was originally created for constructing 
 test cases without much end-user usability and API stability consideration. 
 This design doc proposes a set of API changes for Scala and Python to make 
 the SchemaRDD DSL API more usable and stable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5097) Adding data frame APIs to SchemaRDD

2015-01-13 Thread Mohit Jaggi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275960#comment-14275960
 ] 

Mohit Jaggi commented on SPARK-5097:


minor comment: mutate existing can do 
df(x) = df(x) 

 Adding data frame APIs to SchemaRDD
 ---

 Key: SPARK-5097
 URL: https://issues.apache.org/jira/browse/SPARK-5097
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical
 Attachments: DesignDocAddingDataFrameAPIstoSchemaRDD.pdf


 SchemaRDD, through its DSL, already provides common data frame 
 functionalities. However, the DSL was originally created for constructing 
 test cases without much end-user usability and API stability consideration. 
 This design doc proposes a set of API changes for Scala and Python to make 
 the SchemaRDD DSL API more usable and stable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5097) Adding data frame APIs to SchemaRDD

2015-01-12 Thread Mohit Jaggi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273994#comment-14273994
 ] 

Mohit Jaggi commented on SPARK-5097:


Hi,
This is Mohit Jaggi, author of https://github.com/AyasdiOpenSource/bigdf 
Matei had suggested integrating bigdf with SchemaRDD and I was planning on 
doing that soon.
I would love to contribute to this item. Most of the constructs mentioned in 
the design document already exist in bigdf. 

Mohit.

 Adding data frame APIs to SchemaRDD
 ---

 Key: SPARK-5097
 URL: https://issues.apache.org/jira/browse/SPARK-5097
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical
 Attachments: DesignDocAddingDataFrameAPIstoSchemaRDD.pdf


 SchemaRDD, through its DSL, already provides common data frame 
 functionalities. However, the DSL was originally created for constructing 
 test cases without much end-user usability and API stability consideration. 
 This design doc proposes a set of API changes for Scala and Python to make 
 the SchemaRDD DSL API more usable and stable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org