[GitHub] spark pull request: Spark-2789: Apply names to RDD to create DataF...

2015-02-09 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/4421#issuecomment-73625791
  
Types need to exist, but names don't. They can just be random column names 
like _1, _2, _3. 

In Scala, if you import sqlContext.implicits._, then any RDD[Product] 
(which includes RDD of case classes and RDD of tuples) can be implicitly turned 
into a DataFrame.

In Python, I think we can add an explicit method that turns a RDD of tuple 
into a DataFrame, if that doesn't exist yet.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Spark-2789: Apply names to RDD to create DataF...

2015-02-09 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/4421#issuecomment-73631311
  
I just talked to @davies offline. He is going to submit a PR that adds 
createDataFrame with named columns. I think we can roll this into that one and 
close this PR. Would be great if @dwmclary you can take a look once that is 
submitted.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Spark-2789: Apply names to RDD to create DataF...

2015-02-09 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/4421#issuecomment-73621476
  
@dwmclary thanks for submitting this. I think this is similar to the 
toDataFrame method that supports renaming, isn't it?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Spark-2789: Apply names to RDD to create DataF...

2015-02-09 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/4421#issuecomment-73621532
  
In particular, I'm talking about 
https://github.com/apache/spark/blob/68b25cf695e0fce9e465288d5a053e540a3fccb4/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L105


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Spark-2789: Apply names to RDD to create DataF...

2015-02-09 Thread dwmclary
Github user dwmclary commented on the pull request:

https://github.com/apache/spark/pull/4421#issuecomment-73623236
  
Reynold,

  It is similar, but I think the distinction here is that toDataFrame
appears to require that old names (and a schema) exist.  Or, at least
that's what DataFrameImpl.scala suggests:

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameImpl.scala,
line 93.

  I think there's a benefit for a quick way to get a DataFrame from a
plain RDD.  If we don't want to do @davies applyNames idea, then maybe we
can change the behavior of toDataFrame.

Cheers,
Dan

On Mon, Feb 9, 2015 at 4:33 PM, Reynold Xin notificati...@github.com
wrote:

 In particular, I'm talking about
 
https://github.com/apache/spark/blob/68b25cf695e0fce9e465288d5a053e540a3fccb4/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L105

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/4421#issuecomment-73621532.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Spark-2789: Apply names to RDD to create DataF...

2015-02-09 Thread dwmclary
Github user dwmclary commented on the pull request:

https://github.com/apache/spark/pull/4421#issuecomment-73626452
  
Ah, yes, I see that now.

Python doesn't seem to have a toDataFrame, so maybe the logical thing to do
here is to just do a new PR with a Python implementation of toDataFrame --
it'd be a little bit from my current PR and then call into the Scala method.

What do you think?

On Mon, Feb 9, 2015 at 5:12 PM, Reynold Xin notificati...@github.com
wrote:

 Types need to exist, but names don't. They can just be random column names
 like _1, _2, _3.

 In Scala, if you import sqlContext.implicits._, then any RDDProduct
 
http://which%20includes%20RDD%20of%20case%20classes%20and%20RDD%20of%20tuples
 can be implicitly turned into a DataFrame.

 In Python, I think we can add an explicit method that turns a RDD of tuple
 into a DataFrame, if that doesn't exist yet.

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/4421#issuecomment-73625791.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Spark-2789: Apply names to RDD to create DataF...

2015-02-09 Thread dwmclary
Github user dwmclary commented on the pull request:

https://github.com/apache/spark/pull/4421#issuecomment-73626542
  
Or, I guess I can just do it in this PR if you don't mind it changing a
bunch.

On Mon, Feb 9, 2015 at 5:18 PM, Dan McClary dan.mccl...@gmail.com wrote:

 Ah, yes, I see that now.

 Python doesn't seem to have a toDataFrame, so maybe the logical thing to
 do here is to just do a new PR with a Python implementation of toDataFrame
 -- it'd be a little bit from my current PR and then call into the Scala
 method.

 What do you think?

 On Mon, Feb 9, 2015 at 5:12 PM, Reynold Xin notificati...@github.com
 wrote:

 Types need to exist, but names don't. They can just be random column
 names like _1, _2, _3.

 In Scala, if you import sqlContext.implicits._, then any RDDProduct
 
http://which%20includes%20RDD%20of%20case%20classes%20and%20RDD%20of%20tuples
 can be implicitly turned into a DataFrame.

 In Python, I think we can add an explicit method that turns a RDD of
 tuple into a DataFrame, if that doesn't exist yet.

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/4421#issuecomment-73625791.






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Spark-2789: Apply names to RDD to create DataF...

2015-02-09 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/4421#issuecomment-73627424
  
Adding toDataFrame to Python DataFrame is a great idea. You can do it in 
this PR if you want (make sure you update the title). 

Also - you might want to do it on top of 
https://github.com/apache/spark/pull/4479 otherwise it will conflict.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Spark-2789: Apply names to RDD to create DataF...

2015-02-09 Thread dwmclary
Github user dwmclary commented on the pull request:

https://github.com/apache/spark/pull/4421#issuecomment-73632890
  
Sounds like a plan -- I'll do it on top of #4479.

Thought: I've added a getReservedWords private method to SQLContext.scala.
I feel like leaving that there isn't a bad idea: other methods may need to
check reserved words in the future.

On Mon, Feb 9, 2015 at 5:27 PM, Reynold Xin notificati...@github.com
wrote:

 Adding toDataFrame to Python DataFrame is a great idea. You can do it in
 this PR if you want (make sure you update the title).

 Also - you might want to do it on top of #4479
 https://github.com/apache/spark/pull/4479 otherwise it will conflict.

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/4421#issuecomment-73627424.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Spark-2789: Apply names to RDD to create DataF...

2015-02-08 Thread dwmclary
Github user dwmclary commented on the pull request:

https://github.com/apache/spark/pull/4421#issuecomment-73325875
  
Updated to keep reserved words in the JVM.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Spark-2789: Apply names to RDD to create DataF...

2015-02-06 Thread dwmclary
GitHub user dwmclary opened a pull request:

https://github.com/apache/spark/pull/4421

Spark-2789: Apply names to RDD to create DataFrame

This seemed like a reasonably useful function to add to SparkSQL.  However, 
unlike the [JIRA](https://issues.apache.org/jira/browse/SPARK-2789), this 
implementation does not parse type characters (e.g. brackets and braces).  This 
method creates a DataFrame with column names that map to the existing types in 
the RDD.  In general, this seems far more useful, as users likely wish to 
quickly apply names to existing collections.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dwmclary/spark SPARK-2789

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4421.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4421


commit df8b01528519ebe0c480daedcc5099306e690a5e
Author: Dan McClary dan.mccl...@gmail.com
Date:   2015-02-05T18:56:14Z

basic apply names functionality

commit 15eb351e2a1c43191193bca768607cc56ce3aede
Author: Dan McClary dan.mccl...@gmail.com
Date:   2015-02-05T23:31:04Z

working for map type

commit aa38d7618a9cd069f73cf8673bfdef4ecc0fe339
Author: Dan McClary dan.mccl...@gmail.com
Date:   2015-02-06T02:43:30Z

added array and list types, struct types don't seem relevant

commit 29d8ffa58b6faa9f20b9c36b5afe649d523e2eb8
Author: Dan McClary dan.mccl...@gmail.com
Date:   2015-02-06T05:14:34Z

added applyNames to pyspark

commit 8c773b372c122c4b90f375933e83816ec99ace1d
Author: Dan McClary dan.mccl...@gmail.com
Date:   2015-02-06T07:41:24Z

added pyspark method and tests




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Spark-2789: Apply names to RDD to create DataF...

2015-02-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4421#issuecomment-73201347
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Spark-2789: Apply names to RDD to create DataF...

2015-02-06 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/4421#discussion_r24247234
  
--- Diff: python/pyspark/sql.py ---
@@ -1469,6 +1470,44 @@ def applySchema(self, rdd, schema):
 df = self._ssql_ctx.applySchemaToPythonRDD(jrdd.rdd(), 
schema.json())
 return DataFrame(df, self)
 
+def applyNames(self, nameString, plainRdd):
+
+Builds a DataFrame from an RDD based on column names.
+
+Assumes RDD contains iterables of equal length.
+ unparsedStrings = sc.parallelize([1, A1, true, 2, B2, 
false, 3, C3, true, 4, D4, false])
+ input = unparsedStrings.map(lambda x: x.split(,)).map(lambda 
x: [int(x[0]), x[1], bool(x[2])])
+ df1 = sqlCtx.applyNames(a b c, input)
+ df1.registerTempTable(df1)
+ sqlCtx.sql(select a from df1).collect()
+[Row(a=1), Row(a=2), Row(a=3), Row(a=4)]
+ input2 = unparsedStrings.map(lambda x: 
x.split(,)).map(lambda x: [int(x[0]), x[1], bool(x[2]), {k:int(x[0]), 
v:2*int(x[0])}, x])
+ df2 = sqlCtx.applyNames(a b c d e, input2)
+ df2.registerTempTable(df2)
+ sqlCtx.sql(select d['k']+d['v'] from df2).collect()
+[Row(c0=3), Row(c0=6), Row(c0=9), Row(c0=12)]
+ sqlCtx.sql(select b, e[1] from df2).collect()
+[Row(b=u' A1', c1=u' A1'), Row(b=u' B2', c1=u' B2'), Row(b=u' C3', 
c1=u' C3'), Row(b=u' D4', c1=u' D4')]
+
+fieldNames = [f for f in re.split(( |\\\.*?\\\|'.*?'), 
nameString) if f.strip()]
+reservedWords = set(map(string.lower,[ABS,ALL,AND, 
APPROXIMATE, AS, ASC, AVG, BETWEEN, BY, \
--- End diff --

I can't really speak to this patch in general, since I don't know much 
about this part of Spark SQL, but to avoid duplication it probably makes sense 
to keep the list of reserved words in the JVM and fetch it into Python from 
there.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Spark-2789: Apply names to RDD to create DataF...

2015-02-06 Thread dwmclary
Github user dwmclary commented on a diff in the pull request:

https://github.com/apache/spark/pull/4421#discussion_r24253601
  
--- Diff: python/pyspark/sql.py ---
@@ -1469,6 +1470,44 @@ def applySchema(self, rdd, schema):
 df = self._ssql_ctx.applySchemaToPythonRDD(jrdd.rdd(), 
schema.json())
 return DataFrame(df, self)
 
+def applyNames(self, nameString, plainRdd):
+
+Builds a DataFrame from an RDD based on column names.
+
+Assumes RDD contains iterables of equal length.
+ unparsedStrings = sc.parallelize([1, A1, true, 2, B2, 
false, 3, C3, true, 4, D4, false])
+ input = unparsedStrings.map(lambda x: x.split(,)).map(lambda 
x: [int(x[0]), x[1], bool(x[2])])
+ df1 = sqlCtx.applyNames(a b c, input)
+ df1.registerTempTable(df1)
+ sqlCtx.sql(select a from df1).collect()
+[Row(a=1), Row(a=2), Row(a=3), Row(a=4)]
+ input2 = unparsedStrings.map(lambda x: 
x.split(,)).map(lambda x: [int(x[0]), x[1], bool(x[2]), {k:int(x[0]), 
v:2*int(x[0])}, x])
+ df2 = sqlCtx.applyNames(a b c d e, input2)
+ df2.registerTempTable(df2)
+ sqlCtx.sql(select d['k']+d['v'] from df2).collect()
+[Row(c0=3), Row(c0=6), Row(c0=9), Row(c0=12)]
+ sqlCtx.sql(select b, e[1] from df2).collect()
+[Row(b=u' A1', c1=u' A1'), Row(b=u' B2', c1=u' B2'), Row(b=u' C3', 
c1=u' C3'), Row(b=u' D4', c1=u' D4')]
+
+fieldNames = [f for f in re.split(( |\\\.*?\\\|'.*?'), 
nameString) if f.strip()]
+reservedWords = set(map(string.lower,[ABS,ALL,AND, 
APPROXIMATE, AS, ASC, AVG, BETWEEN, BY, \
--- End diff --

Seems like a reasonable request to me.  I couldn't decide if it was better
to have to pickle and ship a list of words or just to have it instantiated
in both places.

On Fri, Feb 6, 2015 at 7:31 AM, Josh Rosen notificati...@github.com wrote:

 In python/pyspark/sql.py
 https://github.com/apache/spark/pull/4421#discussion_r24247234:

  + unparsedStrings = sc.parallelize([1, A1, true, 2, B2, 
false, 3, C3, true, 4, D4, false])
  + input = unparsedStrings.map(lambda x: 
x.split(,)).map(lambda x: [int(x[0]), x[1], bool(x[2])])
  + df1 = sqlCtx.applyNames(a b c, input)
  + df1.registerTempTable(df1)
  + sqlCtx.sql(select a from df1).collect()
  +[Row(a=1), Row(a=2), Row(a=3), Row(a=4)]
  + input2 = unparsedStrings.map(lambda x: 
x.split(,)).map(lambda x: [int(x[0]), x[1], bool(x[2]), {k:int(x[0]), 
v:2*int(x[0])}, x])
  + df2 = sqlCtx.applyNames(a b c d e, input2)
  + df2.registerTempTable(df2)
  + sqlCtx.sql(select d['k']+d['v'] from df2).collect()
  +[Row(c0=3), Row(c0=6), Row(c0=9), Row(c0=12)]
  + sqlCtx.sql(select b, e[1] from df2).collect()
  +[Row(b=u' A1', c1=u' A1'), Row(b=u' B2', c1=u' B2'), Row(b=u' 
C3', c1=u' C3'), Row(b=u' D4', c1=u' D4')]
  +
  +fieldNames = [f for f in re.split(( |\\\.*?\\\|'.*?'), 
nameString) if f.strip()]
  +reservedWords = set(map(string.lower,[ABS,ALL,AND, 
APPROXIMATE, AS, ASC, AVG, BETWEEN, BY, \

 I can't really speak to this patch in general, since I don't know much
 about this part of Spark SQL, but to avoid duplication it probably makes
 sense to keep the list of reserved words in the JVM and fetch it into
 Python from there.

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/4421/files#r24247234.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org