[GitHub] spark pull request: Spark-2789: Apply names to RDD to create DataF...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/4421#issuecomment-73625791 Types need to exist, but names don't. They can just be random column names like _1, _2, _3. In Scala, if you import sqlContext.implicits._, then any RDD[Product] (which includes RDD of case classes and RDD of tuples) can be implicitly turned into a DataFrame. In Python, I think we can add an explicit method that turns a RDD of tuple into a DataFrame, if that doesn't exist yet. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Spark-2789: Apply names to RDD to create DataF...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/4421#issuecomment-73631311 I just talked to @davies offline. He is going to submit a PR that adds createDataFrame with named columns. I think we can roll this into that one and close this PR. Would be great if @dwmclary you can take a look once that is submitted. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Spark-2789: Apply names to RDD to create DataF...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/4421#issuecomment-73621476 @dwmclary thanks for submitting this. I think this is similar to the toDataFrame method that supports renaming, isn't it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Spark-2789: Apply names to RDD to create DataF...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/4421#issuecomment-73621532 In particular, I'm talking about https://github.com/apache/spark/blob/68b25cf695e0fce9e465288d5a053e540a3fccb4/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L105 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Spark-2789: Apply names to RDD to create DataF...
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/4421#issuecomment-73623236 Reynold, It is similar, but I think the distinction here is that toDataFrame appears to require that old names (and a schema) exist. Or, at least that's what DataFrameImpl.scala suggests: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameImpl.scala, line 93. I think there's a benefit for a quick way to get a DataFrame from a plain RDD. If we don't want to do @davies applyNames idea, then maybe we can change the behavior of toDataFrame. Cheers, Dan On Mon, Feb 9, 2015 at 4:33 PM, Reynold Xin notificati...@github.com wrote: In particular, I'm talking about https://github.com/apache/spark/blob/68b25cf695e0fce9e465288d5a053e540a3fccb4/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L105 â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/4421#issuecomment-73621532. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Spark-2789: Apply names to RDD to create DataF...
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/4421#issuecomment-73626452 Ah, yes, I see that now. Python doesn't seem to have a toDataFrame, so maybe the logical thing to do here is to just do a new PR with a Python implementation of toDataFrame -- it'd be a little bit from my current PR and then call into the Scala method. What do you think? On Mon, Feb 9, 2015 at 5:12 PM, Reynold Xin notificati...@github.com wrote: Types need to exist, but names don't. They can just be random column names like _1, _2, _3. In Scala, if you import sqlContext.implicits._, then any RDDProduct http://which%20includes%20RDD%20of%20case%20classes%20and%20RDD%20of%20tuples can be implicitly turned into a DataFrame. In Python, I think we can add an explicit method that turns a RDD of tuple into a DataFrame, if that doesn't exist yet. â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/4421#issuecomment-73625791. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Spark-2789: Apply names to RDD to create DataF...
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/4421#issuecomment-73626542 Or, I guess I can just do it in this PR if you don't mind it changing a bunch. On Mon, Feb 9, 2015 at 5:18 PM, Dan McClary dan.mccl...@gmail.com wrote: Ah, yes, I see that now. Python doesn't seem to have a toDataFrame, so maybe the logical thing to do here is to just do a new PR with a Python implementation of toDataFrame -- it'd be a little bit from my current PR and then call into the Scala method. What do you think? On Mon, Feb 9, 2015 at 5:12 PM, Reynold Xin notificati...@github.com wrote: Types need to exist, but names don't. They can just be random column names like _1, _2, _3. In Scala, if you import sqlContext.implicits._, then any RDDProduct http://which%20includes%20RDD%20of%20case%20classes%20and%20RDD%20of%20tuples can be implicitly turned into a DataFrame. In Python, I think we can add an explicit method that turns a RDD of tuple into a DataFrame, if that doesn't exist yet. â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/4421#issuecomment-73625791. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Spark-2789: Apply names to RDD to create DataF...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/4421#issuecomment-73627424 Adding toDataFrame to Python DataFrame is a great idea. You can do it in this PR if you want (make sure you update the title). Also - you might want to do it on top of https://github.com/apache/spark/pull/4479 otherwise it will conflict. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Spark-2789: Apply names to RDD to create DataF...
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/4421#issuecomment-73632890 Sounds like a plan -- I'll do it on top of #4479. Thought: I've added a getReservedWords private method to SQLContext.scala. I feel like leaving that there isn't a bad idea: other methods may need to check reserved words in the future. On Mon, Feb 9, 2015 at 5:27 PM, Reynold Xin notificati...@github.com wrote: Adding toDataFrame to Python DataFrame is a great idea. You can do it in this PR if you want (make sure you update the title). Also - you might want to do it on top of #4479 https://github.com/apache/spark/pull/4479 otherwise it will conflict. â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/4421#issuecomment-73627424. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Spark-2789: Apply names to RDD to create DataF...
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/4421#issuecomment-73325875 Updated to keep reserved words in the JVM. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Spark-2789: Apply names to RDD to create DataF...
GitHub user dwmclary opened a pull request: https://github.com/apache/spark/pull/4421 Spark-2789: Apply names to RDD to create DataFrame This seemed like a reasonably useful function to add to SparkSQL. However, unlike the [JIRA](https://issues.apache.org/jira/browse/SPARK-2789), this implementation does not parse type characters (e.g. brackets and braces). This method creates a DataFrame with column names that map to the existing types in the RDD. In general, this seems far more useful, as users likely wish to quickly apply names to existing collections. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dwmclary/spark SPARK-2789 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4421.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4421 commit df8b01528519ebe0c480daedcc5099306e690a5e Author: Dan McClary dan.mccl...@gmail.com Date: 2015-02-05T18:56:14Z basic apply names functionality commit 15eb351e2a1c43191193bca768607cc56ce3aede Author: Dan McClary dan.mccl...@gmail.com Date: 2015-02-05T23:31:04Z working for map type commit aa38d7618a9cd069f73cf8673bfdef4ecc0fe339 Author: Dan McClary dan.mccl...@gmail.com Date: 2015-02-06T02:43:30Z added array and list types, struct types don't seem relevant commit 29d8ffa58b6faa9f20b9c36b5afe649d523e2eb8 Author: Dan McClary dan.mccl...@gmail.com Date: 2015-02-06T05:14:34Z added applyNames to pyspark commit 8c773b372c122c4b90f375933e83816ec99ace1d Author: Dan McClary dan.mccl...@gmail.com Date: 2015-02-06T07:41:24Z added pyspark method and tests --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Spark-2789: Apply names to RDD to create DataF...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4421#issuecomment-73201347 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Spark-2789: Apply names to RDD to create DataF...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/4421#discussion_r24247234 --- Diff: python/pyspark/sql.py --- @@ -1469,6 +1470,44 @@ def applySchema(self, rdd, schema): df = self._ssql_ctx.applySchemaToPythonRDD(jrdd.rdd(), schema.json()) return DataFrame(df, self) +def applyNames(self, nameString, plainRdd): + +Builds a DataFrame from an RDD based on column names. + +Assumes RDD contains iterables of equal length. + unparsedStrings = sc.parallelize([1, A1, true, 2, B2, false, 3, C3, true, 4, D4, false]) + input = unparsedStrings.map(lambda x: x.split(,)).map(lambda x: [int(x[0]), x[1], bool(x[2])]) + df1 = sqlCtx.applyNames(a b c, input) + df1.registerTempTable(df1) + sqlCtx.sql(select a from df1).collect() +[Row(a=1), Row(a=2), Row(a=3), Row(a=4)] + input2 = unparsedStrings.map(lambda x: x.split(,)).map(lambda x: [int(x[0]), x[1], bool(x[2]), {k:int(x[0]), v:2*int(x[0])}, x]) + df2 = sqlCtx.applyNames(a b c d e, input2) + df2.registerTempTable(df2) + sqlCtx.sql(select d['k']+d['v'] from df2).collect() +[Row(c0=3), Row(c0=6), Row(c0=9), Row(c0=12)] + sqlCtx.sql(select b, e[1] from df2).collect() +[Row(b=u' A1', c1=u' A1'), Row(b=u' B2', c1=u' B2'), Row(b=u' C3', c1=u' C3'), Row(b=u' D4', c1=u' D4')] + +fieldNames = [f for f in re.split(( |\\\.*?\\\|'.*?'), nameString) if f.strip()] +reservedWords = set(map(string.lower,[ABS,ALL,AND, APPROXIMATE, AS, ASC, AVG, BETWEEN, BY, \ --- End diff -- I can't really speak to this patch in general, since I don't know much about this part of Spark SQL, but to avoid duplication it probably makes sense to keep the list of reserved words in the JVM and fetch it into Python from there. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Spark-2789: Apply names to RDD to create DataF...
Github user dwmclary commented on a diff in the pull request: https://github.com/apache/spark/pull/4421#discussion_r24253601 --- Diff: python/pyspark/sql.py --- @@ -1469,6 +1470,44 @@ def applySchema(self, rdd, schema): df = self._ssql_ctx.applySchemaToPythonRDD(jrdd.rdd(), schema.json()) return DataFrame(df, self) +def applyNames(self, nameString, plainRdd): + +Builds a DataFrame from an RDD based on column names. + +Assumes RDD contains iterables of equal length. + unparsedStrings = sc.parallelize([1, A1, true, 2, B2, false, 3, C3, true, 4, D4, false]) + input = unparsedStrings.map(lambda x: x.split(,)).map(lambda x: [int(x[0]), x[1], bool(x[2])]) + df1 = sqlCtx.applyNames(a b c, input) + df1.registerTempTable(df1) + sqlCtx.sql(select a from df1).collect() +[Row(a=1), Row(a=2), Row(a=3), Row(a=4)] + input2 = unparsedStrings.map(lambda x: x.split(,)).map(lambda x: [int(x[0]), x[1], bool(x[2]), {k:int(x[0]), v:2*int(x[0])}, x]) + df2 = sqlCtx.applyNames(a b c d e, input2) + df2.registerTempTable(df2) + sqlCtx.sql(select d['k']+d['v'] from df2).collect() +[Row(c0=3), Row(c0=6), Row(c0=9), Row(c0=12)] + sqlCtx.sql(select b, e[1] from df2).collect() +[Row(b=u' A1', c1=u' A1'), Row(b=u' B2', c1=u' B2'), Row(b=u' C3', c1=u' C3'), Row(b=u' D4', c1=u' D4')] + +fieldNames = [f for f in re.split(( |\\\.*?\\\|'.*?'), nameString) if f.strip()] +reservedWords = set(map(string.lower,[ABS,ALL,AND, APPROXIMATE, AS, ASC, AVG, BETWEEN, BY, \ --- End diff -- Seems like a reasonable request to me. I couldn't decide if it was better to have to pickle and ship a list of words or just to have it instantiated in both places. On Fri, Feb 6, 2015 at 7:31 AM, Josh Rosen notificati...@github.com wrote: In python/pyspark/sql.py https://github.com/apache/spark/pull/4421#discussion_r24247234: + unparsedStrings = sc.parallelize([1, A1, true, 2, B2, false, 3, C3, true, 4, D4, false]) + input = unparsedStrings.map(lambda x: x.split(,)).map(lambda x: [int(x[0]), x[1], bool(x[2])]) + df1 = sqlCtx.applyNames(a b c, input) + df1.registerTempTable(df1) + sqlCtx.sql(select a from df1).collect() +[Row(a=1), Row(a=2), Row(a=3), Row(a=4)] + input2 = unparsedStrings.map(lambda x: x.split(,)).map(lambda x: [int(x[0]), x[1], bool(x[2]), {k:int(x[0]), v:2*int(x[0])}, x]) + df2 = sqlCtx.applyNames(a b c d e, input2) + df2.registerTempTable(df2) + sqlCtx.sql(select d['k']+d['v'] from df2).collect() +[Row(c0=3), Row(c0=6), Row(c0=9), Row(c0=12)] + sqlCtx.sql(select b, e[1] from df2).collect() +[Row(b=u' A1', c1=u' A1'), Row(b=u' B2', c1=u' B2'), Row(b=u' C3', c1=u' C3'), Row(b=u' D4', c1=u' D4')] + +fieldNames = [f for f in re.split(( |\\\.*?\\\|'.*?'), nameString) if f.strip()] +reservedWords = set(map(string.lower,[ABS,ALL,AND, APPROXIMATE, AS, ASC, AVG, BETWEEN, BY, \ I can't really speak to this patch in general, since I don't know much about this part of Spark SQL, but to avoid duplication it probably makes sense to keep the list of reserved words in the JVM and fetch it into Python from there. â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/4421/files#r24247234. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org