Re: DataFrames equivalent to SQL table namespacing and aliases

2015-05-09 Thread Nicholas Chammas
I've opened an issue for a few doc fixes that the PySpark DataFrame API needs: SPARK-7505 On Fri, May 8, 2015 at 3:10 PM Nicholas Chammas wrote: > Ah, neat. So in the example I gave earlier, I’d do this to get columns > from specific dataframes:

Re: DataFrames equivalent to SQL table namespacing and aliases

2015-05-08 Thread Nicholas Chammas
Ah, neat. So in the example I gave earlier, I’d do this to get columns from specific dataframes: >>> df12.select(df1['a'], df2['other']) DataFrame[a: bigint, other: string]>>> df12.select(df1['a'], df2['other']).show() a other 4 I dunno This perhaps should be documented in an examp

Re: DataFrames equivalent to SQL table namespacing and aliases

2015-05-08 Thread Reynold Xin
You can actually just use df1['a'] in projection to differentiate. e.g. in Scala (similar things work in Python): scala> val df1 = Seq((1, "one")).toDF("a", "b") df1: org.apache.spark.sql.DataFrame = [a: int, b: string] scala> val df2 = Seq((2, "two")).toDF("a", "b") df2: org.apache.spark.sql.D

Re: DataFrames equivalent to SQL table namespacing and aliases

2015-05-08 Thread Nicholas Chammas
Oh, I didn't know about that. Thanks for the pointer, Rakesh. I wonder why they did that, as opposed to taking the cue from SQL and prefixing column names with a specifiable dataframe alias. The suffix approach seems quite ugly. Nick On Fri, May 8, 2015 at 2:47 PM Rakesh Chalasani wrote: > To

Re: DataFrames equivalent to SQL table namespacing and aliases

2015-05-08 Thread Rakesh Chalasani
To add to the above discussion, Pandas, allows suffixing and prefixing to solve this issue http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.join.html Rakesh On Fri, May 8, 2015 at 2:42 PM Nicholas Chammas wrote: > DataFrames, as far as I can tell, don’t have an equivalent to

DataFrames equivalent to SQL table namespacing and aliases

2015-05-08 Thread Nicholas Chammas
DataFrames, as far as I can tell, don’t have an equivalent to SQL’s table aliases. This is essential when joining dataframes that have identically named columns. >>> # PySpark 1.3.1>>> df1 = sqlContext.jsonRDD(sc.parallelize(['{"a": 4, >>> "other": "I know"}']))>>> df2 = sqlContext.jsonRDD(sc.pa