Re: Should enforce the uniqueness of field name in DataFrame ?

2015-10-15 Thread Koert Kuipers
if DataFrame aspires to be more than a vehicle for SQL then i think it would be mistake to allow multiple column names. it is very confusing. pandas indeed allows this and it has led to many bugs. R does not allow it for data.frame (it renames the name dupes). i would consider a csv with

Re: Should enforce the uniqueness of field name in DataFrame ?

2015-10-15 Thread Xiao Li
True. As long as we can ensure the correct message are printed out, users can correct their app easily. For example, Reference 'name' is ambiguous, could be: name#1, name#5.; Thanks, Xiao Li 2015-10-14 23:58 GMT-07:00 Reynold Xin : > That could break a lot of applications.

Re: Should enforce the uniqueness of field name in DataFrame ?

2015-10-15 Thread Reynold Xin
That could break a lot of applications. In particular, a lot of input data sources (csv, json) don't have clean schema, and can have duplicate column names. For the case of join, maybe a better solution is to ask the left/right prefix/suffix in the user code, similar to what Pandas does. On Wed,

Re: Should enforce the uniqueness of field name in DataFrame ?

2015-10-15 Thread Michael Armbrust
> > In hive, the ambiguous name can be resolved by using the table name as > prefix, but seems DataFrame don't support it ( I mean DataFrame API rather > than SparkSQL) You can do the same using pure DataFrames. Seq((1,2)).toDF("a", "b").registerTempTable("y") Seq((1,4)).toDF("a",