if DataFrame aspires to be more than a vehicle for SQL then i think it
would be mistake to allow multiple column names. it is very confusing.
pandas indeed allows this and it has led to many bugs. R does not allow it
for data.frame (it renames the name dupes).
i would consider a csv with
True. As long as we can ensure the correct message are printed out, users
can correct their app easily. For example, Reference 'name' is ambiguous,
could be: name#1, name#5.;
Thanks,
Xiao Li
2015-10-14 23:58 GMT-07:00 Reynold Xin :
> That could break a lot of applications.
That could break a lot of applications. In particular, a lot of input data
sources (csv, json) don't have clean schema, and can have duplicate column
names.
For the case of join, maybe a better solution is to ask the left/right
prefix/suffix in the user code, similar to what Pandas does.
On Wed,
>
> In hive, the ambiguous name can be resolved by using the table name as
> prefix, but seems DataFrame don't support it ( I mean DataFrame API rather
> than SparkSQL)
You can do the same using pure DataFrames.
Seq((1,2)).toDF("a", "b").registerTempTable("y")
Seq((1,4)).toDF("a",