Running into an issue trying to perform a simple join of two DataFrames
created from two different parquet files on HDFS.

[main] INFO org.apache.spark.SparkContext - Running *Spark version 1.4.1*

Using HDFS from Hadoop 2.7.0



Here is a sample to illustrate.

public void testStrangeness(String[] args) {
    SparkConf conf = new
SparkConf().setMaster("local[*]").setAppName("joinIssue");
    JavaSparkContext context = new JavaSparkContext(conf);
    SQLContext sqlContext = new SQLContext(context);

    DataFrame people =
sqlContext.parquetFile("hdfs://localhost:9000//datalake/sample/people.parquet");
    DataFrame address =
sqlContext.parquetFile("hdfs://localhost:9000//datalake/sample/address.parquet");

    people.printSchema();
    address.printSchema();

    // yeah, works
    DataFrame cartJoin = address.join(people);
    cartJoin.printSchema();

    // boo, fails 
    DataFrame joined = address.join(people,
            address.col("addrid").equalTo(people.col("addressid")));

    joined.printSchema();
}




Contents of people
------------------------
first,last,addressid 
your,mom,1 
fred,flintstone,2


Contents of address
------------------------
addrid,city,state,zip
1,sometown,wi,4444
2,bedrock,il,1111



people.printSchema(); 
results in...

root
 |-- first: string (nullable = true)
 |-- last: string (nullable = true)
 |-- addressid: integer (nullable = true)



address.printSchema();
results in...

root
 |-- addrid: integer (nullable = true)
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)
 |-- zip: integer (nullable = true)



DataFrame cartJoin = address.join(people);
cartJoin.printSchema();

Cartesian join works fine, printSchema() results in...

root
 |-- addrid: integer (nullable = true)
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)
 |-- zip: integer (nullable = true)
 |-- first: string (nullable = true)
 |-- last: string (nullable = true)
 |-- addressid: integer (nullable = true)



This join...

DataFrame joined = address.join(people,
address.col("addrid").equalTo(people.col("addressid")));
Results in the following exception.


Exception in thread "main" org.apache.spark.sql.AnalysisException: *Cannot
resolve column name "addrid" among (addrid, city, state, zip); at
*org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:159)
at
org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:159)
at scala.Option.getOrElse(Option.scala:121) at
org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:158) at
org.apache.spark.sql.DataFrame.col(DataFrame.scala:558) at
dw.dataflow.DataflowParser.testStrangeness(DataflowParser.java:36) at
dw.dataflow.DataflowParser.main(DataflowParser.java:119)



I tried changing it so people and address have a common key attribute
(addressid) and used..

address.join(people, "addressid");
But got the same result.

Any ideas??

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Simple-join-of-two-Spark-DataFrame-failing-with-org-apache-spark-sql-AnalysisException-Cannot-resolv-tp24557.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to