Running into an issue trying to perform a simple join of two DataFrames created from two different parquet files on HDFS.
[main] INFO org.apache.spark.SparkContext - Running *Spark version 1.4.1* Using HDFS from Hadoop 2.7.0 Here is a sample to illustrate. public void testStrangeness(String[] args) { SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("joinIssue"); JavaSparkContext context = new JavaSparkContext(conf); SQLContext sqlContext = new SQLContext(context); DataFrame people = sqlContext.parquetFile("hdfs://localhost:9000//datalake/sample/people.parquet"); DataFrame address = sqlContext.parquetFile("hdfs://localhost:9000//datalake/sample/address.parquet"); people.printSchema(); address.printSchema(); // yeah, works DataFrame cartJoin = address.join(people); cartJoin.printSchema(); // boo, fails DataFrame joined = address.join(people, address.col("addrid").equalTo(people.col("addressid"))); joined.printSchema(); } Contents of people ------------------------ first,last,addressid your,mom,1 fred,flintstone,2 Contents of address ------------------------ addrid,city,state,zip 1,sometown,wi,4444 2,bedrock,il,1111 people.printSchema(); results in... root |-- first: string (nullable = true) |-- last: string (nullable = true) |-- addressid: integer (nullable = true) address.printSchema(); results in... root |-- addrid: integer (nullable = true) |-- city: string (nullable = true) |-- state: string (nullable = true) |-- zip: integer (nullable = true) DataFrame cartJoin = address.join(people); cartJoin.printSchema(); Cartesian join works fine, printSchema() results in... root |-- addrid: integer (nullable = true) |-- city: string (nullable = true) |-- state: string (nullable = true) |-- zip: integer (nullable = true) |-- first: string (nullable = true) |-- last: string (nullable = true) |-- addressid: integer (nullable = true) This join... DataFrame joined = address.join(people, address.col("addrid").equalTo(people.col("addressid"))); Results in the following exception. Exception in thread "main" org.apache.spark.sql.AnalysisException: *Cannot resolve column name "addrid" among (addrid, city, state, zip); at *org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:159) at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:159) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:158) at org.apache.spark.sql.DataFrame.col(DataFrame.scala:558) at dw.dataflow.DataflowParser.testStrangeness(DataflowParser.java:36) at dw.dataflow.DataflowParser.main(DataflowParser.java:119) I tried changing it so people and address have a common key attribute (addressid) and used.. address.join(people, "addressid"); But got the same result. Any ideas?? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Simple-join-of-two-Spark-DataFrame-failing-with-org-apache-spark-sql-AnalysisException-Cannot-resolv-tp24557.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org