i am glad to see this, i think we can into this as well (in 2.0.0-SNAPSHOT) but i couldn't reproduce it nicely.
my observation was that joins of 2 datasets that were derived from the same datasource gave this kind of trouble. i changed my datasource from val to def (so it got created twice) as a workaround. the error did not occur with datasets created in unit test with sc.parallelize. On Fri, May 27, 2016 at 1:26 PM, Ted Yu <[email protected]> wrote: > I tried master branch : > > scala> val testMapped = test.map(t => t.copy(id = t.id + 1)) > testMapped: org.apache.spark.sql.Dataset[Test] = [id: int] > > scala> testMapped.as("t1").joinWith(testMapped.as("t2"), $"t1.id" === $" > t2.id").show > org.apache.spark.sql.AnalysisException: cannot resolve '`t1.id`' given > input columns: [id]; > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:62) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:59) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:287) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:287) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:68) > > > Suggest logging a JIRA if there is none logged. > > On Fri, May 27, 2016 at 10:19 AM, Tim Gautier <[email protected]> > wrote: > >> Oops, screwed up my example. This is what it should be: >> >> case class Test(id: Int) >> val test = Seq( >> Test(1), >> Test(2), >> Test(3) >> ).toDS >> test.as("t1").joinWith(test.as("t2"), $"t1.id" === $"t2.id").show >> val testMapped = test.map(t => t.copy(id = t.id + 1)) >> testMapped.as("t1").joinWith(testMapped.as("t2"), $"t1.id" === $" >> t2.id").show >> >> >> On Fri, May 27, 2016 at 11:16 AM Tim Gautier <[email protected]> >> wrote: >> >>> I figured it out the trigger. Turns out it wasn't because I loaded it >>> from the database, it was because the first thing I do after loading is to >>> lower case all the strings. After a Dataset has been mapped, the resulting >>> Dataset can't be self joined. Here's a test case that illustrates the issue: >>> >>> case class Test(id: Int) >>> val test = Seq( >>> Test(1), >>> Test(2), >>> Test(3) >>> ).toDS >>> test.as("t1").joinWith(test.as("t2"), $"t1.id" === $"t2.id").show >>> // <-- works fine >>> val testMapped = test.map(_.id + 1) // add 1 to each >>> testMapped.as("t1").joinWith(testMapped.as("t2"), $"t1.id" === $" >>> t2.id").show // <-- error >>> >>> >>> On Fri, May 27, 2016 at 10:44 AM Tim Gautier <[email protected]> >>> wrote: >>> >>>> I stand corrected. I just created a test table with a single int field >>>> to test with and the Dataset loaded from that works with no issues. I'll >>>> see if I can track down exactly what the difference might be. >>>> >>>> On Fri, May 27, 2016 at 10:29 AM Tim Gautier <[email protected]> >>>> wrote: >>>> >>>>> I'm using 1.6.1. >>>>> >>>>> I'm not sure what good fake data would do since it doesn't seem to >>>>> have anything to do with the data itself. It has to do with how the >>>>> Dataset >>>>> was created. Both datasets have exactly the same data in them, but the one >>>>> created from a sql query fails where the one created from a Seq works. The >>>>> case class is just a few Option[Int] and Option[String] fields, nothing >>>>> special. >>>>> >>>>> Obviously there's some sort of difference between the two datasets, >>>>> but Spark tells me they're exactly the same type with exactly the same >>>>> data, so I couldn't create a test case without actually accessing a sql >>>>> database. >>>>> >>>>> On Fri, May 27, 2016 at 10:15 AM Ted Yu <[email protected]> wrote: >>>>> >>>>>> Which release of Spark are you using ? >>>>>> >>>>>> Is it possible to come up with fake data that shows what you >>>>>> described ? >>>>>> >>>>>> Thanks >>>>>> >>>>>> On Fri, May 27, 2016 at 8:24 AM, Tim Gautier <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Unfortunately I can't show exactly the data I'm using, but this is >>>>>>> what I'm seeing: >>>>>>> >>>>>>> I have a case class 'Product' that represents a table in our >>>>>>> database. I load that data via >>>>>>> sqlContext.read.format("jdbc").options(...). >>>>>>> load.as[Product] and register it in a temp table 'product'. >>>>>>> >>>>>>> For testing, I created a new Dataset that has only 3 records in it: >>>>>>> >>>>>>> val ts = sqlContext.sql("select * from product where >>>>>>> product_catalog_id in (1, 2, 3)").as[Product] >>>>>>> >>>>>>> I also created another one using the same case class and data, but >>>>>>> from a sequence instead. >>>>>>> >>>>>>> val ds: Dataset[Product] = Seq( >>>>>>> Product(Some(1), ...), >>>>>>> Product(Some(2), ...), >>>>>>> Product(Some(3), ...) >>>>>>> ).toDS >>>>>>> >>>>>>> The spark shell tells me these are exactly the same type at this >>>>>>> point, but they don't behave the same. >>>>>>> >>>>>>> ts.as("ts1").joinWith(ts.as("ts2"), $"ts1.product_catalog_id" === >>>>>>> $"ts2.product_catalog_id") >>>>>>> ds.as("ds1").joinWith(ds.as("ds2"), $"ds1.product_catalog_id" === >>>>>>> $"ds2.product_catalog_id") >>>>>>> >>>>>>> Again, spark tells me these self joins return exactly the same type, >>>>>>> but when I do a .show on them, only the one created from a Seq works. >>>>>>> The >>>>>>> one created by reading from the database throws this error: >>>>>>> >>>>>>> org.apache.spark.sql.AnalysisException: cannot resolve >>>>>>> 'ts1.product_catalog_id' given input columns: [..., product_catalog_id, >>>>>>> ...]; >>>>>>> >>>>>>> Is this a bug? Is there anyway to make the Dataset loaded from a >>>>>>> table behave like the one created from a sequence? >>>>>>> >>>>>>> Thanks, >>>>>>> Tim >>>>>>> >>>>>> >>>>>> >
