I stand corrected. I just created a test table with a single int field to test with and the Dataset loaded from that works with no issues. I'll see if I can track down exactly what the difference might be.
On Fri, May 27, 2016 at 10:29 AM Tim Gautier <[email protected]> wrote: > I'm using 1.6.1. > > I'm not sure what good fake data would do since it doesn't seem to have > anything to do with the data itself. It has to do with how the Dataset was > created. Both datasets have exactly the same data in them, but the one > created from a sql query fails where the one created from a Seq works. The > case class is just a few Option[Int] and Option[String] fields, nothing > special. > > Obviously there's some sort of difference between the two datasets, but > Spark tells me they're exactly the same type with exactly the same data, so > I couldn't create a test case without actually accessing a sql database. > > On Fri, May 27, 2016 at 10:15 AM Ted Yu <[email protected]> wrote: > >> Which release of Spark are you using ? >> >> Is it possible to come up with fake data that shows what you described ? >> >> Thanks >> >> On Fri, May 27, 2016 at 8:24 AM, Tim Gautier <[email protected]> >> wrote: >> >>> Unfortunately I can't show exactly the data I'm using, but this is what >>> I'm seeing: >>> >>> I have a case class 'Product' that represents a table in our database. I >>> load that data via >>> sqlContext.read.format("jdbc").options(...).load.as[Product] >>> and register it in a temp table 'product'. >>> >>> For testing, I created a new Dataset that has only 3 records in it: >>> >>> val ts = sqlContext.sql("select * from product where product_catalog_id >>> in (1, 2, 3)").as[Product] >>> >>> I also created another one using the same case class and data, but from >>> a sequence instead. >>> >>> val ds: Dataset[Product] = Seq( >>> Product(Some(1), ...), >>> Product(Some(2), ...), >>> Product(Some(3), ...) >>> ).toDS >>> >>> The spark shell tells me these are exactly the same type at this point, >>> but they don't behave the same. >>> >>> ts.as("ts1").joinWith(ts.as("ts2"), $"ts1.product_catalog_id" === >>> $"ts2.product_catalog_id") >>> ds.as("ds1").joinWith(ds.as("ds2"), $"ds1.product_catalog_id" === >>> $"ds2.product_catalog_id") >>> >>> Again, spark tells me these self joins return exactly the same type, but >>> when I do a .show on them, only the one created from a Seq works. The one >>> created by reading from the database throws this error: >>> >>> org.apache.spark.sql.AnalysisException: cannot resolve >>> 'ts1.product_catalog_id' given input columns: [..., product_catalog_id, >>> ...]; >>> >>> Is this a bug? Is there anyway to make the Dataset loaded from a table >>> behave like the one created from a sequence? >>> >>> Thanks, >>> Tim >>> >> >>
