Re: I'm pretty sure this is a Dataset bug

Tim Gautier Fri, 27 May 2016 09:30:04 -0700

I'm using 1.6.1.

I'm not sure what good fake data would do since it doesn't seem to have
anything to do with the data itself. It has to do with how the Dataset was
created. Both datasets have exactly the same data in them, but the one
created from a sql query fails where the one created from a Seq works. The
case class is just a few Option[Int] and Option[String] fields, nothing
special.


Obviously there's some sort of difference between the two datasets, but
Spark tells me they're exactly the same type with exactly the same data, so
I couldn't create a test case without actually accessing a sql database.

On Fri, May 27, 2016 at 10:15 AM Ted Yu <[email protected]> wrote:

> Which release of Spark are you using ?
>
> Is it possible to come up with fake data that shows what you described ?
>
> Thanks
>
> On Fri, May 27, 2016 at 8:24 AM, Tim Gautier <[email protected]>
> wrote:
>
>> Unfortunately I can't show exactly the data I'm using, but this is what
>> I'm seeing:
>>
>> I have a case class 'Product' that represents a table in our database. I
>> load that data via 
>> sqlContext.read.format("jdbc").options(...).load.as[Product]
>> and register it in a temp table 'product'.
>>
>> For testing, I created a new Dataset that has only 3 records in it:
>>
>> val ts = sqlContext.sql("select * from product where product_catalog_id
>> in (1, 2, 3)").as[Product]
>>
>> I also created another one using the same case class and data, but from a
>> sequence instead.
>>
>> val ds: Dataset[Product] = Seq(
>>       Product(Some(1), ...),
>>       Product(Some(2), ...),
>>       Product(Some(3), ...)
>>     ).toDS
>>
>> The spark shell tells me these are exactly the same type at this point,
>> but they don't behave the same.
>>
>> ts.as("ts1").joinWith(ts.as("ts2"), $"ts1.product_catalog_id" ===
>> $"ts2.product_catalog_id")
>> ds.as("ds1").joinWith(ds.as("ds2"), $"ds1.product_catalog_id" ===
>> $"ds2.product_catalog_id")
>>
>> Again, spark tells me these self joins return exactly the same type, but
>> when I do a .show on them, only the one created from a Seq works. The one
>> created by reading from the database throws this error:
>>
>> org.apache.spark.sql.AnalysisException: cannot resolve
>> 'ts1.product_catalog_id' given input columns: [..., product_catalog_id,
>> ...];
>>
>> Is this a bug? Is there anyway to make the Dataset loaded from a table
>> behave like the one created from a sequence?
>>
>> Thanks,
>> Tim
>>
>
>

Re: I'm pretty sure this is a Dataset bug

Reply via email to