Re: Spark sql not pushing down timestamp range queries

Mich Talebzadeh Fri, 15 Apr 2016 01:54:12 -0700

Thanks Takeshi,

I did check it. I believe you are referring to this statement


"This is likely because we cast this expression weirdly to be compatible
with Hive. Specifically I think this turns into, CAST(c_date AS STRING) >=
"2016-01-01", and we don't push down casts down into data sources.

The reason for casting this way is because users currently expect the
following to work c_date >= "2016". "

There are two issues here:

   1. The CAST expression is not pushed down
   2. I still don't trust the string comparison of dates. It may or may not
   work. I recall this as an issue in Hive

'2012-11-23' is not a DATE; It is a string. In general from my
experience one should not try to compare a DATE with a string. The results
will depend on several factors, some related to the tool, some related to
the session.

For example the following will convert a date as string format into a DATE
type in Hive

TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(TransactionDate,'dd/MM/yyyy'),'yyyy-MM-dd'))

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 15 April 2016 at 06:58, Takeshi Yamamuro <linguin....@gmail.com> wrote:

> Hi, Mich
>
> Did you check the URL Josh referred to?;
> the cast for string comparisons is needed for accepting `c_date >= "2016"`.
>
> // maropu
>
>
> On Fri, Apr 15, 2016 at 10:30 AM, Hyukjin Kwon <gurwls...@gmail.com>
> wrote:
>
>> Hi,
>>
>>
>> String comparison itself is pushed down fine but the problem is to deal
>> with Cast.
>>
>>
>> It was pushed down before but is was reverted, (
>> https://github.com/apache/spark/pull/8049).
>>
>> Several fixes were tried here, https://github.com/apache/spark/pull/11005
>> and etc. but there were no changes to make it.
>>
>>
>> To cut it short, it is not being pushed down because it is unsafe to
>> resolve cast (eg. long to integer)
>>
>> For an workaround,  the implementation of Solr data source should be
>> changed to one with CatalystScan, which take all the filters.
>>
>> But CatalystScan is not designed to be binary compatible across releases,
>> however it looks some think it is stable now, as mentioned here,
>> https://github.com/apache/spark/pull/10750#issuecomment-175400704.
>>
>>
>> Thanks!
>>
>>
>> 2016-04-15 3:30 GMT+09:00 Mich Talebzadeh <mich.talebza...@gmail.com>:
>>
>>> Hi Josh,
>>>
>>> Can you please clarify whether date comparisons as two strings work at
>>> all?
>>>
>>> I was under the impression is that with string comparison only first
>>> characters are compared?
>>>
>>> Thanks
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 14 April 2016 at 19:26, Josh Rosen <joshro...@databricks.com> wrote:
>>>
>>>> AFAIK this is not being pushed down because it involves an implicit
>>>> cast and we currently don't push casts into data sources or scans; see
>>>> https://github.com/databricks/spark-redshift/issues/155 for a
>>>> possibly-related discussion.
>>>>
>>>> On Thu, Apr 14, 2016 at 10:27 AM Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> Are you comparing strings in here or timestamp?
>>>>>
>>>>> Filter ((cast(registration#37 as string) >= 2015-05-28) &&
>>>>> (cast(registration#37 as string) <= 2015-05-29))
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * 
>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>>
>>>>> On 14 April 2016 at 18:04, Kiran Chitturi <
>>>>> kiran.chitt...@lucidworks.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Timestamp range filter queries in SQL are not getting pushed down to
>>>>>> the PrunedFilteredScan instances. The filtering is happening at the Spark
>>>>>> layer.
>>>>>>
>>>>>> The physical plan for timestamp range queries is not showing the
>>>>>> pushed filters where as range queries on other types is working fine as 
>>>>>> the
>>>>>> physical plan is showing the pushed filters.
>>>>>>
>>>>>> Please see below for code and examples.
>>>>>>
>>>>>> *Example:*
>>>>>>
>>>>>> *1.* Range filter queries on Timestamp types
>>>>>>
>>>>>>    *code: *
>>>>>>
>>>>>>> sqlContext.sql("SELECT * from events WHERE `registration` >=
>>>>>>> '2015-05-28' AND `registration` <= '2015-05-29' ")
>>>>>>
>>>>>>    *Full example*:
>>>>>> https://github.com/lucidworks/spark-solr/blob/master/src/test/scala/com/lucidworks/spark/EventsimTestSuite.scala#L151
>>>>>> *    plan*:
>>>>>> https://gist.github.com/kiranchitturi/4a52688c9f0abe3d4b2bd8b938044421#file-time-range-sql
>>>>>>
>>>>>> *2. * Range filter queries on Long types
>>>>>>
>>>>>>     *code*:
>>>>>>
>>>>>>> sqlContext.sql("SELECT * from events WHERE `length` >= '700' and
>>>>>>> `length` <= '1000'")
>>>>>>
>>>>>>     *Full example*:
>>>>>> https://github.com/lucidworks/spark-solr/blob/master/src/test/scala/com/lucidworks/spark/EventsimTestSuite.scala#L151
>>>>>>     *plan*:
>>>>>> https://gist.github.com/kiranchitturi/4a52688c9f0abe3d4b2bd8b938044421#file-length-range-sql
>>>>>>
>>>>>> The SolrRelation class we use extends
>>>>>> <https://github.com/lucidworks/spark-solr/blob/master/src/main/scala/com/lucidworks/spark/SolrRelation.scala#L37>
>>>>>> the PrunedFilteredScan.
>>>>>>
>>>>>> Since Solr supports date ranges, I would like for the timestamp
>>>>>> filters to be pushed down to the Solr query.
>>>>>>
>>>>>> Are there limitations on the type of filters that are passed down
>>>>>> with Timestamp types ?
>>>>>> Is there something that I should do in my code to fix this ?
>>>>>>
>>>>>> Thanks,
>>>>>> --
>>>>>> Kiran Chitturi
>>>>>>
>>>>>>
>>>>>
>>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>

Re: Spark sql not pushing down timestamp range queries

Reply via email to