Re: Does Apache Spark take into account JDBC indexes / statistics when optimizing queries?

lucas.g...@gmail.com Fri, 20 Oct 2017 09:19:37 -0700

Right, that makes sense and I understood that.

The thing I'm wondering about (And i think the answer is 'no' at this
stage).


When the optimizer is running and pushing predicates down, does it take
into account indexing and other storage layer strategies in determining
which predicates are processed in memory and which predicates are pushed to
storage.

Thanks!

Gary Lucas


On 20 October 2017 at 07:32, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> here below Gary
>
> filtered_df = spark.hiveContext.sql("""
>     SELECT
>         *
>     FROM
>         df
>     WHERE
>         type = 'type'
>         AND action = 'action'
>         AND audited_changes LIKE '---\ncompany_id:\n- %'
> """)
> filtered_audits.registerTempTable("filtered_df")
>
>
> you are using hql to read data from your temporary table "df" and then
> creating a temporary table on the subset of that temptable "df".
>
> What is the  purpose of it?
>
> When you are within Spark itself data is read in. Granted the indexes on
> RDBMS help reading data through the JDBC connection but do not play any
> role later in running the sal in hql.
>
> Does that make sense?
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 20 October 2017 at 00:04, lucas.g...@gmail.com <lucas.g...@gmail.com>
> wrote:
>
>> Ok, so when Spark is forming queries it's ignorant of the underlying
>> storage layer index.
>>
>> If there is an index on a table Spark doesn't take that into account when
>> doing the predicate push down in optimization. In that case why does spark
>> push 2 of my conditions (where fieldx = 'action') to the database but then
>> do the like in memory.  Is that just a function a straightforward LIKE's
>> are done in memory and simple equalities are pushed to the storage layer?
>>
>> remember your indexes are in RDBMS
>>
>>
>> Exactly what I'm asking about, when spark issues the query via the JDBC
>> reader, that query is / is not ignorant of the underlying indexes?  How
>> does spark determine which predicates to perform in the RDD and which
>> predicates to execute in the storage layer?  I guess I should just dig out
>> the JDBC data-frame reader code and see if I can make sense of that?  Or is
>> the predicate push-down stage independent of the readers?
>>
>> Thanks for helping me form a more accurate question!
>>
>> Gary
>>
>>
>>
>> On 19 October 2017 at 15:46, Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>>> remember your indexes are in RDBMS. In this case MySQL. When you are
>>> reading from that table you have an 'id' column which I assume is an
>>> integer and you are making parallel threads through JDBC connection to that
>>> table. You can see the threads in MySQL if you query it. You can see
>>> multiple threads. You stated numPartitions but MySQL will decide how many
>>> parallel threads it can handle.
>>>
>>> So data is read into Spark to RDDs and you can se that through SPAK GUI
>>> (port 4040 by default). Then you create a DataFrame (DF) and convert it
>>> into a tempTable. tempTable will not have any indexes. This is happening in
>>> Spark space not MySQL. Once you start reading in your query and collect
>>> data then it will try to cache data in Spark memory. You can see this again
>>> through Spark GUI. You can see the optimizer by using explain() function.
>>> You will see that no index is used.
>>>
>>> Spark uses distributed data in memory to optimize the work. It does not
>>> use any index. In RDBMS an index is an ordered set of column or columns
>>> stored on the disk in B-tree format to improve the query where needed.
>>> Spark tempTable does not follow that method. So in summary your tempTable
>>> will benefit from more executors and memory if you want to improve the
>>> query performance.
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 19 October 2017 at 23:29, lucas.g...@gmail.com <lucas.g...@gmail.com>
>>> wrote:
>>>
>>>> If the underlying table(s) have indexes on them.  Does spark use those
>>>> indexes to optimize the query?
>>>>
>>>> IE if I had a table in my JDBC data source (mysql in this case) had
>>>> several indexes and my query was filtering on one of the fields with an
>>>> index.  Would spark know to push that predicate to the database or is the
>>>> predicate push-down ignorant of the underlying storage layer details.
>>>>
>>>> Apologies if that still doesn't adequately explain my question.
>>>>
>>>> Gary Lucas
>>>>
>>>> On 19 October 2017 at 15:19, Mich Talebzadeh <mich.talebza...@gmail.com
>>>> > wrote:
>>>>
>>>>> sorry what do you mean my JDBC table has an index on it? Where are you
>>>>> reading the data from the table?
>>>>>
>>>>> I assume you are referring to "id" column on the table that you are
>>>>> reading through JDBC connection.
>>>>>
>>>>> Then you are creating a temp Table called "df". That temp table is
>>>>> created in temporary work space and does not have any index. That index
>>>>> "id" is used when doing parallel reads into RDDs not when querying the 
>>>>> temp
>>>>> Table.
>>>>>
>>>>> HTH
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * 
>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>> On 19 October 2017 at 23:10, lucas.g...@gmail.com <
>>>>> lucas.g...@gmail.com> wrote:
>>>>>
>>>>>> IE:  If my JDBC table has an index on it, will the optimizer consider
>>>>>> that when pushing predicates down?
>>>>>>
>>>>>> I noticed in a query like this:
>>>>>>
>>>>>> df = spark.hiveContext.read.jdbc(
>>>>>>   url=jdbc_url,
>>>>>>   table="schema.table",
>>>>>>   column="id",
>>>>>>   lowerBound=lower_bound_id,
>>>>>>   upperBound=upper_bound_id,
>>>>>>   numPartitions=numberPartitions
>>>>>> )
>>>>>> df.registerTempTable("df")
>>>>>>
>>>>>> filtered_df = spark.hiveContext.sql("""
>>>>>>     SELECT
>>>>>>         *
>>>>>>     FROM
>>>>>>         df
>>>>>>     WHERE
>>>>>>         type = 'type'
>>>>>>         AND action = 'action'
>>>>>>         AND audited_changes LIKE '---\ncompany_id:\n- %'
>>>>>> """)
>>>>>> filtered_audits.registerTempTable("filtered_df")
>>>>>>
>>>>>>
>>>>>> The queries sent to the DB look like this:
>>>>>> "Select fields from schema.table where type='type' and
>>>>>> action='action' and id > lower_bound and id <= upper_bound"
>>>>>>
>>>>>> And then it does the like ( LIKE '---\ncompany_id:\n- %') in memory,
>>>>>> which is great!
>>>>>>
>>>>>> However I'm wondering why it chooses that optimization.  In this case
>>>>>> there aren't any indexes on any of these except ID.
>>>>>>
>>>>>> So, does spark take into account JDBC indexes in it's query plan
>>>>>> where it can?
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> Gary Lucas
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Does Apache Spark take into account JDBC indexes / statistics when optimizing queries?

Reply via email to