Re: Does Apache Spark take into account JDBC indexes / statistics when optimizing queries?

Mich Talebzadeh Thu, 19 Oct 2017 15:47:52 -0700

remember your indexes are in RDBMS. In this case MySQL. When you are
reading from that table you have an 'id' column which I assume is an
integer and you are making parallel threads through JDBC connection to that
table. You can see the threads in MySQL if you query it. You can see
multiple threads. You stated numPartitions but MySQL will decide how many
parallel threads it can handle.


So data is read into Spark to RDDs and you can se that through SPAK GUI
(port 4040 by default). Then you create a DataFrame (DF) and convert it
into a tempTable. tempTable will not have any indexes. This is happening in
Spark space not MySQL. Once you start reading in your query and collect
data then it will try to cache data in Spark memory. You can see this again
through Spark GUI. You can see the optimizer by using explain() function.
You will see that no index is used.

Spark uses distributed data in memory to optimize the work. It does not use
any index. In RDBMS an index is an ordered set of column or columns stored
on the disk in B-tree format to improve the query where needed. Spark
tempTable does not follow that method. So in summary your tempTable will
benefit from more executors and memory if you want to improve the query
performance.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 19 October 2017 at 23:29, lucas.g...@gmail.com <lucas.g...@gmail.com>
wrote:

> If the underlying table(s) have indexes on them.  Does spark use those
> indexes to optimize the query?
>
> IE if I had a table in my JDBC data source (mysql in this case) had
> several indexes and my query was filtering on one of the fields with an
> index.  Would spark know to push that predicate to the database or is the
> predicate push-down ignorant of the underlying storage layer details.
>
> Apologies if that still doesn't adequately explain my question.
>
> Gary Lucas
>
> On 19 October 2017 at 15:19, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>> sorry what do you mean my JDBC table has an index on it? Where are you
>> reading the data from the table?
>>
>> I assume you are referring to "id" column on the table that you are
>> reading through JDBC connection.
>>
>> Then you are creating a temp Table called "df". That temp table is
>> created in temporary work space and does not have any index. That index
>> "id" is used when doing parallel reads into RDDs not when querying the temp
>> Table.
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 19 October 2017 at 23:10, lucas.g...@gmail.com <lucas.g...@gmail.com>
>> wrote:
>>
>>> IE:  If my JDBC table has an index on it, will the optimizer consider
>>> that when pushing predicates down?
>>>
>>> I noticed in a query like this:
>>>
>>> df = spark.hiveContext.read.jdbc(
>>>   url=jdbc_url,
>>>   table="schema.table",
>>>   column="id",
>>>   lowerBound=lower_bound_id,
>>>   upperBound=upper_bound_id,
>>>   numPartitions=numberPartitions
>>> )
>>> df.registerTempTable("df")
>>>
>>> filtered_df = spark.hiveContext.sql("""
>>>     SELECT
>>>         *
>>>     FROM
>>>         df
>>>     WHERE
>>>         type = 'type'
>>>         AND action = 'action'
>>>         AND audited_changes LIKE '---\ncompany_id:\n- %'
>>> """)
>>> filtered_audits.registerTempTable("filtered_df")
>>>
>>>
>>> The queries sent to the DB look like this:
>>> "Select fields from schema.table where type='type' and action='action'
>>> and id > lower_bound and id <= upper_bound"
>>>
>>> And then it does the like ( LIKE '---\ncompany_id:\n- %') in memory,
>>> which is great!
>>>
>>> However I'm wondering why it chooses that optimization.  In this case
>>> there aren't any indexes on any of these except ID.
>>>
>>> So, does spark take into account JDBC indexes in it's query plan where
>>> it can?
>>>
>>> Thanks!
>>>
>>> Gary Lucas
>>>
>>
>>
>

Re: Does Apache Spark take into account JDBC indexes / statistics when optimizing queries?

Reply via email to