Re: [DISCUSS] Incremental statistics collection

2023-09-06 Thread Rakesh Raushan
Hi all,

I would like to hear more from community on this topic.
I believe it would significantly improve statistics collection in spark.

Thanks
Rakesh



On Sat, 2 Sep 2023 at 10:36 AM, Rakesh Raushan 
wrote:

> Thanks all for all your insights.
>
> @Mich
> I am not trying to introduce any sampling model here.
> This idea is about collecting the task write metrics while writing the
> data and aggregating it with the existing values present in the
> catalog(create a new entry if it's a CTAS command).
> This approach is much simpler to implement. Although it does introduce a
> limitation for external tables where users can update data without spark.
> Similarly, `Alter Table Add Partition` would also require scanning the
> added partitions as currently that seems to be the only way of getting the
> metrics.
>
> @Chetan
> Sure we can analyze multi-column stats as well. But to get them to update
> automatically, we need to get this one done first. That can be a future
> scope for this feature.
>
> I feel auto gathering of stats would have been a better name for this one.
> We already have similar features in Hive and major DBMSs(SQL Server, MySQL).
>
> I would like to hear more from the dev community on this. Are the dev
> community in favour of having this feature in spark ?
> I have made SPIP doc editable for further comments or questions on what I
> am trying to achieve or how I am going to implement it.
> Thanks,
> Rakesh
>
> On Wed, Aug 30, 2023 at 9:42 PM Mich Talebzadeh 
> wrote:
>
>> Sorry I missed this one
>>
>> In the context what has been changed we ought to have an additional
>> column timestamp
>>
>> In short we can have
>>
>> datachange(object_name, partition_name, colname, timestamp)
>>
>> timestamp is the point in time you want to compare against for changes.
>>
>> Example
>>
>> SELECT * FROM  WHERE datachange('', '2023-08-01 00:00:00')
>> = 1
>>
>>
>> This query should return all rows from the  table that have been
>> changed since June 1, 2023, 00:00:00.
>>
>> Let me know your thoughts
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Distinguished Technologist, Solutions Architect & Engineer
>> London
>> United Kingdom
>>
>>
>>
>>
>> Disclaimer: Use it at your own risk. Any and all responsibility for any
>> loss, damage or destruction of data or any other property which may arise
>> from relying on this email's technical content is explicitly disclaimed.
>> The author will in no case be liable for any monetary damages arising from
>> such loss, damage or destruction.
>>
>>
>>
>> Mich Talebzadeh,
>> Distinguished Technologist, Solutions Architect & Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 30 Aug 2023 at 10:19, Mich Talebzadeh 
>> wrote:
>>
>>> Another idea that came to my mind from the old days, is the concept of
>>> having a function called *datachange*
>>>
>>> This datachange function should measure the amount of change in the data
>>> distribution since ANALYZE STATISTICS last ran. Specifically, it should
>>> measure the number of inserts, updates and deletes that have occurred on
>>> the given object and helps us determine if running ANALYZE STATISTICS would
>>> benefit the query plan.
>>>
>>> something like
>>>
>>> select datachange(object_name, partition_name, colname)
>>>
>>> Where:
>>>
>>> object_name – is the object name. fully qualified objectname. The
>>> object_name cannot be null.
>>> partition_name – is the data partition name. This can be a null value.
>>> colname – is the column name for which the datachange is requested. This
>>> can be a null value (meaning all columns)
>>>
>>> This should be expressed as a percentage of the total number of rows in
>>> the table or partition (if the partition is specified). The percentage
>>> value can be greater than 100% because the number of changes to an object
>>> can be much greater than the number of rows in the table, particularly when
>>> the number of deletes and updates to a table is very high.
>>>
>>> So we can run this function to see if ANALYZE STATISTICS is required on
>>> a certain column.
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Distinguished Technologist, Solutions Architect & Engineer
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any
>>> loss, damage or destruction of data or any other property which may arise
>>> from relying on this email's technical 

Re: [DISCUSS] Incremental statistics collection

2023-09-01 Thread Rakesh Raushan
Thanks all for all your insights.

@Mich
I am not trying to introduce any sampling model here.
This idea is about collecting the task write metrics while writing the data
and aggregating it with the existing values present in the catalog(create a
new entry if it's a CTAS command).
This approach is much simpler to implement. Although it does introduce a
limitation for external tables where users can update data without spark.
Similarly, `Alter Table Add Partition` would also require scanning the
added partitions as currently that seems to be the only way of getting the
metrics.

@Chetan
Sure we can analyze multi-column stats as well. But to get them to update
automatically, we need to get this one done first. That can be a future
scope for this feature.

I feel auto gathering of stats would have been a better name for this one.
We already have similar features in Hive and major DBMSs(SQL Server, MySQL).

I would like to hear more from the dev community on this. Are the dev
community in favour of having this feature in spark ?
I have made SPIP doc editable for further comments or questions on what I
am trying to achieve or how I am going to implement it.
Thanks,
Rakesh

On Wed, Aug 30, 2023 at 9:42 PM Mich Talebzadeh 
wrote:

> Sorry I missed this one
>
> In the context what has been changed we ought to have an additional column
> timestamp
>
> In short we can have
>
> datachange(object_name, partition_name, colname, timestamp)
>
> timestamp is the point in time you want to compare against for changes.
>
> Example
>
> SELECT * FROM  WHERE datachange('', '2023-08-01 00:00:00') =
> 1
>
>
> This query should return all rows from the  table that have been
> changed since June 1, 2023, 00:00:00.
>
> Let me know your thoughts
>
> HTH
>
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
>
>
>
>
> Disclaimer: Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 30 Aug 2023 at 10:19, Mich Talebzadeh 
> wrote:
>
>> Another idea that came to my mind from the old days, is the concept of
>> having a function called *datachange*
>>
>> This datachange function should measure the amount of change in the data
>> distribution since ANALYZE STATISTICS last ran. Specifically, it should
>> measure the number of inserts, updates and deletes that have occurred on
>> the given object and helps us determine if running ANALYZE STATISTICS would
>> benefit the query plan.
>>
>> something like
>>
>> select datachange(object_name, partition_name, colname)
>>
>> Where:
>>
>> object_name – is the object name. fully qualified objectname. The
>> object_name cannot be null.
>> partition_name – is the data partition name. This can be a null value.
>> colname – is the column name for which the datachange is requested. This
>> can be a null value (meaning all columns)
>>
>> This should be expressed as a percentage of the total number of rows in
>> the table or partition (if the partition is specified). The percentage
>> value can be greater than 100% because the number of changes to an object
>> can be much greater than the number of rows in the table, particularly when
>> the number of deletes and updates to a table is very high.
>>
>> So we can run this function to see if ANALYZE STATISTICS is required on a
>> certain column.
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Distinguished Technologist, Solutions Architect & Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> Disclaimer: Use it at your own risk. Any and all responsibility for any
>> loss, damage or destruction of data or any other property which may arise
>> from relying on this email's technical content is explicitly disclaimed.
>> The author will in no case be liable for any monetary damages arising from
>> such loss, damage or destruction.
>>
>>
>>
>> Mich Talebzadeh,
>> Distinguished Technologist, Solutions Architect & Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  

Re: [DISCUSS] Incremental statistics collection

2023-08-30 Thread Mich Talebzadeh
Sorry I missed this one

In the context what has been changed we ought to have an additional column
timestamp

In short we can have

datachange(object_name, partition_name, colname, timestamp)

timestamp is the point in time you want to compare against for changes.

Example

SELECT * FROM  WHERE datachange('', '2023-08-01 00:00:00') = 1


This query should return all rows from the  table that have been
changed since June 1, 2023, 00:00:00.

Let me know your thoughts

HTH

Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom




Disclaimer: Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 30 Aug 2023 at 10:19, Mich Talebzadeh 
wrote:

> Another idea that came to my mind from the old days, is the concept of
> having a function called *datachange*
>
> This datachange function should measure the amount of change in the data
> distribution since ANALYZE STATISTICS last ran. Specifically, it should
> measure the number of inserts, updates and deletes that have occurred on
> the given object and helps us determine if running ANALYZE STATISTICS would
> benefit the query plan.
>
> something like
>
> select datachange(object_name, partition_name, colname)
>
> Where:
>
> object_name – is the object name. fully qualified objectname. The
> object_name cannot be null.
> partition_name – is the data partition name. This can be a null value.
> colname – is the column name for which the datachange is requested. This
> can be a null value (meaning all columns)
>
> This should be expressed as a percentage of the total number of rows in
> the table or partition (if the partition is specified). The percentage
> value can be greater than 100% because the number of changes to an object
> can be much greater than the number of rows in the table, particularly when
> the number of deletes and updates to a table is very high.
>
> So we can run this function to see if ANALYZE STATISTICS is required on a
> certain column.
>
> HTH
>
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> Disclaimer: Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 30 Aug 2023 at 00:49, Chetan  wrote:
>
>> Thanks for the detailed explanation.
>>
>>
>> Regards,
>> Chetan
>>
>>
>>
>> On Tue, Aug 29, 2023, 4:50 PM Mich Talebzadeh 
>> wrote:
>>
>>> OK, let us take a deeper look here
>>>
>>> ANALYSE TABLE mytable COMPUTE STATISTICS FOR COLUMNS *(c1, c2), c3*
>>>
>>> In above, we are *explicitly grouping columns c1 and c2 together for
>>> which we want to compute statistic*s. Additionally, we are also *computing
>>> statistics for column c3 independen*t*ly*. This approach *allows CBO to
>>> treat columns c1 and c2 as a group and compute joint statistics for them,
>>> while computing separate statistics for column c3.*
>>>
>>> If columns c1 and c2 are frequently used together in conditions, I
>>> concur it makes sense to compute joint statistics for them by using the
>>> above syntax. On the other hand, if each column has its own significance
>>> and the relationship between them is not crucial, we can use
>>>
>>> ANALYSE TABLE mytable COMPUTE STATISTICS FOR 

Re: [DISCUSS] Incremental statistics collection

2023-08-30 Thread Mich Talebzadeh
Another idea that came to my mind from the old days, is the concept of
having a function called *datachange*

This datachange function should measure the amount of change in the data
distribution since ANALYZE STATISTICS last ran. Specifically, it should
measure the number of inserts, updates and deletes that have occurred on
the given object and helps us determine if running ANALYZE STATISTICS would
benefit the query plan.

something like

select datachange(object_name, partition_name, colname)

Where:

object_name – is the object name. fully qualified objectname. The
object_name cannot be null.
partition_name – is the data partition name. This can be a null value.
colname – is the column name for which the datachange is requested. This
can be a null value (meaning all columns)

This should be expressed as a percentage of the total number of rows in the
table or partition (if the partition is specified). The percentage value
can be greater than 100% because the number of changes to an object can be
much greater than the number of rows in the table, particularly when the
number of deletes and updates to a table is very high.

So we can run this function to see if ANALYZE STATISTICS is required on a
certain column.

HTH

Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 30 Aug 2023 at 00:49, Chetan  wrote:

> Thanks for the detailed explanation.
>
>
> Regards,
> Chetan
>
>
>
> On Tue, Aug 29, 2023, 4:50 PM Mich Talebzadeh 
> wrote:
>
>> OK, let us take a deeper look here
>>
>> ANALYSE TABLE mytable COMPUTE STATISTICS FOR COLUMNS *(c1, c2), c3*
>>
>> In above, we are *explicitly grouping columns c1 and c2 together for
>> which we want to compute statistic*s. Additionally, we are also *computing
>> statistics for column c3 independen*t*ly*. This approach *allows CBO to
>> treat columns c1 and c2 as a group and compute joint statistics for them,
>> while computing separate statistics for column c3.*
>>
>> If columns c1 and c2 are frequently used together in conditions, I
>> concur it makes sense to compute joint statistics for them by using the
>> above syntax. On the other hand, if each column has its own significance
>> and the relationship between them is not crucial, we can use
>>
>> ANALYSE TABLE mytable COMPUTE STATISTICS FOR COLUMNS
>> *c1, c2, c3*
>>
>> This syntax can be used to compute separate statistics for each column.
>>
>> So your mileage varies.
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Distinguished Technologist, Solutions Architect & Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 29 Aug 2023 at 12:14, Chetan  wrote:
>>
>>> Hi,
>>>
>>> If we are taking this up, then would ask can we support multicolumn
>>> stats such as :
>>> ANALYZE TABLE mytable COMPUTE STATISTICS FOR COLUMNS (c1,c2), c3
>>> This should help in estimating better for conditions involving c1 and c2
>>>
>>> Thanks.
>>>
>>> On Tue, 29 Aug 2023 at 09:05, Mich Talebzadeh 
>>> wrote:
>>>
 short answer on top of my head

 My point was with regard to  Cost Based Optimizer (CBO) in traditional
 databases. The concept of a rowkey in HBase is somewhat similar to that of
 a primary key in RDBMS.
 Now in databases with automatic deduplication features (i.e. ignore
 duplication of rowkey), inserting 100 rows with the same rowkey actually
 results in only one physical entry in the database due to deduplication.
 Therefore, the new statistical value added should be 1, reflecting 

Re: [DISCUSS] Incremental statistics collection

2023-08-29 Thread Chetan
Thanks for the detailed explanation.


Regards,
Chetan



On Tue, Aug 29, 2023, 4:50 PM Mich Talebzadeh 
wrote:

> OK, let us take a deeper look here
>
> ANALYSE TABLE mytable COMPUTE STATISTICS FOR COLUMNS *(c1, c2), c3*
>
> In above, we are *explicitly grouping columns c1 and c2 together for
> which we want to compute statistic*s. Additionally, we are also *computing
> statistics for column c3 independen*t*ly*. This approach *allows CBO to
> treat columns c1 and c2 as a group and compute joint statistics for them,
> while computing separate statistics for column c3.*
>
> If columns c1 and c2 are frequently used together in conditions, I
> concur it makes sense to compute joint statistics for them by using the
> above syntax. On the other hand, if each column has its own significance
> and the relationship between them is not crucial, we can use
>
> ANALYSE TABLE mytable COMPUTE STATISTICS FOR COLUMNS
> *c1, c2, c3*
>
> This syntax can be used to compute separate statistics for each column.
>
> So your mileage varies.
>
> HTH
>
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 29 Aug 2023 at 12:14, Chetan  wrote:
>
>> Hi,
>>
>> If we are taking this up, then would ask can we support multicolumn stats
>> such as :
>> ANALYZE TABLE mytable COMPUTE STATISTICS FOR COLUMNS (c1,c2), c3
>> This should help in estimating better for conditions involving c1 and c2
>>
>> Thanks.
>>
>> On Tue, 29 Aug 2023 at 09:05, Mich Talebzadeh 
>> wrote:
>>
>>> short answer on top of my head
>>>
>>> My point was with regard to  Cost Based Optimizer (CBO) in traditional
>>> databases. The concept of a rowkey in HBase is somewhat similar to that of
>>> a primary key in RDBMS.
>>> Now in databases with automatic deduplication features (i.e. ignore
>>> duplication of rowkey), inserting 100 rows with the same rowkey actually
>>> results in only one physical entry in the database due to deduplication.
>>> Therefore, the new statistical value added should be 1, reflecting the
>>> distinct physical entry. If the rowkey is already present in HBase, the
>>> value would indeed be 0, indicating that no new physical entry was created.
>>> We need to take into account the underlying deduplication mechanism of the
>>> database in use to ensure that statistical values accurately represent the
>>> unique physical data entries.
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Distinguished Technologist, Solutions Architect & Engineer
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 29 Aug 2023 at 02:07, Jia Fan  wrote:
>>>
 For those databases with automatic deduplication capabilities, such as
 hbase, we have inserted 100 rows with the same rowkey, but in fact there is
 only one in hbase. Is the new statistical value we added 100 or 1, or hbase
 already contains this rowkey, the value would be 0. How should we handle
 this situation?

 Mich Talebzadeh  于2023年8月29日周二 07:22写道:

> I have never been fond of the notion that measuring inserts, updates,
> and deletes (referred to as DML) is the sole criterion for signaling a
> necessity to update statistics for Spark's CBO. Nevertheless, in the
> absence of an alternative mechanism, it seems this is the only approach at
> our disposal (can we use AI for it ). Personally, I would prefer some
> form of indication regarding shifts in the distribution of values in the
> histogram, overall density, and similar indicators. The decision to 
> execute
> "ANALYZE TABLE xyz COMPUTE STATISTICS FOR COLUMNS" revolves around
> column-level statistics, which is why I would tend to focus on monitoring
> individual column-level statistics to detect any signals warranting a
> statistics update.
> HTH
>
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
>
>
>view my Linkedin 

Re: [DISCUSS] Incremental statistics collection

2023-08-29 Thread Mich Talebzadeh
OK, let us take a deeper look here

ANALYSE TABLE mytable COMPUTE STATISTICS FOR COLUMNS *(c1, c2), c3*

In above, we are *explicitly grouping columns c1 and c2 together for which
we want to compute statistic*s. Additionally, we are also *computing
statistics for column c3 independen*t*ly*. This approach *allows CBO to
treat columns c1 and c2 as a group and compute joint statistics for them,
while computing separate statistics for column c3.*

If columns c1 and c2 are frequently used together in conditions, I
concur it makes sense to compute joint statistics for them by using the
above syntax. On the other hand, if each column has its own significance
and the relationship between them is not crucial, we can use

ANALYSE TABLE mytable COMPUTE STATISTICS FOR COLUMNS
*c1, c2, c3*

This syntax can be used to compute separate statistics for each column.

So your mileage varies.

HTH

Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 29 Aug 2023 at 12:14, Chetan  wrote:

> Hi,
>
> If we are taking this up, then would ask can we support multicolumn stats
> such as :
> ANALYZE TABLE mytable COMPUTE STATISTICS FOR COLUMNS (c1,c2), c3
> This should help in estimating better for conditions involving c1 and c2
>
> Thanks.
>
> On Tue, 29 Aug 2023 at 09:05, Mich Talebzadeh 
> wrote:
>
>> short answer on top of my head
>>
>> My point was with regard to  Cost Based Optimizer (CBO) in traditional
>> databases. The concept of a rowkey in HBase is somewhat similar to that of
>> a primary key in RDBMS.
>> Now in databases with automatic deduplication features (i.e. ignore
>> duplication of rowkey), inserting 100 rows with the same rowkey actually
>> results in only one physical entry in the database due to deduplication.
>> Therefore, the new statistical value added should be 1, reflecting the
>> distinct physical entry. If the rowkey is already present in HBase, the
>> value would indeed be 0, indicating that no new physical entry was created.
>> We need to take into account the underlying deduplication mechanism of the
>> database in use to ensure that statistical values accurately represent the
>> unique physical data entries.
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Distinguished Technologist, Solutions Architect & Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 29 Aug 2023 at 02:07, Jia Fan  wrote:
>>
>>> For those databases with automatic deduplication capabilities, such as
>>> hbase, we have inserted 100 rows with the same rowkey, but in fact there is
>>> only one in hbase. Is the new statistical value we added 100 or 1, or hbase
>>> already contains this rowkey, the value would be 0. How should we handle
>>> this situation?
>>>
>>> Mich Talebzadeh  于2023年8月29日周二 07:22写道:
>>>
 I have never been fond of the notion that measuring inserts, updates,
 and deletes (referred to as DML) is the sole criterion for signaling a
 necessity to update statistics for Spark's CBO. Nevertheless, in the
 absence of an alternative mechanism, it seems this is the only approach at
 our disposal (can we use AI for it ). Personally, I would prefer some
 form of indication regarding shifts in the distribution of values in the
 histogram, overall density, and similar indicators. The decision to execute
 "ANALYZE TABLE xyz COMPUTE STATISTICS FOR COLUMNS" revolves around
 column-level statistics, which is why I would tend to focus on monitoring
 individual column-level statistics to detect any signals warranting a
 statistics update.
 HTH

 Mich Talebzadeh,
 Distinguished Technologist, Solutions Architect & Engineer
 London
 United Kingdom


view my Linkedin profile
 


  https://en.everybodywiki.com/Mich_Talebzadeh



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other 

Re: [DISCUSS] Incremental statistics collection

2023-08-29 Thread Chetan
Hi,

If we are taking this up, then would ask can we support multicolumn stats
such as :
ANALYZE TABLE mytable COMPUTE STATISTICS FOR COLUMNS (c1,c2), c3
This should help in estimating better for conditions involving c1 and c2

Thanks.

On Tue, 29 Aug 2023 at 09:05, Mich Talebzadeh 
wrote:

> short answer on top of my head
>
> My point was with regard to  Cost Based Optimizer (CBO) in traditional
> databases. The concept of a rowkey in HBase is somewhat similar to that of
> a primary key in RDBMS.
> Now in databases with automatic deduplication features (i.e. ignore
> duplication of rowkey), inserting 100 rows with the same rowkey actually
> results in only one physical entry in the database due to deduplication.
> Therefore, the new statistical value added should be 1, reflecting the
> distinct physical entry. If the rowkey is already present in HBase, the
> value would indeed be 0, indicating that no new physical entry was created.
> We need to take into account the underlying deduplication mechanism of the
> database in use to ensure that statistical values accurately represent the
> unique physical data entries.
>
> HTH
>
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 29 Aug 2023 at 02:07, Jia Fan  wrote:
>
>> For those databases with automatic deduplication capabilities, such as
>> hbase, we have inserted 100 rows with the same rowkey, but in fact there is
>> only one in hbase. Is the new statistical value we added 100 or 1, or hbase
>> already contains this rowkey, the value would be 0. How should we handle
>> this situation?
>>
>> Mich Talebzadeh  于2023年8月29日周二 07:22写道:
>>
>>> I have never been fond of the notion that measuring inserts, updates,
>>> and deletes (referred to as DML) is the sole criterion for signaling a
>>> necessity to update statistics for Spark's CBO. Nevertheless, in the
>>> absence of an alternative mechanism, it seems this is the only approach at
>>> our disposal (can we use AI for it ). Personally, I would prefer some
>>> form of indication regarding shifts in the distribution of values in the
>>> histogram, overall density, and similar indicators. The decision to execute
>>> "ANALYZE TABLE xyz COMPUTE STATISTICS FOR COLUMNS" revolves around
>>> column-level statistics, which is why I would tend to focus on monitoring
>>> individual column-level statistics to detect any signals warranting a
>>> statistics update.
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Distinguished Technologist, Solutions Architect & Engineer
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Sat, 26 Aug 2023 at 21:30, Mich Talebzadeh 
>>> wrote:
>>>
 Hi,

 Impressive, yet in the realm of classic DBMSs, it could be seen as a
 case of old wine in a new bottle. The objective, I assume, is to employ
 dynamic sampling to enhance the optimizer's capacity to create effective
 execution plans without the burden of complete I/O and in less time.

 For instance:
 ANALYZE TABLE xyz COMPUTE STATISTICS WITH SAMPLING = 5 percent

 This approach could potentially aid in estimating deltas by utilizing
 sampling.

 HTH

 Mich Talebzadeh,
 Distinguished Technologist, Solutions Architect & Engineer
 London
 United Kingdom


view my Linkedin profile
 


  https://en.everybodywiki.com/Mich_Talebzadeh



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.




 On Sat, 26 Aug 2023 at 20:58, RAKSON RAKESH 
 wrote:

> Hi all,
>
> I would like to 

Re: [DISCUSS] Incremental statistics collection

2023-08-29 Thread Mich Talebzadeh
short answer on top of my head

My point was with regard to  Cost Based Optimizer (CBO) in traditional
databases. The concept of a rowkey in HBase is somewhat similar to that of
a primary key in RDBMS.
Now in databases with automatic deduplication features (i.e. ignore
duplication of rowkey), inserting 100 rows with the same rowkey actually
results in only one physical entry in the database due to deduplication.
Therefore, the new statistical value added should be 1, reflecting the
distinct physical entry. If the rowkey is already present in HBase, the
value would indeed be 0, indicating that no new physical entry was created.
We need to take into account the underlying deduplication mechanism of the
database in use to ensure that statistical values accurately represent the
unique physical data entries.

HTH

Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 29 Aug 2023 at 02:07, Jia Fan  wrote:

> For those databases with automatic deduplication capabilities, such as
> hbase, we have inserted 100 rows with the same rowkey, but in fact there is
> only one in hbase. Is the new statistical value we added 100 or 1, or hbase
> already contains this rowkey, the value would be 0. How should we handle
> this situation?
>
> Mich Talebzadeh  于2023年8月29日周二 07:22写道:
>
>> I have never been fond of the notion that measuring inserts, updates, and
>> deletes (referred to as DML) is the sole criterion for signaling a
>> necessity to update statistics for Spark's CBO. Nevertheless, in the
>> absence of an alternative mechanism, it seems this is the only approach at
>> our disposal (can we use AI for it ). Personally, I would prefer some
>> form of indication regarding shifts in the distribution of values in the
>> histogram, overall density, and similar indicators. The decision to execute
>> "ANALYZE TABLE xyz COMPUTE STATISTICS FOR COLUMNS" revolves around
>> column-level statistics, which is why I would tend to focus on monitoring
>> individual column-level statistics to detect any signals warranting a
>> statistics update.
>> HTH
>>
>> Mich Talebzadeh,
>> Distinguished Technologist, Solutions Architect & Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sat, 26 Aug 2023 at 21:30, Mich Talebzadeh 
>> wrote:
>>
>>> Hi,
>>>
>>> Impressive, yet in the realm of classic DBMSs, it could be seen as a
>>> case of old wine in a new bottle. The objective, I assume, is to employ
>>> dynamic sampling to enhance the optimizer's capacity to create effective
>>> execution plans without the burden of complete I/O and in less time.
>>>
>>> For instance:
>>> ANALYZE TABLE xyz COMPUTE STATISTICS WITH SAMPLING = 5 percent
>>>
>>> This approach could potentially aid in estimating deltas by utilizing
>>> sampling.
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Distinguished Technologist, Solutions Architect & Engineer
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Sat, 26 Aug 2023 at 20:58, RAKSON RAKESH 
>>> wrote:
>>>
 Hi all,

 I would like to propose the incremental collection of statistics in
 spark. SPARK-44817 
 has been raised for the same.

 Currently, spark invalidates the stats after data changing commands
 which would make CBO non-functional. To update these stats, user either
 needs to run `ANALYZE TABLE` command or turn
 `spark.sql.statistics.size.autoUpdate.enabled`. Both of these ways have
 their 

Re: [DISCUSS] Incremental statistics collection

2023-08-28 Thread Jia Fan
For those databases with automatic deduplication capabilities, such as
hbase, we have inserted 100 rows with the same rowkey, but in fact there is
only one in hbase. Is the new statistical value we added 100 or 1, or hbase
already contains this rowkey, the value would be 0. How should we handle
this situation?

Mich Talebzadeh  于2023年8月29日周二 07:22写道:

> I have never been fond of the notion that measuring inserts, updates, and
> deletes (referred to as DML) is the sole criterion for signaling a
> necessity to update statistics for Spark's CBO. Nevertheless, in the
> absence of an alternative mechanism, it seems this is the only approach at
> our disposal (can we use AI for it ). Personally, I would prefer some
> form of indication regarding shifts in the distribution of values in the
> histogram, overall density, and similar indicators. The decision to execute
> "ANALYZE TABLE xyz COMPUTE STATISTICS FOR COLUMNS" revolves around
> column-level statistics, which is why I would tend to focus on monitoring
> individual column-level statistics to detect any signals warranting a
> statistics update.
> HTH
>
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 26 Aug 2023 at 21:30, Mich Talebzadeh 
> wrote:
>
>> Hi,
>>
>> Impressive, yet in the realm of classic DBMSs, it could be seen as a case
>> of old wine in a new bottle. The objective, I assume, is to employ dynamic
>> sampling to enhance the optimizer's capacity to create effective execution
>> plans without the burden of complete I/O and in less time.
>>
>> For instance:
>> ANALYZE TABLE xyz COMPUTE STATISTICS WITH SAMPLING = 5 percent
>>
>> This approach could potentially aid in estimating deltas by utilizing
>> sampling.
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Distinguished Technologist, Solutions Architect & Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sat, 26 Aug 2023 at 20:58, RAKSON RAKESH 
>> wrote:
>>
>>> Hi all,
>>>
>>> I would like to propose the incremental collection of statistics in
>>> spark. SPARK-44817 
>>> has been raised for the same.
>>>
>>> Currently, spark invalidates the stats after data changing commands
>>> which would make CBO non-functional. To update these stats, user either
>>> needs to run `ANALYZE TABLE` command or turn
>>> `spark.sql.statistics.size.autoUpdate.enabled`. Both of these ways have
>>> their own drawbacks, executing `ANALYZE TABLE` command triggers full table
>>> scan while the other one only updates table and partition stats and can be
>>> costly in certain cases.
>>>
>>> The goal of this proposal is to collect stats incrementally while
>>> executing data changing commands by utilizing the framework introduced in
>>> SPARK-21669 .
>>>
>>> SPIP Document has been attached along with JIRA:
>>>
>>> https://docs.google.com/document/d/1CNPWg_L1fxfB4d2m6xfizRyYRoWS2uPCwTKzhL2fwaQ/edit?usp=sharing
>>>
>>> Hive also supports automatic collection of statistics to keep the stats
>>> consistent.
>>> I can find multiple spark JIRAs asking for the same:
>>> https://issues.apache.org/jira/browse/SPARK-28872
>>> https://issues.apache.org/jira/browse/SPARK-33825
>>>
>>> Regards,
>>> Rakesh
>>>
>>


Re: [DISCUSS] Incremental statistics collection

2023-08-28 Thread Mich Talebzadeh
I have never been fond of the notion that measuring inserts, updates, and
deletes (referred to as DML) is the sole criterion for signaling a
necessity to update statistics for Spark's CBO. Nevertheless, in the
absence of an alternative mechanism, it seems this is the only approach at
our disposal (can we use AI for it ). Personally, I would prefer some
form of indication regarding shifts in the distribution of values in the
histogram, overall density, and similar indicators. The decision to execute
"ANALYZE TABLE xyz COMPUTE STATISTICS FOR COLUMNS" revolves around
column-level statistics, which is why I would tend to focus on monitoring
individual column-level statistics to detect any signals warranting a
statistics update.
HTH

Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sat, 26 Aug 2023 at 21:30, Mich Talebzadeh 
wrote:

> Hi,
>
> Impressive, yet in the realm of classic DBMSs, it could be seen as a case
> of old wine in a new bottle. The objective, I assume, is to employ dynamic
> sampling to enhance the optimizer's capacity to create effective execution
> plans without the burden of complete I/O and in less time.
>
> For instance:
> ANALYZE TABLE xyz COMPUTE STATISTICS WITH SAMPLING = 5 percent
>
> This approach could potentially aid in estimating deltas by utilizing
> sampling.
>
> HTH
>
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 26 Aug 2023 at 20:58, RAKSON RAKESH 
> wrote:
>
>> Hi all,
>>
>> I would like to propose the incremental collection of statistics in
>> spark. SPARK-44817 
>> has been raised for the same.
>>
>> Currently, spark invalidates the stats after data changing commands which
>> would make CBO non-functional. To update these stats, user either needs to
>> run `ANALYZE TABLE` command or turn
>> `spark.sql.statistics.size.autoUpdate.enabled`. Both of these ways have
>> their own drawbacks, executing `ANALYZE TABLE` command triggers full table
>> scan while the other one only updates table and partition stats and can be
>> costly in certain cases.
>>
>> The goal of this proposal is to collect stats incrementally while
>> executing data changing commands by utilizing the framework introduced in
>> SPARK-21669 .
>>
>> SPIP Document has been attached along with JIRA:
>>
>> https://docs.google.com/document/d/1CNPWg_L1fxfB4d2m6xfizRyYRoWS2uPCwTKzhL2fwaQ/edit?usp=sharing
>>
>> Hive also supports automatic collection of statistics to keep the stats
>> consistent.
>> I can find multiple spark JIRAs asking for the same:
>> https://issues.apache.org/jira/browse/SPARK-28872
>> https://issues.apache.org/jira/browse/SPARK-33825
>>
>> Regards,
>> Rakesh
>>
>


Re: [DISCUSS] Incremental statistics collection

2023-08-26 Thread Mich Talebzadeh
Hi,

Impressive, yet in the realm of classic DBMSs, it could be seen as a case
of old wine in a new bottle. The objective, I assume, is to employ dynamic
sampling to enhance the optimizer's capacity to create effective execution
plans without the burden of complete I/O and in less time.

For instance:
ANALYZE TABLE xyz COMPUTE STATISTICS WITH SAMPLING = 5 percent

This approach could potentially aid in estimating deltas by utilizing
sampling.

HTH

Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sat, 26 Aug 2023 at 20:58, RAKSON RAKESH  wrote:

> Hi all,
>
> I would like to propose the incremental collection of statistics in spark.
> SPARK-44817  has been
> raised for the same.
>
> Currently, spark invalidates the stats after data changing commands which
> would make CBO non-functional. To update these stats, user either needs to
> run `ANALYZE TABLE` command or turn
> `spark.sql.statistics.size.autoUpdate.enabled`. Both of these ways have
> their own drawbacks, executing `ANALYZE TABLE` command triggers full table
> scan while the other one only updates table and partition stats and can be
> costly in certain cases.
>
> The goal of this proposal is to collect stats incrementally while
> executing data changing commands by utilizing the framework introduced in
> SPARK-21669 .
>
> SPIP Document has been attached along with JIRA:
>
> https://docs.google.com/document/d/1CNPWg_L1fxfB4d2m6xfizRyYRoWS2uPCwTKzhL2fwaQ/edit?usp=sharing
>
> Hive also supports automatic collection of statistics to keep the stats
> consistent.
> I can find multiple spark JIRAs asking for the same:
> https://issues.apache.org/jira/browse/SPARK-28872
> https://issues.apache.org/jira/browse/SPARK-33825
>
> Regards,
> Rakesh
>


[DISCUSS] Incremental statistics collection

2023-08-26 Thread RAKSON RAKESH
Hi all,

I would like to propose the incremental collection of statistics in spark.
SPARK-44817  has been
raised for the same.

Currently, spark invalidates the stats after data changing commands which
would make CBO non-functional. To update these stats, user either needs to
run `ANALYZE TABLE` command or turn
`spark.sql.statistics.size.autoUpdate.enabled`. Both of these ways have
their own drawbacks, executing `ANALYZE TABLE` command triggers full table
scan while the other one only updates table and partition stats and can be
costly in certain cases.

The goal of this proposal is to collect stats incrementally while executing
data changing commands by utilizing the framework introduced in SPARK-21669
.

SPIP Document has been attached along with JIRA:
https://docs.google.com/document/d/1CNPWg_L1fxfB4d2m6xfizRyYRoWS2uPCwTKzhL2fwaQ/edit?usp=sharing

Hive also supports automatic collection of statistics to keep the stats
consistent.
I can find multiple spark JIRAs asking for the same:
https://issues.apache.org/jira/browse/SPARK-28872
https://issues.apache.org/jira/browse/SPARK-33825

Regards,
Rakesh