Re: [DISCUSS] Incremental statistics collection

2023-09-06 Thread Rakesh Raushan
Hi all, I would like to hear more from community on this topic. I believe it would significantly improve statistics collection in spark. Thanks Rakesh On Sat, 2 Sep 2023 at 10:36 AM, Rakesh Raushan wrote: > Thanks all for all your insights. > > @Mich > I am not trying to introduce any sampli

Re: [DISCUSS] Incremental statistics collection

2023-09-01 Thread Rakesh Raushan
Thanks all for all your insights. @Mich I am not trying to introduce any sampling model here. This idea is about collecting the task write metrics while writing the data and aggregating it with the existing values present in the catalog(create a new entry if it's a CTAS command). This approach is

Re: [DISCUSS] Incremental statistics collection

2023-08-30 Thread Mich Talebzadeh
Sorry I missed this one In the context what has been changed we ought to have an additional column timestamp In short we can have datachange(object_name, partition_name, colname, timestamp) timestamp is the point in time you want to compare against for changes. Example SELECT * FROM WHERE da

Re: [DISCUSS] Incremental statistics collection

2023-08-30 Thread Mich Talebzadeh
Another idea that came to my mind from the old days, is the concept of having a function called *datachange* This datachange function should measure the amount of change in the data distribution since ANALYZE STATISTICS last ran. Specifically, it should measure the number of inserts, updates and d

Re: [DISCUSS] Incremental statistics collection

2023-08-29 Thread Chetan
Thanks for the detailed explanation. Regards, Chetan On Tue, Aug 29, 2023, 4:50 PM Mich Talebzadeh wrote: > OK, let us take a deeper look here > > ANALYSE TABLE mytable COMPUTE STATISTICS FOR COLUMNS *(c1, c2), c3* > > In above, we are *explicitly grouping columns c1 and c2 together for > wh

Re: [DISCUSS] Incremental statistics collection

2023-08-29 Thread Mich Talebzadeh
OK, let us take a deeper look here ANALYSE TABLE mytable COMPUTE STATISTICS FOR COLUMNS *(c1, c2), c3* In above, we are *explicitly grouping columns c1 and c2 together for which we want to compute statistic*s. Additionally, we are also *computing statistics for column c3 independen*t*ly*. This ap

Re: [DISCUSS] Incremental statistics collection

2023-08-29 Thread Chetan
Hi, If we are taking this up, then would ask can we support multicolumn stats such as : ANALYZE TABLE mytable COMPUTE STATISTICS FOR COLUMNS (c1,c2), c3 This should help in estimating better for conditions involving c1 and c2 Thanks. On Tue, 29 Aug 2023 at 09:05, Mich Talebzadeh wrote: > short

Re: [DISCUSS] Incremental statistics collection

2023-08-29 Thread Mich Talebzadeh
short answer on top of my head My point was with regard to Cost Based Optimizer (CBO) in traditional databases. The concept of a rowkey in HBase is somewhat similar to that of a primary key in RDBMS. Now in databases with automatic deduplication features (i.e. ignore duplication of rowkey), inser

Re: [DISCUSS] Incremental statistics collection

2023-08-28 Thread Jia Fan
For those databases with automatic deduplication capabilities, such as hbase, we have inserted 100 rows with the same rowkey, but in fact there is only one in hbase. Is the new statistical value we added 100 or 1, or hbase already contains this rowkey, the value would be 0. How should we handle thi

Re: [DISCUSS] Incremental statistics collection

2023-08-28 Thread Mich Talebzadeh
I have never been fond of the notion that measuring inserts, updates, and deletes (referred to as DML) is the sole criterion for signaling a necessity to update statistics for Spark's CBO. Nevertheless, in the absence of an alternative mechanism, it seems this is the only approach at our disposal (

Re: [DISCUSS] Incremental statistics collection

2023-08-26 Thread Mich Talebzadeh
Hi, Impressive, yet in the realm of classic DBMSs, it could be seen as a case of old wine in a new bottle. The objective, I assume, is to employ dynamic sampling to enhance the optimizer's capacity to create effective execution plans without the burden of complete I/O and in less time. For instan