Hi all,
I would like to hear more from community on this topic.
I believe it would significantly improve statistics collection in spark.
Thanks
Rakesh
On Sat, 2 Sep 2023 at 10:36 AM, Rakesh Raushan
wrote:
> Thanks all for all your insights.
>
> @Mich
> I am not trying to introduce any sampli
Thanks all for all your insights.
@Mich
I am not trying to introduce any sampling model here.
This idea is about collecting the task write metrics while writing the data
and aggregating it with the existing values present in the catalog(create a
new entry if it's a CTAS command).
This approach is
Sorry I missed this one
In the context what has been changed we ought to have an additional column
timestamp
In short we can have
datachange(object_name, partition_name, colname, timestamp)
timestamp is the point in time you want to compare against for changes.
Example
SELECT * FROM WHERE da
Another idea that came to my mind from the old days, is the concept of
having a function called *datachange*
This datachange function should measure the amount of change in the data
distribution since ANALYZE STATISTICS last ran. Specifically, it should
measure the number of inserts, updates and d
Thanks for the detailed explanation.
Regards,
Chetan
On Tue, Aug 29, 2023, 4:50 PM Mich Talebzadeh
wrote:
> OK, let us take a deeper look here
>
> ANALYSE TABLE mytable COMPUTE STATISTICS FOR COLUMNS *(c1, c2), c3*
>
> In above, we are *explicitly grouping columns c1 and c2 together for
> wh
OK, let us take a deeper look here
ANALYSE TABLE mytable COMPUTE STATISTICS FOR COLUMNS *(c1, c2), c3*
In above, we are *explicitly grouping columns c1 and c2 together for which
we want to compute statistic*s. Additionally, we are also *computing
statistics for column c3 independen*t*ly*. This ap
Hi,
If we are taking this up, then would ask can we support multicolumn stats
such as :
ANALYZE TABLE mytable COMPUTE STATISTICS FOR COLUMNS (c1,c2), c3
This should help in estimating better for conditions involving c1 and c2
Thanks.
On Tue, 29 Aug 2023 at 09:05, Mich Talebzadeh
wrote:
> short
short answer on top of my head
My point was with regard to Cost Based Optimizer (CBO) in traditional
databases. The concept of a rowkey in HBase is somewhat similar to that of
a primary key in RDBMS.
Now in databases with automatic deduplication features (i.e. ignore
duplication of rowkey), inser
For those databases with automatic deduplication capabilities, such as
hbase, we have inserted 100 rows with the same rowkey, but in fact there is
only one in hbase. Is the new statistical value we added 100 or 1, or hbase
already contains this rowkey, the value would be 0. How should we handle
thi
I have never been fond of the notion that measuring inserts, updates, and
deletes (referred to as DML) is the sole criterion for signaling a
necessity to update statistics for Spark's CBO. Nevertheless, in the
absence of an alternative mechanism, it seems this is the only approach at
our disposal (
Hi,
Impressive, yet in the realm of classic DBMSs, it could be seen as a case
of old wine in a new bottle. The objective, I assume, is to employ dynamic
sampling to enhance the optimizer's capacity to create effective execution
plans without the burden of complete I/O and in less time.
For instan
11 matches
Mail list logo