Re: [DISCUSS] Incremental statistics collection

2023-09-01 Thread Rakesh Raushan
Thanks all for all your insights.

@Mich
I am not trying to introduce any sampling model here.
This idea is about collecting the task write metrics while writing the data
and aggregating it with the existing values present in the catalog(create a
new entry if it's a CTAS command).
This approach is much simpler to implement. Although it does introduce a
limitation for external tables where users can update data without spark.
Similarly, `Alter Table Add Partition` would also require scanning the
added partitions as currently that seems to be the only way of getting the
metrics.

@Chetan
Sure we can analyze multi-column stats as well. But to get them to update
automatically, we need to get this one done first. That can be a future
scope for this feature.

I feel auto gathering of stats would have been a better name for this one.
We already have similar features in Hive and major DBMSs(SQL Server, MySQL).

I would like to hear more from the dev community on this. Are the dev
community in favour of having this feature in spark ?
I have made SPIP doc editable for further comments or questions on what I
am trying to achieve or how I am going to implement it.
Thanks,
Rakesh

On Wed, Aug 30, 2023 at 9:42 PM Mich Talebzadeh 
wrote:

> Sorry I missed this one
>
> In the context what has been changed we ought to have an additional column
> timestamp
>
> In short we can have
>
> datachange(object_name, partition_name, colname, timestamp)
>
> timestamp is the point in time you want to compare against for changes.
>
> Example
>
> SELECT * FROM  WHERE datachange('', '2023-08-01 00:00:00') =
> 1
>
>
> This query should return all rows from the  table that have been
> changed since June 1, 2023, 00:00:00.
>
> Let me know your thoughts
>
> HTH
>
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
>
>
>
>
> Disclaimer: Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 30 Aug 2023 at 10:19, Mich Talebzadeh 
> wrote:
>
>> Another idea that came to my mind from the old days, is the concept of
>> having a function called *datachange*
>>
>> This datachange function should measure the amount of change in the data
>> distribution since ANALYZE STATISTICS last ran. Specifically, it should
>> measure the number of inserts, updates and deletes that have occurred on
>> the given object and helps us determine if running ANALYZE STATISTICS would
>> benefit the query plan.
>>
>> something like
>>
>> select datachange(object_name, partition_name, colname)
>>
>> Where:
>>
>> object_name – is the object name. fully qualified objectname. The
>> object_name cannot be null.
>> partition_name – is the data partition name. This can be a null value.
>> colname – is the column name for which the datachange is requested. This
>> can be a null value (meaning all columns)
>>
>> This should be expressed as a percentage of the total number of rows in
>> the table or partition (if the partition is specified). The percentage
>> value can be greater than 100% because the number of changes to an object
>> can be much greater than the number of rows in the table, particularly when
>> the number of deletes and updates to a table is very high.
>>
>> So we can run this function to see if ANALYZE STATISTICS is required on a
>> certain column.
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Distinguished Technologist, Solutions Architect & Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> Disclaimer: Use it at your own risk. Any and all responsibility for any
>> loss, damage or destruction of data or any other property which may arise
>> from relying on this email's technical content is explicitly disclaimed.
>> The author will in no case be liable for any monetary damages arising from
>> such loss, damage or destruction.
>>
>>
>>
>> Mich Talebzadeh,
>> Distinguished Technologist, Solutions Architect & Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  

Re: [VOTE] Release Apache Spark 3.5.0 (RC3)

2023-09-01 Thread Jungtaek Lim
My apologies, I have to add another ticket for a blocker, SPARK-45045
. That said, I'm -1
(non-binding).

SPARK-43183  made a
behavioral change regarding the StreamingQueryListener as well as
StreamingQuery API as a side-effect, while the intention was more about
introducing the change in the former one. I just got some reports that the
behavioral change for StreamingQuery API broke various tests in 3rd party
data sources. To help 3rd party ecosystems to adopt 3.5 without hassle, I'd
like to see this be fixed in 3.5.0.

There is no fix yet but I'm working on it. I'll give an update here. Maybe
we could lower down priority and let the release go with describing this
as a "known issue", if I couldn't make progress in a couple of days. I'm
sorry about that.

Thanks,
Jungtaek Lim

On Fri, Sep 1, 2023 at 12:12 PM Wenchen Fan  wrote:

> Sorry for the last-minute bug report, but we found a regression in 3.5:
> the SQL INSERT command without a column list fills missing columns with
> NULL while Spark 3.4 does not allow it. According to the SQL standard, this
> shouldn't be allowed and thus a regression in 3.5.
>
> The fix has been merged but one day after the RC3 cut:
> https://github.com/apache/spark/pull/42393 . I'm -1 and let's include
> this fix in 3.5.
>
> Thanks,
> Wenchen
>
> On Thu, Aug 31, 2023 at 9:09 PM Ian Manning 
> wrote:
>
>> +1 (non-binding)
>>
>> Using Spark Core, Spark SQL, Structured Streaming.
>>
>> On Tue, Aug 29, 2023 at 8:12 PM Yuanjian Li 
>> wrote:
>>
>>> Please vote on releasing the following candidate(RC3) as Apache Spark
>>> version 3.5.0.
>>>
>>> The vote is open until 11:59pm Pacific time Aug 31st and passes if a
>>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 3.5.0
>>>
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v3.5.0-rc3 (commit
>>> 9f137aa4dc43398aafa0c3e035ed3174182d7d6c):
>>>
>>> https://github.com/apache/spark/tree/v3.5.0-rc3
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>>
>>> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc3-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>>
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>>
>>> https://repository.apache.org/content/repositories/orgapachespark-1447
>>>
>>> The documentation corresponding to this release can be found at:
>>>
>>> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc3-docs/
>>>
>>> The list of bug fixes going into 3.5.0 can be found at the following URL:
>>>
>>> https://issues.apache.org/jira/projects/SPARK/versions/12352848
>>>
>>> This release is using the release script of the tag v3.5.0-rc3.
>>>
>>>
>>> FAQ
>>>
>>> =
>>>
>>> How can I help test this release?
>>>
>>> =
>>>
>>> If you are a Spark user, you can help us test this release by taking
>>>
>>> an existing Spark workload and running on this release candidate, then
>>>
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>>
>>> the current RC and see if anything important breaks, in the Java/Scala
>>>
>>> you can add the staging repository to your projects resolvers and test
>>>
>>> with the RC (make sure to clean up the artifact cache before/after so
>>>
>>> you don't end up building with an out of date RC going forward).
>>>
>>> ===
>>>
>>> What should happen to JIRA tickets still targeting 3.5.0?
>>>
>>> ===
>>>
>>> The current list of open tickets targeted at 3.5.0 can be found at:
>>>
>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>> Version/s" = 3.5.0
>>>
>>> Committers should look at those and triage. Extremely important bug
>>>
>>> fixes, documentation, and API tweaks that impact compatibility should
>>>
>>> be worked on immediately. Everything else please retarget to an
>>>
>>> appropriate release.
>>>
>>> ==
>>>
>>> But my bug isn't fixed?
>>>
>>> ==
>>>
>>> In order to make timely releases, we will typically not hold the
>>>
>>> release unless the bug in question is a regression from the previous
>>>
>>> release. That being said, if there is something which is a regression
>>>
>>> that has not been correctly targeted please ping me or a committer to
>>>
>>> help target the issue.
>>>
>>> Thanks,
>>>
>>> Yuanjian Li
>>>
>>