[VOTE][RESULT] Release Spark 3.3.3 (RC1)

2023-08-14 Thread Yuming Wang
The vote passes with 7 +1s (4 binding +1s).
Thanks to all who helped with the release!

(* = binding)
+1:

- Yuming Wang *
- Jie Yang
- Dongjoon Hyun *
- Liang-Chi Hsieh *
- Cheng Pan
- Mridul Muralidharan *
- Jia Fan

+0: None

-1: None


Re: Dynamic resource allocation for structured streaming [SPARK-24815]

2023-08-14 Thread Mich Talebzadeh
Thank you for your comments.

My vision of integrating machine learning (ML) into Spark Structured
Streaming (SSS) for capacity planning and performance optimization seems to
be promising. By leveraging ML techniques, I believe that we can
potentially create predictive models that enhance the efficiency and
resource allocation of the data processing pipelines. Here are some
potential benefits and considerations for adding ML to SSS for capacity
planning. However, I stand corrected

   1.

   *Predictive Capacity Planning:* ML models can analyze historical data
   (that we discussed already), workloads, and trends to predict future
   resource needs accurately. This enables proactive scaling and allocation of
   resources, ensuring optimal performance during high-demand periods, such as
   times of high trades.
   2.

   *Real-time Decision Making: *ML can be used to make real-time decisions
   on resource allocation (software and cluster) based on current data and
   conditions, allowing for dynamic adjustments to meet the processing demands.
   3.

   *Complex Data Analysis: *In a heterogeneous setup involving multiple
   databases, ML can analyze various factors like data read and write times
   from different databases, data volumes, and data distribution patterns to
   optimize the overall data processing flow.
   4.

   *Anomaly Detection: *ML models can identify unusual patterns or
   performance deviations, alerting us to potential issues before they impact
   the system.
   5.

   Integration with Monitoring: ML models can work alongside monitoring
   tools, gathering real-time data on various performance metrics, and using
   this data for making intelligent decisions on capacity and resource
   allocation.

However, there are some important considerations to keep in mind:

   1.

   *Model Training: *ML models require training and validation using
   relevant data. Our DS colleagues need to define appropriate features,
   select the right ML algorithms, and fine-tune the model parameters to
   achieve optimal performance.
   2.

   *Complexity:* Integrating ML adds complexity to our architecture.
   Moreover, we need to have the necessary expertise in both Spark Structured
   Streaming and machine learning to design, implement, and maintain the
   system effectively.
   3.

   *Resource Overhead: *ML algorithms can be resource-intensive. We ought
   to consider the additional computational requirements, especially during
   the model training and inference phases.
   4.

   In summary, this idea of utilizing ML for capacity planning in Spark
   Structured Streaming can possibly hold significant potential for improving
   system performance and resource utilization. Having said that, I totally
   agree that we need to evaluate the feasibility, potential benefits, and
   challenges and we will need involving experts in both Spark and machine
   learning to ensure a successful outcome.

HTH

Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 14 Aug 2023 at 14:58, Martin Andersson 
wrote:

> IMO, using any kind of machine learning or AI for DRA is overkill. The
> effort involved would be considerable and likely counterproductive,
> compared to a more conventional approach of comparing the rate of incoming
> stream data with the effort of handling previous data rates.
> --
> *From:* Mich Talebzadeh 
> *Sent:* Tuesday, August 8, 2023 19:59
> *To:* Pavan Kotikalapudi 
> *Cc:* dev@spark.apache.org 
> *Subject:* Re: Dynamic resource allocation for structured streaming
> [SPARK-24815]
>
>
> EXTERNAL SENDER. Do not click links or open attachments unless you
> recognize the sender and know the content is safe. DO NOT provide your
> username or password.
>
> I am currently contemplating and sharing my thoughts openly. Considering
> our reliance on previously collected statistics (as mentioned earlier), it
> raises the question of why we couldn't integrate certain machine learning
> elements into Spark Structured Streaming? While this might slightly deviate
> from our current topic, I am not an expert in machine learning. However,
> there are individuals who possess the expertise to assist us in exploring
> this avenue.
>
> HTH
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it 

Re: Dynamic resource allocation for structured streaming [SPARK-24815]

2023-08-14 Thread Martin Andersson
IMO, using any kind of machine learning or AI for DRA is overkill. The effort 
involved would be considerable and likely counterproductive, compared to a more 
conventional approach of comparing the rate of incoming stream data with the 
effort of handling previous data rates.

From: Mich Talebzadeh 
Sent: Tuesday, August 8, 2023 19:59
To: Pavan Kotikalapudi 
Cc: dev@spark.apache.org 
Subject: Re: Dynamic resource allocation for structured streaming [SPARK-24815]


EXTERNAL SENDER. Do not click links or open attachments unless you recognize 
the sender and know the content is safe. DO NOT provide your username or 
password.


I am currently contemplating and sharing my thoughts openly. Considering our 
reliance on previously collected statistics (as mentioned earlier), it raises 
the question of why we couldn't integrate certain machine learning elements 
into Spark Structured Streaming? While this might slightly deviate from our 
current topic, I am not an expert in machine learning. However, there are 
individuals who possess the expertise to assist us in exploring this avenue.

HTH

Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom


 
[https://ci3.googleusercontent.com/mail-sig/AIorK4zholKucR2Q9yMrKbHNn-o1TuS4mYXyi2KO6Xmx6ikHPySa9MLaLZ8t2hrA6AUcxSxDgHIwmKE]
   view my Linkedin 
profile


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Tue, 8 Aug 2023 at 18:01, Pavan Kotikalapudi 
mailto:pkotikalap...@twilio.com>> wrote:
Listeners are the best resources to the allocation manager  afaik... It already 
has 
SparkListener
 that it utilizes. We can use it to extract more information (like processing 
times).
The one with more information regarding streaming query resides in sql 
module
 though.

Thanks

Pavan

On Tue, Aug 8, 2023 at 5:43 AM Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>> wrote:
Hi Pavan or anyone else

Is there any way one access the matrix displayed on SparkGUI? For example the 
readings for processing time? Can these be acessed?

Thanks

For example,
Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom


 
[https://ci3.googleusercontent.com/mail-sig/AIorK4zholKucR2Q9yMrKbHNn-o1TuS4mYXyi2KO6Xmx6ikHPySa9MLaLZ8t2hrA6AUcxSxDgHIwmKE]
   view my Linkedin 
profile


 
https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Tue, 8 Aug 2023 at 06:44, Pavan Kotikalapudi 
mailto:pkotikalap...@twilio.com>> wrote:
Thanks for the review Mich,

Yes, the configuration parameters we end up setting would be based on the 
trigger interval.

> If you are going to have additional indicators why not look at scheduling 
> delay as well
Yes. The implementation is based on scheduling delays, not for pending tasks of 
the current stage but rather pending tasks of all the stages in a 
micro-batch
 (hence trigger interval).

> we ought to utilise the historical statistics collected under the 
> checkpointing directory to get more accurate statistics
You are right! This is just a simple implementation based on one factor, we 
should also look into other indicators as well If that would help build a 
better scaling algorithm.

Thank you,

Pavan

On Mon, Aug 7, 2023 at 9:55 PM Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>> wrote:
Hi,

I glanced over the design doc.

You are providing certain configuration parameters plus some settings based 

Unsubscribe

2023-08-14 Thread xu han



Fwd: Question about ARRAY_INSERT between Spark and Databricks

2023-08-14 Thread Ran Tao
> Forward to dev

Yes, the databricks runtime 13.0 and 13.1 and 13.2 are all ok and have the
same behavior with open source Apache Spark 3.4.x.
But I think the docs of databricks need to be updated[1]. It's confusing.

[1]
https://docs.databricks.com/en/sql/language-manual/functions/array_insert.html

Best Regards,
Ran Tao


Sean Owen  于2023年8月14日周一 11:58写道:

> Oh I get it. Let me report to the docs people (I'm at databricks)
>
> On Sun, Aug 13, 2023, 10:56 PM Ran Tao  wrote:
>
>> Yes, the databricks runtime 13.0 and 13.1 and 13.2 are all ok and have
>> the same behavior with open source Apache Spark 3.4.x.
>> But I think the docs of databricks need to be updated. it's confusing.
>>
>> Thanks owen!
>>
>>
>> Sean Owen  于2023年8月14日周一 11:33写道:
>>
>>> That's just a bug fix, right? I'd imagine.
>>>
>>> On Sun, Aug 13, 2023, 10:28 PM Ran Tao  wrote:
>>>
 thanks. it seems that the new Databricks runtime has same behavior with
 open source apache spark.
 however, the Databricks runtime docs[1] may confuse users, it indicates
 can be applicable in Databricks Runtime 13.0 and later,
 however it not has this behavior in 13.1 and later version.

 Best Regards,
 Ran Tao
 https://github.com/chucheng92


 Sean Owen  于2023年8月14日周一 02:07写道:

> No, I'm saying I do not see any inconsistency between docs, OSS Spark,
> and Spark in Databricks. All the behavior looks the same and correct.
> I am using the latest Spark and DBR runtime though, and you didn't say
> what version of Spark you use in DB.
> The most likely answer is it's an older version with some kind of bug,
> not sure.
>
> On Sun, Aug 13, 2023 at 1:01 PM Ran Tao  wrote:
>
>> thanks. however I have checked the latest spark version 3.4.0 and
>> 3.4.1 the result is what i listed above. did’u mean the databricks doc is
>> outdated or wrong?which one i should respect?
>>
>>
>>
>> Sean Owen 于2023年8月13日 周日22:35写道:
>>
>>> There shouldn't be any difference here. In fact, I get the results
>>> you list for 'spark' from Databricks. It's possible the difference is a 
>>> bug
>>> fix along the way that is in the Spark version you are using locally but
>>> not in the DBR you are using. But, yeah seems to work as. you say.
>>>
>>> If you're asking about the Spark semantics being 1-indexed vs
>>> 0-indexed? some comments here:
>>> https://github.com/apache/spark/pull/38867#discussion_r1097054656
>>>
>>>
>>> On Sun, Aug 13, 2023 at 7:28 AM Ran Tao 
>>> wrote:
>>>
 Hi, devs.

 I found that the  ARRAY_INSERT[1] function (from spark 3.4.0) has
 different semantics with databricks[2].

 e.g.

 // spark
 SELECT array_insert(array('a', 'b', 'c'), -1, 'z');
  ["a","b","z","c"]

 // databricks
 SELECT array_insert(array('a', 'b', 'c'), -1, 'z');
  ["a","b","c","z"]

 // spark
 SELECT array_insert(array('a', 'b', 'c'), -5, 'z');
 ["z",null,null,"a","b","c"]

 // databricks
 SELECT array_insert(array('a', 'b', 'c'), -5, 'z');
  ["z",NULL,"a","b","c"]

 It looks like that inserting negative index is more reasonable in
 Databricks.

 Of cause, I read the source code of spark, and I can understand the
 logic of spark, but my question is whether spark is designed like this 
 on
 purpose?


 [1]
 https://spark.apache.org/docs/latest/api/sql/index.html#array_insert
 [2]
 https://docs.databricks.com/en/sql/language-manual/functions/array_insert.html


 Best Regards,
 Ran Tao
 https://github.com/chucheng92

>>> --
>> Best Regards,
>> Ran Tao
>> https://github.com/chucheng92
>>
>