Re: Which version of spark version supports parquet version 2 ?

2024-04-17 Thread Mich Talebzadeh
Hi Prem,

Your question about writing Parquet v2 with Spark 3.2.0.

Spark 3.2.0 Limitations: Spark 3.2.0 doesn't have a built-in way to
explicitly force Parquet v2 encoding. As we saw previously, even Spark 3.4
created a file with parquet-mr version, indicating v1 encoding.

Dremio v2 Support: As I understand, Dremio versions 24.3 and later can read
Parquet v2 files with delta encodings.

Parquet v2 Status and  Spark. As Ryan alluded to, Spark currently does not
support Parquet v2

In the meantime, you can try excluding parquet-mr from your dependencies
and upgrading the parquet library (if possible) to see if it indirectly
enables v2 writing with Spark 3.2.0.

HTH

Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Wed, 17 Apr 2024 at 20:20, Prem Sahoo  wrote:

> Hello Ryan,
> May I know how you can write Parquet V2 encoding from spark 3.2.0 ?  As
> per my knowledge Dremio is creating and reading Parquet V2.
> "Apache Parquet-MR Writer version PARQUET_2_0, which is widely adopted by
> engines that write Parquet data, supports delta encodings. However, these
> encodings were not previously supported by Dremio's vectorized Parquet
> reader, resulting in decreased speed. Now, in version 24.3 and Dremio
> Cloud, when you use the Dremio SQL query engine on Parquet datasets, you’ll
> receive best-in-class performance."
>
> Could you let me know where Parquet Community is not recommending Parquet
> V2 ?
>
>
>
> On Wed, Apr 17, 2024 at 2:44 PM Ryan Blue  wrote:
>
>> Prem, as I said earlier, v2 is not a finalized spec so you should not use
>> it. That's why it is not the default. You can get Spark to write v2 files,
>> but it isn't recommended by the Parquet community.
>>
>> On Wed, Apr 17, 2024 at 11:05 AM Prem Sahoo  wrote:
>>
>>> Hello Community,
>>> Could anyone shed more light on this (Spark Supporting Parquet V2)?
>>>
>>> On Tue, Apr 16, 2024 at 3:42 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> Hi Prem,
>>>>
>>>> Regrettably this is not my area of speciality. I trust
>>>> another colleague will have a more informed idea. Alternatively you may
>>>> raise an SPIP for it.
>>>>
>>>> Spark Project Improvement Proposals (SPIP) | Apache Spark
>>>> <https://spark.apache.org/improvement-proposals.html>
>>>>
>>>> HTH
>>>>
>>>> Mich Talebzadeh,
>>>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>>>> London
>>>> United Kingdom
>>>>
>>>>
>>>>view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* The information provided is correct to the best of my
>>>> knowledge but of course cannot be guaranteed . It is essential to note
>>>> that, as with any advice, quote "one test result is worth one-thousand
>>>> expert opinions (Werner
>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>
>>>>
>>>> On Tue, 16 Apr 2024 at 18:17, Prem Sahoo  wrote:
>>>>
>>>>> Hello Mich,
>>>>> Thanks for example.
>>>>> I have the same parquet-mr version which creates Parquet version 1. We
>>>>> need to create V2 as it is more optimized. We have Dremio where if we use
>>>>> Parquet V2 it is 75% better than Parquet V1 in case of read and 25 % 
>>>>> better
>>>>> in case of write . so we are inclined towards this way.  Please let us 
>>>>> know
>>>>> why Spark is not going towards Parquet V2 ?
>>>>> Sent from my iPhone
>>>>>
>>>>> On Apr 16, 2024, at 1:04 PM, Mich Talebzadeh <
>>>>> mich.talebza...@gmail.com> wrote:
>>>>>
&g

Re: Which version of spark version supports parquet version 2 ?

2024-04-16 Thread Mich Talebzadeh
Hi Prem,

Regrettably this is not my area of speciality. I trust another colleague
will have a more informed idea. Alternatively you may raise an SPIP for it.

Spark Project Improvement Proposals (SPIP) | Apache Spark
<https://spark.apache.org/improvement-proposals.html>

HTH

Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Tue, 16 Apr 2024 at 18:17, Prem Sahoo  wrote:

> Hello Mich,
> Thanks for example.
> I have the same parquet-mr version which creates Parquet version 1. We
> need to create V2 as it is more optimized. We have Dremio where if we use
> Parquet V2 it is 75% better than Parquet V1 in case of read and 25 % better
> in case of write . so we are inclined towards this way.  Please let us know
> why Spark is not going towards Parquet V2 ?
> Sent from my iPhone
>
> On Apr 16, 2024, at 1:04 PM, Mich Talebzadeh 
> wrote:
>
> 
> Well let us do a test in PySpark.
>
> Take this code and create a default parquet file. My spark is 3.4
>
> cat parquet_checxk.py
> from pyspark.sql import SparkSession
>
> spark = SparkSession.builder.appName("ParquetVersionExample").getOrCreate()
>
> data = [("London", 8974432), ("New York City", 8804348), ("Beijing",
> 21893000)]
> df = spark.createDataFrame(data, ["city", "population"])
>
> df.write.mode("overwrite").parquet("parquet_example")  # it create file
> in hdfs directory
>
> Use a tool called parquet-tools (downloadable using pip from
> https://pypi.org/project/parquet-tools/)
>
> Get the parquet files from hdfs to the current directory say
>
> hdfs dfs -get /user/hduser/parquet_example .
> cd ./parquet_example
> do an ls and pickup file 3 like below to inspect
>  parquet-tools inspect
> part-3-c33854c8-a8b6-4315-bf51-20198ce0ba62-c000.snappy.parquet
>
> Now this is the output
>
>  file meta data 
> created_by: parquet-mr version 1.12.3 (build
> f8dced182c4c1fbdec6ccb3185537b5a01e6ed6b)
> num_columns: 2
> num_rows: 1
> num_row_groups: 1
> format_version: 1.0
> serialized_size: 563
>
>
>  Columns 
> name
> age
>
>  Column(name) 
> name: name
> path: name
> max_definition_level: 1
> max_repetition_level: 0
> physical_type: BYTE_ARRAY
> logical_type: String
> converted_type (legacy): UTF8
> compression: SNAPPY (space_saved: -5%)
>
>  Column(age) 
> name: age
> path: age
> max_definition_level: 1
> max_repetition_level: 0
> physical_type: INT64
> logical_type: None
> converted_type (legacy): NONE
> compression: SNAPPY (space_saved: -5%)
>
> File Information:
>
>- format_version: 1.0: This line explicitly states that the format
>version of the Parquet file is 1.0, which corresponds to Parquet version 1.
>- created_by: parquet-mr version 1.12.3: While this doesn't directly
>specify the format version, itt is accepted that older versions of
>parquet-mr like 1.12.3 typically write Parquet version 1 files.
>
> Since in this case Spark 3.4 is capable of reading both versions (1 and
> 2), you don't  necessarily need to modify your Spark code to access this
> file. However, if you want to create Parquet files in version 2 using
> Spark, you might need to consider additional changes like excluding
> parquet-mr or upgrading Parquet libraries and do a custom build.of Spark.
> However, taking klaws of diminishing returns, I would not advise that
> either.. You can ofcourse usse gzip for compression that may be more
> suitable for your needs.
>
> HTH
>
> Mich Talebzadeh,
> Technologist | Solutions Architect | Data Engineer  | Generative AI
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  <h

Re: Which version of spark version supports parquet version 2 ?

2024-04-16 Thread Mich Talebzadeh
Well let us do a test in PySpark.

Take this code and create a default parquet file. My spark is 3.4

cat parquet_checxk.py
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("ParquetVersionExample").getOrCreate()

data = [("London", 8974432), ("New York City", 8804348), ("Beijing",
21893000)]
df = spark.createDataFrame(data, ["city", "population"])

df.write.mode("overwrite").parquet("parquet_example")  # it create file in
hdfs directory

Use a tool called parquet-tools (downloadable using pip from
https://pypi.org/project/parquet-tools/)

Get the parquet files from hdfs to the current directory say

hdfs dfs -get /user/hduser/parquet_example .
cd ./parquet_example
do an ls and pickup file 3 like below to inspect
 parquet-tools inspect
part-3-c33854c8-a8b6-4315-bf51-20198ce0ba62-c000.snappy.parquet

Now this is the output

 file meta data 
created_by: parquet-mr version 1.12.3 (build
f8dced182c4c1fbdec6ccb3185537b5a01e6ed6b)
num_columns: 2
num_rows: 1
num_row_groups: 1
format_version: 1.0
serialized_size: 563


 Columns 
name
age

 Column(name) 
name: name
path: name
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY (space_saved: -5%)

 Column(age) 
name: age
path: age
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: -5%)

File Information:

   - format_version: 1.0: This line explicitly states that the format
   version of the Parquet file is 1.0, which corresponds to Parquet version 1.
   - created_by: parquet-mr version 1.12.3: While this doesn't directly
   specify the format version, itt is accepted that older versions of
   parquet-mr like 1.12.3 typically write Parquet version 1 files.

Since in this case Spark 3.4 is capable of reading both versions (1 and 2),
you don't  necessarily need to modify your Spark code to access this file.
However, if you want to create Parquet files in version 2 using Spark, you
might need to consider additional changes like excluding parquet-mr or
upgrading Parquet libraries and do a custom build.of Spark. However, taking
klaws of diminishing returns, I would not advise that either.. You can
ofcourse usse gzip for compression that may be more suitable for your needs.

HTH

Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Tue, 16 Apr 2024 at 15:00, Prem Sahoo  wrote:

> Hello Community,
> Could any of you shed some light on below questions please ?
> Sent from my iPhone
>
> On Apr 15, 2024, at 9:02 PM, Prem Sahoo  wrote:
>
> 
> Any specific reason spark does not support or community doesn't want to go
> to Parquet V2 , which is more optimized and read and write is too much
> faster (form other component which I am using)
>
> On Mon, Apr 15, 2024 at 7:55 PM Ryan Blue  wrote:
>
>> Spark will read data written with v2 encodings just fine. You just don't
>> need to worry about making Spark produce v2. And you should probably also
>> not produce v2 encodings from other systems.
>>
>> On Mon, Apr 15, 2024 at 4:37 PM Prem Sahoo  wrote:
>>
>>> oops but so spark does not support parquet V2  atm ?, as We have a use
>>> case where we need parquet V2 as  one of our components uses Parquet V2 .
>>>
>>> On Mon, Apr 15, 2024 at 7:09 PM Ryan Blue  wrote:
>>>
>>>> Hi Prem,
>>>>
>>>> Parquet v1 is the default because v2 has not been finalized and adopted
>>>> by the community. I highly recommend not using v2 encodings at this time.
>>>>
>>>> Ryan
>>>>
>>>> On Mon, Apr 15, 2024 at 3:05 PM Prem Sahoo 
>>>> wrote:
>>>>
>>>>> I am using spark 3.2.0 . but my spark package comes with parquet-mr
>>>>> 1.2.1 which writes in parquet version 1 not version version 2:(. so I was
>>>>> looking how to write in Parquet version2 ?
>>>>>
>>>>> On Mon, Apr 15, 2024 at 5:05 PM Mich Talebzadeh <
>>>>> mich.talebza

Re: Which version of spark version supports parquet version 2 ?

2024-04-15 Thread Mich Talebzadeh
Sorry you have a point there. It was released in version 3.00. What version
of spark are you using?

Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Mon, 15 Apr 2024 at 21:33, Prem Sahoo  wrote:

> Thank you so much for the info! But do we have any release notes where it
> says spark2.4.0 onwards supports parquet version 2. I was under the
> impression Spark3.0 onwards it started supporting .
>
>
>
>
> On Mon, Apr 15, 2024 at 4:28 PM Mich Talebzadeh 
> wrote:
>
>> Well if I am correct, Parquet version 2 support was introduced in Spark
>> version 2.4.0. Therefore, any version of Spark starting from 2.4.0 supports
>> Parquet version 2. Assuming that you are using Spark version  2.4.0 or
>> later, you should be able to take advantage of Parquet version 2 features.
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>
>>
>> On Mon, 15 Apr 2024 at 20:53, Prem Sahoo  wrote:
>>
>>> Thank you for the information!
>>> I can use any version of parquet-mr to produce parquet file.
>>>
>>> regarding 2nd question .
>>> Which version of spark is supporting parquet version 2?
>>> May I get the release notes where parquet versions are mentioned ?
>>>
>>>
>>> On Mon, Apr 15, 2024 at 2:34 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> Parquet-mr is a Java library that provides functionality for working
>>>> with Parquet files with hadoop. It is therefore  more geared towards
>>>> working with Parquet files within the Hadoop ecosystem, particularly using
>>>> MapReduce jobs. There is no definitive way to check exact compatible
>>>> versions within the library itself. However, you can have a look at this
>>>>
>>>> https://github.com/apache/parquet-mr/blob/master/CHANGES.md
>>>>
>>>> HTH
>>>>
>>>> Mich Talebzadeh,
>>>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>>>> London
>>>> United Kingdom
>>>>
>>>>
>>>>view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* The information provided is correct to the best of my
>>>> knowledge but of course cannot be guaranteed . It is essential to note
>>>> that, as with any advice, quote "one test result is worth one-thousand
>>>> expert opinions (Werner
>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>
>>>>
>>>> On Mon, 15 Apr 2024 at 18:59, Prem Sahoo  wrote:
>>>>
>>>>> Hello Team,
>>>>> May I know how to check which version of parquet is supported by
>>>>> parquet-mr 1.2.1 ?
>>>>>
>>>>> Which version of parquet-mr is supporting parquet version 2 (V2) ?
>>>>>
>>>>> Which version of spark is supporting parquet version 2?
>>>>> May I get the release notes where parquet versions are mentioned ?
>>>>>
>>>>


Re: Which version of spark version supports parquet version 2 ?

2024-04-15 Thread Mich Talebzadeh
Well if I am correct, Parquet version 2 support was introduced in Spark
version 2.4.0. Therefore, any version of Spark starting from 2.4.0 supports
Parquet version 2. Assuming that you are using Spark version  2.4.0 or
later, you should be able to take advantage of Parquet version 2 features.

HTH

Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Mon, 15 Apr 2024 at 20:53, Prem Sahoo  wrote:

> Thank you for the information!
> I can use any version of parquet-mr to produce parquet file.
>
> regarding 2nd question .
> Which version of spark is supporting parquet version 2?
> May I get the release notes where parquet versions are mentioned ?
>
>
> On Mon, Apr 15, 2024 at 2:34 PM Mich Talebzadeh 
> wrote:
>
>> Parquet-mr is a Java library that provides functionality for working
>> with Parquet files with hadoop. It is therefore  more geared towards
>> working with Parquet files within the Hadoop ecosystem, particularly using
>> MapReduce jobs. There is no definitive way to check exact compatible
>> versions within the library itself. However, you can have a look at this
>>
>> https://github.com/apache/parquet-mr/blob/master/CHANGES.md
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>
>>
>> On Mon, 15 Apr 2024 at 18:59, Prem Sahoo  wrote:
>>
>>> Hello Team,
>>> May I know how to check which version of parquet is supported by
>>> parquet-mr 1.2.1 ?
>>>
>>> Which version of parquet-mr is supporting parquet version 2 (V2) ?
>>>
>>> Which version of spark is supporting parquet version 2?
>>> May I get the release notes where parquet versions are mentioned ?
>>>
>>


Re: Which version of spark version supports parquet version 2 ?

2024-04-15 Thread Mich Talebzadeh
Parquet-mr is a Java library that provides functionality for working with
Parquet files with hadoop. It is therefore  more geared towards working
with Parquet files within the Hadoop ecosystem, particularly using
MapReduce jobs. There is no definitive way to check exact compatible
versions within the library itself. However, you can have a look at this

https://github.com/apache/parquet-mr/blob/master/CHANGES.md

HTH

Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Mon, 15 Apr 2024 at 18:59, Prem Sahoo  wrote:

> Hello Team,
> May I know how to check which version of parquet is supported by
> parquet-mr 1.2.1 ?
>
> Which version of parquet-mr is supporting parquet version 2 (V2) ?
>
> Which version of spark is supporting parquet version 2?
> May I get the release notes where parquet versions are mentioned ?
>


Re: [VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-14 Thread Mich Talebzadeh
+ 1 for me

It makes it more compatible with the other ANSI SQL compliant products.

Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Sun, 14 Apr 2024 at 00:39, Dongjoon Hyun  wrote:

> Please vote on SPARK-4 to use ANSI SQL mode by default.
> The technical scope is defined in the following PR which is
> one line of code change and one line of migration guide.
>
> - DISCUSSION:
> https://lists.apache.org/thread/ztlwoz1v1sn81ssks12tb19x37zozxlz
> - JIRA: https://issues.apache.org/jira/browse/SPARK-4
> - PR: https://github.com/apache/spark/pull/46013
>
> The vote is open until April 17th 1AM (PST) and passes
> if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Use ANSI SQL mode by default
> [ ] -1 Do not use ANSI SQL mode by default because ...
>
> Thank you in advance.
>
> Dongjoon
>


Re: SPIP: Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on Various Native Engines

2024-04-10 Thread Mich Talebzadeh
I read the SPIP. I have a number of  ;points if I may

- Maturity of Gluten: as the excerpt mentions, Gluten is a project, and its
feature set and stability IMO are still under development. Integrating a
non-core component could introduce risks if it is not fully mature
- Complexity: integrating Gluten's functionalities into Spark might add
complexity to the codebase, potentially increasing maintenance
overhead. Users might need to learn about Gluten's functionalities and
potential limitations for effective utilization?
- Performance Overhead: the plan conversion process itself could introduce
some overhead compared to native Spark execution.The effectiveness of
performance optimizations from Gluten might vary depending on the specific
engine and workload.
- Potential compatibility issues::not all data processing engines might
have complete support for the "Substrate standard", potentially limiting
the universality of the approach. There could be edge cases where plan
conversion or execution on a specific engine leads to unexpected behavior.
- Security: If other engines have different security models or access
controls, integrating them with Spark might require additional security
considerations.
- integration and support in the cloud

HTH

Technologist | Solutions Architect | Data Engineer  | Generative AI
Mich Talebzadeh,
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Wed, 10 Apr 2024 at 12:33, Wenchen Fan  wrote:

> It's good to reduce duplication between different native accelerators of
> Spark, and AFAIK there is already a project trying to solve it:
> https://substrait.io/
>
> I'm not sure why we need to do this inside Spark, instead of doing
> the unification for a wider scope (for all engines, not only Spark).
>
>
> On Wed, Apr 10, 2024 at 10:11 AM Holden Karau 
> wrote:
>
>> I like the idea of improving flexibility of Sparks physical plans and
>> really anything that might reduce code duplication among the ~4 or so
>> different accelerators.
>>
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>>
>> On Tue, Apr 9, 2024 at 3:14 AM Dongjoon Hyun 
>> wrote:
>>
>>> Thank you for sharing, Jia.
>>>
>>> I have the same questions like the previous Weiting's thread.
>>>
>>> Do you think you can share the future milestone of Apache Gluten?
>>> I'm wondering when the first stable release will come and how we can
>>> coordinate across the ASF communities.
>>>
>>> > This project is still under active development now, and doesn't have a
>>> stable release.
>>> > https://github.com/apache/incubator-gluten/releases/tag/v1.1.1
>>>
>>> In the Apache Spark community, Apache Spark 3.2 and 3.3 is the end of
>>> support.
>>> And, 3.4 will have 3.4.3 next week and 3.4.4 (another EOL release) is
>>> scheduled in October.
>>>
>>> For the SPIP, I guess it's applicable for Apache Spark 4.0.0 only if
>>> there is something we need to do from Spark side.
>>>
>> +1 I think any changes need to target 4.0
>>
>>>
>>> Thanks,
>>> Dongjoon.
>>>
>>>
>>> On Tue, Apr 9, 2024 at 12:22 AM Ke Jia  wrote:
>>>
>>>> Apache Spark currently lacks an official mechanism to support
>>>> cross-platform execution of physical plans. The Gluten project offers a
>>>> mechanism that utilizes the Substrait standard to convert and optimize
>>>> Spark's physical plans. By introducing Gluten's plan conversion,
>>>> validation, and fallback mechanisms into Spark, we can significantly
>>>> enhance the portability and interoperability of Spark's physical plans,
>>>> enabling them to operate across a broader spectrum of execution
>>>> environments without requiring users to migrate, while also improving
>>>> Spark's execution efficiency through the utilization of Gluten's advanced
>>>> optimization techniques. And the integration of Gluten into Spark has
>>>> already shown significant performance improvements with ClickHouse and
>>>> Velox backends and has been successfully deployed in production by several
>>>> customers.
>>>>
>>>> References:
>>>> JIAR Ticket <https://issues.apache.org/jira/browse/SPARK-47773>
>>>> SPIP Doc
>>>> <https://docs.google.com/document/d/1v7sndtIHIBdzc4YvLPI8InXxhI7SnnAQ5HvmM2DGjVE/edit?usp=sharing>
>>>>
>>>> Your feedback and comments are welcome and appreciated.  Thanks.
>>>>
>>>> Thanks,
>>>> Jia Ke
>>>>
>>>


Re: External Spark shuffle service for k8s

2024-04-08 Thread Mich Talebzadeh
Hi,

First thanks everyone for their contributions

I was going to reply to @Enrico Minack   but
noticed additional info. As I understand for example,  Apache Uniffle is an
incubating project aimed at providing a pluggable shuffle service for
Spark. So basically, all these "external shuffle services" have in common
is to offload shuffle data management to external services, thus reducing
the memory and CPU overhead on Spark executors. That is great.  While
Uniffle and others enhance shuffle performance and scalability, it would be
great to integrate them with Spark UI. This may require additional
development efforts. I suppose  the interest would be to have these
external matrices incorporated into Spark with one look and feel. This may
require customizing the UI to fetch and display metrics or statistics from
the external shuffle services. Has any project done this?

Thanks

Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Mon, 8 Apr 2024 at 14:19, Vakaris Baškirov 
wrote:

> I see that both Uniffle and Celebron support S3/HDFS backends which is
> great.
> In the case someone is using S3/HDFS, I wonder what would be the
> advantages of using Celebron or Uniffle vs IBM shuffle service plugin
> <https://github.com/IBM/spark-s3-shuffle> or Cloud Shuffle Storage Plugin
> from AWS
> <https://docs.aws.amazon.com/glue/latest/dg/cloud-shuffle-storage-plugin.html>
> ?
>
> These plugins do not require deploying a separate service. Are there any
> advantages to using Uniffle/Celebron in the case of using S3 backend, which
> would require deploying a separate service?
>
> Thanks
> Vakaris
>
> On Mon, Apr 8, 2024 at 10:03 AM roryqi  wrote:
>
>> Apache Uniffle (incubating) may be another solution.
>> You can see
>> https://github.com/apache/incubator-uniffle
>>
>> https://uniffle.apache.org/blog/2023/07/21/Uniffle%20-%20New%20chapter%20for%20the%20shuffle%20in%20the%20cloud%20native%20era
>>
>> Mich Talebzadeh  于2024年4月8日周一 07:15写道:
>>
>>> Splendid
>>>
>>> The configurations below can be used with k8s deployments of Spark.
>>> Spark applications running on k8s can utilize these configurations to
>>> seamlessly access data stored in Google Cloud Storage (GCS) and Amazon S3.
>>>
>>> For Google GCS we may have
>>>
>>> spark_config_gcs = {
>>> "spark.kubernetes.authenticate.driver.serviceAccountName":
>>> "service_account_name",
>>> "spark.hadoop.fs.gs.impl":
>>> "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem",
>>> "spark.hadoop.google.cloud.auth.service.account.enable": "true",
>>> "spark.hadoop.google.cloud.auth.service.account.json.keyfile":
>>> "/path/to/keyfile.json",
>>> }
>>>
>>> For Amazon S3 similar
>>>
>>> spark_config_s3 = {
>>> "spark.kubernetes.authenticate.driver.serviceAccountName":
>>> "service_account_name",
>>> "spark.hadoop.fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem",
>>> "spark.hadoop.fs.s3a.access.key": "s3_access_key",
>>> "spark.hadoop.fs.s3a.secret.key": "secret_key",
>>> }
>>>
>>>
>>> To implement these configurations and enable Spark applications to
>>> interact with GCS and S3, I guess we can approach it this way
>>>
>>> 1) Spark Repository Integration: These configurations need to be added
>>> to the Spark repository as part of the supported configuration options for
>>> k8s deployments.
>>>
>>> 2) Configuration Settings: Users need to specify these configurations
>>> when submitting Spark applications to a Kubernetes cluster. They can
>>> include these configurations in the Spark application code or pass them as
>>> command-line arguments or environment variables during application
>>> submission.
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>>
>>> Technologist | Solutions Architect | D

Fwd: Apache Spark 3.4.3 (?)

2024-04-07 Thread Mich Talebzadeh
Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


-- Forwarded message -----
From: Mich Talebzadeh 
Date: Sun, 7 Apr 2024 at 11:56
Subject: Re: Apache Spark 3.4.3 (?)
To: Dongjoon Hyun 


Yes given that a good number of people are using some flavour of 3.4.n,
this will be a good fit.

+1 for me


Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Sat, 6 Apr 2024 at 23:02, Dongjoon Hyun  wrote:

> Hi, All.
>
> Apache Spark 3.4.2 tag was created on Nov 24th and `branch-3.4` has 85
> commits including important security and correctness patches like
> SPARK-45580, SPARK-46092, SPARK-46466, SPARK-46794, and SPARK-46862.
>
> https://github.com/apache/spark/releases/tag/v3.4.2
>
> $ git log --oneline v3.4.2..HEAD | wc -l
>   85
>
> SPARK-45580 Subquery changes the output schema of the outer query
> SPARK-46092 Overflow in Parquet row group filter creation causes incorrect
> results
> SPARK-46466 Vectorized parquet reader should never do rebase for timestamp
> ntz
> SPARK-46794 Incorrect results due to inferred predicate from checkpoint
> with subquery
> SPARK-46862 Incorrect count() of a dataframe loaded from CSV datasource
> SPARK-45445 Upgrade snappy to 1.1.10.5
> SPARK-47428 Upgrade Jetty to 9.4.54.v20240208
> SPARK-46239 Hide `Jetty` info
>
>
> Currently, I'm checking more applicable patches for branch-3.4. I'd like
> to propose to release Apache Spark 3.4.3 and volunteer as the release
> manager for Apache Spark 3.4.3. If there are no additional blockers, the
> first tentative RC1 vote date is April 15th (Monday).
>
> WDYT?
>
>
> Dongjoon.
>


Re: External Spark shuffle service for k8s

2024-04-07 Thread Mich Talebzadeh
Thanks Cheng for the heads up. I will have a look.

Cheers

Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Sun, 7 Apr 2024 at 15:08, Cheng Pan  wrote:

> Instead of External Shuffle Shufle, Apache Celeborn might be a good option
> as a Remote Shuffle Service for Spark on K8s.
>
> There are some useful resources you might be interested in.
>
> [1] https://celeborn.apache.org/
> [2] https://www.youtube.com/watch?v=s5xOtG6Venw
> [3] https://github.com/aws-samples/emr-remote-shuffle-service
> [4] https://github.com/apache/celeborn/issues/2140
>
> Thanks,
> Cheng Pan
>
>
> > On Apr 6, 2024, at 21:41, Mich Talebzadeh 
> wrote:
> >
> > I have seen some older references for shuffle service for k8s,
> > although it is not clear they are talking about a generic shuffle
> > service for k8s.
> >
> > Anyhow with the advent of genai and the need to allow for a larger
> > volume of data, I was wondering if there has been any more work on
> > this matter. Specifically larger and scalable file systems like HDFS,
> > GCS , S3 etc, offer significantly larger storage capacity than local
> > disks on individual worker nodes in a k8s cluster, thus allowing
> > handling much larger datasets more efficiently. Also the degree of
> > parallelism and fault tolerance  with these files systems come into
> > it. I will be interested in hearing more about any progress on this.
> >
> > Thanks
> > .
> >
> > Mich Talebzadeh,
> >
> > Technologist | Solutions Architect | Data Engineer  | Generative AI
> >
> > London
> > United Kingdom
> >
> >
> >   view my Linkedin profile
> >
> >
> > https://en.everybodywiki.com/Mich_Talebzadeh
> >
> >
> >
> > Disclaimer: The information provided is correct to the best of my
> > knowledge but of course cannot be guaranteed . It is essential to note
> > that, as with any advice, quote "one test result is worth one-thousand
> > expert opinions (Werner Von Braun)".
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
>


Re: External Spark shuffle service for k8s

2024-04-07 Thread Mich Talebzadeh
Splendid

The configurations below can be used with k8s deployments of Spark. Spark
applications running on k8s can utilize these configurations to seamlessly
access data stored in Google Cloud Storage (GCS) and Amazon S3.

For Google GCS we may have

spark_config_gcs = {
"spark.kubernetes.authenticate.driver.serviceAccountName":
"service_account_name",
"spark.hadoop.fs.gs.impl":
"com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem",
"spark.hadoop.google.cloud.auth.service.account.enable": "true",
"spark.hadoop.google.cloud.auth.service.account.json.keyfile":
"/path/to/keyfile.json",
}

For Amazon S3 similar

spark_config_s3 = {
"spark.kubernetes.authenticate.driver.serviceAccountName":
"service_account_name",
"spark.hadoop.fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem",
"spark.hadoop.fs.s3a.access.key": "s3_access_key",
"spark.hadoop.fs.s3a.secret.key": "secret_key",
}


To implement these configurations and enable Spark applications to interact
with GCS and S3, I guess we can approach it this way

1) Spark Repository Integration: These configurations need to be added to
the Spark repository as part of the supported configuration options for k8s
deployments.

2) Configuration Settings: Users need to specify these configurations when
submitting Spark applications to a Kubernetes cluster. They can include
these configurations in the Spark application code or pass them as
command-line arguments or environment variables during application
submission.

HTH

Mich Talebzadeh,

Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Sun, 7 Apr 2024 at 13:31, Vakaris Baškirov 
wrote:

> There is an IBM shuffle service plugin that supports S3
> https://github.com/IBM/spark-s3-shuffle
>
> Though I would think a feature like this could be a part of the main Spark
> repo. Trino already has out-of-box support for s3 exchange (shuffle) and
> it's very useful.
>
> Vakaris
>
> On Sun, Apr 7, 2024 at 12:27 PM Mich Talebzadeh 
> wrote:
>
>>
>> Thanks for your suggestion that I take it as a workaround. Whilst this
>> workaround can potentially address storage allocation issues, I was more
>> interested in exploring solutions that offer a more seamless integration
>> with large distributed file systems like HDFS, GCS, or S3. This would
>> ensure better performance and scalability for handling larger datasets
>> efficiently.
>>
>>
>> Mich Talebzadeh,
>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>
>>
>> On Sat, 6 Apr 2024 at 21:28, Bjørn Jørgensen 
>> wrote:
>>
>>> You can make a PVC on K8S call it 300GB
>>>
>>> make a folder in yours dockerfile
>>> WORKDIR /opt/spark/work-dir
>>> RUN chmod g+w /opt/spark/work-dir
>>>
>>> start spark with adding this
>>>
>>> .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.options.claimName",
>>> "300gb") \
>>>
>>> .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.mount.path",
>>> "/opt/spark/work-dir") \
>>>
>>> .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.mount.readOnly",
>>> "False") \
>>>
>>> .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.options.claimName",
>>> "300gb"

Re: External Spark shuffle service for k8s

2024-04-06 Thread Mich Talebzadeh
Thanks for your suggestion that I take it as a workaround. Whilst this
workaround can potentially address storage allocation issues, I was more
interested in exploring solutions that offer a more seamless integration
with large distributed file systems like HDFS, GCS, or S3. This would
ensure better performance and scalability for handling larger datasets
efficiently.


Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Sat, 6 Apr 2024 at 21:28, Bjørn Jørgensen 
wrote:

> You can make a PVC on K8S call it 300GB
>
> make a folder in yours dockerfile
> WORKDIR /opt/spark/work-dir
> RUN chmod g+w /opt/spark/work-dir
>
> start spark with adding this
>
> .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.options.claimName",
> "300gb") \
>
> .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.mount.path",
> "/opt/spark/work-dir") \
>
> .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.mount.readOnly",
> "False") \
>
> .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.options.claimName",
> "300gb") \
>
> .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.mount.path",
> "/opt/spark/work-dir") \
>
> .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.mount.readOnly",
> "False") \
>   .config("spark.local.dir", "/opt/spark/work-dir")
>
>
>
>
> lør. 6. apr. 2024 kl. 15:45 skrev Mich Talebzadeh <
> mich.talebza...@gmail.com>:
>
>> I have seen some older references for shuffle service for k8s,
>> although it is not clear they are talking about a generic shuffle
>> service for k8s.
>>
>> Anyhow with the advent of genai and the need to allow for a larger
>> volume of data, I was wondering if there has been any more work on
>> this matter. Specifically larger and scalable file systems like HDFS,
>> GCS , S3 etc, offer significantly larger storage capacity than local
>> disks on individual worker nodes in a k8s cluster, thus allowing
>> handling much larger datasets more efficiently. Also the degree of
>> parallelism and fault tolerance  with these files systems come into
>> it. I will be interested in hearing more about any progress on this.
>>
>> Thanks
>> .
>>
>> Mich Talebzadeh,
>>
>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>>
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> Disclaimer: The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner Von Braun)".
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
> --
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
>
> +47 480 94 297
>


External Spark shuffle service for k8s

2024-04-06 Thread Mich Talebzadeh
I have seen some older references for shuffle service for k8s,
although it is not clear they are talking about a generic shuffle
service for k8s.

Anyhow with the advent of genai and the need to allow for a larger
volume of data, I was wondering if there has been any more work on
this matter. Specifically larger and scalable file systems like HDFS,
GCS , S3 etc, offer significantly larger storage capacity than local
disks on individual worker nodes in a k8s cluster, thus allowing
handling much larger datasets more efficiently. Also the degree of
parallelism and fault tolerance  with these files systems come into
it. I will be interested in hearing more about any progress on this.

Thanks
.

Mich Talebzadeh,

Technologist | Solutions Architect | Data Engineer  | Generative AI

London
United Kingdom


   view my Linkedin profile


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner Von Braun)".

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Scheduling jobs using FAIR pool

2024-04-01 Thread Mich Talebzadeh
Hi,

Have you put this question to Databricks forum

Data Engineering - Databricks
<https://community.databricks.com/t5/data-engineering/bd-p/data-engineering>


Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Mon, 1 Apr 2024 at 07:22, Varun Shah  wrote:

> Hi Community,
>
> I am currently exploring the best use of "Scheduler Pools" for executing
> jobs in parallel, and require clarification and suggestions on a few points.
>
> The implementation consists of executing "Structured Streaming" jobs on
> Databricks using AutoLoader. Each stream is executed with trigger =
> 'AvailableNow', ensuring that the streams don't keep running for the
> source. (we have ~4000 such streams, with no continuous stream from source,
> hence not keeping the streams running infinitely using other triggers).
>
> One way to achieve parallelism in the jobs is to use "MultiThreading", all
> using same SparkContext, as quoted from official docs: "Inside a given
> Spark application (SparkContext instance), multiple parallel jobs can run
> simultaneously if they were submitted from separate threads."
>
> There's also a availability of "FAIR Scheduler", which instead of FIFO
> Scheduler (default), assigns executors in Round-Robin fashion, ensuring the
> smaller jobs that were submitted later do not starve due to bigger jobs
> submitted early consuming all resources.
>
> Here are my questions:
> 1. The Round-Robin distribution of executors only work in case of empty
> executors (achievable by enabling dynamic allocation). In case the jobs
> (part of the same pool) requires all executors, second jobs will still need
> to wait.
> 2. If we create dynamic pools for submitting each stream (by setting spark
> property -> "spark.scheduler.pool" to a dynamic value as
> spark.sparkContext.setLocalProperty("spark.scheduler.pool", " string>") , how does executor allocation happen ? Since all pools created
> are created dynamically, they share equal weight. Does this also work the
> same way as submitting streams to a single pool as a FAIR scheduler ?
> 3. Official docs quote "inside each pool, jobs run in FIFO order.". Is
> this true for the FAIR scheduler also ? By definition, it does not seem
> right, but it's confusing. It says "By Default" , so does it mean for FIFO
> scheduler or by default for both scheduling types ?
> 4. Are there any overhead for spark driver while creating / using a
> dynamically created spark pool vs pre-defined pools ?
>
> Apart from these, any suggestions or ways you have implemented
> auto-scaling for such loads ? We are currently trying to auto-scale the
> resources based on requests, but scaling down is an issue (known already
> for which SPIP is already in discussion, but it does not cater to
> submitting multiple streams in a single cluster.
>
> Thanks for reading !! Looking forward to your suggestions
>
> Regards,
> Varun Shah
>
>
>
>
>


Re: Allowing Unicode Whitespace in Lexer

2024-03-27 Thread Mich Talebzadeh
looks fine except that processing all Unicode whitespace characters might
add overhead to the parsing process, potentially impacting performance.
Although I think this is a moot point

+1

Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Wed, 27 Mar 2024 at 22:57, Gengliang Wang  wrote:

> +1, this is a reasonable change.
>
> Gengliang
>
> On Wed, Mar 27, 2024 at 9:54 AM serge rielau.com  wrote:
>
>> Going once, going twice, …. last call for objections
>> On Mar 23, 2024 at 5:29 PM -0700, serge rielau.com ,
>> wrote:
>>
>> Hello,
>>
>> I have a PR https://github.com/apache/spark/pull/45620  ready to go that
>> will extend the definition of whitespace (what separates token) from the
>> small set of ASCII characters space, tab, linefeed to those defined in
>> Unicode.
>> While this is a small and safe change, it is one where we would have a
>> hard time changing our minds about later.
>> It is also a change that, AFAIK, cannot be controlled under a config.
>>
>> What does the community think?
>>
>> Cheers
>> Serge
>> SQL Architect at Databricks
>>
>>


Re: Vote on Dynamic resource allocation for structured streaming [SPARK-24815]

2024-03-26 Thread Mich Talebzadeh
Hi Pavan,

Thanks for instigating this proposal. Looks like the proposal is ready and
has enough votes to be implemented. Having a sheppard will make it more
fruitful.

I will leave it to @Jungtaek Lim  's
capable hands to drive it forward.

Will be there to help if needed.

Cheers

Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Tue, 26 Mar 2024 at 10:02, Pavan Kotikalapudi 
wrote:

> Hi Bhuwan,
>
> Glad to hear back from you! Very much appreciate your help on reviewing
> the design doc/PR and endorsing this proposal.
>
> Thank you so much @Jungtaek Lim  , @Mich
> Talebzadeh   for graciously agreeing to
> mentor/shepherd this effort.
>
> Regarding Twilio copyright in Notice binary file:
> Twilio Opensource counsel was involved all through the process, I have
> placed it in the project file prior to Twilio signing a CCLA for the spark
> project contribution( Aug '23).
>
> Since the CCLA is signed now, I have removed the twilio copyright from
> that file. I didn't get a chance to update the PR after github-actions
> closed it.
>
> Please let me know of next steps needed to bring this draft PR/effort to
> completion.
>
> Thank you,
>
> Pavan
>
>
> On Tue, Mar 26, 2024 at 12:01 AM Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
>
>> I'm happy to, but it looks like I need to check one more thing about the
>> license, according to the WIP PR
>> <https://urldefense.com/v3/__https://github.com/apache/spark/pull/42352__;!!NCc8flgU!a1C5BeYxzO7gVVrGZ56kzunhigqd4SeXMg3dHddtkIdIpO5UwFH3dxzNpK3bc53vuAkFYJ3goLU8Hxev8npLyDrA6JBQ8S0$>
>> .
>>
>> @Pavan Kotikalapudi 
>> I see you've added the copyright of Twilio in the NOTICE-binary file,
>> which makes me wonder if Twilio had filed CCLA to the Apache Software
>> Foundation.
>>
>> PMC members can correct me if I'm mistaken, but from my understanding
>> (and experiences of PMC member in other ASF project), code contribution is
>> considered as code donation and copyright belongs to ASF. That's why you
>> can't find the copyright of employers for contributors in the codebase.
>> What you see copyrights in NOTICE-binary is due to the fact we have binary
>> dependency and their licenses may require to explicitly mention about
>> copyright. It's not about direct code contribution.
>>
>> Is Twilio aware of this? Also, if Twilio did not file CCLA in prior,
>> could you please engage with a relevant group in the company (could be a
>> legal team, or similar with OSS advocate team if there is any) and ensure
>> that CCLA is filed? The copyright issue is a legal issue, so we have to be
>> conservative and 100% sure that the employer is aware of what is the
>> meaning of donating the code to ASF via reviewing CCLA and relevant doc,
>> and explicitly express that they are OK with it via filing CCLA.
>>
>> You can read the description of agreements on contribution and ICLA/CCLA
>> form from this page.
>> https://www.apache.org/licenses/contributor-agreements.html
>> <https://urldefense.com/v3/__https://www.apache.org/licenses/contributor-agreements.html__;!!NCc8flgU!a1C5BeYxzO7gVVrGZ56kzunhigqd4SeXMg3dHddtkIdIpO5UwFH3dxzNpK3bc53vuAkFYJ3goLU8Hxev8npLyDrAktmm6BY$>
>>
>> Please let me know if this is resolved. This seems to me as a blocker to
>> move on. Please also let me know if the contribution is withdrawn from the
>> employer.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>>
>> On Mon, Mar 25, 2024 at 11:47 PM Bhuwan Sahni
>>  wrote:
>>
>>> Hi Pavan,
>>>
>>> I looked at the PR, and the changes look simple and contained. It would
>>> be useful to add dynamic resource allocation to Spark Structured Streaming.
>>>
>>> Jungtaek. Would you be able to shepherd this change?
>>>
>>>
>>> On Tue, Mar 19, 2024 at 10:38 AM Bhuwan Sahni <
>>> bhuwan.sa...@databricks.com> wrote:
>>>
>>>> Thanks a lot for creating the risk table Pavan. My apologies. I was
>>>> tied up with high priority items for the last couple weeks and could not
>&g

Re: Improved Structured Streaming Documentation Proof-of-Concept

2024-03-25 Thread Mich Talebzadeh
Hi,

Your intended work on improving the Structured Streaming documentation is
great! Clear and well-organized instructions are important for everyone
using Spark, beginners and experts alike.
Having said that, Spark Structured Streaming much like other specialist
topics with Spark say (k8s) or otherwise cannot be mastered by
documentation alone. These topics require a considerable amount of practice
and trench warfare so to speak to master them. Suffice to say that I agree
with the proposals of making examples. However, it is an area that many try
to master but fail( judging by typical issues brought up in the user group
and otherwise). Perhaps using a section such as the proposed "Knowledge
Sharing Hub'', may become more relevant. Moreover, the examples have to
reflect real life scenarios and conversly will be of limited use otherwise.

HTH

Mich Talebzadeh,

Technologist | Data | Generative AI | Financial Fraud

London
United Kingdom


   view my Linkedin profile


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my knowledge
but of course cannot be guaranteed . It is essential to note that, as with
any advice, quote "one test result is worth one-thousand expert opinions
(Werner Von Braun)".

Mich Talebzadeh,
Technologist | Data | Generative AI | Financial Fraud
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Mon, 25 Mar 2024 at 21:19, Neil Ramaswamy  wrote:

> Hi all,
>
> I recently started an effort to improve the Structured Streaming
> documentation. I thought that the current documentation, while very
> comprehensive, could be improved in terms of organization, clarity, and
> presence of examples.
>
> You can view the repo here
> <https://github.com/neilramaswamy/structured-streaming>, and you can see
> a preview of the site here <https://structured-streaming.vercel.app/>.
> It's almost at full parity with the programming guide, and it also has
> additional content, like a guide on unit testing and an in-depth
> explanation of watermarks. I think it's at a point where we can bring this
> to completion if it's something that the community wants.
>
> I'd love to hear feedback from everyone: is this something that we would
> want to move forward with? As it borrows certain parts from the programming
> guide, it has an Apache License, so I'd be more than happy if it is adopted
> by an official Spark repo.
>
> Best,
> Neil
>


Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-19 Thread Mich Talebzadeh
I concur. Whilst  Databricks' (a commercial entity) Knowledge Sharing Hub
can be a useful resource for sharing knowledge and engaging with their
respective community, ASF likely prioritizes platforms and channels that
align more closely with its principles of open source, and vendor
neutrality.

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Tue, 19 Mar 2024 at 21:14, Steve Loughran 
wrote:

>
> ASF will be unhappy about this. and stack overflow exists. otherwise:
> apache Confluent and linkedIn exist; LI is the option I'd point at
>
> On Mon, 18 Mar 2024 at 10:59, Mich Talebzadeh 
> wrote:
>
>> Some of you may be aware that Databricks community Home | Databricks
>> have just launched a knowledge sharing hub. I thought it would be a
>> good idea for the Apache Spark user group to have the same, especially
>> for repeat questions on Spark core, Spark SQL, Spark Structured
>> Streaming, Spark Mlib and so forth.
>>
>> Apache Spark user and dev groups have been around for a good while.
>> They are serving their purpose . We went through creating a slack
>> community that managed to create more more heat than light.. This is
>> what Databricks community came up with and I quote
>>
>> "Knowledge Sharing Hub
>> Dive into a collaborative space where members like YOU can exchange
>> knowledge, tips, and best practices. Join the conversation today and
>> unlock a wealth of collective wisdom to enhance your experience and
>> drive success."
>>
>> I don't know the logistics of setting it up.but I am sure that should
>> not be that difficult. If anyone is supportive of this proposal, let
>> the usual +1, 0, -1 decide
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Dad | Technologist | Solutions Architect | Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> Disclaimer: The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner Von Braun)".
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-19 Thread Mich Talebzadeh
One option that comes to my mind, is that given the cyclic nature of these
types of proposals in these two forums, we should be able to use
Databricks's existing knowledge sharing hub Knowledge Sharing Hub -
Databricks
<https://community.databricks.com/t5/knowledge-sharing-hub/bd-p/Knowledge-Sharing-Hub>
as well.

The majority of topics will be of interest to their audience as well. In
addition, they seem to invite everyone to contribute. Unless you have an
overriding concern why we should not take this approach, I can enquire from
Databricks community managers whether they can entertain this idea. They
seem to have a well defined structure for hosting topics.

Let me know your thoughts

Thanks
<https://community.databricks.com/t5/knowledge-sharing-hub/bd-p/Knowledge-Sharing-Hub>
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Tue, 19 Mar 2024 at 08:25, Joris Billen 
wrote:

> +1
>
>
> On 18 Mar 2024, at 21:53, Mich Talebzadeh 
> wrote:
>
> Well as long as it works.
>
> Please all check this link from Databricks and let us know your thoughts.
> Will something similar work for us?. Of course Databricks have much deeper
> pockets than our ASF community. Will it require moderation in our side to
> block spams and nutcases.
>
> Knowledge Sharing Hub - Databricks
> <https://community.databricks.com/t5/knowledge-sharing-hub/bd-p/Knowledge-Sharing-Hub>
>
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
> Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>
>
> On Mon, 18 Mar 2024 at 20:31, Bjørn Jørgensen 
> wrote:
>
>> something like this  Spark community · GitHub
>> <https://github.com/Spark-community>
>>
>>
>> man. 18. mars 2024 kl. 17:26 skrev Parsian, Mahmoud
>> :
>>
>>> Good idea. Will be useful
>>>
>>>
>>>
>>> +1
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *From: *ashok34...@yahoo.com.INVALID 
>>> *Date: *Monday, March 18, 2024 at 6:36 AM
>>> *To: *user @spark , Spark dev list <
>>> dev@spark.apache.org>, Mich Talebzadeh 
>>> *Cc: *Matei Zaharia 
>>> *Subject: *Re: A proposal for creating a Knowledge Sharing Hub for
>>> Apache Spark Community
>>>
>>> External message, be mindful when clicking links or attachments
>>>
>>>
>>>
>>> Good idea. Will be useful
>>>
>>>
>>>
>>> +1
>>>
>>>
>>>
>>> On Monday, 18 March 2024 at 11:00:40 GMT, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>
>>>
>>>
>>>
>>> Some of you may be aware that Databricks community Home | Databricks
>>>
>>> have just launched a knowledge sharing hub. I thought it would be a
>>>
>>> good idea for the Apache Spark user group to have the same, especially
>>>
>>> for repeat questions on Spark core, Spark SQL, Spark Structured
>>>
>>> Streaming, Spark Mlib and so forth.
>>>
>>>
>>>
>>> Apache Spark user and dev groups have been around for a good while.
>>>
>>> They are serving their purpose . We went through creating a slack
>>>
>>> community that managed to create more more heat than light.. This is
>>>
>>> what Databricks community came up with and I quote
>>>
>>>
>>>
>>> "Knowledge Sharing Hub
>>>
>>> Dive into a collaborative space where members like YOU can ex

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Mich Talebzadeh
OK thanks for the update.

What does officially blessed signify here? Can we have and run it as a
sister site? The reason this comes to my mind is that the interested
parties should have easy access to this site (from ISUG Spark sites) as a
reference repository. I guess the advice would be that the information
(topics) are provided as best efforts and cannot be guaranteed.

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Mon, 18 Mar 2024 at 21:04, Reynold Xin  wrote:

> One of the problem in the past when something like this was brought up was
> that the ASF couldn't have officially blessed venues beyond the already
> approved ones. So that's something to look into.
>
> Now of course you are welcome to run unofficial things unblessed as long
> as they follow trademark rules.
>
>
>
> On Mon, Mar 18, 2024 at 1:53 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Well as long as it works.
>>
>> Please all check this link from Databricks and let us know your thoughts.
>> Will something similar work for us?. Of course Databricks have much deeper
>> pockets than our ASF community. Will it require moderation in our side to
>> block spams and nutcases.
>>
>> Knowledge Sharing Hub - Databricks
>> <https://community.databricks.com/t5/knowledge-sharing-hub/bd-p/Knowledge-Sharing-Hub>
>>
>>
>> Mich Talebzadeh,
>> Dad | Technologist | Solutions Architect | Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>
>>
>> On Mon, 18 Mar 2024 at 20:31, Bjørn Jørgensen 
>> wrote:
>>
>>> something like this  Spark community · GitHub
>>> <https://github.com/Spark-community>
>>>
>>>
>>> man. 18. mars 2024 kl. 17:26 skrev Parsian, Mahmoud <
>>> mpars...@illumina.com.invalid>:
>>>
>>>> Good idea. Will be useful
>>>>
>>>>
>>>>
>>>> +1
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *From: *ashok34...@yahoo.com.INVALID 
>>>> *Date: *Monday, March 18, 2024 at 6:36 AM
>>>> *To: *user @spark , Spark dev list <
>>>> dev@spark.apache.org>, Mich Talebzadeh 
>>>> *Cc: *Matei Zaharia 
>>>> *Subject: *Re: A proposal for creating a Knowledge Sharing Hub for
>>>> Apache Spark Community
>>>>
>>>> External message, be mindful when clicking links or attachments
>>>>
>>>>
>>>>
>>>> Good idea. Will be useful
>>>>
>>>>
>>>>
>>>> +1
>>>>
>>>>
>>>>
>>>> On Monday, 18 March 2024 at 11:00:40 GMT, Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Some of you may be aware that Databricks community Home | Databricks
>>>>
>>>> have just launched a knowledge sharing hub. I thought it would be a
>>>>
>>>> good idea for the Apache Spark user group to have the same, especially
>>>>
>>>> for repeat questions on Spark core, Spark SQL, Spark Structured
>>>>
>>>> Streaming, Spark Mlib and so forth.
>>>>
>>>>
>>>>
>>>> Apache Spark user and dev groups have been around for a good while.
>>>>
>>>> They are serving their purpose . We went through c

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Mich Talebzadeh
Well as long as it works.

Please all check this link from Databricks and let us know your thoughts.
Will something similar work for us?. Of course Databricks have much deeper
pockets than our ASF community. Will it require moderation in our side to
block spams and nutcases.

Knowledge Sharing Hub - Databricks
<https://community.databricks.com/t5/knowledge-sharing-hub/bd-p/Knowledge-Sharing-Hub>


Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Mon, 18 Mar 2024 at 20:31, Bjørn Jørgensen 
wrote:

> something like this  Spark community · GitHub
> <https://github.com/Spark-community>
>
>
> man. 18. mars 2024 kl. 17:26 skrev Parsian, Mahmoud
> :
>
>> Good idea. Will be useful
>>
>>
>>
>> +1
>>
>>
>>
>>
>>
>>
>>
>> *From: *ashok34...@yahoo.com.INVALID 
>> *Date: *Monday, March 18, 2024 at 6:36 AM
>> *To: *user @spark , Spark dev list <
>> dev@spark.apache.org>, Mich Talebzadeh 
>> *Cc: *Matei Zaharia 
>> *Subject: *Re: A proposal for creating a Knowledge Sharing Hub for
>> Apache Spark Community
>>
>> External message, be mindful when clicking links or attachments
>>
>>
>>
>> Good idea. Will be useful
>>
>>
>>
>> +1
>>
>>
>>
>> On Monday, 18 March 2024 at 11:00:40 GMT, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>
>>
>>
>>
>> Some of you may be aware that Databricks community Home | Databricks
>>
>> have just launched a knowledge sharing hub. I thought it would be a
>>
>> good idea for the Apache Spark user group to have the same, especially
>>
>> for repeat questions on Spark core, Spark SQL, Spark Structured
>>
>> Streaming, Spark Mlib and so forth.
>>
>>
>>
>> Apache Spark user and dev groups have been around for a good while.
>>
>> They are serving their purpose . We went through creating a slack
>>
>> community that managed to create more more heat than light.. This is
>>
>> what Databricks community came up with and I quote
>>
>>
>>
>> "Knowledge Sharing Hub
>>
>> Dive into a collaborative space where members like YOU can exchange
>>
>> knowledge, tips, and best practices. Join the conversation today and
>>
>> unlock a wealth of collective wisdom to enhance your experience and
>>
>> drive success."
>>
>>
>>
>> I don't know the logistics of setting it up.but I am sure that should
>>
>> not be that difficult. If anyone is supportive of this proposal, let
>>
>> the usual +1, 0, -1 decide
>>
>>
>>
>> HTH
>>
>>
>>
>> Mich Talebzadeh,
>>
>> Dad | Technologist | Solutions Architect | Engineer
>>
>> London
>>
>> United Kingdom
>>
>>
>>
>>
>>
>>   view my Linkedin profile
>>
>>
>>
>>
>>
>> https://en.everybodywiki.com/Mich_Talebzadeh
>> <https://urldefense.com/v3/__https:/en.everybodywiki.com/Mich_Talebzadeh__;!!HrbR-XT-OQ!Wu9fFP8RFJW2N_YUvwl9yctGHxtM-CFPe6McqOJDrxGBjIaRoF8vRwpjT9WzHojwI2R09Nbg8YE9ggB4FtocU8cQFw$>
>>
>>
>>
>>
>>
>>
>>
>> Disclaimer: The information provided is correct to the best of my
>>
>> knowledge but of course cannot be guaranteed . It is essential to note
>>
>> that, as with any advice, quote "one test result is worth one-thousand
>>
>> expert opinions (Werner Von Braun)".
>>
>>
>>
>> -
>>
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>>
>
>
> --
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
>
> +47 480 94 297
>


Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Mich Talebzadeh
+1 for me

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Mon, 18 Mar 2024 at 16:23, Parsian, Mahmoud 
wrote:

> Good idea. Will be useful
>
>
>
> +1
>
>
>
>
>
>
>
> *From: *ashok34...@yahoo.com.INVALID 
> *Date: *Monday, March 18, 2024 at 6:36 AM
> *To: *user @spark , Spark dev list <
> dev@spark.apache.org>, Mich Talebzadeh 
> *Cc: *Matei Zaharia 
> *Subject: *Re: A proposal for creating a Knowledge Sharing Hub for Apache
> Spark Community
>
> External message, be mindful when clicking links or attachments
>
>
>
> Good idea. Will be useful
>
>
>
> +1
>
>
>
> On Monday, 18 March 2024 at 11:00:40 GMT, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>
>
>
>
> Some of you may be aware that Databricks community Home | Databricks
>
> have just launched a knowledge sharing hub. I thought it would be a
>
> good idea for the Apache Spark user group to have the same, especially
>
> for repeat questions on Spark core, Spark SQL, Spark Structured
>
> Streaming, Spark Mlib and so forth.
>
>
>
> Apache Spark user and dev groups have been around for a good while.
>
> They are serving their purpose . We went through creating a slack
>
> community that managed to create more more heat than light.. This is
>
> what Databricks community came up with and I quote
>
>
>
> "Knowledge Sharing Hub
>
> Dive into a collaborative space where members like YOU can exchange
>
> knowledge, tips, and best practices. Join the conversation today and
>
> unlock a wealth of collective wisdom to enhance your experience and
>
> drive success."
>
>
>
> I don't know the logistics of setting it up.but I am sure that should
>
> not be that difficult. If anyone is supportive of this proposal, let
>
> the usual +1, 0, -1 decide
>
>
>
> HTH
>
>
>
> Mich Talebzadeh,
>
> Dad | Technologist | Solutions Architect | Engineer
>
> London
>
> United Kingdom
>
>
>
>
>
>   view my Linkedin profile
>
>
>
>
>
> https://en.everybodywiki.com/Mich_Talebzadeh
> <https://urldefense.com/v3/__https:/en.everybodywiki.com/Mich_Talebzadeh__;!!HrbR-XT-OQ!Wu9fFP8RFJW2N_YUvwl9yctGHxtM-CFPe6McqOJDrxGBjIaRoF8vRwpjT9WzHojwI2R09Nbg8YE9ggB4FtocU8cQFw$>
>
>
>
>
>
>
>
> Disclaimer: The information provided is correct to the best of my
>
> knowledge but of course cannot be guaranteed . It is essential to note
>
> that, as with any advice, quote "one test result is worth one-thousand
>
> expert opinions (Werner Von Braun)".
>
>
>
> -
>
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>


A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Mich Talebzadeh
Some of you may be aware that Databricks community Home | Databricks
have just launched a knowledge sharing hub. I thought it would be a
good idea for the Apache Spark user group to have the same, especially
for repeat questions on Spark core, Spark SQL, Spark Structured
Streaming, Spark Mlib and so forth.

Apache Spark user and dev groups have been around for a good while.
They are serving their purpose . We went through creating a slack
community that managed to create more more heat than light.. This is
what Databricks community came up with and I quote

"Knowledge Sharing Hub
Dive into a collaborative space where members like YOU can exchange
knowledge, tips, and best practices. Join the conversation today and
unlock a wealth of collective wisdom to enhance your experience and
drive success."

I don't know the logistics of setting it up.but I am sure that should
not be that difficult. If anyone is supportive of this proposal, let
the usual +1, 0, -1 decide

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner Von Braun)".

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Enhanced Console Sink for Structured Streaming

2024-03-12 Thread Mich Talebzadeh
OK I have just been working on  a Databricks engineering question raised by
a user

Monitoring structure streaming in external sink
<https://community.databricks.com/t5/data-engineering/monitoring-structure-streaming-in-externar-sink/td-p/63069>

In practice there is an option to use *StreamingQueryListener* from
*pyspark.sql.streaming
import DataStreamWriter, StreamingQueryListene*r to get the matrix out
fpreachBatch

For example
onQueryProgress
microbatch_data received
{
"id" : "941e4cb6-f4ee-41f8-b662-af6dda61dc66",
"runId" : "691d5eb2-140e-48c0-949a-7efbe0fa0967",
"name" : null,
"timestamp" : "2024-03-10T09:21:27.233Z",
"batchId" : 21,
"numInputRows" : 1,
"inputRowsPerSecond" : 100.0,
"processedRowsPerSecond" : 5.347593582887701,
"durationMs" : {
"addBatch" : 37,
"commitOffsets" : 41,
"getBatch" : 0,
"latestOffset" : 0,
"queryPlanning" : 5,
"triggerExecution" : 187,
"walCommit" : 104
},
"stateOperators" : [ ],
"sources" : [ {
"description" : "RateStreamV2[rowsPerSecond=1, rampUpTimeSeconds=0,
numPartitions=default",
etc

Will that help?

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Fri, 9 Feb 2024 at 22:39, Neil Ramaswamy 
wrote:

> Thanks for the comments, Anish and Jerry. To summarize so far, we are in
> agreement that:
>
> 1. Enhanced console sink is a good tool for new users to understand
> Structured Streaming semantics
> 2. It should be opt-in via an option (unlike my original proposal)
> 3. Out of the 2 modes of verbosity I proposed, we're fine with the first
> mode for now (print sink data with event-time metadata and state data for
> stateful queries, with duration-rendered timestamps, with just the
> KeyWithIndexToValue state store for joins, and with a state table for every
> stateful operator, if there are multiple).
>
> I think the last pending suggestion (from Raghu, Anish, and Jerry) is how
> to structure the output so that it's clear what is data and what is
> metadata. Here's my proposal:
>
> --
> BATCH: 1
> --
>
> ++
> |   ROWS WRITTEN TO SINK |
> +--+-+
> |  window  |   count |
> +--+-+
> | {10 seconds, 20 seconds} |  2  |
> +--+-+
>
> ++
> |   EVENT TIME METADATA  |
> ++
> | watermark -> 21 seconds|
> | numDroppedRows -> 0|
> ++
>
> ++
> |  ROWS IN STATE STORE   |
> +--+-+
> |   key|value|
> +--+-+
> | {30 seconds, 40 seconds} | {1} |
> +--+-+
>
> If there are no more major concerns, I think we can discuss smaller
> details in the JIRA ticket or PR itself. I don't think a SPIP is needed for
> a flag-gated benign change like this, but please let me know if you
> disagree.
>
> Best,
> Neil
>
> On Thu, Feb 8, 2024 at 5:37 PM Jerry Peng 
> wrote:
>
>> I am generally a +1 on this as we can use this information in our docs to
>> demonstrate certains concepts to potential users.
>>
>> I am in agreement with other reviewers that we should keep the existing
>> default behavior of the console sink.  This new style of output should be
>> enabled behind a flag.
>>
>> As for the output of this "new mode" in the console sink, can we be more
>> explicit about what is the actual output and what is the metadata?  It is
>> not clear from the logged output.
>>
>> On Tue, Feb 6, 2024 at 11:08 AM Neil Ramaswamy
>>  wrote:
>>
>>> Jungtaek and Raghu, thanks for the input. I'm happy with the verbose
>>&

Re: [VOTE] SPIP: Structured Logging Framework for Apache Spark

2024-03-11 Thread Mich Talebzadeh
+1

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Mon, 11 Mar 2024 at 09:27, Hyukjin Kwon  wrote:

> +1
>
> On Mon, 11 Mar 2024 at 18:11, yangjie01 
> wrote:
>
>> +1
>>
>>
>>
>> Jie Yang
>>
>>
>>
>> *发件人**: *Haejoon Lee 
>> *日期**: *2024年3月11日 星期一 17:09
>> *收件人**: *Gengliang Wang 
>> *抄送**: *dev 
>> *主题**: *Re: [VOTE] SPIP: Structured Logging Framework for Apache Spark
>>
>>
>>
>> +1
>>
>>
>>
>> On Mon, Mar 11, 2024 at 10:36 AM Gengliang Wang  wrote:
>>
>> Hi all,
>>
>> I'd like to start the vote for SPIP: Structured Logging Framework for
>> Apache Spark
>>
>>
>> References:
>>
>>- JIRA ticket
>>
>> <https://mailshield.baidu.com/check?q=godVZoGJGzagfL5fHFKDXe8FOsAuf3UaY0E7uyGx6HVUGGWsmD%2fgOW2x6J1A1XYt8pai0Y8FBhY%3d>
>>- SPIP doc
>>
>> <https://mailshield.baidu.com/check?q=qnzij19o7FucfHJ%2f4C2cBnMVM2kxjtEi9Gv4zA05b3oPw5UX986BZOwzaJ30UdGRMv%2fix31TYpjtazJC5uyypG0pZVBCfSjQGqlzkUoZozkFtgMXfpmRMSSp1%2bq83gkbLyrm1g%3d%3d>
>>- Discussion thread
>>
>> <https://mailshield.baidu.com/check?q=6PGfLtMnDpsSvIF5SlbpQ4%2bwdg53GCedx5r%2b7AOnYMjYwomNs%2fBioZOabP9Ml3b%2bE8jzqXF0xR3j607DdbjV0JOnlvU%3d>
>>
>> Please vote on the SPIP for the next 72 hours:
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>>
>> Thanks!
>>
>> Gengliang Wang
>>
>>


Re: [DISCUSS] SPIP: Structured Spark Logging

2024-03-09 Thread Mich Talebzadeh
Splendid. Thanks Gengliang

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Sat, 9 Mar 2024 at 18:10, Gengliang Wang  wrote:

> Hi Mich,
>
> Thanks for your suggestions. I agree that we should avoid confusion with
> Spark Structured Streaming.
>
> So, I'll go with "Structured Logging Framework for Apache Spark". This
> keeps the standard term "Structured Logging" and distinguishes it from
> "Structured Streaming" clearly.
>
> Thanks for helping shape this!
>
> Best,
> Gengliang
>
> On Sat, Mar 2, 2024 at 12:19 PM Mich Talebzadeh 
> wrote:
>
>> Hi Gengliang,
>>
>> Thanks for taking the initiative to improve the Spark logging system.
>> Transitioning to structured logs seems like a worthy way to enhance the
>> ability to analyze and troubleshoot Spark jobs and hopefully  the future
>> integration with cloud logging systems. While "Structured Spark Logging"
>> sounds good, I was wondering if we could consider an alternative name.
>> Since we already use "Spark Structured Streaming", there might be a slight
>> initial confusion with the terminology. I must confess it was my initial
>> reaction so to speak.
>>
>> Here are a few alternative names I came up with if I may
>>
>>- Spark Log Schema Initiative
>>- Centralized Logging with Structured Data for Spark
>>- Enhanced Spark Logging with Queryable Format
>>
>> These options all highlight the key aspects of your proposal namely;
>> schema, centralized logging and queryability and might be even clearer for
>> everyone at first glance.
>>
>> Cheers
>>
>> Mich Talebzadeh,
>> Dad | Technologist | Solutions Architect | Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>
>>
>> On Fri, 1 Mar 2024 at 10:07, Gengliang Wang  wrote:
>>
>>> Hi All,
>>>
>>> I propose to enhance our logging system by transitioning to structured
>>> logs. This initiative is designed to tackle the challenges of analyzing
>>> distributed logs from drivers, workers, and executors by allowing them to
>>> be queried using a fixed schema. The goal is to improve the informativeness
>>> and accessibility of logs, making it significantly easier to diagnose
>>> issues.
>>>
>>> Key benefits include:
>>>
>>>- Clarity and queryability of distributed log files.
>>>- Continued support for log4j, allowing users to switch back to
>>>traditional text logging if preferred.
>>>
>>> The improvement will simplify debugging and enhance productivity without
>>> disrupting existing logging practices. The implementation is estimated to
>>> take around 3 months.
>>>
>>> *SPIP*:
>>> https://docs.google.com/document/d/1rATVGmFLNVLmtxSpWrEceYm7d-ocgu8ofhryVs4g3XU/edit?usp=sharing
>>> *JIRA*: SPARK-47240 <https://issues.apache.org/jira/browse/SPARK-47240>
>>>
>>> Your comments and feedback would be greatly appreciated.
>>>
>>


SPARK-44951, Improve Spark Dynamic Allocation

2024-03-08 Thread Mich Talebzadeh
Hi all,

On this ticket, improve Spark Dynamic Allocation
<https://issues.apache.org/jira/browse/SPARK-44951>

 I see no movement since it was opened back in August 2023

I may be wrong of course


Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


Re: When Spark job shows FetchFailedException it creates few duplicate data and we see few data also missing , please explain why

2024-03-05 Thread Mich Talebzadeh
Hi Jason,

I read your notes and the code simulating the problem as link
https://issues.apache.org/jira/browse/SPARK-38388  and the specific
repartition issue (SPARK-38388) that this code aims to demonstrate

The code below from the above link Jira

import scala.sys.process._
import org.apache.spark.TaskContext

case class TestObject(id: Long, value: Double)

val ds = spark.range(0, 1000 * 1000, 1).repartition(100,
$"id").withColumn("val", rand()).repartition(100).map {
  row => if (TaskContext.get.stageAttemptNumber == 0 &&
TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId > 97) {
throw new Exception("pkill -f java".!!)
  }
  TestObject(row.getLong(0), row.getDouble(1))
}

ds.toDF("id", "value").write.mode("overwrite").saveAsTable("tmp.test_table")

spark.sql("select count(distinct id) from tmp.test_table").show

*contains a potential security risk* by using scala.sys.process to execute
the pkill -f java command. While the code aims to demonstrate the
repartition issue, using pkill is IMO unnecessary and risky. This could
potentially terminate critical processes on the cluster as well. Instead of
throwing an exception based on partition ID, you can try to filter out
unwanted partitions before applying the map transformation like below

val filteredDS = ds.filter($"id".lt(98)) // Filter out partitions with ID
>= 98
filteredDS.map { row => TestObject(row.getLong(0), row.getDouble(1)) }

By using filteredDS for subsequent transformations or actions, you avoid
redundant processing and potential complications from the conditional logic
in the original map transformation. This approach is a safer simulation of
the repartition issue by only working with the filtered dataset, representing
the partitions that would have hypothetically succeeded.

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".

On Mon, 4 Mar 2024 at 18:26, Jason Xu  wrote:

> Hi Prem,
>
> From the symptom of shuffle fetch failure and few duplicate data and few
> missing data, I think you might run into this correctness bug:
> https://issues.apache.org/jira/browse/SPARK-38388.
>
> Node/shuffle failure is hard to avoid, I wonder if you have
> non-deterministic logic and calling repartition() (round robin
> partitioning) in your code? If you can avoid either of these, you can avoid
> the issue from happening for now. To root fix the issue, it requires a
> non-trivial effort, I don't think there's a solution available yet.
>
> I have heard that there are community efforts to solve this issue, but I
> lack detailed information. Hopefully, someone with more knowledge can
> provide further insight.
>
> Best,
> Jason
>
> On Mon, Mar 4, 2024 at 9:41 AM Prem Sahoo  wrote:
>
>> super :(
>>
>> On Mon, Mar 4, 2024 at 6:19 AM Mich Talebzadeh 
>> wrote:
>>
>>> "... in a nutshell  if fetchFailedException occurs due to data node
>>> reboot then it  can create duplicate / missing data  .   so this is more of
>>> hardware(env issue ) rather than spark issue ."
>>>
>>> As an overall conclusion your point is correct but again the answer is
>>> not binary.
>>>
>>> Spark core relies on a distributed file system to store data across data
>>> nodes. When Spark needs to process data, it fetches the required blocks
>>> from the data nodes.* FetchFailedException*: means  that Spark
>>> encountered an error while fetching data blocks from a data node. If a data
>>> node reboots unexpectedly, it becomes unavailable to Spark for a
>>> period. During this time, Spark might attempt to fetch data blocks from the
>>> unavailable node, resulting in the FetchFailedException.. Depending on the
>>> timing and nature of the reboot and data access, this exception can lead
>>> to:the following:
>>>
>>>- Duplicate Data: If Spark retries the fetch operation successfully
>>>after the reboot, it might end up processing the same data twice, leading
>>>to duplicates.
>>>- Missing Data: If Spark cannot fetch all required data blocks due
>>>to the unavailable data node, some data might be miss

Re: When Spark job shows FetchFailedException it creates few duplicate data and we see few data also missing , please explain why

2024-03-04 Thread Mich Talebzadeh
"... in a nutshell  if fetchFailedException occurs due to data node reboot
then it  can create duplicate / missing data  .   so this is more of
hardware(env issue ) rather than spark issue ."

As an overall conclusion your point is correct but again the answer is not
binary.

Spark core relies on a distributed file system to store data across data
nodes. When Spark needs to process data, it fetches the required blocks
from the data nodes.* FetchFailedException*: means  that Spark encountered
an error while fetching data blocks from a data node. If a data node
reboots unexpectedly, it becomes unavailable to Spark for a period. During
this time, Spark might attempt to fetch data blocks from the unavailable
node, resulting in the FetchFailedException.. Depending on the timing and
nature of the reboot and data access, this exception can lead to:the
following:

   - Duplicate Data: If Spark retries the fetch operation successfully
   after the reboot, it might end up processing the same data twice, leading
   to duplicates.
   - Missing Data: If Spark cannot fetch all required data blocks due to
   the unavailable data node, some data might be missing from the processing
   results.

The root cause of this issue lies in the data node reboot itself. So we can
conclude that it is not a  problem with Spark core functionality but rather
an environmental issue within the distributed storage systemL  You need to
ensure that your nodes are stable and minimise unexpected reboots for
whatever reason. Look at the host logs  or run /usr/bin/dmesg to see what
happened..

Good luck

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Mon, 4 Mar 2024 at 01:30, Prem Sahoo  wrote:

> thanks Mich, in a nutshell  if fetchFailedException occurs due to data
> node reboot then it  can create duplicate / missing data  .   so this is
> more of hardware(env issue ) rather than spark issue .
>
>
>
> On Sat, Mar 2, 2024 at 7:45 AM Mich Talebzadeh 
> wrote:
>
>> Hi,
>>
>> It seems to me that there are issues related to below
>>
>> * I think when a task failed in between  and retry task started and
>> completed it may create duplicate as failed task has some data + retry task
>> has  full data.  but my question is why spark keeps delta data or
>> according to you if speculative and original task completes generally spark
>> kills one of the tasks to get rid of dups data.  when a data node is
>> rebooted then spark fault tolerant should go to other nodes isn't it ? then
>> why it has missing data.*
>>
>> Spark is designed to be fault-tolerant through lineage and recomputation.
>> However, there are scenarios where speculative execution or task retries
>> might lead to duplicated or missing data. So what are these?
>>
>> - Task Failure and Retry: You are correct that a failed task might have
>> processed some data before encountering the FetchFailedException. If a
>> retry succeeds, it would process the entire data partition again, leading
>> to duplicates. When a task fails, Spark may recompute the lost data by
>> recomputing the lost task on another node.  The output of the retried task
>> is typically combined with the output of the original task during the final
>> stage of the computation. This combination is done to handle scenarios
>> where the original task partially completed and generated some output
>> before failing. Spark does not intentionally store partially processed
>> data. However, due to retries and speculative execution, duplicate
>> processing can occur. To the best of my knowledge, Spark itself doesn't
>> have a mechanism to identify and eliminate duplicates automatically. While
>> Spark might sometimes kill speculative tasks if the original one finishes,
>> it is not a guaranteed behavior. This depends on various factors like
>> scheduling and task dependencies.
>>
>> - Speculative Execution: Spark supports speculative execution, where the
>> same task is launched on multiple executors simultaneously. The result of
>> the first completed task is used, and the others are usually killed to
>> avoid duplicated results. However, speculative execution might introduce
>> some duplication in the final output if tasks 

Re: [DISCUSS] SPIP: Structured Spark Logging

2024-03-02 Thread Mich Talebzadeh
Hi Gengliang,

Thanks for taking the initiative to improve the Spark logging system.
Transitioning to structured logs seems like a worthy way to enhance the
ability to analyze and troubleshoot Spark jobs and hopefully  the future
integration with cloud logging systems. While "Structured Spark Logging"
sounds good, I was wondering if we could consider an alternative name.
Since we already use "Spark Structured Streaming", there might be a slight
initial confusion with the terminology. I must confess it was my initial
reaction so to speak.

Here are a few alternative names I came up with if I may

   - Spark Log Schema Initiative
   - Centralized Logging with Structured Data for Spark
   - Enhanced Spark Logging with Queryable Format

These options all highlight the key aspects of your proposal namely;
schema, centralized logging and queryability and might be even clearer for
everyone at first glance.

Cheers

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Fri, 1 Mar 2024 at 10:07, Gengliang Wang  wrote:

> Hi All,
>
> I propose to enhance our logging system by transitioning to structured
> logs. This initiative is designed to tackle the challenges of analyzing
> distributed logs from drivers, workers, and executors by allowing them to
> be queried using a fixed schema. The goal is to improve the informativeness
> and accessibility of logs, making it significantly easier to diagnose
> issues.
>
> Key benefits include:
>
>- Clarity and queryability of distributed log files.
>- Continued support for log4j, allowing users to switch back to
>traditional text logging if preferred.
>
> The improvement will simplify debugging and enhance productivity without
> disrupting existing logging practices. The implementation is estimated to
> take around 3 months.
>
> *SPIP*:
> https://docs.google.com/document/d/1rATVGmFLNVLmtxSpWrEceYm7d-ocgu8ofhryVs4g3XU/edit?usp=sharing
> *JIRA*: SPARK-47240 <https://issues.apache.org/jira/browse/SPARK-47240>
>
> Your comments and feedback would be greatly appreciated.
>


Re: When Spark job shows FetchFailedException it creates few duplicate data and we see few data also missing , please explain why

2024-03-02 Thread Mich Talebzadeh
Hi,

It seems to me that there are issues related to below

* I think when a task failed in between  and retry task started and
completed it may create duplicate as failed task has some data + retry task
has  full data.  but my question is why spark keeps delta data or
according to you if speculative and original task completes generally spark
kills one of the tasks to get rid of dups data.  when a data node is
rebooted then spark fault tolerant should go to other nodes isn't it ? then
why it has missing data.*

Spark is designed to be fault-tolerant through lineage and recomputation.
However, there are scenarios where speculative execution or task retries
might lead to duplicated or missing data. So what are these?

- Task Failure and Retry: You are correct that a failed task might have
processed some data before encountering the FetchFailedException. If a
retry succeeds, it would process the entire data partition again, leading
to duplicates. When a task fails, Spark may recompute the lost data by
recomputing the lost task on another node.  The output of the retried task
is typically combined with the output of the original task during the final
stage of the computation. This combination is done to handle scenarios
where the original task partially completed and generated some output
before failing. Spark does not intentionally store partially processed
data. However, due to retries and speculative execution, duplicate
processing can occur. To the best of my knowledge, Spark itself doesn't
have a mechanism to identify and eliminate duplicates automatically. While
Spark might sometimes kill speculative tasks if the original one finishes,
it is not a guaranteed behavior. This depends on various factors like
scheduling and task dependencies.

- Speculative Execution: Spark supports speculative execution, where the
same task is launched on multiple executors simultaneously. The result of
the first completed task is used, and the others are usually killed to
avoid duplicated results. However, speculative execution might introduce
some duplication in the final output if tasks on different executors
complete successfully.

- Node Reboots and Fault Tolerance: If the data node reboot leads to data
corruption or loss, that data might be unavailable to Spark. Even with
fault tolerance, Spark cannot recover completely missing data. Fault
tolerance focuses on recovering from issues like executor failures, not
data loss on storage nodes. Overall, Spark's fault tolerance is designed to
handle executor failures by rescheduling tasks on other available executors
and temporary network issues by retrying fetches based on configuration.

Here are some stuff to consider:

- Minimize retries: Adjust spark.shuffle.io.maxRetries to a lower value
such as  1 or 2 to reduce the chance of duplicate processing attempts, if
retries are suspected to be a source.
- Disable speculative execution if needed: Consider disabling speculative
execution (spark.speculation=false) if duplicates are a major concern.
However, this might impact performance.
- Data persistence: As mentioned in the previous reply, persist
intermediate data to reliable storage (HDFS, GCS, etc.) if data integrity
is critical. This ensures data availability even during node failures.
- Data validation checks: Implement data validation checks after processing
to identify potential duplicates or missing data.
HTH
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Sat, 2 Mar 2024 at 01:43, Prem Sahoo  wrote:

> Hello Mich,
> thanks for your reply.
>
> As an engineer I can chip in. You may have partial execution and retries
> meaning when spark encounters a *FetchFailedException*, it  may retry
> fetching the data from the unavailable (the one being rebooted) node a few
> times before marking it permanently unavailable. However, if the rebooted
> node recovers quickly within this retry window, some executors might
> successfully fetch the data after a retry. *This leads to duplicate
> processing of the same data partition*.
>
>  data node reboot is taking more than 20 mins and our config
> spark.network.timeout=300s so we don't have dupls for the above reason.
> I am not sure this one applies to your spark version but spark may
> speculatively execute tasks on different executors to improve
> performance. If a task fails due to the *FetchFailedException*, a
> speculative 

Re: When Spark job shows FetchFailedException it creates few duplicate data and we see few data also missing , please explain why

2024-03-01 Thread Mich Talebzadeh
Hi,

Your point -> "When Spark job shows FetchFailedException it creates few
duplicate data and  we see few data also missing , please explain why. We
have scenario when  spark job complains *FetchFailedException as one of the
data node got ** rebooted middle of job running ."*

As an engineer I can chip in. You may have partial execution and retries
meaning when spark encounters a *FetchFailedException*, it  may retry
fetching the data from the unavailable (the one being rebooted) node a few
times before marking it permanently unavailable. However, if the rebooted
node recovers quickly within this retry window, some executors might
successfully fetch the data after a retry. *This leads to duplicate
processing of the same data partition*.

I am not sure this one applies to your spark version but spark may
speculatively execute tasks on different executors to improve
performance. If a task fails due to the *FetchFailedException*, a
speculative task might be launched on another executor. This is where fun
and games start. If the unavailable node recovers before the speculative
task finishes, both the original and speculative tasks might complete
successfully,* resulting in duplicates*. With regard to missing data, if
the data node reboot leads to data corruption or loss, some data partitions
might be completely unavailable. In this case, spark may skip processing
that missing data, leading to missing data in the final output.

Potential remedies: Spark offers some features to mitigate these issues,
but it might not guarantee complete elimination of duplicates or data
loss:. You can adjust parameters like *spark.shuffle.retry.wa*it and
*spark.speculation* to control retry attempts and speculative execution
behavior. Lineage tracking is there to help. Spark can track data lineage,
allowing you to identify potentially corrupted or missing data in some
cases. You can consider persisting intermediate data results to a reliable
storage (like HDFS or GCS or another cloud storage) to avoid data loss in
case of node failures.  Your mileage varies as it adds additional
processing overhead but can ensure data integrity.

HTH

Mich Talebzadeh,
Dad | Technologist
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Fri, 1 Mar 2024 at 20:56, Prem Sahoo  wrote:

> Hello All,
> in the list of JIRAs i didn't find anything related to
> fetchFailedException.
>
> as mentioned above
>
> "When Spark job shows FetchFailedException it creates few duplicate data
> and we see few data also missing , please explain why. We have a scenario
> when spark job complains FetchFailedException as one of the data nodes got
> rebooted in the middle of job running .
> Now due to this we have few duplicate data and few missing data . Why is
> spark not handling this scenario correctly ? kind of we shouldn't miss any
> data and we shouldn't create duplicate data . "
>
> We have to rerun the job again to fix this data quality issue . Please let
> me know why this case is not handled properly by Spark ?
>
> On Thu, Feb 29, 2024 at 9:50 PM Dongjoon Hyun 
> wrote:
>
>> Please use the url as thr full string including '()' part.
>>
>> Or you can seach directly at ASF Jira with 'Spark' project and three
>> labels, 'Correctness', 'correctness' and 'data-loss'.
>>
>> Dongjoon
>>
>> On Thu, Feb 29, 2024 at 11:54 Prem Sahoo  wrote:
>>
>>> Hello Dongjoon,
>>> Thanks for emailing me.
>>> Could you please share a list of fixes  as the link provided by you is
>>> not working.
>>>
>>> On Thu, Feb 29, 2024 at 11:27 AM Dongjoon Hyun 
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> If you are observing correctness issues, you may hit some old (and
>>>> fixed) correctness issues.
>>>>
>>>> For example, from Apache Spark 3.2.1 to 3.2.4, we fixed 31 correctness
>>>> issues.
>>>>
>>>>
>>>> https://issues.apache.org/jira/issues/?filter=12345390=project%20%3D%20SPARK%20AND%20fixVersion%20in%20(3.2.1%2C%203.2.2%2C%203.2.3%2C%203.2.4)%20AND%20labels%20in%20(Correctness%2C%20correctness%2C%20data-loss)
>>>>
>>>> There are more fixes in 3.3 and 3.4 and 3.5, too.
>>>>
>>>> Please use the latest version, Apache Spark 3.5.1, because Apache

Re: Vote on Dynamic resource allocation for structured streaming [SPARK-24815]

2024-03-01 Thread Mich Talebzadeh
Hi Bhuwan et al,

Thank you for passing on the DataBricks Structured Streaming team's review
of the SPIP document. FYI, I work closely with Pawan and other members to
help deliver this piece of work. We appreciate your insights, especially
regarding the cost savings potential from the PoC.

Pavan already furnished you with some additional info. Your team's point
about the SPIP currently addressing a specific use case (single streaming
query with Processing Time trigger) is well-taken. We agree that
maintaining simplicity is key, particularly as we explore more general
resource allocation mechanisms in the future. To address the concerns and
foster open discussion, The DataBricks team are invited to directly add
their comments and suggestions to the Jira itself

[SPARK-24815] Structured Streaming should support dynamic allocation - ASF
JIRA (apache.org) <https://issues.apache.org/jira/browse/SPARK-24815>
This will ensure everyone involved can benefit from your team's expertise
and facilitate further collaboration.

Thanks

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Fri, 1 Mar 2024 at 19:59, Pavan Kotikalapudi
 wrote:

> Thanks Bhuwan and rest of the databricks team for the reviews,
>
> I appreciate your reviews, was very helpful in evaluating a few options
> that were overlooked earlier (especially about mixed spark apps running on
> notebooks). Regarding the use-cases, It could handle multiple streaming
> queries provided that they are run on the same trigger interval processing
> time (very similar to how current batch dra is set up)..but I felt like it
> would be beneficial if we separate out streaming queries when setting up
> production pipelines.
>
> Regarding the implementation, here is the draft PR
> https://github.com/apache/spark/pull/42352. (already mentioned in ticket
> SPARK-24815 <https://issues.apache.org/jira/browse/SPARK-24815>)
>
> I have built it on top of the current Dynamic resource allocation (DRA)
> algorithm
> <https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation>
> .
> While current DRA is catered towards batch jobs. This implementation just
> makes few changes to that algorithm to
> - do gradual scale-back. The remove-policy still applies (uses 2 old
> configs we currently have), but we now remove few executors per round of
> evaluation ( I have added 2 configs to tune that)
> - The scale-out process also still uses the same request policy (same uses
> 2 old configs we currently have).
> - while we are using the old configs in the both scale-out/back, the
> difference is that we are now giving configs to them based on the trigger
> interval as our north star.
>
> This implementation is just changes in 2 files to make it work. I have
> made the changes minimal/limited to just the core module of the spark repo.
> 1) to make sure it is applied on primitives of task, stage, job which the
> current dra is already doing. (This will enable us to think about other
> cases like  default and continuous mode can still work provided we have a
> target processing time range we want to achieve)
> 2) We are reusing ExecutorAllocationClient, ExecutorMonitor and listeners
> which are already well tested and working well for batch job use case.
>
> We internally (in the company) have also added helpers so that we have
> less configs to tune. I can contribute that as well, if it makes the dev
> experience better.
>
> Feel free to review the PR, when we decide the direction is alright I will
> start adding the tests as well.
>
> On a side note.  Maybe we should consider some future work to have DRA
> algo per query (batch, streaming queries, mixed etc) rather than per spark
> context.
>
> Thank you,
>
> Pavan
>
>
> On Fri, Mar 1, 2024 at 9:06 AM Bhuwan Sahni
>  wrote:
>
>> Hi Pavan,
>>
>> I am from the DataBricks Structured Streaming team, and we did a review
>> of the SPIP internally. Wanted to pass on the points discussed in the
>> meeting.
>>
>> Thanks for putting together the SPIP document. It's useful to have
>> dynamic resource allocation for Streaming queries, and it's exciting to see
>> the cost saving numbers from your PoC. However, in general we discovered
>

Please unlock Jira ticket for SPARK-24815, Dynamic resource allocation for structured streaming

2024-02-26 Thread Mich Talebzadeh
Hi,

Can a committer please unlock this SPIP? It is for Dynamic resource
allocation for structured streaming that has got 6 votes. it was locked
because of inactivity by GitHub actions

[SPARK-24815] Structured Streaming should support dynamic allocation - ASF
JIRA (apache.org) <https://issues.apache.org/jira/browse/SPARK-24815>

For now I have volunteered to mentor the team until a committer volunteers
to take it over. This should not be that strenuous  hopefully.

Thanks

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


Proposal about moving on from the Shepherd terminology in SPIPs

2024-02-23 Thread Mich Talebzadeh
We had a discussion about getting a Shepherd to assist with Structured
streaming SPIP a few hours ago.

As an active member I am proposing a move to replace the current
terminology "SPIP Shepherd" with the more respectful and inclusive term
"SPIP Mentor." We have over the past few years have tried to replace some
past terminologies with more acceptable ones.

While some may not find "Shepherd" offensive, it can unintentionally imply
passivity or dependence on community members, which might not accurately
reflect their expertise and contributions. Additionally, the shepherd-sheep
dynamic might be interpreted as hierarchical, which does not align with the
collaborative and open nature of Spark community.

*"SPIP Mentor"* better emphasizes the collaborative nature of the process,
focusing on supporting and guiding members while respecting their strengths
and contributions. It also avoids any potentially offensive or hierarchical
connotations.
Great if you share your thoughts and participate in discussion to consider
this proposal and discuss any potential challenges or solutions during the
transition period.in SPIP (assuming we accept this or another alternative
proposal).

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


Re: Vote on Dynamic resource allocation for structured streaming [SPARK-24815]

2024-02-23 Thread Mich Talebzadeh
Hi Pavan and those who kindly voted for this SPIP

Great to have 6+ votes and no -1 and 0. The so-called mass volume is there.
The rest is admin matter and how to drive the project forward and yes there
is more than one way of skinning the cat. I think we need some flexibility
in the rules given the dwindling (IMO) number of comitters who are willing
or actively participating. For example, on a similar matter I
approached Codi Koeninger who was one of the founders of Spark Streaming,
to shepherd a project almost a year back. Sadly he is no longer active and
quotes "I haven't been involved lately and would be missing a lot of
context." So we need to improvise and see how best we can drive this and
similar ones. We wait a short while for a response otherwise I am happy to
give a hand if needed and work with you guys to drive this. It is something
worthwhile.

HTH

T
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Fri, 23 Feb 2024 at 17:41, Pavan Kotikalapudi
 wrote:

> Thanks for the pointers Mich, will wait for Jungtaek Lee or any other PMC
> members to respond.
>
> aggregating upvotes to this email thread
>
> +6
> Mich Talebzadeh
> Adam Hobbs
> Pavan Kotikalapudi
> Krystal Mitchell
> Sona Torosyan
> Aaron Kern
>
> Thank you,
>
> Pavan
>
> On Thu, Feb 22, 2024 at 3:07 PM Mich Talebzadeh 
> wrote:
>
>> Hi,
>>
>> please check this doc
>>
>> Spark Project Improvement Proposals (SPIP) | Apache Spark
>> <https://urldefense.com/v3/__https://spark.apache.org/improvement-proposals.html__;!!NCc8flgU!dJHLBpsBdsmdGt7dGsV2kyUhjpah0Z3g27vaxbmk2IA8gKdE4x_RgGK9V4wFOK7k2sZNMxzBz_9MHb9C5YHtjL5qy0rbHA$>
>>
>> and specifically the below extract
>>
>> Discussing an SPIP
>>
>> All discussion of an SPIP should take place in a public forum, preferably
>> the discussion attached to the Jira. Any discussions that happen offline
>> should be made available online for the public via meeting notes
>> summarizing the discussions.(done)
>>
>> During this discussion, one or more shepherds should be identified among
>> PMC members. (outstanding)
>>
>> Once the discussion settles, the shepherd(s) should call for a vote on
>> the SPIP moving forward on the dev@ list. The vote should be open for at
>> least 72 hours and follows the typical Apache vote process and passes upon
>> consensus (at least 3 +1 votes from PMC members and no -1 votes from PMC
>> members). dev@ should be notified of the vote result.
>>
>> If there does not exist at least one PMC member that is committed to
>> shepherding the change within a month, the SPIP is rejected.
>>
>> If a committer does not think a SPIP aligns with long-term project goals,
>> or is not practical at the point of proposal, the committer should -1 the
>> SPIP explicitly and give technical justifications.
>> OK a shepherd from PMC members is required. Maybe Jungtaek Lee can kindly
>> help the process
>>
>> cheers
>>
>> Mich Talebzadeh,
>> Dad | Technologist | Solutions Architect | Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!dJHLBpsBdsmdGt7dGsV2kyUhjpah0Z3g27vaxbmk2IA8gKdE4x_RgGK9V4wFOK7k2sZNMxzBz_9MHb9C5YHtjL6nGmLi3g$>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!dJHLBpsBdsmdGt7dGsV2kyUhjpah0Z3g27vaxbmk2IA8gKdE4x_RgGK9V4wFOK7k2sZNMxzBz_9MHb9C5YHtjL5rLq6E3w$>
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> <https://urldefense.com/v3/__https://en.wikipedia.org/wiki/Wernher_von_Braun__;!!NCc8flgU!dJHLBpsBdsmdGt7dGsV2kyUhjpah0Z3g27vaxbmk2IA8gKdE4x_RgGK9V4wFOK7k2sZNMxzBz_9MHb9C5YHtjL4exCs1_Q$>Von
>> Braun
>> <https://urldefense.com/v3/__https://en.wikipedia.org/wiki/

Re: Vote on Dynamic resource allocation for structured streaming [SPARK-24815]

2024-02-23 Thread Mich Talebzadeh
+1 for me

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Fri, 23 Feb 2024 at 16:05, Aaron Kern  wrote:

> +1
>


Re: Vote on Dynamic resource allocation for structured streaming [SPARK-24815]

2024-02-22 Thread Mich Talebzadeh
Hi,

please check this doc

Spark Project Improvement Proposals (SPIP) | Apache Spark
<https://spark.apache.org/improvement-proposals.html>

and specifically the below extract

Discussing an SPIP

All discussion of an SPIP should take place in a public forum, preferably
the discussion attached to the Jira. Any discussions that happen offline
should be made available online for the public via meeting notes
summarizing the discussions.(done)

During this discussion, one or more shepherds should be identified among
PMC members. (outstanding)

Once the discussion settles, the shepherd(s) should call for a vote on the
SPIP moving forward on the dev@ list. The vote should be open for at least
72 hours and follows the typical Apache vote process and passes upon
consensus (at least 3 +1 votes from PMC members and no -1 votes from PMC
members). dev@ should be notified of the vote result.

If there does not exist at least one PMC member that is committed to
shepherding the change within a month, the SPIP is rejected.

If a committer does not think a SPIP aligns with long-term project goals,
or is not practical at the point of proposal, the committer should -1 the
SPIP explicitly and give technical justifications.
OK a shepherd from PMC members is required. Maybe Jungtaek Lee can kindly
help the process

cheers

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Thu, 22 Feb 2024 at 21:52, Pavan Kotikalapudi
 wrote:

> Hi Mich,
>
> We have
>
> five  +1s till now.
>
> Mich Talebzadeh
> Adam Hobbs
> Pavan Kotikalapudi
> Krystal Mitchell
> Sona Torosyan
> (few more in github pr)
> +0: None
>
> -1: None
>
> Does it pass the required condition as approved?
>
>
> Not sure of that though, nothing about minimum required is mentioned in
> the past emails.
>
> I would request spark PMC members or any others who have done this in the
> past to understand the process better.
>
> Thank you,
>
> Pavan
>
> On Thu, Feb 22, 2024 at 3:20 AM Mich Talebzadeh 
> wrote:
>
>> Hi Pavan,
>>
>> Do you have a list of votes for this feature by any chance? Does it pass
>> the required condition as approved?
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Dad | Technologist | Solutions Architect | Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!d1kZcsoBaeESUOMsb65wLw8dWRZEP3M2DyjVC4M4ie4NbCcMm9jETo-zSzhl3hcGLSFKRzsfReUfos7lbV5t0A1aYWcDAg$>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!d1kZcsoBaeESUOMsb65wLw8dWRZEP3M2DyjVC4M4ie4NbCcMm9jETo-zSzhl3hcGLSFKRzsfReUfos7lbV5t0A0gQVKWXw$>
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> <https://urldefense.com/v3/__https://en.wikipedia.org/wiki/Wernher_von_Braun__;!!NCc8flgU!d1kZcsoBaeESUOMsb65wLw8dWRZEP3M2DyjVC4M4ie4NbCcMm9jETo-zSzhl3hcGLSFKRzsfReUfos7lbV5t0A0P4WA5mw$>Von
>> Braun
>> <https://urldefense.com/v3/__https://en.wikipedia.org/wiki/Wernher_von_Braun__;!!NCc8flgU!d1kZcsoBaeESUOMsb65wLw8dWRZEP3M2DyjVC4M4ie4NbCcMm9jETo-zSzhl3hcGLSFKRzsfReUfos7lbV5t0A0P4WA5mw$>
>> )".
>>
>>
>> On Thu, 22 Feb 2024 at 10:04, Pavan Kotikalapudi
>>  wrote:
>>
>>> Yes. The PR was closed due to inactivity by github actions..
>>>
>>> The msg
>>> <https://urldefense.com/v3/__https://github.com/apache/spark/pull/42352*issuecomment-1865306284__;Iw!!NCc8flgU!d1kZcsoBaeESUOMsb65wLw8dWRZEP3M2DyjVC4M4ie4NbCcMm9jETo-zSzhl3hcGLSFKRzsfReUfos7lbV5t0A113artKQ$>
>>>  also
>>> says
>>>
>>> > If you'd like to revive this PR, please reopen it and ask a committer
>>> to remove the Stale tag!
>>>
>>> On Thu, Feb 22, 2024 at 1:09 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>

Re: Vote on Dynamic resource allocation for structured streaming [SPARK-24815]

2024-02-22 Thread Mich Talebzadeh
Hi Pavan,

Do you have a list of votes for this feature by any chance? Does it pass
the required condition as approved?

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Thu, 22 Feb 2024 at 10:04, Pavan Kotikalapudi
 wrote:

> Yes. The PR was closed due to inactivity by github actions..
>
> The msg
> <https://github.com/apache/spark/pull/42352#issuecomment-1865306284> also
> says
>
> > If you'd like to revive this PR, please reopen it and ask a committer to
> remove the Stale tag!
>
> On Thu, Feb 22, 2024 at 1:09 AM Mich Talebzadeh 
> wrote:
>
>> I can see it was closed. Was it because of inactivity?
>>
>>
>> Mich Talebzadeh,
>> Dad | Technologist | Solutions Architect | Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!ay85y5IRZ-bv2v2dR8HP7lChTidWLK_bsLQVbOqng9bwhC30-WY-SKIUNTIJCJaVCLHGgHDJOCmJ11L9pU6yO7lCFDAOXA$>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!ay85y5IRZ-bv2v2dR8HP7lChTidWLK_bsLQVbOqng9bwhC30-WY-SKIUNTIJCJaVCLHGgHDJOCmJ11L9pU6yO7kBRUgBOQ$>
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> <https://urldefense.com/v3/__https://en.wikipedia.org/wiki/Wernher_von_Braun__;!!NCc8flgU!ay85y5IRZ-bv2v2dR8HP7lChTidWLK_bsLQVbOqng9bwhC30-WY-SKIUNTIJCJaVCLHGgHDJOCmJ11L9pU6yO7lSMcDbbg$>Von
>> Braun
>> <https://urldefense.com/v3/__https://en.wikipedia.org/wiki/Wernher_von_Braun__;!!NCc8flgU!ay85y5IRZ-bv2v2dR8HP7lChTidWLK_bsLQVbOqng9bwhC30-WY-SKIUNTIJCJaVCLHGgHDJOCmJ11L9pU6yO7lSMcDbbg$>
>> )".
>>
>>
>> On Thu, 22 Feb 2024 at 06:58, Pavan Kotikalapudi
>>  wrote:
>>
>>> Hi Spark PMC members,
>>>
>>> I think we have few upvotes for this effort here and more people are
>>> showing interest (see  PR comments
>>> <https://urldefense.com/v3/__https://github.com/apache/spark/pull/42352*issuecomment-1955238640__;Iw!!NCc8flgU!ay85y5IRZ-bv2v2dR8HP7lChTidWLK_bsLQVbOqng9bwhC30-WY-SKIUNTIJCJaVCLHGgHDJOCmJ11L9pU6yO7k0wc9hCg$>
>>> .)
>>>
>>> Is anyone interested in mentoring and reviewing this effort?
>>>
>>> Also can the repository admin/owner re-open the PR?  ( I guess people
>>> only with admin access to the repository can do that).
>>>
>>> Thank you,
>>>
>>> Pavan
>>>
>>> On Tue, Feb 20, 2024 at 2:08 PM Krystal Mitchell
>>>  wrote:
>>>
>>>> +1
>>>>
>>>> On 2024/01/17 17:49:32 Pavan Kotikalapudi wrote:
>>>> > Thanks for proposing and voting for the feature Mich.
>>>> >
>>>> > adding some references to the thread.
>>>> >
>>>> >- Jira ticket - SPARK-24815
>>>> ><https://issues.apache.org/jira/browse/SPARK-24815>
>>>> <https://urldefense.com/v3/__https://issues.apache.org/jira/browse/SPARK-24815*3E__;JQ!!NCc8flgU!b8v0cnobIeWmrtrGvm7r3lY83cOCZBDfHYW8xGj1tzG-9XYCnzsQoebrCmyMCJBXU52BSm3phgntc1HXve-r64f0rbw$>
>>>> >- Design Doc
>>>> ><
>>>> https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit?usp=sharing>
>>>> <https://urldefense.com/v3/__https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit?usp=sharing*3E__;JQ!!NCc8flgU!b8v0cnobIeWmrtrGvm7r3lY83cOCZBDfHYW8xGj1tzG-9XYCnzsQoebrCmyMCJBXU52BSm3phgntc1HXve-r44a1rO8$>
>>>> >
>>>> >- discussion thread
>>>> ><https://lists.apache.org/thread/9yx0jnk9h1234joymwlzfx2gh2m8b9bo>
>>>> <https://urldefense.com/v3/__https://lists.apache.org/thread/9yx0jnk9h1234joymwlzfx2g

Re: Vote on Dynamic resource allocation for structured streaming [SPARK-24815]

2024-02-22 Thread Mich Talebzadeh
I can see it was closed. Was it because of inactivity?


Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Thu, 22 Feb 2024 at 06:58, Pavan Kotikalapudi
 wrote:

> Hi Spark PMC members,
>
> I think we have few upvotes for this effort here and more people are
> showing interest (see  PR comments
> <https://github.com/apache/spark/pull/42352#issuecomment-1955238640>.)
>
> Is anyone interested in mentoring and reviewing this effort?
>
> Also can the repository admin/owner re-open the PR?  ( I guess people only
> with admin access to the repository can do that).
>
> Thank you,
>
> Pavan
>
> On Tue, Feb 20, 2024 at 2:08 PM Krystal Mitchell
>  wrote:
>
>> +1
>>
>> On 2024/01/17 17:49:32 Pavan Kotikalapudi wrote:
>> > Thanks for proposing and voting for the feature Mich.
>> >
>> > adding some references to the thread.
>> >
>> >- Jira ticket - SPARK-24815
>> ><https://issues.apache.org/jira/browse/SPARK-24815>
>> <https://urldefense.com/v3/__https://issues.apache.org/jira/browse/SPARK-24815*3E__;JQ!!NCc8flgU!b8v0cnobIeWmrtrGvm7r3lY83cOCZBDfHYW8xGj1tzG-9XYCnzsQoebrCmyMCJBXU52BSm3phgntc1HXve-r64f0rbw$>
>> >- Design Doc
>> ><
>> https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit?usp=sharing>
>> <https://urldefense.com/v3/__https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit?usp=sharing*3E__;JQ!!NCc8flgU!b8v0cnobIeWmrtrGvm7r3lY83cOCZBDfHYW8xGj1tzG-9XYCnzsQoebrCmyMCJBXU52BSm3phgntc1HXve-r44a1rO8$>
>> >
>> >- discussion thread
>> ><https://lists.apache.org/thread/9yx0jnk9h1234joymwlzfx2gh2m8b9bo>
>> <https://urldefense.com/v3/__https://lists.apache.org/thread/9yx0jnk9h1234joymwlzfx2gh2m8b9bo*3E__;JQ!!NCc8flgU!b8v0cnobIeWmrtrGvm7r3lY83cOCZBDfHYW8xGj1tzG-9XYCnzsQoebrCmyMCJBXU52BSm3phgntc1HXve-rkLpTOYM$>
>> >- PR with initial implementation -
>> >https://github.com/apache/spark/pull/42352
>> <https://urldefense.com/v3/__https://github.com/apache/spark/pull/42352__;!!NCc8flgU!b8v0cnobIeWmrtrGvm7r3lY83cOCZBDfHYW8xGj1tzG-9XYCnzsQoebrCmyMCJBXU52BSm3phgntc1HXve-rZAZFOls$>
>> >
>> > Please vote with:
>> >
>> > [ ] +1: Accept the proposal and start with the development.
>> > [ ] +0
>> > [ ] -1: I don’t think this is a good idea because …
>> >
>> > Thank you,
>> >
>> > Pavan
>> >
>> > On Wed, Jan 17, 2024 at 9:52 PM Mich Talebzadeh 
>> > wrote:
>> >
>> > >
>> > > +1 for me  (non binding)
>> > >
>> > >
>> > >
>> > > *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any
>> > > loss, damage or destruction of data or any other property which may
>> arise
>> > > from relying on this email's technical content is explicitly
>> disclaimed.
>> > > The author will in no case be liable for any monetary damages arising
>> from
>> > > such loss, damage or destruction.
>> > >
>> > >
>> > >
>> >
>>
>


Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-19 Thread Mich Talebzadeh
Ok thanks for your clarifications

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Mon, 19 Feb 2024 at 17:24, Chao Sun  wrote:

> Hi Mich,
>
> > Also have you got some benchmark results from your tests that you can
> possibly share?
>
> We only have some partial benchmark results internally so far. Once
> shuffle and better memory management have been introduced, we plan to
> publish the benchmark results (at least TPC-H) in the repo.
>
> > Compared to standard Spark, what kind of performance gains can be
> expected with Comet?
>
> Currently, users could benefit from Comet in a few areas:
> - Parquet read: a few improvements have been made against reading from S3
> in particular, so users can expect better scan performance in this scenario
> - Hash aggregation
> - Columnar shuffle
> - Decimals (Java's BigDecimal is pretty slow)
>
> > Can one use Comet on k8s in conjunction with something like a Volcano
> addon?
>
> I think so. Comet is mostly orthogonal to the Spark scheduler framework.
>
> Chao
>
>
>
>
>
>
> On Fri, Feb 16, 2024 at 5:39 AM Mich Talebzadeh 
> wrote:
>
>> Hi Chao,
>>
>> As a cool feature
>>
>>
>>- Compared to standard Spark, what kind of performance gains can be
>>expected with Comet?
>>-  Can one use Comet on k8s in conjunction with something like a
>>Volcano addon?
>>
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Dad | Technologist | Solutions Architect | Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge, sourced from both personal expertise and other resources but of
>> course cannot be guaranteed . It is essential to note that, as with any
>> advice, one verified and tested result holds more weight than a thousand
>> expert opinions.
>>
>>
>> On Tue, 13 Feb 2024 at 20:42, Chao Sun  wrote:
>>
>>> Hi all,
>>>
>>> We are very happy to announce that Project Comet, a plugin to
>>> accelerate Spark query execution via leveraging DataFusion and Arrow,
>>> has now been open sourced under the Apache Arrow umbrella. Please
>>> check the project repo
>>> https://github.com/apache/arrow-datafusion-comet for more details if
>>> you are interested. We'd love to collaborate with people from the open
>>> source community who share similar goals.
>>>
>>> Thanks,
>>> Chao
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>


Re: ASF board report draft for February

2024-02-18 Thread Mich Talebzadeh
Np, thanks for addressing the point promptly

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, one test result is worth one-thousand expert
opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
<https://en.wikipedia.org/wiki/Wernher_von_Braun>).


On Sun, 18 Feb 2024 at 17:22, Matei Zaharia  wrote:

> Thanks for the clarification. I updated it to say Comet is in the process
> of being open sourced.
>
> On Feb 18, 2024, at 1:55 AM, Mich Talebzadeh 
> wrote:
>
> Hi Matei,
>
> With regard to your last point
>
> "- Project Comet, a plugin designed to accelerate Spark query execution by
> leveraging DataFusion and Arrow, has been open-sourced under the Apache
> Arrow project. For more information, visit
> https://github.com/apache/arrow-datafusion-comet.;
>
> If my understanding is correct (as of  15th February), I don't think the
> full project is open sourced yet and I quote a response from the thead
> owner Chao Sun
>
> "Note that we haven't open sourced several features yet including shuffle
> support, which the aggregate operation depends on. Please stay tuned!"
>
> I would be inclined to leave that line out for now. The rest is fine.
>
> HTH
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, one verified and tested result holds more weight
> than a thousand expert opinions.
>
>
> On Sat, 17 Feb 2024 at 19:23, Matei Zaharia 
> wrote:
>
>> Hi all,
>>
>> I missed some reminder emails about our board report this month, but here
>> is my draft. I’ll submit it tomorrow if that’s ok.
>>
>> ==
>>
>> Issues for the board:
>>
>> - None
>>
>> Project status:
>>
>> - We made two patch releases: Spark 3.3.4 (EOL release) on December 16,
>> 2023, and Spark 3.4.2 on November 30, 2023.
>> - We have begun voting for a Spark 3.5.1 maintenance release.
>> - The vote on "SPIP: Structured Streaming - Arbitrary State API v2" has
>> passed.
>> - We transitioned to an ASF-hosted analytics service, Matomo. For
>> details, visit
>> https://analytics.apache.org/index.php?module=CoreHome=index=yesterday=day=40
>> .
>> - Project Comet, a plugin designed to accelerate Spark query execution by
>> leveraging DataFusion and Arrow, has been open-sourced under the Apache
>> Arrow project. For more information, visit
>> https://github.com/apache/arrow-datafusion-comet.
>>
>> Trademarks:
>>
>> - No changes since the last report.
>>
>> Latest releases:
>>
>> - Spark 3.3.4 was released on December 16, 2023
>> - Spark 3.4.2 was released on November 30, 2023
>> - Spark 3.5.0 was released on September 13, 2023
>>
>> Committers and PMC:
>>
>> - The latest committer was added on Oct 2nd, 2023 (Jiaan Geng).
>> - The latest PMC members were added on Oct 2nd, 2023 (Yuanjian Li and
>> Yikun Jiang).
>>
>> ==
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>


Re: ASF board report draft for February

2024-02-18 Thread Mich Talebzadeh
Hi Matei,

With regard to your last point

"- Project Comet, a plugin designed to accelerate Spark query execution by
leveraging DataFusion and Arrow, has been open-sourced under the Apache
Arrow project. For more information, visit
https://github.com/apache/arrow-datafusion-comet.;

If my understanding is correct (as of  15th February), I don't think the
full project is open sourced yet and I quote a response from the thead
owner Chao Sun

"Note that we haven't open sourced several features yet including shuffle
support, which the aggregate operation depends on. Please stay tuned!"

I would be inclined to leave that line out for now. The rest is fine.

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, one verified and tested result holds more weight
than a thousand expert opinions.


On Sat, 17 Feb 2024 at 19:23, Matei Zaharia  wrote:

> Hi all,
>
> I missed some reminder emails about our board report this month, but here
> is my draft. I’ll submit it tomorrow if that’s ok.
>
> ==
>
> Issues for the board:
>
> - None
>
> Project status:
>
> - We made two patch releases: Spark 3.3.4 (EOL release) on December 16,
> 2023, and Spark 3.4.2 on November 30, 2023.
> - We have begun voting for a Spark 3.5.1 maintenance release.
> - The vote on "SPIP: Structured Streaming - Arbitrary State API v2" has
> passed.
> - We transitioned to an ASF-hosted analytics service, Matomo. For details,
> visit
> https://analytics.apache.org/index.php?module=CoreHome=index=yesterday=day=40
> .
> - Project Comet, a plugin designed to accelerate Spark query execution by
> leveraging DataFusion and Arrow, has been open-sourced under the Apache
> Arrow project. For more information, visit
> https://github.com/apache/arrow-datafusion-comet.
>
> Trademarks:
>
> - No changes since the last report.
>
> Latest releases:
>
> - Spark 3.3.4 was released on December 16, 2023
> - Spark 3.4.2 was released on November 30, 2023
> - Spark 3.5.0 was released on September 13, 2023
>
> Committers and PMC:
>
> - The latest committer was added on Oct 2nd, 2023 (Jiaan Geng).
> - The latest PMC members were added on Oct 2nd, 2023 (Yuanjian Li and
> Yikun Jiang).
>
> ==
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-16 Thread Mich Talebzadeh
Hi Chao,

As a cool feature


   - Compared to standard Spark, what kind of performance gains can be
   expected with Comet?
   -  Can one use Comet on k8s in conjunction with something like a Volcano
   addon?


HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge, sourced from both personal expertise and other resources but of
course cannot be guaranteed . It is essential to note that, as with any
advice, one verified and tested result holds more weight than a thousand
expert opinions.


On Tue, 13 Feb 2024 at 20:42, Chao Sun  wrote:

> Hi all,
>
> We are very happy to announce that Project Comet, a plugin to
> accelerate Spark query execution via leveraging DataFusion and Arrow,
> has now been open sourced under the Apache Arrow umbrella. Please
> check the project repo
> https://github.com/apache/arrow-datafusion-comet for more details if
> you are interested. We'd love to collaborate with people from the open
> source community who share similar goals.
>
> Thanks,
> Chao
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-15 Thread Mich Talebzadeh
Hi,I gather from the replies that the plugin is not currently available in
the form expected although I am aware of the shell script.

Also have you got some benchmark results from your tests that you can
possibly share?

Thanks,

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge, sourced from both personal expertise and other resources but of
course cannot be guaranteed . It is essential to note that, as with any
advice, one verified and tested result holds more weight than a thousand
expert opinions.


On Thu, 15 Feb 2024 at 01:18, Chao Sun  wrote:

> Hi Praveen,
>
> We will add a "Getting Started" section in the README soon, but basically
> comet-spark-shell
> <https://github.com/apache/arrow-datafusion-comet/blob/main/bin/comet-spark-shell>
>  in
> the repo should provide a basic tool to build Comet and launch a Spark
> shell with it.
>
> Note that we haven't open sourced several features yet including shuffle
> support, which the aggregate operation depends on. Please stay tuned!
>
> Chao
>
>
> On Wed, Feb 14, 2024 at 2:44 PM praveen sinha 
> wrote:
>
>> Hi Chao,
>>
>> Is there any example app/gist/repo which can help me use this plugin. I
>> wanted to try out some realtime aggregate performance on top of parquet and
>> spark dataframes.
>>
>> Thanks and Regards
>> Praveen
>>
>>
>> On Wed, Feb 14, 2024 at 9:20 AM Chao Sun  wrote:
>>
>>> > Out of interest what are the differences in the approach between this
>>> and Glutten?
>>>
>>> Overall they are similar, although Gluten supports multiple backends
>>> including Velox and Clickhouse. One major difference is (obviously)
>>> Comet is based on DataFusion and Arrow, and written in Rust, while
>>> Gluten is mostly C++.
>>> I haven't looked very deep into Gluten yet, but there could be other
>>> differences such as how strictly the engine follows Spark's semantics,
>>> table format support (Iceberg, Delta, etc), fallback mechanism
>>> (coarse-grained fallback on stage level or more fine-grained fallback
>>> within stages), UDF support (Comet hasn't started on this yet),
>>> shuffle support, memory management, etc.
>>>
>>> Both engines are backed by very strong and vibrant open source
>>> communities (Velox, Clickhouse, Arrow & DataFusion) so it's very
>>> exciting to see how the projects will grow in future.
>>>
>>> Best,
>>> Chao
>>>
>>> On Tue, Feb 13, 2024 at 10:06 PM John Zhuge  wrote:
>>> >
>>> > Congratulations! Excellent work!
>>> >
>>> > On Tue, Feb 13, 2024 at 8:04 PM Yufei Gu  wrote:
>>> >>
>>> >> Absolutely thrilled to see the project going open-source! Huge
>>> congrats to Chao and the entire team on this milestone!
>>> >>
>>> >> Yufei
>>> >>
>>> >>
>>> >> On Tue, Feb 13, 2024 at 12:43 PM Chao Sun  wrote:
>>> >>>
>>> >>> Hi all,
>>> >>>
>>> >>> We are very happy to announce that Project Comet, a plugin to
>>> >>> accelerate Spark query execution via leveraging DataFusion and Arrow,
>>> >>> has now been open sourced under the Apache Arrow umbrella. Please
>>> >>> check the project repo
>>> >>> https://github.com/apache/arrow-datafusion-comet for more details if
>>> >>> you are interested. We'd love to collaborate with people from the
>>> open
>>> >>> source community who share similar goals.
>>> >>>
>>> >>> Thanks,
>>> >>> Chao
>>> >>>
>>> >>> -
>>> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >>>
>>> >
>>> >
>>> > --
>>> > John Zhuge
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>


Re: How do you debug a code-generated aggregate?

2024-02-13 Thread Mich Talebzadeh
Sure thanks for clarification.  I gather what you are alluding to is -- in
a distributed environment, when one does operations that involve shuffling
or repartitioning of data, the order in which this data is processed across
partitions is not guaranteed. So when repartitioning a dataframe, the data
is redistributed across partitions, and each partition may process its
portion of the data independently and that makes the debugging distributed
systems challenging.

I hope that makes sense.

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 13 Feb 2024 at 21:25, Jack Goodson  wrote:

> Apologies if it wasn't clear, I was meaning the difficulty of debugging,
> not floating point precision :)
>
> On Wed, Feb 14, 2024 at 2:03 AM Mich Talebzadeh 
> wrote:
>
>> Hi Jack,
>>
>> "  most SQL engines suffer from the same issue... ""
>>
>> Sure. This behavior is not a bug, but rather a consequence of the
>> limitations of floating-point precision. The numbers involved in the
>> example (see SPIP [SPARK-47024] Sum of floats/doubles may be incorrect
>> depending on partitioning - ASF JIRA (apache.org)
>> <https://issues.apache.org/jira/browse/SPARK-47024> exceed the precision
>> of the double-precision floating-point representation used by default in
>> Spark and others Interesting to have a look and test the code
>>
>> This is the code
>>
>> SUM_EXAMPLE = [
>> (1.0,), (0.0,), (1.0,), (9007199254740992.0,), ] spark = (
>> SparkSession.builder .config("spark.log.level", "ERROR") .getOrCreate() )
>> def compare_sums(data, num_partitions): df = spark.createDataFrame(data,
>> "val double").coalesce(1) result1 = df.agg(sum(col("val"))).collect()[0][0]
>> df = spark.createDataFrame(data, "val double").repartition(num_partitions) 
>> *result2
>> = df.agg(sum(col("val"))).collect()[0][0]* assert result1 == result2,
>> f"{result1}, {result2}" if __name__ == "__main__":
>> print(compare_sums(SUM_EXAMPLE, 2))
>> In Python, floating-point numbers are implemented using the IEEE 754
>> standard,
>> <https://stackoverflow.com/questions/73340696/how-is-pythons-decimal-and-other-precise-decimal-libraries-implemented-and-wh>which
>> has a limited precision. When one performs operations with very large
>> numbers or numbers with many decimal places, one may encounter precision
>> errors.
>>
>> print(compare_sums(SUM_EXAMPLE, 2)) File "issue01.py", line 23, in
>> compare_sums assert result1 == result2, f"{result1}, {result2}"
>> AssertionError: 9007199254740994.0, 9007199254740992.0
>> In the aforementioned case, the result of the aggregation (sum) is
>> affected by the precision limits of floating-point representation. The
>> difference between 9007199254740994.0, 9007199254740992.0. is within the
>> expected precision limitations of double-precision floating-point numbers.
>>
>> The likely cause in this scenario in this example
>>
>> When one performs an aggregate operation like sum on a DataFrame, the
>> operation may be affected by the order of the data.and the case here, the
>> order of data can be influenced by the number of partitions in
>> Spark..result2 above creates a new DataFrame df with the same data but
>> explicitly repartition it into two partitions
>> (repartition(num_partitions)). Repartitioning can shuffle the data
>> across partitions, introducing a different order for the subsequent
>> aggregation. The sum operation is then performed on the data in a
>> different order, leading to a slightly different result from result1
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Dad | Technologist | Solutions Architect | Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which 

Re: How do you debug a code-generated aggregate?

2024-02-13 Thread Mich Talebzadeh
Hi Jack,

"  most SQL engines suffer from the same issue... ""

Sure. This behavior is not a bug, but rather a consequence of the
limitations of floating-point precision. The numbers involved in the
example (see SPIP [SPARK-47024] Sum of floats/doubles may be incorrect
depending on partitioning - ASF JIRA (apache.org)
<https://issues.apache.org/jira/browse/SPARK-47024> exceed the precision of
the double-precision floating-point representation used by default in Spark
and others Interesting to have a look and test the code

This is the code

SUM_EXAMPLE = [
(1.0,), (0.0,), (1.0,), (9007199254740992.0,), ] spark = (
SparkSession.builder .config("spark.log.level", "ERROR") .getOrCreate() )
def compare_sums(data, num_partitions): df = spark.createDataFrame(data,
"val double").coalesce(1) result1 = df.agg(sum(col("val"))).collect()[0][0]
df = spark.createDataFrame(data, "val
double").repartition(num_partitions) *result2
= df.agg(sum(col("val"))).collect()[0][0]* assert result1 == result2,
f"{result1}, {result2}" if __name__ == "__main__":
print(compare_sums(SUM_EXAMPLE, 2))
In Python, floating-point numbers are implemented using the IEEE 754
standard,
<https://stackoverflow.com/questions/73340696/how-is-pythons-decimal-and-other-precise-decimal-libraries-implemented-and-wh>which
has a limited precision. When one performs operations with very large
numbers or numbers with many decimal places, one may encounter precision
errors.

print(compare_sums(SUM_EXAMPLE, 2)) File "issue01.py", line 23, in
compare_sums assert result1 == result2, f"{result1}, {result2}"
AssertionError: 9007199254740994.0, 9007199254740992.0
In the aforementioned case, the result of the aggregation (sum) is affected
by the precision limits of floating-point representation. The difference
between 9007199254740994.0, 9007199254740992.0. is within the expected
precision limitations of double-precision floating-point numbers.

The likely cause in this scenario in this example

When one performs an aggregate operation like sum on a DataFrame, the
operation may be affected by the order of the data.and the case here, the
order of data can be influenced by the number of partitions in
Spark..result2 above creates a new DataFrame df with the same data but
explicitly repartition it into two partitions
(repartition(num_partitions)). Repartitioning can shuffle the data across
partitions, introducing a different order for the subsequent aggregation.
The sum operation is then performed on the data in a different order,
leading to a slightly different result from result1

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 13 Feb 2024 at 03:06, Jack Goodson  wrote:

> I may be ignorant of other debugging methods in Spark but the best success
> I've had is using smaller datasets (if runs take a long time) and adding
> intermediate output steps. This is quite different from application
> development in non-distributed systems where a debugger is trivial to
> attach but I believe it's one of the trade offs on using a system like
> Spark for data processing, most SQL engines suffer from the same issue. If
> you do believe there is a bug in Spark using the explain function like
> Herman mentioned helps as well as looking at the Spark plan in the Spark UI
>
> On Tue, Feb 13, 2024 at 9:24 AM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> OK, I figured it out. The details are in SPARK-47024
>> <https://issues.apache.org/jira/browse/SPARK-47024> for anyone who’s
>> interested.
>>
>> It turned out to be a floating point arithmetic “bug”. The main reason I
>> was able to figure it out was because I’ve been investigating another,
>> unrelated bug (a real bug) related to floats, so these weird float corner
>> cases have been top of mind.
>>
>> If it weren't for that, I wonder how much progress I would have made.
>> Though I could inspect the generated code, I couldn’t figure out how to get
>> logging statements placed in the generated code to print somewhere I could
>> see them.
>>
>> Depending on how often we find ourselves debugging aggregates like this,
>> it would be really helpful if we added some way to trace the aggregation
>> buffer.
>>
>&g

Re: Building an Event-Driven Real-Time Data Processor with Spark Structured Streaming and API Integration

2024-02-09 Thread Mich Talebzadeh
The full code is available from the link below

https://github.com/michTalebzadeh/Event_Driven_Real_Time_data_processor_with_SSS_and_API_integration

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 9 Feb 2024 at 16:16, Mich Talebzadeh 
wrote:

> Appreciate your thoughts on this, Personally I think Spark Structured
> Streaming can be used effectively in an Event Driven Architecture  as well
> as  continuous streaming)
>
> From the link here
> <https://www.linkedin.com/posts/activity-7161748945801617409-v29V?utm_source=share_medium=member_desktop>
>
> HTH,
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


Re: Pyspark Write Batch Streaming Data to Snowflake Fails with more columns

2024-02-09 Thread Mich Talebzadeh
Hi Varun,

I am no expert on Snowflake, however, the issue you are facing,
particularly if it involves data trimming in a COPY statement and potential
data mismatch, is likely related to how Snowflake handles data ingestion
rather than being directly tied to PySpark. The COPY command in Snowflake
is used to load data from external files (like those in s3) into Snowflake
tables. Possible causes for data truncation or mismatch could include
differences in data types, column lengths, or encoding between your source
data and the Snowflake table schema. It could also be related to the way
your PySpark application is formatting or providing data to Snowflake.

Check these

   - Schema Matching: Ensure that the data types, lengths, and encoding of
   the columns in your Snowflake table match the corresponding columns in your
   PySpark DataFrame.
   - Column Mapping: Explicitly map the columns in your PySpark DataFrame
   to the corresponding columns in the Snowflake table during the write
   operation. This can help avoid any implicit mappings that might be causing
   issues.


   1.

   HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 9 Feb 2024 at 13:06, Varun Shah  wrote:

> Hi Team,
>
> We currently have implemented pyspark spark-streaming application on
> databricks, where we read data from s3 and write to the snowflake table
> using snowflake connector jars (net.snowflake:snowflake-jdbc v3.14.5 and
> net.snowflake:spark-snowflake v2.12:2.14.0-spark_3.3) .
>
> Currently facing an issue where if we give a large number of columns, it
> trims the data in a copy statement, thereby unable to write to the
> snowflake as the data mismatch happens.
>
> Using databricks 11.3 LTS with Spark 3.3.0 and Scala 2.12 version.
>
> Can you please help on how I can resolve this issue ? I tried searching
> online, but did not get any such articles.
>
> Looking forward to hearing from you.
>
> Regards,
> Varun Shah
>
>
>


Building an Event-Driven Real-Time Data Processor with Spark Structured Streaming and API Integration

2024-02-09 Thread Mich Talebzadeh
Appreciate your thoughts on this, Personally I think Spark Structured
Streaming can be used effectively in an Event Driven Architecture  as well
as  continuous streaming)

>From the link here
<https://www.linkedin.com/posts/activity-7161748945801617409-v29V?utm_source=share_medium=member_desktop>

HTH,

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: Shuffle write and read phase optimizations for parquet+zstd write

2024-02-08 Thread Mich Talebzadeh
Hi,

... Most of our jobs end up with a shuffle stage based on a partition
column value before writing into a parquet, and most of the time we have
data skewness in partitions

Have you considered the causes of these recurring issues and some potential
alternative strategies?

   1.

   - Tuning Spark Configuration related to shuffle operations, settings
   like *adjusting the* *spark.shuffle.partitions**,*
*spark.reducer.maxSizeInFlight,
   spark.shuffle.memory.fraction spark.shuffle.spi*ll etc
   2.

   - Partitioning Strategy: may benefit to review and optimize the
   partitioning strategy to minimize data skewness by looking at causes of
   skewness
   3.

   SELECT column_name, COUNT(column_name) AS count FROM ABC GROUP BY
   column_name ORDER BY count DESC
   4.

   Then you can try things like salting or bucketing to distribute data
   more evenly.
   5.

   - Caching Frequently Accessed Data: If certain data is frequently
   accessed, you may consider caching it in memory to reduce the need for
   repeated shuffling.

The feasibility of your proposal depends on the specific requirements,
characteristics of your data, and the downstream processes that consume
that data. If downstream tools or processes expect data in a specific
format, the serialized format may require additional processing or
conversion, impacting compatibility.

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 7 Feb 2024 at 18:59, satyajit vegesna 
wrote:

> Hi Community,
>
> Can someone please help validate the idea below and suggest pros/cons.
>
> Most of our jobs end up with a shuffle stage based on a partition column
> value before writing into parquet, and most of the time we have data skew
> ness in partitions.
>
> Currently most of the problems happen at shuffle read stage and we face
> several issues like below,
>
>1. Executor lost
>2. Node lost
>3. Shuffle Fetch erros
>
> *And I have been thinking about ways to completely avoid de-serializing
> data during shuffle read phase and one way to be able to do it in our case
> is by,*
>
>1. *Serialize the shuffle write in parquet + zstd format*
>2. *Just move the data files into partition folders from shuffle
>blocks locally written to executors  (This avoids trying to de-serialize
>the data into memory and disk and then write into parquet)*
>
> Please confirm on the feasibility here and any pros/cons on the above
> approach.
>
> Regards.
>
>
>
>


Re: Enhanced Console Sink for Structured Streaming

2024-02-05 Thread Mich Talebzadeh
I don't think adding this to the streaming flow (at micro level) will be
that useful

However, this can be added to Spark UI as an enhancement to the Streaming
Query Statistics page.

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 6 Feb 2024 at 03:49, Raghu Angadi 
wrote:

> Agree, the default behavior does not need to change.
>
> Neil, how about separating it into two sections:
>
>- Actual rows in the sink (same as current output)
>- Followed by metadata data
>
>


Re: Enhanced Console Sink for Structured Streaming

2024-02-03 Thread Mich Talebzadeh
Hi,

As I understood, the proposal you mentioned suggests adding event-time and
state store metadata to the console sink to better highlight the semantics
of the Structured Streaming engine. While I agree this enhancement can
provide valuable insights into the engine's behavior especially for
newcomers, there are potential challenges that we need to be aware of:

- Including additional metadata in the console sink output can increase the
volume of information printed. This might result in a more verbose console
output, making it harder to observe the actual data from the metadata,
especially in scenarios with high data throughput.
- Added verbosity, the proposed additional metadata may make the console
output more verbose, potentially affecting its readability, especially for
users who are primarily interested in the processed data and not the
internal engine details.
- Users unfamiliar with the internal workings of Structured Streaming might
misinterpret the metadata as part of the actual data, leading to confusion.
- The act of printing additional metadata to the console may introduce some
overhead, especially in scenarios where high-frequency updates occur. While
this overhead might be minimal, it is worth considering it in
performance-sensitive applications.
- While the proposal aims to make it easier for beginners to understand
concepts like watermarks, operator state, and output rows, it could
potentially increase the learning curve due to the introduction of
additional terminology and information.
- Users might benefit from the ability to selectively enable or disable the
display of certain metadata elements to tailor the console output to their
specific needs. However, this introduces additional complexity.

As usual with these things, your mileage varies. Whilst the proposed
enhancements offer valuable insights into the behavior of Structured
Streaming, we ought to think about the potential downsides, particularly in
terms of increased verbosity, complexity, and the impact on user experience

HTH
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sat, 3 Feb 2024 at 01:32, Neil Ramaswamy
 wrote:

> Hi all,
>
> I'd like to propose the idea of enhancing Structured Streaming's console
> sink to print event-time metrics and state store data, in addition to the
> sink's rows.
>
> I've noticed beginners often struggle to understand how watermarks,
> operator state, and output rows are all intertwined. By printing all of
> this information in the same place, I think that this sink will make it
> easier for users to see—and our docs to explain—how these concepts work
> together.
>
> For example, our docs could walk the users through a query with a
> 10-second tumbling window aggregation (e.g. with a .count()) and a 15
> second watermark. After processing something like (foo, 17) and (bar, 15),
> writing another record (baz, 36) to the source would cause the following to
> print for batch 2:
>
> ++
>
> |  WRITES TO SINK (Batch = 2)|
>
> +--+-+
>
> |  window  |   count |
>
> +--+-+
>
> | {10 seconds, 20 seconds} |  2  |
>
> +--+-+
>
> | EVENT TIME |
>
> ++
>
> | watermark -> 21 seconds|
>
> | numDroppedRows -> 0|
>
> ++
>
> | STATE ROWS |
>
> +--+-+
>
> |   key|value|
>
> +--+-+
>
> | {30 seconds, 40 seconds} | {1} |
>
> +--+-+
>
> From this (especially with expository help), it would be more apparent
> that the record at 36 seconds did three things: it advanced the watermark
> to 36-15 = 21 seconds, caused the [10, 20] window to close, and was put
> into the state for [30, 40].
>
> One valid concern is that this sink would now be printing *metadata*, not
> just data: will users think that Structured Streaming writes metadata to
> sinks? Perhaps. But I think that we can clarify that in the documentation
> of

Re: [QUESTION] Legal dependency with Oracle JDBC driver

2024-01-30 Thread Mich Talebzadeh
Hi Alex,
Well, that is just Justin's opinion vis-à-vis his matter. It is different
from mine. Bottom line, you can always refer to Oracle or a copyright
expert on this matter and see what they suggest.

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 29 Jan 2024 at 22:05, Alex Porcelli  wrote:

> Hi Mich,
>
> Thank you for the prompt response.
>
> Looks like Justin Mclean has a slightly different perspective on the
> Oracle's license as you can see in [3].
>
>
> On Mon, Jan 29, 2024 at 4:17 PM Mich Talebzadeh 
> wrote:
>
>> Hi,
>>
>> This is not an official response and should not be taken as an
>> official view. It is my own opinion.
>>
>> Looking at the reference [1], I can see a host of inclusion to other JDBC
>> vendor' drivers such as IBM DB2 and MSSQL
>>
>> With regard to link [2], it is already closed (3+ years) and it is
>> assumed that these references are taken as "convenience". There is no
>> implication that JDBC drivers are included on these releases, modified or
>> not modified.
>> Oracle provides multiple JDBC drivers like ojdbc5.jar, ojdbc6.jar,
>> ojdbc7.jar, ojdbv11.jar and so forth, free to download and use within the
>> license (you need to have a valid login to Oracle Center)
>>
>> This is what it says with regard to license
>>
>> Governed by the No-clickthrough FDHUT license
>> <https://download.oracle.com/otn-pub/otn_software/jdbc/FDHUT_LICENSE.txt>
>>
>> I glanced through the license and did not find anything that
>> contravened Spark references as in [1] in
>>
>>- spark <https://github.com/apache/spark/tree/master>
>>- /sql <https://github.com/apache/spark/tree/master/sql>
>>- /core <https://github.com/apache/spark/tree/master/sql/core>
>>
>>
>> /pom.xml
>>
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Dad | Technologist | Solutions Architect | Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Mon, 29 Jan 2024 at 16:16, Alex Porcelli  wrote:
>>
>>> Hi Spark Devs,
>>>
>>> I'm reaching out to understand how you managed to include the Oracle
>>> JDBC as one of your dependencies [1]. According to legal tickers
>>> [2][3], this is considered a Category X dependency and is not allowed.
>>>
>>> (I'm part of the Apache KIE podling, and we are struggling with such a
>>> dependency and being pointed out that you may have a solution to
>>> share).
>>>
>>> [1] - https://github.com/apache/spark/blob/master/sql/core/pom.xml#L187
>>> [2] - https://issues.apache.org/jira/browse/LEGAL-526
>>> [3] - https://issues.apache.org/jira/browse/LEGAL-663
>>>
>>> Regards,
>>>
>>> Alex
>>> Apache KIE
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>


Re: [QUESTION] Legal dependency with Oracle JDBC driver

2024-01-29 Thread Mich Talebzadeh
Hi,

This is not an official response and should not be taken as an
official view. It is my own opinion.

Looking at the reference [1], I can see a host of inclusion to other JDBC
vendor' drivers such as IBM DB2 and MSSQL

With regard to link [2], it is already closed (3+ years) and it is assumed
that these references are taken as "convenience". There is no implication
that JDBC drivers are included on these releases, modified or not modified.
Oracle provides multiple JDBC drivers like ojdbc5.jar, ojdbc6.jar,
ojdbc7.jar, ojdbv11.jar and so forth, free to download and use within the
license (you need to have a valid login to Oracle Center)

This is what it says with regard to license

Governed by the No-clickthrough FDHUT license
<https://download.oracle.com/otn-pub/otn_software/jdbc/FDHUT_LICENSE.txt>

I glanced through the license and did not find anything that
contravened Spark references as in [1] in
- spark <https://github.com/apache/spark/tree/master>
- /sql <https://github.com/apache/spark/tree/master/sql>
- /core <https://github.com/apache/spark/tree/master/sql/core>
/pom.xml


HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 29 Jan 2024 at 16:16, Alex Porcelli  wrote:

> Hi Spark Devs,
>
> I'm reaching out to understand how you managed to include the Oracle
> JDBC as one of your dependencies [1]. According to legal tickers
> [2][3], this is considered a Category X dependency and is not allowed.
>
> (I'm part of the Apache KIE podling, and we are struggling with such a
> dependency and being pointed out that you may have a solution to
> share).
>
> [1] - https://github.com/apache/spark/blob/master/sql/core/pom.xml#L187
> [2] - https://issues.apache.org/jira/browse/LEGAL-526
> [3] - https://issues.apache.org/jira/browse/LEGAL-663
>
> Regards,
>
> Alex
> Apache KIE
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [EXTERNAL] Re: Spark Kafka Rack Aware Consumer

2024-01-26 Thread Mich Talebzadeh
Ok I made a request to access this document

Thanks

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

Ent



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 26 Jan 2024 at 15:48, Schwager, Randall <
randall.schwa...@charter.com> wrote:

> Hi Mich,
>
>
>
> Thanks for responding. In the JIRA issue, the design doc you’re referring
> to describes the prior work.
>
>
>
> This is the design doc for the proposed change:
> https://docs.google.com/document/d/1RoEk_mt8AUh9sTQZ1NfzIuuYKf1zx6BP1K3IlJ2b8iM/edit#heading=h.pbt6pdb2jt5c
>
>
>
> I’ll re-word the description to make that distinction more clear.
>
>
>
> Sincerely,
>
>
>
> Randall
>
>
>
> *From: *Mich Talebzadeh 
> *Date: *Friday, January 26, 2024 at 04:30
> *To: *"Schwager, Randall" 
> *Cc: *"dev@spark.apache.org" 
> *Subject: *[EXTERNAL] Re: Spark Kafka Rack Aware Consumer
>
>
>
> *CAUTION:* The e-mail below is from an external source. Please exercise
> caution before opening attachments, clicking links, or following guidance.
>
> Your design doc
>
> Structured Streaming Kafka Source - Design Doc - Google Docs
> <https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit#heading=h.k36c6oyz89xw>
>
>
>
> seems to be around since 2016. Reading the comments  it was decided not to
> progress with it. What has changed since then please?
>
>
>
> Are you implying if this  doc is still relevant?
>
>
>
> HTH
>
>
>
>
> Mich Talebzadeh,
>
> Dad | Technologist | Solutions Architect | Engineer
>
> London
>
> United Kingdom
>
>
>
>  [image: Image removed by sender.]  view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
>
> On Thu, 25 Jan 2024 at 20:10, Schwager, Randall <
> randall.schwa...@charter.com> wrote:
>
> Bump.
>
> Am I asking these questions in the wrong place? Or should I forego design
> input and just write the PR?
>
>
>
> *From: *"Schwager, Randall" 
> *Date: *Monday, January 22, 2024 at 17:02
> *To: *"dev@spark.apache.org" 
> *Subject: *Re: Spark Kafka Rack Aware Consumer
>
>
>
> Hello Spark Devs!
>
>
>
> After doing some detective work, I’d like to revisit this idea in earnest.
> My understanding now is that setting `client.rack` dynamically on the
> executor will do nothing. This is because the driver assigns Kafka
> partitions to executors. I’ve summarized a design to enable rack awareness
> and other location assignment patterns more generally in SPARK-46798
> <https://issues.apache.org/jira/browse/SPARK-46798>.
>
>
>
> Since this is my first go at contributing to Spark, could I ask for a
> committer to help shepherd this JIRA issue along?
>
>
>
> Sincerely,
>
>
>
> Randall
>
>
>
> *From: *"Schwager, Randall" 
> *Date: *Wednesday, January 10, 2024 at 19:39
> *To: *"dev@spark.apache.org" 
> *Subject: *Spark Kafka Rack Aware Consumer
>
>
>
> Hello Spark Devs!
>
>
>
> Has there been discussion around adding the ability to dynamically set the
> ‘client.rack’ Kafka parameter at the executor?
>
> The Kafka SQL connector code on master doesn’t seem to support this
> feature. One can easily set the ‘client.rack’ parameter at the driver, but
> that just sets all executors to have the same rack. It seems that if we
> want each executor to set the correct rack, each executor will have to
> produce the setting dynamically on start-up.
>
>
>
> Would this be a good area to consider contributing new functionality?
>
>
>
> Sincerely,
>
>
>
> Randall
>
>
>
>


Re: Spark Kafka Rack Aware Consumer

2024-01-26 Thread Mich Talebzadeh
Your design doc

Structured Streaming Kafka Source - Design Doc - Google Docs
<https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit#heading=h.k36c6oyz89xw>

seems to be around since 2016. Reading the comments  it was decided not to
progress with it. What has changed since then please?

Are you implying if this  doc is still relevant?

HTH


Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 25 Jan 2024 at 20:10, Schwager, Randall <
randall.schwa...@charter.com> wrote:

> Bump.
>
> Am I asking these questions in the wrong place? Or should I forego design
> input and just write the PR?
>
>
>
> *From: *"Schwager, Randall" 
> *Date: *Monday, January 22, 2024 at 17:02
> *To: *"dev@spark.apache.org" 
> *Subject: *Re: Spark Kafka Rack Aware Consumer
>
>
>
> Hello Spark Devs!
>
>
>
> After doing some detective work, I’d like to revisit this idea in earnest.
> My understanding now is that setting `client.rack` dynamically on the
> executor will do nothing. This is because the driver assigns Kafka
> partitions to executors. I’ve summarized a design to enable rack awareness
> and other location assignment patterns more generally in SPARK-46798
> <https://issues.apache.org/jira/browse/SPARK-46798>.
>
>
>
> Since this is my first go at contributing to Spark, could I ask for a
> committer to help shepherd this JIRA issue along?
>
>
>
> Sincerely,
>
>
>
> Randall
>
>
>
> *From: *"Schwager, Randall" 
> *Date: *Wednesday, January 10, 2024 at 19:39
> *To: *"dev@spark.apache.org" 
> *Subject: *Spark Kafka Rack Aware Consumer
>
>
>
> Hello Spark Devs!
>
>
>
> Has there been discussion around adding the ability to dynamically set the
> ‘client.rack’ Kafka parameter at the executor?
>
> The Kafka SQL connector code on master doesn’t seem to support this
> feature. One can easily set the ‘client.rack’ parameter at the driver, but
> that just sets all executors to have the same rack. It seems that if we
> want each executor to set the correct rack, each executor will have to
> produce the setting dynamically on start-up.
>
>
>
> Would this be a good area to consider contributing new functionality?
>
>
>
> Sincerely,
>
>
>
> Randall
>
>
>


Re: Vote on Dynamic resource allocation for structured streaming [SPARK-24815]

2024-01-19 Thread Mich Talebzadeh
Everyone's vote matters whether they are PMC or not. There is no monopoly
here

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 19 Jan 2024 at 11:55, Pavan Kotikalapudi
 wrote:

> +1
> If my vote counts.
>
> Does only spark PMC votes count?
>
> Thanks,
>
> Pavan
>
> On Thu, Jan 18, 2024 at 3:19 AM Adam Hobbs
>  wrote:
>
>> +1
>> --
>> *From:* Pavan Kotikalapudi 
>> *Sent:* Thursday, January 18, 2024 4:19:32 AM
>> *To:* Spark dev list 
>> *Subject:* Re: Vote on Dynamic resource allocation for structured
>> streaming [SPARK-24815]
>>
>>
>> CAUTION: This email originated from outside of the organisation. Do not
>> click links or open attachments unless you recognise the sender's full
>> email address and know the content is safe.
>>
>> Thanks for proposing and voting for the feature Mich.
>>
>> adding some references to the thread.
>>
>>- Jira ticket - SPARK-24815
>>
>> <https://urldefense.com/v3/__https://issues.apache.org/jira/browse/SPARK-24815__;!!OkoFT9xN!M8RjO-4PxxtSXLdZ72VEqpLZr9IE1m1Gj4YHrjSKR-6ZwOH-1RMbh-d9RZlvDvxwMrhtlDCGv7l6zFvILPwy_fEyuSdA5k0zCn0_Z1lI$>
>>- Design Doc
>>
>> <https://urldefense.com/v3/__https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit?usp=sharing__;!!OkoFT9xN!M8RjO-4PxxtSXLdZ72VEqpLZr9IE1m1Gj4YHrjSKR-6ZwOH-1RMbh-d9RZlvDvxwMrhtlDCGv7l6zFvILPwy_fEyuSdA5k0zCuAyVt8y$>
>>
>>- discussion thread
>>
>> <https://urldefense.com/v3/__https://lists.apache.org/thread/9yx0jnk9h1234joymwlzfx2gh2m8b9bo__;!!OkoFT9xN!M8RjO-4PxxtSXLdZ72VEqpLZr9IE1m1Gj4YHrjSKR-6ZwOH-1RMbh-d9RZlvDvxwMrhtlDCGv7l6zFvILPwy_fEyuSdA5k0zCqHoXny8$>
>>- PR with initial implementation -
>>https://github.com/apache/spark/pull/42352
>>
>> <https://urldefense.com/v3/__https://github.com/apache/spark/pull/42352__;!!OkoFT9xN!M8RjO-4PxxtSXLdZ72VEqpLZr9IE1m1Gj4YHrjSKR-6ZwOH-1RMbh-d9RZlvDvxwMrhtlDCGv7l6zFvILPwy_fEyuSdA5k0zCisLiWaP$>
>>
>> Please vote with:
>>
>> [ ] +1: Accept the proposal and start with the development.
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>>
>> Thank you,
>>
>> Pavan
>>
>> On Wed, Jan 17, 2024 at 9:52 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>
>> +1 for me  (non binding)
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> 
>>
>> This communication is intended only for use of the addressee and may
>> contain legally privileged and confidential information.
>> If you are not the addressee or intended recipient, you are notified that
>> any dissemination, copying or use of any of the information is unauthorised.
>>
>> The legal privilege and confidentiality attached to this e-mail is not
>> waived, lost or destroyed by reason of a mistaken delivery to you.
>> If you have received this message in error, we would appreciate an
>> immediate notification via e-mail to contac...@bendigoadelaide.com.au or
>> by phoning 1300 BENDIGO (1300 236 344), and ask that the e-mail be
>> permanently deleted from your system.
>>
>> Bendigo and Adelaide Bank Limited ABN 11 068 049 178
>>
>>
>> 
>>
>


Vote on Dynamic resource allocation for structured streaming [SPARK-24815]

2024-01-17 Thread Mich Talebzadeh
+1 for me  (non binding)



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: Dynamic resource allocation for structured streaming [SPARK-24815]

2024-01-17 Thread Mich Talebzadeh
I think we have discussed this enough and I consider it as a useful
feature.. I propose a vote on it.

+ 1 for me

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 8 Aug 2023 at 01:30, Pavan Kotikalapudi
 wrote:

> Hi Spark Dev,
>
> I have extended traditional DRA to work for structured streaming
> use-case.
>
> Here is an initial Implementation draft PR
> https://github.com/apache/spark/pull/42352 and design doc:
> https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit?usp=sharing
>
> Please review and let me know what you think.
>
> Thank you,
>
> Pavan
>


Re: Spark Structured Streaming and Flask REST API for Real-Time Data Ingestion and Analytics.

2024-01-09 Thread Mich Talebzadeh
Hi Ashok,

Thanks for pointing out the databricks article Scalable Spark Structured
Streaming for REST API Destinations | Databricks Blog
<https://www.databricks.com/blog/scalable-spark-structured-streaming-rest-api-destinations>

I browsed it and it is basically similar to many of us involved with spark
structure streaming with *foreachBatch. *This article and mine both mention
REST API as part of the architecture. However, there are notable
differences I believe.

In my proposed approach:

   1. Event-Driven Model:


   - Spark Streaming waits until Flask REST API makes a request for events
   to be generated within PySpark.
   - Messages are generated and then fed into any sink based on the Flask
   REST API's request.
   - This creates a more event-driven model where Spark generates data when
   prompted by external requests.





In the Databricks article scenario:

Continuous Data Stream:

   - There is an incoming stream of data from sources like Kafka, AWS
   Kinesis, or Azure Event Hub handled by foreachBatch
   - As messages flow off this stream, calls are made to a REST API with
   some or all of the message data.
   - This suggests a continuous flow of data where messages are sent to a
   REST API as soon as they are available in the streaming source.


*Benefits of Event-Driven Model:*


   1. Responsiveness: Ideal for scenarios where data generation needs to be
   aligned with specific events or user actions.
   2. Resource Optimization: Can reduce resource consumption by processing
   data only when needed.
   3. Flexibility: Allows for dynamic control over data generation based on
   external triggers.

*Benefits of Continuous Data Stream Mode with foreachBatch:*

   1. Real-Time Processing: Facilitates immediate analysis and action on
   incoming data.
   2. Handling High Volumes: Well-suited for scenarios with
   continuous, high-volume data streams.
   3. Low-Latency Applications: Essential for applications requiring near
   real-time responses.

*Potential Use Cases for my approach:*

   - On-Demand Data Generation: Generating data for
   simulations, reports, or visualizations based on user requests.
   - Triggered Analytics: Executing specific analytics tasks only when
   certain events occur, such as detecting anomalies or reaching thresholds
   say fraud detection.
   - Custom ETL Processes: Facilitating data
   extraction, transformation, and loading workflows based on external events
   or triggers


Something to note on latency. Event-driven models like mine can potentially
introduce slight latency compared to continuous processing, as data
generation depends on API calls.

So my approach is more event-triggered and responsive to external requests,
while foreachBatch scenario is more continuous and real-time, processing
and sending data as it becomes available.

In summary, both approaches have their merits and are suited to different
use cases depending on the nature of the data flow and processing
requirements.

Cheers

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 9 Jan 2024 at 19:11, ashok34...@yahoo.com 
wrote:

> Hey Mich,
>
> Thanks for this introduction on your forthcoming proposal "Spark
> Structured Streaming and Flask REST API for Real-Time Data Ingestion and
> Analytics". I recently came across an article by Databricks with title 
> Scalable
> Spark Structured Streaming for REST API Destinations
> <https://www.databricks.com/blog/scalable-spark-structured-streaming-rest-api-destinations>
> . Their use case is similar to your suggestion but what they are saying
> is that they have incoming stream of data from sources like Kafka, AWS
> Kinesis, or Azure Event Hub. In other words, a continuous flow of data
> where messages are sent to a REST API as soon as they are available in the
> streaming source. Their approach is practical but wanted to get your
> thoughts on their article with a better understanding on your proposal and
> differences.
>
> Thanks
>
>
> On Tuesday, 9 January 2024 at 00:24:19 GMT, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>
> Please also note that Flask, by default, is a single-threaded web
> framework. While it is suitable for development and small-scale
> applications, it may not handle concurrent requests efficiently in a
> production environment.
> In production, one can utilise Gunicorn

Re: AutoReply: Re: [VOTE] SPIP: Structured Streaming - Arbitrary State API v2

2024-01-09 Thread Mich Talebzadeh
Hi,

Please stop this acknowledgement email. It is spamming the forum
unnecessarily!

Thanks

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 9 Jan 2024 at 09:44, laglanyue  wrote:

> thx for your email, and I receiver it.


Re: [VOTE] SPIP: Structured Streaming - Arbitrary State API v2

2024-01-09 Thread Mich Talebzadeh
+1 for me as well


Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 9 Jan 2024 at 03:24, Anish Shrigondekar
 wrote:

> Thanks Jungtaek for creating the Vote thread.
>
> +1 (non-binding) from my side too.
>
> Thanks,
> Anish
>
> On Tue, Jan 9, 2024 at 6:09 AM Jungtaek Lim 
> wrote:
>
>> Starting with my +1 (non-binding). Thanks!
>>
>> On Tue, Jan 9, 2024 at 9:37 AM Jungtaek Lim 
>> wrote:
>>
>>> Hi all,
>>>
>>> I'd like to start the vote for SPIP: Structured Streaming - Arbitrary
>>> State API v2.
>>>
>>> References:
>>>
>>>- JIRA ticket <https://issues.apache.org/jira/browse/SPARK-45939>
>>>- SPIP doc
>>>
>>> <https://docs.google.com/document/d/1QtC5qd4WQEia9kl1Qv74WE0TiXYy3x6zeTykygwPWig/edit?usp=sharing>
>>>- Discussion thread
>>><https://lists.apache.org/thread/3jyjdgk1m5zyqfmrocnt6t415703nc8l>
>>>
>>> Please vote on the SPIP for the next 72 hours:
>>>
>>> [ ] +1: Accept the proposal as an official SPIP
>>> [ ] +0
>>> [ ] -1: I don’t think this is a good idea because …
>>>
>>> Thanks!
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>


Re: Spark Structured Streaming and Flask REST API for Real-Time Data Ingestion and Analytics.

2024-01-08 Thread Mich Talebzadeh
Please also note that Flask, by default, is a single-threaded web
framework. While it is suitable for development and small-scale
applications, it may not handle concurrent requests efficiently in a
production environment.
In production, one can utilise Gunicorn (Green Unicorn) which is a WSGI (
Web Server Gateway Interface) that is commonly used to serve Flask
applications in production. It provides multiple worker processes, each
capable of handling a single request at a time. This makes Gunicorn
suitable for handling multiple simultaneous requests and improves the
concurrency and performance of your Flask application.

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 8 Jan 2024 at 19:30, Mich Talebzadeh 
wrote:

> Thought it might be useful to share my idea with fellow forum members.  During
> the breaks, I worked on the *seamless integration of Spark Structured
> Streaming with Flask REST API for real-time data ingestion and analytics*.
> The use case revolves around a scenario where data is generated through
> REST API requests in real time. The Flask REST AP
> <https://en.wikipedia.org/wiki/Flask_(web_framework)>I efficiently
> captures and processes this data, saving it to a Spark Structured Streaming
> DataFrame. Subsequently, the processed data could be channelled into any
> sink of your choice including Kafka pipeline, showing a robust end-to-end
> solution for dynamic and responsive data streaming. I will delve into the
> architecture, implementation, and benefits of this combination, enabling
> one to build an agile and efficient real-time data application. I will put
> the code in GitHub for everyone's benefit. Hopefully your comments will
> help me to improve it.
>
> Cheers
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


Spark Structured Streaming and Flask REST API for Real-Time Data Ingestion and Analytics.

2024-01-08 Thread Mich Talebzadeh
Thought it might be useful to share my idea with fellow forum members.  During
the breaks, I worked on the *seamless integration of Spark Structured
Streaming with Flask REST API for real-time data ingestion and analytics*.
The use case revolves around a scenario where data is generated through
REST API requests in real time. The Flask REST AP
<https://en.wikipedia.org/wiki/Flask_(web_framework)>I efficiently captures
and processes this data, saving it to a Spark Structured Streaming
DataFrame. Subsequently, the processed data could be channelled into any
sink of your choice including Kafka pipeline, showing a robust end-to-end
solution for dynamic and responsive data streaming. I will delve into the
architecture, implementation, and benefits of this combination, enabling
one to build an agile and efficient real-time data application. I will put
the code in GitHub for everyone's benefit. Hopefully your comments will
help me to improve it.

Cheers

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: Dynamic resource allocation for structured streaming [SPARK-24815]

2024-01-05 Thread Mich Talebzadeh
Hi Pavan,

Thanks for your answers.

Given these responses , it seems like you have already taken a
comprehensive approach to address the challenges associated with dynamic
scaling in Spark Structured Streaming. IMO, It would also be beneficial to
engage with other members as well, or gather additional feedback and
perspectives, especially from those with experience in dynamic resource
allocation in Spark. Having said that, the discussion above demonstrates a
good understanding of the challenges involved in enhancing Spark Structured
Streaming resource management capabilities.

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 5 Jan 2024 at 13:43, Pavan Kotikalapudi 
wrote:

> Hi Mich,
>
> As always thanks for looking keenly on the design, really appreciate your
> inputs on this Ticket. Would love to improve this further and cover more
> edge-cases if any.
>
> I can answer the concerns you have below. I believe I have covered some of
> them in the proposal, If at all I missed out on anything.
>
>
>- Implementation Complexity: Integrating dynamic scaling into Spark's
>resource management framework requires careful design and implementation to
>ensure effectiveness and stability.
>I have drafted a PR with initial implementation
>https://github.com/apache/spark/pull/42352
>
> <https://urldefense.com/v3/__https://github.com/apache/spark/pull/42352__;!!NCc8flgU!f57o0p_8gfCLFNDpC01KL-ol2cIFY9ToRmVSpnKl8EzBHNF7tqnvFzcGx94xjl2DzrNQSBnFrtE44gyMDwT9slb8WuoTPA$>,
>made sure that we just utilize Spark's stable resource management
>framework of batch jobs and extended it to work for our streaming
>use-cases. As structured streaming is a micro-batch at the lowest level, I
>tuned the scaling actions based on micro-batches.
>Would appreciate it if anybody in the dev community who has worked on
>dynamic resource allocation (DRA) implementation can take a look at this as
>well.
>
>- Heuristic Accuracy: This proposal effectiveness depends heavily on
>the accuracy of the trigger interval heuristics used to guide scaling
>decisions.
>Yes. Though the scaling guidelines of the app are determined by the
>trigger interval, The guidelines will just provide values to the
>request/remove policy of the already existing DRA solution
>
> <https://spark.apache.org/docs/latest/job-scheduling.html#resource-allocation-policy>.
>
>The current dra is targeted towards batch use cases; it will
>constantly scale out/back per stage of the job. That makes it unstable for
>streaming jobs. I have tweaked it to scale by micro-batches. That said, I
>am still looking for any suggestions on other stats which will be helpful
>in effective scaling of the streaming apps
>
>- Overhead: Monitoring and scaling processes themselves introduce some
>overhead, which needs to be balanced against the potential performance
>gains. For example, how we can utilise Input Rate, process rate and 
> Operation
>Duration from Streaming Query Statistics page etc
>We already have all of the events in the Listener Bus spark framework.
>We are making sure we don't add anything more to the framework but rather
>just consume that information to scale. So the solution shouldn't
>compromise any performance, it will definitely yield better resource
>utilization for uneven traffic patterns of the day.
>Regarding the utilization of `Streaming Query Statistics`, it would
>fall under the spark-sql sub-module of the project which will steer towards
>creating a new algorithm in that module separate from current DRA
>implementation. Since the current design doesn't require any of those stats
>I kept it to the core module stats, but if other stats like input rate will
>help in building better scaling accuracy would definitely look into it.
>
>- We ought to consider the potential impact on latency. Scaling
>operations, especially scaling up, may introduce some latency. Ensuring
>minimal impact on the processing time is crucial
>Since structured streaming apps tend to be latency sensitive at times
>the scaling algorithm aggressively scales to add more resources. The scale
>

Re: Dynamic resource allocation for structured streaming [SPARK-24815]

2024-01-02 Thread Mich Talebzadeh
Hi Pavan,

Thanks for putting this request forward.

I am generally supportive of it. In a nutshell, I believe this proposal can
potentially hold a significant promise for optimizing resource utilization
and enhancing performance in Spark Structured Streaming.

Having said that there are potential Challenges and Considerations from my
experience of Spark Structured Streaming (SSS), if I summarise

   - Implementation Complexity: Integrating dynamic scaling into Spark's
   resource management framework requires careful design and implementation to
   ensure effectiveness and stability.
   - Heuristic Accuracy: This proposal effectiveness depends heavily on the
   accuracy of the trigger interval heuristics used to guide scaling
   decisions.
   - Overhead: Monitoring and scaling processes themselves introduce some
   overhead, which needs to be balanced against the potential performance
   gains. For example, how we can utilise Input Rate, process rate and
Operation
   Duration from Streaming Query Statistics page etc
   - We ought to consider the potential impact on latency. Scaling
   operations, especially scaling up, may introduce some latency. Ensuring
   minimal impact on the processing time is crucial
   - Implementing mechanisms for graceful scaling operations, avoiding
   abrupt changes, can contribute to a smoother user experience.

I do not know whether some of these points are already considered in your
proposal?

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 1 Jan 2024 at 10:34, Pavan Kotikalapudi
 wrote:

> Hi PMC members,
>
> Bumping this idea for one last time to see if there are any approvals to
> take it forward.
>
> Here is an initial Implementation draft PR
> https://github.com/apache/spark/pull/42352 and design doc:
> https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit?usp=sharing
>
>
> Thank you,
>
> Pavan
>
> On Mon, Nov 13, 2023 at 6:57 AM Pavan Kotikalapudi <
> pkotikalap...@twilio.com> wrote:
>
>>
>>
>> Here is an initial Implementation draft PR
>> https://github.com/apache/spark/pull/42352 and design doc:
>> https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit?usp=sharing
>>
>>
>> On Sun, Nov 12, 2023 at 5:24 PM Pavan Kotikalapudi <
>> pkotikalap...@twilio.com> wrote:
>>
>>> Hi Dev community,
>>>
>>> Just bumping to see if there are more reviews to evaluate this idea of
>>> adding auto-scaling to structured streaming.
>>>
>>> Thanks again,
>>>
>>> Pavan
>>>
>>> On Wed, Aug 23, 2023 at 2:49 PM Pavan Kotikalapudi <
>>> pkotikalap...@twilio.com> wrote:
>>>
>>>> Thanks for the review Mich.
>>>>
>>>> I have updated the Q4 with as concise information as possible and left
>>>> the detailed explanation to Appendix.
>>>>
>>>> here is the updated answer to the Q4
>>>> <https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit#heading=h.xe0x4i9gc1dg>
>>>>
>>>> Thank you,
>>>>
>>>> Pavan
>>>>
>>>> On Wed, Aug 23, 2023 at 2:46 AM Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> Hi Pavan,
>>>>>
>>>>> I started reading your SPIP but have difficulty understanding it in
>>>>> detail.
>>>>>
>>>>> Specifically under Q4, " What is new in your approach and why do you
>>>>> think it will be successful?", I believe it would be better to remove the
>>>>> plots and focus on "what this proposed solution is going to add to the
>>>>> current play". At this stage a concise briefing would be appreciated and
>>>>> the specific plots should be left to the Appendix.
>>>>>
>>>>> HTH
>>>>>
>>>>>
>>>>> Mich Talebzadeh,
>>>>> Distinguished Technologist, Solutions Architect & Engineer
>>>>> London
>>>>> United Kingdom
>>>>>
>>>>>
>

Re: Validate spark sql

2023-12-24 Thread Mich Talebzadeh
Yes, you can validate the syntax of your PySpark SQL queries without
connecting to an actual dataset or running the queries on a cluster.
PySpark provides a method for syntax validation without executing the
query. Something like below
  __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.4.0
  /_/

Using Python version 3.9.16 (main, Apr 24 2023 10:36:11)
Spark context Web UI available at http://rhes75:4040
Spark context available as 'sc' (master = local[*], app id =
local-1703410019374).
SparkSession available as 'spark'.
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.appName("validate").getOrCreate()
23/12/24 09:28:02 WARN SparkSession: Using an existing Spark session; only
runtime SQL configurations will take effect.
>>> sql = "SELECT * FROM  WHERE  = some value"
>>> try:
...   spark.sql(sql)
...   print("is working")
... except Exception as e:
...   print(f"Syntax error: {e}")
...
Syntax error:
[PARSE_SYNTAX_ERROR] Syntax error at or near '<'.(line 1, pos 14)

== SQL ==
SELECT * FROM  WHERE  = some value
--^^^

Here we only check for syntax errors and not the actual existence of query
semantics. We are not validating against table or column existence.

This method is useful when you want to catch obvious syntax errors before
submitting your PySpark job to a cluster, especially when you don't have
access to the actual data.

In summary

   - Theis method validates syntax but will not catch semantic errors
   - If you need more comprehensive validation, consider using a testing
   framework and a small dataset.
   - For complex queries, using a linter or code analysis tool can help
   identify potential issues.

HTH


Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 24 Dec 2023 at 07:57, ram manickam  wrote:

> Hello,
> Is there a way to validate pyspark sql to validate only syntax errors?. I
> cannot connect do actual data set to perform this validation.  Any
> help would be appreciated.
>
>
> Thanks
> Ram
>


Re: ShuffleManager and Speculative Execution

2023-12-21 Thread Mich Talebzadeh
Interesting point.

As I understand, the key point is the ShuffleManager ensures that only one
map output file is processed by the reduce task, even when multiple
attempts succeed. So it is not a random selection process. At the reduce
stage, only one copy of the map output needs to be read by the reduce task.
Now which copies, if I am correct, much like other classical examples,
Spark prioritizes the copy that completes first (FIFO). The first completed
instance output will be used, and the output from the other speculative
instances will be ignored. This makes sense as the reduce stage can proceed
with the earliest available data, minimizing the impact of speculative
execution on job completion time which is another important factor.

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 21 Dec 2023 at 17:51, Enrico Minack  wrote:

> Hi Spark devs,
>
> I have a question around ShuffleManager: With speculative execution, one
> map output file is being created multiple times (by multiple task
> attempts). If both attempts succeed, which is to be read by the reduce
> task in the next stage? Is any map output as good as any other?
>
> Thanks for clarification,
> Enrico
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: When and how does Spark use metastore statistics?

2023-12-11 Thread Mich Talebzadeh
You are right. By default CBO is not enabled. Whilst the CBO was the
default optimizer in earlier versions of Spark, it has been replaced by the
AQE in recent releases.

spark.sql.cbo.strategy

As I understand, The spark.sql.cbo.strategy configuration property
specifies the optimizer strategy used by Spark SQL to generate query
execution plans. There are two main optimizer strategies available:

   -

   CBO (Cost-Based Optimization): The default optimizer strategy, which
   analyzes the query plan and estimates the execution costs associated with
   each operation. It uses statistics to guide its decisions, selecting the
   plan with the lowest estimated cost.
   -

   CBO-Like (Cost-Based Optimization-Like): A simplified optimizer strategy
   that mimics some of the CBO's logic, but without the ability to estimate
   costs. This strategy is faster than CBO for simple queries, but may not
   produce the most efficient plan for complex queries.

The spark.sql.cbo.strategy property can be set to either CBO or CBO-Like.
The default value is AUTO, which means that Spark will automatically choose
the most appropriate strategy based on the complexity of the query and
availability of statistic


Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 11 Dec 2023 at 17:11, Nicholas Chammas 
wrote:

>
> On Dec 11, 2023, at 6:40 AM, Mich Talebzadeh 
> wrote:
>
> By default, the CBO is enabled in Spark.
>
>
> Note that this is not correct. AQE is enabled
> <https://github.com/apache/spark/blob/8235f1d56bf232bb713fe24ff6f2ffdaf49d2fcc/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L664-L669>
>  by
> default, but CBO isn’t
> <https://github.com/apache/spark/blob/8235f1d56bf232bb713fe24ff6f2ffdaf49d2fcc/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2694-L2699>
> .
>


Re: When and how does Spark use metastore statistics?

2023-12-11 Thread Mich Talebzadeh
Some of these have been around outside of spark for years. like CBO and RBO
etc but I concur that they have a place in spark's doc.

Simply put, statistics  provide insights into the characteristics of data,
such as distribution, skewness, and cardinalities, which help the optimizer
make informed decisions about data partitioning, aggregation strategies,
and join order.

Not so differently, Spark utilizes statistics to:

   - Partition Data Effectively: Spark partitions data into smaller chunks
   to distribute and parallelize computations across worker nodes. Accurate
   statistics enable the optimizer to choose the most appropriate partitioning
   strategy for each data set, considering factors like data distribution and
   skewness.
   - Optimize Join Operations: Spark employs statistics to determine the
   most efficient join order, considering the join factors and their
   respective cardinalities. This helps reduce the amount of data shuffled
   during joins, improving performance and minimizing data transfer overhead.
   - Choose Optimal Aggregation Strategies: When performing aggregations,
   Spark uses statistics to determine the most efficient aggregation algorithm
   based on the data distribution and the desired aggregation functions. This
   ensures that aggregations are performed efficiently without compromising
   accuracy.


With regard to type of statistics:


   - Catalog Statistics: These are pre-computed statistics that are stored
   in the Spark SQL catalog and associated with table or dataset metadata.
   They are typically gathered using the ANALYZE TABLE statement or through
   data source-specific mechanisms.
   - Data Source Statistics: These statistics are computed by the data
   source itself, such as Parquet or Hive, and are associated with the
   internal format of the data. Spark can access and utilize these statistics
   when working with external data sources.
   - Runtime Statistics: These are statistics that are dynamically computed
   during query execution. Spark can gather runtime statistics for certain
   operations, such as aggregations or joins, to refine its optimization
   decisions based on the actual data encountered.

It is important to mention Cost-Based Optimization (CBO). CBO in Spark
analyzes the query plan and estimates the execution costs associated with
each operation. It uses statistics to guide its decisions, selecting the
plan with the lowest estimated cost. I do not know any RDBMS that uses rule
based optimizer (RBO) anymore.

By default, the CBO is enabled in Spark. However, you can explicitly enable
or disable it using the following options:

   -

   spark.sql.cbo.enabled: Set to true to enable the CBO, or false to
   disable it.
   -

   spark.sql.cbo.strategy: Set to AUTO to use the CBO as the default
   optimizer, or NONE to disable it completely.

HTH
Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 11 Dec 2023 at 02:36, Nicholas Chammas 
wrote:

> I’ve done some reading and have a slightly better understanding of
> statistics now.
>
> Every implementation of LeafNode.computeStats
> <https://github.com/apache/spark/blob/7cea52c96f5be1bc565a033bfd77370ab5527a35/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala#L210>
>  offers
> its own way to get statistics:
>
>- LocalRelation
>
> <https://github.com/apache/spark/blob/8ff6b7a04cbaef9c552789ad5550ceab760cb078/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LocalRelation.scala#L97>
>  estimates
>the size of the relation directly from the row count.
>- HiveTableRelation
>
> <https://github.com/apache/spark/blob/8e95929ac4238d02dca379837ccf2fbc1cd1926d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L923-L929>
>  pulls
>those statistics from the catalog o

Re: [DISCUSSION] SPIP: An Official Kubernetes Operator for Apache Spark

2023-11-11 Thread Mich Talebzadeh
Thanks Zhou for your response to my points raised (private communication)

If we start with a base model and cluster, minimal footprint for the tool, then
we can establish the operational parameters needed. So +1 for me too.

HTH



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 10 Nov 2023 at 05:02, Zhou Jiang  wrote:

> Hi Spark community,
>
> I'm reaching out to initiate a conversation about the possibility of
> developing a Java-based Kubernetes operator for Apache Spark. Following the
> operator pattern (
> https://kubernetes.io/docs/concepts/extend-kubernetes/operator/), Spark
> users may manage applications and related components seamlessly using
> native tools like kubectl. The primary goal is to simplify the Spark user
> experience on Kubernetes, minimizing the learning curve and operational
> complexities and therefore enable users to focus on the Spark application
> development.
>
> Although there are several open-source Spark on Kubernetes operators
> available, none of them are officially integrated into the Apache Spark
> project. As a result, these operators may lack active support and
> development for new features. Within this proposal, our aim is to introduce
> a Java-based Spark operator as an integral component of the Apache Spark
> project. This solution has been employed internally at Apple for multiple
> years, operating millions of executors in real production environments. The
> use of Java in this solution is intended to accommodate a wider user and
> contributor audience, especially those who are familiar with Scala.
>
> Ideally, this operator should have its dedicated repository, similar to
> Spark Connect Golang or Spark Docker, allowing it to maintain a loose
> connection with the Spark release cycle. This model is also followed by the
> Apache Flink Kubernetes operator.
>
> We believe that this project holds the potential to evolve into a thriving
> community project over the long run. A comparison can be drawn with the
> Flink Kubernetes Operator: Apple has open-sourced internal Flink Kubernetes
> operator, making it a part of the Apache Flink project (
> https://github.com/apache/flink-kubernetes-operator). This move has
> gained wide industry adoption and contributions from the community. In a
> mere year, the Flink operator has garnered more than 600 stars and has
> attracted contributions from over 80 contributors. This showcases the level
> of community interest and collaborative momentum that can be achieved in
> similar scenarios.
>
> More details can be found at SPIP doc : Spark Kubernetes Operator
> https://docs.google.com/document/d/1f5mm9VpSKeWC72Y9IiKN2jbBn32rHxjWKUfLRaGEcLE
>
> Thanks,
> --
> *Zhou JIANG*
>
>


Re: [DISCUSSION] SPIP: An Official Kubernetes Operator for Apache Spark

2023-11-10 Thread Mich Talebzadeh
Hi,

Looks like a good idea but before committing myself, I have a number of
design questions having looked at SPIP itself:


   1. Will the name "Standard add-on Kubernetes operator to Spark ''
   describe it better?
   2. We  are still struggling with improving Spark driver start-up time.
   What would be the footprint of this add-on on the driver start-up time?
   3. In  a commercial world will there be (?) a static image for this
   besides the base image that is maintained in the so called  container
   registry (ECR, GCR etc), It takes time to upload these images. Will this
   bea  static image (docker file)? Other alternative would be that this
   docker file is created by the user through set of scripts?


These are the things that come into my mind.

HTH


Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 10 Nov 2023 at 14:19, Bjørn Jørgensen 
wrote:

> +1
>
> fre. 10. nov. 2023 kl. 08:39 skrev Nan Zhu :
>
>> just curious what happened on google’s spark operator?
>>
>> On Thu, Nov 9, 2023 at 19:12 Ilan Filonenko  wrote:
>>
>>> +1
>>>
>>> On Thu, Nov 9, 2023 at 7:43 PM Ryan Blue  wrote:
>>>
>>>> +1
>>>>
>>>> On Thu, Nov 9, 2023 at 4:23 PM Hussein Awala  wrote:
>>>>
>>>>> +1 for creating an official Kubernetes operator for Apache Spark
>>>>>
>>>>> On Fri, Nov 10, 2023 at 12:38 AM huaxin gao 
>>>>> wrote:
>>>>>
>>>>>> +1
>>>>>>
>>>>>
>>>>>> On Thu, Nov 9, 2023 at 3:14 PM DB Tsai  wrote:
>>>>>>
>>>>>>> +1
>>>>>>>
>>>>>>> To be completely transparent, I am employed in the same department
>>>>>>> as Zhou at Apple.
>>>>>>>
>>>>>>> I support this proposal, provided that we witness community adoption
>>>>>>> following the release of the Flink Kubernetes operator, streamlining 
>>>>>>> Flink
>>>>>>> deployment on Kubernetes.
>>>>>>>
>>>>>>> A well-maintained official Spark Kubernetes operator is essential
>>>>>>> for our Spark community as well.
>>>>>>>
>>>>>>> DB Tsai  |  https://www.dbtsai.com/
>>>>>>> <https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.dbtsai.com%2F=05%7C01%7Cif56%40g.cornell.edu%7C6b33babc19c64437ef0408dbe18607c6%7C5d7e43661b9b45cf8e79b14b27df46e1%7C0%7C0%7C638351737993352064%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=uZSpzGB3TcMkiB4aGlteedWlk%2FL3M8XgHfcFxasEGUk%3D=0>
>>>>>>>  |  PGP 42E5B25A8F7A82C1
>>>>>>>
>>>>>>> On Nov 9, 2023, at 12:05 PM, Zhou Jiang 
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi Spark community,
>>>>>>> I'm reaching out to initiate a conversation about the possibility of
>>>>>>> developing a Java-based Kubernetes operator for Apache Spark. Following 
>>>>>>> the
>>>>>>> operator pattern (
>>>>>>> https://kubernetes.io/docs/concepts/extend-kubernetes/operator/
>>>>>>> <https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fkubernetes.io%2Fdocs%2Fconcepts%2Fextend-kubernetes%2Foperator%2F=05%7C01%7Cif56%40g.cornell.edu%7C6b33babc19c64437ef0408dbe18607c6%7C5d7e43661b9b45cf8e79b14b27df46e1%7C0%7C0%7C638351737993352064%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=Np4pJPeJNqKLEJWsH5PrGQ%2FxbcbQXs6lk8i5pCgMkaE%3D=0>),
>>>>>>> Spark users may manage applications and related components seamlessly 
>>>>>>> using
>>>>>>> native tools like kubectl. The primary goal is to simplify the Spark 
>>>>>>> user
>>>>>>> experience on Kubernetes, minimizing the learning curve and operational
>>>>>>> complexities and therefore enable users

Re: Spark 3.2.1 parquet read error

2023-10-30 Thread Mich Talebzadeh
Hi,

The error message when reading Parquet data in Spark 3.2.1 is due to a
schema mismatch between the Parquet file and the Spark schema. The Parquet
file contains INT32 data for the ss_sold_time_sk column, while Spark schema
expects it to be  BIGINT. This schema mismatch is causing the error.

This is the likely cause.

Parquet column cannot be converted in file
obj_store_location/store_sales/ss_sold_date_sk=2451121/part-00440-eac89ce9-041a-4254-b90a-6aceb3c8e6c4.c000.snappy.parquet.
Column: [ss_sold_time_sk], Expected: bigint, Found: INT32

Does this read work with Spark 3.0.2?

HTH

Mich Talebzadeh,

Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


 view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 30 Oct 2023 at 14:11, Suryansh Agnihotri 
wrote:

> Hello spark-dev
> I have loaded tpcds data in parquet format using spark *3.0.2* and while
> reading it from spark *3.2.1* , my query is failing with below error.
>
> Later I set spark.sql.parquet.enableVectorizedReader=false my but it
> resulted in a different error. I am also providing output of parquet-tools
> below.
>
> spark-sql> select * from store_sales limit 100;
>
> org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot 
> be converted in file 
> obj_store_location/store_sales/ss_sold_date_sk=2451121/part-00440-eac89ce9-041a-4254-b90a-6aceb3c8e6c4.c000.snappy.parquet.
>  Column: [ss_sold_time_sk], Expected: bigint, Found: INT32
> at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.unsupportedSchemaColumnConvertError(QueryExecutionErrors.scala:570)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:195)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104)
> at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:522)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
> at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:349)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> Caused by: 
> org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory.constructConvertNotSupportedException(ParquetVectorUpdaterFactory.java:1104)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory.getUpdater(ParquetVectorUpdaterFactory.java:181)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:161)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:298)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:196)
> at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1

Re: Are DataFrame rows ordered without an explicit ordering clause?

2023-09-24 Thread Mich Talebzadeh
LOL,

Hindsight is a very good thing and often one learns these through
experience.Once told off because strict ordering was not maintained, then
the lesson will never be forgotten!

HTH


Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sat, 23 Sept 2023 at 13:29, Steve Loughran 
wrote:

>
> Now, if you are ruthless it'd make sense to randomise the order of results
> if someone left out the order by, to stop complacency.
>
> like that time sun changed the ordering that methods were returned in a
> Class.listMethods() call and everyone's junit test cases failed if they'd
> assumed that ordering was that of the source file -which it was until then,
> even though the language spec said "no guarantees".
>
> People code for what works, not what is documented in places they don't
> read. (this is also why anyone writing network code should really have a
> flaky network connection to keep themselves honest)
>
> On Sat, 23 Sept 2023 at 11:00, beliefer  wrote:
>
>> AFAIK, The order is free whether it's SQL without spcified ORDER BY
>> clause or  DataFrame without sort. The behavior is consistent between them.
>>
>>
>>
>> At 2023-09-18 23:47:40, "Nicholas Chammas" 
>> wrote:
>>
>> I’ve always considered DataFrames to be logically equivalent to SQL
>> tables or queries.
>>
>> In SQL, the result order of any query is implementation-dependent without
>> an explicit ORDER BY clause. Technically, you could run `SELECT * FROM
>> table;` 10 times in a row and get 10 different orderings.
>>
>> I thought the same applied to DataFrames, but the docstring for the
>> recently added method DataFrame.offset
>> <https://github.com/apache/spark/pull/40873/files#diff-4ff57282598a3b9721b8d6f8c2fea23a62e4bc3c0f1aa5444527549d1daa38baR1293-R1301>
>>  implies
>> otherwise.
>>
>> This example will work fine in practice, of course. But if DataFrames are
>> technically unordered without an explicit ordering clause, then in theory a
>> future implementation change may result in “Bob" being the “first” row in
>> the DataFrame, rather than “Tom”. That would make the example incorrect.
>>
>> Is that not the case?
>>
>> Nick
>>
>>


Re: Are DataFrame rows ordered without an explicit ordering clause?

2023-09-18 Thread Mich Talebzadeh
These are good points. In traditional RDBMSs, SQL query results without an
explicit *ORDER BY* clause may vary in order due to optimization,
especially when no clustered index is defined. In contrast, systems like
Hive and Spark SQL, which are based on distributed file storage, do not
rely on physical data order (co-location of data blocks). They deploy
techniques like columnar storage and predicate pushdown instead of
traditional indexing due to the distributed nature of their storage
systems.

HTH


On Mon, 18 Sept 2023 at 20:19, Sean Owen  wrote:

> I think it's the same, and always has been - yes you don't have a
> guaranteed ordering unless an operation produces a specific ordering. Could
> be the result of order by, yes; I believe you would be guaranteed that
> reading input files results in data in the order they appear in the file,
> etc. 1:1 operations like map() don't change ordering. But not the result of
> a shuffle, for example. So yeah anything like limit or head might give
> different results in the future (or simply on different cluster setups with
> different parallelism, etc). The existence of operations like offset
> doesn't contradict that. Maybe that's totally fine in some situations (ex:
> I just want to display some sample rows) but otherwise yeah you've always
> had to state your ordering for "first" or "nth" to have a guaranteed result.
>
> On Mon, Sep 18, 2023 at 10:48 AM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> I’ve always considered DataFrames to be logically equivalent to SQL
>> tables or queries.
>>
>> In SQL, the result order of any query is implementation-dependent without
>> an explicit ORDER BY clause. Technically, you could run `SELECT * FROM
>> table;` 10 times in a row and get 10 different orderings.
>>
>> I thought the same applied to DataFrames, but the docstring for the
>> recently added method DataFrame.offset
>> 
>>  implies
>> otherwise.
>>
>> This example will work fine in practice, of course. But if DataFrames are
>> technically unordered without an explicit ordering clause, then in theory a
>> future implementation change may result in “Bob" being the “first” row in
>> the DataFrame, rather than “Tom”. That would make the example incorrect.
>>
>> Is that not the case?
>>
>> Nick
>>
>>


Re: Are DataFrame rows ordered without an explicit ordering clause?

2023-09-18 Thread Mich Talebzadeh
Hi Nicholas,

Your point

"In SQL, the result order of any query is implementation-dependent without
an explicit ORDER BY clause. Technically, you could run `SELECT * FROM
table;` 10 times in a row and get 10 different orderings."

yes I concur my understanding is the same.

In SQL, the result order of any query is implementation-dependent without
an explicit ORDER BY clause. Basically this means that the database engine
is free to return the results in any order that it sees fit. This is
because SQL does not guarantee a specific order for results unless an ORDER
BY clause is used.

HTH

Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 18 Sept 2023 at 16:58, Reynold Xin 
wrote:

> It should be the same as SQL. Otherwise it takes away a lot of potential
> future optimization opportunities.
>
>
> On Mon, Sep 18 2023 at 8:47 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> I’ve always considered DataFrames to be logically equivalent to SQL
>> tables or queries.
>>
>> In SQL, the result order of any query is implementation-dependent without
>> an explicit ORDER BY clause. Technically, you could run `SELECT * FROM
>> table;` 10 times in a row and get 10 different orderings.
>>
>> I thought the same applied to DataFrames, but the docstring for the
>> recently added method DataFrame.offset
>> <https://github.com/apache/spark/pull/40873/files#diff-4ff57282598a3b9721b8d6f8c2fea23a62e4bc3c0f1aa5444527549d1daa38baR1293-R1301>
>>  implies
>> otherwise.
>>
>> This example will work fine in practice, of course. But if DataFrames are
>> technically unordered without an explicit ordering clause, then in theory a
>> future implementation change may result in “Bob" being the “first” row in
>> the DataFrame, rather than “Tom”. That would make the example incorrect.
>>
>> Is that not the case?
>>
>> Nick
>>
>


Re: [VOTE] Release Apache Spark 3.5.0 (RC5)

2023-09-09 Thread Mich Talebzadeh
Apologies that should read ... release 3.5.0 (RC4) plus ..

Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sat, 9 Sept 2023 at 15:58, Mich Talebzadeh 
wrote:

> Hi,
>
> Can you please confirm that this cut is release 3.4.0 plus the resolved
> Jira  https://issues.apache.org/jira/browse/SPARK-44805 which was already
> fixed yesterday?
>
> Nothing else I believe?
>
> Thanks
>
> Mich
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 9 Sept 2023 at 15:42, Yuanjian Li  wrote:
>
>> Please vote on releasing the following candidate(RC5) as Apache Spark
>> version 3.5.0.
>>
>> The vote is open until 11:59pm Pacific time Sep 11th and passes if a
>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.5.0
>>
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v3.5.0-rc5 (commit
>> ce5ddad990373636e94071e7cef2f31021add07b):
>>
>> https://github.com/apache/spark/tree/v3.5.0-rc5
>>
>> The release files, including signatures, digests, etc. can be found at:
>>
>> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc5-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>>
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>>
>> https://repository.apache.org/content/repositories/orgapachespark-1449
>>
>> The documentation corresponding to this release can be found at:
>>
>> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc5-docs/
>>
>> The list of bug fixes going into 3.5.0 can be found at the following URL:
>>
>> https://issues.apache.org/jira/projects/SPARK/versions/12352848
>>
>> This release is using the release script of the tag v3.5.0-rc5.
>>
>>
>> FAQ
>>
>> =
>>
>> How can I help test this release?
>>
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>>
>> an existing Spark workload and running on this release candidate, then
>>
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>>
>> the current RC and see if anything important breaks, in the Java/Scala
>>
>> you can add the staging repository to your projects resolvers and test
>>
>> with the RC (make sure to clean up the artifact cache before/after so
>>
>> you don't end up building with an out of date RC going forward).
>>
>> ===
>>
>> What should happen to JIRA tickets still targeting 3.5.0?
>>
>> ===
>>
>> The current list of open tickets targeted at 3.5.0 can be found at:
>>
>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 3.5.0
>>
>> Committers should look at those and triage. Extremely important bug
>>
>> fixes, documentation, and API tweaks that impact compatibility should
>>
>> be worked on immediately. Everything else please retarget to an
>>
>> appropriate release.
>>
>> ==
>>
>> But my bug isn't fixed?
>>
>> ==
>>
>> In order to make timely releases, we will typically not hold the
>>
>> release unless the bug in question is a regression from the previous
>>
>> release. That being said, if there is something which is a regression
>>
>> that has not been correctly targeted please ping me or a committer to
>>
>> help target the issue.
>>
>> Thanks,
>>
>> Yuanjian Li
>>
>


Re: [VOTE] Release Apache Spark 3.5.0 (RC5)

2023-09-09 Thread Mich Talebzadeh
Hi,

Can you please confirm that this cut is release 3.4.0 plus the resolved
Jira  https://issues.apache.org/jira/browse/SPARK-44805 which was already
fixed yesterday?

Nothing else I believe?

Thanks

Mich


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sat, 9 Sept 2023 at 15:42, Yuanjian Li  wrote:

> Please vote on releasing the following candidate(RC5) as Apache Spark
> version 3.5.0.
>
> The vote is open until 11:59pm Pacific time Sep 11th and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.5.0
>
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v3.5.0-rc5 (commit
> ce5ddad990373636e94071e7cef2f31021add07b):
>
> https://github.com/apache/spark/tree/v3.5.0-rc5
>
> The release files, including signatures, digests, etc. can be found at:
>
> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc5-bin/
>
> Signatures used for Spark RCs can be found in this file:
>
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
>
> https://repository.apache.org/content/repositories/orgapachespark-1449
>
> The documentation corresponding to this release can be found at:
>
> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc5-docs/
>
> The list of bug fixes going into 3.5.0 can be found at the following URL:
>
> https://issues.apache.org/jira/projects/SPARK/versions/12352848
>
> This release is using the release script of the tag v3.5.0-rc5.
>
>
> FAQ
>
> =
>
> How can I help test this release?
>
> =
>
> If you are a Spark user, you can help us test this release by taking
>
> an existing Spark workload and running on this release candidate, then
>
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
>
> the current RC and see if anything important breaks, in the Java/Scala
>
> you can add the staging repository to your projects resolvers and test
>
> with the RC (make sure to clean up the artifact cache before/after so
>
> you don't end up building with an out of date RC going forward).
>
> ===
>
> What should happen to JIRA tickets still targeting 3.5.0?
>
> ===
>
> The current list of open tickets targeted at 3.5.0 can be found at:
>
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.5.0
>
> Committers should look at those and triage. Extremely important bug
>
> fixes, documentation, and API tweaks that impact compatibility should
>
> be worked on immediately. Everything else please retarget to an
>
> appropriate release.
>
> ==
>
> But my bug isn't fixed?
>
> ==
>
> In order to make timely releases, we will typically not hold the
>
> release unless the bug in question is a regression from the previous
>
> release. That being said, if there is something which is a regression
>
> that has not been correctly targeted please ping me or a committer to
>
> help target the issue.
>
> Thanks,
>
> Yuanjian Li
>


Re: [DISCUSS] SPIP: Python Stored Procedures

2023-09-06 Thread Mich Talebzadeh
Thanks Alison for your explanation.

   1. As a matter of interest, what does "sessionCatalog.resolveProcedure" do?
   Does it recompile the stored procedure (SP)?
   2. If the SP makes a reference to an underlying table and table schema
   is changed. then by definition that SP compiled plan will be invalidated
   3. When you use the command sessionCatalog.createProcedure, we should
   add an optional syntax for creating the SP with recompile option. to
   allow a new execution plan to be generated to reflect the current state of
   the metadata.
   4. since SP is compiled once and used many times, we ought to provide an
   API to recompile the existing SPs.


Regards,

Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 6 Sept 2023 at 00:38, Allison Wang 
wrote:

> Hi Mich,
>
> Thank you for your comments! I've left some comments on the SPIP, but
> let's continue the discussion here.
>
> You've highlighted the potential advantages of Python stored procedures,
> and I'd like to emphasize two important aspects:
>
>1. *Versatility*: Integrating Python into SQL provides remarkable
>versatility to the SQL workflow. By leveraging Spark Connect, it's even
>possible to execute Spark queries within a Python stored procedure.
>2. *Reusability*: Stored procedures, once saved in the catalog (e.g.,
>HMS), can be reused across various users and sessions.
>
> This initiative will also pave the way for supporting other procedural
> languages in the future.
>
> Regarding the cons you mentioned, I'd like to shed some light on the
> potential implementation of Pyhton stored procedures. The plan is to
> leverage the existing Python UDF implementation. I.e the Python stored
> procedural logic will be executed inside a Python worker. As @Sean Owen
> mentioned, many of the challenges are shared with the current way of
> executing Python logic in Spark, whether for UDFs/UDTFs or Python stored
> procedures. We should think more about them, esp regarding error handling.
>
> For storage options, regardless of the chosen storage solution, we need to
> expose these APIs for stored procedures to integrate with Spark:
>
>- sessionCatalog.createProcedure:  create a new stored procedure
>- sessionCatalog.dropProcedure: drop a stored procedure
>- sessionCatalog.resolveProcedure: resolve a stored procedure given
>the identifier
>
> Stored procedures are similar to functions, and we can leverage HMS
> function interface to support storing stored procedures (by serializing
> them into strings and placing them into the resource field of the
> CatalogFunction). We could also make these APIs compatible with other
> storage systems in the future, whether they are 3rd party or native storage
> solutions, but for the short term, HMS remains a decent option.
>
> I'd appreciate your thoughts on this, and I am more than willing to delve
> deeper or clarify any aspect :)
>
> Thanks,
> Allison
>
> On Sat, Sep 2, 2023 at 8:27 AM Mich Talebzadeh 
> wrote:
>
>>
>> I have noticed an worthy discussion in the SPIP comments regarding the
>> definition of "stored procedure" in the context of Spark, and I believe it
>> is an important point to address.
>>
>> To provide some historical context, Sybase
>> <https://www.referenceforbusiness.com/history2/49/Sybase-Inc.html>, a
>> relational database vendor (which later co-licensed their code to Microsoft
>> for SQL Server), introduced the concept of stored procedures while
>> positioning themselves as a client-server company. During this period, they
>> were in competition with Oracle, particularly in the realm of front-office
>> trading systems. The introduction of stored procedures, stored on the
>> server-side within the database, allowed Sybase to modularize frequently
>> used code. This move significantly reduced network overhead and latency.
>> Stored procedures were first introduced in the mid-1980s and proved to be a
>> profitable innovation. It is important to note that they had a robust
>> database to rely on during this process.
>>
>> Now, as we contemplate the implementation of stored procedures in Spark,
>> we must think strategically about where 

Re: Feature to restart Spark job from previous failure point

2023-09-05 Thread Mich Talebzadeh
Hi Dipayan,

You ought to maintain data source consistency minimising changes. upstream.
Spark is not a Swiss Army knife :)

Anyhow, we already do this in spark structured streaming with the concept
of checkpointing.You can do so by implementing


   - Checkpointing
   - Stateful processing in Spark.
   - Retry mechanism:


In Pyspark you can use

rdd.checkpoint("hdfs://")  # chekpointing rdd
or
dataframe.write.option("path",
"hdfs://("overwrite").saveAsTable("checkpointed_table")
# checkpointing a DF

Retry mechanism

something like below

def myfunction(input_file_path,checkpoint_directory, max_retries):
retries = 0
while retries < max_retries:
   try:
   .
   except Exception as e:
  print(f"Error: {str(e)}")
  retries += 1
 if retries < max_retries:
   print(f"Retrying... (Retry {retries}/{max_retries})")
 else:
   print("Max retries reached. Exiting.")
   break

Remember checkpointing incurs I/O and is expensive!. You can use cloud
buckets for checkpointing as well

HTH

Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 5 Sept 2023 at 10:12, Dipayan Dev  wrote:

> Hi Team,
>
> One of the biggest pain points we're facing is when Spark reads upstream
> partition data and during Action, the upstream also gets refreshed and the
> application fails with 'File not exists' error. It could happen that the
> job has already spent a reasonable amount of time, and re-running the
> entire application is unwanted.
>
> I know the general solution to this is to handle how the upstream is
> managing the data, but is there a way to tackle this problem from the Spark
> applicable side? One approach I was thinking of is to at least save some
> state of operations done by Spark job till that point, and on a retry,
> resume the operation from that point?
>
>
>
> With Best Regards,
>
> Dipayan Dev
>


Re: [Internet]Re: Improving Dynamic Allocation Logic for Spark 4+

2023-09-03 Thread Mich Talebzadeh
On this subject of launching both the driver and the executors using lazy
executor IDs, this can introduce complexity but potentially could be a
viable strategy in certain scenarios. Basically your mileage varies

Pros:

   1. Faster Startup: launching the driver and initial executors
   simultaneously can reduce startup time not waiting for the driver to
   allocate executor IDs dynamically.
   2. Better workload distribution byy running initial executors alongside
   the driver.
   3. Simplified Configuration: Preallocation of resources

Cons:

   1. Complexity:
   2. Resource Overhead: Contention at the cluster start-up time.
   3. If the driver dies, all resources will be wasted. The executors will
   be waiting and have to be terminated manually.

HTH

On Thu, 24 Aug 2023 at 03:07, Holden Karau  wrote:

> One option could be to initially launch both drivers and initial executors
> (using the lazy executor ID allocation), but it would introduce a lot of
> complexity.
>
> On Wed, Aug 23, 2023 at 6:44 PM Qian Sun  wrote:
>
>> Hi Mich
>>
>> I agree with your opinion that the startup time of the Spark on
>> Kubernetes cluster needs to be improved.
>>
>> Regarding the fetching image directly, I have utilized ImageCache to
>> store the images on the node, eliminating the time required to pull images
>> from a remote repository, which does indeed lead to a reduction in
>> overall time, and the effect becomes more pronounced as the size of the
>> image increases.
>>
>> Additionally, I have observed that the driver pod takes a significant
>> amount of time from running to attempting to create executor pods, with an
>> estimated time expenditure of around 75%. We can also explore optimization
>> options in this area.
>>
>> On Thu, Aug 24, 2023 at 12:58 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> On this conversion, one of the issues I brought up was the driver
>>> start-up time. This is especially true in k8s. As spark on k8s is modeled
>>> on Spark on standalone schedler, Spark on k8s consist of a
>>> single-driver pod (as master on standalone”) and a  number of executors
>>> (“workers”). When executed on k8s, the driver and executors are
>>> executed on separate pods
>>> <https://spark.apache.org/docs/latest/running-on-kubernetes.html>. First
>>> the driver pod is launched, then the driver pod itself launches the
>>> executor pods. From my observation, in an auto scaling cluster, the driver
>>> pod may take up to 40 seconds followed by executor pods. This is a
>>> considerable time for customers and it is painfully slow. Can we actually
>>> move away from dependency on standalone mode and try to speed up k8s
>>> cluster formation.
>>>
>>> Another naive question, when the docker image is pulled from the
>>> container registry to the driver itself, this takes finite time. The docker
>>> image for executors could be different from that of the driver
>>> docker image. Since spark-submit presents this at the time of submission,
>>> can we save time by fetching the docker images straight away?
>>>
>>> Thanks
>>>
>>> Mich
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 8 Aug 2023 at 18:25, Mich Talebzadeh 
>>> wrote:
>>>
>>>> Splendid idea. 
>>>>
>>>> Mich Talebzadeh,
>>>> Solutions Architect/Engineering Lead
>>>> London
>>>> United Kingdom
>>>>
>>>>
>>>>view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from 

Re: [DISCUSS] SPIP: Python Stored Procedures

2023-09-02 Thread Mich Talebzadeh
I have noticed an worthy discussion in the SPIP comments regarding the
definition of "stored procedure" in the context of Spark, and I believe it
is an important point to address.

To provide some historical context, Sybase
<https://www.referenceforbusiness.com/history2/49/Sybase-Inc.html>, a
relational database vendor (which later co-licensed their code to Microsoft
for SQL Server), introduced the concept of stored procedures while
positioning themselves as a client-server company. During this period, they
were in competition with Oracle, particularly in the realm of front-office
trading systems. The introduction of stored procedures, stored on the
server-side within the database, allowed Sybase to modularize frequently
used code. This move significantly reduced network overhead and latency.
Stored procedures were first introduced in the mid-1980s and proved to be a
profitable innovation. It is important to note that they had a robust
database to rely on during this process.

Now, as we contemplate the implementation of stored procedures in Spark, we
must think strategically about where these procedures will be stored and
how they will be reused. Some colleagues have suggested using HMS (Derby)
by default, but it is worth noting that HMS is inherently single-threaded.
If we intend to leverage stored procedures extensively, Should we consider
establishing "a native" storage solution? This approach not only aligns
with good architectural practices but also has the potential for broader
applications beyond Spark. While empowering users to choose their preferred
database for this purpose might sound appealing, it may not be the most
realistic or practical approach. This discussion highlights the importance
of clarifying terminologies and establishing a solid foundation for this
feature.

HTH

Mich Talebzadeh,

Disclaimer: Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 31 Aug 2023 at 18:19, Mich Talebzadeh 
wrote:

> I concur with the view point raised by @Sean Owen
>
> While this might introduce some challenges related to compatibility and
> environment issues, it is not fundamentally different from how the users
> currently import and use common code in Python. The main difference is that
> now this shared code would be stored as stored procedures in the catalog of
> user choice -> probably Hive Metastore
>
> HTH
>
>
>
> Disclaimer: Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 31 Aug 2023 at 16:41, Sean Owen  wrote:
>
>> I think you're talking past Hyukjin here.
>>
>> I think the response is: none of that is managed by Pyspark now, and this
>> proposal does not change that. Your current interpreter and environment is
>> used to execute the stored procedure, which is just Python code. It's on
>> you to bring an environment that runs the code correctly. This is just the
>> same as how running any python code works now.
>>
>> I think you have exactly the same problems with UDFs now, and that's all
>> a real problem, just not something Spar

Re: [DISCUSS] SPIP: Python Stored Procedures

2023-08-31 Thread Mich Talebzadeh
I concur with the view point raised by @Sean Owen

While this might introduce some challenges related to compatibility and
environment issues, it is not fundamentally different from how the users
currently import and use common code in Python. The main difference is that
now this shared code would be stored as stored procedures in the catalog of
user choice -> probably Hive Metastore

HTH



Disclaimer: Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 31 Aug 2023 at 16:41, Sean Owen  wrote:

> I think you're talking past Hyukjin here.
>
> I think the response is: none of that is managed by Pyspark now, and this
> proposal does not change that. Your current interpreter and environment is
> used to execute the stored procedure, which is just Python code. It's on
> you to bring an environment that runs the code correctly. This is just the
> same as how running any python code works now.
>
> I think you have exactly the same problems with UDFs now, and that's all a
> real problem, just not something Spark has ever tried to solve for you.
> Think of this as exactly like: I have a bit of python code I import as a
> function and share across many python workloads. Just, now that chunk is
> stored as a 'stored procedure'.
>
> I agree this raises the same problem in new ways - now, you are storing
> and sharing a chunk of code across many workloads. There is more potential
> for compatibility and environment problems, as all of that is simply punted
> to the end workloads. But, it's not different from importing common code
> and the world doesn't fall apart.
>
> On Wed, Aug 30, 2023 at 11:16 PM Alexander Shorin 
> wrote:
>
>>
>> Which Python version will run that stored procedure?
>>>
>>> All Python versions supported in PySpark
>>>
>>
>> Where in stored procedure defines the exact python version which will run
>> the code? That was the question.
>>
>>
>>> How to manage external dependencies?
>>>
>>> Existing way we have
>>> https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html
>>> .
>>> In fact, this will use the external dependencies within your Python
>>> interpreter so you can use all existing conda or venvs.
>>>
>> Current proposal solves this issue nohow (the stored code doesn't provide
>> any manifest about its dependencies and what is required to run it). So
>> feels like it's better to stay with UDF since they are under control and
>> their behaviour is predictable. Did I miss something?
>>
>> How to test it via a common CI process?
>>>
>>> Existing way of PySpark unittests, see
>>> https://github.com/apache/spark/tree/master/python/pyspark/tests
>>>
>> Sorry, but this wouldn't work since stored procedure thing requires some
>> specific definition and this code will not be stored as regular python
>> code. Do you have any examples how to test stored python procedures as a
>> unit e.g. without spark?
>>
>> How to manage versions and do upgrades? Migrations?
>>>
>>> This is a new feature so no migration is needed. We will keep the
>>> compatibility according to the sember we follow.
>>>
>> Question was not about spark, but about stored procedures itself. Any
>> guidelines which will not copy flaws of other systems?
>>
>> Current Python UDF solution handles these problems in a good way since
>>> they delegate them to project level.
>>>
>>> Current UDF solution cannot handle stored procedures because UDF is on
>>> the worker side. This is Driver side.
>>>
>> How so? Currently it works and we never faced such issue. May be you
>> should have the same Python code also on the driver side? But such trivial
>> idea doesn't require new feature on Spark since you already have to ship
>> that code somehow.
>>
>> --
>> ,,,^..^,,,
>>
>


Re: [DISCUSS] SPIP: Python Stored Procedures

2023-08-31 Thread Mich Talebzadeh
These are my initial thoughts:

As usual your mileage varies. Depending on the use case, introducing
support for stored procedures (SP) in Spark SQL with Python as the
procedural language

*Pros*

   - Can potentially provide more flexibility and capabilities in the
   respective SQL workflows. We  can seamlessly integrate Python code with SQL
   workflows, thus enabling ourselves to perform a wider range of tasks
   directly within Spark SQL.
   - SPs as usual will enable more modular and reusable coding. Users can
   build their own libraries of stored procedures and remember these are
   compiled once and used thereafter.
   - With SPs, one can potentially perform advanced analytics in Spark SQL
   through Python packages
   - Restricted access and enhanced security by hiding sensitive code in
   SPs, only accessible through SP
   - Build your own Catalog and enhance it

*Cons*

   - Performance implications due to the need to serialize and deserialize
   data between Spark and Python, especially for large datasets
   - Additional resource utilisation
   - Error handling will require more thoughts
   - Compatibility with different versions of Spark andPython libraries
   - Client side and server side Python compatibilities
   - if the underlying table schema changes, often the SP code will be
   invalidated and has to be recompiled

HTH

Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh

 Disclaimer: Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 31 Aug 2023 at 09:45, Mich Talebzadeh 
wrote:

> Thanks Allison!
>
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 31 Aug 2023 at 01:26, Allison Wang 
> wrote:
>
>> Hi Mich,
>>
>> I've updated the permissions on the document. Please feel free to leave
>> comments.
>> Thanks,
>> Allison
>>
>> On Wed, Aug 30, 2023 at 3:44 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Great. Please allow edit access on SPIP or ability to comment.
>>>
>>> Thanks
>>>
>>> Mich Talebzadeh,
>>> Distinguished Technologist, Solutions Architect & Engineer
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Wed, 30 Aug 2023 at 23:29, Allison Wang
>>>  wrote:
>>>
>>>> Hi all,
>>>>
>>>> I would like to start a discussion on “Python Stored Procedures".
>>>>
>>>> This proposal aims to extend Spark SQL by introducing support for
>>>> stored procedures, starting with Python as the procedural language. This
>>>> will enable users to run complex logic using Python within their SQL
>>>> workflows and save these routines in catalogs like HMS for future use.
>>>>
>>>> *SPIP*:
>>>> https://docs.google.com/document/d/1ce2EZrf2BxHu7TjfGn4TgToK3TBYYzRkmsIVcfmkNzE/edit?usp=sharing
>>>> *JIRA*: https://issues.apache.org/jira/browse/SPARK-45023
>>>>
>>>> Looking forward to your feedback!
>>>>
>>>> Thanks,
>>>> Allison
>>>>
>>>>


Re: [DISCUSS] SPIP: Python Stored Procedures

2023-08-31 Thread Mich Talebzadeh
Thanks Allison!

Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 31 Aug 2023 at 01:26, Allison Wang 
wrote:

> Hi Mich,
>
> I've updated the permissions on the document. Please feel free to leave
> comments.
> Thanks,
> Allison
>
> On Wed, Aug 30, 2023 at 3:44 PM Mich Talebzadeh 
> wrote:
>
>> Hi,
>>
>> Great. Please allow edit access on SPIP or ability to comment.
>>
>> Thanks
>>
>> Mich Talebzadeh,
>> Distinguished Technologist, Solutions Architect & Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 30 Aug 2023 at 23:29, Allison Wang
>>  wrote:
>>
>>> Hi all,
>>>
>>> I would like to start a discussion on “Python Stored Procedures".
>>>
>>> This proposal aims to extend Spark SQL by introducing support for stored
>>> procedures, starting with Python as the procedural language. This will
>>> enable users to run complex logic using Python within their SQL workflows
>>> and save these routines in catalogs like HMS for future use.
>>>
>>> *SPIP*:
>>> https://docs.google.com/document/d/1ce2EZrf2BxHu7TjfGn4TgToK3TBYYzRkmsIVcfmkNzE/edit?usp=sharing
>>> *JIRA*: https://issues.apache.org/jira/browse/SPARK-45023
>>>
>>> Looking forward to your feedback!
>>>
>>> Thanks,
>>> Allison
>>>
>>>


Re: [DISCUSS] SPIP: Python Stored Procedures

2023-08-30 Thread Mich Talebzadeh
Hi,

Great. Please allow edit access on SPIP or ability to comment.

Thanks

Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 30 Aug 2023 at 23:29, Allison Wang
 wrote:

> Hi all,
>
> I would like to start a discussion on “Python Stored Procedures".
>
> This proposal aims to extend Spark SQL by introducing support for stored
> procedures, starting with Python as the procedural language. This will
> enable users to run complex logic using Python within their SQL workflows
> and save these routines in catalogs like HMS for future use.
>
> *SPIP*:
> https://docs.google.com/document/d/1ce2EZrf2BxHu7TjfGn4TgToK3TBYYzRkmsIVcfmkNzE/edit?usp=sharing
> *JIRA*: https://issues.apache.org/jira/browse/SPARK-45023
>
> Looking forward to your feedback!
>
> Thanks,
> Allison
>
>


Re: [DISCUSS] Incremental statistics collection

2023-08-30 Thread Mich Talebzadeh
Sorry I missed this one

In the context what has been changed we ought to have an additional column
timestamp

In short we can have

datachange(object_name, partition_name, colname, timestamp)

timestamp is the point in time you want to compare against for changes.

Example

SELECT * FROM  WHERE datachange('', '2023-08-01 00:00:00') = 1


This query should return all rows from the  table that have been
changed since June 1, 2023, 00:00:00.

Let me know your thoughts

HTH

Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom




Disclaimer: Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 30 Aug 2023 at 10:19, Mich Talebzadeh 
wrote:

> Another idea that came to my mind from the old days, is the concept of
> having a function called *datachange*
>
> This datachange function should measure the amount of change in the data
> distribution since ANALYZE STATISTICS last ran. Specifically, it should
> measure the number of inserts, updates and deletes that have occurred on
> the given object and helps us determine if running ANALYZE STATISTICS would
> benefit the query plan.
>
> something like
>
> select datachange(object_name, partition_name, colname)
>
> Where:
>
> object_name – is the object name. fully qualified objectname. The
> object_name cannot be null.
> partition_name – is the data partition name. This can be a null value.
> colname – is the column name for which the datachange is requested. This
> can be a null value (meaning all columns)
>
> This should be expressed as a percentage of the total number of rows in
> the table or partition (if the partition is specified). The percentage
> value can be greater than 100% because the number of changes to an object
> can be much greater than the number of rows in the table, particularly when
> the number of deletes and updates to a table is very high.
>
> So we can run this function to see if ANALYZE STATISTICS is required on a
> certain column.
>
> HTH
>
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> Disclaimer: Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 30 Aug 2023 at 00:49, Chetan  wrote:
>
>> Thanks for the detailed explanation.
>>
>>
>> Regards,
>> Chetan
>>
>>
>>
>> On Tue, Aug 29, 2023, 4:50 PM Mich Talebzadeh 
>> wrote:
>>
>>> OK, let us take a deeper look here
>>>
>>> ANALYSE TABLE mytable COMPUTE STATISTICS FOR COLUMNS *(c1, c2), c3*
>>>
>>> In above, we are *explicitly grouping columns c1 and c2 together for
>>> which we want to compute statistic*s. Additionally, we are also *computing
>>> statistics for column c3 independen*t*ly*. This approach *allows CBO to
>>> treat columns c1 and c2 as a group and compute joint statistics for the

Re: [DISCUSS] Incremental statistics collection

2023-08-30 Thread Mich Talebzadeh
Another idea that came to my mind from the old days, is the concept of
having a function called *datachange*

This datachange function should measure the amount of change in the data
distribution since ANALYZE STATISTICS last ran. Specifically, it should
measure the number of inserts, updates and deletes that have occurred on
the given object and helps us determine if running ANALYZE STATISTICS would
benefit the query plan.

something like

select datachange(object_name, partition_name, colname)

Where:

object_name – is the object name. fully qualified objectname. The
object_name cannot be null.
partition_name – is the data partition name. This can be a null value.
colname – is the column name for which the datachange is requested. This
can be a null value (meaning all columns)

This should be expressed as a percentage of the total number of rows in the
table or partition (if the partition is specified). The percentage value
can be greater than 100% because the number of changes to an object can be
much greater than the number of rows in the table, particularly when the
number of deletes and updates to a table is very high.

So we can run this function to see if ANALYZE STATISTICS is required on a
certain column.

HTH

Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 30 Aug 2023 at 00:49, Chetan  wrote:

> Thanks for the detailed explanation.
>
>
> Regards,
> Chetan
>
>
>
> On Tue, Aug 29, 2023, 4:50 PM Mich Talebzadeh 
> wrote:
>
>> OK, let us take a deeper look here
>>
>> ANALYSE TABLE mytable COMPUTE STATISTICS FOR COLUMNS *(c1, c2), c3*
>>
>> In above, we are *explicitly grouping columns c1 and c2 together for
>> which we want to compute statistic*s. Additionally, we are also *computing
>> statistics for column c3 independen*t*ly*. This approach *allows CBO to
>> treat columns c1 and c2 as a group and compute joint statistics for them,
>> while computing separate statistics for column c3.*
>>
>> If columns c1 and c2 are frequently used together in conditions, I
>> concur it makes sense to compute joint statistics for them by using the
>> above syntax. On the other hand, if each column has its own significance
>> and the relationship between them is not crucial, we can use
>>
>> ANALYSE TABLE mytable COMPUTE STATISTICS FOR COLUMNS
>> *c1, c2, c3*
>>
>> This syntax can be used to compute separate statistics for each column.
>>
>> So your mileage varies.
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Distinguished Technologist, Solutions Architect & Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 29 Aug 2023 at 12:14, Chetan  wrote:
>>
>>> Hi,
>>>
>>> If we are taking this up, then would ask can we support multicolumn
>>> stats such as :
>>> ANALYZE TABLE mytable COMPUTE STATISTICS FOR COLUMNS (c1,c2), c3
>>> This should help in estimating better for conditions involving c1 and c2
>>>
>>> Thanks.
>>>
>>> On Tue, 29 Aug 2023 at 09:05, Mich Talebzadeh 
>>> wrote:
>>>
>>>> short answer on top of my head
>>>>
&g

Re: [DISCUSS] Incremental statistics collection

2023-08-29 Thread Mich Talebzadeh
OK, let us take a deeper look here

ANALYSE TABLE mytable COMPUTE STATISTICS FOR COLUMNS *(c1, c2), c3*

In above, we are *explicitly grouping columns c1 and c2 together for which
we want to compute statistic*s. Additionally, we are also *computing
statistics for column c3 independen*t*ly*. This approach *allows CBO to
treat columns c1 and c2 as a group and compute joint statistics for them,
while computing separate statistics for column c3.*

If columns c1 and c2 are frequently used together in conditions, I
concur it makes sense to compute joint statistics for them by using the
above syntax. On the other hand, if each column has its own significance
and the relationship between them is not crucial, we can use

ANALYSE TABLE mytable COMPUTE STATISTICS FOR COLUMNS
*c1, c2, c3*

This syntax can be used to compute separate statistics for each column.

So your mileage varies.

HTH

Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 29 Aug 2023 at 12:14, Chetan  wrote:

> Hi,
>
> If we are taking this up, then would ask can we support multicolumn stats
> such as :
> ANALYZE TABLE mytable COMPUTE STATISTICS FOR COLUMNS (c1,c2), c3
> This should help in estimating better for conditions involving c1 and c2
>
> Thanks.
>
> On Tue, 29 Aug 2023 at 09:05, Mich Talebzadeh 
> wrote:
>
>> short answer on top of my head
>>
>> My point was with regard to  Cost Based Optimizer (CBO) in traditional
>> databases. The concept of a rowkey in HBase is somewhat similar to that of
>> a primary key in RDBMS.
>> Now in databases with automatic deduplication features (i.e. ignore
>> duplication of rowkey), inserting 100 rows with the same rowkey actually
>> results in only one physical entry in the database due to deduplication.
>> Therefore, the new statistical value added should be 1, reflecting the
>> distinct physical entry. If the rowkey is already present in HBase, the
>> value would indeed be 0, indicating that no new physical entry was created.
>> We need to take into account the underlying deduplication mechanism of the
>> database in use to ensure that statistical values accurately represent the
>> unique physical data entries.
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Distinguished Technologist, Solutions Architect & Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 29 Aug 2023 at 02:07, Jia Fan  wrote:
>>
>>> For those databases with automatic deduplication capabilities, such as
>>> hbase, we have inserted 100 rows with the same rowkey, but in fact there is
>>> only one in hbase. Is the new statistical value we added 100 or 1, or hbase
>>> already contains this rowkey, the value would be 0. How should we handle
>>> this situation?
>>>
>>> Mich Talebzadeh  于2023年8月29日周二 07:22写道:
>>>
>>>> I have never been fond of the notion that measuring inserts, updates,
>>>> and deletes (referred to as DML) is the sole criterion for signaling a
>>>> necessity to update statistics for Spark's CBO. Nevertheless, in the
>>>> absence of an alternative mechanism, it seems this is the only approach at
>>>> our disposal (can we use AI for it ). Personally, I would prefer some
>>>> form of indication regarding shifts in the distribution of values in the
>>>> histogram, overall density, and similar indicators. The decision to execute
>>>> "ANALYZE TABLE xyz COMPUTE STATISTICS FOR COLUMNS" revolves around
>>>> column-level statistics, which is why I would tend to focus on monitoring
>>>> individual column-level statist

Re: [DISCUSS] Incremental statistics collection

2023-08-29 Thread Mich Talebzadeh
short answer on top of my head

My point was with regard to  Cost Based Optimizer (CBO) in traditional
databases. The concept of a rowkey in HBase is somewhat similar to that of
a primary key in RDBMS.
Now in databases with automatic deduplication features (i.e. ignore
duplication of rowkey), inserting 100 rows with the same rowkey actually
results in only one physical entry in the database due to deduplication.
Therefore, the new statistical value added should be 1, reflecting the
distinct physical entry. If the rowkey is already present in HBase, the
value would indeed be 0, indicating that no new physical entry was created.
We need to take into account the underlying deduplication mechanism of the
database in use to ensure that statistical values accurately represent the
unique physical data entries.

HTH

Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 29 Aug 2023 at 02:07, Jia Fan  wrote:

> For those databases with automatic deduplication capabilities, such as
> hbase, we have inserted 100 rows with the same rowkey, but in fact there is
> only one in hbase. Is the new statistical value we added 100 or 1, or hbase
> already contains this rowkey, the value would be 0. How should we handle
> this situation?
>
> Mich Talebzadeh  于2023年8月29日周二 07:22写道:
>
>> I have never been fond of the notion that measuring inserts, updates, and
>> deletes (referred to as DML) is the sole criterion for signaling a
>> necessity to update statistics for Spark's CBO. Nevertheless, in the
>> absence of an alternative mechanism, it seems this is the only approach at
>> our disposal (can we use AI for it ). Personally, I would prefer some
>> form of indication regarding shifts in the distribution of values in the
>> histogram, overall density, and similar indicators. The decision to execute
>> "ANALYZE TABLE xyz COMPUTE STATISTICS FOR COLUMNS" revolves around
>> column-level statistics, which is why I would tend to focus on monitoring
>> individual column-level statistics to detect any signals warranting a
>> statistics update.
>> HTH
>>
>> Mich Talebzadeh,
>> Distinguished Technologist, Solutions Architect & Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sat, 26 Aug 2023 at 21:30, Mich Talebzadeh 
>> wrote:
>>
>>> Hi,
>>>
>>> Impressive, yet in the realm of classic DBMSs, it could be seen as a
>>> case of old wine in a new bottle. The objective, I assume, is to employ
>>> dynamic sampling to enhance the optimizer's capacity to create effective
>>> execution plans without the burden of complete I/O and in less time.
>>>
>>> For instance:
>>> ANALYZE TABLE xyz COMPUTE STATISTICS WITH SAMPLING = 5 percent
>>>
>>> This approach could potentially aid in estimating deltas by utilizing
>>> sampling.
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Distinguished Technologist, Solutions Architect & Engineer
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arisin

Re: [DISCUSS] Incremental statistics collection

2023-08-28 Thread Mich Talebzadeh
I have never been fond of the notion that measuring inserts, updates, and
deletes (referred to as DML) is the sole criterion for signaling a
necessity to update statistics for Spark's CBO. Nevertheless, in the
absence of an alternative mechanism, it seems this is the only approach at
our disposal (can we use AI for it ). Personally, I would prefer some
form of indication regarding shifts in the distribution of values in the
histogram, overall density, and similar indicators. The decision to execute
"ANALYZE TABLE xyz COMPUTE STATISTICS FOR COLUMNS" revolves around
column-level statistics, which is why I would tend to focus on monitoring
individual column-level statistics to detect any signals warranting a
statistics update.
HTH

Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sat, 26 Aug 2023 at 21:30, Mich Talebzadeh 
wrote:

> Hi,
>
> Impressive, yet in the realm of classic DBMSs, it could be seen as a case
> of old wine in a new bottle. The objective, I assume, is to employ dynamic
> sampling to enhance the optimizer's capacity to create effective execution
> plans without the burden of complete I/O and in less time.
>
> For instance:
> ANALYZE TABLE xyz COMPUTE STATISTICS WITH SAMPLING = 5 percent
>
> This approach could potentially aid in estimating deltas by utilizing
> sampling.
>
> HTH
>
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 26 Aug 2023 at 20:58, RAKSON RAKESH 
> wrote:
>
>> Hi all,
>>
>> I would like to propose the incremental collection of statistics in
>> spark. SPARK-44817 <https://issues.apache.org/jira/browse/SPARK-44817>
>> has been raised for the same.
>>
>> Currently, spark invalidates the stats after data changing commands which
>> would make CBO non-functional. To update these stats, user either needs to
>> run `ANALYZE TABLE` command or turn
>> `spark.sql.statistics.size.autoUpdate.enabled`. Both of these ways have
>> their own drawbacks, executing `ANALYZE TABLE` command triggers full table
>> scan while the other one only updates table and partition stats and can be
>> costly in certain cases.
>>
>> The goal of this proposal is to collect stats incrementally while
>> executing data changing commands by utilizing the framework introduced in
>> SPARK-21669 <https://issues.apache.org/jira/browse/SPARK-21669>.
>>
>> SPIP Document has been attached along with JIRA:
>>
>> https://docs.google.com/document/d/1CNPWg_L1fxfB4d2m6xfizRyYRoWS2uPCwTKzhL2fwaQ/edit?usp=sharing
>>
>> Hive also supports automatic collection of statistics to keep the stats
>> consistent.
>> I can find multiple spark JIRAs asking for the same:
>> https://issues.apache.org/jira/browse/SPARK-28872
>> https://issues.apache.org/jira/browse/SPARK-33825
>>
>> Regards,
>> Rakesh
>>
>


Re: [Internet]Re: Improving Dynamic Allocation Logic for Spark 4+

2023-08-28 Thread Mich Talebzadeh
Thanks Qian for your feedback.

I will have a look

Regards,

Mich


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 28 Aug 2023 at 02:32, Qian Sun  wrote:

> Hi Mich,
>
> ImageCache is an alibaba cloud ECI feature[1]. An image cache is a
> cluster-level resource that you can use to accelerate the creation of pods
> in different namespaces.
>
> If need to update the spark image, imagecache will be created in the
> cluster. And specify pod annotation to use image cache[2].
>
>
> ref:
> 1.
> https://www.alibabacloud.com/help/en/elastic-container-instance/latest/overview-of-the-image-cache-feature?spm=a2c63.p38356.0.0.19977f3e9Xpq4E#topic-2131957
> 2.
> https://www.alibabacloud.com/help/en/ack/serverless-kubernetes/user-guide/use-image-caches-to-accelerate-the-creation-of-pods#section-3e8-8n8-hdh
>
> On Fri, Aug 25, 2023 at 10:08 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Hi Qian,
>>
>> How in practice have you implemented image caching for the driver and
>> executor pods respectively?
>>
>> Thanks
>>
>> On Thu, 24 Aug 2023 at 02:44, Qian Sun  wrote:
>>
>>> Hi Mich
>>>
>>> I agree with your opinion that the startup time of the Spark on
>>> Kubernetes cluster needs to be improved.
>>>
>>> Regarding the fetching image directly, I have utilized ImageCache to
>>> store the images on the node, eliminating the time required to pull images
>>> from a remote repository, which does indeed lead to a reduction in
>>> overall time, and the effect becomes more pronounced as the size of the
>>> image increases.
>>>
>>> Additionally, I have observed that the driver pod takes a significant
>>> amount of time from running to attempting to create executor pods, with an
>>> estimated time expenditure of around 75%. We can also explore optimization
>>> options in this area.
>>>
>>> On Thu, Aug 24, 2023 at 12:58 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> On this conversion, one of the issues I brought up was the driver
>>>> start-up time. This is especially true in k8s. As spark on k8s is modeled
>>>> on Spark on standalone schedler, Spark on k8s consist of a
>>>> single-driver pod (as master on standalone”) and a  number of executors
>>>> (“workers”). When executed on k8s, the driver and executors are
>>>> executed on separate pods
>>>> <https://spark.apache.org/docs/latest/running-on-kubernetes.html>. First
>>>> the driver pod is launched, then the driver pod itself launches the
>>>> executor pods. From my observation, in an auto scaling cluster, the driver
>>>> pod may take up to 40 seconds followed by executor pods. This is a
>>>> considerable time for customers and it is painfully slow. Can we actually
>>>> move away from dependency on standalone mode and try to speed up k8s
>>>> cluster formation.
>>>>
>>>> Another naive question, when the docker image is pulled from the
>>>> container registry to the driver itself, this takes finite time. The docker
>>>> image for executors could be different from that of the driver
>>>> docker image. Since spark-submit presents this at the time of submission,
>>>> can we save time by fetching the docker images straight away?
>>>>
>>>> Thanks
>>>>
>>>> Mich
>>>>
>>>>
>>>>view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>&

Re: [DISCUSS] Incremental statistics collection

2023-08-26 Thread Mich Talebzadeh
Hi,

Impressive, yet in the realm of classic DBMSs, it could be seen as a case
of old wine in a new bottle. The objective, I assume, is to employ dynamic
sampling to enhance the optimizer's capacity to create effective execution
plans without the burden of complete I/O and in less time.

For instance:
ANALYZE TABLE xyz COMPUTE STATISTICS WITH SAMPLING = 5 percent

This approach could potentially aid in estimating deltas by utilizing
sampling.

HTH

Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sat, 26 Aug 2023 at 20:58, RAKSON RAKESH  wrote:

> Hi all,
>
> I would like to propose the incremental collection of statistics in spark.
> SPARK-44817 <https://issues.apache.org/jira/browse/SPARK-44817> has been
> raised for the same.
>
> Currently, spark invalidates the stats after data changing commands which
> would make CBO non-functional. To update these stats, user either needs to
> run `ANALYZE TABLE` command or turn
> `spark.sql.statistics.size.autoUpdate.enabled`. Both of these ways have
> their own drawbacks, executing `ANALYZE TABLE` command triggers full table
> scan while the other one only updates table and partition stats and can be
> costly in certain cases.
>
> The goal of this proposal is to collect stats incrementally while
> executing data changing commands by utilizing the framework introduced in
> SPARK-21669 <https://issues.apache.org/jira/browse/SPARK-21669>.
>
> SPIP Document has been attached along with JIRA:
>
> https://docs.google.com/document/d/1CNPWg_L1fxfB4d2m6xfizRyYRoWS2uPCwTKzhL2fwaQ/edit?usp=sharing
>
> Hive also supports automatic collection of statistics to keep the stats
> consistent.
> I can find multiple spark JIRAs asking for the same:
> https://issues.apache.org/jira/browse/SPARK-28872
> https://issues.apache.org/jira/browse/SPARK-33825
>
> Regards,
> Rakesh
>


Two new tickets for Spark on K8s

2023-08-26 Thread Mich Talebzadeh
Hi,

@holden Karau recently created two Jiras that deal with two items of
interest namely:


   1. Improve Spark Driver Launch Time SPARK-44950
   <https://issues.apache.org/jira/browse/SPARK-44950>
   2. Improve Spark Dynamic Allocation SPARK-44951
   <https://issues.apache.org/jira/browse/SPARK-44951>

These are both very much in demand (at least IMO)

These topics have been discussed a few times. Most recently in spark-dev
thread
I*mproving Dynamic Allocation Logic for Spark 4+*  on 7th August

Pretty much in skeleton form yet. Add your vote if you are interested and
comment.

Thanks,


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: Clarification on ExecutorRoll Plugin & Ignore Decommission Fetch Failure

2023-08-25 Thread Mich Talebzadeh
Hi,

The crux of the matter here as I understand is " how should I be using
Executor Rolling, without triggering stage failures?"

The object of executor rolling is to replace decommissioning executors with
new ones while minimizing the impact on running tasks and stages. in k8s.

As mentioned

spark.plugins: "org.apache.spark.scheduler.cluster.k8s.ExecutorRollPlugin"
spark.kubernetes.executor.rollInterval: "1800s"
spark.kubernetes.executor.rollPolicy: "OUTLIER_NO_FALLBACK"
spark.kubernetes.executor.minTasksPerExecutorBeforeRolling: "100"

You will need to ensure that  the decommissioning of executors is done
gracefully. |As per classic Spark, data and tasks being handled by a
decommissioned executor should be properly redistributed to active
executors before the decommissioned executor is removed, otherwise you are
going to have issues. Need to have an eye on fetch failures during
rolling. *This
can happen if tasks attempt to fetch data from decommissioned executors
before the data is redistributed.*
Possible remedy would be to set
"spark.stage.ignoreDecommissionFetchFailure'', "true" (as you have
correctly pointed out)  to tell Spark to ignore fetch failures from
decommissioned executors and retry the tasks on the remaining active
executors as per norm. This will incur additional computation as expected
but will ensure data integrity

In general other parameters settings such as
"spark.kubernetes.executor.minTasksPerExecutorBeforeRolling" need to be
tried for your workload and it is practically impossible to guess for
optimum values. This parameter controls the minimum number of tasks that
should be completed before an executor is rolled.

HTH

Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 25 Aug 2023 at 17:48, Arun Ravi  wrote:

> Hi Team,
> I am running Apache Spark  3.4.1 Application on K8s with the below
> configuration related to executor rolling and Ignore Decommission Fetch
> Failure.
>
> spark.plugins: "org.apache.spark.scheduler.cluster.k8s.ExecutorRollPlugin"
> spark.kubernetes.executor.rollInterval: "1800s"
> spark.kubernetes.executor.rollPolicy: "OUTLIER_NO_FALLBACK"
> spark.kubernetes.executor.minTasksPerExecutorBeforeRolling: "100"
>
> spark.stage.ignoreDecommissionFetchFailure: "true"
> spark.scheduler.maxRetainedRemovedDecommissionExecutors: "20"
>
> spark.decommission.enabled: "true"
> spark.storage.decommission.enabled: "true"
> spark.storage.decommission.fallbackStorage.path: "some-s3-path"
> spark.storage.decommission.shuffleBlocks.maxThreads: "16"
>
> When an executor is decommissioned in the middle of the stage, I notice
> that there are shuffle fetch failures in tasks and the above ignore
> decommission configurations are not respected. The stage will go into
> retry. The decommissioned executor logs clearly show the decommission was
> fully graceful and blocks were replicated to other active
> executors/fallback.
>
> May I know how I should be using Executor Rolling, without triggering
> stage failures? I am using executor rolling to avoid executors being
> removed by K8s due to memory pressure or oom issues as my spark job is
> heavy on shuffling and has a lot of window functions. Any help will be
> super useful.
>
>
>
> Arun Ravi M V
> B.Tech (Batch: 2010-2014)
>
> Computer Science and Engineering
>
> Govt. Model Engineering College
> Cochin University Of Science And Technology
> Kochi
>


Re: [Internet]Re: Improving Dynamic Allocation Logic for Spark 4+

2023-08-25 Thread Mich Talebzadeh
Hi Qian,

How in practice have you implemented image caching for the driver and
executor pods respectively?

Thanks

On Thu, 24 Aug 2023 at 02:44, Qian Sun  wrote:

> Hi Mich
>
> I agree with your opinion that the startup time of the Spark on Kubernetes
> cluster needs to be improved.
>
> Regarding the fetching image directly, I have utilized ImageCache to store
> the images on the node, eliminating the time required to pull images from a
> remote repository, which does indeed lead to a reduction in overall time,
> and the effect becomes more pronounced as the size of the image increases.
>
>
> Additionally, I have observed that the driver pod takes a significant
> amount of time from running to attempting to create executor pods, with an
> estimated time expenditure of around 75%. We can also explore optimization
> options in this area.
>
> On Thu, Aug 24, 2023 at 12:58 AM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Hi all,
>>
>> On this conversion, one of the issues I brought up was the driver
>> start-up time. This is especially true in k8s. As spark on k8s is modeled
>> on Spark on standalone schedler, Spark on k8s consist of a single-driver
>> pod (as master on standalone”) and a  number of executors (“workers”). When 
>> executed
>> on k8s, the driver and executors are executed on separate pods
>> <https://spark.apache.org/docs/latest/running-on-kubernetes.html>. First
>> the driver pod is launched, then the driver pod itself launches the
>> executor pods. From my observation, in an auto scaling cluster, the driver
>> pod may take up to 40 seconds followed by executor pods. This is a
>> considerable time for customers and it is painfully slow. Can we actually
>> move away from dependency on standalone mode and try to speed up k8s
>> cluster formation.
>>
>> Another naive question, when the docker image is pulled from the
>> container registry to the driver itself, this takes finite time. The docker
>> image for executors could be different from that of the driver
>> docker image. Since spark-submit presents this at the time of submission,
>> can we save time by fetching the docker images straight away?
>>
>> Thanks
>>
>> Mich
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 8 Aug 2023 at 18:25, Mich Talebzadeh 
>> wrote:
>>
>>> Splendid idea. 
>>>
>>> Mich Talebzadeh,
>>> Solutions Architect/Engineering Lead
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 8 Aug 2023 at 18:10, Holden Karau  wrote:
>>>
>>>> The driver it’s self is probably another topic, perhaps I’ll make a
>>>> “faster spark star time” JIRA and a DA JIRA and we can explore both.
>>>>
>>>> On Tue, Aug 8, 2023 at 10:07 AM Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> From my own perspective faster execution time especially with Spark on
>>>>> tin boxes (Dataproc & EC2) and Spark on k8s is something that customers
>>>>> often bring up.
>>>>>
>>>>> Poor time to onboard with autoscaling seems to be particularly singled
>>>>> out for heavy ETL jobs that use Spark. I am disappointed to see the poor
>>>>> performance of Spark on k8s autopilot with timelines starting the driver
>>>>> itself and moving from Pending 

Fwd:  Wednesday: Join 6 Members at "Ofir Press | Complementing Scale: Novel Guidance Methods for Improving LMs"

2023-08-24 Thread Mich Talebzadeh
They recently combined Apache Spark and  AI meeting in London.

An online session worth attending for some?

HTH

Mich

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




-- Forwarded message -
From: Apache Spark+AI London 
Date: Thu, 24 Aug 2023 at 20:01
Subject:  Wednesday: Join 6 Members at "Ofir Press | Complementing Scale:
Novel Guidance Methods for Improving LMs"
To: 


Apache Spark+AI London invites you to keep connecting

[image: Meetup]
<http://clicks.meetup.com/ls/click?upn=XbaZ37larFA-2FuV5MohrYpdrra25MtI4CzodbRR1Rd1lMbY-2F0BB6sl1-2BGb-2BU4Xb61IYOfzSGDr1WLA6MC9Dyq9-2F8g3ZHmsfIhK4vjh2BPLpwsD6Iy-2BmFCfGEumnmGhsB4xvvr0JsqTn3JcdmaNJdQmw-3D-3DkYao_Y1imAGhe9PFOX0yIXfkkZuRXJYXWpMyeZZtnEsE9vxhYJX7At9WbASZx-2B7Yno7jp7lnd8WbITbo-2BTZqywe2J-2FOVAP8gA3GEo9TaOqm8R1FsqAUHxZPKlCWGK7DKbejoOip5UXg2HjAB-2BKZFitkCvgCbNsCCMG9ZelHebLF-2B8BFSNxbtmKCoDQx2WU3hXML1SeBBPLDxyH-2FpSy7BaPGh3leetGFi6STd-2FJMTu0pi3sCSizn2JmYBppBaIS6-2BxjzZCEGMhDuHoiFE3Fyh0GjgJvLUsI5Iht2-2BpH8-2F9uGpot6nBeTitT89H2-2FhresQGyt7-2F-2FgBJKaSA5K-2F-2Ffwtz5KJR2mKDzJKyEkj4VvdsKiX2Qimotni6SCIDcPK5m2qsM13-2BUZg1LGS9HrA8FWi7QlW6DM3XqxUDY5CmOmZJ6p2d2wcc5eWpbgK-2FVIyI1RZd2DiNLnecv8n6qrFPhIj0-2BRw0YocCmvTLh-2F1NQ0CrUneizlKxHazkUQVUXLYQmKSkUvH0iU4h7btTvVG4geaHk3OEidSoQDPr1FqX-2BNsjF-2FYGaRDKZ69BaMcUATndkC01-2BCfW2fvgbLcI72o-2F1g3fD6fb-2BBXxrd5sJe8-2FIfvvRjGV9t-2B2AMnK95w3fYUFD0qWurS4VXKI-2FuW0Odlu-2BIIuv-2FGZszgnBMX0yvJbWDLhXbbNfwKdROIWzAvBZ6-2FLsq98UNs9zJicpYRqvtlLJVDWSPdv1QPab4KHYXjzTIqAMSYsn2WuigIzuZl1sysSy01kkItc>

<http://clicks.meetup.com/ls/click?upn=XbaZ37larFA-2FuV5MohrYpdrra25MtI4CzodbRR1Rd1lucdvIwHwxpqyRpUJ2cLgrLceiLidYHikYKFDLUotqnKr1wFOtdTEoK-2Fs3Zb-2F-2BDHS-2B0ciqgGMn9AB3y5MCYgQP7rJpQ2SSPlU3-2FyOWUDwFyg-3D-3DFQUx_Y1imAGhe9PFOX0yIXfkkZuRXJYXWpMyeZZtnEsE9vxhYJX7At9WbASZx-2B7Yno7jp7lnd8WbITbo-2BTZqywe2J-2FOVAP8gA3GEo9TaOqm8R1FsqAUHxZPKlCWGK7DKbejoOip5UXg2HjAB-2BKZFitkCvgCbNsCCMG9ZelHebLF-2B8BFSNxbtmKCoDQx2WU3hXML1SeBBPLDxyH-2FpSy7BaPGh3leetGFi6STd-2FJMTu0pi3sCSizn2JmYBppBaIS6-2BxjzZCEGMhDuHoiFE3Fyh0GjgJvLUsI5Iht2-2BpH8-2F9uGpot6nBeTitT89H2-2FhresQGyt7-2F-2FgBJKaSA5K-2F-2Ffwtz5KJR2mKDzJKyEkj4VvdsKiX2Qimotni6SCIDcPK5m2qsM13-2BUZg1LGS9HrA8FWi7QlW6DM3XqxUDY5CmOmZJ6p2d2wcc5eWpbgK-2FVIyI1RZd2DiNLnecv8n6qrFPhIj0-2BRw0YocCmvTLh-2F1NQ0CrUneizlKxHazkUQVUXLYQmKSkUvH0iU4h7btTvVG4geaHk3OEidSoQDPr1FqX-2BNsjF-2FYGaRDKZ69BaMcUATndkC01-2BCfW2fvgbLcI72o-2F1g3fD6fb-2BBXxrd5sJe8-2FIfvvRjGV9t9i-2FiXfprAFg-2FW4xtYRoY76DxLmHyNuKXC822GetT3wPkc2Oprz147XLurkQtD7l5YS4cYTvbPPvmE8gUUZTX-2BYoD4yLCdnnJ2KNOs2VUkxwNEVURPEjh8L-2BAYGG48siWFtxanIArixv9zGFMHY8fic>
You
are 1 RSVP away from 5 RSVPs
<http://clicks.meetup.com/ls/click?upn=XbaZ37larFA-2FuV5MohrYpdrra25MtI4CzodbRR1Rd1lucdvIwHwxpqyRpUJ2cLgrLceiLidYHikYKFDLUotqnKr1wFOtdTEoK-2Fs3Zb-2F-2BDHS-2B0ciqgGMn9AB3y5MCYgQP7rJpQ2SSPlU3-2FyOWUDwFyg-3D-3D8PAt_Y1imAGhe9PFOX0yIXfkkZuRXJYXWpMyeZZtnEsE9vxhYJX7At9WbASZx-2B7Yno7jp7lnd8WbITbo-2BTZqywe2J-2FOVAP8gA3GEo9TaOqm8R1FsqAUHxZPKlCWGK7DKbejoOip5UXg2HjAB-2BKZFitkCvgCbNsCCMG9ZelHebLF-2B8BFSNxbtmKCoDQx2WU3hXML1SeBBPLDxyH-2FpSy7BaPGh3leetGFi6STd-2FJMTu0pi3sCSizn2JmYBppBaIS6-2BxjzZCEGMhDuHoiFE3Fyh0GjgJvLUsI5Iht2-2BpH8-2F9uGpot6nBeTitT89H2-2FhresQGyt7-2F-2FgBJKaSA5K-2F-2Ffwtz5KJR2mKDzJKyEkj4VvdsKiX2Qimotni6SCIDcPK5m2qsM13-2BUZg1LGS9HrA8FWi7QlW6DM3XqxUDY5CmOmZJ6p2d2wcc5eWpbgK-2FVIyI1RZd2DiNLnecv8n6qrFPhIj0-2BRw0YocCmvTLh-2F1NQ0CrUneizlKxHazkUQVUXLYQmKSkUvH0iU4h7btTvVG4geaHk3OEidSoQDPr1FqX-2BNsjF-2FYGaRDKZ69BaMcUATndkC01-2BCfW2fvgbLcI72o-2F1g3fD6fb-2BBXxrd5sJe8-2FIfvvRjGV9t-2FLJ6R5UpfuK5-2BN8B0Uiv9wFHzmFleJajLBFbkk-2F8cuurWARcGAEkYss7S9t16MqEQZh6hECYoe1OYYOnKQrg2w9Daqo2E8nE8pm5NAV07nCBQpGycU2ynA6pgaX8WASznbxwksHUirzhaKu9ndg24g>
Wednesday
Ofir Press | Complementing Scale: Novel Guidance Methods for Improving LMs
<http://clicks.meetup.com/ls/click?upn=XbaZ37larFA-2FuV5MohrYpdrra25MtI4CzodbRR1Rd1miCKx60eDFevCrdo8wt3KCFjcAnYhNkBulr7Qec2c-2Fmg-3D-3DNa8f_Y1imAGhe9PFOX0yIXfkkZuRXJYXWpMyeZZtnEsE9vxhYJX7At9WbASZx-2B7Yno7jp7lnd8WbITbo-2BTZqywe2J-2FOVAP8gA3GEo9TaOqm8R1FsqAUHxZPKlCWGK7DKbejoOip5UXg2HjAB-2BKZFitkCvgCbNsCCMG9ZelHebLF-2B8BFSNxbtmKCoDQx2WU3hXML1SeBBPLDxyH-2FpSy7BaPGh3leetGFi6STd-2FJMTu0pi3sCSizn2JmYBppBaIS6-2BxjzZCEGMhDuHoiFE3Fyh0GjgJvLUsI5Iht2-2BpH8-2F9uGpot6nBeTitT89H2-2FhresQGyt7-2F-2FgBJKaSA5K-2F-2Ffwtz5KJR2mKDzJKyEkj4VvdsKiX2Qimotni6SCIDcPK5m2qsM13-2BUZg1LGS9HrA8FWi7QlW6DM3XqxUDY5CmOmZJ6p2d2wcc5eWpbgK-2FVIyI1RZd2DiNLnecv8n6qrFPhIj0-2BRw0YocCmvTLh-2F1NQ0CrUneizlKxHazkUQVUXLYQmKSkUvH0iU4h7btTvVG4geaHk3OEidSoQDPr1FqX-2BNsjF-2FYGaRDKZ69BaMcUATndkC01-2BCfW2fvgbLcI72o-2

Re: [Internet]Re: Improving Dynamic Allocation Logic for Spark 4+

2023-08-23 Thread Mich Talebzadeh
Hi all,

On this conversion, one of the issues I brought up was the driver start-up
time. This is especially true in k8s. As spark on k8s is modeled on Spark
on standalone schedler, Spark on k8s consist of a single-driver pod (as
master on standalone”) and a  number of executors (“workers”). When executed
on k8s, the driver and executors are executed on separate pods
<https://spark.apache.org/docs/latest/running-on-kubernetes.html>. First
the driver pod is launched, then the driver pod itself launches the
executor pods. From my observation, in an auto scaling cluster, the driver
pod may take up to 40 seconds followed by executor pods. This is a
considerable time for customers and it is painfully slow. Can we actually
move away from dependency on standalone mode and try to speed up k8s
cluster formation.

Another naive question, when the docker image is pulled from the container
registry to the driver itself, this takes finite time. The docker image for
executors could be different from that of the driver docker image. Since
spark-submit presents this at the time of submission, can we save time by
fetching the docker images straight away?

Thanks

Mich


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 8 Aug 2023 at 18:25, Mich Talebzadeh 
wrote:

> Splendid idea. 
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 8 Aug 2023 at 18:10, Holden Karau  wrote:
>
>> The driver it’s self is probably another topic, perhaps I’ll make a
>> “faster spark star time” JIRA and a DA JIRA and we can explore both.
>>
>> On Tue, Aug 8, 2023 at 10:07 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> From my own perspective faster execution time especially with Spark on
>>> tin boxes (Dataproc & EC2) and Spark on k8s is something that customers
>>> often bring up.
>>>
>>> Poor time to onboard with autoscaling seems to be particularly singled
>>> out for heavy ETL jobs that use Spark. I am disappointed to see the poor
>>> performance of Spark on k8s autopilot with timelines starting the driver
>>> itself and moving from Pending to Running phase (Spark 4.3.1 with Java 11)
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Solutions Architect/Engineering Lead
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 8 Aug 2023 at 15:49, kalyan  wrote:
>>>
>>>> +1 to enhancements in DEA. Long time due!
>>>>
>>>> There were a few things that I was thinking along the same lines for
>>>> some time now(few overlap with @holden 's points)
>>>> 1. How to reduce wastage on the RM side? Sometimes the driver asks for
>>>> some units of resources. But when RM provisions them, the driver cancels
>>>> it.
>>>> 2. How to make the resource available when it is needed.
>>>> 3. Cost Vs AppRunTime: A good DEA algo should allow the developer to
>>>> choose between cost and runtime. Sometimes developers might be ok to pay
>>>> higher costs for faster execution.
>>&

  1   2   3   4   >