Re: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-06 Thread 刘唯
This indeed looks like a bug. I will take some time to look into it.

Mich Talebzadeh  于2024年4月3日周三 01:55写道:

>
> hm. you are getting below
>
> AnalysisException: Append output mode not supported when there are
> streaming aggregations on streaming DataFrames/DataSets without watermark;
>
> The problem seems to be that you are using the append output mode when
> writing the streaming query results to Kafka. This mode is designed for
> scenarios where you want to append new data to an existing dataset at the
> sink (in this case, the "sink" topic in Kafka). However, your query
> involves a streaming aggregation: group by provinceId, window('createTime',
> '1 hour', '30 minutes'). The problem is that Spark Structured Streaming
> requires a watermark to ensure exactly-once processing when using
> aggregations with append mode. Your code already defines a watermark on the
> "createTime" column with a delay of 10 seconds (withWatermark("createTime",
> "10 seconds")). However, the error message indicates it is missing on the
> start column. Try adding watermark to "start" Column: Modify your code as
> below  to include a watermark on the "start" column generated by the
> window function:
>
> from pyspark.sql.functions import col, from_json, explode, window, sum,
> watermark
>
> streaming_df = session.readStream \
>   .format("kafka") \
>   .option("kafka.bootstrap.servers", "localhost:9092") \
>   .option("subscribe", "payment_msg") \
>   .option("startingOffsets", "earliest") \
>   .load() \
>   .select(from_json(col("value").cast("string"),
> schema).alias("parsed_value")) \
>   .select("parsed_value.*") \
>   .withWatermark("createTime", "10 seconds")  # Existing watermark on
> createTime
>
> *# Modified section with watermark on 'start' column*
> streaming_df = streaming_df.groupBy(
>   col("provinceId"),
>   window(col("createTime"), "1 hour", "30 minutes")
> ).agg(
>   sum(col("payAmount")).alias("totalPayAmount")
> ).withWatermark(expr("start"), "10 seconds")  # Watermark on
> window-generated 'start'
>
> # Rest of the code remains the same
> streaming_df.createOrReplaceTempView("streaming_df")
>
> spark.sql("""
> SELECT
>   window.start, window.end, provinceId, totalPayAmount
> FROM streaming_df
> ORDER BY window.start
> """) \
> .writeStream \
> .format("kafka") \
> .option("checkpointLocation", "checkpoint") \
> .option("kafka.bootstrap.servers", "localhost:9092") \
> .option("topic", "sink") \
> .start()
>
> Try and see how it goes
>
> HTH
>
> Mich Talebzadeh,
>
> Technologist | Solutions Architect | Data Engineer  | Generative AI
>
> London
> United Kingdom
>
>
>view my Linkedin profile
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> Disclaimer: The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner Von Braun)".
>
> Mich Talebzadeh,
> Technologist | Solutions Architect | Data Engineer  | Generative AI
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Tue, 2 Apr 2024 at 22:43, Chloe He 
> wrote:
>
>> Hi Mich,
>>
>> Thank you so much for your response. I really appreciate your help!
>>
>> You mentioned "defining the watermark using the withWatermark function on
>> the streaming_df before creating the temporary view” - I believe this is
>> what I’m doing and it’s not working for me. Here is the exact code snippet
>> that I’m running:
>>
>> ```
>> >>> streaming_df = session.readStream\
>> .format("kafka")\
>> .option("kafka.bootstrap.servers", "localhost:9092")\
>> .option("subscribe", "payment_msg")\
>> .option("startingOffsets","earliest")\
>> .load()\
>> .select(from_json(col("value").cast("string"),
>> schema).alias("parsed_value"))\
>> .select("parsed_value.*")\
>> .withWatermark("createTime", "10 seconds")
>>
>> >>> streaming_df.createOrReplaceTempView("streaming_df”)
>>
>> >>> spark.sql("""
>> SELECT
>> window.start, window.end, provinceId, sum(payAmount) as totalPayAmount
>> FROM streaming_df
>> GROUP BY provinceId, window('createTime', '1 hour', '30 minutes')
>> ORDER BY window.start
>> """)\
>>   .withWatermark("start", "10 seconds")\
>>   .writeStream\
>>   .format("kafka") \
>>   .option("checkpointLocation", "checkpoint") \
>>   .option("kafka.bootstrap.servers", "localhost:9092") \
>>   .option("topic", "sink") \
>>   .start()
>>
>> AnalysisException: Append output mode not 

Re: [External] Re: Issue of spark with antlr version

2024-04-06 Thread Bjørn Jørgensen
[[VOTE] Release Plan for Apache Spark 4.0.0 (June 2024)](
https://lists.apache.org/thread/r0zn6rd8y25yn2dg59ktw3ttrwxzqrfb)

Apache Spark 4.0.0 Release Plan
===

1. After creating `branch-3.5`, set "4.0.0-SNAPSHOT" in master branch.

2. Creating `branch-4.0` on April 1st, 2024.

3. Apache Spark 4.0.0 RC1 on May 1st, 2024.

4. Apache Spark 4.0.0 Release in June, 2024.

tir. 2. apr. 2024 kl. 12:06 skrev Chawla, Parul :

> ++ Ashima
>
> --
> *From:* Chawla, Parul 
> *Sent:* Tuesday, April 2, 2024 9:56 AM
> *To:* Bjørn Jørgensen ; user@spark.apache.org <
> user@spark.apache.org>
> *Cc:* Sahni, Ashima ;
> user@spark.apache.org ; Misra Parashar, Jyoti <
> jyoti.misra.paras...@accenture.com>
> *Subject:* Re: [External] Re: Issue of spark with antlr version
>
> Hi Team,
> Any update on below query :when spark 4.x will be released to maven as on
> maven site i could see spark core 3.5.1 .
>
> Regards,
> Parul
>
> --
> *From:* Chawla, Parul 
> *Sent:* Monday, April 1, 2024 4:20 PM
> *To:* Bjørn Jørgensen 
> *Cc:* Sahni, Ashima ;
> user@spark.apache.org ; Misra Parashar, Jyoti <
> jyoti.misra.paras...@accenture.com>; Mekala, Rajesh <
> r.mek...@accenture.com>; Grandhi, Venkatesh <
> venkatesh.a.gran...@accenture.com>; George, Rejish <
> rejish.geo...@accenture.com>; Tayal, Aayushi 
> *Subject:* Re: [External] Re: Issue of spark with antlr version
>
> Hi Team,
>
> Can you let us know the when   this spark 4.x will be released to maven.
>
> regards,
> Parul
>
> Get Outlook for iOS 
> --
> *From:* Bjørn Jørgensen 
> *Sent:* Wednesday, February 28, 2024 5:06:54 PM
> *To:* Chawla, Parul 
> *Cc:* Sahni, Ashima ;
> user@spark.apache.org ; Misra Parashar, Jyoti <
> jyoti.misra.paras...@accenture.com>; Mekala, Rajesh <
> r.mek...@accenture.com>; Grandhi, Venkatesh <
> venkatesh.a.gran...@accenture.com>; George, Rejish <
> rejish.geo...@accenture.com>; Tayal, Aayushi 
> *Subject:* Re: [External] Re: Issue of spark with antlr version
>
> [image: image.png]
>
> ons. 28. feb. 2024 kl. 11:28 skrev Chawla, Parul <
> parul.cha...@accenture.com>:
>
>
> Hi ,
> Can we get spark version on whuich this is resolved.
> --
> *From:* Bjørn Jørgensen 
> *Sent:* Tuesday, February 27, 2024 7:05:36 PM
> *To:* Sahni, Ashima 
> *Cc:* Chawla, Parul ; user@spark.apache.org <
> user@spark.apache.org>; Misra Parashar, Jyoti <
> jyoti.misra.paras...@accenture.com>; Mekala, Rajesh <
> r.mek...@accenture.com>; Grandhi, Venkatesh <
> venkatesh.a.gran...@accenture.com>; George, Rejish <
> rejish.geo...@accenture.com>; Tayal, Aayushi 
> *Subject:* [External] Re: Issue of spark with antlr version
>
> *CAUTION:* External email. Be cautious with links and attachments.
> [SPARK-44366][BUILD] Upgrade antlr4 to 4.13.1
> 
>
>
> tir. 27. feb. 2024 kl. 13:25 skrev Sahni, Ashima
> :
>
> Hi Team,
>
>
>
> Can you please let us know the update on below.
>
>
>
> Thanks,
>
> Ashima
>
>
>
> *From:* Chawla, Parul 
> *Sent:* Sunday, February 25, 2024 11:57 PM
> *To:* user@spark.apache.org
> *Cc:* Sahni, Ashima ; Misra Parashar, Jyoti <
> jyoti.misra.paras...@accenture.com>
> *Subject:* Issue of spark with antlr version
>
>
>
> Hi Spark Team,
>
>
>
>
>
> Our application is currently using spring framrwork 5.3.31 .To upgrade it
> to 6.x , as per application dependency we must upgrade Spark and
> Hibernate jars as well .
>
> With Hibernate compatible upgrade, the dependent Antlr4 jar version has
> been upgraded to 4.10.1 but there’s no Spark version available with the
> upgraded Antlr4 jar.
>
> Can u please update when we can have updated version with upgraded antl4
> version..
>
>
>
>
>
> Regards,
>
> Parul
>
> --
>
> This message is for the designated recipient only and may contain
> privileged, proprietary, or otherwise confidential information. If you have
> received it in error, please notify the sender immediately and delete the
> original. Any other use of the e-mail by you is prohibited. Where allowed
> by local law, electronic communications with Accenture and its affiliates,
> including e-mail and instant messaging (including content), may be scanned
> by our systems for the purposes of information security, AI-powered support
> capabilities, and assessment of internal compliance with Accenture policy.
> Your privacy is important to us. Accenture uses your personal data only in
> compliance with data protection laws. For further information on how
> Accenture processes your personal data, please see our privacy statement at
> https://www.accenture.com/us-en/privacy-policy.
>
> __
>
> www.accenture.com
>
>
>
> --
> 

Unsubscribe

2024-04-06 Thread rau-jannik
Unsubscribe

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



External Spark shuffle service for k8s

2024-04-06 Thread Mich Talebzadeh
I have seen some older references for shuffle service for k8s,
although it is not clear they are talking about a generic shuffle
service for k8s.

Anyhow with the advent of genai and the need to allow for a larger
volume of data, I was wondering if there has been any more work on
this matter. Specifically larger and scalable file systems like HDFS,
GCS , S3 etc, offer significantly larger storage capacity than local
disks on individual worker nodes in a k8s cluster, thus allowing
handling much larger datasets more efficiently. Also the degree of
parallelism and fault tolerance  with these files systems come into
it. I will be interested in hearing more about any progress on this.

Thanks
.

Mich Talebzadeh,

Technologist | Solutions Architect | Data Engineer  | Generative AI

London
United Kingdom


   view my Linkedin profile


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner Von Braun)".

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Clarification on what "[id=#]" refers to in Physical Plan Exchange hashpartitioning

2024-04-04 Thread Tahj Anderson
Hello,

While looking through spark physical plans generated by the spark history 
server log to find any bottle necks in my code, I stumbled across an ID that 
shows up in a partitioning stage.
My goal is to use the history server log to provide meaningful analysis on my 
spark system performance. With this goal in mind, I am trying to connect spark 
physical plans to StageIDs which house useful information that I can tie back 
to my code. Below is a snippet from one of the physical plans.
+- *(2) Sort [Column#46 ASC NULLS FIRST], true, 0
+- Exchange hashpartitioning(ColumnId#329, 200), ENSURE_REQUIREMENTS, 
[id=#278]


What exactly does [id=#278] refer to?
I have seen some examples that say this ID is a reference to a specific 
partition, a stage id, or a plan_id but I have not been able to confirm which 
one it is.

Thank you



Clarification on what "[id=#]" refers to in Physical Plan Exchange hashpartitioning

2024-04-04 Thread Tahj Anderson
Hello,

While looking through spark physical plans generated by the spark history 
server log to find any bottle necks in my code, I stumbled across an ID that 
shows up in a partitioning stage.
My goal is to use the history server log to provide meaningful analysis on my 
spark system performance. With this goal in mind, I am trying to connect spark 
physical plans to StageIDs which house useful information that I can tie back 
to my code. Below is a snippet from one of the physical plans.
+- *(2) Sort [Column#46 ASC NULLS FIRST], true, 0
+- Exchange hashpartitioning(ColumnId#329, 200), ENSURE_REQUIREMENTS, 
[id=#278]


What exactly does [id=#278] refer to?
I have seen some examples that say this ID is a reference to a specific 
partition, a stage id, or a plan_id but I have not been able to confirm which 
one it is.

Thank you,
Tahj



Re: [EXTERNAL] Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix

2024-04-03 Thread Oxlade, Dan
I don't really understand how Iceberg and the hadoop libraries can coexist in a 
deployment.

The latest spark (3.5.1) base image contains the hadoop-client*-3.3.4.jar. The 
AWS v2 SDK is only supported in hadoop*-3.4.0.jar and onward.
Iceberg AWS integration states AWS v2 SDK is 
required

Does anyone have a working combination of pyspark, iceberg and hadoop? Or, is 
there an alternative way to use pyspark to 
spark.read.parquet("s3a:///.parquet") such that I don't need the 
hadoop dependencies?

Kind regards,
Dan

From: Oxlade, Dan 
Sent: 03 April 2024 15:49
To: Oxlade, Dan ; Aaron Grubb 
; user@spark.apache.org 
Subject: Re: [EXTERNAL] Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility 
matrix

Swapping out the iceberg-aws-bundle for the very latest aws provided sdk 
('software.amazon.awssdk:bundle:2.25.23') produces an incompatibility from a 
slightly different code path:

java.lang.NoSuchMethodError: 'void 
org.apache.hadoop.util.SemaphoredDelegatingExecutor.(java.util.concurrent.ExecutorService,
 int, boolean, org.apache.hadoop.fs.statistics.DurationTrackerFactory)'
at org.apache.hadoop.fs.s3a.S3AFileSystem.executeOpen(S3AFileSystem.java 
[s3afilesystem.java]:1767)
at org.apache.hadoop.fs.s3a.S3AFileSystem.open(S3AFileSystem.java 
[s3afilesystem.java]:1717)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java 
[filesystem.java]:976)
at 
org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java 
[hadoopinputfile.java]:69)
at org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java 
[parquetfilereader.java]:774)
at org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java 
[parquetfilereader.java]:658)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java
 
[parquetfooterreader.java]:53)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java
 
[parquetfooterreader.java]:44)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readParquetFootersInParallel$1(ParquetFileFormat.scala:429)




From: Oxlade, Dan 
Sent: 03 April 2024 14:33
To: Aaron Grubb ; user@spark.apache.org 
Subject: Re: [EXTERNAL] Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility 
matrix


[sorry; replying all this time]

With hadoop-*-3.3.6 in place of the 3.4.0 below I get 
java.lang.NoClassDefFoundError: com/amazonaws/AmazonClientException

I think that the below iceberg-aws-bundle version supplies the v2 sdk.

Dan


From: Aaron Grubb 
Sent: 03 April 2024 13:52
To: user@spark.apache.org 
Subject: [EXTERNAL] Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility 
matrix

Downgrade to hadoop-*:3.3.x, Hadoop 3.4.x is based on the AWS SDK v2 and should 
probably be considered as breaking for tools that build on < 3.4.0 while using 
AWS.

From: Oxlade, Dan 
Sent: Wednesday, April 3, 2024 2:41:11 PM
To: user@spark.apache.org 
Subject: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix


Hi all,



I’ve struggled with this for quite some time.

My requirement is to read a parquet file from s3 to a Dataframe then append to 
an existing iceberg table.



In order to read the parquet I need the hadoop-aws dependency for s3a:// . In 
order to write to iceberg I need the iceberg dependency. Both of these 
dependencies have a transitive dependency on the aws SDK. I can’t find versions 
for Spark 3.4 that work 

Re: [EXTERNAL] Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix

2024-04-03 Thread Oxlade, Dan
Swapping out the iceberg-aws-bundle for the very latest aws provided sdk 
('software.amazon.awssdk:bundle:2.25.23') produces an incompatibility from a 
slightly different code path:

java.lang.NoSuchMethodError: 'void 
org.apache.hadoop.util.SemaphoredDelegatingExecutor.(java.util.concurrent.ExecutorService,
 int, boolean, org.apache.hadoop.fs.statistics.DurationTrackerFactory)'
at org.apache.hadoop.fs.s3a.S3AFileSystem.executeOpen(S3AFileSystem.java:1767)
at org.apache.hadoop.fs.s3a.S3AFileSystem.open(S3AFileSystem.java:1717)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:976)
at 
org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69)
at 
org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:774)
at org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:658)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:53)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:44)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readParquetFootersInParallel$1(ParquetFileFormat.scala:429)




From: Oxlade, Dan 
Sent: 03 April 2024 14:33
To: Aaron Grubb ; user@spark.apache.org 
Subject: Re: [EXTERNAL] Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility 
matrix


[sorry; replying all this time]

With hadoop-*-3.3.6 in place of the 3.4.0 below I get 
java.lang.NoClassDefFoundError: com/amazonaws/AmazonClientException

I think that the below iceberg-aws-bundle version supplies the v2 sdk.

Dan


From: Aaron Grubb 
Sent: 03 April 2024 13:52
To: user@spark.apache.org 
Subject: [EXTERNAL] Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility 
matrix

Downgrade to hadoop-*:3.3.x, Hadoop 3.4.x is based on the AWS SDK v2 and should 
probably be considered as breaking for tools that build on < 3.4.0 while using 
AWS.

From: Oxlade, Dan 
Sent: Wednesday, April 3, 2024 2:41:11 PM
To: user@spark.apache.org 
Subject: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix


Hi all,



I’ve struggled with this for quite some time.

My requirement is to read a parquet file from s3 to a Dataframe then append to 
an existing iceberg table.



In order to read the parquet I need the hadoop-aws dependency for s3a:// . In 
order to write to iceberg I need the iceberg dependency. Both of these 
dependencies have a transitive dependency on the aws SDK. I can’t find versions 
for Spark 3.4 that work together.





Current Versions:

Spark 3.4.1

iceberg-spark-runtime-3.4-2.12:1.4.1

iceberg-aws-bundle:1.4.1

hadoop-aws:3.4.0

hadoop-common:3.4.0



I’ve tried a number of combinations of the above and their respective versions 
but all fall over with their assumptions on the aws sdk version with class not 
found exceptions or method not found etc.



Is there a compatibility matrix somewhere that someone could point me to?



Thanks

Dan

T. Rowe Price International Ltd (registered number 3957748) is registered in 
England and Wales with its registered office at Warwick Court, 5 Paternoster 
Square, London EC4M 7DX. T. Rowe Price International Ltd is authorised and 
regulated by the Financial Conduct Authority. The company has a branch in Dubai 
International Financial Centre (regulated by the DFSA as a Representative 
Office).

T. Rowe Price (including T. Rowe Price International Ltd and its affiliates) 
and its associates do not provide legal or tax advice. Any tax-related 
discussion contained in this e-mail, including any attachments, is not intended 
or written to be used, and cannot be used, for the purpose of (i) avoiding any 
tax penalties or (ii) promoting, marketing, or recommending to any other party 
any transaction or matter addressed herein. Please consult your independent 
legal counsel and/or professional tax advisor regarding any legal or tax issues 
raised in this e-mail.

The contents of this e-mail and any attachments are intended solely for the use 
of the named addressee(s) and may contain confidential and/or privileged 
information. Any unauthorized use, copying, disclosure, or distribution of the 
contents of this e-mail is strictly prohibited by the sender and may be 
unlawful. If you are not the intended recipient, please notify the sender 
immediately and delete this e-mail.

T. Rowe Price International Ltd (registered number 3957748) is registered in 
England and Wales with its registered office at Warwick Court, 5 Paternoster 
Square, London EC4M 7DX. T. Rowe Price International Ltd is authorised and 
regulated by the Financial Conduct Authority. The company has a branch in Dubai 
International Financial Centre (regulated by the DFSA as a Representative 
Office).

T. Rowe Price (including T. Rowe Price International Ltd and its affiliates) 
and its associates do not provide legal or tax advice. 

Participate in the ASF 25th Anniversary Campaign

2024-04-03 Thread Brian Proffitt
Hi everyone,

As part of The ASF’s 25th anniversary campaign[1], we will be celebrating
projects and communities in multiple ways.

We invite all projects and contributors to participate in the following
ways:

* Individuals - submit your first contribution:
https://news.apache.org/foundation/entry/the-asf-launches-firstasfcontribution-campaign
* Projects - share your public good story:
https://docs.google.com/forms/d/1vuN-tUnBwpTgOE5xj3Z5AG1hsOoDNLBmGIqQHwQT6k8/viewform?edit_requested=true
* Projects - submit a project spotlight for the blog:
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=278466116
* Projects - contact the Voice of Apache podcast (formerly Feathercast) to
be featured: https://feathercast.apache.org/help/
*  Projects - use the 25th anniversary template and the #ASF25Years hashtag
on social media:
https://docs.google.com/presentation/d/1oDbMol3F_XQuCmttPYxBIOIjRuRBksUjDApjd8Ve3L8/edit#slide=id.g26b0919956e_0_13

If you have questions, email the Marketing & Publicity team at
mark...@apache.org.

Peace,
BKP

[1] https://apache.org/asf25years/

[NOTE: You are receiving this message because you are a contributor to an
Apache Software Foundation project. The ASF will very occasionally send out
messages relating to the Foundation to contributors and members, such as
this one.]

Brian Proffitt
VP, Marketing & Publicity
VP, Conferences


Re: [EXTERNAL] Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix

2024-04-03 Thread Oxlade, Dan

[sorry; replying all this time]

With hadoop-*-3.3.6 in place of the 3.4.0 below I get 
java.lang.NoClassDefFoundError: com/amazonaws/AmazonClientException

I think that the below iceberg-aws-bundle version supplies the v2 sdk.

Dan


From: Aaron Grubb 
Sent: 03 April 2024 13:52
To: user@spark.apache.org 
Subject: [EXTERNAL] Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility 
matrix

Downgrade to hadoop-*:3.3.x, Hadoop 3.4.x is based on the AWS SDK v2 and should 
probably be considered as breaking for tools that build on < 3.4.0 while using 
AWS.

From: Oxlade, Dan 
Sent: Wednesday, April 3, 2024 2:41:11 PM
To: user@spark.apache.org 
Subject: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix


Hi all,



I’ve struggled with this for quite some time.

My requirement is to read a parquet file from s3 to a Dataframe then append to 
an existing iceberg table.



In order to read the parquet I need the hadoop-aws dependency for s3a:// . In 
order to write to iceberg I need the iceberg dependency. Both of these 
dependencies have a transitive dependency on the aws SDK. I can’t find versions 
for Spark 3.4 that work together.





Current Versions:

Spark 3.4.1

iceberg-spark-runtime-3.4-2.12:1.4.1

iceberg-aws-bundle:1.4.1

hadoop-aws:3.4.0

hadoop-common:3.4.0



I’ve tried a number of combinations of the above and their respective versions 
but all fall over with their assumptions on the aws sdk version with class not 
found exceptions or method not found etc.



Is there a compatibility matrix somewhere that someone could point me to?



Thanks

Dan

T. Rowe Price International Ltd (registered number 3957748) is registered in 
England and Wales with its registered office at Warwick Court, 5 Paternoster 
Square, London EC4M 7DX. T. Rowe Price International Ltd is authorised and 
regulated by the Financial Conduct Authority. The company has a branch in Dubai 
International Financial Centre (regulated by the DFSA as a Representative 
Office).

T. Rowe Price (including T. Rowe Price International Ltd and its affiliates) 
and its associates do not provide legal or tax advice. Any tax-related 
discussion contained in this e-mail, including any attachments, is not intended 
or written to be used, and cannot be used, for the purpose of (i) avoiding any 
tax penalties or (ii) promoting, marketing, or recommending to any other party 
any transaction or matter addressed herein. Please consult your independent 
legal counsel and/or professional tax advisor regarding any legal or tax issues 
raised in this e-mail.

The contents of this e-mail and any attachments are intended solely for the use 
of the named addressee(s) and may contain confidential and/or privileged 
information. Any unauthorized use, copying, disclosure, or distribution of the 
contents of this e-mail is strictly prohibited by the sender and may be 
unlawful. If you are not the intended recipient, please notify the sender 
immediately and delete this e-mail.
T. Rowe Price International Ltd (registered number 3957748) is registered in 
England and Wales with its registered office at Warwick Court, 5 Paternoster 
Square, London EC4M 7DX. T. Rowe Price International Ltd is authorised and 
regulated by the Financial Conduct Authority. The company has a branch in Dubai 
International Financial Centre (regulated by the DFSA as a Representative 
Office).

T. Rowe Price (including T. Rowe Price International Ltd and its affiliates) 
and its associates do not provide legal or tax advice. Any tax-related 
discussion contained in this e-mail, including any attachments, is not intended 
or written to be used, and cannot be used, for the purpose of (i) avoiding any 
tax penalties or (ii) promoting, marketing, or recommending to any other party 
any transaction or matter addressed herein. Please consult your independent 
legal counsel and/or professional tax advisor regarding any legal or tax issues 
raised in this e-mail.

The contents of this e-mail and any attachments are intended solely for the use 
of the named addressee(s) and may contain confidential and/or privileged 
information. Any unauthorized use, copying, disclosure, or distribution of the 
contents of this e-mail is strictly prohibited by the sender and may be 
unlawful. If you are not the intended recipient, please notify the sender 
immediately and delete this e-mail.


Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix

2024-04-03 Thread Aaron Grubb
Downgrade to hadoop-*:3.3.x, Hadoop 3.4.x is based on the AWS SDK v2 and should 
probably be considered as breaking for tools that build on < 3.4.0 while using 
AWS.

From: Oxlade, Dan 
Sent: Wednesday, April 3, 2024 2:41:11 PM
To: user@spark.apache.org 
Subject: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix


Hi all,



I’ve struggled with this for quite some time.

My requirement is to read a parquet file from s3 to a Dataframe then append to 
an existing iceberg table.



In order to read the parquet I need the hadoop-aws dependency for s3a:// . In 
order to write to iceberg I need the iceberg dependency. Both of these 
dependencies have a transitive dependency on the aws SDK. I can’t find versions 
for Spark 3.4 that work together.





Current Versions:

Spark 3.4.1

iceberg-spark-runtime-3.4-2.12:1.4.1

iceberg-aws-bundle:1.4.1

hadoop-aws:3.4.0

hadoop-common:3.4.0



I’ve tried a number of combinations of the above and their respective versions 
but all fall over with their assumptions on the aws sdk version with class not 
found exceptions or method not found etc.



Is there a compatibility matrix somewhere that someone could point me to?



Thanks

Dan

T. Rowe Price International Ltd (registered number 3957748) is registered in 
England and Wales with its registered office at Warwick Court, 5 Paternoster 
Square, London EC4M 7DX. T. Rowe Price International Ltd is authorised and 
regulated by the Financial Conduct Authority. The company has a branch in Dubai 
International Financial Centre (regulated by the DFSA as a Representative 
Office).

T. Rowe Price (including T. Rowe Price International Ltd and its affiliates) 
and its associates do not provide legal or tax advice. Any tax-related 
discussion contained in this e-mail, including any attachments, is not intended 
or written to be used, and cannot be used, for the purpose of (i) avoiding any 
tax penalties or (ii) promoting, marketing, or recommending to any other party 
any transaction or matter addressed herein. Please consult your independent 
legal counsel and/or professional tax advisor regarding any legal or tax issues 
raised in this e-mail.

The contents of this e-mail and any attachments are intended solely for the use 
of the named addressee(s) and may contain confidential and/or privileged 
information. Any unauthorized use, copying, disclosure, or distribution of the 
contents of this e-mail is strictly prohibited by the sender and may be 
unlawful. If you are not the intended recipient, please notify the sender 
immediately and delete this e-mail.


[Spark]: Spark / Iceberg / hadoop-aws compatibility matrix

2024-04-03 Thread Oxlade, Dan
Hi all,

I've struggled with this for quite some time.
My requirement is to read a parquet file from s3 to a Dataframe then append to 
an existing iceberg table.

In order to read the parquet I need the hadoop-aws dependency for s3a:// . In 
order to write to iceberg I need the iceberg dependency. Both of these 
dependencies have a transitive dependency on the aws SDK. I can't find versions 
for Spark 3.4 that work together.


Current Versions:
Spark 3.4.1
iceberg-spark-runtime-3.4-2.12:1.4.1
iceberg-aws-bundle:1.4.1
hadoop-aws:3.4.0
hadoop-common:3.4.0

I've tried a number of combinations of the above and their respective versions 
but all fall over with their assumptions on the aws sdk version with class not 
found exceptions or method not found etc.

Is there a compatibility matrix somewhere that someone could point me to?

Thanks
Dan
T. Rowe Price International Ltd (registered number 3957748) is registered in 
England and Wales with its registered office at Warwick Court, 5 Paternoster 
Square, London EC4M 7DX. T. Rowe Price International Ltd is authorised and 
regulated by the Financial Conduct Authority. The company has a branch in Dubai 
International Financial Centre (regulated by the DFSA as a Representative 
Office).

T. Rowe Price (including T. Rowe Price International Ltd and its affiliates) 
and its associates do not provide legal or tax advice. Any tax-related 
discussion contained in this e-mail, including any attachments, is not intended 
or written to be used, and cannot be used, for the purpose of (i) avoiding any 
tax penalties or (ii) promoting, marketing, or recommending to any other party 
any transaction or matter addressed herein. Please consult your independent 
legal counsel and/or professional tax advisor regarding any legal or tax issues 
raised in this e-mail.

The contents of this e-mail and any attachments are intended solely for the use 
of the named addressee(s) and may contain confidential and/or privileged 
information. Any unauthorized use, copying, disclosure, or distribution of the 
contents of this e-mail is strictly prohibited by the sender and may be 
unlawful. If you are not the intended recipient, please notify the sender 
immediately and delete this e-mail.


Re: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-02 Thread Mich Talebzadeh
hm. you are getting below

AnalysisException: Append output mode not supported when there are
streaming aggregations on streaming DataFrames/DataSets without watermark;

The problem seems to be that you are using the append output mode when
writing the streaming query results to Kafka. This mode is designed for
scenarios where you want to append new data to an existing dataset at the
sink (in this case, the "sink" topic in Kafka). However, your query
involves a streaming aggregation: group by provinceId, window('createTime',
'1 hour', '30 minutes'). The problem is that Spark Structured Streaming
requires a watermark to ensure exactly-once processing when using
aggregations with append mode. Your code already defines a watermark on the
"createTime" column with a delay of 10 seconds (withWatermark("createTime",
"10 seconds")). However, the error message indicates it is missing on the
start column. Try adding watermark to "start" Column: Modify your code as
below  to include a watermark on the "start" column generated by the window
function:

from pyspark.sql.functions import col, from_json, explode, window, sum,
watermark

streaming_df = session.readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "localhost:9092") \
  .option("subscribe", "payment_msg") \
  .option("startingOffsets", "earliest") \
  .load() \
  .select(from_json(col("value").cast("string"),
schema).alias("parsed_value")) \
  .select("parsed_value.*") \
  .withWatermark("createTime", "10 seconds")  # Existing watermark on
createTime

*# Modified section with watermark on 'start' column*
streaming_df = streaming_df.groupBy(
  col("provinceId"),
  window(col("createTime"), "1 hour", "30 minutes")
).agg(
  sum(col("payAmount")).alias("totalPayAmount")
).withWatermark(expr("start"), "10 seconds")  # Watermark on
window-generated 'start'

# Rest of the code remains the same
streaming_df.createOrReplaceTempView("streaming_df")

spark.sql("""
SELECT
  window.start, window.end, provinceId, totalPayAmount
FROM streaming_df
ORDER BY window.start
""") \
.writeStream \
.format("kafka") \
.option("checkpointLocation", "checkpoint") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("topic", "sink") \
.start()

Try and see how it goes

HTH

Mich Talebzadeh,

Technologist | Solutions Architect | Data Engineer  | Generative AI

London
United Kingdom


   view my Linkedin profile


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my knowledge
but of course cannot be guaranteed . It is essential to note that, as with
any advice, quote "one test result is worth one-thousand expert opinions
(Werner Von Braun)".

Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Tue, 2 Apr 2024 at 22:43, Chloe He  wrote:

> Hi Mich,
>
> Thank you so much for your response. I really appreciate your help!
>
> You mentioned "defining the watermark using the withWatermark function on
> the streaming_df before creating the temporary view” - I believe this is
> what I’m doing and it’s not working for me. Here is the exact code snippet
> that I’m running:
>
> ```
> >>> streaming_df = session.readStream\
> .format("kafka")\
> .option("kafka.bootstrap.servers", "localhost:9092")\
> .option("subscribe", "payment_msg")\
> .option("startingOffsets","earliest")\
> .load()\
> .select(from_json(col("value").cast("string"),
> schema).alias("parsed_value"))\
> .select("parsed_value.*")\
> .withWatermark("createTime", "10 seconds")
>
> >>> streaming_df.createOrReplaceTempView("streaming_df”)
>
> >>> spark.sql("""
> SELECT
> window.start, window.end, provinceId, sum(payAmount) as totalPayAmount
> FROM streaming_df
> GROUP BY provinceId, window('createTime', '1 hour', '30 minutes')
> ORDER BY window.start
> """)\
>   .withWatermark("start", "10 seconds")\
>   .writeStream\
>   .format("kafka") \
>   .option("checkpointLocation", "checkpoint") \
>   .option("kafka.bootstrap.servers", "localhost:9092") \
>   .option("topic", "sink") \
>   .start()
>
> AnalysisException: Append output mode not supported when there are
> streaming aggregations on streaming DataFrames/DataSets without watermark;
> EventTimeWatermark start#37: timestamp, 10 seconds
> ```
>
> I’m using pyspark 3.5.1. Please let me know if I missed something. Thanks
> again!
>
> Best,
> Chloe
>
>
> On 2024/04/02 20:32:11 Mich Talebzadeh wrote:
> > ok let us 

RE: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-02 Thread Chloe He
Hi Mich,

Thank you so much for your response. I really appreciate your help!

You mentioned "defining the watermark using the withWatermark function on the 
streaming_df before creating the temporary view” - I believe this is what I’m 
doing and it’s not working for me. Here is the exact code snippet that I’m 
running:

```
>>> streaming_df = session.readStream\
.format("kafka")\
.option("kafka.bootstrap.servers", "localhost:9092")\
.option("subscribe", "payment_msg")\
.option("startingOffsets","earliest")\
.load()\
.select(from_json(col("value").cast("string"), 
schema).alias("parsed_value"))\
.select("parsed_value.*")\
.withWatermark("createTime", "10 seconds")

>>> streaming_df.createOrReplaceTempView("streaming_df”)

>>> spark.sql("""
SELECT
window.start, window.end, provinceId, sum(payAmount) as totalPayAmount
FROM streaming_df
GROUP BY provinceId, window('createTime', '1 hour', '30 minutes')
ORDER BY window.start
""")\
  .withWatermark("start", "10 seconds")\
  .writeStream\
  .format("kafka") \
  .option("checkpointLocation", "checkpoint") \
  .option("kafka.bootstrap.servers", "localhost:9092") \
  .option("topic", "sink") \
  .start()

AnalysisException: Append output mode not supported when there are streaming 
aggregations on streaming DataFrames/DataSets without watermark;
EventTimeWatermark start#37: timestamp, 10 seconds
```

I’m using pyspark 3.5.1. Please let me know if I missed something. Thanks again!

Best,
Chloe


On 2024/04/02 20:32:11 Mich Talebzadeh wrote:
> ok let us take it for a test.
> 
> The original code of mine
> 
> def fetch_data(self):
> self.sc.setLogLevel("ERROR")
> schema = StructType() \
>  .add("rowkey", StringType()) \
>  .add("timestamp", TimestampType()) \
>  .add("temperature", IntegerType())
> checkpoint_path = "file:///ssd/hduser/avgtemperature/chkpt"
> try:
> 
> # construct a streaming dataframe 'streamingDataFrame' that
> subscribes to topic temperature
> streamingDataFrame = self.spark \
> .readStream \
> .format("kafka") \
> .option("kafka.bootstrap.servers",
> config['MDVariables']['bootstrapServers'],) \
> .option("schema.registry.url",
> config['MDVariables']['schemaRegistryURL']) \
> .option("group.id", config['common']['appName']) \
> .option("zookeeper.connection.timeout.ms",
> config['MDVariables']['zookeeperConnectionTimeoutMs']) \
> .option("rebalance.backoff.ms",
> config['MDVariables']['rebalanceBackoffMS']) \
> .option("zookeeper.session.timeout.ms",
> config['MDVariables']['zookeeperSessionTimeOutMs']) \
> .option("auto.commit.interval.ms",
> config['MDVariables']['autoCommitIntervalMS']) \
> .option("subscribe", "temperature") \
> .option("failOnDataLoss", "false") \
> .option("includeHeaders", "true") \
> .option("startingOffsets", "earliest") \
> .load() \
> .select(from_json(col("value").cast("string"),
> schema).alias("parsed_value"))
> 
> 
> resultC = streamingDataFrame.select( \
>  col("parsed_value.rowkey").alias("rowkey") \
>, col("parsed_value.timestamp").alias("timestamp") \
>, col("parsed_value.temperature").alias("temperature"))
> 
> """
> We work out the window and the AVG(temperature) in the window's
> timeframe below
> This should return back the following Dataframe as struct
> 
>  root
>  |-- window: struct (nullable = false)
>  ||-- start: timestamp (nullable = true)
>  ||-- end: timestamp (nullable = true)
>  |-- avg(temperature): double (nullable = true)
> 
> """
> resultM = resultC. \
>  withWatermark("timestamp", "5 minutes"). \
>  groupBy(window(resultC.timestamp, "5 minutes", "5
> minutes")). \
>  avg('temperature')
> 
> # We take the above DataFrame and flatten it to get the columns
> aliased as "startOfWindowFrame", "endOfWindowFrame" and "AVGTemperature"
> resultMF = resultM. \
>select( \
> F.col("window.start").alias("startOfWindow") \
>   , F.col("window.end").alias("endOfWindow") \
>   ,
> F.col("avg(temperature)").alias("AVGTemperature"))
> 
> # Kafka producer requires a key, value pair. We generate UUID
> key as the unique identifier of Kafka record
> uuidUdf= F.udf(lambda : str(uuid.uuid4()),StringType())
> 
> """
> We take DataFrame resultMF containing temperature info and
> write it to Kafka. The uuid is serialized as a 

Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-02 Thread Mich Talebzadeh
ok let us take it for a test.

The original code of mine

def fetch_data(self):
self.sc.setLogLevel("ERROR")
schema = StructType() \
 .add("rowkey", StringType()) \
 .add("timestamp", TimestampType()) \
 .add("temperature", IntegerType())
checkpoint_path = "file:///ssd/hduser/avgtemperature/chkpt"
try:

# construct a streaming dataframe 'streamingDataFrame' that
subscribes to topic temperature
streamingDataFrame = self.spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers",
config['MDVariables']['bootstrapServers'],) \
.option("schema.registry.url",
config['MDVariables']['schemaRegistryURL']) \
.option("group.id", config['common']['appName']) \
.option("zookeeper.connection.timeout.ms",
config['MDVariables']['zookeeperConnectionTimeoutMs']) \
.option("rebalance.backoff.ms",
config['MDVariables']['rebalanceBackoffMS']) \
.option("zookeeper.session.timeout.ms",
config['MDVariables']['zookeeperSessionTimeOutMs']) \
.option("auto.commit.interval.ms",
config['MDVariables']['autoCommitIntervalMS']) \
.option("subscribe", "temperature") \
.option("failOnDataLoss", "false") \
.option("includeHeaders", "true") \
.option("startingOffsets", "earliest") \
.load() \
.select(from_json(col("value").cast("string"),
schema).alias("parsed_value"))


resultC = streamingDataFrame.select( \
 col("parsed_value.rowkey").alias("rowkey") \
   , col("parsed_value.timestamp").alias("timestamp") \
   , col("parsed_value.temperature").alias("temperature"))

"""
We work out the window and the AVG(temperature) in the window's
timeframe below
This should return back the following Dataframe as struct

 root
 |-- window: struct (nullable = false)
 ||-- start: timestamp (nullable = true)
 ||-- end: timestamp (nullable = true)
 |-- avg(temperature): double (nullable = true)

"""
resultM = resultC. \
 withWatermark("timestamp", "5 minutes"). \
 groupBy(window(resultC.timestamp, "5 minutes", "5
minutes")). \
 avg('temperature')

# We take the above DataFrame and flatten it to get the columns
aliased as "startOfWindowFrame", "endOfWindowFrame" and "AVGTemperature"
resultMF = resultM. \
   select( \
F.col("window.start").alias("startOfWindow") \
  , F.col("window.end").alias("endOfWindow") \
  ,
F.col("avg(temperature)").alias("AVGTemperature"))

# Kafka producer requires a key, value pair. We generate UUID
key as the unique identifier of Kafka record
uuidUdf= F.udf(lambda : str(uuid.uuid4()),StringType())

"""
We take DataFrame resultMF containing temperature info and
write it to Kafka. The uuid is serialized as a string and used as the key.
We take all the columns of the DataFrame and serialize them as
a JSON string, putting the results in the "value" of the record.
"""
result = resultMF.withColumn("uuid",uuidUdf()) \
 .selectExpr("CAST(uuid AS STRING) AS key",
"to_json(struct(startOfWindow, endOfWindow, AVGTemperature)) AS value") \
 .writeStream \
 .outputMode('complete') \
 .format("kafka") \
 .option("kafka.bootstrap.servers",
config['MDVariables']['bootstrapServers'],) \
 .option("topic", "avgtemperature") \
 .option('checkpointLocation', checkpoint_path) \
 .queryName("avgtemperature") \
 .start()

except Exception as e:
print(f"""{e}, quitting""")
sys.exit(1)

#print(result.status)
#print(result.recentProgress)
#print(result.lastProgress)

result.awaitTermination()

Now try to use sql for the entire transformation and aggression

#import this and anything else needed
from pyspark.sql.functions import from_json, col, window
from pyspark.sql.types import StructType, StringType,IntegerType,
FloatType, TimestampType


# Define the schema for the JSON data
schema = ... # Replace with your schema definition

# construct a streaming dataframe 'streamingDataFrame' that
subscribes to topic temperature
streamingDataFrame = self.spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers",
config['MDVariables']['bootstrapServers'],) \

[Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-02 Thread Chloe He
Hello!

I am attempting to write a streaming pipeline that would consume data from a 
Kafka source, manipulate the data, and then write results to a downstream sink 
(Kafka, Redis, etc). I want to write fully formed SQL instead of using the 
function API that Spark offers. I read a few guides on how to do this and my 
understanding is that I need to create a temp view in order to execute my raw 
SQL queries via spark.sql(). 

However, I’m having trouble defining watermarks on my source. It doesn’t seem 
like there is a way to introduce watermark in the raw SQL that Spark supports, 
so I’m using the .withWatermark() function. However, this watermark does not 
work on the temp view.

Example code:
```
streaming_df.select(from_json(col("value").cast("string"), 
schema).alias("parsed_value")).select("parsed_value.*").withWatermark("createTime",
 "10 seconds”)

json_df.createOrReplaceTempView("json_df”)

session.sql("""
SELECT
window.start, window.end, provinceId, sum(payAmount) as totalPayAmount
FROM json_df
GROUP BY provinceId, window('createTime', '1 hour', '30 minutes')
ORDER BY window.start
""")\
.writeStream\
.format("kafka") \
.option("checkpointLocation", "checkpoint") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("topic", "sink") \
.start()
```
This throws
```
AnalysisException: Append output mode not supported when there are streaming 
aggregations on streaming DataFrames/DataSets without watermark;
```

If I switch out the SQL query and write it in the function API instead, 
everything seems to work fine.

How can I use .sql() in conjunction with watermarks?

Best,
Chloe

Re: [External] Re: Issue of spark with antlr version

2024-04-01 Thread Chawla, Parul
Hi Team,

Can you let us know the when   this spark 4.x will be released to maven.

regards,
Parul

Get Outlook for iOS

From: Bjørn Jørgensen 
Sent: Wednesday, February 28, 2024 5:06:54 PM
To: Chawla, Parul 
Cc: Sahni, Ashima ; user@spark.apache.org 
; Misra Parashar, Jyoti 
; Mekala, Rajesh ; 
Grandhi, Venkatesh ; George, Rejish 
; Tayal, Aayushi 
Subject: Re: [External] Re: Issue of spark with antlr version

[image.png]

ons. 28. feb. 2024 kl. 11:28 skrev Chawla, Parul 
mailto:parul.cha...@accenture.com>>:

Hi ,
Can we get spark version on whuich this is resolved.

From: Bjørn Jørgensen 
mailto:bjornjorgen...@gmail.com>>
Sent: Tuesday, February 27, 2024 7:05:36 PM
To: Sahni, Ashima 
Cc: Chawla, Parul 
mailto:parul.cha...@accenture.com>>; 
user@spark.apache.org 
mailto:user@spark.apache.org>>; Misra Parashar, Jyoti 
mailto:jyoti.misra.paras...@accenture.com>>;
 Mekala, Rajesh mailto:r.mek...@accenture.com>>; 
Grandhi, Venkatesh 
mailto:venkatesh.a.gran...@accenture.com>>; 
George, Rejish 
mailto:rejish.geo...@accenture.com>>; Tayal, 
Aayushi mailto:aayushi.ta...@accenture.com>>
Subject: [External] Re: Issue of spark with antlr version

CAUTION: External email. Be cautious with links and attachments.

[SPARK-44366][BUILD] Upgrade antlr4 to 
4.13.1


tir. 27. feb. 2024 kl. 13:25 skrev Sahni, Ashima 
:

Hi Team,



Can you please let us know the update on below.



Thanks,

Ashima



From: Chawla, Parul 
mailto:parul.cha...@accenture.com>>
Sent: Sunday, February 25, 2024 11:57 PM
To: user@spark.apache.org
Cc: Sahni, Ashima 
mailto:ashima.sa...@accenture.com>>; Misra 
Parashar, Jyoti 
mailto:jyoti.misra.paras...@accenture.com>>
Subject: Issue of spark with antlr version



Hi Spark Team,





Our application is currently using spring framrwork 5.3.31 .To upgrade it to 
6.x , as per application dependency we must upgrade Spark and Hibernate jars as 
well .

With Hibernate compatible upgrade, the dependent Antlr4 jar version has been 
upgraded to 4.10.1 but there’s no Spark version available with the upgraded 
Antlr4 jar.

Can u please update when we can have updated version with upgraded antl4 
version..





Regards,

Parul



This message is for the designated recipient only and may contain privileged, 
proprietary, or otherwise confidential information. If you have received it in 
error, please notify the sender immediately and delete the original. Any other 
use of the e-mail by you is prohibited. Where allowed by local law, electronic 
communications with Accenture and its affiliates, including e-mail and instant 
messaging (including content), may be scanned by our systems for the purposes 
of information security, AI-powered support capabilities, and assessment of 
internal compliance with Accenture policy. Your privacy is important to us. 
Accenture uses your personal data only in compliance with data protection laws. 
For further information on how Accenture processes your personal data, please 
see our privacy statement at https://www.accenture.com/us-en/privacy-policy.
__

www.accenture.com


--
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


--
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Apache Spark integration with Spring Boot 3.0.0+

2024-03-28 Thread Szymon Kasperkiewicz
Hello,  Ive got a project which has to use newest versions of both Apache 
Spark and Spring Boot due to vulnerabilities issues.  I build my project using 
Gradle. And when I try to run it i get :   Unsatisfied dependecy exception 
about javax/servlet/Servlet.  Ive tried to add jakarta servlet, javax 
older version, etc. None of them worked.  The only solution which I saw was to 
downgrade Spring Boot but i cant do that unfortunatelly.  Is there any 
known option to use both Apache Spark and Spring Boot in project?  Best regards 
 Szymon


Community Over Code NA 2024 Travel Assistance Applications now open!

2024-03-27 Thread Gavin McDonald
Hello to all users, contributors and Committers!

[ You are receiving this email as a subscriber to one or more ASF project
dev or user
  mailing lists and is not being sent to you directly. It is important that
we reach all of our
  users and contributors/committers so that they may get a chance to
benefit from this.
  We apologise in advance if this doesn't interest you but it is on topic
for the mailing
  lists of the Apache Software Foundation; and it is important please that
you do not
  mark this as spam in your email client. Thank You! ]

The Travel Assistance Committee (TAC) are pleased to announce that
travel assistance applications for Community over Code NA 2024 are now
open!

We will be supporting Community over Code NA, Denver Colorado in
October 7th to the 10th 2024.

TAC exists to help those that would like to attend Community over Code
events, but are unable to do so for financial reasons. For more info
on this years applications and qualifying criteria, please visit the
TAC website at < https://tac.apache.org/ >. Applications are already
open on https://tac-apply.apache.org/, so don't delay!

The Apache Travel Assistance Committee will only be accepting
applications from those people that are able to attend the full event.

Important: Applications close on Monday 6th May, 2024.

Applicants have until the the closing date above to submit their
applications (which should contain as much supporting material as
required to efficiently and accurately process their request), this
will enable TAC to announce successful applications shortly
afterwards.

As usual, TAC expects to deal with a range of applications from a
diverse range of backgrounds; therefore, we encourage (as always)
anyone thinking about sending in an application to do so ASAP.

For those that will need a Visa to enter the Country - we advise you apply
now so that you have enough time in case of interview delays. So do not
wait until you know if you have been accepted or not.

We look forward to greeting many of you in Denver, Colorado , October 2024!

Kind Regards,

Gavin

(On behalf of the Travel Assistance Committee)


Re: [DISCUSS] MySQL version support policy

2024-03-25 Thread Dongjoon Hyun
Hi, Cheng.

Thank you for the suggestion. Your suggestion seems to have at least two
themes.

A. Adding a new Apache Spark community policy (contract) to guarantee MySQL
LTS Versions Support.
B. Dropping the support of non-LTS version support (MySQL 8.3/8.2/8.1)

And, it brings me three questions.

1. For (A), do you mean MySQL LTS versions are not supported by Apache
Spark releases properly due to the improper test suite?
2. For (B), why does Apache Spark need to drop non-LTS MySQL support?
3. What about MariaDB? Do we need to stick to some versions?

To be clear, if needed, we can have daily GitHub Action CIs easily like
Python CI (Python 3.8/3.10/3.11/3.12).

-
https://github.com/apache/spark/blob/master/.github/workflows/build_python.yml

Thanks,
Dongjoon.


On Sun, Mar 24, 2024 at 10:29 PM Cheng Pan  wrote:

> Hi, Spark community,
>
> I noticed that the Spark JDBC connector MySQL dialect is testing against
> the 8.3.0[1] now, a non-LTS version.
>
> MySQL changed the version policy recently[2], which is now very similar to
> the Java version policy. In short, 5.5, 5.6, 5.7, 8.0 is the LTS version,
> 8.1, 8.2, 8.3 is non-LTS, and the next LTS version is 8.4.
>
> I would say that MySQL is one of the most important infrastructures today,
> I checked the AWS RDS MySQL[4] and Azure Database for MySQL[5] version
> support policy, and both only support 5.7 and 8.0.
>
> Also, Spark officially only supports LTS Java versions, like JDK 17 and
> 21, but not 22. I would recommend using MySQL 8.0 for testing until the
> next MySQL LTS version (8.4) is available.
>
> Additional discussion can be found at [3]
>
> [1] https://issues.apache.org/jira/browse/SPARK-47453
> [2]
> https://dev.mysql.com/blog-archive/introducing-mysql-innovation-and-long-term-support-lts-versions/
> [3] https://github.com/apache/spark/pull/45581
> [4] https://aws.amazon.com/rds/mysql/
> [5] https://learn.microsoft.com/en-us/azure/mysql/concepts-version-policy
>
> Thanks,
> Cheng Pan
>
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


[DISCUSS] MySQL version support policy

2024-03-24 Thread Cheng Pan
Hi, Spark community,

I noticed that the Spark JDBC connector MySQL dialect is testing against the 
8.3.0[1] now, a non-LTS version.

MySQL changed the version policy recently[2], which is now very similar to the 
Java version policy. In short, 5.5, 5.6, 5.7, 8.0 is the LTS version, 8.1, 8.2, 
8.3 is non-LTS, and the next LTS version is 8.4. 

I would say that MySQL is one of the most important infrastructures today, I 
checked the AWS RDS MySQL[4] and Azure Database for MySQL[5] version support 
policy, and both only support 5.7 and 8.0.

Also, Spark officially only supports LTS Java versions, like JDK 17 and 21, but 
not 22. I would recommend using MySQL 8.0 for testing until the next MySQL LTS 
version (8.4) is available.

Additional discussion can be found at [3]

[1] https://issues.apache.org/jira/browse/SPARK-47453
[2] 
https://dev.mysql.com/blog-archive/introducing-mysql-innovation-and-long-term-support-lts-versions/
[3] https://github.com/apache/spark/pull/45581
[4] https://aws.amazon.com/rds/mysql/
[5] https://learn.microsoft.com/en-us/azure/mysql/concepts-version-policy

Thanks,
Cheng Pan



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Is one Spark partition mapped to one and only Spark Task ?

2024-03-24 Thread Sreyan Chakravarty
I am trying to understand the Spark Architecture for my upcoming
certification, however there seems to be conflicting information available.

https://stackoverflow.com/questions/47782099/what-is-the-relationship-between-tasks-and-partitions

Does Spark assign a Spark partition to only a single corresponding Spark
partition ?

In other words, is the number of Spark tasks for a job equal to the number
of Spark partitions ? (Provided of course there are no shuffles)

If so, a following question is :

1) Is the reason in Spark we can get OOMs ? Because a partition may not be
able to be loaded into RAM (provided its coming from an intermediate step
like a groupBy) ?

2) What is the purpose of spark.task.cpus ? It does not make sense for more
than one thread (or more than one cpu) to be working on a single
partition of data. So this number should always be 1 right ?

Need some help. Thanks.

-- 
Regards,
Sreyan Chakravarty


Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-23 Thread Winston Lai
+1

--
Thank You & Best Regards
Winston Lai

From: Jay Han 
Date: Sunday, 24 March 2024 at 08:39
To: Kiran Kumar Dusi 
Cc: Farshid Ashouri , Matei Zaharia 
, Mich Talebzadeh , Spark 
dev list , user @spark 
Subject: Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark 
Community
+1. It sounds awesome!

Kiran Kumar Dusi mailto:kirankumard...@gmail.com>> 
于2024年3月21日周四 14:16写道:
+1

On Thu, 21 Mar 2024 at 7:46 AM, Farshid Ashouri 
mailto:farsheed.asho...@gmail.com>> wrote:
+1

On Mon, 18 Mar 2024, 11:00 Mich Talebzadeh, 
mailto:mich.talebza...@gmail.com>> wrote:
Some of you may be aware that Databricks community Home | Databricks
have just launched a knowledge sharing hub. I thought it would be a
good idea for the Apache Spark user group to have the same, especially
for repeat questions on Spark core, Spark SQL, Spark Structured
Streaming, Spark Mlib and so forth.

Apache Spark user and dev groups have been around for a good while.
They are serving their purpose . We went through creating a slack
community that managed to create more more heat than light.. This is
what Databricks community came up with and I quote

"Knowledge Sharing Hub
Dive into a collaborative space where members like YOU can exchange
knowledge, tips, and best practices. Join the conversation today and
unlock a wealth of collective wisdom to enhance your experience and
drive success."

I don't know the logistics of setting it up.but I am sure that should
not be that difficult. If anyone is supportive of this proposal, let
the usual +1, 0, -1 decide

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner Von Braun)".

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org


Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-23 Thread Jay Han
+1. It sounds awesome!

Kiran Kumar Dusi  于2024年3月21日周四 14:16写道:

> +1
>
> On Thu, 21 Mar 2024 at 7:46 AM, Farshid Ashouri <
> farsheed.asho...@gmail.com> wrote:
>
>> +1
>>
>> On Mon, 18 Mar 2024, 11:00 Mich Talebzadeh, 
>> wrote:
>>
>>> Some of you may be aware that Databricks community Home | Databricks
>>> have just launched a knowledge sharing hub. I thought it would be a
>>> good idea for the Apache Spark user group to have the same, especially
>>> for repeat questions on Spark core, Spark SQL, Spark Structured
>>> Streaming, Spark Mlib and so forth.
>>>
>>> Apache Spark user and dev groups have been around for a good while.
>>> They are serving their purpose . We went through creating a slack
>>> community that managed to create more more heat than light.. This is
>>> what Databricks community came up with and I quote
>>>
>>> "Knowledge Sharing Hub
>>> Dive into a collaborative space where members like YOU can exchange
>>> knowledge, tips, and best practices. Join the conversation today and
>>> unlock a wealth of collective wisdom to enhance your experience and
>>> drive success."
>>>
>>> I don't know the logistics of setting it up.but I am sure that should
>>> not be that difficult. If anyone is supportive of this proposal, let
>>> the usual +1, 0, -1 decide
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Dad | Technologist | Solutions Architect | Engineer
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> Disclaimer: The information provided is correct to the best of my
>>> knowledge but of course cannot be guaranteed . It is essential to note
>>> that, as with any advice, quote "one test result is worth one-thousand
>>> expert opinions (Werner Von Braun)".
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>


Re: Feature article: Leveraging Generative AI with Apache Spark: Transforming Data Engineering

2024-03-22 Thread Mich Talebzadeh
Sorry from this link

Leveraging Generative AI with Apache Spark: Transforming Data Engineering |
LinkedIn


Mich Talebzadeh,
Technologist | Data | Generative AI | Financial Fraud
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Fri, 22 Mar 2024 at 16:16, Mich Talebzadeh 
wrote:

> You may find this link of mine in Linkedin for the said article. We
> can use Linkedin for now.
>
> Leveraging Generative AI with Apache Spark: Transforming Data
> Engineering | LinkedIn
>
>
> Mich Talebzadeh,
>
> Technologist | Data | Generative AI | Financial Fraud
>
> London
> United Kingdom
>
>
>view my Linkedin profile
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> Disclaimer: The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner Von Braun)".
>


Feature article: Leveraging Generative AI with Apache Spark: Transforming Data Engineering

2024-03-22 Thread Mich Talebzadeh
You may find this link of mine in Linkedin for the said article. We
can use Linkedin for now.

Leveraging Generative AI with Apache Spark: Transforming Data
Engineering | LinkedIn


Mich Talebzadeh,

Technologist | Data | Generative AI | Financial Fraud

London
United Kingdom


   view my Linkedin profile


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner Von Braun)".

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re:

2024-03-21 Thread Mich Talebzadeh
You can try this

val kafkaReadStream = spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", broker)
  .option("subscribe", topicName)
  .option("startingOffsets", startingOffsetsMode)
  .option("maxOffsetsPerTrigger", maxOffsetsPerTrigger)
  .load()

kafkaReadStream
  .writeStream
  .foreachBatch((df: DataFrame, batchId: Long) => sendToSink(df, batchId))
  .trigger(Trigger.ProcessingTime(s"${triggerProcessingTime} seconds"))
  .option("checkpointLocation", checkpoint_path)
  .start()
  .awaitTermination()

Notice the function sendToSink

The foreachBatch method ensures that the sendToSink function is called for
each micro-batch, regardless of whether the DataFrame contains data or not.

Let us look at that function

import org.apache.spark.sql.functions._
def sendToSink(df: DataFrame, batchId: Long): Unit = {

  if (!df.isEmpty) {
println(s"From sendToSink, batchId is $batchId, at
${java.time.LocalDateTime.now()}")
// df.show(100, false)
df.persist()
// Write to BigQuery batch table
// s.writeTableToBQ(df, "append",
config.getString("MDVariables.targetDataset"),
config.getString("MDVariables.targetTable"))
df.unpersist()
// println("wrote to DB")
  } else {
println("DataFrame df is empty")
  }
}

If the DataFrame is empty, it prints a message indicating that the
DataFrame is empty. You can of course adapt it for your case

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Thu, 21 Mar 2024 at 23:14, Рамик И  wrote:

>
> Hi!
> I want to exucute code inside forEachBatch that will trigger regardless of
> whether there is data in the batch or not.
>
>
> val kafkaReadStream = spark
> .readStream
> .format("kafka")
> .option("kafka.bootstrap.servers", broker)
> .option("subscribe", topicName)
> .option("startingOffsets", startingOffsetsMode)
> .option("maxOffsetsPerTrigger", maxOffsetsPerTrigger)
> .load()
>
>
> kafkaReadStream
> .writeStream
> .trigger(Trigger.ProcessingTime(s"$triggerProcessingTime seconds"))
> .foreachBatch {
>
> 
> }
> .start()
> .awaitTermination()
>


Bug in org.apache.spark.util.sketch.BloomFilter

2024-03-21 Thread Nathan Conroy
Hi All,

I believe that there is a bug that affects the Spark BloomFilter implementation 
when creating a bloom filter with large n. Since this implementation uses 
integer hash functions, it doesn’t work properly when the number of bits 
exceeds MAX_INT.

I asked a question about this on stackoverflow, but didn’t get a satisfactory 
answer. I believe I know what is causing the bug and have documented my 
reasoning there as well:

https://stackoverflow.com/questions/78162973/why-is-observed-false-positive-rate-in-spark-bloom-filter-higher-than-expected

I would just go ahead and create a Jira ticket on the spark jira board, but I’m 
still waiting to hear back regarding getting my account set up.

Huge thanks if anyone can help!

-N


[no subject]

2024-03-21 Thread Рамик И
Hi!
I want to exucute code inside forEachBatch that will trigger regardless of
whether there is data in the batch or not.


val kafkaReadStream = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", broker)
.option("subscribe", topicName)
.option("startingOffsets", startingOffsetsMode)
.option("maxOffsetsPerTrigger", maxOffsetsPerTrigger)
.load()


kafkaReadStream
.writeStream
.trigger(Trigger.ProcessingTime(s"$triggerProcessingTime seconds"))
.foreachBatch {


}
.start()
.awaitTermination()


Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-20 Thread Kiran Kumar Dusi
+1

On Thu, 21 Mar 2024 at 7:46 AM, Farshid Ashouri 
wrote:

> +1
>
> On Mon, 18 Mar 2024, 11:00 Mich Talebzadeh, 
> wrote:
>
>> Some of you may be aware that Databricks community Home | Databricks
>> have just launched a knowledge sharing hub. I thought it would be a
>> good idea for the Apache Spark user group to have the same, especially
>> for repeat questions on Spark core, Spark SQL, Spark Structured
>> Streaming, Spark Mlib and so forth.
>>
>> Apache Spark user and dev groups have been around for a good while.
>> They are serving their purpose . We went through creating a slack
>> community that managed to create more more heat than light.. This is
>> what Databricks community came up with and I quote
>>
>> "Knowledge Sharing Hub
>> Dive into a collaborative space where members like YOU can exchange
>> knowledge, tips, and best practices. Join the conversation today and
>> unlock a wealth of collective wisdom to enhance your experience and
>> drive success."
>>
>> I don't know the logistics of setting it up.but I am sure that should
>> not be that difficult. If anyone is supportive of this proposal, let
>> the usual +1, 0, -1 decide
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Dad | Technologist | Solutions Architect | Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> Disclaimer: The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner Von Braun)".
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-20 Thread Farshid Ashouri
+1

On Mon, 18 Mar 2024, 11:00 Mich Talebzadeh, 
wrote:

> Some of you may be aware that Databricks community Home | Databricks
> have just launched a knowledge sharing hub. I thought it would be a
> good idea for the Apache Spark user group to have the same, especially
> for repeat questions on Spark core, Spark SQL, Spark Structured
> Streaming, Spark Mlib and so forth.
>
> Apache Spark user and dev groups have been around for a good while.
> They are serving their purpose . We went through creating a slack
> community that managed to create more more heat than light.. This is
> what Databricks community came up with and I quote
>
> "Knowledge Sharing Hub
> Dive into a collaborative space where members like YOU can exchange
> knowledge, tips, and best practices. Join the conversation today and
> unlock a wealth of collective wisdom to enhance your experience and
> drive success."
>
> I don't know the logistics of setting it up.but I am sure that should
> not be that difficult. If anyone is supportive of this proposal, let
> the usual +1, 0, -1 decide
>
> HTH
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> Disclaimer: The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner Von Braun)".
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Announcing the Community Over Code 2024 Streaming Track

2024-03-20 Thread James Hughes
Hi all,

Community Over Code , the ASF conference,
will be held in Denver, Colorado,

October 7-10, 2024. The call for presentations

is open now through April 15, 2024.  (This is two months earlier than last
year!)

I am one of the co-chairs for the stream processing track, and we would
love to see you there and hope that you will consider submitting a talk.

About the Streaming track:

There are many top-level ASF projects which focus on and push the envelope
for stream and event processing.  ActiveMQ, Beam, Bookkeeper, Camel, Flink,
Kafka, Pulsar, RocketMQ, and Spark are all house-hold names in the stream
processing and analytics world at this point.  These projects show that
stream processing has unique characteristics requiring deep expertise.  On
the other hand, users need easy to apply solutions.  The streaming track
will host talks focused on the use cases and advances of these projects as
well as other developments in the streaming world.

Thanks and see you in October!

Jim


Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-19 Thread Mich Talebzadeh
One option that comes to my mind, is that given the cyclic nature of these
types of proposals in these two forums, we should be able to use
Databricks's existing knowledge sharing hub Knowledge Sharing Hub -
Databricks

as well.

The majority of topics will be of interest to their audience as well. In
addition, they seem to invite everyone to contribute. Unless you have an
overriding concern why we should not take this approach, I can enquire from
Databricks community managers whether they can entertain this idea. They
seem to have a well defined structure for hosting topics.

Let me know your thoughts

Thanks

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Tue, 19 Mar 2024 at 08:25, Joris Billen 
wrote:

> +1
>
>
> On 18 Mar 2024, at 21:53, Mich Talebzadeh 
> wrote:
>
> Well as long as it works.
>
> Please all check this link from Databricks and let us know your thoughts.
> Will something similar work for us?. Of course Databricks have much deeper
> pockets than our ASF community. Will it require moderation in our side to
> block spams and nutcases.
>
> Knowledge Sharing Hub - Databricks
> 
>
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Mon, 18 Mar 2024 at 20:31, Bjørn Jørgensen 
> wrote:
>
>> something like this  Spark community · GitHub
>> 
>>
>>
>> man. 18. mars 2024 kl. 17:26 skrev Parsian, Mahmoud
>> :
>>
>>> Good idea. Will be useful
>>>
>>>
>>>
>>> +1
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *From: *ashok34...@yahoo.com.INVALID 
>>> *Date: *Monday, March 18, 2024 at 6:36 AM
>>> *To: *user @spark , Spark dev list <
>>> d...@spark.apache.org>, Mich Talebzadeh 
>>> *Cc: *Matei Zaharia 
>>> *Subject: *Re: A proposal for creating a Knowledge Sharing Hub for
>>> Apache Spark Community
>>>
>>> External message, be mindful when clicking links or attachments
>>>
>>>
>>>
>>> Good idea. Will be useful
>>>
>>>
>>>
>>> +1
>>>
>>>
>>>
>>> On Monday, 18 March 2024 at 11:00:40 GMT, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>
>>>
>>>
>>>
>>> Some of you may be aware that Databricks community Home | Databricks
>>>
>>> have just launched a knowledge sharing hub. I thought it would be a
>>>
>>> good idea for the Apache Spark user group to have the same, especially
>>>
>>> for repeat questions on Spark core, Spark SQL, Spark Structured
>>>
>>> Streaming, Spark Mlib and so forth.
>>>
>>>
>>>
>>> Apache Spark user and dev groups have been around for a good while.
>>>
>>> They are serving their purpose . We went through creating a slack
>>>
>>> community that managed to create more more heat than light.. This is
>>>
>>> what Databricks community came up with and I quote
>>>
>>>
>>>
>>> "Knowledge Sharing Hub
>>>
>>> Dive into a collaborative space where members like YOU can exchange
>>>
>>> knowledge, tips, and best practices. Join the conversation today and
>>>
>>> unlock a wealth of collective wisdom to enhance your experience and
>>>
>>> drive success."
>>>
>>>
>>>
>>> I don't know the logistics of setting it up.but I am sure that should
>>>
>>> not be that difficult. If anyone is supportive of this proposal, let
>>>
>>> the usual +1, 0, -1 decide
>>>
>>>
>>>
>>> HTH
>>>
>>>
>>>
>>> Mich Talebzadeh,
>>>
>>> Dad | Technologist | Solutions Architect | Engineer
>>>
>>> London
>>>
>>> United Kingdom
>>>
>>>
>>>
>>>
>>>
>>>   view my Linkedin profile
>>>
>>>
>>>
>>>
>>>
>>> https://en.everybodywiki.com/Mich_Talebzadeh
>>> 
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> 

Spark-UI stages and other tabs not accessible in standalone mode when reverse-proxy is enabled

2024-03-19 Thread sharad mishra
Hi Team,
We're encountering an issue with Spark UI.
I've documented the details here:
https://issues.apache.org/jira/browse/SPARK-47232
When enabled reverse proxy in master and worker configOptions. We're not
able to access different tabs available in spark UI e.g.(stages,
environment, storage etc.)

We're deploying spark through bitnami helm chart :
https://github.com/bitnami/charts/tree/main/bitnami/spark

Name and Version

bitnami/spark - 6.0.0

What steps will reproduce the bug?

Kubernetes Version: 1.25
Spark: 3.4.2
Helm chart: 6.0.0

Steps to reproduce:
After installing the chart Spark Cluster(Master and worker) UI is available
at:


https://spark.staging.abc.com/

We are able to access running application by click on applicationID under
Running Applications link:



We can access spark UI by clicking Application Detail UI:

We are taken to jobs tab when we click on Application Detail UI


URL looks like:
https://spark.staging.abc.com/proxy/app-20240208103209-0030/stages/

When we click any of the tab from spark UI e.g. stages or environment etc,
it takes us back to spark cluster UI page
We noticed that endpoint changes to


https://spark.staging.abc.com/stages/
instead of
https://spark.staging.abc.com/proxy/app-20240208103209-0030/stages/



Are you using any custom parameters or values?

Configurations set in values.yaml
```
master:
  configOptions:
-Dspark.ui.reverseProxy=true
-Dspark.ui.reverseProxyUrl=https://spark.staging.abc.com

worker:
  configOptions:
-Dspark.ui.reverseProxy=true
-Dspark.ui.reverseProxyUrl=https://spark.staging.abc.com

service:
  type: ClusterIP
  ports:
http: 8080
https: 443
cluster: 7077

ingress:

  enabled: true
  pathType: ImplementationSpecific
  apiVersion: ""
  hostname: spark.staging.abc.com
  ingressClassName: "staging"
  path: /
```



What is the expected behavior?

Expected behaviour is that when I click on stages tab, instead of taking me
to
https://spark.staging.abc.com/stages/
it should take me to following URL:
https://spark.staging.abc.com/proxy/app-20240208103209-0030/stages/

What do you see instead?

current behaviour is it takes me to URL:
https://spark.staging.abc.com/stages/ , which shows spark cluster UI with
master and worker details

would appreciate any help on this, thanks.

Best,
Sharad


Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-19 Thread Joris Billen
+1


On 18 Mar 2024, at 21:53, Mich Talebzadeh  wrote:

Well as long as it works.

Please all check this link from Databricks and let us know your thoughts. Will 
something similar work for us?. Of course Databricks have much deeper pockets 
than our ASF community. Will it require moderation in our side to block spams 
and nutcases.

Knowledge Sharing Hub - 
Databricks


Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom

 
[https://ci3.googleusercontent.com/mail-sig/AIorK4zholKucR2Q9yMrKbHNn-o1TuS4mYXyi2KO6Xmx6ikHPySa9MLaLZ8t2hrA6AUcxSxDgHIwmKE]
   view my Linkedin 
profile

 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my knowledge but 
of course cannot be guaranteed . It is essential to note that, as with any 
advice, quote "one test result is worth one-thousand expert opinions (Werner 
 Von 
Braun)".


On Mon, 18 Mar 2024 at 20:31, Bjørn Jørgensen 
mailto:bjornjorgen...@gmail.com>> wrote:
something like this  Spark community · 
GitHub


man. 18. mars 2024 kl. 17:26 skrev Parsian, Mahmoud 
:
Good idea. Will be useful

+1



From: ashok34...@yahoo.com.INVALID 
Date: Monday, March 18, 2024 at 6:36 AM
To: user @spark mailto:user@spark.apache.org>>, Spark 
dev list mailto:d...@spark.apache.org>>, Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>>
Cc: Matei Zaharia mailto:matei.zaha...@gmail.com>>
Subject: Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark 
Community
External message, be mindful when clicking links or attachments

Good idea. Will be useful

+1

On Monday, 18 March 2024 at 11:00:40 GMT, Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>> wrote:


Some of you may be aware that Databricks community Home | Databricks
have just launched a knowledge sharing hub. I thought it would be a
good idea for the Apache Spark user group to have the same, especially
for repeat questions on Spark core, Spark SQL, Spark Structured
Streaming, Spark Mlib and so forth.

Apache Spark user and dev groups have been around for a good while.
They are serving their purpose . We went through creating a slack
community that managed to create more more heat than light.. This is
what Databricks community came up with and I quote

"Knowledge Sharing Hub
Dive into a collaborative space where members like YOU can exchange
knowledge, tips, and best practices. Join the conversation today and
unlock a wealth of collective wisdom to enhance your experience and
drive success."

I don't know the logistics of setting it up.but I am sure that should
not be that difficult. If anyone is supportive of this proposal, let
the usual +1, 0, -1 decide

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


  view my Linkedin profile


https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner Von Braun)".

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org



--
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297



Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-19 Thread Varun Shah
Hi @Mich Talebzadeh  , community,

Where can I find such insights on the Spark Architecture ?

I found few sites below which did/does cover internals :
1. https://github.com/JerryLead/SparkInternals
2. https://books.japila.pl/apache-spark-internals/overview/
3. https://stackoverflow.com/questions/30691385/how-spark-works-internally

Most of them are very old, and hoping the basic internals have not changed,
where can we find more information on internals ? Asking in case you or
someone from community has more articles / videos / document links to
share.

Appreciate your help.


Regards,
Varun Shah



On Fri, Mar 15, 2024, 03:10 Mich Talebzadeh 
wrote:

> Hi,
>
> When you create a DataFrame from Python objects using
> spark.createDataFrame, here it goes:
>
>
> *Initial Local Creation:*
> The DataFrame is initially created in the memory of the driver node. The
> data is not yet distributed to executors at this point.
>
> *The role of lazy Evaluation:*
>
> Spark applies lazy evaluation, *meaning transformations are not executed
> immediately*.  It constructs a logical plan describing the operations,
> but data movement does not occur yet.
>
> *Action Trigger:*
>
> When you initiate an action (things like show(), collect(), etc), Spark
> triggers the execution.
>
>
>
> *When partitioning  and distribution come in:Spark partitions the
> DataFrame into logical chunks for parallel processing*. It divides the
> data based on a partitioning scheme (default is hash partitioning). Each
> partition is sent to different executor nodes for distributed execution.
> This stage involves data transfer across the cluster, but it is not that
> expensive shuffle you have heard of. Shuffles happen within repartitioning
> or certain join operations.
>
> *Storage on Executors:*
>
> Executors receive their assigned partitions and store them in their
> memory. If memory is limited, Spark spills partitions to disk. look at
> stages tab in UI (4040)
>
>
> *In summary:*
> No Data Transfer During Creation: --> Data transfer occurs only when an
> action is triggered.
> Distributed Processing: --> DataFrames are distributed for parallel
> execution, not stored entirely on the driver node.
> Lazy Evaluation Optimization: --> Delaying data transfer until necessary
> enhances performance.
> Shuffle vs. Partitioning: --> Data movement during partitioning is not
> considered a shuffle in Spark terminology.
> Shuffles involve more complex data rearrangement.
>
> *Considerations: *
> Large DataFrames: For very large DataFrames
>
>- manage memory carefully to avoid out-of-memory errors. Consider
>options like:
>- Increasing executor memory
>- Using partitioning strategies to optimize memory usage
>- Employing techniques like checkpointing to persistent storage (hard
>disks) or caching for memory efficiency
>- You can get additional info from Spark UI default port 4040 tabs
>like SQL and executors
>- Spark uses Catalyst optimiser for efficient execution plans.
>df.explain("extended") shows both logical and physical plans
>
> HTH
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Thu, 14 Mar 2024 at 19:46, Sreyan Chakravarty 
> wrote:
>
>> I am trying to understand Spark Architecture.
>>
>> For Dataframes that are created from python objects ie. that are *created
>> in memory where are they stored ?*
>>
>> Take following example:
>>
>> from pyspark.sql import Rowimport datetime
>> courses = [
>> {
>> 'course_id': 1,
>> 'course_title': 'Mastering Python',
>> 'course_published_dt': datetime.date(2021, 1, 14),
>> 'is_active': True,
>> 'last_updated_ts': datetime.datetime(2021, 2, 18, 16, 57, 25)
>> }
>>
>> ]
>>
>>
>> courses_df = spark.createDataFrame([Row(**course) for course in courses])
>>
>>
>> Where is the dataframe stored when I invoke the call:
>>
>> courses_df = spark.createDataFrame([Row(**course) for course in courses])
>>
>> Does it:
>>
>>1. Send the data to a random executor ?
>>
>>
>>- Does this mean this counts as a shuffle ?
>>
>>
>>1. Or does it stay on the driver node ?
>>
>>
>>- That does not make sense when the dataframe grows large.
>>
>>
>> --
>> Regards,
>> Sreyan Chakravarty
>>
>


Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-19 Thread Varun Shah
+1  Great initiative.

QQ : Stack overflow has a similar feature called "Collectives", but I am
not sure of the expenses to create one for Apache Spark. With SO being used
( atleast before ChatGPT became quite the norm for searching questions), it
already has a lot of questions asked and answered by the community over a
period of time and hence, if possible, we could leverage it as the starting
point for building a community before creating a complete new website from
scratch. Any thoughts on this?

Regards,
Varun Shah


On Mon, Mar 18, 2024, 16:29 Mich Talebzadeh 
wrote:

> Some of you may be aware that Databricks community Home | Databricks
> have just launched a knowledge sharing hub. I thought it would be a
> good idea for the Apache Spark user group to have the same, especially
> for repeat questions on Spark core, Spark SQL, Spark Structured
> Streaming, Spark Mlib and so forth.
>
> Apache Spark user and dev groups have been around for a good while.
> They are serving their purpose . We went through creating a slack
> community that managed to create more more heat than light.. This is
> what Databricks community came up with and I quote
>
> "Knowledge Sharing Hub
> Dive into a collaborative space where members like YOU can exchange
> knowledge, tips, and best practices. Join the conversation today and
> unlock a wealth of collective wisdom to enhance your experience and
> drive success."
>
> I don't know the logistics of setting it up.but I am sure that should
> not be that difficult. If anyone is supportive of this proposal, let
> the usual +1, 0, -1 decide
>
> HTH
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> Disclaimer: The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner Von Braun)".
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Deepak Sharma
+1 .
I can contribute to it as well .

On Tue, 19 Mar 2024 at 9:19 AM, Code Tutelage 
wrote:

> +1
>
> Thanks for proposing
>
> On Mon, Mar 18, 2024 at 9:25 AM Parsian, Mahmoud
>  wrote:
>
>> Good idea. Will be useful
>>
>>
>>
>> +1
>>
>>
>>
>>
>>
>>
>>
>> *From: *ashok34...@yahoo.com.INVALID 
>> *Date: *Monday, March 18, 2024 at 6:36 AM
>> *To: *user @spark , Spark dev list <
>> d...@spark.apache.org>, Mich Talebzadeh 
>> *Cc: *Matei Zaharia 
>> *Subject: *Re: A proposal for creating a Knowledge Sharing Hub for
>> Apache Spark Community
>>
>> External message, be mindful when clicking links or attachments
>>
>>
>>
>> Good idea. Will be useful
>>
>>
>>
>> +1
>>
>>
>>
>> On Monday, 18 March 2024 at 11:00:40 GMT, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>
>>
>>
>>
>> Some of you may be aware that Databricks community Home | Databricks
>>
>> have just launched a knowledge sharing hub. I thought it would be a
>>
>> good idea for the Apache Spark user group to have the same, especially
>>
>> for repeat questions on Spark core, Spark SQL, Spark Structured
>>
>> Streaming, Spark Mlib and so forth.
>>
>>
>>
>> Apache Spark user and dev groups have been around for a good while.
>>
>> They are serving their purpose . We went through creating a slack
>>
>> community that managed to create more more heat than light.. This is
>>
>> what Databricks community came up with and I quote
>>
>>
>>
>> "Knowledge Sharing Hub
>>
>> Dive into a collaborative space where members like YOU can exchange
>>
>> knowledge, tips, and best practices. Join the conversation today and
>>
>> unlock a wealth of collective wisdom to enhance your experience and
>>
>> drive success."
>>
>>
>>
>> I don't know the logistics of setting it up.but I am sure that should
>>
>> not be that difficult. If anyone is supportive of this proposal, let
>>
>> the usual +1, 0, -1 decide
>>
>>
>>
>> HTH
>>
>>
>>
>> Mich Talebzadeh,
>>
>> Dad | Technologist | Solutions Architect | Engineer
>>
>> London
>>
>> United Kingdom
>>
>>
>>
>>
>>
>>   view my Linkedin profile
>>
>>
>>
>>
>>
>> https://en.everybodywiki.com/Mich_Talebzadeh
>> 
>>
>>
>>
>>
>>
>>
>>
>> Disclaimer: The information provided is correct to the best of my
>>
>> knowledge but of course cannot be guaranteed . It is essential to note
>>
>> that, as with any advice, quote "one test result is worth one-thousand
>>
>> expert opinions (Werner Von Braun)".
>>
>>
>>
>> -
>>
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>>
>


[ANNOUNCE] Apache Kyuubi released 1.9.0

2024-03-18 Thread Binjie Yang
Hi all,

The Apache Kyuubi community is pleased to announce that
Apache Kyuubi 1.9.0 has been released!

Apache Kyuubi is a distributed and multi-tenant gateway to provide
serverless SQL on data warehouses and lakehouses.

Kyuubi provides a pure SQL gateway through Thrift JDBC/ODBC interface
for end-users to manipulate large-scale data with pre-programmed and
extensible Spark SQL engines.

We are aiming to make Kyuubi an "out-of-the-box" tool for data warehouses
and lakehouses.

This "out-of-the-box" model minimizes the barriers and costs for end-users
to use Spark at the client side.

At the server-side, Kyuubi server and engine's multi-tenant architecture
provides the administrators a way to achieve computing resource isolation,
data security, high availability, high client concurrency, etc.

The full release notes and download links are available at:
Release Notes: https://kyuubi.apache.org/release/1.9.0.html

To learn more about Apache Kyuubi, please see
https://kyuubi.apache.org/

Kyuubi Resources:
- Issue: https://github.com/apache/kyuubi/issues
- Mailing list: d...@kyuubi.apache.org

We would like to thank all contributors of the Kyuubi community
who made this release possible!

Thanks,
On behalf of Apache Kyuubi community


Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Hyukjin Kwon
One very good example is SparkR releases in Conda channel (
https://github.com/conda-forge/r-sparkr-feedstock).
This is fully run by the community unofficially.

On Tue, 19 Mar 2024 at 09:54, Mich Talebzadeh 
wrote:

> +1 for me
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Mon, 18 Mar 2024 at 16:23, Parsian, Mahmoud 
> wrote:
>
>> Good idea. Will be useful
>>
>>
>>
>> +1
>>
>>
>>
>>
>>
>>
>>
>> *From: *ashok34...@yahoo.com.INVALID 
>> *Date: *Monday, March 18, 2024 at 6:36 AM
>> *To: *user @spark , Spark dev list <
>> d...@spark.apache.org>, Mich Talebzadeh 
>> *Cc: *Matei Zaharia 
>> *Subject: *Re: A proposal for creating a Knowledge Sharing Hub for
>> Apache Spark Community
>>
>> External message, be mindful when clicking links or attachments
>>
>>
>>
>> Good idea. Will be useful
>>
>>
>>
>> +1
>>
>>
>>
>> On Monday, 18 March 2024 at 11:00:40 GMT, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>
>>
>>
>>
>> Some of you may be aware that Databricks community Home | Databricks
>>
>> have just launched a knowledge sharing hub. I thought it would be a
>>
>> good idea for the Apache Spark user group to have the same, especially
>>
>> for repeat questions on Spark core, Spark SQL, Spark Structured
>>
>> Streaming, Spark Mlib and so forth.
>>
>>
>>
>> Apache Spark user and dev groups have been around for a good while.
>>
>> They are serving their purpose . We went through creating a slack
>>
>> community that managed to create more more heat than light.. This is
>>
>> what Databricks community came up with and I quote
>>
>>
>>
>> "Knowledge Sharing Hub
>>
>> Dive into a collaborative space where members like YOU can exchange
>>
>> knowledge, tips, and best practices. Join the conversation today and
>>
>> unlock a wealth of collective wisdom to enhance your experience and
>>
>> drive success."
>>
>>
>>
>> I don't know the logistics of setting it up.but I am sure that should
>>
>> not be that difficult. If anyone is supportive of this proposal, let
>>
>> the usual +1, 0, -1 decide
>>
>>
>>
>> HTH
>>
>>
>>
>> Mich Talebzadeh,
>>
>> Dad | Technologist | Solutions Architect | Engineer
>>
>> London
>>
>> United Kingdom
>>
>>
>>
>>
>>
>>   view my Linkedin profile
>>
>>
>>
>>
>>
>> https://en.everybodywiki.com/Mich_Talebzadeh
>> 
>>
>>
>>
>>
>>
>>
>>
>> Disclaimer: The information provided is correct to the best of my
>>
>> knowledge but of course cannot be guaranteed . It is essential to note
>>
>> that, as with any advice, quote "one test result is worth one-thousand
>>
>> expert opinions (Werner Von Braun)".
>>
>>
>>
>> -
>>
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>>
>


Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Mich Talebzadeh
OK thanks for the update.

What does officially blessed signify here? Can we have and run it as a
sister site? The reason this comes to my mind is that the interested
parties should have easy access to this site (from ISUG Spark sites) as a
reference repository. I guess the advice would be that the information
(topics) are provided as best efforts and cannot be guaranteed.

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Mon, 18 Mar 2024 at 21:04, Reynold Xin  wrote:

> One of the problem in the past when something like this was brought up was
> that the ASF couldn't have officially blessed venues beyond the already
> approved ones. So that's something to look into.
>
> Now of course you are welcome to run unofficial things unblessed as long
> as they follow trademark rules.
>
>
>
> On Mon, Mar 18, 2024 at 1:53 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Well as long as it works.
>>
>> Please all check this link from Databricks and let us know your thoughts.
>> Will something similar work for us?. Of course Databricks have much deeper
>> pockets than our ASF community. Will it require moderation in our side to
>> block spams and nutcases.
>>
>> Knowledge Sharing Hub - Databricks
>> 
>>
>>
>> Mich Talebzadeh,
>> Dad | Technologist | Solutions Architect | Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> Von Braun
>> )".
>>
>>
>> On Mon, 18 Mar 2024 at 20:31, Bjørn Jørgensen 
>> wrote:
>>
>>> something like this  Spark community · GitHub
>>> 
>>>
>>>
>>> man. 18. mars 2024 kl. 17:26 skrev Parsian, Mahmoud <
>>> mpars...@illumina.com.invalid>:
>>>
 Good idea. Will be useful



 +1







 *From: *ashok34...@yahoo.com.INVALID 
 *Date: *Monday, March 18, 2024 at 6:36 AM
 *To: *user @spark , Spark dev list <
 d...@spark.apache.org>, Mich Talebzadeh 
 *Cc: *Matei Zaharia 
 *Subject: *Re: A proposal for creating a Knowledge Sharing Hub for
 Apache Spark Community

 External message, be mindful when clicking links or attachments



 Good idea. Will be useful



 +1



 On Monday, 18 March 2024 at 11:00:40 GMT, Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:





 Some of you may be aware that Databricks community Home | Databricks

 have just launched a knowledge sharing hub. I thought it would be a

 good idea for the Apache Spark user group to have the same, especially

 for repeat questions on Spark core, Spark SQL, Spark Structured

 Streaming, Spark Mlib and so forth.



 Apache Spark user and dev groups have been around for a good while.

 They are serving their purpose . We went through creating a slack

 community that managed to create more more heat than light.. This is

 what Databricks community came up with and I quote



 "Knowledge Sharing Hub

 Dive into a collaborative space where members like YOU can exchange

 knowledge, tips, and best practices. Join the conversation today and

 unlock a wealth of collective wisdom to enhance your experience and

 drive success."



 I don't know the logistics of setting it up.but I am sure that should

 not be that difficult. If anyone is supportive of this proposal, let

 the usual +1, 0, -1 decide



 HTH



 Mich Talebzadeh,

 Dad | Technologist | Solutions Architect | Engineer

 London

 United Kingdom





   view my Linkedin profile





 https://en.everybodywiki.com/Mich_Talebzadeh
 

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Reynold Xin
One of the problem in the past when something like this was brought up was that 
the ASF couldn't have officially blessed venues beyond the already approved 
ones. So that's something to look into.

Now of course you are welcome to run unofficial things unblessed as long as 
they follow trademark rules.

On Mon, Mar 18, 2024 at 1:53 PM, Mich Talebzadeh < mich.talebza...@gmail.com > 
wrote:

> 
> Well as long as it works.
> 
> Please all check this link from Databricks and let us know your thoughts.
> Will something similar work for us?. Of course Databricks have much deeper
> pockets than our ASF community. Will it require moderation in our side to
> block spams and nutcases.
> 
> 
> 
> Knowledge Sharing Hub - Databricks (
> https://community.databricks.com/t5/knowledge-sharing-hub/bd-p/Knowledge-Sharing-Hub
> )
> 
> 
> 
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> 
> London
> 
> United Kingdom
> 
> 
> 
> 
> 
> 
> 
> ** view my Linkedin profile (
> https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/ )
> 
> 
> 
> 
> 
> 
> 
> 
> https:/ / en. everybodywiki. com/ Mich_Talebzadeh (
> https://en.everybodywiki.com/Mich_Talebzadeh )
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one - thousand
> expert opinions ( Werner ( https://en.wikipedia.org/wiki/Wernher_von_Braun
> ) Von Braun ( https://en.wikipedia.org/wiki/Wernher_von_Braun ) )".
> 
> 
> 
> 
> 
> On Mon, 18 Mar 2024 at 20:31, Bjørn Jørgensen < bjornjorgensen@ gmail. com
> ( bjornjorgen...@gmail.com ) > wrote:
> 
> 
>> something like this Spark community · GitHub (
>> https://github.com/Spark-community )
>> 
>> 
>> 
>> man. 18. mars 2024 kl. 17:26 skrev Parsian, Mahmoud < mparsian@ illumina. 
>> com.
>> invalid ( mpars...@illumina.com.invalid ) >:
>> 
>> 
>>> 
>>> 
>>> Good idea. Will be useful
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> +1
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> *From:* ashok34668@ yahoo. com. INVALID ( ashok34...@yahoo.com.INVALID ) <
>>> ashok34668@ yahoo. com. INVALID ( ashok34...@yahoo.com.INVALID ) >
>>> *Date:* Monday, March 18 , 2024 at 6:36 AM
>>> *To:* user @spark < user@ spark. apache. org ( user@spark.apache.org ) >,
>>> Spark dev list < dev@ spark. apache. org ( d...@spark.apache.org ) >, Mich
>>> Talebzadeh < mich. talebzadeh@ gmail. com ( mich.talebza...@gmail.com ) >
>>> *Cc:* Matei Zaharia < matei. zaharia@ gmail. com ( matei.zaha...@gmail.com
>>> ) >
>>> *Subject:* Re: A proposal for creating a Knowledge Sharing Hub for Apache
>>> Spark Community
>>> 
>>> 
>>> 
>>> 
>>> External message, be mindful when clicking links or attachments
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Good idea. Will be useful
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> +1
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Monday, 18 March 2024 at 11:00:40 GMT, Mich Talebzadeh < mich. 
>>> talebzadeh@
>>> gmail. com ( mich.talebza...@gmail.com ) > wrote:
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Some of you may be aware that Databricks community Home | Databricks
>>> 
>>> 
>>> 
>>> 
>>> have just launched a knowledge sharing hub. I thought it would be a
>>> 
>>> 
>>> 
>>> 
>>> good idea for the Apache Spark user group to have the same, especially
>>> 
>>> 
>>> 
>>> 
>>> for repeat questions on Spark core, Spark SQL, Spark Structured
>>> 
>>> 
>>> 
>>> 
>>> Streaming, Spark Mlib and so forth.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Apache Spark user and dev groups have been around for a good while.
>>> 
>>> 
>>> 
>>> 
>>> They are serving their purpose . We went through creating a slack
>>> 
>>> 
>>> 
>>> 
>>> community that managed to create more more heat than light.. This is
>>> 
>>> 
>>> 
>>> 
>>> what Databricks community came up with and I quote
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> "Knowledge Sharing Hub
>>> 
>>> 
>>> 
>>> 
>>> Dive into a collaborative space where members like YOU can exchange
>>> 
>>> 
>>> 
>>> 
>>> knowledge, tips, and best practices. Join the conversation today and
>>> 
>>> 
>>> 
>>> 
>>> unlock a wealth of collective wisdom to enhance your experience and
>>> 
>>> 
>>> 
>>> 
>>> drive success."
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> I don't know the logistics of setting it up.but I am sure that should
>>> 
>>> 
>>> 
>>> 
>>> not be that difficult. If anyone is supportive of this proposal, let
>>> 
>>> 
>>> 
>>> 
>>> the usual +1, 0, -1 decide
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> HTH
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Mich Talebzadeh,
>>> 
>>> 
>>> 
>>> 
>>> Dad | Technologist | Solutions Architect | Engineer
>>> 
>>> 
>>> 
>>> 
>>> London
>>> 
>>> 
>>> 
>>> 
>>> United Kingdom
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> view my Linkedin profile

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Mich Talebzadeh
Well as long as it works.

Please all check this link from Databricks and let us know your thoughts.
Will something similar work for us?. Of course Databricks have much deeper
pockets than our ASF community. Will it require moderation in our side to
block spams and nutcases.

Knowledge Sharing Hub - Databricks



Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Mon, 18 Mar 2024 at 20:31, Bjørn Jørgensen 
wrote:

> something like this  Spark community · GitHub
> 
>
>
> man. 18. mars 2024 kl. 17:26 skrev Parsian, Mahmoud
> :
>
>> Good idea. Will be useful
>>
>>
>>
>> +1
>>
>>
>>
>>
>>
>>
>>
>> *From: *ashok34...@yahoo.com.INVALID 
>> *Date: *Monday, March 18, 2024 at 6:36 AM
>> *To: *user @spark , Spark dev list <
>> d...@spark.apache.org>, Mich Talebzadeh 
>> *Cc: *Matei Zaharia 
>> *Subject: *Re: A proposal for creating a Knowledge Sharing Hub for
>> Apache Spark Community
>>
>> External message, be mindful when clicking links or attachments
>>
>>
>>
>> Good idea. Will be useful
>>
>>
>>
>> +1
>>
>>
>>
>> On Monday, 18 March 2024 at 11:00:40 GMT, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>
>>
>>
>>
>> Some of you may be aware that Databricks community Home | Databricks
>>
>> have just launched a knowledge sharing hub. I thought it would be a
>>
>> good idea for the Apache Spark user group to have the same, especially
>>
>> for repeat questions on Spark core, Spark SQL, Spark Structured
>>
>> Streaming, Spark Mlib and so forth.
>>
>>
>>
>> Apache Spark user and dev groups have been around for a good while.
>>
>> They are serving their purpose . We went through creating a slack
>>
>> community that managed to create more more heat than light.. This is
>>
>> what Databricks community came up with and I quote
>>
>>
>>
>> "Knowledge Sharing Hub
>>
>> Dive into a collaborative space where members like YOU can exchange
>>
>> knowledge, tips, and best practices. Join the conversation today and
>>
>> unlock a wealth of collective wisdom to enhance your experience and
>>
>> drive success."
>>
>>
>>
>> I don't know the logistics of setting it up.but I am sure that should
>>
>> not be that difficult. If anyone is supportive of this proposal, let
>>
>> the usual +1, 0, -1 decide
>>
>>
>>
>> HTH
>>
>>
>>
>> Mich Talebzadeh,
>>
>> Dad | Technologist | Solutions Architect | Engineer
>>
>> London
>>
>> United Kingdom
>>
>>
>>
>>
>>
>>   view my Linkedin profile
>>
>>
>>
>>
>>
>> https://en.everybodywiki.com/Mich_Talebzadeh
>> 
>>
>>
>>
>>
>>
>>
>>
>> Disclaimer: The information provided is correct to the best of my
>>
>> knowledge but of course cannot be guaranteed . It is essential to note
>>
>> that, as with any advice, quote "one test result is worth one-thousand
>>
>> expert opinions (Werner Von Braun)".
>>
>>
>>
>> -
>>
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>>
>
>
> --
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
>
> +47 480 94 297
>


Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Bjørn Jørgensen
something like this  Spark community · GitHub



man. 18. mars 2024 kl. 17:26 skrev Parsian, Mahmoud
:

> Good idea. Will be useful
>
>
>
> +1
>
>
>
>
>
>
>
> *From: *ashok34...@yahoo.com.INVALID 
> *Date: *Monday, March 18, 2024 at 6:36 AM
> *To: *user @spark , Spark dev list <
> d...@spark.apache.org>, Mich Talebzadeh 
> *Cc: *Matei Zaharia 
> *Subject: *Re: A proposal for creating a Knowledge Sharing Hub for Apache
> Spark Community
>
> External message, be mindful when clicking links or attachments
>
>
>
> Good idea. Will be useful
>
>
>
> +1
>
>
>
> On Monday, 18 March 2024 at 11:00:40 GMT, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>
>
>
>
> Some of you may be aware that Databricks community Home | Databricks
>
> have just launched a knowledge sharing hub. I thought it would be a
>
> good idea for the Apache Spark user group to have the same, especially
>
> for repeat questions on Spark core, Spark SQL, Spark Structured
>
> Streaming, Spark Mlib and so forth.
>
>
>
> Apache Spark user and dev groups have been around for a good while.
>
> They are serving their purpose . We went through creating a slack
>
> community that managed to create more more heat than light.. This is
>
> what Databricks community came up with and I quote
>
>
>
> "Knowledge Sharing Hub
>
> Dive into a collaborative space where members like YOU can exchange
>
> knowledge, tips, and best practices. Join the conversation today and
>
> unlock a wealth of collective wisdom to enhance your experience and
>
> drive success."
>
>
>
> I don't know the logistics of setting it up.but I am sure that should
>
> not be that difficult. If anyone is supportive of this proposal, let
>
> the usual +1, 0, -1 decide
>
>
>
> HTH
>
>
>
> Mich Talebzadeh,
>
> Dad | Technologist | Solutions Architect | Engineer
>
> London
>
> United Kingdom
>
>
>
>
>
>   view my Linkedin profile
>
>
>
>
>
> https://en.everybodywiki.com/Mich_Talebzadeh
> 
>
>
>
>
>
>
>
> Disclaimer: The information provided is correct to the best of my
>
> knowledge but of course cannot be guaranteed . It is essential to note
>
> that, as with any advice, quote "one test result is worth one-thousand
>
> expert opinions (Werner Von Braun)".
>
>
>
> -
>
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>


-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Code Tutelage
+1

Thanks for proposing

On Mon, Mar 18, 2024 at 9:25 AM Parsian, Mahmoud
 wrote:

> Good idea. Will be useful
>
>
>
> +1
>
>
>
>
>
>
>
> *From: *ashok34...@yahoo.com.INVALID 
> *Date: *Monday, March 18, 2024 at 6:36 AM
> *To: *user @spark , Spark dev list <
> d...@spark.apache.org>, Mich Talebzadeh 
> *Cc: *Matei Zaharia 
> *Subject: *Re: A proposal for creating a Knowledge Sharing Hub for Apache
> Spark Community
>
> External message, be mindful when clicking links or attachments
>
>
>
> Good idea. Will be useful
>
>
>
> +1
>
>
>
> On Monday, 18 March 2024 at 11:00:40 GMT, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>
>
>
>
> Some of you may be aware that Databricks community Home | Databricks
>
> have just launched a knowledge sharing hub. I thought it would be a
>
> good idea for the Apache Spark user group to have the same, especially
>
> for repeat questions on Spark core, Spark SQL, Spark Structured
>
> Streaming, Spark Mlib and so forth.
>
>
>
> Apache Spark user and dev groups have been around for a good while.
>
> They are serving their purpose . We went through creating a slack
>
> community that managed to create more more heat than light.. This is
>
> what Databricks community came up with and I quote
>
>
>
> "Knowledge Sharing Hub
>
> Dive into a collaborative space where members like YOU can exchange
>
> knowledge, tips, and best practices. Join the conversation today and
>
> unlock a wealth of collective wisdom to enhance your experience and
>
> drive success."
>
>
>
> I don't know the logistics of setting it up.but I am sure that should
>
> not be that difficult. If anyone is supportive of this proposal, let
>
> the usual +1, 0, -1 decide
>
>
>
> HTH
>
>
>
> Mich Talebzadeh,
>
> Dad | Technologist | Solutions Architect | Engineer
>
> London
>
> United Kingdom
>
>
>
>
>
>   view my Linkedin profile
>
>
>
>
>
> https://en.everybodywiki.com/Mich_Talebzadeh
> 
>
>
>
>
>
>
>
> Disclaimer: The information provided is correct to the best of my
>
> knowledge but of course cannot be guaranteed . It is essential to note
>
> that, as with any advice, quote "one test result is worth one-thousand
>
> expert opinions (Werner Von Braun)".
>
>
>
> -
>
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>


Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-18 Thread Sreyan Chakravarty
On Mon, Mar 18, 2024 at 1:16 PM Mich Talebzadeh 
wrote:

>
> "I may need something like that for synthetic data for testing. Any way to
> do that ?"
>
> Have a look at this.
>
> https://github.com/joke2k/faker
>

No I was not actually referring to data that can be faked. I want data to
actually reside on the storage or executors.

Maybe this will be better tackled in a separate thread here:

https://lists.apache.org/thread/w6f7rq7m8fj6hzwpyhvvx3c42wbmkwdq

-- 
Regards,
Sreyan Chakravarty


pyspark - Use Spark to generate a large dataset on the fly

2024-03-18 Thread Sreyan Chakravarty
Hi,

I have a specific problem where I have to get the data from REST APIs and
store it, and then do some transformations on it and then write to a RDBMS
table.

I am wondering if Spark will help in this regard.

I am confused as to how do I store the data while I actually acquire it on
the driver node ?

Is there any way I can partition the data "on the fly"(ie. during the
acquisition)  ?

Here are two ways how I think this can be done:

Approach 1:

Run a loop on the driver node to collect all the data via HTTP requests and
then create a dataframe from it.

Problem: It will result in an OOM in the driver node as the data is so
large that it needs to be spread out. How do I do that ?

Approach 2:

Make an app and push the data received from the REST APIs to a Kafka topic.

Use Spark's structured streaming to read from that topic.

Problem: How will Spark know how to partition the data from the Kafka topic
?

Basically, my problem means calls from sending each piece of data as I
receive it to the worker node. Can that be done somehow ?

-- 
Regards,
Sreyan Chakravarty


Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Mich Talebzadeh
+1 for me

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Mon, 18 Mar 2024 at 16:23, Parsian, Mahmoud 
wrote:

> Good idea. Will be useful
>
>
>
> +1
>
>
>
>
>
>
>
> *From: *ashok34...@yahoo.com.INVALID 
> *Date: *Monday, March 18, 2024 at 6:36 AM
> *To: *user @spark , Spark dev list <
> d...@spark.apache.org>, Mich Talebzadeh 
> *Cc: *Matei Zaharia 
> *Subject: *Re: A proposal for creating a Knowledge Sharing Hub for Apache
> Spark Community
>
> External message, be mindful when clicking links or attachments
>
>
>
> Good idea. Will be useful
>
>
>
> +1
>
>
>
> On Monday, 18 March 2024 at 11:00:40 GMT, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>
>
>
>
> Some of you may be aware that Databricks community Home | Databricks
>
> have just launched a knowledge sharing hub. I thought it would be a
>
> good idea for the Apache Spark user group to have the same, especially
>
> for repeat questions on Spark core, Spark SQL, Spark Structured
>
> Streaming, Spark Mlib and so forth.
>
>
>
> Apache Spark user and dev groups have been around for a good while.
>
> They are serving their purpose . We went through creating a slack
>
> community that managed to create more more heat than light.. This is
>
> what Databricks community came up with and I quote
>
>
>
> "Knowledge Sharing Hub
>
> Dive into a collaborative space where members like YOU can exchange
>
> knowledge, tips, and best practices. Join the conversation today and
>
> unlock a wealth of collective wisdom to enhance your experience and
>
> drive success."
>
>
>
> I don't know the logistics of setting it up.but I am sure that should
>
> not be that difficult. If anyone is supportive of this proposal, let
>
> the usual +1, 0, -1 decide
>
>
>
> HTH
>
>
>
> Mich Talebzadeh,
>
> Dad | Technologist | Solutions Architect | Engineer
>
> London
>
> United Kingdom
>
>
>
>
>
>   view my Linkedin profile
>
>
>
>
>
> https://en.everybodywiki.com/Mich_Talebzadeh
> 
>
>
>
>
>
>
>
> Disclaimer: The information provided is correct to the best of my
>
> knowledge but of course cannot be guaranteed . It is essential to note
>
> that, as with any advice, quote "one test result is worth one-thousand
>
> expert opinions (Werner Von Braun)".
>
>
>
> -
>
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>


pyspark - Use Spark to generate a large dataset on the fly

2024-03-18 Thread Sreyan Chakravarty
Hi,

I have a specific problem where I have to get the data from REST APIs and
store it, and then do some transformations on it and then write to a RDBMS
table.

I am wondering if Spark will help in this regard.

I am confused as to how do I store the data while I actually acquire it on
the driver node ?

Is there any way I can partition the data "on the fly"(ie. during the
acquisition)  ?

Here are two ways how I think this can be done:

Approach 1:

Run a loop on the driver node to collect all the data via HTTP requests and
then create a dataframe from it.

Problem: It will result in an OOM in the driver node as the data is so
large that it needs to be spread out. How do I do that ?

Approach 2:

Make an app and push the data received from the REST APIs to a Kafka topic.

Use Spark's structured streaming to read from that topic.

Problem: How will Spark know how to partition the data from the Kafka topic
?

*Basically, my problem means calls from sending each piece of data as I
receive it to the worker node. Can that be done somehow ?*
-- 
Regards,
Sreyan Chakravarty


Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Parsian, Mahmoud
Good idea. Will be useful

+1



From: ashok34...@yahoo.com.INVALID 
Date: Monday, March 18, 2024 at 6:36 AM
To: user @spark , Spark dev list 
, Mich Talebzadeh 
Cc: Matei Zaharia 
Subject: Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark 
Community
External message, be mindful when clicking links or attachments

Good idea. Will be useful

+1

On Monday, 18 March 2024 at 11:00:40 GMT, Mich Talebzadeh 
 wrote:


Some of you may be aware that Databricks community Home | Databricks
have just launched a knowledge sharing hub. I thought it would be a
good idea for the Apache Spark user group to have the same, especially
for repeat questions on Spark core, Spark SQL, Spark Structured
Streaming, Spark Mlib and so forth.

Apache Spark user and dev groups have been around for a good while.
They are serving their purpose . We went through creating a slack
community that managed to create more more heat than light.. This is
what Databricks community came up with and I quote

"Knowledge Sharing Hub
Dive into a collaborative space where members like YOU can exchange
knowledge, tips, and best practices. Join the conversation today and
unlock a wealth of collective wisdom to enhance your experience and
drive success."

I don't know the logistics of setting it up.but I am sure that should
not be that difficult. If anyone is supportive of this proposal, let
the usual +1, 0, -1 decide

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


  view my Linkedin profile


https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner Von Braun)".

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org



Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread ashok34...@yahoo.com.INVALID
 Good idea. Will be useful
+1
On Monday, 18 March 2024 at 11:00:40 GMT, Mich Talebzadeh 
 wrote:  
 
 Some of you may be aware that Databricks community Home | Databricks
have just launched a knowledge sharing hub. I thought it would be a
good idea for the Apache Spark user group to have the same, especially
for repeat questions on Spark core, Spark SQL, Spark Structured
Streaming, Spark Mlib and so forth.

Apache Spark user and dev groups have been around for a good while.
They are serving their purpose . We went through creating a slack
community that managed to create more more heat than light.. This is
what Databricks community came up with and I quote

"Knowledge Sharing Hub
Dive into a collaborative space where members like YOU can exchange
knowledge, tips, and best practices. Join the conversation today and
unlock a wealth of collective wisdom to enhance your experience and
drive success."

I don't know the logistics of setting it up.but I am sure that should
not be that difficult. If anyone is supportive of this proposal, let
the usual +1, 0, -1 decide

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


  view my Linkedin profile


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner Von Braun)".

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

  

Re: [GraphX]: Prevent recomputation of DAG

2024-03-18 Thread Mich Talebzadeh
Hi,

I must admit I don't know much about this Fruchterman-Reingold (call
it FR) visualization using GraphX and Kubernetes..But you are
suggesting this slowdown issue starts after the second iteration, and
caching/persisting the graph after each iteration does not help. FR
involves many computations between vertex pairs. In MapReduce (or
shuffle) steps, Data might be shuffled across the network, impacting
performance for large graphs. The usual steps to verify this is
through Spark UI in Stages, SQL and execute tabbs, You will see the
time taken for each step and the amount of read/write  etc. Also
repeatedly creating and destroying GraphX graphs in each iteration may
lead to garbage collection (GC) overhead.So you should consider r
profiling your application to identify bottlenecks and pinpoint which
part of the code is causing the slowdown.  As I mentioned Spark offers
profiling tools like Spark UI or third-party libraries.for this
purpose.

HTH


Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner Von Braun)".

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner Von Braun)".



On Sun, 17 Mar 2024 at 18:45, Marek Berith  wrote:
>
> Dear community,
> for my diploma thesis, we are implementing a distributed version of
> Fruchterman-Reingold visualization algorithm, using GraphX and Kubernetes. Our
> solution is a backend that continously computes new positions of vertices in a
> graph and sends them via RabbitMQ to a consumer. Fruchterman-Reingold is an
> iterative algorithm, meaning that in each iteration repulsive and attractive
> forces between vertices are computed and then new positions of vertices based
> on those forces are computed. Graph vertices and edges are stored in a GraphX
> graph structure. Forces between vertices are computed using MapReduce(between
> each pair of vertices) and aggregateMessages(for vertices connected via
> edges). After an iteration of the algorithm, the recomputed positions from the
> RDD are serialized using collect and sent to the RabbitMQ queue.
>
> Here comes the issue. The first two iterations of the algorithm seem to be
> quick, but at the third iteration, the algorithm is very slow until it reaches
> a point at which it cannot finish an iteration in real time. It seems like
> caching of the graph may be an issue, because if we serialize the graph after
> each iteration in an array and create new graph from the array in the new
> iteration, we get a constant usage of memory and each iteration takes the same
> amount of time. We had already tried to cache/persist/checkpoint the graph
> after each iteration but it didn't help, so maybe we are doing something
> wrong. We do not think that serializing the graph into an array should be the
> solution for such a complex library like Apache Spark. I'm also not very
> confident how this fix will affect performance for large graphs or in parallel
> environment. We are attaching a short example of code that shows doing an
> iteration of the algorithm, input and output example.
>
> We would appreciate if you could help us fix this issue or give us any
> meaningful ideas, as we had tried everything that came to mind.
>
> We look forward to your reply.
> Thank you, Marek Berith
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Mich Talebzadeh
Some of you may be aware that Databricks community Home | Databricks
have just launched a knowledge sharing hub. I thought it would be a
good idea for the Apache Spark user group to have the same, especially
for repeat questions on Spark core, Spark SQL, Spark Structured
Streaming, Spark Mlib and so forth.

Apache Spark user and dev groups have been around for a good while.
They are serving their purpose . We went through creating a slack
community that managed to create more more heat than light.. This is
what Databricks community came up with and I quote

"Knowledge Sharing Hub
Dive into a collaborative space where members like YOU can exchange
knowledge, tips, and best practices. Join the conversation today and
unlock a wealth of collective wisdom to enhance your experience and
drive success."

I don't know the logistics of setting it up.but I am sure that should
not be that difficult. If anyone is supportive of this proposal, let
the usual +1, 0, -1 decide

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner Von Braun)".

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-18 Thread Mich Talebzadeh
Yes, transformations are indeed executed on the worker nodes, but they are
only performed when necessary, usually when an action is called. This lazy
evaluation helps in optimizing the execution of Spark jobs by allowing
Spark to optimize the execution plan and perform optimizations such as
pipelining transformations and removing unnecessary computations.

"I may need something like that for synthetic data for testing. Any way to
do that ?"

Have a look at this.

https://github.com/joke2k/faker

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Mon, 18 Mar 2024 at 07:16, Sreyan Chakravarty  wrote:

>
> On Fri, Mar 15, 2024 at 3:10 AM Mich Talebzadeh 
> wrote:
>
>>
>> No Data Transfer During Creation: --> Data transfer occurs only when an
>> action is triggered.
>> Distributed Processing: --> DataFrames are distributed for parallel
>> execution, not stored entirely on the driver node.
>> Lazy Evaluation Optimization: --> Delaying data transfer until necessary
>> enhances performance.
>> Shuffle vs. Partitioning: --> Data movement during partitioning is not
>> considered a shuffle in Spark terminology.
>> Shuffles involve more complex data rearrangement.
>>
>
> So just to be clear the transformations are always executed on the worker
> node but it is just transferred until an action on the dataframe is
> triggered.
>
> Am I correct ?
>
> If so, then how do I generate a large dataset ?
>
> I may need something like that for synthetic data for testing. Any way to
> do that ?
>
>
> --
> Regards,
> Sreyan Chakravarty
>


Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-18 Thread Sreyan Chakravarty
On Fri, Mar 15, 2024 at 3:10 AM Mich Talebzadeh 
wrote:

>
> No Data Transfer During Creation: --> Data transfer occurs only when an
> action is triggered.
> Distributed Processing: --> DataFrames are distributed for parallel
> execution, not stored entirely on the driver node.
> Lazy Evaluation Optimization: --> Delaying data transfer until necessary
> enhances performance.
> Shuffle vs. Partitioning: --> Data movement during partitioning is not
> considered a shuffle in Spark terminology.
> Shuffles involve more complex data rearrangement.
>

So just to be clear the transformations are always executed on the worker
node but it is just transferred until an action on the dataframe is
triggered.

Am I correct ?

If so, then how do I generate a large dataset ?

I may need something like that for synthetic data for testing. Any way to
do that ?


-- 
Regards,
Sreyan Chakravarty


[GraphX]: Prevent recomputation of DAG

2024-03-17 Thread Marek Berith

Dear community,
for my diploma thesis, we are implementing a distributed version of 
Fruchterman-Reingold visualization algorithm, using GraphX and Kubernetes. Our 
solution is a backend that continously computes new positions of vertices in a 
graph and sends them via RabbitMQ to a consumer. Fruchterman-Reingold is an 
iterative algorithm, meaning that in each iteration repulsive and attractive 
forces between vertices are computed and then new positions of vertices based 
on those forces are computed. Graph vertices and edges are stored in a GraphX 
graph structure. Forces between vertices are computed using MapReduce(between 
each pair of vertices) and aggregateMessages(for vertices connected via 
edges). After an iteration of the algorithm, the recomputed positions from the 
RDD are serialized using collect and sent to the RabbitMQ queue.


Here comes the issue. The first two iterations of the algorithm seem to be 
quick, but at the third iteration, the algorithm is very slow until it reaches 
a point at which it cannot finish an iteration in real time. It seems like 
caching of the graph may be an issue, because if we serialize the graph after 
each iteration in an array and create new graph from the array in the new 
iteration, we get a constant usage of memory and each iteration takes the same 
amount of time. We had already tried to cache/persist/checkpoint the graph 
after each iteration but it didn't help, so maybe we are doing something 
wrong. We do not think that serializing the graph into an array should be the 
solution for such a complex library like Apache Spark. I'm also not very 
confident how this fix will affect performance for large graphs or in parallel 
environment. We are attaching a short example of code that shows doing an 
iteration of the algorithm, input and output example.


We would appreciate if you could help us fix this issue or give us any 
meaningful ideas, as we had tried everything that came to mind.


We look forward to your reply.
Thank you, Marek Berith
 def iterate(
  sc: SparkContext,
  graph: graphx.Graph[GeneralVertex, EdgeProperty],
  metaGraph: graphx.Graph[GeneralVertex, EdgeProperty])
  : (graphx.Graph[GeneralVertex, EdgeProperty], graphx.Graph[GeneralVertex, 
EdgeProperty]) = {
val attractiveDisplacement: VertexRDD[(VertexId, Vector)] =
  this.calculateAttractiveForces(graph)
val repulsiveDisplacement: RDD[(VertexId, Vector)] = 
this.calculateRepulsiveForces(graph)
val metaVertexDisplacement: RDD[(VertexId, Vector)] =
  this.calculateMetaVertexForces(graph, metaGraph.vertices)
val metaEdgeDisplacement: RDD[(VertexId, Vector)] =
  this.calculateMetaEdgeForces(metaGraph)
val displacements: RDD[(VertexId, Vector)] = this.combineDisplacements(
  attractiveDisplacement,
  repulsiveDisplacement,
  metaVertexDisplacement,
  metaEdgeDisplacement)
val newVertices: RDD[(VertexId, GeneralVertex)] = 
this.displaceVertices(graph, displacements)
val newGraph = graphx.Graph(newVertices, graph.edges)
// persist or checkpoint or cache? or something else?
newGraph.persist()
metaGraph.persist()
(newGraph, metaGraph)
  }

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: [External] Re: [GraphFrames Spark Package]: Why is there not a distribution for Spark 3.3?

2024-03-17 Thread Ofir Manor
Just to add - the latest version is 0.8.3, it seems to support 3.3:
"Support Spark 3.3 / Scala 2.12 , Spark 3.4 / Scala 2.12 and Scala 2.13, Spark 
3.5 / Scala 2.12 and Scala 2.13"
Releases · graphframes/graphframes 
(github.com)
   Ofir

From: Russell Jurney 
Sent: Friday, March 15, 2024 11:43 PM
To: brad.boil...@fcc-fac.ca.invalid 
Cc: user@spark.apache.org 
Subject: [External] Re: [GraphFrames Spark Package]: Why is there not a 
distribution for Spark 3.3?

There is an implementation for Spark 3, but GraphFrames isn't released often 
enough to match every point version. It supports Spark 3.4. Try it - it will 
probably work. https://spark-packages.org/package/graphframes/graphframes

Thanks,
Russell Jurney @rjurney 
russell.jur...@gmail.com 
LI FB 
datasyndrome.com Book a time on 
Calendly


On Fri, Jan 12, 2024 at 7:55 AM Boileau, Brad  
wrote:

Hello,



I was hoping to use a distribution of GraphFrames for AWS Glue 4 which has 
spark 3.3, but there is no found distribution for Spark 3.3 at this location:



https://spark-packages.org/package/graphframes/graphframes



Do you have any advice on the best compatible version to use for Spark 3.3?



Sincerely,



Brad Boileau

Senior Product Architect / Architecte produit sénior
Farm Credit Canada | Financement agricole Canada
1820 Hamilton Street / 1820, rue Hamilton

Regina SK  S4P 2B8

Tel/Tél. : 306-359, C/M: 306-737-8900

fcc.ca / fac.ca

FCC social media / 
Médias sociaux FAC



[2QA=]



This email, including attachments, is confidential. You may not share this 
email with any third party. If you are not the intended recipient, any 
redistribution or copying of this email is prohibited. If you have received 
this email in error or cannot comply with these restrictions, please delete or 
destroy it entirely and immediately without making a copy and notify us by 
return email.

Ce courriel (y compris toutes les pièces jointes qu’il comporte) est 
confidentiel. Vous ne pouvez pas partager ce courriel avec des tiers. Si vous 
n’êtes pas le destinataire prévu, toute divulgation, reproduction, copie ou 
distribution de ce courriel est strictement interdite. Si vous avez reçu ce 
courriel par erreur ou ne pouvez pas respecter ces restrictions, merci de le 
supprimer ou de le détruire complètement et immédiatement, sans le dupliquer, 
et de nous aviser par retour de courriel.


Unsubscribe from FCC marketing-related 
messages.
 (Customers will still receive messages related to business transactions.)

Se désabonner pour ne plus recevoir de messages liés au marketing de la part de 
FAC.
 (Les clients continueront de recevoir des messages concernant leurs 
transactions.)


Python library that generates fake data using Faker

2024-03-16 Thread Mich Talebzadeh
I came across this a few weeks ago. II a nutshell  you can use it for
generating test data and other scenarios where you need
realistic-looking but not necessarily real data. With so many
regulations and copyrights etc it is a viable alternative. I used it
to generate 1000 lines of mixed true and fraudulent transactions to
build a machine learning model to detect fraudulent transactions using
PySpark's MLlib library. You can install it via pip install Faker

Details from

https://github.com/joke2k/faker

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner Von Braun)".

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [GraphFrames Spark Package]: Why is there not a distribution for Spark 3.3?

2024-03-15 Thread Russell Jurney
There is an implementation for Spark 3, but GraphFrames isn't released
often enough to match every point version. It supports Spark 3.4. Try it -
it will probably work.
https://spark-packages.org/package/graphframes/graphframes

Thanks,
Russell Jurney @rjurney 
russell.jur...@gmail.com LI  FB
 datasyndrome.com Book a time on Calendly



On Fri, Jan 12, 2024 at 7:55 AM Boileau, Brad
 wrote:

> Hello,
>
>
>
> I was hoping to use a distribution of GraphFrames for AWS Glue 4 which has
> spark 3.3, but there is no found distribution for Spark 3.3 at this
> location:
>
>
>
> https://spark-packages.org/package/graphframes/graphframes
>
>
>
> Do you have any advice on the best compatible version to use for Spark 3.3?
>
>
>
> Sincerely,
>
>
>
> *Brad Boileau*
>
> *Senior Product Architect** / **Architecte produit sénior*
> *Farm Credit Canada | Financement agricole Canada*
> 1820 Hamilton Street / 1820, rue Hamilton
>
> Regina SK  S4P 2B8
>
> Tel/Tél. : 306-359, C/M: 306-737-8900
>
> fcc.ca  / fac.ca
> 
>
> FCC social media  /
>  Médias sociaux FAC
> 
>
>
>
> *[image: 2QA=]*
>
>
>
> *This email, including attachments, is confidential. You may not share
> this email with any third party.* If you are not the intended recipient,
> any redistribution or copying of this email is prohibited. If you have
> received this email in error or cannot comply with these restrictions,
> please delete or destroy it entirely and immediately without making a copy
> and notify us by return email.
>
> *Ce courriel (y compris toutes les pièces jointes qu’il comporte) est
> confidentiel. Vous ne pouvez pas partager ce courriel avec des tiers.* Si
> vous n’êtes pas le destinataire prévu, toute divulgation, reproduction,
> copie ou distribution de ce courriel est strictement interdite. Si vous
> avez reçu ce courriel par erreur ou ne pouvez pas respecter ces
> restrictions, merci de le supprimer ou de le détruire complètement et
> immédiatement, sans le dupliquer, et de nous aviser par retour de courriel.
>
> Unsubscribe from FCC marketing-related messages.
> 
> (Customers will still receive messages related to business transactions.)
>
> Se désabonner pour ne plus recevoir de messages liés au marketing de la
> part de FAC.
> 
> (Les clients continueront de recevoir des messages concernant leurs
> transactions.)
>


Requesting further assistance with Spark Scala code coverage

2024-03-14 Thread 里昂
I have sent out an email regarding Spark coverage, but haven't received any 
response. I'm hoping someone could provide an answer on whether there is 
currently any code coverage statistics available for Scala code in Spark.

https://lists.apache.org/thread/hob7x42gk3q244t9b0d8phwjtxjk2plt



Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-14 Thread Mich Talebzadeh
Hi,

When you create a DataFrame from Python objects using
spark.createDataFrame, here it goes:


*Initial Local Creation:*
The DataFrame is initially created in the memory of the driver node. The
data is not yet distributed to executors at this point.

*The role of lazy Evaluation:*

Spark applies lazy evaluation, *meaning transformations are not executed
immediately*.  It constructs a logical plan describing the operations, but
data movement does not occur yet.

*Action Trigger:*

When you initiate an action (things like show(), collect(), etc), Spark
triggers the execution.



*When partitioning  and distribution come in:Spark partitions the DataFrame
into logical chunks for parallel processing*. It divides the data based on
a partitioning scheme (default is hash partitioning). Each partition is
sent to different executor nodes for distributed execution.
This stage involves data transfer across the cluster, but it is not that
expensive shuffle you have heard of. Shuffles happen within repartitioning
or certain join operations.

*Storage on Executors:*

Executors receive their assigned partitions and store them in their
memory. If memory is limited, Spark spills partitions to disk. look at
stages tab in UI (4040)


*In summary:*
No Data Transfer During Creation: --> Data transfer occurs only when an
action is triggered.
Distributed Processing: --> DataFrames are distributed for parallel
execution, not stored entirely on the driver node.
Lazy Evaluation Optimization: --> Delaying data transfer until necessary
enhances performance.
Shuffle vs. Partitioning: --> Data movement during partitioning is not
considered a shuffle in Spark terminology.
Shuffles involve more complex data rearrangement.

*Considerations: *
Large DataFrames: For very large DataFrames

   - manage memory carefully to avoid out-of-memory errors. Consider
   options like:
   - Increasing executor memory
   - Using partitioning strategies to optimize memory usage
   - Employing techniques like checkpointing to persistent storage (hard
   disks) or caching for memory efficiency
   - You can get additional info from Spark UI default port 4040 tabs like
   SQL and executors
   - Spark uses Catalyst optimiser for efficient execution plans.
   df.explain("extended") shows both logical and physical plans

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Thu, 14 Mar 2024 at 19:46, Sreyan Chakravarty  wrote:

> I am trying to understand Spark Architecture.
>
> For Dataframes that are created from python objects ie. that are *created
> in memory where are they stored ?*
>
> Take following example:
>
> from pyspark.sql import Rowimport datetime
> courses = [
> {
> 'course_id': 1,
> 'course_title': 'Mastering Python',
> 'course_published_dt': datetime.date(2021, 1, 14),
> 'is_active': True,
> 'last_updated_ts': datetime.datetime(2021, 2, 18, 16, 57, 25)
> }
>
> ]
>
>
> courses_df = spark.createDataFrame([Row(**course) for course in courses])
>
>
> Where is the dataframe stored when I invoke the call:
>
> courses_df = spark.createDataFrame([Row(**course) for course in courses])
>
> Does it:
>
>1. Send the data to a random executor ?
>
>
>- Does this mean this counts as a shuffle ?
>
>
>1. Or does it stay on the driver node ?
>
>
>- That does not make sense when the dataframe grows large.
>
>
> --
> Regards,
> Sreyan Chakravarty
>


pyspark - Where are Dataframes created from Python objects stored?

2024-03-14 Thread Sreyan Chakravarty
I am trying to understand Spark Architecture.

For Dataframes that are created from python objects ie. that are *created
in memory where are they stored ?*

Take following example:

from pyspark.sql import Rowimport datetime
courses = [
{
'course_id': 1,
'course_title': 'Mastering Python',
'course_published_dt': datetime.date(2021, 1, 14),
'is_active': True,
'last_updated_ts': datetime.datetime(2021, 2, 18, 16, 57, 25)
}

]


courses_df = spark.createDataFrame([Row(**course) for course in courses])


Where is the dataframe stored when I invoke the call:

courses_df = spark.createDataFrame([Row(**course) for course in courses])

Does it:

   1. Send the data to a random executor ?


   - Does this mean this counts as a shuffle ?


   1. Or does it stay on the driver node ?


   - That does not make sense when the dataframe grows large.


-- 
Regards,
Sreyan Chakravarty


Re: Bug in How to Monitor Streaming Queries in PySpark

2024-03-12 Thread Mich Talebzadeh
Thanks for the clarification. That makes sense.. In the code below, we can
see

   def onQueryProgress(self, event):
print("onQueryProgress")
# Access micro-batch data
microbatch_data = event.progress
#print("microbatch_data received")  # Check if data is received
#print(microbatch_data)
print(f"Type of microbatch_data is {type(microbatch_data)}")
#processedRowsPerSecond =
microbatch_data.get("processedRowsPerSecond")  incorrect
processedRowsPerSecond = microbatch_data.processedRowsPerSecond
if processedRowsPerSecond is not None:  # Check if value exists
   print("processedRowsPerSecond retrieved")
   print(f"Processed rows per second is ->
{processedRowsPerSecond}")
else:
   print("processedRowsPerSecond not retrieved!")

The output

onQueryProgress
Type of microbatch_data is 
processedRowsPerSecond retrieved
Processed rows per second is -> 2.570694087403599

So we are dealing with the attribute of the class and NOT the dictionary.

The line (processedRowsPerSecond =
microbatch_data.get("processedRowsPerSecond")) fails because it uses the
.get() method, while the second line (processedRowsPerSecond =
microbatch_data.processedRowsPerSecond) accesses the attribute directly.

In short, they need to ensure that that event.progress* returns a
dictionary *

Cheers

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Tue, 12 Mar 2024 at 04:04, 刘唯  wrote:

> Oh I see why the confusion.
>
> microbatch_data = event.progress
>
> means that microbatch_data is a StreamingQueryProgress instance, it's not
> a dictionary, so you should use ` microbatch_data.processedRowsPerSecond`,
> instead of the `get` method which is used for dictionaries.
>
> But weirdly, for query.lastProgress and query.recentProgress, they should
> return StreamingQueryProgress  but instead they returned a dict. So the
> `get` method works there.
>
> I think PySpark should improve on this part.
>
> Mich Talebzadeh  于2024年3月11日周一 05:51写道:
>
>> Hi,
>>
>> Thank you for your advice
>>
>> This is the amended code
>>
>>def onQueryProgress(self, event):
>> print("onQueryProgress")
>> # Access micro-batch data
>> microbatch_data = event.progress
>> #print("microbatch_data received")  # Check if data is received
>> #print(microbatch_data)
>> #processed_rows_per_second =
>> microbatch_data.get("processed_rows_per_second")
>> processed_rows_per_second =
>> microbatch_data.get("processedRowsPerSecond")
>> print("CPC", processed_rows_per_second)
>> if processed_rows_per_second is not None:  # Check if value exists
>>print("ocessed_rows_per_second retrieved")
>>print(f"Processed rows per second:
>> {processed_rows_per_second}")
>> else:
>>print("processed_rows_per_second not retrieved!")
>>
>> This is the output
>>
>> onQueryStarted
>> 'None' [c1a910e6-41bb-493f-b15b-7863d07ff3fe] got started!
>> SLF4J: Failed to load class "org.slf4j.impl.StaticMDCBinder".
>> SLF4J: Defaulting to no-operation MDCAdapter implementation.
>> SLF4J: See http://www.slf4j.org/codes.html#no_static_mdc_binder for
>> further details.
>> ---
>> Batch: 0
>> ---
>> +---+-+---+---+
>> |key|doubled_value|op_type|op_time|
>> +---+-+---+---+
>> +---+-+---+---+
>>
>> onQueryProgress
>> ---
>> Batch: 1
>> ---
>> ++-+---++
>> | key|doubled_value|op_type| op_time|
>> ++-+---++
>> |a960f663-d13a-49c...|0|  1|2024-03-11 12:17:...|
>> ++-+---++
>>
>> onQueryProgress
>> ---
>> Batch: 2
>> ---
>> ++-+---++
>> | key|doubled_value|op_type| op_time|
>> ++-+---++
>> |a960f663-d13a-49c...|2|  1|2024-03-11 12:17:...|
>> ++-+---++
>>
>> I am afraid it is not 

Re: Bug in How to Monitor Streaming Queries in PySpark

2024-03-11 Thread 刘唯
Oh I see why the confusion.

microbatch_data = event.progress

means that microbatch_data is a StreamingQueryProgress instance, it's not a
dictionary, so you should use ` microbatch_data.processedRowsPerSecond`,
instead of the `get` method which is used for dictionaries.

But weirdly, for query.lastProgress and query.recentProgress, they should
return StreamingQueryProgress  but instead they returned a dict. So the
`get` method works there.

I think PySpark should improve on this part.

Mich Talebzadeh  于2024年3月11日周一 05:51写道:

> Hi,
>
> Thank you for your advice
>
> This is the amended code
>
>def onQueryProgress(self, event):
> print("onQueryProgress")
> # Access micro-batch data
> microbatch_data = event.progress
> #print("microbatch_data received")  # Check if data is received
> #print(microbatch_data)
> #processed_rows_per_second =
> microbatch_data.get("processed_rows_per_second")
> processed_rows_per_second =
> microbatch_data.get("processedRowsPerSecond")
> print("CPC", processed_rows_per_second)
> if processed_rows_per_second is not None:  # Check if value exists
>print("ocessed_rows_per_second retrieved")
>print(f"Processed rows per second: {processed_rows_per_second}")
> else:
>print("processed_rows_per_second not retrieved!")
>
> This is the output
>
> onQueryStarted
> 'None' [c1a910e6-41bb-493f-b15b-7863d07ff3fe] got started!
> SLF4J: Failed to load class "org.slf4j.impl.StaticMDCBinder".
> SLF4J: Defaulting to no-operation MDCAdapter implementation.
> SLF4J: See http://www.slf4j.org/codes.html#no_static_mdc_binder for
> further details.
> ---
> Batch: 0
> ---
> +---+-+---+---+
> |key|doubled_value|op_type|op_time|
> +---+-+---+---+
> +---+-+---+---+
>
> onQueryProgress
> ---
> Batch: 1
> ---
> ++-+---++
> | key|doubled_value|op_type| op_time|
> ++-+---++
> |a960f663-d13a-49c...|0|  1|2024-03-11 12:17:...|
> ++-+---++
>
> onQueryProgress
> ---
> Batch: 2
> ---
> ++-+---++
> | key|doubled_value|op_type| op_time|
> ++-+---++
> |a960f663-d13a-49c...|2|  1|2024-03-11 12:17:...|
> ++-+---++
>
> I am afraid it is not working. Not even printing anything
>
> Cheers
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my kno
> wledge but of course cannot be guaranteed . It is essential to note that,
> as with any advice, quote "one test result is worth one-thousand expert op
> inions (Werner  Von Braun
> )".
>
>
> On Mon, 11 Mar 2024 at 05:07, 刘唯  wrote:
>
>> *now -> not
>>
>> 刘唯  于2024年3月10日周日 22:04写道:
>>
>>> Have you tried using microbatch_data.get("processedRowsPerSecond")?
>>> Camel case now snake case
>>>
>>> Mich Talebzadeh  于2024年3月10日周日 11:46写道:
>>>

 There is a paper from Databricks on this subject


 https://www.databricks.com/blog/2022/05/27/how-to-monitor-streaming-queries-in-pyspark.html

 But having tested it, there seems to be a bug there that I reported to
 Databricks forum as well (in answer to a user question)

 I have come to a conclusion that this is a bug. In general there is a b
 ug in obtaining individual values from the dictionary. For example, a b
 ug in the way Spark Streaming is populating the processe
 d_rows_per_second key within the microbatch_data -> microbatch_data =
 event.progres dictionary or any other key. I have explored various deb
 ugging steps, and even though the key seems to exist, the value might n
 ot be getting set. Note that the dictionary itself prints the elements
 correctly. This is with regard to method onQueryProgress(self, event)
 in class MyListener(StreamingQueryListener):

 For example with print(microbatch_data), you get all printed as below

 onQueryProgress
 microbatch_data received
 {
 "id" : "941e4cb6-f4ee-41f8-b662-af6dda61dc66",
 "runId" : 

Data ingestion into elastic failing using pyspark

2024-03-11 Thread Karthick Nk
Hi @all,

I am using pyspark program to write the data into elastic index by using
upsert operation (sample code snippet below).

def writeDataToES(final_df):
write_options = {
"es.nodes":  elastic_host,
"es.net.ssl": "false",
"es.nodes.wan.only": "true",
"es.net.http.auth.user": elastic_user_name,
"es.net.http.auth.pass": elastic_password,
"es.port": elastic_port,
"es.net.ssl": "true",
'es.spark.dataframe.write.null': "true",
"es.mapping.id" : mapping_id,
"es.write.operation": "upsert"
}
final_df.write.format(
"org.elasticsearch.spark.sql").options(**write_options).mode("append").save(
f"{index_name}")


while writing data from delta table to elastic index, i am getting error
for few records(error message below)

*Py4JJavaError: An error occurred while calling o1305.save.*
*: org.apache.spark.SparkException: Job aborted due to stage failure: Task
4 in stage 524.0 failed 4 times, most recent failure: Lost task 4.3 in
stage 524.0 (TID 12805) (192.168.128.16 executor 1):
org.elasticsearch.hadoop.EsHadoopException: Could not write all entries for
bulk operation [1/1]. Error sample (first [5] error messages):*
* org.elasticsearch.hadoop.rest.EsHadoopRemoteException:
illegal_argument_exception: Illegal group reference: group index is missing*

Could you guide me on it, am I missing anythings,

If you require more additional details, please let me know.

Thanks


Re: Bugs with joins and SQL in Structured Streaming

2024-03-11 Thread Andrzej Zera
Hi,

Do you think there is any chance for this issue to get resolved? Should I
create another bug report? As mentioned in my message, there is one open
already: https://issues.apache.org/jira/browse/SPARK-45637 but it covers
only one of the problems.

Andrzej

wt., 27 lut 2024 o 09:58 Andrzej Zera  napisał(a):

> Hi,
>
> Yes, I tested all of them on spark 3.5.
>
> Regards,
> Andrzej
>
>
> pon., 26 lut 2024 o 23:24 Mich Talebzadeh 
> napisał(a):
>
>> Hi,
>>
>> These are all on spark 3.5, correct?
>>
>> Mich Talebzadeh,
>> Dad | Technologist | Solutions Architect | Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> Von Braun
>> )".
>>
>>
>> On Mon, 26 Feb 2024 at 22:18, Andrzej Zera  wrote:
>>
>>> Hey all,
>>>
>>> I've been using Structured Streaming in production for almost a year
>>> already and I want to share the bugs I found in this time. I created a test
>>> for each of the issues and put them all here:
>>> https://github.com/andrzejzera/spark-bugs/tree/main/spark-3.5/src/test/scala
>>>
>>> I split the issues into three groups: outer joins on event time,
>>> interval joins and Spark SQL.
>>>
>>> Issues related to outer joins:
>>>
>>>- When joining three or more input streams on event time, if two or
>>>more streams don't contain an event for a join key (which is event time),
>>>no row will be output even if other streams contain an event for this 
>>> join
>>>key. Tests that check for this:
>>>
>>> https://github.com/andrzejzera/spark-bugs/blob/abae7a3839326a8eafc7516a51aca5e0c79282a6/spark-3.5/src/test/scala/OuterJoinTest.scala#L86
>>>and
>>>
>>> https://github.com/andrzejzera/spark-bugs/blob/abae7a3839326a8eafc7516a51aca5e0c79282a6/spark-3.5/src/test/scala/OuterJoinTest.scala#L169
>>>- When joining aggregated stream with raw events with a stream with
>>>already aggregated events (aggregation made outside of Spark), then no 
>>> row
>>>will be output if that second stream don't contain a corresponding event.
>>>Test that checks for this:
>>>
>>> https://github.com/andrzejzera/spark-bugs/blob/abae7a3839326a8eafc7516a51aca5e0c79282a6/spark-3.5/src/test/scala/OuterJoinTest.scala#L266
>>>- When joining two aggregated streams (aggregated in Spark), no
>>>result is produced. Test that checks for this:
>>>
>>> https://github.com/andrzejzera/spark-bugs/blob/abae7a3839326a8eafc7516a51aca5e0c79282a6/spark-3.5/src/test/scala/OuterJoinTest.scala#L341.
>>>I've already reported this one here:
>>>https://issues.apache.org/jira/browse/SPARK-45637 but it hasn't been
>>>handled yet.
>>>
>>> Issues related to interval joins:
>>>
>>>- When joining three streams (A, B, C) using interval join on event
>>>time, in the way that B.eventTime is conditioned on A.eventTime and
>>>C.eventTime is also conditioned on A.eventTime, and then doing window
>>>aggregation based on A's event time, the result is output only after
>>>watermark crosses the window end + interval(A, B) + interval (A, C).
>>>However, I'd expect results to be output faster, i.e. when the watermark
>>>crosses window end + MAX(interval(A, B) + interval (A, C)). If our case 
>>> is
>>>that event B can happen 3 minutes after event A and event C can happen 5
>>>minutes after A, there is no point to suspend reporting output for 8
>>>minutes (3+5) after the end of the window if we know that no more event 
>>> can
>>>be matched after 5 min from the window end (assuming window end is based 
>>> on
>>>A's event time). Test that checks for this:
>>>
>>> https://github.com/andrzejzera/spark-bugs/blob/abae7a3839326a8eafc7516a51aca5e0c79282a6/spark-3.5/src/test/scala/IntervalJoinTest.scala#L32
>>>
>>> SQL issues:
>>>
>>>- WITH clause (in contrast to subquery) seems to create a static
>>>DataFrame that can't be used in streaming joins. Test that checks for 
>>> this:
>>>
>>> https://github.com/andrzejzera/spark-bugs/blob/abae7a3839326a8eafc7516a51aca5e0c79282a6/spark-3.5/src/test/scala/SqlSyntaxTest.scala#L31
>>>- Two subqueries, each aggregating data using window() functio,
>>>breaks the output schema. Test that checks for this:
>>>
>>> https://github.com/andrzejzera/spark-bugs/blob/abae7a3839326a8eafc7516a51aca5e0c79282a6/spark-3.5/src/test/scala/SqlSyntaxTest.scala#L122
>>>
>>> I'm a beginner with Scala (I'm using Structured Streaming with PySpark)
>>> so won't be able to provide fixes. But I hope the test cases I provided can
>>> be of some 

Re: Bug in How to Monitor Streaming Queries in PySpark

2024-03-11 Thread Mich Talebzadeh
Hi,

Thank you for your advice

This is the amended code

   def onQueryProgress(self, event):
print("onQueryProgress")
# Access micro-batch data
microbatch_data = event.progress
#print("microbatch_data received")  # Check if data is received
#print(microbatch_data)
#processed_rows_per_second =
microbatch_data.get("processed_rows_per_second")
processed_rows_per_second =
microbatch_data.get("processedRowsPerSecond")
print("CPC", processed_rows_per_second)
if processed_rows_per_second is not None:  # Check if value exists
   print("ocessed_rows_per_second retrieved")
   print(f"Processed rows per second: {processed_rows_per_second}")
else:
   print("processed_rows_per_second not retrieved!")

This is the output

onQueryStarted
'None' [c1a910e6-41bb-493f-b15b-7863d07ff3fe] got started!
SLF4J: Failed to load class "org.slf4j.impl.StaticMDCBinder".
SLF4J: Defaulting to no-operation MDCAdapter implementation.
SLF4J: See http://www.slf4j.org/codes.html#no_static_mdc_binder for further
details.
---
Batch: 0
---
+---+-+---+---+
|key|doubled_value|op_type|op_time|
+---+-+---+---+
+---+-+---+---+

onQueryProgress
---
Batch: 1
---
++-+---++
| key|doubled_value|op_type| op_time|
++-+---++
|a960f663-d13a-49c...|0|  1|2024-03-11 12:17:...|
++-+---++

onQueryProgress
---
Batch: 2
---
++-+---++
| key|doubled_value|op_type| op_time|
++-+---++
|a960f663-d13a-49c...|2|  1|2024-03-11 12:17:...|
++-+---++

I am afraid it is not working. Not even printing anything

Cheers

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Mon, 11 Mar 2024 at 05:07, 刘唯  wrote:

> *now -> not
>
> 刘唯  于2024年3月10日周日 22:04写道:
>
>> Have you tried using microbatch_data.get("processedRowsPerSecond")?
>> Camel case now snake case
>>
>> Mich Talebzadeh  于2024年3月10日周日 11:46写道:
>>
>>>
>>> There is a paper from Databricks on this subject
>>>
>>>
>>> https://www.databricks.com/blog/2022/05/27/how-to-monitor-streaming-queries-in-pyspark.html
>>>
>>> But having tested it, there seems to be a bug there that I reported to
>>> Databricks forum as well (in answer to a user question)
>>>
>>> I have come to a conclusion that this is a bug. In general there is a
>>> bug in obtaining individual values from the dictionary. For example, a bug
>>> in the way Spark Streaming is populating the processed_rows_per_second key
>>> within the microbatch_data -> microbatch_data = event.progres dictionary or
>>> any other key. I have explored various debugging steps, and even though the
>>> key seems to exist, the value might not be getting set. Note that the
>>> dictionary itself prints the elements correctly. This is with regard to
>>> method onQueryProgress(self, event) in class
>>> MyListener(StreamingQueryListener):
>>>
>>> For example with print(microbatch_data), you get all printed as below
>>>
>>> onQueryProgress
>>> microbatch_data received
>>> {
>>> "id" : "941e4cb6-f4ee-41f8-b662-af6dda61dc66",
>>> "runId" : "691d5eb2-140e-48c0-949a-7efbe0fa0967",
>>> "name" : null,
>>> "timestamp" : "2024-03-10T09:21:27.233Z",
>>> "batchId" : 21,
>>> "numInputRows" : 1,
>>> "inputRowsPerSecond" : 100.0,
>>> "processedRowsPerSecond" : 5.347593582887701,
>>> "durationMs" : {
>>> "addBatch" : 37,
>>> "commitOffsets" : 41,
>>> "getBatch" : 0,
>>> "latestOffset" : 0,
>>> "queryPlanning" : 5,
>>> "triggerExecution" : 187,
>>> "walCommit" : 104
>>> },
>>> "stateOperators" : [ ],
>>> "sources" : [ {
>>> "description" : "RateStreamV2[rowsPerSecond=1, rampUpTimeSeconds=0,
>>> numPartitions=default",
>>> "startOffset" : 20,
>>> "endOffset" : 21,
>>> "latestOffset" : 21,
>>> "numInputRows" : 1,
>>> "inputRowsPerSecond" : 100.0,
>>> "processedRowsPerSecond" : 

Re: Bug in How to Monitor Streaming Queries in PySpark

2024-03-10 Thread 刘唯
*now -> not

刘唯  于2024年3月10日周日 22:04写道:

> Have you tried using microbatch_data.get("processedRowsPerSecond")?
> Camel case now snake case
>
> Mich Talebzadeh  于2024年3月10日周日 11:46写道:
>
>>
>> There is a paper from Databricks on this subject
>>
>>
>> https://www.databricks.com/blog/2022/05/27/how-to-monitor-streaming-queries-in-pyspark.html
>>
>> But having tested it, there seems to be a bug there that I reported to
>> Databricks forum as well (in answer to a user question)
>>
>> I have come to a conclusion that this is a bug. In general there is a bug
>> in obtaining individual values from the dictionary. For example, a bug in
>> the way Spark Streaming is populating the processed_rows_per_second key
>> within the microbatch_data -> microbatch_data = event.progres dictionary or
>> any other key. I have explored various debugging steps, and even though the
>> key seems to exist, the value might not be getting set. Note that the
>> dictionary itself prints the elements correctly. This is with regard to
>> method onQueryProgress(self, event) in class
>> MyListener(StreamingQueryListener):
>>
>> For example with print(microbatch_data), you get all printed as below
>>
>> onQueryProgress
>> microbatch_data received
>> {
>> "id" : "941e4cb6-f4ee-41f8-b662-af6dda61dc66",
>> "runId" : "691d5eb2-140e-48c0-949a-7efbe0fa0967",
>> "name" : null,
>> "timestamp" : "2024-03-10T09:21:27.233Z",
>> "batchId" : 21,
>> "numInputRows" : 1,
>> "inputRowsPerSecond" : 100.0,
>> "processedRowsPerSecond" : 5.347593582887701,
>> "durationMs" : {
>> "addBatch" : 37,
>> "commitOffsets" : 41,
>> "getBatch" : 0,
>> "latestOffset" : 0,
>> "queryPlanning" : 5,
>> "triggerExecution" : 187,
>> "walCommit" : 104
>> },
>> "stateOperators" : [ ],
>> "sources" : [ {
>> "description" : "RateStreamV2[rowsPerSecond=1, rampUpTimeSeconds=0,
>> numPartitions=default",
>> "startOffset" : 20,
>> "endOffset" : 21,
>> "latestOffset" : 21,
>> "numInputRows" : 1,
>> "inputRowsPerSecond" : 100.0,
>> "processedRowsPerSecond" : 5.347593582887701
>> } ],
>> "sink" : {
>> "description" :
>> "org.apache.spark.sql.execution.streaming.ConsoleTable$@430a977c",
>> "numOutputRows" : 1
>> }
>> }
>> However, the observed behaviour (i.e. processed_rows_per_second is either
>> None or not being updated correctly).
>>
>> The spark version I used for my test is 3.4
>>
>> Sample code uses format=rate for simulating a streaming process. You can
>> test the code yourself, all in one
>> from pyspark.sql import SparkSession
>> from pyspark.sql.functions import col
>> from pyspark.sql.streaming import DataStreamWriter, StreamingQueryListener
>> from pyspark.sql.functions import col, round, current_timestamp, lit
>> import uuid
>>
>> def process_data(df):
>>
>> processed_df = df.withColumn("key", lit(str(uuid.uuid4(.\
>>   withColumn("doubled_value", col("value") * 2). \
>>   withColumn("op_type", lit(1)). \
>>   withColumn("op_time", current_timestamp())
>>
>> return processed_df
>>
>> # Create a Spark session
>> appName = "testListener"
>> spark = SparkSession.builder.appName(appName).getOrCreate()
>>
>> # Define the schema for the streaming data
>> schema = "key string timestamp timestamp, value long"
>>
>> # Define my listener.
>> class MyListener(StreamingQueryListener):
>> def onQueryStarted(self, event):
>> print("onQueryStarted")
>> print(f"'{event.name}' [{event.id}] got started!")
>> def onQueryProgress(self, event):
>> print("onQueryProgress")
>> # Access micro-batch data
>> microbatch_data = event.progress
>> print("microbatch_data received")  # Check if data is received
>> print(microbatch_data)
>> processed_rows_per_second =
>> microbatch_data.get("processed_rows_per_second")
>> if processed_rows_per_second is not None:  # Check if value exists
>>print("processed_rows_per_second retrieved")
>>print(f"Processed rows per second:
>> {processed_rows_per_second}")
>> else:
>>print("processed_rows_per_second not retrieved!")
>> def onQueryTerminated(self, event):
>> print("onQueryTerminated")
>> if event.exception:
>> print(f"Query terminated with exception: {event.exception}")
>> else:
>> print("Query successfully terminated.")
>> # Add my listener.
>>
>> listener_instance = MyListener()
>> spark.streams.addListener(listener_instance)
>>
>>
>> # Create a streaming DataFrame with the rate source
>> streaming_df = (
>> spark.readStream
>> .format("rate")
>> .option("rowsPerSecond", 1)
>> .load()
>> )
>>
>> # Apply processing function to the streaming DataFrame
>> processed_streaming_df = process_data(streaming_df)
>>
>> # Define the output sink (for example, console sink)
>> query = (
>> processed_streaming_df.select( \
>>   col("key").alias("key") \
>> , 

Re: Bug in How to Monitor Streaming Queries in PySpark

2024-03-10 Thread 刘唯
Have you tried using microbatch_data.get("processedRowsPerSecond")?
Camel case now snake case

Mich Talebzadeh  于2024年3月10日周日 11:46写道:

>
> There is a paper from Databricks on this subject
>
>
> https://www.databricks.com/blog/2022/05/27/how-to-monitor-streaming-queries-in-pyspark.html
>
> But having tested it, there seems to be a bug there that I reported to
> Databricks forum as well (in answer to a user question)
>
> I have come to a conclusion that this is a bug. In general there is a bug
> in obtaining individual values from the dictionary. For example, a bug in
> the way Spark Streaming is populating the processed_rows_per_second key
> within the microbatch_data -> microbatch_data = event.progres dictionary or
> any other key. I have explored various debugging steps, and even though the
> key seems to exist, the value might not be getting set. Note that the
> dictionary itself prints the elements correctly. This is with regard to
> method onQueryProgress(self, event) in class
> MyListener(StreamingQueryListener):
>
> For example with print(microbatch_data), you get all printed as below
>
> onQueryProgress
> microbatch_data received
> {
> "id" : "941e4cb6-f4ee-41f8-b662-af6dda61dc66",
> "runId" : "691d5eb2-140e-48c0-949a-7efbe0fa0967",
> "name" : null,
> "timestamp" : "2024-03-10T09:21:27.233Z",
> "batchId" : 21,
> "numInputRows" : 1,
> "inputRowsPerSecond" : 100.0,
> "processedRowsPerSecond" : 5.347593582887701,
> "durationMs" : {
> "addBatch" : 37,
> "commitOffsets" : 41,
> "getBatch" : 0,
> "latestOffset" : 0,
> "queryPlanning" : 5,
> "triggerExecution" : 187,
> "walCommit" : 104
> },
> "stateOperators" : [ ],
> "sources" : [ {
> "description" : "RateStreamV2[rowsPerSecond=1, rampUpTimeSeconds=0,
> numPartitions=default",
> "startOffset" : 20,
> "endOffset" : 21,
> "latestOffset" : 21,
> "numInputRows" : 1,
> "inputRowsPerSecond" : 100.0,
> "processedRowsPerSecond" : 5.347593582887701
> } ],
> "sink" : {
> "description" :
> "org.apache.spark.sql.execution.streaming.ConsoleTable$@430a977c",
> "numOutputRows" : 1
> }
> }
> However, the observed behaviour (i.e. processed_rows_per_second is either
> None or not being updated correctly).
>
> The spark version I used for my test is 3.4
>
> Sample code uses format=rate for simulating a streaming process. You can
> test the code yourself, all in one
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import col
> from pyspark.sql.streaming import DataStreamWriter, StreamingQueryListener
> from pyspark.sql.functions import col, round, current_timestamp, lit
> import uuid
>
> def process_data(df):
>
> processed_df = df.withColumn("key", lit(str(uuid.uuid4(.\
>   withColumn("doubled_value", col("value") * 2). \
>   withColumn("op_type", lit(1)). \
>   withColumn("op_time", current_timestamp())
>
> return processed_df
>
> # Create a Spark session
> appName = "testListener"
> spark = SparkSession.builder.appName(appName).getOrCreate()
>
> # Define the schema for the streaming data
> schema = "key string timestamp timestamp, value long"
>
> # Define my listener.
> class MyListener(StreamingQueryListener):
> def onQueryStarted(self, event):
> print("onQueryStarted")
> print(f"'{event.name}' [{event.id}] got started!")
> def onQueryProgress(self, event):
> print("onQueryProgress")
> # Access micro-batch data
> microbatch_data = event.progress
> print("microbatch_data received")  # Check if data is received
> print(microbatch_data)
> processed_rows_per_second =
> microbatch_data.get("processed_rows_per_second")
> if processed_rows_per_second is not None:  # Check if value exists
>print("processed_rows_per_second retrieved")
>print(f"Processed rows per second: {processed_rows_per_second}")
> else:
>print("processed_rows_per_second not retrieved!")
> def onQueryTerminated(self, event):
> print("onQueryTerminated")
> if event.exception:
> print(f"Query terminated with exception: {event.exception}")
> else:
> print("Query successfully terminated.")
> # Add my listener.
>
> listener_instance = MyListener()
> spark.streams.addListener(listener_instance)
>
>
> # Create a streaming DataFrame with the rate source
> streaming_df = (
> spark.readStream
> .format("rate")
> .option("rowsPerSecond", 1)
> .load()
> )
>
> # Apply processing function to the streaming DataFrame
> processed_streaming_df = process_data(streaming_df)
>
> # Define the output sink (for example, console sink)
> query = (
> processed_streaming_df.select( \
>   col("key").alias("key") \
> , col("doubled_value").alias("doubled_value") \
> , col("op_type").alias("op_type") \
> , col("op_time").alias("op_time")). \
> 

Bug in How to Monitor Streaming Queries in PySpark

2024-03-10 Thread Mich Talebzadeh
There is a paper from Databricks on this subject

https://www.databricks.com/blog/2022/05/27/how-to-monitor-streaming-queries-in-pyspark.html

But having tested it, there seems to be a bug there that I reported to
Databricks forum as well (in answer to a user question)

I have come to a conclusion that this is a bug. In general there is a bug
in obtaining individual values from the dictionary. For example, a bug in
the way Spark Streaming is populating the processed_rows_per_second key
within the microbatch_data -> microbatch_data = event.progres dictionary or
any other key. I have explored various debugging steps, and even though the
key seems to exist, the value might not be getting set. Note that the
dictionary itself prints the elements correctly. This is with regard to
method onQueryProgress(self, event) in class
MyListener(StreamingQueryListener):

For example with print(microbatch_data), you get all printed as below

onQueryProgress
microbatch_data received
{
"id" : "941e4cb6-f4ee-41f8-b662-af6dda61dc66",
"runId" : "691d5eb2-140e-48c0-949a-7efbe0fa0967",
"name" : null,
"timestamp" : "2024-03-10T09:21:27.233Z",
"batchId" : 21,
"numInputRows" : 1,
"inputRowsPerSecond" : 100.0,
"processedRowsPerSecond" : 5.347593582887701,
"durationMs" : {
"addBatch" : 37,
"commitOffsets" : 41,
"getBatch" : 0,
"latestOffset" : 0,
"queryPlanning" : 5,
"triggerExecution" : 187,
"walCommit" : 104
},
"stateOperators" : [ ],
"sources" : [ {
"description" : "RateStreamV2[rowsPerSecond=1, rampUpTimeSeconds=0,
numPartitions=default",
"startOffset" : 20,
"endOffset" : 21,
"latestOffset" : 21,
"numInputRows" : 1,
"inputRowsPerSecond" : 100.0,
"processedRowsPerSecond" : 5.347593582887701
} ],
"sink" : {
"description" :
"org.apache.spark.sql.execution.streaming.ConsoleTable$@430a977c",
"numOutputRows" : 1
}
}
However, the observed behaviour (i.e. processed_rows_per_second is either
None or not being updated correctly).

The spark version I used for my test is 3.4

Sample code uses format=rate for simulating a streaming process. You can
test the code yourself, all in one
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.streaming import DataStreamWriter, StreamingQueryListener
from pyspark.sql.functions import col, round, current_timestamp, lit
import uuid

def process_data(df):

processed_df = df.withColumn("key", lit(str(uuid.uuid4(.\
  withColumn("doubled_value", col("value") * 2). \
  withColumn("op_type", lit(1)). \
  withColumn("op_time", current_timestamp())

return processed_df

# Create a Spark session
appName = "testListener"
spark = SparkSession.builder.appName(appName).getOrCreate()

# Define the schema for the streaming data
schema = "key string timestamp timestamp, value long"

# Define my listener.
class MyListener(StreamingQueryListener):
def onQueryStarted(self, event):
print("onQueryStarted")
print(f"'{event.name}' [{event.id}] got started!")
def onQueryProgress(self, event):
print("onQueryProgress")
# Access micro-batch data
microbatch_data = event.progress
print("microbatch_data received")  # Check if data is received
print(microbatch_data)
processed_rows_per_second =
microbatch_data.get("processed_rows_per_second")
if processed_rows_per_second is not None:  # Check if value exists
   print("processed_rows_per_second retrieved")
   print(f"Processed rows per second: {processed_rows_per_second}")
else:
   print("processed_rows_per_second not retrieved!")
def onQueryTerminated(self, event):
print("onQueryTerminated")
if event.exception:
print(f"Query terminated with exception: {event.exception}")
else:
print("Query successfully terminated.")
# Add my listener.

listener_instance = MyListener()
spark.streams.addListener(listener_instance)


# Create a streaming DataFrame with the rate source
streaming_df = (
spark.readStream
.format("rate")
.option("rowsPerSecond", 1)
.load()
)

# Apply processing function to the streaming DataFrame
processed_streaming_df = process_data(streaming_df)

# Define the output sink (for example, console sink)
query = (
processed_streaming_df.select( \
  col("key").alias("key") \
, col("doubled_value").alias("doubled_value") \
, col("op_type").alias("op_type") \
, col("op_time").alias("op_time")). \
writeStream.\
outputMode("append").\
format("console"). \
start()
)

# Wait for the streaming query to terminate
query.awaitTermination()

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile

Spark on Kubenets, execute dataset.show raise exceptions

2024-03-09 Thread BODY NO
Hi,
I encountered a strange issue. I run spark-shell with client mode in
kubernets.
as below command:
val data=spark.read.parquet("datapath")

When I run: "data.show", it may raise exceptions, the stacktrace like below:

DEBUG BlockManagerMasterEndpoint: Updating block info on master
taskresult_3 form BlockManagerId(2, 192.168.167.22, 7079, None)
INFO BlockManagerInfo: Added taskresult_3 in memory on 192.168.167.22, 7079
(size: 173 KiB, free: 12 GB)
DEBUG TaskResultGetter: Fetching indirect task result for task 0.2 in stage
1.0 (TID3)
DEBUG BlockManager: Getting remote block taskresult_3
DEBUG BlockManager: Getting remote block taskresult_3 from
BlockManagerId(2, 192.168.167.22, 7079, None)
INFO TransportClientFactory: Found inactive connection to /
192.168.167.22:7079, creating a new one
DEBUG TransportClientFactory: Creating new connection to /
192.168.167.22:7079
DEBUG TransportClientFactory: Connection to /192.168.167.22:7079
successful, running bootstraps..
INFO TransportClientFactory: Successfully created connection to /
192.168.167.22:7079 after 1 ms (0 ms spent in bootstraps)
ERROR TransportResponseHandler: Still have 1 requests outstanding when
connection from  is closed
ERROR OneForOneBlockFetcher: Failed while starting block fetches
java.io.IOException: Connection from  closed
at
org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:147)
at org.apache.spark.network.client.TransportChannelHandler.channelInactive(
TransportChannelHandler .java:147)
..
ERROR TransportClient: Failed to send RPC RPC 22331333 to : io.netty.channel.StacklessClosedChannelException
..

It looks like the exceptions are related to some data, the same data I run
"data.show" would raise exceptions, but run "data.show(2)" would not.
Looks like it depends on whether the task result needs to be fetched
indirectly?

from the exceptions, which condition there is no specific ip or port,
rather than ? it's very strange.

Does anybody know how to fix it? Why is it ??

thanks.


Spark-UI stages and other tabs not accessible in standalone mode when reverse-proxy is enabled

2024-03-08 Thread sharad mishra
Hi Team,
We're encountering an issue with Spark UI.
When enabled reverse proxy in master and worker configOptions. We're not
able to access different tabs available in spark UI e.g.(stages,
environment, storage etc.)

We're deploying spark through bitnami helm chart :
https://github.com/bitnami/charts/tree/main/bitnami/spark

Name and Version

bitnami/spark - 6.0.0

What steps will reproduce the bug?

Kubernetes Version: 1.25
Spark: 3.4.2
Helm chart: 6.0.0

Steps to reproduce:
After installing the chart Spark Cluster(Master and worker) UI is available
at:


https://spark.staging.abc.com/

We are able to access running application by click on applicationID under
Running Applications link:



We can access spark UI by clicking Application Detail UI:

We are taken to jobs tab when we click on Application Detail UI


URL looks like:
https://spark.staging.abc.com/proxy/app-20240208103209-0030/stages/

When we click any of the tab from spark UI e.g. stages or environment etc,
it takes us back to spark cluster UI page
We noticed that endpoint changes to


https://spark.staging.abc.com/stages/
instead of
https://spark.staging.abc.com/proxy/app-20240208103209-0030/stages/



Are you using any custom parameters or values?

Configurations set in values.yaml
```
master:
  configOptions:
-Dspark.ui.reverseProxy=true
-Dspark.ui.reverseProxyUrl=https://spark.staging.abc.com

worker:
  configOptions:
-Dspark.ui.reverseProxy=true
-Dspark.ui.reverseProxyUrl=https://spark.staging.abc.com

service:
  type: ClusterIP
  ports:
http: 8080
https: 443
cluster: 7077

ingress:

  enabled: true
  pathType: ImplementationSpecific
  apiVersion: ""
  hostname: spark.staging.abc.com
  ingressClassName: "staging"
  path: /
```



What is the expected behavior?

Expected behaviour is that when I click on stages tab, instead of taking me
to
https://spark.staging.abc.com/stages/
it should take me to following URL:
https://spark.staging.abc.com/proxy/app-20240208103209-0030/stages/

What do you see instead?

current behaviour is it takes me to URL:
https://spark.staging.abc.com/stages/ , which shows spark cluster UI with
master and worker details

Best,
Sharad


Re: Creating remote tables using PySpark

2024-03-08 Thread Mich Talebzadeh
The error message shows a mismatch between the configured warehouse
directory and the actual location accessible by the Spark application
running in the container..

You have configured the SparkSession with
spark.sql.warehouse.dir="file:/data/hive/warehouse". This tells Spark where
to store temporary and intermediate data during operations like saving
DataFrames as tables. When running the application remotely, the container
cannot access the directory /data/hive/warehouseon your local machine. This
directory path  may exist on the container's host system, but not within
the container itself..
You can set spark.sql.warehouse.dirto a directory within the container's
file system. This directory should be accessible by the Spark application
running inside the container. For example:

spark = SparkSession.builder \
.appName("testme") \
.master("spark://192.168.1.245:7077") \
.config("spark.sql.warehouse.dir", "/tmp/spark-warehouse") \ # Change this
to anything suitable within the container
.config("hive.metastore.uris","thrift://192.168.1.245:9083") \
.enableHiveSupport() \
.getOrCreate()

Use spark.conf.get("spark.sql.warehouse.dir") to print the configured
warehouse directory after creating the SparkSession to confirm all is OK

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Fri, 8 Mar 2024 at 06:01, Tom Barber  wrote:

> Okay interesting, maybe my assumption was incorrect, although I'm still
> confused.
>
> I tried to mount a central mount point that would be the same on my local
> machine and the container. Same error although I moved the path to
> /tmp/hive/data/hive/ but when I rerun the test code to save a table,
> the complaint is still for
>
> Warehouse Dir: file:/tmp/hive/data/hive/warehouse
> Metastore URIs: thrift://192.168.1.245:9083
> Warehouse Dir: file:/tmp/hive/data/hive/warehouse
> Metastore URIs: thrift://192.168.1.245:9083
> ERROR FileOutputCommitter: Mkdirs failed to create
> file:/data/hive/warehouse/input.db/accounts_20240307_232110_1_0_6_post21_g4fdc321_d20240307/_temporary/0
>
> so what is /data/hive even referring to when I print out the spark conf
> values and neither now refer to /data/hive/
>
> On Thu, Mar 7, 2024 at 9:49 PM Tom Barber  wrote:
>
>> Wonder if anyone can just sort my brain out here as to whats possible or
>> not.
>>
>> I have a container running Spark, with Hive and a ThriftServer. I want to
>> run code against it remotely.
>>
>> If I take something simple like this
>>
>> from pyspark.sql import SparkSession
>> from pyspark.sql.types import StructType, StructField, IntegerType,
>> StringType
>>
>> # Initialize SparkSession
>> spark = SparkSession.builder \
>> .appName("ShowDatabases") \
>> .master("spark://192.168.1.245:7077") \
>> .config("spark.sql.warehouse.dir", "file:/data/hive/warehouse") \
>> .config("hive.metastore.uris","thrift://192.168.1.245:9083")\
>> .enableHiveSupport() \
>> .getOrCreate()
>>
>> # Define schema of the DataFrame
>> schema = StructType([
>> StructField("id", IntegerType(), True),
>> StructField("name", StringType(), True)
>> ])
>>
>> # Data to be converted into a DataFrame
>> data = [(1, "John Doe"), (2, "Jane Doe"), (3, "Mike Johnson")]
>>
>> # Create DataFrame
>> df = spark.createDataFrame(data, schema)
>>
>> # Show the DataFrame (optional, for verification)
>> df.show()
>>
>> # Save the DataFrame to a table named "my_table"
>> df.write.mode("overwrite").saveAsTable("my_table")
>>
>> # Stop the SparkSession
>> spark.stop()
>>
>> When I run it in the container it runs fine, but when I run it remotely
>> it says:
>>
>> : java.io.FileNotFoundException: File
>> file:/data/hive/warehouse/my_table/_temporary/0 does not exist
>> at
>> org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:597)
>> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972)
>> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014)
>> at
>> org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:761)
>> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972)
>> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014)
>> at
>> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.getAllCommittedTaskPaths(FileOutputCommitter.java:334)
>> at
>> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJobInternal(FileOutputCommitter.java:404)
>> at
>> 

Re: Creating remote tables using PySpark

2024-03-07 Thread Tom Barber
Okay that was some caching issue. Now there is a shared mount point between
the place the pyspark code is executed and the spark nodes it runs. Hrmph,
I was hoping that wouldn't be the case. Fair enough!

On Thu, Mar 7, 2024 at 11:23 PM Tom Barber  wrote:

> Okay interesting, maybe my assumption was incorrect, although I'm still
> confused.
>
> I tried to mount a central mount point that would be the same on my local
> machine and the container. Same error although I moved the path to
> /tmp/hive/data/hive/ but when I rerun the test code to save a table,
> the complaint is still for
>
> Warehouse Dir: file:/tmp/hive/data/hive/warehouse
> Metastore URIs: thrift://192.168.1.245:9083
> Warehouse Dir: file:/tmp/hive/data/hive/warehouse
> Metastore URIs: thrift://192.168.1.245:9083
> ERROR FileOutputCommitter: Mkdirs failed to create
> file:/data/hive/warehouse/input.db/accounts_20240307_232110_1_0_6_post21_g4fdc321_d20240307/_temporary/0
>
> so what is /data/hive even referring to when I print out the spark conf
> values and neither now refer to /data/hive/
>
> On Thu, Mar 7, 2024 at 9:49 PM Tom Barber  wrote:
>
>> Wonder if anyone can just sort my brain out here as to whats possible or
>> not.
>>
>> I have a container running Spark, with Hive and a ThriftServer. I want to
>> run code against it remotely.
>>
>> If I take something simple like this
>>
>> from pyspark.sql import SparkSession
>> from pyspark.sql.types import StructType, StructField, IntegerType,
>> StringType
>>
>> # Initialize SparkSession
>> spark = SparkSession.builder \
>> .appName("ShowDatabases") \
>> .master("spark://192.168.1.245:7077") \
>> .config("spark.sql.warehouse.dir", "file:/data/hive/warehouse") \
>> .config("hive.metastore.uris","thrift://192.168.1.245:9083")\
>> .enableHiveSupport() \
>> .getOrCreate()
>>
>> # Define schema of the DataFrame
>> schema = StructType([
>> StructField("id", IntegerType(), True),
>> StructField("name", StringType(), True)
>> ])
>>
>> # Data to be converted into a DataFrame
>> data = [(1, "John Doe"), (2, "Jane Doe"), (3, "Mike Johnson")]
>>
>> # Create DataFrame
>> df = spark.createDataFrame(data, schema)
>>
>> # Show the DataFrame (optional, for verification)
>> df.show()
>>
>> # Save the DataFrame to a table named "my_table"
>> df.write.mode("overwrite").saveAsTable("my_table")
>>
>> # Stop the SparkSession
>> spark.stop()
>>
>> When I run it in the container it runs fine, but when I run it remotely
>> it says:
>>
>> : java.io.FileNotFoundException: File
>> file:/data/hive/warehouse/my_table/_temporary/0 does not exist
>> at
>> org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:597)
>> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972)
>> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014)
>> at
>> org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:761)
>> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972)
>> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014)
>> at
>> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.getAllCommittedTaskPaths(FileOutputCommitter.java:334)
>> at
>> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJobInternal(FileOutputCommitter.java:404)
>> at
>> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:377)
>> at
>> org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48)
>> at
>> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitJob(HadoopMapReduceCommitProtocol.scala:192)
>>
>> My assumption is that its trying to look on my local machine for
>> /data/hive/warehouse and failing because on the remote box I can see those
>> folders.
>>
>> So the question is, if you're not backing it with hadoop or something do
>> you have to mount the drive in the same place on the computer running the
>> pyspark? Or am I missing a config option somewhere?
>>
>> Thanks!
>>
>


Re: Creating remote tables using PySpark

2024-03-07 Thread Tom Barber
Okay interesting, maybe my assumption was incorrect, although I'm still
confused.

I tried to mount a central mount point that would be the same on my local
machine and the container. Same error although I moved the path to
/tmp/hive/data/hive/ but when I rerun the test code to save a table,
the complaint is still for

Warehouse Dir: file:/tmp/hive/data/hive/warehouse
Metastore URIs: thrift://192.168.1.245:9083
Warehouse Dir: file:/tmp/hive/data/hive/warehouse
Metastore URIs: thrift://192.168.1.245:9083
ERROR FileOutputCommitter: Mkdirs failed to create
file:/data/hive/warehouse/input.db/accounts_20240307_232110_1_0_6_post21_g4fdc321_d20240307/_temporary/0

so what is /data/hive even referring to when I print out the spark conf
values and neither now refer to /data/hive/

On Thu, Mar 7, 2024 at 9:49 PM Tom Barber  wrote:

> Wonder if anyone can just sort my brain out here as to whats possible or
> not.
>
> I have a container running Spark, with Hive and a ThriftServer. I want to
> run code against it remotely.
>
> If I take something simple like this
>
> from pyspark.sql import SparkSession
> from pyspark.sql.types import StructType, StructField, IntegerType,
> StringType
>
> # Initialize SparkSession
> spark = SparkSession.builder \
> .appName("ShowDatabases") \
> .master("spark://192.168.1.245:7077") \
> .config("spark.sql.warehouse.dir", "file:/data/hive/warehouse") \
> .config("hive.metastore.uris","thrift://192.168.1.245:9083")\
> .enableHiveSupport() \
> .getOrCreate()
>
> # Define schema of the DataFrame
> schema = StructType([
> StructField("id", IntegerType(), True),
> StructField("name", StringType(), True)
> ])
>
> # Data to be converted into a DataFrame
> data = [(1, "John Doe"), (2, "Jane Doe"), (3, "Mike Johnson")]
>
> # Create DataFrame
> df = spark.createDataFrame(data, schema)
>
> # Show the DataFrame (optional, for verification)
> df.show()
>
> # Save the DataFrame to a table named "my_table"
> df.write.mode("overwrite").saveAsTable("my_table")
>
> # Stop the SparkSession
> spark.stop()
>
> When I run it in the container it runs fine, but when I run it remotely it
> says:
>
> : java.io.FileNotFoundException: File
> file:/data/hive/warehouse/my_table/_temporary/0 does not exist
> at
> org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:597)
> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972)
> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014)
> at
> org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:761)
> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972)
> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014)
> at
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.getAllCommittedTaskPaths(FileOutputCommitter.java:334)
> at
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJobInternal(FileOutputCommitter.java:404)
> at
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:377)
> at
> org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48)
> at
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitJob(HadoopMapReduceCommitProtocol.scala:192)
>
> My assumption is that its trying to look on my local machine for
> /data/hive/warehouse and failing because on the remote box I can see those
> folders.
>
> So the question is, if you're not backing it with hadoop or something do
> you have to mount the drive in the same place on the computer running the
> pyspark? Or am I missing a config option somewhere?
>
> Thanks!
>


Creating remote tables using PySpark

2024-03-07 Thread Tom Barber
Wonder if anyone can just sort my brain out here as to whats possible or
not.

I have a container running Spark, with Hive and a ThriftServer. I want to
run code against it remotely.

If I take something simple like this

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType,
StringType

# Initialize SparkSession
spark = SparkSession.builder \
.appName("ShowDatabases") \
.master("spark://192.168.1.245:7077") \
.config("spark.sql.warehouse.dir", "file:/data/hive/warehouse") \
.config("hive.metastore.uris","thrift://192.168.1.245:9083")\
.enableHiveSupport() \
.getOrCreate()

# Define schema of the DataFrame
schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True)
])

# Data to be converted into a DataFrame
data = [(1, "John Doe"), (2, "Jane Doe"), (3, "Mike Johnson")]

# Create DataFrame
df = spark.createDataFrame(data, schema)

# Show the DataFrame (optional, for verification)
df.show()

# Save the DataFrame to a table named "my_table"
df.write.mode("overwrite").saveAsTable("my_table")

# Stop the SparkSession
spark.stop()

When I run it in the container it runs fine, but when I run it remotely it
says:

: java.io.FileNotFoundException: File
file:/data/hive/warehouse/my_table/_temporary/0 does not exist
at
org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:597)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014)
at
org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:761)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014)
at
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.getAllCommittedTaskPaths(FileOutputCommitter.java:334)
at
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJobInternal(FileOutputCommitter.java:404)
at
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:377)
at
org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48)
at
org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitJob(HadoopMapReduceCommitProtocol.scala:192)

My assumption is that its trying to look on my local machine for
/data/hive/warehouse and failing because on the remote box I can see those
folders.

So the question is, if you're not backing it with hadoop or something do
you have to mount the drive in the same place on the computer running the
pyspark? Or am I missing a config option somewhere?

Thanks!


Dark mode logo

2024-03-06 Thread Mike Drob
Hi Spark Community,

I see that y'all have a logo uploaded to https://www.apache.org/logos/#spark 
but it has black text. Is there an official, alternate logo with lighter text 
that would look good on a dark background?

Thanks,
Mike

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



答复: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-05 Thread Pan,Bingkun
Okay, Let me double-check it carefully.

Thank you very much for your help!



发件人: Jungtaek Lim 
发送时间: 2024年3月5日 21:56:41
收件人: Pan,Bingkun
抄送: Dongjoon Hyun; dev; user
主题: Re: [ANNOUNCE] Apache Spark 3.5.1 released

Yeah the approach seems OK to me - please double check that the doc generation 
in Spark repo won't fail after the move of the js file. Other than that, it 
would be probably just a matter of updating the release process.

On Tue, Mar 5, 2024 at 7:24 PM Pan,Bingkun 
mailto:panbing...@baidu.com>> wrote:

Okay, I see.

Perhaps we can solve this confusion by sharing the same file `version.json` 
across `all versions` in the `Spark website repo`? Make each version of the 
document display the `same` data in the dropdown menu.


发件人: Jungtaek Lim 
mailto:kabhwan.opensou...@gmail.com>>
发送时间: 2024年3月5日 17:09:07
收件人: Pan,Bingkun
抄送: Dongjoon Hyun; dev; user
主题: Re: [ANNOUNCE] Apache Spark 3.5.1 released

Let me be more specific.

We have two active release version lines, 3.4.x and 3.5.x. We just released 
Spark 3.5.1, having a dropdown as 3.5.1 and 3.4.2 given the fact the last 
version of 3.4.x is 3.4.2. After a month we released Spark 3.4.3. In the 
dropdown of Spark 3.4.3, there will be 3.5.1 and 3.4.3. But if we call this as 
done, 3.5.1 (still latest) won't show 3.4.3 in the dropdown, giving confusion 
that 3.4.3 wasn't ever released.

This is just about two active release version lines with keeping only the 
latest version of version lines. If you expand this to EOLed version lines and 
versions which aren't the latest in their version line, the problem gets much 
more complicated.

On Tue, Mar 5, 2024 at 6:01 PM Pan,Bingkun 
mailto:panbing...@baidu.com>> wrote:

Based on my understanding, we should not update versions that have already been 
released,

such as the situation you mentioned: `But what about dropout of version D? 
Should we add E in the dropdown?` We only need to record the latest `version. 
json` file that has already been published at the time of each new document 
release.

Of course, if we need to keep the latest in every document, I think it's also 
possible.

Only by sharing the same version. json file in each version.


发件人: Jungtaek Lim 
mailto:kabhwan.opensou...@gmail.com>>
发送时间: 2024年3月5日 16:47:30
收件人: Pan,Bingkun
抄送: Dongjoon Hyun; dev; user
主题: Re: [ANNOUNCE] Apache Spark 3.5.1 released

But this does not answer my question about updating the dropdown for the doc of 
"already released versions", right?

Let's say we just released version D, and the dropdown has version A, B, C. We 
have another release tomorrow as version E, and it's probably easy to add A, B, 
C, D in the dropdown of E. But what about dropdown of version D? Should we add 
E in the dropdown? How do we maintain it if we will have 10 releases afterwards?

On Tue, Mar 5, 2024 at 5:27 PM Pan,Bingkun 
mailto:panbing...@baidu.com>> wrote:

According to my understanding, the original intention of this feature is that 
when a user has entered the pyspark document, if he finds that the version he 
is currently in is not the version he wants, he can easily jump to the version 
he wants by clicking on the drop-down box. Additionally, in this PR, the 
current automatic mechanism for PRs did not merge in.

https://github.com/apache/spark/pull/42881

So, we need to manually update this file. I can manually submit an update first 
to get this feature working.


发件人: Jungtaek Lim 
mailto:kabhwan.opensou...@gmail.com>>
发送时间: 2024年3月4日 6:34:42
收件人: Dongjoon Hyun
抄送: dev; user
主题: Re: [ANNOUNCE] Apache Spark 3.5.1 released

Shall we revisit this functionality? The API doc is built with individual 
versions, and for each individual version we depend on other released versions. 
This does not seem to be right to me. Also, the functionality is only in 
PySpark API doc which does not seem to be consistent as well.

I don't think this is manageable with the current approach (listing versions in 
version-dependent doc). Let's say we release 3.4.3 after 3.5.1. Should we 
update the versions in 3.5.1 to add 3.4.3 in version switcher? How about the 
time we are going to release the new version after releasing 10 versions? 
What's the criteria of pruning the version?

Unless we have a good answer to these questions, I think it's better to revert 
the functionality - it missed various considerations.

On Fri, Mar 1, 2024 at 2:44 PM Jungtaek Lim 
mailto:kabhwan.opensou...@gmail.com>> wrote:
Thanks for reporting - this is odd - the dropdown did not exist in other recent 
releases.

https://spark.apache.org/docs/3.5.0/api/python/index.html

Re: It seems --py-files only takes the first two arguments. Can someone please confirm?

2024-03-05 Thread Mich Talebzadeh
Sorry I forgot. This below is catered for yarn mode

if your application code primarily consists of Python files and does not
require a separate virtual environment with specific dependencies, you can
use the --py-files argument in spark-submit

spark-submit --verbose \
   --master yarn \
  --deploy-mode cluster \
  --name $APPNAME \
  --driver-memory 1g \  # Adjust memory as needed
  --executor-memory 1g \  # Adjust memory as needed
  --num-executors 2 \ # Adjust executors as needed
  -*-py-files ${build_directory}/source_code.zip \*
  $CODE_DIRECTORY_CLOUD/my_application_entry_point.py  # Path to your
main application script

For application code with a separate virtual environment)

If your application code has specific dependencies that you manage in a
separate virtual environment, you can leverage the --conf
spark.yarn.dist.archives argument.
spark-submit --verbose \
-master yarn \
-deploy-mode cluster \
--name $APPNAME \
 --driver-memory 1g \ # Adjust memory as needed
--executor-memory 1g \ # Adjust memory as needed
--num-executors 2 \ # Adjust executors as needed-
*-conf "spark.yarn.dist.archives"=${pyspark_venv}.tar.gz#pyspark_venv \*
$CODE_DIRECTORY_CLOUD/my_application_entry_point.py # Path to your main
application script

Explanation:

   - --conf "spark.yarn.dist.archives"=${pyspark_venv}.tar.gz#pyspark_venv:
This
   configures Spark to distribute your virtual environment archive (
   pyspark_venv.tar.gz) to the Yarn cluster nodes. The #pyspark_venv  part
   defines a symbolic link name within the container.
   - You do not need --py-fileshere because the virtual environment archive
   will contain all necessary dependencies.

Choosing the best approach:

The choice depends on your project setup:

   - No Separate Virtual Environment: Use  --py-files if your application
   code consists mainly of Python files and doesn't require a separate virtual
   environment.
   - Separate Virtual Environment: Use --conf spark.yarn.dist.archives if
   you manage dependencies in a separate virtual environment archive.

HTH
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Tue, 5 Mar 2024 at 17:28, Mich Talebzadeh 
wrote:

>
>
>
>  I use zip file personally and pass the application name (in your case
> main.py) as the last input line like below
>
> APPLICATION is your main.py. It does not need to be called main.py. It
> could be anything like  testpython.py
>
> CODE_DIRECTORY_CLOUD="gs://spark-on-k8s/codes"   ## replace gs with s3
> # zip needs to be done at root directory of code
> zip -rq ${source_code}.zip ${source_code}
> gsutil cp ${source_code}.zip $CODE_DIRECTORY_CLOUD  ## replace gsutil with
> aws s3
> gsutil cp /${source_code}/src/${APPLICATION} $CODE_DIRECTORY_CLOUD
>
> your spark job
>
>  spark-submit --verbose \
>--properties-file ${property_file} \
>--master k8s://https://$KUBERNETES_MASTER_IP:443 \
>--deploy-mode cluster \
>--name $APPNAME \
>  *  --py-files $CODE_DIRECTORY_CLOUD/spark_on_gke.zip \*
>--conf spark.kubernetes.namespace=$NAMESPACE \
>--conf spark.network.timeout=300 \
>--conf spark.kubernetes.allocation.batch.size=3 \
>--conf spark.kubernetes.allocation.batch.delay=1 \
>--conf spark.kubernetes.driver.container.image=${IMAGEDRIVER} \
>--conf spark.kubernetes.executor.container.image=${IMAGEDRIVER}
> \
>--conf spark.kubernetes.driver.pod.name=$APPNAME \
>--conf
> spark.kubernetes.authenticate.driver.serviceAccountName=spark-bq \
>--conf
> spark.driver.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true" \
>--conf
> spark.executor.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true"
> \
>--conf spark.dynamicAllocation.enabled=true \
>--conf spark.dynamicAllocation.shuffleTracking.enabled=true \
>--conf spark.dynamicAllocation.shuffleTracking.timeout=20s \
>--conf spark.dynamicAllocation.executorIdleTimeout=30s \
>--conf spark.dynamicAllocation.cachedExecutorIdleTimeout=40s \
>--conf spark.dynamicAllocation.minExecutors=0 \
>--conf spark.dynamicAllocation.maxExecutors=20 \
>--conf spark.driver.cores=3 \
>--conf spark.executor.cores=3 \
>--conf spark.driver.memory=1024m \
>--conf spark.executor.memory=1024m \
> *   

S3 committer for dynamic partitioning

2024-03-05 Thread Nikhil Goyal
Hi folks,
We have been following this doc

for writing data from Spark Job to S3. However it fails writing to dynamic
partitions. Any suggestions on what config should be used to avoid the cost
of renaming in S3?

Thanks
Nikhil


Re: It seems --py-files only takes the first two arguments. Can someone please confirm?

2024-03-05 Thread Mich Talebzadeh
 I use zip file personally and pass the application name (in your case
main.py) as the last input line like below

APPLICATION is your main.py. It does not need to be called main.py. It
could be anything like  testpython.py

CODE_DIRECTORY_CLOUD="gs://spark-on-k8s/codes"   ## replace gs with s3
# zip needs to be done at root directory of code
zip -rq ${source_code}.zip ${source_code}
gsutil cp ${source_code}.zip $CODE_DIRECTORY_CLOUD  ## replace gsutil with
aws s3
gsutil cp /${source_code}/src/${APPLICATION} $CODE_DIRECTORY_CLOUD

your spark job

 spark-submit --verbose \
   --properties-file ${property_file} \
   --master k8s://https://$KUBERNETES_MASTER_IP:443 \
   --deploy-mode cluster \
   --name $APPNAME \
 *  --py-files $CODE_DIRECTORY_CLOUD/spark_on_gke.zip \*
   --conf spark.kubernetes.namespace=$NAMESPACE \
   --conf spark.network.timeout=300 \
   --conf spark.kubernetes.allocation.batch.size=3 \
   --conf spark.kubernetes.allocation.batch.delay=1 \
   --conf spark.kubernetes.driver.container.image=${IMAGEDRIVER} \
   --conf spark.kubernetes.executor.container.image=${IMAGEDRIVER} \
   --conf spark.kubernetes.driver.pod.name=$APPNAME \
   --conf
spark.kubernetes.authenticate.driver.serviceAccountName=spark-bq \
   --conf
spark.driver.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true" \
   --conf
spark.executor.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true"
\
   --conf spark.dynamicAllocation.enabled=true \
   --conf spark.dynamicAllocation.shuffleTracking.enabled=true \
   --conf spark.dynamicAllocation.shuffleTracking.timeout=20s \
   --conf spark.dynamicAllocation.executorIdleTimeout=30s \
   --conf spark.dynamicAllocation.cachedExecutorIdleTimeout=40s \
   --conf spark.dynamicAllocation.minExecutors=0 \
   --conf spark.dynamicAllocation.maxExecutors=20 \
   --conf spark.driver.cores=3 \
   --conf spark.executor.cores=3 \
   --conf spark.driver.memory=1024m \
   --conf spark.executor.memory=1024m \
*   $CODE_DIRECTORY_CLOUD/${APPLICATION}*

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Tue, 5 Mar 2024 at 16:15, Pedro, Chuck 
wrote:

> Hi all,
>
>
>
> I am working in Databricks. When I submit a spark job with the –py-files
> argument, it seems the first two are read in but the third is ignored.
>
>
>
> "--py-files",
>
> "s3://some_path/appl_src.py",
>
> "s3://some_path/main.py",
>
> "s3://a_different_path/common.py",
>
>
>
> I can see the first two acknowledged in the Log4j but not the third.
>
>
>
> 24/02/28 21:41:00 INFO Utils: Fetching s3://some_path/appl_src.py to ...
>
> 24/02/28 21:41:00 INFO Utils: Fetching s3://some_path/main.py to ...
>
>
>
> As a result, the job fails because appl_src.py is importing from common.py
> but can’t find it.
>
>
>
> I posted to both Databricks community here
> 
> and Stack Overflow here
> 
> but did not get a response.
>
>
>
> I’m aware that we could use a .zip file, so I tried zipping the first two
> arguments but then got a totally different error:
>
>
>
> “Exception in thread "main" org.apache.spark.SparkException: Failed to get
> main class in JAR with error 'null'.  Please specify one with --class.”
>
>
>
> Basically I just want the application code in one s3 path and a “common”
> utilities package in another path. Thanks for your help.
>
>
>
>
>
>
>
> *Kind regards,*
>
> Chuck Pedro
>
>
>
>
> --
> This message (including any attachments) may contain confidential,
> proprietary, privileged and/or private information. The information is
> intended to be for the use of the individual or entity designated above. If
> you are not the intended recipient of this message, please notify the
> sender immediately, and delete the message and any attachments. Any
> disclosure, reproduction, distribution or other use of this message or any
> attachments by an individual or entity other than the intended recipient is
> prohibited.
>
> TRVDiscDefault::1201
>


It seems --py-files only takes the first two arguments. Can someone please confirm?

2024-03-05 Thread Pedro, Chuck
Hi all,

I am working in Databricks. When I submit a spark job with the -py-files 
argument, it seems the first two are read in but the third is ignored.

"--py-files",
"s3://some_path/appl_src.py",
"s3://some_path/main.py",
"s3://a_different_path/common.py",

I can see the first two acknowledged in the Log4j but not the third.

24/02/28 21:41:00 INFO Utils: Fetching s3://some_path/appl_src.py to ...
24/02/28 21:41:00 INFO Utils: Fetching s3://some_path/main.py to ...

As a result, the job fails because appl_src.py is importing from common.py but 
can't find it.

I posted to both Databricks community 
here
 and Stack Overflow 
here
 but did not get a response.

I'm aware that we could use a .zip file, so I tried zipping the first two 
arguments but then got a totally different error:

"Exception in thread "main" org.apache.spark.SparkException: Failed to get main 
class in JAR with error 'null'.  Please specify one with --class."

Basically I just want the application code in one s3 path and a "common" 
utilities package in another path. Thanks for your help.



Kind regards,
Chuck Pedro



This message (including any attachments) may contain confidential, proprietary, 
privileged and/or private information. The information is intended to be for 
the use of the individual or entity designated above. If you are not the 
intended recipient of this message, please notify the sender immediately, and 
delete the message and any attachments. Any disclosure, reproduction, 
distribution or other use of this message or any attachments by an individual 
or entity other than the intended recipient is prohibited.

TRVDiscDefault::1201


Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-05 Thread Jungtaek Lim
Yeah the approach seems OK to me - please double check that the doc
generation in Spark repo won't fail after the move of the js file. Other
than that, it would be probably just a matter of updating the release
process.

On Tue, Mar 5, 2024 at 7:24 PM Pan,Bingkun  wrote:

> Okay, I see.
>
> Perhaps we can solve this confusion by sharing the same file `version.json`
> across `all versions` in the `Spark website repo`? Make each version of
> the document display the `same` data in the dropdown menu.
> --
> *发件人:* Jungtaek Lim 
> *发送时间:* 2024年3月5日 17:09:07
> *收件人:* Pan,Bingkun
> *抄送:* Dongjoon Hyun; dev; user
> *主题:* Re: [ANNOUNCE] Apache Spark 3.5.1 released
>
> Let me be more specific.
>
> We have two active release version lines, 3.4.x and 3.5.x. We just
> released Spark 3.5.1, having a dropdown as 3.5.1 and 3.4.2 given the fact
> the last version of 3.4.x is 3.4.2. After a month we released Spark 3.4.3.
> In the dropdown of Spark 3.4.3, there will be 3.5.1 and 3.4.3. But if we
> call this as done, 3.5.1 (still latest) won't show 3.4.3 in the dropdown,
> giving confusion that 3.4.3 wasn't ever released.
>
> This is just about two active release version lines with keeping only the
> latest version of version lines. If you expand this to EOLed version lines
> and versions which aren't the latest in their version line, the problem
> gets much more complicated.
>
> On Tue, Mar 5, 2024 at 6:01 PM Pan,Bingkun  wrote:
>
>> Based on my understanding, we should not update versions that have
>> already been released,
>>
>> such as the situation you mentioned: `But what about dropout of version
>> D? Should we add E in the dropdown?` We only need to record the latest
>> `version. json` file that has already been published at the time of each
>> new document release.
>>
>> Of course, if we need to keep the latest in every document, I think it's
>> also possible.
>>
>> Only by sharing the same version. json file in each version.
>> --
>> *发件人:* Jungtaek Lim 
>> *发送时间:* 2024年3月5日 16:47:30
>> *收件人:* Pan,Bingkun
>> *抄送:* Dongjoon Hyun; dev; user
>> *主题:* Re: [ANNOUNCE] Apache Spark 3.5.1 released
>>
>> But this does not answer my question about updating the dropdown for the
>> doc of "already released versions", right?
>>
>> Let's say we just released version D, and the dropdown has version A, B,
>> C. We have another release tomorrow as version E, and it's probably easy to
>> add A, B, C, D in the dropdown of E. But what about dropdown of version D?
>> Should we add E in the dropdown? How do we maintain it if we will have 10
>> releases afterwards?
>>
>> On Tue, Mar 5, 2024 at 5:27 PM Pan,Bingkun  wrote:
>>
>>> According to my understanding, the original intention of this feature is
>>> that when a user has entered the pyspark document, if he finds that the
>>> version he is currently in is not the version he wants, he can easily jump
>>> to the version he wants by clicking on the drop-down box. Additionally, in
>>> this PR, the current automatic mechanism for PRs did not merge in.
>>>
>>> https://github.com/apache/spark/pull/42881
>>> 
>>>
>>> So, we need to manually update this file. I can manually submit an
>>> update first to get this feature working.
>>> --
>>> *发件人:* Jungtaek Lim 
>>> *发送时间:* 2024年3月4日 6:34:42
>>> *收件人:* Dongjoon Hyun
>>> *抄送:* dev; user
>>> *主题:* Re: [ANNOUNCE] Apache Spark 3.5.1 released
>>>
>>> Shall we revisit this functionality? The API doc is built with
>>> individual versions, and for each individual version we depend on other
>>> released versions. This does not seem to be right to me. Also, the
>>> functionality is only in PySpark API doc which does not seem to be
>>> consistent as well.
>>>
>>> I don't think this is manageable with the current approach (listing
>>> versions in version-dependent doc). Let's say we release 3.4.3 after 3.5.1.
>>> Should we update the versions in 3.5.1 to add 3.4.3 in version switcher?
>>> How about the time we are going to release the new version after releasing
>>> 10 versions? What's the criteria of pruning the version?
>>>
>>> Unless we have a good answer to these questions, I think it's better to
>>> revert the functionality - it missed various considerations.
>>>
>>> On Fri, Mar 1, 2024 at 2:44 PM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
 Thanks for reporting - this is odd - the dropdown did not exist in
 other recent releases.

 https://spark.apache.org/docs/3.5.0/api/python/index.html
 
 https://spark.apache.org/docs/3.4.2/api/python/index.html
 
 

答复: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-05 Thread Pan,Bingkun
Okay, I see.

Perhaps we can solve this confusion by sharing the same file `version.json` 
across `all versions` in the `Spark website repo`? Make each version of the 
document display the `same` data in the dropdown menu.


发件人: Jungtaek Lim 
发送时间: 2024年3月5日 17:09:07
收件人: Pan,Bingkun
抄送: Dongjoon Hyun; dev; user
主题: Re: [ANNOUNCE] Apache Spark 3.5.1 released

Let me be more specific.

We have two active release version lines, 3.4.x and 3.5.x. We just released 
Spark 3.5.1, having a dropdown as 3.5.1 and 3.4.2 given the fact the last 
version of 3.4.x is 3.4.2. After a month we released Spark 3.4.3. In the 
dropdown of Spark 3.4.3, there will be 3.5.1 and 3.4.3. But if we call this as 
done, 3.5.1 (still latest) won't show 3.4.3 in the dropdown, giving confusion 
that 3.4.3 wasn't ever released.

This is just about two active release version lines with keeping only the 
latest version of version lines. If you expand this to EOLed version lines and 
versions which aren't the latest in their version line, the problem gets much 
more complicated.

On Tue, Mar 5, 2024 at 6:01 PM Pan,Bingkun 
mailto:panbing...@baidu.com>> wrote:

Based on my understanding, we should not update versions that have already been 
released,

such as the situation you mentioned: `But what about dropout of version D? 
Should we add E in the dropdown?` We only need to record the latest `version. 
json` file that has already been published at the time of each new document 
release.

Of course, if we need to keep the latest in every document, I think it's also 
possible.

Only by sharing the same version. json file in each version.


发件人: Jungtaek Lim 
mailto:kabhwan.opensou...@gmail.com>>
发送时间: 2024年3月5日 16:47:30
收件人: Pan,Bingkun
抄送: Dongjoon Hyun; dev; user
主题: Re: [ANNOUNCE] Apache Spark 3.5.1 released

But this does not answer my question about updating the dropdown for the doc of 
"already released versions", right?

Let's say we just released version D, and the dropdown has version A, B, C. We 
have another release tomorrow as version E, and it's probably easy to add A, B, 
C, D in the dropdown of E. But what about dropdown of version D? Should we add 
E in the dropdown? How do we maintain it if we will have 10 releases afterwards?

On Tue, Mar 5, 2024 at 5:27 PM Pan,Bingkun 
mailto:panbing...@baidu.com>> wrote:

According to my understanding, the original intention of this feature is that 
when a user has entered the pyspark document, if he finds that the version he 
is currently in is not the version he wants, he can easily jump to the version 
he wants by clicking on the drop-down box. Additionally, in this PR, the 
current automatic mechanism for PRs did not merge in.

https://github.com/apache/spark/pull/42881

So, we need to manually update this file. I can manually submit an update first 
to get this feature working.


发件人: Jungtaek Lim 
mailto:kabhwan.opensou...@gmail.com>>
发送时间: 2024年3月4日 6:34:42
收件人: Dongjoon Hyun
抄送: dev; user
主题: Re: [ANNOUNCE] Apache Spark 3.5.1 released

Shall we revisit this functionality? The API doc is built with individual 
versions, and for each individual version we depend on other released versions. 
This does not seem to be right to me. Also, the functionality is only in 
PySpark API doc which does not seem to be consistent as well.

I don't think this is manageable with the current approach (listing versions in 
version-dependent doc). Let's say we release 3.4.3 after 3.5.1. Should we 
update the versions in 3.5.1 to add 3.4.3 in version switcher? How about the 
time we are going to release the new version after releasing 10 versions? 
What's the criteria of pruning the version?

Unless we have a good answer to these questions, I think it's better to revert 
the functionality - it missed various considerations.

On Fri, Mar 1, 2024 at 2:44 PM Jungtaek Lim 
mailto:kabhwan.opensou...@gmail.com>> wrote:
Thanks for reporting - this is odd - the dropdown did not exist in other recent 
releases.

https://spark.apache.org/docs/3.5.0/api/python/index.html
https://spark.apache.org/docs/3.4.2/api/python/index.html
https://spark.apache.org/docs/3.3.4/api/python/index.html

Looks like the dropdown feature was recently introduced but partially done. The 
addition of a dropdown was done, but the way how to bump the version was missed 
to be documented.
The contributor proposed the way to update the version 

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-05 Thread Jungtaek Lim
Let me be more specific.

We have two active release version lines, 3.4.x and 3.5.x. We just released
Spark 3.5.1, having a dropdown as 3.5.1 and 3.4.2 given the fact the last
version of 3.4.x is 3.4.2. After a month we released Spark 3.4.3. In the
dropdown of Spark 3.4.3, there will be 3.5.1 and 3.4.3. But if we call this
as done, 3.5.1 (still latest) won't show 3.4.3 in the dropdown, giving
confusion that 3.4.3 wasn't ever released.

This is just about two active release version lines with keeping only the
latest version of version lines. If you expand this to EOLed version lines
and versions which aren't the latest in their version line, the problem
gets much more complicated.

On Tue, Mar 5, 2024 at 6:01 PM Pan,Bingkun  wrote:

> Based on my understanding, we should not update versions that have already
> been released,
>
> such as the situation you mentioned: `But what about dropout of version D?
> Should we add E in the dropdown?` We only need to record the latest
> `version. json` file that has already been published at the time of each
> new document release.
>
> Of course, if we need to keep the latest in every document, I think it's
> also possible.
>
> Only by sharing the same version. json file in each version.
> --
> *发件人:* Jungtaek Lim 
> *发送时间:* 2024年3月5日 16:47:30
> *收件人:* Pan,Bingkun
> *抄送:* Dongjoon Hyun; dev; user
> *主题:* Re: [ANNOUNCE] Apache Spark 3.5.1 released
>
> But this does not answer my question about updating the dropdown for the
> doc of "already released versions", right?
>
> Let's say we just released version D, and the dropdown has version A, B,
> C. We have another release tomorrow as version E, and it's probably easy to
> add A, B, C, D in the dropdown of E. But what about dropdown of version D?
> Should we add E in the dropdown? How do we maintain it if we will have 10
> releases afterwards?
>
> On Tue, Mar 5, 2024 at 5:27 PM Pan,Bingkun  wrote:
>
>> According to my understanding, the original intention of this feature is
>> that when a user has entered the pyspark document, if he finds that the
>> version he is currently in is not the version he wants, he can easily jump
>> to the version he wants by clicking on the drop-down box. Additionally, in
>> this PR, the current automatic mechanism for PRs did not merge in.
>>
>> https://github.com/apache/spark/pull/42881
>> 
>>
>> So, we need to manually update this file. I can manually submit an update
>> first to get this feature working.
>> --
>> *发件人:* Jungtaek Lim 
>> *发送时间:* 2024年3月4日 6:34:42
>> *收件人:* Dongjoon Hyun
>> *抄送:* dev; user
>> *主题:* Re: [ANNOUNCE] Apache Spark 3.5.1 released
>>
>> Shall we revisit this functionality? The API doc is built with individual
>> versions, and for each individual version we depend on other released
>> versions. This does not seem to be right to me. Also, the functionality is
>> only in PySpark API doc which does not seem to be consistent as well.
>>
>> I don't think this is manageable with the current approach (listing
>> versions in version-dependent doc). Let's say we release 3.4.3 after 3.5.1.
>> Should we update the versions in 3.5.1 to add 3.4.3 in version switcher?
>> How about the time we are going to release the new version after releasing
>> 10 versions? What's the criteria of pruning the version?
>>
>> Unless we have a good answer to these questions, I think it's better to
>> revert the functionality - it missed various considerations.
>>
>> On Fri, Mar 1, 2024 at 2:44 PM Jungtaek Lim 
>> wrote:
>>
>>> Thanks for reporting - this is odd - the dropdown did not exist in other
>>> recent releases.
>>>
>>> https://spark.apache.org/docs/3.5.0/api/python/index.html
>>> 
>>> https://spark.apache.org/docs/3.4.2/api/python/index.html
>>> 
>>> https://spark.apache.org/docs/3.3.4/api/python/index.html
>>> 
>>>
>>> Looks like the dropdown feature was recently introduced but partially
>>> done. The addition of a dropdown was done, but the way how to bump the
>>> version was missed to be documented.
>>> The contributor proposed the way to update the version "automatically",
>>> but the PR wasn't merged. As a result, we are neither having the
>>> instruction how to bump the version manually, nor having the automatic bump.
>>>
>>> * PR for addition of dropdown:
>>> https://github.com/apache/spark/pull/42428
>>> 
>>> * PR for 

答复: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-05 Thread Pan,Bingkun
Based on my understanding, we should not update versions that have already been 
released,

such as the situation you mentioned: `But what about dropout of version D? 
Should we add E in the dropdown?` We only need to record the latest `version. 
json` file that has already been published at the time of each new document 
release.

Of course, if we need to keep the latest in every document, I think it's also 
possible.

Only by sharing the same version. json file in each version.


发件人: Jungtaek Lim 
发送时间: 2024年3月5日 16:47:30
收件人: Pan,Bingkun
抄送: Dongjoon Hyun; dev; user
主题: Re: [ANNOUNCE] Apache Spark 3.5.1 released

But this does not answer my question about updating the dropdown for the doc of 
"already released versions", right?

Let's say we just released version D, and the dropdown has version A, B, C. We 
have another release tomorrow as version E, and it's probably easy to add A, B, 
C, D in the dropdown of E. But what about dropdown of version D? Should we add 
E in the dropdown? How do we maintain it if we will have 10 releases afterwards?

On Tue, Mar 5, 2024 at 5:27 PM Pan,Bingkun 
mailto:panbing...@baidu.com>> wrote:

According to my understanding, the original intention of this feature is that 
when a user has entered the pyspark document, if he finds that the version he 
is currently in is not the version he wants, he can easily jump to the version 
he wants by clicking on the drop-down box. Additionally, in this PR, the 
current automatic mechanism for PRs did not merge in.

https://github.com/apache/spark/pull/42881

So, we need to manually update this file. I can manually submit an update first 
to get this feature working.


发件人: Jungtaek Lim 
mailto:kabhwan.opensou...@gmail.com>>
发送时间: 2024年3月4日 6:34:42
收件人: Dongjoon Hyun
抄送: dev; user
主题: Re: [ANNOUNCE] Apache Spark 3.5.1 released

Shall we revisit this functionality? The API doc is built with individual 
versions, and for each individual version we depend on other released versions. 
This does not seem to be right to me. Also, the functionality is only in 
PySpark API doc which does not seem to be consistent as well.

I don't think this is manageable with the current approach (listing versions in 
version-dependent doc). Let's say we release 3.4.3 after 3.5.1. Should we 
update the versions in 3.5.1 to add 3.4.3 in version switcher? How about the 
time we are going to release the new version after releasing 10 versions? 
What's the criteria of pruning the version?

Unless we have a good answer to these questions, I think it's better to revert 
the functionality - it missed various considerations.

On Fri, Mar 1, 2024 at 2:44 PM Jungtaek Lim 
mailto:kabhwan.opensou...@gmail.com>> wrote:
Thanks for reporting - this is odd - the dropdown did not exist in other recent 
releases.

https://spark.apache.org/docs/3.5.0/api/python/index.html
https://spark.apache.org/docs/3.4.2/api/python/index.html
https://spark.apache.org/docs/3.3.4/api/python/index.html

Looks like the dropdown feature was recently introduced but partially done. The 
addition of a dropdown was done, but the way how to bump the version was missed 
to be documented.
The contributor proposed the way to update the version "automatically", but the 
PR wasn't merged. As a result, we are neither having the instruction how to 
bump the version manually, nor having the automatic bump.

* PR for addition of dropdown: 
https://github.com/apache/spark/pull/42428
* PR for automatically bumping version: 
https://github.com/apache/spark/pull/42881

We will probably need to add an instruction in the release process to update 
the version. (For automatic bumping I don't have a good idea.)
I'll look into it. Please expect some delay during the holiday weekend in S. 
Korea.

Thanks again.
Jungtaek Lim (HeartSaVioR)


On Fri, Mar 1, 2024 at 2:14 PM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
BTW, Jungtaek.

PySpark document seems to show a wrong branch. At this time, `master`.


https://spark.apache.org/docs/3.5.1/api/python/index.html

PySpark 

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-05 Thread Jungtaek Lim
But this does not answer my question about updating the dropdown for the
doc of "already released versions", right?

Let's say we just released version D, and the dropdown has version A, B, C.
We have another release tomorrow as version E, and it's probably easy to
add A, B, C, D in the dropdown of E. But what about dropdown of version D?
Should we add E in the dropdown? How do we maintain it if we will have 10
releases afterwards?

On Tue, Mar 5, 2024 at 5:27 PM Pan,Bingkun  wrote:

> According to my understanding, the original intention of this feature is
> that when a user has entered the pyspark document, if he finds that the
> version he is currently in is not the version he wants, he can easily jump
> to the version he wants by clicking on the drop-down box. Additionally, in
> this PR, the current automatic mechanism for PRs did not merge in.
>
> https://github.com/apache/spark/pull/42881
>
> So, we need to manually update this file. I can manually submit an update
> first to get this feature working.
> --
> *发件人:* Jungtaek Lim 
> *发送时间:* 2024年3月4日 6:34:42
> *收件人:* Dongjoon Hyun
> *抄送:* dev; user
> *主题:* Re: [ANNOUNCE] Apache Spark 3.5.1 released
>
> Shall we revisit this functionality? The API doc is built with individual
> versions, and for each individual version we depend on other released
> versions. This does not seem to be right to me. Also, the functionality is
> only in PySpark API doc which does not seem to be consistent as well.
>
> I don't think this is manageable with the current approach (listing
> versions in version-dependent doc). Let's say we release 3.4.3 after 3.5.1.
> Should we update the versions in 3.5.1 to add 3.4.3 in version switcher?
> How about the time we are going to release the new version after releasing
> 10 versions? What's the criteria of pruning the version?
>
> Unless we have a good answer to these questions, I think it's better to
> revert the functionality - it missed various considerations.
>
> On Fri, Mar 1, 2024 at 2:44 PM Jungtaek Lim 
> wrote:
>
>> Thanks for reporting - this is odd - the dropdown did not exist in other
>> recent releases.
>>
>> https://spark.apache.org/docs/3.5.0/api/python/index.html
>> 
>> https://spark.apache.org/docs/3.4.2/api/python/index.html
>> 
>> https://spark.apache.org/docs/3.3.4/api/python/index.html
>> 
>>
>> Looks like the dropdown feature was recently introduced but partially
>> done. The addition of a dropdown was done, but the way how to bump the
>> version was missed to be documented.
>> The contributor proposed the way to update the version "automatically",
>> but the PR wasn't merged. As a result, we are neither having the
>> instruction how to bump the version manually, nor having the automatic bump.
>>
>> * PR for addition of dropdown: https://github.com/apache/spark/pull/42428
>> 
>> * PR for automatically bumping version:
>> https://github.com/apache/spark/pull/42881
>> 
>>
>> We will probably need to add an instruction in the release process to
>> update the version. (For automatic bumping I don't have a good idea.)
>> I'll look into it. Please expect some delay during the holiday weekend
>> in S. Korea.
>>
>> Thanks again.
>> Jungtaek Lim (HeartSaVioR)
>>
>>
>> On Fri, Mar 1, 2024 at 2:14 PM Dongjoon Hyun 
>> wrote:
>>
>>> BTW, Jungtaek.
>>>
>>> PySpark document seems to show a wrong branch. At this time, `master`.
>>>
>>> https://spark.apache.org/docs/3.5.1/api/python/index.html
>>> 
>>>
>>> PySpark Overview
>>> 
>>>
>>>Date: Feb 24, 2024 Version: master
>>>
>>> [image: Screenshot 2024-02-29 at 21.12.24.png]
>>>
>>>
>>> Could you do the follow-up, please?
>>>
>>> Thank you in advance.
>>>
>>> Dongjoon.
>>>
>>>
>>> On Thu, Feb 29, 2024 at 2:48 PM John Zhuge  wrote:
>>>
 Excellent work, congratulations!

 On Wed, Feb 28, 2024 at 10:12 PM Dongjoon Hyun 
 wrote:

> Congratulations!
>
> Bests,
> Dongjoon.
>
> On Wed, Feb 28, 2024 at 11:43 AM beliefer  wrote:
>
>> Congratulations!
>>
>>
>>
>> At 2024-02-28 

答复: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-05 Thread Pan,Bingkun
According to my understanding, the original intention of this feature is that 
when a user has entered the pyspark document, if he finds that the version he 
is currently in is not the version he wants, he can easily jump to the version 
he wants by clicking on the drop-down box. Additionally, in this PR, the 
current automatic mechanism for PRs did not merge in.

https://github.com/apache/spark/pull/42881

So, we need to manually update this file. I can manually submit an update first 
to get this feature working.


发件人: Jungtaek Lim 
发送时间: 2024年3月4日 6:34:42
收件人: Dongjoon Hyun
抄送: dev; user
主题: Re: [ANNOUNCE] Apache Spark 3.5.1 released

Shall we revisit this functionality? The API doc is built with individual 
versions, and for each individual version we depend on other released versions. 
This does not seem to be right to me. Also, the functionality is only in 
PySpark API doc which does not seem to be consistent as well.

I don't think this is manageable with the current approach (listing versions in 
version-dependent doc). Let's say we release 3.4.3 after 3.5.1. Should we 
update the versions in 3.5.1 to add 3.4.3 in version switcher? How about the 
time we are going to release the new version after releasing 10 versions? 
What's the criteria of pruning the version?

Unless we have a good answer to these questions, I think it's better to revert 
the functionality - it missed various considerations.

On Fri, Mar 1, 2024 at 2:44 PM Jungtaek Lim 
mailto:kabhwan.opensou...@gmail.com>> wrote:
Thanks for reporting - this is odd - the dropdown did not exist in other recent 
releases.

https://spark.apache.org/docs/3.5.0/api/python/index.html
https://spark.apache.org/docs/3.4.2/api/python/index.html
https://spark.apache.org/docs/3.3.4/api/python/index.html

Looks like the dropdown feature was recently introduced but partially done. The 
addition of a dropdown was done, but the way how to bump the version was missed 
to be documented.
The contributor proposed the way to update the version "automatically", but the 
PR wasn't merged. As a result, we are neither having the instruction how to 
bump the version manually, nor having the automatic bump.

* PR for addition of dropdown: 
https://github.com/apache/spark/pull/42428
* PR for automatically bumping version: 
https://github.com/apache/spark/pull/42881

We will probably need to add an instruction in the release process to update 
the version. (For automatic bumping I don't have a good idea.)
I'll look into it. Please expect some delay during the holiday weekend in S. 
Korea.

Thanks again.
Jungtaek Lim (HeartSaVioR)


On Fri, Mar 1, 2024 at 2:14 PM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
BTW, Jungtaek.

PySpark document seems to show a wrong branch. At this time, `master`.


https://spark.apache.org/docs/3.5.1/api/python/index.html

PySpark 
Overview

   Date: Feb 24, 2024 Version: master

[Screenshot 2024-02-29 at 21.12.24.png]


Could you do the follow-up, please?

Thank you in advance.

Dongjoon.


On Thu, Feb 29, 2024 at 2:48 PM John Zhuge 
mailto:jzh...@apache.org>> wrote:
Excellent work, congratulations!

On Wed, Feb 28, 2024 at 10:12 PM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
Congratulations!

Bests,
Dongjoon.

On Wed, Feb 28, 2024 at 11:43 AM beliefer 
mailto:belie...@163.com>> wrote:

Congratulations!



At 2024-02-28 17:43:25, "Jungtaek Lim" 
mailto:kabhwan.opensou...@gmail.com>> wrote:

Hi everyone,

We are happy to announce the availability of Spark 3.5.1!

Spark 3.5.1 is a maintenance release containing stability fixes. This
release is based on the branch-3.5 maintenance branch of Spark. We strongly
recommend all 3.5 users to upgrade to this stable release.

To download Spark 3.5.1, head over to the download page:
https://spark.apache.org/downloads.html

To view the release notes:

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-04 Thread yangjie01
That sounds like a great suggestion.

发件人: Jungtaek Lim 
日期: 2024年3月5日 星期二 10:46
收件人: Hyukjin Kwon 
抄送: yangjie01 , Dongjoon Hyun , 
dev , user 
主题: Re: [ANNOUNCE] Apache Spark 3.5.1 released

Yes, it's relevant to that PR. I wonder, if we want to expose version switcher, 
it should be in versionless doc (spark-website) rather than the doc being 
pinned to a specific version.

On Tue, Mar 5, 2024 at 11:18 AM Hyukjin Kwon 
mailto:gurwls...@apache.org>> wrote:
Is this related to 
https://github.com/apache/spark/pull/42428?

cc @Yang,Jie(INF)

On Mon, 4 Mar 2024 at 22:21, Jungtaek Lim 
mailto:kabhwan.opensou...@gmail.com>> wrote:
Shall we revisit this functionality? The API doc is built with individual 
versions, and for each individual version we depend on other released versions. 
This does not seem to be right to me. Also, the functionality is only in 
PySpark API doc which does not seem to be consistent as well.

I don't think this is manageable with the current approach (listing versions in 
version-dependent doc). Let's say we release 3.4.3 after 3.5.1. Should we 
update the versions in 3.5.1 to add 3.4.3 in version switcher? How about the 
time we are going to release the new version after releasing 10 versions? 
What's the criteria of pruning the version?

Unless we have a good answer to these questions, I think it's better to revert 
the functionality - it missed various considerations.

On Fri, Mar 1, 2024 at 2:44 PM Jungtaek Lim 
mailto:kabhwan.opensou...@gmail.com>> wrote:
Thanks for reporting - this is odd - the dropdown did not exist in other recent 
releases.

https://spark.apache.org/docs/3.5.0/api/python/index.html
https://spark.apache.org/docs/3.4.2/api/python/index.html
https://spark.apache.org/docs/3.3.4/api/python/index.html

Looks like the dropdown feature was recently introduced but partially done. The 
addition of a dropdown was done, but the way how to bump the version was missed 
to be documented.
The contributor proposed the way to update the version "automatically", but the 
PR wasn't merged. As a result, we are neither having the instruction how to 
bump the version manually, nor having the automatic bump.

* PR for addition of dropdown: 
https://github.com/apache/spark/pull/42428
* PR for automatically bumping version: 
https://github.com/apache/spark/pull/42881

We will probably need to add an instruction in the release process to update 
the version. (For automatic bumping I don't have a good idea.)
I'll look into it. Please expect some delay during the holiday weekend in S. 
Korea.

Thanks again.
Jungtaek Lim (HeartSaVioR)


On Fri, Mar 1, 2024 at 2:14 PM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
BTW, Jungtaek.

PySpark document seems to show a wrong branch. At this time, `master`.


https://spark.apache.org/docs/3.5.1/api/python/index.html

PySpark Overview

   Date: Feb 24, 2024 Version: master
[cid:image001.png@01DA6F13.CD4B0B00]



Could you do the follow-up, please?

Thank you in advance.

Dongjoon.


On Thu, Feb 29, 2024 at 2:48 PM John Zhuge 
mailto:jzh...@apache.org>> wrote:
Excellent work, congratulations!

On Wed, Feb 28, 2024 at 10:12 PM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
Congratulations!

Bests,
Dongjoon.

On Wed, Feb 28, 2024 at 11:43 AM beliefer 
mailto:belie...@163.com>> wrote:

Congratulations!





At 2024-02-28 17:43:25, "Jungtaek Lim" 
mailto:kabhwan.opensou...@gmail.com>> wrote:
Hi everyone,

We are happy to announce the availability of Spark 3.5.1!

Spark 3.5.1 is a maintenance release containing stability fixes. This
release is based on the branch-3.5 maintenance branch of Spark. We strongly
recommend all 3.5 users to upgrade to this stable release.

To download Spark 3.5.1, head over to the download page:
https://spark.apache.org/downloads.html

To view the release notes:

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-04 Thread Jungtaek Lim
Yes, it's relevant to that PR. I wonder, if we want to expose version
switcher, it should be in versionless doc (spark-website) rather than the
doc being pinned to a specific version.

On Tue, Mar 5, 2024 at 11:18 AM Hyukjin Kwon  wrote:

> Is this related to https://github.com/apache/spark/pull/42428?
>
> cc @Yang,Jie(INF) 
>
> On Mon, 4 Mar 2024 at 22:21, Jungtaek Lim 
> wrote:
>
>> Shall we revisit this functionality? The API doc is built with individual
>> versions, and for each individual version we depend on other released
>> versions. This does not seem to be right to me. Also, the functionality is
>> only in PySpark API doc which does not seem to be consistent as well.
>>
>> I don't think this is manageable with the current approach (listing
>> versions in version-dependent doc). Let's say we release 3.4.3 after 3.5.1.
>> Should we update the versions in 3.5.1 to add 3.4.3 in version switcher?
>> How about the time we are going to release the new version after releasing
>> 10 versions? What's the criteria of pruning the version?
>>
>> Unless we have a good answer to these questions, I think it's better to
>> revert the functionality - it missed various considerations.
>>
>> On Fri, Mar 1, 2024 at 2:44 PM Jungtaek Lim 
>> wrote:
>>
>>> Thanks for reporting - this is odd - the dropdown did not exist in other
>>> recent releases.
>>>
>>> https://spark.apache.org/docs/3.5.0/api/python/index.html
>>> https://spark.apache.org/docs/3.4.2/api/python/index.html
>>> https://spark.apache.org/docs/3.3.4/api/python/index.html
>>>
>>> Looks like the dropdown feature was recently introduced but partially
>>> done. The addition of a dropdown was done, but the way how to bump the
>>> version was missed to be documented.
>>> The contributor proposed the way to update the version "automatically",
>>> but the PR wasn't merged. As a result, we are neither having the
>>> instruction how to bump the version manually, nor having the automatic bump.
>>>
>>> * PR for addition of dropdown:
>>> https://github.com/apache/spark/pull/42428
>>> * PR for automatically bumping version:
>>> https://github.com/apache/spark/pull/42881
>>>
>>> We will probably need to add an instruction in the release process to
>>> update the version. (For automatic bumping I don't have a good idea.)
>>> I'll look into it. Please expect some delay during the holiday weekend
>>> in S. Korea.
>>>
>>> Thanks again.
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>>
>>> On Fri, Mar 1, 2024 at 2:14 PM Dongjoon Hyun 
>>> wrote:
>>>
 BTW, Jungtaek.

 PySpark document seems to show a wrong branch. At this time, `master`.

 https://spark.apache.org/docs/3.5.1/api/python/index.html

 PySpark Overview
 

Date: Feb 24, 2024 Version: master

 [image: Screenshot 2024-02-29 at 21.12.24.png]


 Could you do the follow-up, please?

 Thank you in advance.

 Dongjoon.


 On Thu, Feb 29, 2024 at 2:48 PM John Zhuge  wrote:

> Excellent work, congratulations!
>
> On Wed, Feb 28, 2024 at 10:12 PM Dongjoon Hyun <
> dongjoon.h...@gmail.com> wrote:
>
>> Congratulations!
>>
>> Bests,
>> Dongjoon.
>>
>> On Wed, Feb 28, 2024 at 11:43 AM beliefer  wrote:
>>
>>> Congratulations!
>>>
>>>
>>>
>>> At 2024-02-28 17:43:25, "Jungtaek Lim" 
>>> wrote:
>>>
>>> Hi everyone,
>>>
>>> We are happy to announce the availability of Spark 3.5.1!
>>>
>>> Spark 3.5.1 is a maintenance release containing stability fixes. This
>>> release is based on the branch-3.5 maintenance branch of Spark. We
>>> strongly
>>> recommend all 3.5 users to upgrade to this stable release.
>>>
>>> To download Spark 3.5.1, head over to the download page:
>>> https://spark.apache.org/downloads.html
>>>
>>> To view the release notes:
>>> https://spark.apache.org/releases/spark-release-3-5-1.html
>>>
>>> We would like to acknowledge all community members for contributing
>>> to this
>>> release. This release would not have been possible without you.
>>>
>>> Jungtaek Lim
>>>
>>> ps. Yikun is helping us through releasing the official docker image
>>> for Spark 3.5.1 (Thanks Yikun!) It may take some time to be generally
>>> available.
>>>
>>>
>
> --
> John Zhuge
>



Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-04 Thread Hyukjin Kwon
Is this related to https://github.com/apache/spark/pull/42428?

cc @Yang,Jie(INF) 

On Mon, 4 Mar 2024 at 22:21, Jungtaek Lim 
wrote:

> Shall we revisit this functionality? The API doc is built with individual
> versions, and for each individual version we depend on other released
> versions. This does not seem to be right to me. Also, the functionality is
> only in PySpark API doc which does not seem to be consistent as well.
>
> I don't think this is manageable with the current approach (listing
> versions in version-dependent doc). Let's say we release 3.4.3 after 3.5.1.
> Should we update the versions in 3.5.1 to add 3.4.3 in version switcher?
> How about the time we are going to release the new version after releasing
> 10 versions? What's the criteria of pruning the version?
>
> Unless we have a good answer to these questions, I think it's better to
> revert the functionality - it missed various considerations.
>
> On Fri, Mar 1, 2024 at 2:44 PM Jungtaek Lim 
> wrote:
>
>> Thanks for reporting - this is odd - the dropdown did not exist in other
>> recent releases.
>>
>> https://spark.apache.org/docs/3.5.0/api/python/index.html
>> https://spark.apache.org/docs/3.4.2/api/python/index.html
>> https://spark.apache.org/docs/3.3.4/api/python/index.html
>>
>> Looks like the dropdown feature was recently introduced but partially
>> done. The addition of a dropdown was done, but the way how to bump the
>> version was missed to be documented.
>> The contributor proposed the way to update the version "automatically",
>> but the PR wasn't merged. As a result, we are neither having the
>> instruction how to bump the version manually, nor having the automatic bump.
>>
>> * PR for addition of dropdown: https://github.com/apache/spark/pull/42428
>> * PR for automatically bumping version:
>> https://github.com/apache/spark/pull/42881
>>
>> We will probably need to add an instruction in the release process to
>> update the version. (For automatic bumping I don't have a good idea.)
>> I'll look into it. Please expect some delay during the holiday weekend
>> in S. Korea.
>>
>> Thanks again.
>> Jungtaek Lim (HeartSaVioR)
>>
>>
>> On Fri, Mar 1, 2024 at 2:14 PM Dongjoon Hyun 
>> wrote:
>>
>>> BTW, Jungtaek.
>>>
>>> PySpark document seems to show a wrong branch. At this time, `master`.
>>>
>>> https://spark.apache.org/docs/3.5.1/api/python/index.html
>>>
>>> PySpark Overview
>>> 
>>>
>>>Date: Feb 24, 2024 Version: master
>>>
>>> [image: Screenshot 2024-02-29 at 21.12.24.png]
>>>
>>>
>>> Could you do the follow-up, please?
>>>
>>> Thank you in advance.
>>>
>>> Dongjoon.
>>>
>>>
>>> On Thu, Feb 29, 2024 at 2:48 PM John Zhuge  wrote:
>>>
 Excellent work, congratulations!

 On Wed, Feb 28, 2024 at 10:12 PM Dongjoon Hyun 
 wrote:

> Congratulations!
>
> Bests,
> Dongjoon.
>
> On Wed, Feb 28, 2024 at 11:43 AM beliefer  wrote:
>
>> Congratulations!
>>
>>
>>
>> At 2024-02-28 17:43:25, "Jungtaek Lim" 
>> wrote:
>>
>> Hi everyone,
>>
>> We are happy to announce the availability of Spark 3.5.1!
>>
>> Spark 3.5.1 is a maintenance release containing stability fixes. This
>> release is based on the branch-3.5 maintenance branch of Spark. We
>> strongly
>> recommend all 3.5 users to upgrade to this stable release.
>>
>> To download Spark 3.5.1, head over to the download page:
>> https://spark.apache.org/downloads.html
>>
>> To view the release notes:
>> https://spark.apache.org/releases/spark-release-3-5-1.html
>>
>> We would like to acknowledge all community members for contributing
>> to this
>> release. This release would not have been possible without you.
>>
>> Jungtaek Lim
>>
>> ps. Yikun is helping us through releasing the official docker image
>> for Spark 3.5.1 (Thanks Yikun!) It may take some time to be generally
>> available.
>>
>>

 --
 John Zhuge

>>>


Working with a text file that is both compressed by bz2 followed by zip in PySpark

2024-03-04 Thread Mich Talebzadeh
I have downloaded Amazon reviews for sentiment analysis from here. The file
is not particularly large (just over 500MB) but comes in the following
format

test.ft.txt.bz2.zip

So it is a text file that is compressed by bz2 followed by zip. Now I like
tro do all these operations in PySpark. In PySpark a file cannot have both
.bz2 and .zip simultaneously..

The way I do it is to  place the downloaded file in a local directory. Then
just do some operations that are simple but messy.. I try to unzip the file
using zipfile package. This works with bash stype filename. as opposed to
python style filename "file:///.." This necessitates using different style,
one for OS type for zip and the other Python style to read bz2 file
directory into df in Pyspark

import os
import zipfile
data_path = "file:///d4T/hduser/sentiments/"
input_file_path = os.path.join(data_path, "test.ft.txt.bz2")
output_file_path = os.path.join(data_path, "review_text_file")
dir_name = "/d4T/hduser/sentiments/"
zipped_file=os.path.join(dir_name, "test.ft.txt.bz2.zip")
bz2_file=os.path.join(dir_name, "test.ft.txt.bz2")
try:
# Unzip the file
with zipfile.ZipFile(zipped_file, 'r') as zip_ref:
zip_ref.extractall(os.path.dirname(bz2_file))

# Now bz2_file should contain the path to the unzipped file
print(f"Unzipped file: {bz2_file}")
except Exception as e:
print(f"Error during unzipping: {str(e)}")

# Load the bz2 file into a DataFrame
df = spark.read.text(input_file_path)
# Remove the '__label__1' and '__label__2' prefixes
df = df.withColumn("review_text", expr("regexp_replace(value,
'__label__[12] ', '')"))

Then the rest is just spark-ml

Once I finished I remove the bz2 file to cleanup

if os.path.exists(bz2_file):  # Check if bz2 file exists
  try:
os.remove(bz2_file)
print(f"Successfully deleted {bz2_file}")
  except OSError as e:
print(f"Error deleting {bz2_file}: {e}")
else:
print(f"bz2 file {bz2_file} could not be found")


My question is can these operations be done more efficiently in Pyspark
itself ideally with one df operation reading the original file (.bz2.zip)?

Thanks


Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my knowledge
but of course cannot be guaranteed . It is essential to note that, as with
any advice, quote "one test result is worth one-thousand expert opinions
(Werner Von Braun)".


Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-03 Thread Jungtaek Lim
Shall we revisit this functionality? The API doc is built with individual
versions, and for each individual version we depend on other released
versions. This does not seem to be right to me. Also, the functionality is
only in PySpark API doc which does not seem to be consistent as well.

I don't think this is manageable with the current approach (listing
versions in version-dependent doc). Let's say we release 3.4.3 after 3.5.1.
Should we update the versions in 3.5.1 to add 3.4.3 in version switcher?
How about the time we are going to release the new version after releasing
10 versions? What's the criteria of pruning the version?

Unless we have a good answer to these questions, I think it's better to
revert the functionality - it missed various considerations.

On Fri, Mar 1, 2024 at 2:44 PM Jungtaek Lim 
wrote:

> Thanks for reporting - this is odd - the dropdown did not exist in other
> recent releases.
>
> https://spark.apache.org/docs/3.5.0/api/python/index.html
> https://spark.apache.org/docs/3.4.2/api/python/index.html
> https://spark.apache.org/docs/3.3.4/api/python/index.html
>
> Looks like the dropdown feature was recently introduced but partially
> done. The addition of a dropdown was done, but the way how to bump the
> version was missed to be documented.
> The contributor proposed the way to update the version "automatically",
> but the PR wasn't merged. As a result, we are neither having the
> instruction how to bump the version manually, nor having the automatic bump.
>
> * PR for addition of dropdown: https://github.com/apache/spark/pull/42428
> * PR for automatically bumping version:
> https://github.com/apache/spark/pull/42881
>
> We will probably need to add an instruction in the release process to
> update the version. (For automatic bumping I don't have a good idea.)
> I'll look into it. Please expect some delay during the holiday weekend
> in S. Korea.
>
> Thanks again.
> Jungtaek Lim (HeartSaVioR)
>
>
> On Fri, Mar 1, 2024 at 2:14 PM Dongjoon Hyun 
> wrote:
>
>> BTW, Jungtaek.
>>
>> PySpark document seems to show a wrong branch. At this time, `master`.
>>
>> https://spark.apache.org/docs/3.5.1/api/python/index.html
>>
>> PySpark Overview
>> 
>>
>>Date: Feb 24, 2024 Version: master
>>
>> [image: Screenshot 2024-02-29 at 21.12.24.png]
>>
>>
>> Could you do the follow-up, please?
>>
>> Thank you in advance.
>>
>> Dongjoon.
>>
>>
>> On Thu, Feb 29, 2024 at 2:48 PM John Zhuge  wrote:
>>
>>> Excellent work, congratulations!
>>>
>>> On Wed, Feb 28, 2024 at 10:12 PM Dongjoon Hyun 
>>> wrote:
>>>
 Congratulations!

 Bests,
 Dongjoon.

 On Wed, Feb 28, 2024 at 11:43 AM beliefer  wrote:

> Congratulations!
>
>
>
> At 2024-02-28 17:43:25, "Jungtaek Lim" 
> wrote:
>
> Hi everyone,
>
> We are happy to announce the availability of Spark 3.5.1!
>
> Spark 3.5.1 is a maintenance release containing stability fixes. This
> release is based on the branch-3.5 maintenance branch of Spark. We
> strongly
> recommend all 3.5 users to upgrade to this stable release.
>
> To download Spark 3.5.1, head over to the download page:
> https://spark.apache.org/downloads.html
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-3-5-1.html
>
> We would like to acknowledge all community members for contributing to
> this
> release. This release would not have been possible without you.
>
> Jungtaek Lim
>
> ps. Yikun is helping us through releasing the official docker image
> for Spark 3.5.1 (Thanks Yikun!) It may take some time to be generally
> available.
>
>
>>>
>>> --
>>> John Zhuge
>>>
>>


Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-29 Thread Peter Toth
Congratulations and thanks Jungtaek for driving this!

Xinrong Meng  ezt írta (időpont: 2024. márc. 1.,
P, 5:24):

> Congratulations!
>
> Thanks,
> Xinrong
>
> On Thu, Feb 29, 2024 at 11:16 AM Dongjoon Hyun 
> wrote:
>
>> Congratulations!
>>
>> Bests,
>> Dongjoon.
>>
>> On Wed, Feb 28, 2024 at 11:43 AM beliefer  wrote:
>>
>>> Congratulations!
>>>
>>>
>>>
>>> At 2024-02-28 17:43:25, "Jungtaek Lim" 
>>> wrote:
>>>
>>> Hi everyone,
>>>
>>> We are happy to announce the availability of Spark 3.5.1!
>>>
>>> Spark 3.5.1 is a maintenance release containing stability fixes. This
>>> release is based on the branch-3.5 maintenance branch of Spark. We
>>> strongly
>>> recommend all 3.5 users to upgrade to this stable release.
>>>
>>> To download Spark 3.5.1, head over to the download page:
>>> https://spark.apache.org/downloads.html
>>>
>>> To view the release notes:
>>> https://spark.apache.org/releases/spark-release-3-5-1.html
>>>
>>> We would like to acknowledge all community members for contributing to
>>> this
>>> release. This release would not have been possible without you.
>>>
>>> Jungtaek Lim
>>>
>>> ps. Yikun is helping us through releasing the official docker image for
>>> Spark 3.5.1 (Thanks Yikun!) It may take some time to be generally available.
>>>
>>>


Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-29 Thread Jungtaek Lim
Thanks for reporting - this is odd - the dropdown did not exist in other
recent releases.

https://spark.apache.org/docs/3.5.0/api/python/index.html
https://spark.apache.org/docs/3.4.2/api/python/index.html
https://spark.apache.org/docs/3.3.4/api/python/index.html

Looks like the dropdown feature was recently introduced but partially done.
The addition of a dropdown was done, but the way how to bump the version
was missed to be documented.
The contributor proposed the way to update the version "automatically", but
the PR wasn't merged. As a result, we are neither having the instruction
how to bump the version manually, nor having the automatic bump.

* PR for addition of dropdown: https://github.com/apache/spark/pull/42428
* PR for automatically bumping version:
https://github.com/apache/spark/pull/42881

We will probably need to add an instruction in the release process to
update the version. (For automatic bumping I don't have a good idea.)
I'll look into it. Please expect some delay during the holiday weekend
in S. Korea.

Thanks again.
Jungtaek Lim (HeartSaVioR)


On Fri, Mar 1, 2024 at 2:14 PM Dongjoon Hyun 
wrote:

> BTW, Jungtaek.
>
> PySpark document seems to show a wrong branch. At this time, `master`.
>
> https://spark.apache.org/docs/3.5.1/api/python/index.html
>
> PySpark Overview
> 
>
>Date: Feb 24, 2024 Version: master
>
> [image: Screenshot 2024-02-29 at 21.12.24.png]
>
>
> Could you do the follow-up, please?
>
> Thank you in advance.
>
> Dongjoon.
>
>
> On Thu, Feb 29, 2024 at 2:48 PM John Zhuge  wrote:
>
>> Excellent work, congratulations!
>>
>> On Wed, Feb 28, 2024 at 10:12 PM Dongjoon Hyun 
>> wrote:
>>
>>> Congratulations!
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Wed, Feb 28, 2024 at 11:43 AM beliefer  wrote:
>>>
 Congratulations!



 At 2024-02-28 17:43:25, "Jungtaek Lim" 
 wrote:

 Hi everyone,

 We are happy to announce the availability of Spark 3.5.1!

 Spark 3.5.1 is a maintenance release containing stability fixes. This
 release is based on the branch-3.5 maintenance branch of Spark. We
 strongly
 recommend all 3.5 users to upgrade to this stable release.

 To download Spark 3.5.1, head over to the download page:
 https://spark.apache.org/downloads.html

 To view the release notes:
 https://spark.apache.org/releases/spark-release-3-5-1.html

 We would like to acknowledge all community members for contributing to
 this
 release. This release would not have been possible without you.

 Jungtaek Lim

 ps. Yikun is helping us through releasing the official docker image for
 Spark 3.5.1 (Thanks Yikun!) It may take some time to be generally 
 available.


>>
>> --
>> John Zhuge
>>
>


Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-29 Thread Dongjoon Hyun
BTW, Jungtaek.

PySpark document seems to show a wrong branch. At this time, `master`.

https://spark.apache.org/docs/3.5.1/api/python/index.html

PySpark Overview


   Date: Feb 24, 2024 Version: master

[image: Screenshot 2024-02-29 at 21.12.24.png]


Could you do the follow-up, please?

Thank you in advance.

Dongjoon.


On Thu, Feb 29, 2024 at 2:48 PM John Zhuge  wrote:

> Excellent work, congratulations!
>
> On Wed, Feb 28, 2024 at 10:12 PM Dongjoon Hyun 
> wrote:
>
>> Congratulations!
>>
>> Bests,
>> Dongjoon.
>>
>> On Wed, Feb 28, 2024 at 11:43 AM beliefer  wrote:
>>
>>> Congratulations!
>>>
>>>
>>>
>>> At 2024-02-28 17:43:25, "Jungtaek Lim" 
>>> wrote:
>>>
>>> Hi everyone,
>>>
>>> We are happy to announce the availability of Spark 3.5.1!
>>>
>>> Spark 3.5.1 is a maintenance release containing stability fixes. This
>>> release is based on the branch-3.5 maintenance branch of Spark. We
>>> strongly
>>> recommend all 3.5 users to upgrade to this stable release.
>>>
>>> To download Spark 3.5.1, head over to the download page:
>>> https://spark.apache.org/downloads.html
>>>
>>> To view the release notes:
>>> https://spark.apache.org/releases/spark-release-3-5-1.html
>>>
>>> We would like to acknowledge all community members for contributing to
>>> this
>>> release. This release would not have been possible without you.
>>>
>>> Jungtaek Lim
>>>
>>> ps. Yikun is helping us through releasing the official docker image for
>>> Spark 3.5.1 (Thanks Yikun!) It may take some time to be generally available.
>>>
>>>
>
> --
> John Zhuge
>


Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-29 Thread John Zhuge
Excellent work, congratulations!

On Wed, Feb 28, 2024 at 10:12 PM Dongjoon Hyun 
wrote:

> Congratulations!
>
> Bests,
> Dongjoon.
>
> On Wed, Feb 28, 2024 at 11:43 AM beliefer  wrote:
>
>> Congratulations!
>>
>>
>>
>> At 2024-02-28 17:43:25, "Jungtaek Lim" 
>> wrote:
>>
>> Hi everyone,
>>
>> We are happy to announce the availability of Spark 3.5.1!
>>
>> Spark 3.5.1 is a maintenance release containing stability fixes. This
>> release is based on the branch-3.5 maintenance branch of Spark. We
>> strongly
>> recommend all 3.5 users to upgrade to this stable release.
>>
>> To download Spark 3.5.1, head over to the download page:
>> https://spark.apache.org/downloads.html
>>
>> To view the release notes:
>> https://spark.apache.org/releases/spark-release-3-5-1.html
>>
>> We would like to acknowledge all community members for contributing to
>> this
>> release. This release would not have been possible without you.
>>
>> Jungtaek Lim
>>
>> ps. Yikun is helping us through releasing the official docker image for
>> Spark 3.5.1 (Thanks Yikun!) It may take some time to be generally available.
>>
>>

-- 
John Zhuge


Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-29 Thread Prem Sahoo
Congratulations Sent from my iPhoneOn Feb 29, 2024, at 4:54 PM, Xinrong Meng  wrote:Congratulations!Thanks,XinrongOn Thu, Feb 29, 2024 at 11:16 AM Dongjoon Hyun  wrote:Congratulations!Bests,Dongjoon.On Wed, Feb 28, 2024 at 11:43 AM beliefer  wrote:Congratulations!At 2024-02-28 17:43:25, "Jungtaek Lim"  wrote:Hi everyone,We are happy to announce the availability of Spark 3.5.1!Spark 3.5.1 is a maintenance release containing stability fixes. Thisrelease is based on the branch-3.5 maintenance branch of Spark. We stronglyrecommend all 3.5 users to upgrade to this stable release.To download Spark 3.5.1, head over to the download page:https://spark.apache.org/downloads.htmlTo view the release notes:https://spark.apache.org/releases/spark-release-3-5-1.htmlWe would like to acknowledge all community members for contributing to thisrelease. This release would not have been possible without you.Jungtaek Limps. Yikun is helping us through releasing the official docker image for Spark 3.5.1 (Thanks Yikun!) It may take some time to be generally available.




Re: pyspark dataframe join with two different data type

2024-02-29 Thread Mich Talebzadeh
This is what you want, how to join two DFs with a string column in one and
an array of strings in the other, keeping only rows where the string is
present in the array.

from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql.functions import expr

spark = SparkSession.builder.appName("joins").getOrCreate()

data1 = [Row(combined_id=[1, 2, 3])  # this one has a column combined_id as
an array of integers
data2 = [Row(mr_id=2), Row(mr_id=5)] # this one has column mr_id with
single integers

df1 = spark.createDataFrame(data1)
df2 = spark.createDataFrame(data2)

df1.printSchema()
df2.printSchema()

# Perform the join with array_contains. It takes two arguments: an array
and a value. It returns True if the value exists as an element within the
array, otherwise False.
joined_df = df1.join(df2, expr("array_contains(combined_id, mr_id)"))

# Show the result
joined_df.show()

root
 |-- combined_id: array (nullable = true)
 ||-- element: long (containsNull = true)

root
 |-- mr_id: long (nullable = true)

+---+-+
|combined_id|mr_id|
+---+-+
|  [1, 2, 3]|2|
|  [4, 5, 6]|5|
+---+-+

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Thu, 29 Feb 2024 at 20:50, Karthick Nk  wrote:

> Hi All,
>
> I have two dataframe with below structure, i have to join these two
> dataframe - the scenario is one column is string in one dataframe and in
> other df join column is array of string, so we have to inner join two df
> and get the data if string value is present in any of the array of string
> value in another dataframe,
>
>
> df1 = spark.sql("""
> SELECT
> mr.id as mr_id,
> pv.id as pv_id,
> array(mr.id, pv.id) as combined_id
> FROM
> table1 mr
> INNER JOIN table2 pv ON pv.id = Mr.recordid
>where
> pv.id = '35122806-4cd2-4916-a149-24ea55c2dc36'
> or pv.id = 'a5f03625-6cc5-49df-95eb-df741fe9139b'
> """)
>
> # df1.display()
>
> # Your second query
> df2 = spark.sql("""
> SELECT
> id
> FROM
> table2
> WHERE
> id = '35122806-4cd2-4916-a149-24ea55c2dc36'
>
> """)
>
>
>
> Result data:
> 35122806-4cd2-4916-a149-24ea55c2dc36 only, because this records alone is
> common between string and array of string value.
>
> Can you share the sample snippet, how we can do the join for this two
> different datatype in the dataframe.
>
> if any clarification needed, pls feel free to ask.
>
> Thanks
>
>


pyspark dataframe join with two different data type

2024-02-29 Thread Karthick Nk
Hi All,

I have two dataframe with below structure, i have to join these two
dataframe - the scenario is one column is string in one dataframe and in
other df join column is array of string, so we have to inner join two df
and get the data if string value is present in any of the array of string
value in another dataframe,


df1 = spark.sql("""
SELECT
mr.id as mr_id,
pv.id as pv_id,
array(mr.id, pv.id) as combined_id
FROM
table1 mr
INNER JOIN table2 pv ON pv.id = Mr.recordid
   where
pv.id = '35122806-4cd2-4916-a149-24ea55c2dc36'
or pv.id = 'a5f03625-6cc5-49df-95eb-df741fe9139b'
""")

# df1.display()

# Your second query
df2 = spark.sql("""
SELECT
id
FROM
table2
WHERE
id = '35122806-4cd2-4916-a149-24ea55c2dc36'

""")



Result data:
35122806-4cd2-4916-a149-24ea55c2dc36 only, because this records alone is
common between string and array of string value.

Can you share the sample snippet, how we can do the join for this two
different datatype in the dataframe.

if any clarification needed, pls feel free to ask.

Thanks


<    1   2   3   4   5   6   7   8   9   10   >