Spark stand-alone mode

2023-09-14 Thread Ilango
Hi all,

We have 4 HPC nodes and installed spark individually in all nodes.

Spark is used as local mode(each driver/executor will have 8 cores and 65
GB) in Sparklyr/pyspark using Rstudio/Posit workbench. Slurm is used as
scheduler.

As this is local mode, we are facing performance issue(as only one
executor) when it comes dealing with large datasets.

Can I convert this 4 nodes into spark standalone cluster. We dont have
hadoop so yarn mode is out of scope.

Shall I follow the official documentation for setting up standalone
cluster. Will it work? Do I need to aware anything else?
Can you please share your thoughts?

Thanks,
Elango


Re: Data Duplication Bug Found - Structured Streaming Versions 3..4.1, 3.2.4, and 3.3.2

2023-09-14 Thread Craig Alfieri
Hi Russell et al,Acknowledging receipt; we’ll get these answers back to the group.Follow-up forthcoming.CraigOn Sep 14, 2023, at 6:38 PM, russell.spit...@gmail.com wrote:Exactly once should be output sink dependent, what sink was being used?Sent from my iPhoneOn Sep 14, 2023, at 4:52 PM, Jerry Peng  wrote:Craig,Thanks! Please let us know the result!Best,JerryOn Thu, Sep 14, 2023 at 12:22 PM Mich Talebzadeh  wrote:Hi Craig,Can you please clarify what this bug is and provide sample code causing this issue?HTH 

Mich Talebzadeh,Distinguished Technologist, Solutions Architect & EngineerLondonUnited Kingdom

   view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh

 Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction
of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.  

On Thu, 14 Sept 2023 at 17:48, Craig Alfieri  wrote:







Hello Spark Community-
 
As part of a research effort, our team here at Antithesis tests for correctness/fault tolerance of major OSS projects.
Our team recently was testing Spark’s Structured Streaming, and we came across a data duplication bug we’d like to work with the teams on to resolve.
 
Our intention is to utilize this as a future case study for our platform, but prior to doing so we like to have a resolution in place so that an announcement isn’t alarming to the user base.
 
Attached is a high level .pdf that reviews the High Availability set-up put under test.
This was also tested across the three latest versions, and the same behavior was observed.
 
We can reproduce this error readily, since our environment is fully deterministic, we are just not Spark experts and would like to work with someone in the community to resolve this.
 
Please let us know at your earliest convenience.
 
Best


 







Craig Alfieri



c: 917.841.1652

craig.alfi...@antithesis.com



New York, NY.

Antithesis.com




 
We can't talk about most of the bugs that we've found for our customers,

but some customers like to speak about their work with us:
https://github.com/mongodb/mongo/wiki/Testing-MongoDB-with-Antithesis


 
 





-This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity for whom they are addressed. If you received this message in error, please notify the sender and remove it from your system.
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org


-This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity for whom they are addressed. If you received this message in error, please notify the sender and remove it from your system.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Data Duplication Bug Found - Structured Streaming Versions 3..4.1, 3.2.4, and 3.3.2

2023-09-14 Thread russell . spitzer
Exactly once should be output sink dependent, what sink was being used?Sent from my iPhoneOn Sep 14, 2023, at 4:52 PM, Jerry Peng  wrote:Craig,Thanks! Please let us know the result!Best,JerryOn Thu, Sep 14, 2023 at 12:22 PM Mich Talebzadeh  wrote:Hi Craig,Can you please clarify what this bug is and provide sample code causing this issue?HTH 

Mich Talebzadeh,Distinguished Technologist, Solutions Architect & EngineerLondonUnited Kingdom

   view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh

 Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction
of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.  

On Thu, 14 Sept 2023 at 17:48, Craig Alfieri  wrote:







Hello Spark Community-
 
As part of a research effort, our team here at Antithesis tests for correctness/fault tolerance of major OSS projects.
Our team recently was testing Spark’s Structured Streaming, and we came across a data duplication bug we’d like to work with the teams on to resolve.
 
Our intention is to utilize this as a future case study for our platform, but prior to doing so we like to have a resolution in place so that an announcement isn’t alarming to the user base.
 
Attached is a high level .pdf that reviews the High Availability set-up put under test.
This was also tested across the three latest versions, and the same behavior was observed.
 
We can reproduce this error readily, since our environment is fully deterministic, we are just not Spark experts and would like to work with someone in the community to resolve this.
 
Please let us know at your earliest convenience.
 
Best


 







Craig Alfieri



c: 917.841.1652

craig.alfi...@antithesis.com



New York, NY.

Antithesis.com




 
We can't talk about most of the bugs that we've found for our customers,

but some customers like to speak about their work with us:
https://github.com/mongodb/mongo/wiki/Testing-MongoDB-with-Antithesis


 
 





-This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity for whom they are addressed. If you received this message in error, please notify the sender and remove it from your system.
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org


Re: Data Duplication Bug Found - Structured Streaming Versions 3..4.1, 3.2.4, and 3.3.2

2023-09-14 Thread Jerry Peng
Craig,

Thanks! Please let us know the result!

Best,

Jerry

On Thu, Sep 14, 2023 at 12:22 PM Mich Talebzadeh 
wrote:

>
> Hi Craig,
>
> Can you please clarify what this bug is and provide sample code causing
> this issue?
>
> HTH
>
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 14 Sept 2023 at 17:48, Craig Alfieri 
> wrote:
>
>> Hello Spark Community-
>>
>>
>>
>> As part of a research effort, our team here at Antithesis tests for
>> correctness/fault tolerance of major OSS projects.
>>
>> Our team recently was testing Spark’s Structured Streaming, and we came
>> across a data duplication bug we’d like to work with the teams on to
>> resolve.
>>
>>
>>
>> Our intention is to utilize this as a future case study for our platform,
>> but prior to doing so we like to have a resolution in place so that an
>> announcement isn’t alarming to the user base.
>>
>>
>>
>> Attached is a high level .pdf that reviews the High Availability set-up
>> put under test.
>>
>> This was also tested across the three latest versions, and the same
>> behavior was observed.
>>
>>
>>
>> We can reproduce this error readily, since our environment is fully
>> deterministic, we are just not Spark experts and would like to work with
>> someone in the community to resolve this.
>>
>>
>>
>> Please let us know at your earliest convenience.
>>
>>
>>
>> Best
>>
>>
>>
>> *[image: signature_2327449931]*
>>
>> *Craig Alfieri*
>>
>> c: 917.841.1652
>>
>> craig.alfi...@antithesis.com
>>
>> New York, NY.
>>
>> Antithesis.com
>> 
>>
>>
>>
>> We can't talk about most of the bugs that we've found for our customers,
>>
>> but some customers like to speak about their work with us:
>>
>> https://github.com/mongodb/mongo/wiki/Testing-MongoDB-with-Antithesis
>>
>>
>>
>>
>>
>>
>> *-*
>> *This email and any files transmitted with it are confidential and
>> intended solely for the use of the individual or entity for whom they are
>> addressed. If you received this message in error, please notify the sender
>> and remove it from your system.*
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org


Re: Data Duplication Bug Found - Structured Streaming Versions 3..4.1, 3.2.4, and 3.3.2

2023-09-14 Thread Craig Alfieri
Hi Jerry- This is exactly the type of help we're seeking, to confirm the 
FilestreamSink was not utilized on our test runs.

Our team is going to work towards implementing this and re-running our 
experiments across the versions.

If everything comes back with similar results, we will reach back out to share 
more artifacts with this thread.

Thank you Jerry.


From: Jerry Peng 
Date: Thursday, September 14, 2023 at 1:10 PM
To: Craig Alfieri 
Cc: user@spark.apache.org 
Subject: Re: Data Duplication Bug Found - Structured Streaming Versions 3..4.1, 
3.2.4, and 3.3.2
Hi Craig,

Thank you for bringing this to the community's attention! Do you have any 
example code you can share that we can use to reproduce this issue?  By the 
way, how did you determine duplicates in the output?  The FileStreamSink 
provides exactly-once writes ONLY if you read the output with the 
FileStreamSource or the FileSource (batch).  A log is used to determine what 
data is committed or not and those aforementioned sources know how to use that 
log to read the data "exactly-once".

Best,

Jerry

On Thu, Sep 14, 2023 at 9:48 AM Craig Alfieri 
mailto:craig.alfi...@antithesis.com>> wrote:
Hello Spark Community-

As part of a research effort, our team here at Antithesis tests for 
correctness/fault tolerance of major OSS projects.
Our team recently was testing Spark’s Structured Streaming, and we came across 
a data duplication bug we’d like to work with the teams on to resolve.

Our intention is to utilize this as a future case study for our platform, but 
prior to doing so we like to have a resolution in place so that an announcement 
isn’t alarming to the user base.

Attached is a high level .pdf that reviews the High Availability set-up put 
under test.
This was also tested across the three latest versions, and the same behavior 
was observed.

We can reproduce this error readily, since our environment is fully 
deterministic, we are just not Spark experts and would like to work with 
someone in the community to resolve this.

Please let us know at your earliest convenience.

Best

[signature_2327449931]
Craig Alfieri
c: 917.841.1652
craig.alfi...@antithesis.com
New York, NY.
Antithesis.com

We can't talk about most of the bugs that we've found for our customers,
but some customers like to speak about their work with us:
https://github.com/mongodb/mongo/wiki/Testing-MongoDB-with-Antithesis



-
This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity for whom they are addressed. If 
you received this message in error, please notify the sender and remove it from 
your system.

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org

-- 

*-*
*This email and any files transmitted with 
it are confidential and intended solely for the use of the individual or 
entity for whom they are addressed. If you received this message in error, 
please notify the sender and remove it from your system.*



Re: Write Spark Connection client application in Go

2023-09-14 Thread bo yang
Thanks Holden and Martin for the nice words and feedback :)

On Wed, Sep 13, 2023 at 8:22 AM Martin Grund  wrote:

> This is absolutely awesome! Thank you so much for dedicating your time to
> this project!
>
>
> On Wed, Sep 13, 2023 at 6:04 AM Holden Karau  wrote:
>
>> That’s so cool! Great work y’all :)
>>
>> On Tue, Sep 12, 2023 at 8:14 PM bo yang  wrote:
>>
>>> Hi Spark Friends,
>>>
>>> Anyone interested in using Golang to write Spark application? We created
>>> a Spark Connect Go Client library
>>> . Would love to hear
>>> feedback/thoughts from the community.
>>>
>>> Please see the quick start guide
>>> 
>>> about how to use it. Following is a very short Spark Connect application in
>>> Go:
>>>
>>> func main() {
>>> spark, _ := 
>>> sql.SparkSession.Builder.Remote("sc://localhost:15002").Build()
>>> defer spark.Stop()
>>>
>>> df, _ := spark.Sql("select 'apple' as word, 123 as count union all 
>>> select 'orange' as word, 456 as count")
>>> df.Show(100, false)
>>> df.Collect()
>>>
>>> df.Write().Mode("overwrite").
>>> Format("parquet").
>>> Save("file:///tmp/spark-connect-write-example-output.parquet")
>>>
>>> df = spark.Read().Format("parquet").
>>> Load("file:///tmp/spark-connect-write-example-output.parquet")
>>> df.Show(100, false)
>>>
>>> df.CreateTempView("view1", true, false)
>>> df, _ = spark.Sql("select count, word from view1 order by count")
>>> }
>>>
>>>
>>> Many thanks to Martin, Hyukjin, Ruifeng and Denny for creating and
>>> working together on this repo! Welcome more people to contribute :)
>>>
>>> Best,
>>> Bo
>>>
>>>