Re: Data Duplication Bug Found - Structured Streaming Versions 3..4.1, 3.2.4, and 3.3.2

2023-09-18 Thread Jerry Peng
Hi Craig,

Thank you for sending us more information.  Can you answer my previous
question which I don't think the document addresses. How did you determine
duplicates in the output?  How was the output data read? The FileStreamSink
provides exactly-once writes ONLY if you read the output with the
FileStreamSource or the FileSource (batch).  A log is used to determine
what data is committed or not and those aforementioned sources know how to
use that log to read the data "exactly-once".  So there may be duplicated
data written on disk.  If you simply just read the data files written to
disk you may see duplicates when there are failures.  However, if you read
the output location with Spark you should get exactly once results (unless
there is a bug) since spark will know how to use the commit log to see what
data files are committed and not.

Best,

Jerry

On Mon, Sep 18, 2023 at 1:18 PM Craig Alfieri 
wrote:

> Hi Russell/Jerry/Mich,
>
>
>
> Appreciate your patience on this.
>
>
>
> Attached are more details on how this duplication “error” was found.
>
> Since we’re still unsure I am using “error” in quotes.
>
>
>
> We’d love the opportunity to work with any of you directly and/or the
> wider Spark community to triage this or get a better understanding of the
> nature of what we’re experiencing.
>
>
>
> Our platform provides the ability to fully reproduce this.
>
>
>
> Once you have had the chance to review the attached draft, let us know if
> there are any questions in the meantime. Again, we welcome the opportunity
> to work with the teams on this.
>
>
>
> Best-
>
> Craig
>
>
>
>
>
>
>
> *From: *Craig Alfieri 
> *Date: *Thursday, September 14, 2023 at 8:45 PM
> *To: *russell.spit...@gmail.com 
> *Cc: *Jerry Peng , Mich Talebzadeh <
> mich.talebza...@gmail.com>, user@spark.apache.org ,
> connor.mc...@antithesis.com 
> *Subject: *Re: Data Duplication Bug Found - Structured Streaming Versions
> 3..4.1, 3.2.4, and 3.3.2
>
> Hi Russell et al,
>
>
>
> Acknowledging receipt; we’ll get these answers back to the group.
>
>
>
> Follow-up forthcoming.
>
>
>
> Craig
>
>
>
>
>
>
>
> On Sep 14, 2023, at 6:38 PM, russell.spit...@gmail.com wrote:
>
> Exactly once should be output sink dependent, what sink was being used?
>
> Sent from my iPhone
>
>
>
> On Sep 14, 2023, at 4:52 PM, Jerry Peng 
> wrote:
>
> 
>
> Craig,
>
>
>
> Thanks! Please let us know the result!
>
>
>
> Best,
>
>
>
> Jerry
>
>
>
> On Thu, Sep 14, 2023 at 12:22 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>
>
> Hi Craig,
>
>
>
> Can you please clarify what this bug is and provide sample code causing
> this issue?
>
>
>
> HTH
>
>
> Mich Talebzadeh,
>
> Distinguished Technologist, Solutions Architect & Engineer
>
> London
>
> United Kingdom
>
>
>
>  [image: Image removed by sender.]  view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
>
> On Thu, 14 Sept 2023 at 17:48, Craig Alfieri 
> wrote:
>
> Hello Spark Community-
>
>
>
> As part of a research effort, our team here at Antithesis tests for
> correctness/fault tolerance of major OSS projects.
>
> Our team recently was testing Spark’s Structured Streaming, and we came
> across a data duplication bug we’d like to work with the teams on to
> resolve.
>
>
>
> Our intention is to utilize this as a future case study for our platform,
> but prior to doing so we like to have a resolution in place so that an
> announcement isn’t alarming to the user base.
>
>
>
> Attached is a high level .pdf that reviews the High Availability set-up
> put under test.
>
> This was also tested across the three latest versions, and the same
> behavior was observed.
>
>
>
> We can reproduce this error readily, since our environment is fully
> deterministic, we are just not Spark experts and would like to work with
> someone in the community to resolve this.
>
>
>
> Please let us know at your earliest convenience.
>
>
>
> Best
>
>
>
> Error! Filename not specified.
>

Re: Data Duplication Bug Found - Structured Streaming Versions 3..4.1, 3.2.4, and 3.3.2

2023-09-14 Thread Craig Alfieri
Hi Russell et al,Acknowledging receipt; we’ll get these answers back to the group.Follow-up forthcoming.CraigOn Sep 14, 2023, at 6:38 PM, russell.spit...@gmail.com wrote:Exactly once should be output sink dependent, what sink was being used?Sent from my iPhoneOn Sep 14, 2023, at 4:52 PM, Jerry Peng  wrote:Craig,Thanks! Please let us know the result!Best,JerryOn Thu, Sep 14, 2023 at 12:22 PM Mich Talebzadeh  wrote:Hi Craig,Can you please clarify what this bug is and provide sample code causing this issue?HTH 

Mich Talebzadeh,Distinguished Technologist, Solutions Architect & EngineerLondonUnited Kingdom

   view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh

 Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction
of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.  

On Thu, 14 Sept 2023 at 17:48, Craig Alfieri  wrote:







Hello Spark Community-
 
As part of a research effort, our team here at Antithesis tests for correctness/fault tolerance of major OSS projects.
Our team recently was testing Spark’s Structured Streaming, and we came across a data duplication bug we’d like to work with the teams on to resolve.
 
Our intention is to utilize this as a future case study for our platform, but prior to doing so we like to have a resolution in place so that an announcement isn’t alarming to the user base.
 
Attached is a high level .pdf that reviews the High Availability set-up put under test.
This was also tested across the three latest versions, and the same behavior was observed.
 
We can reproduce this error readily, since our environment is fully deterministic, we are just not Spark experts and would like to work with someone in the community to resolve this.
 
Please let us know at your earliest convenience.
 
Best


 







Craig Alfieri



c: 917.841.1652

craig.alfi...@antithesis.com



New York, NY.

Antithesis.com




 
We can't talk about most of the bugs that we've found for our customers,

but some customers like to speak about their work with us:
https://github.com/mongodb/mongo/wiki/Testing-MongoDB-with-Antithesis


 
 





-This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity for whom they are addressed. If you received this message in error, please notify the sender and remove it from your system.
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org


-This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity for whom they are addressed. If you received this message in error, please notify the sender and remove it from your system.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Data Duplication Bug Found - Structured Streaming Versions 3..4.1, 3.2.4, and 3.3.2

2023-09-14 Thread russell . spitzer
Exactly once should be output sink dependent, what sink was being used?Sent from my iPhoneOn Sep 14, 2023, at 4:52 PM, Jerry Peng  wrote:Craig,Thanks! Please let us know the result!Best,JerryOn Thu, Sep 14, 2023 at 12:22 PM Mich Talebzadeh  wrote:Hi Craig,Can you please clarify what this bug is and provide sample code causing this issue?HTH 

Mich Talebzadeh,Distinguished Technologist, Solutions Architect & EngineerLondonUnited Kingdom

   view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh

 Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction
of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.  

On Thu, 14 Sept 2023 at 17:48, Craig Alfieri  wrote:







Hello Spark Community-
 
As part of a research effort, our team here at Antithesis tests for correctness/fault tolerance of major OSS projects.
Our team recently was testing Spark’s Structured Streaming, and we came across a data duplication bug we’d like to work with the teams on to resolve.
 
Our intention is to utilize this as a future case study for our platform, but prior to doing so we like to have a resolution in place so that an announcement isn’t alarming to the user base.
 
Attached is a high level .pdf that reviews the High Availability set-up put under test.
This was also tested across the three latest versions, and the same behavior was observed.
 
We can reproduce this error readily, since our environment is fully deterministic, we are just not Spark experts and would like to work with someone in the community to resolve this.
 
Please let us know at your earliest convenience.
 
Best


 







Craig Alfieri



c: 917.841.1652

craig.alfi...@antithesis.com



New York, NY.

Antithesis.com




 
We can't talk about most of the bugs that we've found for our customers,

but some customers like to speak about their work with us:
https://github.com/mongodb/mongo/wiki/Testing-MongoDB-with-Antithesis


 
 





-This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity for whom they are addressed. If you received this message in error, please notify the sender and remove it from your system.
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org


Re: Data Duplication Bug Found - Structured Streaming Versions 3..4.1, 3.2.4, and 3.3.2

2023-09-14 Thread Jerry Peng
Craig,

Thanks! Please let us know the result!

Best,

Jerry

On Thu, Sep 14, 2023 at 12:22 PM Mich Talebzadeh 
wrote:

>
> Hi Craig,
>
> Can you please clarify what this bug is and provide sample code causing
> this issue?
>
> HTH
>
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 14 Sept 2023 at 17:48, Craig Alfieri 
> wrote:
>
>> Hello Spark Community-
>>
>>
>>
>> As part of a research effort, our team here at Antithesis tests for
>> correctness/fault tolerance of major OSS projects.
>>
>> Our team recently was testing Spark’s Structured Streaming, and we came
>> across a data duplication bug we’d like to work with the teams on to
>> resolve.
>>
>>
>>
>> Our intention is to utilize this as a future case study for our platform,
>> but prior to doing so we like to have a resolution in place so that an
>> announcement isn’t alarming to the user base.
>>
>>
>>
>> Attached is a high level .pdf that reviews the High Availability set-up
>> put under test.
>>
>> This was also tested across the three latest versions, and the same
>> behavior was observed.
>>
>>
>>
>> We can reproduce this error readily, since our environment is fully
>> deterministic, we are just not Spark experts and would like to work with
>> someone in the community to resolve this.
>>
>>
>>
>> Please let us know at your earliest convenience.
>>
>>
>>
>> Best
>>
>>
>>
>> *[image: signature_2327449931]*
>>
>> *Craig Alfieri*
>>
>> c: 917.841.1652
>>
>> craig.alfi...@antithesis.com
>>
>> New York, NY.
>>
>> Antithesis.com
>> 
>>
>>
>>
>> We can't talk about most of the bugs that we've found for our customers,
>>
>> but some customers like to speak about their work with us:
>>
>> https://github.com/mongodb/mongo/wiki/Testing-MongoDB-with-Antithesis
>>
>>
>>
>>
>>
>>
>> *-*
>> *This email and any files transmitted with it are confidential and
>> intended solely for the use of the individual or entity for whom they are
>> addressed. If you received this message in error, please notify the sender
>> and remove it from your system.*
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org


Re: Data Duplication Bug Found - Structured Streaming Versions 3..4.1, 3.2.4, and 3.3.2

2023-09-14 Thread Craig Alfieri
Hi Jerry- This is exactly the type of help we're seeking, to confirm the 
FilestreamSink was not utilized on our test runs.

Our team is going to work towards implementing this and re-running our 
experiments across the versions.

If everything comes back with similar results, we will reach back out to share 
more artifacts with this thread.

Thank you Jerry.


From: Jerry Peng 
Date: Thursday, September 14, 2023 at 1:10 PM
To: Craig Alfieri 
Cc: user@spark.apache.org 
Subject: Re: Data Duplication Bug Found - Structured Streaming Versions 3..4.1, 
3.2.4, and 3.3.2
Hi Craig,

Thank you for bringing this to the community's attention! Do you have any 
example code you can share that we can use to reproduce this issue?  By the 
way, how did you determine duplicates in the output?  The FileStreamSink 
provides exactly-once writes ONLY if you read the output with the 
FileStreamSource or the FileSource (batch).  A log is used to determine what 
data is committed or not and those aforementioned sources know how to use that 
log to read the data "exactly-once".

Best,

Jerry

On Thu, Sep 14, 2023 at 9:48 AM Craig Alfieri 
mailto:craig.alfi...@antithesis.com>> wrote:
Hello Spark Community-

As part of a research effort, our team here at Antithesis tests for 
correctness/fault tolerance of major OSS projects.
Our team recently was testing Spark’s Structured Streaming, and we came across 
a data duplication bug we’d like to work with the teams on to resolve.

Our intention is to utilize this as a future case study for our platform, but 
prior to doing so we like to have a resolution in place so that an announcement 
isn’t alarming to the user base.

Attached is a high level .pdf that reviews the High Availability set-up put 
under test.
This was also tested across the three latest versions, and the same behavior 
was observed.

We can reproduce this error readily, since our environment is fully 
deterministic, we are just not Spark experts and would like to work with 
someone in the community to resolve this.

Please let us know at your earliest convenience.

Best

[signature_2327449931]
Craig Alfieri
c: 917.841.1652
craig.alfi...@antithesis.com<mailto:craig.alfi...@antithesis.com>
New York, NY.
Antithesis.com<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.antithesis.com_=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=1FbSpGgVIpZO4QkQDmXk7jc1BFVciZWVioOvdJ86ubY=5SVjNvtYuy6icWSaP0lwjzTQw1Cc7JQO9QVaxn5KxqTdH8HC1HHURutlp5rgiaMH=SRmgBE5ImnGZ-GuqL3X6Q_6NPYiay1gLRbcUUofPIHo=>

We can't talk about most of the bugs that we've found for our customers,
but some customers like to speak about their work with us:
https://github.com/mongodb/mongo/wiki/Testing-MongoDB-with-Antithesis



-
This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity for whom they are addressed. If 
you received this message in error, please notify the sender and remove it from 
your system.

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>

-- 

*-*
*This email and any files transmitted with 
it are confidential and intended solely for the use of the individual or 
entity for whom they are addressed. If you received this message in error, 
please notify the sender and remove it from your system.*