[jira] [Comment Edited] (SPARK-22163) Design Issue of Spark Streaming that Causes Random Run-time Exception

2017-10-11 Thread The Facts (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16199848#comment-16199848
 ] 

The Facts edited comment on SPARK-22163 at 10/11/17 6:16 AM:
-

Sean's latest claim is another evidence of his pattern of own disinformation 
and abuse.   The JIRA account with Apache has no value when Sean keeps closing 
tickets without understanding them or worse yet make blatant false claims. So I 
disregard their threat to disable my account and provided them with the facts 
as listed below. The admins played politics and did not have the integrity to 
view this ticket to see Sean's false claims. 

For instance, the text description of the ticket clearly says 

"My application does not spin up its own thread. All the threads are controlled 
by Spark.", 

he still made a blatant opposite false claim of 

 "you imply this happens outside of Spark's threads, in an app thread you 
spawn."

There is a saying that people do not quit their job. Instead they quit their 
bosses. For open-source projects without "bosses", the analogy is that people 
don't contribute or quit because of the work.  People don't contribute or quit 
because of abusers who are not only more interested in closing tickets without 
understanding them and abuse their role on the project to block other people 
from opening the ticket.




was (Author: thefacts_1):
Sean's latest claim is another evidence of his pattern of own disinformation 
and abuse.   The JIRA account with Apache has no value when Sean keeps closing 
tickets without understanding them or worse yet make blatant false claims. So I 
disregard their threat to disable my account and provided them with the facts 
as listed below. The admins played politics and did not have the integrity to 
view this ticket to see Sean's false claims. 

For instance, the text description of the ticket clearly says 

"My application does not spin up its own thread. All the threads are controlled 
by Spark.", 

he still made a blatant opposite false claim of 

 "you imply this happens outside of Spark's threads, in an app thread you 
spawn."

There is a saying that people do not quit their job. Instead they quit their 
bosses. For open-source projects without "bosses", the analogy is that people 
don't contribute or quit because of the work.  People don't contribute or quit 
because of abusers who is not only more interested in closing tickets without 
understanding them and abuse their role on the project to block other people 
from opening the ticket.



> Design Issue of Spark Streaming that Causes Random Run-time Exception
> -
>
> Key: SPARK-22163
> URL: https://issues.apache.org/jira/browse/SPARK-22163
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Structured Streaming
>Affects Versions: 2.2.0
> Environment: Spark Streaming
> Kafka
> Linux
>Reporter: The Facts
>
> The application objects can contain List and can be modified dynamically as 
> well.   However, Spark Streaming framework asynchronously serializes the 
> application's objects as the application runs.  Therefore, it causes random 
> run-time exception on the List when Spark Streaming framework happens to 
> serializes the application's objects while the application modifies a List in 
> its own object.  
> In fact, there are multiple bugs reported about
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject
> that are permutation of the same root cause. So the design issue of Spark 
> streaming framework is that it should do this serialization asynchronously.  
> Instead, it should either
> 1. do this serialization synchronously. This is preferred to eliminate the 
> issue completely.  Or
> 2. Allow it to be configured per application whether to do this serialization 
> synchronously or asynchronously, depending on the nature of each application.
> Also, Spark documentation should describe the conditions that trigger Spark 
> to do this type of serialization asynchronously, so the applications can work 
> around them until the fix is provided. 
> ===
> Vadim Semenov and Steve Loughran, per your inquiries in ticket 
> https://issues.apache.org/jira/browse/SPARK-21999, I am posting the reply 
> here because this issue involves Spark's design and not necessarily its code 
> implementation.
> —
> My application does not spin up its own thread. All the threads are 
> controlled by Spark.
> Batch interval = 5 seconds
> Batch #3
> 1. Driver - Spark Thread #1 - starts batch #3 and blocks until all slave 
> threads are done with this batch
> 2. Slave A - Spark Thread #2. Says it takes 10 seconds to complete
> 3. Slave B - Spark Thread #3. Says it takes 1 minutes to complete
> 4. Both thread #1 for the driver and thread 2# 

[jira] [Comment Edited] (SPARK-22163) Design Issue of Spark Streaming that Causes Random Run-time Exception

2017-10-08 Thread Michael N (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16195813#comment-16195813
 ] 

Michael N edited comment on SPARK-22163 at 10/8/17 4:08 PM:


Sean, my text description in the ticket clearly says "My application does not 
spin up its own thread. All the threads are controlled by Spark.". So you made 
another invalid claim that "you imply this happens outside of Spark's threads, 
in an app thread you spawn."

Spark's thread #4 may not be related to checkpointing. I removed it from the 
text description. However, the asynchronous serialization of application 
objects is still done by Spark's own thread #4 in parallel to Spark's own 
thread #1.

I did not post the code before, because 
- it is more important to understand first the current design and its rationale 
for Spark's asynchronous serialization of application objects, which was why I 
posted the questions about that design, and 
- also for me to work around the issue in my application in the meantime, 
before I circled back to the tickets.

But you kept closing the tickets without providing the answers to those 
questions.


was (Author: michaeln_apache):
Sean, my text description in the ticket clearly says "My application does not 
spin up its own thread. All the threads are controlled by Spark.". So the you 
made another invalid claim that "you imply this happens outside of Spark's 
threads, in an app thread you spawn."

Spark's thread #4 may not be related to checkpointing. I removed it from the 
text description. However, the asynchronous serialization of application 
objects is still done by Spark's own thread #4 in parallel to Spark's own 
thread #1.

I did not post the code before, because 
- it is more important to understand first the current design and its rationale 
for Spark's asynchronous serialization of application objects, which was why I 
posted the questions about that design, and 
- also for me to work around the issue in my application in the meantime, 
before I circled back to the tickets.

But you kept closing the tickets without providing the answers to those 
questions.

> Design Issue of Spark Streaming that Causes Random Run-time Exception
> -
>
> Key: SPARK-22163
> URL: https://issues.apache.org/jira/browse/SPARK-22163
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Structured Streaming
>Affects Versions: 2.2.0
> Environment: Spark Streaming
> Kafka
> Linux
>Reporter: Michael N
>
> The application objects can contain List and can be modified dynamically as 
> well.   However, Spark Streaming framework asynchronously serializes the 
> application's objects as the application runs.  Therefore, it causes random 
> run-time exception on the List when Spark Streaming framework happens to 
> serializes the application's objects while the application modifies a List in 
> its own object.  
> In fact, there are multiple bugs reported about
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject
> that are permutation of the same root cause. So the design issue of Spark 
> streaming framework is that it should do this serialization asynchronously.  
> Instead, it should either
> 1. do this serialization synchronously. This is preferred to eliminate the 
> issue completely.  Or
> 2. Allow it to be configured per application whether to do this serialization 
> synchronously or asynchronously, depending on the nature of each application.
> Also, Spark documentation should describe the conditions that trigger Spark 
> to do this type of serialization asynchronously, so the applications can work 
> around them until the fix is provided. 
> ===
> Vadim Semenov and Steve Loughran, per your inquiries in ticket 
> https://issues.apache.org/jira/browse/SPARK-21999, I am posting the reply 
> here because this issue involves Spark's design and not necessarily its code 
> implementation.
> —
> My application does not spin up its own thread. All the threads are 
> controlled by Spark.
> Batch interval = 5 seconds
> Batch #3
> 1. Driver - Spark Thread #1 - starts batch #3 and blocks until all slave 
> threads are done with this batch
> 2. Slave A - Spark Thread #2. Says it takes 10 seconds to complete
> 3. Slave B - Spark Thread #3. Says it takes 1 minutes to complete
> 4. Both thread #1 for the driver and thread 2# for Slave A do not jump ahead 
> and process batch #4. Instead, they wait for thread #3 until it is done. => 
> So there is already synchronization among the threads within the same batch. 
> Also, batch to batch is synchronous.
> 5. After Spark Thread #3 is done, the driver does other processing to finish 
> the current batch. In my case, it updates a list of objects.
> The above steps repeat for the 

[jira] [Comment Edited] (SPARK-22163) Design Issue of Spark Streaming that Causes Random Run-time Exception

2017-10-07 Thread Michael N (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16195813#comment-16195813
 ] 

Michael N edited comment on SPARK-22163 at 10/7/17 6:06 PM:


Sean, my text description in the ticket clearly says "My application does not 
spin up its own thread. All the threads are controlled by Spark.". So the you 
made another invalid claim that "you imply this happens outside of Spark's 
threads, in an app thread you spawn."

Spark's thread #4 may not be related to checkpointing. I removed it from the 
text description. However, the asynchronous serialization of application 
objects is still done by Spark's own thread #4 in parallel to Spark's own 
thread #1.

I did not post the code before, because 
- it is more important to understand first the current design and its rationale 
for Spark's asynchronous serialization of application objects, which was why I 
posted the questions about that design, and 
- also for me to work around the issue in my application in the meantime, 
before I circled back to the tickets.

But you kept closing the tickets without providing the answers to those 
questions.


was (Author: michaeln_apache):
Sean, my text description in the ticket clearly says "My application does not 
spin up its own thread. All the threads are controlled by Spark.". So the you 
made another invalid claim that "you imply this happens outside of Spark's 
threads, in an app thread you spawn."

Spark's thread #4 may not be related to checkpointing. I removed it from the 
text description. However, the asynchronous serialization of application 
objects is still done by Spark's own thread #4 in parallel to Spark's own 
thread #1.

I did not post the code before, because 
- it is more important to understand the current design of Spark's asynchronous 
serialization of application objects first, which was why I posted the 
questions about that design, and 
- also for to work around the issue in my application in the meantime, before I 
circled back to the tickets.

But you kept closing the tickets without providing the answers to those 
questions.

> Design Issue of Spark Streaming that Causes Random Run-time Exception
> -
>
> Key: SPARK-22163
> URL: https://issues.apache.org/jira/browse/SPARK-22163
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Structured Streaming
>Affects Versions: 2.2.0
> Environment: Spark Streaming
> Kafka
> Linux
>Reporter: Michael N
>
> The application objects can contain List and can be modified dynamically as 
> well.   However, Spark Streaming framework asynchronously serializes the 
> application's objects as the application runs.  Therefore, it causes random 
> run-time exception on the List when Spark Streaming framework happens to 
> serializes the application's objects while the application modifies a List in 
> its own object.  
> In fact, there are multiple bugs reported about
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject
> that are permutation of the same root cause. So the design issue of Spark 
> streaming framework is that it should do this serialization asynchronously.  
> Instead, it should either
> 1. do this serialization synchronously. This is preferred to eliminate the 
> issue completely.  Or
> 2. Allow it to be configured per application whether to do this serialization 
> synchronously or asynchronously, depending on the nature of each application.
> Also, Spark documentation should describe the conditions that trigger Spark 
> to do this type of serialization asynchronously, so the applications can work 
> around them until the fix is provided. 
> ===
> Vadim Semenov and Steve Loughran, per your inquiries in ticket 
> https://issues.apache.org/jira/browse/SPARK-21999, I am posting the reply 
> here because this issue involves Spark's design and not necessarily its code 
> implementation.
> —
> My application does not spin up its own thread. All the threads are 
> controlled by Spark.
> Batch interval = 5 seconds
> Batch #3
> 1. Driver - Spark Thread #1 - starts batch #3 and blocks until all slave 
> threads are done with this batch
> 2. Slave A - Spark Thread #2. Says it takes 10 seconds to complete
> 3. Slave B - Spark Thread #3. Says it takes 1 minutes to complete
> 4. Both thread #1 for the driver and thread 2# for Slave A do not jump ahead 
> and process batch #4. Instead, they wait for thread #3 until it is done. => 
> So there is already synchronization among the threads within the same batch. 
> Also, batch to batch is synchronous.
> 5. After Spark Thread #3 is done, the driver does other processing to finish 
> the current batch. In my case, it updates a list of objects.
> The above steps repeat for the next batch #4 and 

[jira] [Comment Edited] (SPARK-22163) Design Issue of Spark Streaming that Causes Random Run-time Exception

2017-10-07 Thread Michael N (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16195813#comment-16195813
 ] 

Michael N edited comment on SPARK-22163 at 10/7/17 6:05 PM:


Sean, my text description in the ticket clearly says "My application does not 
spin up its own thread. All the threads are controlled by Spark.". So the you 
made another invalid claim that "you imply this happens outside of Spark's 
threads, in an app thread you spawn."

Spark's thread #4 may not be related to checkpointing. I removed it from the 
text description. However, the asynchronous serialization of application 
objects is still done by Spark's own thread #4 in parallel to Spark's own 
thread #1.

I did not post the code before, because 
- it is more important to understand the current design of Spark's asynchronous 
serialization of application objects first, which was why I posted the 
questions about that design, and 
- also for to work around the issue in my application in the meantime, before I 
circled back to the tickets.

But you kept closing the tickets without providing the answers to those 
questions.


was (Author: michaeln_apache):
Sean, my text description in the ticket clearly says "My application does not 
spin up its own thread. All the threads are controlled by Spark.". So the you 
made another invalid claim that "you imply this happens outside of Spark's 
threads, in an app thread you spawn."

Spark's thread #4 may not be related to checkpointing. I removed it from the 
text description. However, the asynchronous serialization of application 
objects is still done by Spark's own thread #4 in parallel to Spark's own 
thread #1.

> Design Issue of Spark Streaming that Causes Random Run-time Exception
> -
>
> Key: SPARK-22163
> URL: https://issues.apache.org/jira/browse/SPARK-22163
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Structured Streaming
>Affects Versions: 2.2.0
> Environment: Spark Streaming
> Kafka
> Linux
>Reporter: Michael N
>
> The application objects can contain List and can be modified dynamically as 
> well.   However, Spark Streaming framework asynchronously serializes the 
> application's objects as the application runs.  Therefore, it causes random 
> run-time exception on the List when Spark Streaming framework happens to 
> serializes the application's objects while the application modifies a List in 
> its own object.  
> In fact, there are multiple bugs reported about
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject
> that are permutation of the same root cause. So the design issue of Spark 
> streaming framework is that it should do this serialization asynchronously.  
> Instead, it should either
> 1. do this serialization synchronously. This is preferred to eliminate the 
> issue completely.  Or
> 2. Allow it to be configured per application whether to do this serialization 
> synchronously or asynchronously, depending on the nature of each application.
> Also, Spark documentation should describe the conditions that trigger Spark 
> to do this type of serialization asynchronously, so the applications can work 
> around them until the fix is provided. 
> ===
> Vadim Semenov and Steve Loughran, per your inquiries in ticket 
> https://issues.apache.org/jira/browse/SPARK-21999, I am posting the reply 
> here because this issue involves Spark's design and not necessarily its code 
> implementation.
> —
> My application does not spin up its own thread. All the threads are 
> controlled by Spark.
> Batch interval = 5 seconds
> Batch #3
> 1. Driver - Spark Thread #1 - starts batch #3 and blocks until all slave 
> threads are done with this batch
> 2. Slave A - Spark Thread #2. Says it takes 10 seconds to complete
> 3. Slave B - Spark Thread #3. Says it takes 1 minutes to complete
> 4. Both thread #1 for the driver and thread 2# for Slave A do not jump ahead 
> and process batch #4. Instead, they wait for thread #3 until it is done. => 
> So there is already synchronization among the threads within the same batch. 
> Also, batch to batch is synchronous.
> 5. After Spark Thread #3 is done, the driver does other processing to finish 
> the current batch. In my case, it updates a list of objects.
> The above steps repeat for the next batch #4 and subsequent batches.
> Based on the exception stack trace, it looks like in step 5, Spark has 
> another thread #4 that serializes application objects asynchronously. So it 
> causes random occurrences of ConcurrentModificationException, because the 
> list of objects is being changed by Spark own thread #1 for the driver.
> So the issue is not that my application "is modifying a collection 
> asynchronously w.r.t. Spark" as Sean kept 

[jira] [Comment Edited] (SPARK-22163) Design Issue of Spark Streaming that Causes Random Run-time Exception

2017-10-05 Thread Michael N (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193907#comment-16193907
 ] 

Michael N edited comment on SPARK-22163 at 10/6/17 12:24 AM:
-

Vadim Semenov and Steve Loughran, per your inquiries in ticket 
https://issues.apache.org/jira/browse/SPARK-21999, I am posting the reply here 
because this issue involves Spark's design and not necessarily its code 
implementation.

---

My application does not spin up its own thread. All the threads are controlled 
by Spark.

Batch interval = 5 seconds

Batch #3
1. Driver - Spark Thread #1 - starts batch #3 and blocks until all slave 
threads are done with this batch
2. Slave A - Spark Thread #2. Says it takes 10 seconds to complete
3. Slave B - Spark Thread #3. Says it takes 1 minutes to complete

4. Both thread #1 for the driver and thread 2# for Slave A do not jump ahead 
and process batch #4. Instead, they wait for  thread #3 until it is done.  => 
So there is already synchronization among the threads within the same batch. 
Also, batch to batch is synchronous.

5. After Spark Thread #3 is done, the driver does other processing to finish 
the current batch.  In my case, it updates a list of objects.

The above steps repeat for the next batch #4 and subsequent batches.

Based on the exception stack trace, it looks like in step 5, Spark has another 
thread #4 that serializes application objects asynchronously. So it causes 
random occurrences of ConcurrentModificationException, because the list of 
objects is being changed by Spark own thread #1 for the driver.

So the issue is not that my application "is modifying a collection 
asynchronously w.r.t. Spark" as Sean kept claiming. Instead, it is Spark's 
asynchronous operations among its own different threads within the same batch 
that causes this issue.

I understand Spark needs to serializes objects for check point purposes. 
However, since Spark controls all the threads and their synchronization, it is 
a Spark design's issue for the lack of synchronization between threads #1 and 
#4, that triggers ConcurrentModificationException. That is the root cause of 
this issue.

Further, even if the application does not modify its list of objects, in step 5 
the driver could be modifying multiple native objects say two integers.  In 
thread #1 the driver could have updated integer X and before it could update 
integer Y,  when Spark's  thread #4  asynchronous serializes the application 
objects. So the persisted serialized data does not match with the actual data.  
This resulted in a permutation of this issue with a false positive condition 
where the serialized checkpoint data has partially correct data.

One solution for both issues is to modify Spark's design and allow the 
serialization of application objects by Spark's  thread #4 to be configurable 
per application to be either asynchronous or synchronous with Spark's thread 
#1.  That way, it is up to individual applications to decide based on the 
nature of their business requirements and needed throughput.



was (Author: michaeln_apache):
Vadim Semenov and Steve Loughran, per your inquiries in ticket 
https://issues.apache.org/jira/browse/SPARK-21999, I am posting the reply here 
because this issue involves Spark's design and not necessarily its code 
implementation.

---

My application does not spin up its own thread. All the threads are controlled 
by Spark.

Batch interval = 5 seconds

Batch #3
1. Driver - Spark Thread #1 - starts batch #3 and blocks until all slave 
threads are done with this batch
2. Slave A - Spark Thread #2. Says it takes 10 seconds to complete
3. Slave B - Spark Thread #3. Says it takes 1 minutes to complete

4. Both thread #1 for the driver and thread 2# for Slave A do not jump ahead 
and process batch #4. Instead, they wait for  thread #3 until it is done.  => 
So there is already synchronization among the threads within the same batch. 
Also, batch to batch is synchronous.

5. After Spark Thread #3 is done, the driver does other processing to finish 
the current batch.  In my case, it updates a list of objects.

The above steps repeat for the next batch #4 and subsequent batches.

Based on the exception stack trace, it looks like in step 5, Spark has another 
thread #4 that serializes application objects asynchronously. So it causes 
random occurrences of ConcurrentModificationException, because the list of 
objects is being changed by Spark own thread #1 for the driver.

So the issue is not that my application "is modifying a collection 
asynchronously w.r.t. Spark" as Sean kept claiming. Instead, it is Spark's 
asynchronous operations among its own different threads within the same batch 
that causes this issue.

I understand Spark needs to serializes objects for check point purposes. 
However, since Spark controls all the threads and their synchronization, it is 
a Spark design's issue 

[jira] [Comment Edited] (SPARK-22163) Design Issue of Spark Streaming that Causes Random Run-time Exception

2017-10-05 Thread Michael N (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193907#comment-16193907
 ] 

Michael N edited comment on SPARK-22163 at 10/6/17 12:21 AM:
-

Vadim Semenov and Steve Loughran, per your inquiries in ticket 
https://issues.apache.org/jira/browse/SPARK-21999, I am posting the reply here 
because this issue involves Spark's design and not necessarily its code 
implementation.

---

My application does not spin up its own thread. All the threads are controlled 
by Spark.

Batch interval = 5 seconds

Batch #3
1. Driver - Spark Thread #1 - starts batch #3 and blocks until all slave 
threads are done with this batch
2. Slave A - Spark Thread #2. Says it takes 10 seconds to complete
3. Slave B - Spark Thread #3. Says it takes 1 minutes to complete

4. Both thread #1 for the driver and thread 2# for Slave A do not jump ahead 
and process batch #4. Instead, they wait for  thread #3 until it is done.  => 
So there is already synchronization among the threads within the same batch. 
Also, batch to batch is synchronous.

5. After Spark Thread #3 is done, the driver does other processing to finish 
the current batch.  In my case, it updates a list of objects.

The above steps repeat for the next batch #4 and subsequent batches.

Based on the exception stack trace, it looks like in step 5, Spark has another 
thread #4 that serializes application objects asynchronously. So it causes 
random occurrences of ConcurrentModificationException, because the list of 
objects is being changed by Spark own thread #1 for the driver.

So the issue is not that my application "is modifying a collection 
asynchronously w.r.t. Spark" as Sean kept claiming. Instead, it is Spark's 
asynchronous operations among its own different threads within the same batch 
that causes this issue.

I understand Spark needs to serializes objects for check point purposes. 
However, since Spark controls all the threads and their synchronization, it is 
a Spark design's issue for the lack of synchronization between threads #1 and 
#4, that triggers ConcurrentModificationException that is the root cause of 
this issue.

Further, even if the application does not modify its list of objects, in step 5 
the driver could be modifying multiple native objects say two integers.  In 
thread #1 the driver could have updated integer X and before it could update 
integer Y,  when Spark's  thread #4  asynchronous serializes the application 
objects. So the persisted serialized data does not match with the actual data.  
This resulted in a permutation of this issue with a false positive condition 
where the serialized checkpoint data has partially correct data.

One solution for both issues is to modify Spark's design and allow the 
serialization of application objects by Spark's  thread #4 to be configurable 
per application to be either asynchronous or synchronous with Spark's thread 
#1.  That way, it is up to individual applications to decide based on the 
nature of their business requirements and needed throughput.



was (Author: michaeln_apache):
Vadim Semenov and Steve Loughran, per your inquiries in ticket 
https://issues.apache.org/jira/browse/SPARK-21999, I am posting the reply here 
because this issue involves Spark's design and not necessarily its code 
implementation.

---

My application does not spin up its own thread. All the threads are controlled 
by Spark.

Batch interval = 5 seconds

Batch #3
1. Driver - Spark Thread #1 - starts batch #3 and blocks until all slave 
threads are done with this batch
2. Slave A - Spark Thread #2 takes 10 seconds to complete
3. Slave B - Spark Thread #3 takes 1 minutes to complete

4. Both thread 1 for the driver and thread 2 for Slave A do not jump ahead and 
process batch #4. Instead, they synchronize with  thread B until it is done.  
=> So there is synchronization among the threads within the same batch, and 
also batch to batch is synchronous.

5. After Spark Thread #3 is done, the driver does other processing to finish 
the current batch.  In my case, it updates a list of objects.

The above steps repeat for the next batch #4 and subsequent batches.

Based on the exception stack trace, it looks like in step 5, Spark has another 
thread #4 that serializes application objects asynchronously, so it causes 
random occurrences of ConcurrentModificationException, because the list of 
objects is being changed by Spark Thread #1 for the driver.

So the issue is not that my application "is modifying a collection 
asynchronously w.r.t. Spark" as Sean kept claiming. It is Spark's asynchronous 
operations among its own different threads within the same batch.

I understand Spark needs to serializes objects for check point purposes. 
However, since Spark controls all the threads and their synchronization, it is 
a Spark design's issue for the lack of synchronization between threads #1 and 

[jira] [Comment Edited] (SPARK-22163) Design Issue of Spark Streaming that Causes Random Run-time Exception

2017-10-05 Thread Michael N (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193907#comment-16193907
 ] 

Michael N edited comment on SPARK-22163 at 10/6/17 12:18 AM:
-

Vadim Semenov and Steve Loughran, per your inquiries in ticket 
https://issues.apache.org/jira/browse/SPARK-21999, I am posting the reply here 
because this issue involves Spark's design and not necessarily its code 
implementation.

---

My application does not spin up its own thread. All the threads are controlled 
by Spark.

Batch interval = 5 seconds

Batch #3
1. Driver - Spark Thread #1 - starts batch #3 and blocks until all slave 
threads are done with this batch
2. Slave A - Spark Thread #2 takes 10 seconds to complete
3. Slave B - Spark Thread #3 takes 1 minutes to complete

4. Both thread 1 for the driver and thread 2 for Slave A do not jump ahead and 
process batch #4. Instead, they synchronize with  thread B until it is done.  
=> So there is synchronization among the threads within the same batch, and 
also batch to batch is synchronous.

5. After Spark Thread #3 is done, the driver does other processing to finish 
the current batch.  In my case, it updates a list of objects.

The above steps repeat for the next batch #4 and subsequent batches.

Based on the exception stack trace, it looks like in step 5, Spark has another 
thread #4 that serializes application objects asynchronously, so it causes 
random occurrences of ConcurrentModificationException, because the list of 
objects is being changed by Spark Thread #1 for the driver.

So the issue is not that my application "is modifying a collection 
asynchronously w.r.t. Spark" as Sean kept claiming. It is Spark's asynchronous 
operations among its own different threads within the same batch.

I understand Spark needs to serializes objects for check point purposes. 
However, since Spark controls all the threads and their synchronization, it is 
a Spark design's issue for the lack of synchronization between threads #1 and 
#4, that triggers ConcurrentModificationException. 

Further, even if the application does not modify its list of objects, in step 5 
the driver could be modifying multiple native objects say two integers.  In 
thread #1 the driver could have updated integer X and before it could update 
integer Y,  when Spark's  thread #4  asynchronous serializes the application 
objects. So the persisted serialized data does not match with the actual data.  
This resulted in a permutation of this issue with a false positive condition 
where the serialized checkpoint data has partially correct data.

One solution for both issues is to modify Spark's design and allow the 
serialization of application objects by Spark's  thread #4 to be configurable 
per application to be either asynchronous or synchronous with Spark's thread 
#1.  That way, it is up to individual applications to decide based on the 
nature of their business requirements and needed throughput.



was (Author: michaeln_apache):
Vadim Semenov and Steve Loughran, per your inquiries in ticket 
https://issues.apache.org/jira/browse/SPARK-21999, I am posting the reply here 
because this issue involves Spark's design and not necessarily its code 
implementation.

---

My application does not spin up its own thread. All the threads are controlled 
by Spark.

Batch interval = 5 seconds

Batch #3
1. Driver - Spark Thread #1 - starts batch #3 and blocks until all slave 
threads are done with this batch
2. Slave A - Spark Thread #2 takes 10 seconds to complete
3. Slave B - Spark Thread #3 takes 1 minutes to complete

4. Both thread 1 for the driver and thread 2 for Slave A do not jump ahead and 
process batch #4. Instead, they synchronize with  thread B until it is done.  
=> So there is synchronization among the threads within the same batch, and 
also batch to batch is synchronous.

5. After Spark Thread #3 is done, the driver does other processing to finish 
the current batch.  In my case, it updates a list of objects.

The above steps repeat for the next batch #4 and subsequent batches.

Based on the exception stack trace, it looks like in step 5, Spark has another 
thread #4 that serializes application objects asynchronously, so it causes 
random occurrences of ConcurrentModificationException, because the list of 
objects is being changed by Spark Thread #1 for the driver.

So the issue is not that my application "is modifying a collection 
asynchronously w.r.t. Spark" as Sean kept claiming. It is Spark's asynchronous 
operations among its own different threads within the same batch.

I understand Spark needs to serializes objects for check point purposes. 
However, since Spark controls all the threads and their synchronization, it is 
a Spark design's issue for the lack of synchronization between threads #1 and 
#4, that triggers ConcurrentModificationException. 

Further, even if the application does not 

[jira] [Comment Edited] (SPARK-22163) Design Issue of Spark Streaming that Causes Random Run-time Exception

2017-10-05 Thread Michael N (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193907#comment-16193907
 ] 

Michael N edited comment on SPARK-22163 at 10/5/17 11:47 PM:
-

Vadim Semenov and Steve Loughran, per your inquiries in ticket 
https://issues.apache.org/jira/browse/SPARK-21999, I am posting the reply here 
because this issue involves Spark's design and not necessarily its code 
implementation.

---

My application does not spin up its own thread. All the threads are controlled 
by Spark.

Batch interval = 5 seconds

Batch #3
1. Driver - Spark Thread #1 - starts batch #3 and blocks until all slave 
threads are done with this batch
2. Slave A - Spark Thread #2 takes 10 seconds to complete
3. Slave B - Spark Thread #3 takes 1 minutes to complete

4. Both thread 1 for the driver and thread 2 for Slave A do not jump ahead and 
process batch #4. Instead, they synchronize with  thread B until it is done.  
=> So there is synchronization among the threads within the same batch, and 
also batch to batch is synchronous.

5. After Spark Thread #3 is done, the driver does other processing to finish 
the current batch.  In my case, it updates a list of objects.

The above steps repeat for the next batch #4 and subsequent batches.

Based on the exception stack trace, it looks like in step 5, Spark has another 
thread #4 that serializes application objects asynchronously, so it causes 
random occurrences of ConcurrentModificationException, because the list of 
objects is being changed by Spark Thread #1 for the driver.

So the issue is not that my application "is modifying a collection 
asynchronously w.r.t. Spark" as Sean kept claiming. It is Spark's asynchronous 
operations among its own different threads within the same batch.

I understand Spark needs to serializes objects for check point purposes. 
However, since Spark controls all the threads and their synchronization, it is 
a Spark design's issue for the lack of synchronization between threads #1 and 
#4, that triggers ConcurrentModificationException. 

Further, even if the application does not modify its list of objects, in step 5 
the driver could be modifying multiple native objects say two integers.  in 
thread #1 the driver could have updated integer X and before it could update 
integer Y,  when Spark's  thread #4  asynchronous serializes the application 
objects. So the persisted serialized data does not match with the actual data.  
This resulted in a permutation of this issue of false positive condition where 
the serialized checkpoint data has partial correct data.

One solution for both issues is the modify Spark's design to allow the 
serialization of application objects by Spark's  thread #4 to be configurable 
per application to be either asynchronous or synchronous with Spark's thread 
#1.  That way, it is up to individual applications to decide based on the 
nature of their business requirements and needed throughput.



was (Author: michaeln_apache):
Vadim Semenov and Steve Loughran, per your inquiries in ticket 
https://issues.apache.org/jira/browse/SPARK-21999, I am posting the reply here 
because this issue involves Spark's design and not necessarily its code 
implementation.

---

My application does not spin up its own thread. All the threads are controlled 
by Spark.

Batch interval = 5 seconds

Batch #3
1. Driver - Spark Thread #1 - starts batch #3 and blocks until all slave 
threads are done with this batch
2. Slave A - Spark Thread #2 takes 10 seconds to complete
3. Slave B - Spark Thread #3 takes 1 minutes to complete

4. Both thread 1 for the driver and thread 2 for Slave A do not jump ahead and 
process batch #4. Instead, they synchronize with  thread B until it is done.  
=> So there is synchronization among the threads within the same batch, and 
also batch to batch is synchronous.

5. After Spark Thread #3 is done, the driver does other processing to finish 
the current batch.  In my case, it updates a list of objects.

The above steps repeat for the next batch #4 and subsequent batches.

Based on the exception stack trace, it looks like in step 5, Spark has another 
thread #4 that serializes application objects asynchronously, so it causes 
random occurrences of ConcurrentModificationException, because the list of 
objects is being changed by Spark Thread #1 the driver.

So the issue is not that my application "is modifying a collection 
asynchronously w.r.t. Spark" as Sean kept claiming. It is Spark's asynchronous 
operations among its own different threads. 

I understand Spark needs to serializes objects for check point purposes. 
However, since Spark controls all the threads and their synchronization, it is 
a Spark design's issue for the lack of synchronization between threads #1 and 
#4, that triggers ConcurrentModificationException. 

Further, even if the application does not modify its list, in step 5 the 

[jira] [Comment Edited] (SPARK-22163) Design Issue of Spark Streaming that Causes Random Run-time Exception

2017-10-05 Thread Michael N (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193907#comment-16193907
 ] 

Michael N edited comment on SPARK-22163 at 10/5/17 11:42 PM:
-

Vadim Semenov and Steve Loughran, per your inquiries in ticket 
https://issues.apache.org/jira/browse/SPARK-21999, I am posting the reply here 
because this issue involves Spark's design and not necessarily its code 
implementation.

---

My application does not spin up its own thread. All the threads are controlled 
by Spark.

Batch interval = 5 seconds

Batch #3
1. Driver - Spark Thread #1 - starts batch #3 and blocks until all slave 
threads are done with this batch
2. Slave A - Spark Thread #2 takes 10 seconds to complete
3. Slave B - Spark Thread #3 takes 1 minutes to complete

4. Both thread 1 for the driver and thread 2 for Slave A do not jump ahead and 
process batch #4. Instead, they synchronize with  thread B until it is done.  
=> So there is synchronization among the threads within the same batch, and 
also batch to batch is synchronous.

5. After Spark Thread #3 is done, the driver does other processing to finish 
the current batch.  In my case, it updates a list of objects.

The above steps repeat for the next batch #4 and subsequent batches.

Based on the exception stack trace, it looks like in step 5, Spark has another 
thread #4 that serializes application objects asynchronously, so it causes 
random occurrences of ConcurrentModificationException, because the list of 
objects is being changed by Spark Thread #1 the driver.

So the issue is not that my application "is modifying a collection 
asynchronously w.r.t. Spark" as Sean kept claiming. It is Spark's asynchronous 
operations among its own different threads. 

I understand Spark needs to serializes objects for check point purposes. 
However, since Spark controls all the threads and their synchronization, it is 
a Spark design's issue for the lack of synchronization between threads #1 and 
#4, that triggers ConcurrentModificationException. 

Further, even if the application does not modify its list, in step 5 the driver 
could be modifying multiple native objects say two integers.  The driver could 
have updated integer X and before it could update integer Y,  Spark's  thread 
#4  asynchronous serializes the application objects. So the persisted 
serialized data does not match with the actual data.  So a permutation of this 
issue is false positive condition with partial correct data.

One solution for both issues is the modify Spark's design to allow the 
serialization of application objects by Spark's  thread #4 to be configurable 
per application to be either asynchronous or synchronous.  That way, it is up 
to individual applications to decide based on the nature of their business 
requirements and needed throughput.



was (Author: michaeln_apache):
Vadim Semenov and Steve Loughran, per your inquiries in ticket 
https://issues.apache.org/jira/browse/SPARK-21999, I am posting the reply here 
because this issue involves Spark's design and not necessarily its code 
implementation.

---

My application does not spin up its own thread. All the threads are controlled 
by Spark.

Batch interval = 5 seconds

Batch #3
1. Driver - Spark Thread #1 - starts batch #3 and blocks until all slave 
threads are done with this batch
2. Slave A - Spark Thread #2 takes 10 seconds to complete
3. Slave B - Spark Thread #3 takes 1 minutes to complete

4. Both thread 1 for the driver and thread 2 for Slave A do not jump ahead and 
process batch #4. Instead, they synchronize with  thread B until it is done.  
=> So there is synchronization among the threads within the same batch, and 
also batch to batch is synchronous.

5. After Spark Thread #3 is done, the driver does other processing to finish 
the current batch.  In my case, it updates a list of objects.

The above steps repeat for the next batch #4 and subsequent batches.

Based on the exception stack trace, it looks like in step 5, Spark has another 
thread #4 that serializes application objects asynchronously, so it causes 
random occurrences of ConcurrentModificationException, because the list of 
objects is being changed by Spark Thread #1 the driver.

So the issue is not that my application "is modifying a collection 
asynchronously w.r.t. Spark" as Sean kept claiming. It is Spark's asynchronous 
operations among its own different threads. 

I understand Spark needs to serializes objects for check point purposes. 
However, since Spark controls all the threads and their synchronization, it is 
a Spark design's issue for the lack of synchronization between threads #1 and 
#4, that triggers ConcurrentModificationException. 

Further, even if the application does not modify its list, in step 5 the driver 
could be modifying multiple native objects say two integers.  The driver could 
have updated integer X and before it 

[jira] [Comment Edited] (SPARK-22163) Design Issue of Spark Streaming that Causes Random Run-time Exception

2017-10-04 Thread Michael N (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192352#comment-16192352
 ] 

Michael N edited comment on SPARK-22163 at 10/5/17 2:11 AM:


It is obvious that you don't understand the differences between design flaws vs 
coding bugs, particularly you have not been able to provide the answers to the 
questions of

1. In the first place, why does Spark serialize the application objects 
*asynchronously* while the streaming application is running continuously from 
batch to batch ?

2. If Spark needs to do this type of serialization at all, why does it not do 
at the end of the batch *synchronously* ?

Instead of blindly closing tickets, you need to either either find their 
answers and post them here or let someone else who is capable to address them. 

Btw, your response to the ticket 
https://issues.apache.org/jira/browse/SPARK-21999 where you said

  "Your app is modifying a collection asynchronously w.r.t. Spark. Right"

confirmed that you do not understand the issue.   *This issue occurs in both 
the slave nodes and the driver*.  My app is *not* modifying a collection 
asynchronously w.r.t. Spark.   So you kept making the same invalid claim and 
kept closing the ticket that you do not understand.   My Streaming Spark 
application  is run synchronously by Spark Streaming framework from batch to 
batch. And it modifies the data synchronously as part of the batch processing. 
However, Spark framework has another thread that *asynchronously* serializes 
the application objects.


was (Author: michaeln_apache):
It is obvious that you don't understand the differences between design flaws vs 
coding bugs, particularly you have not been able to provide the answers to the 
questions of

1. In the first place, why does Spark serialize the application objects 
*asynchronously* while the streaming application is running continuously from 
batch to batch ?

2. If Spark needs to do this type of serialization at all, why does it not do 
at the end of the batch *synchronously* ?

Instead of blindly closing tickets, you need to either either find their 
answers and post them here or let someone else who is capable to address them. 

Btw, your response to the ticket 
https://issues.apache.org/jira/browse/SPARK-21999 where you said

  "Your app is modifying a collection asynchronously w.r.t. Spark. Right"

confirmed that you do not understand the issue.  My app is *not* modifying a 
collection asynchronously w.r.t. Spark.   So you kept making the same invalid 
claim and kept closing the ticket that you do not understand.   My Streaming 
Spark application  is run synchronously by Spark Streaming framework from batch 
to batch. And it modifies the data synchronously as part of the batch 
processing. However, Spark framework has another thread that *asynchronously* 
serializes the application objects.

> Design Issue of Spark Streaming that Causes Random Run-time Exception
> -
>
> Key: SPARK-22163
> URL: https://issues.apache.org/jira/browse/SPARK-22163
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Structured Streaming
>Affects Versions: 2.2.0
> Environment: Spark Streaming
> Kafka
> Linux
>Reporter: Michael N
>Priority: Critical
>
> The application objects can contain List and can be modified dynamically as 
> well.   However, Spark Streaming framework asynchronously serializes the 
> application's objects as the application runs.  Therefore, it causes random 
> run-time exception on the List when Spark Streaming framework happens to 
> serializes the application's objects while the application modifies a List in 
> its own object.  
> In fact, there are multiple bugs reported about
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject
> that are permutation of the same root cause. So the design issue of Spark 
> streaming framework is that it should do this serialization asynchronously.  
> Instead, it should either
> 1. do this serialization synchronously. This is preferred to eliminate the 
> issue completely.  Or
> 2. Allow it to be configured per application whether to do this serialization 
> synchronously or asynchronously, depending on the nature of each application.
> Also, Spark documentation should describe the conditions that trigger Spark 
> to do this type of serialization asynchronously, so the applications can work 
> around them until the fix is provided. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22163) Design Issue of Spark Streaming that Causes Random Run-time Exception

2017-10-04 Thread Michael N (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192352#comment-16192352
 ] 

Michael N edited comment on SPARK-22163 at 10/5/17 1:42 AM:


It is obvious that you don't understand the differences between design flaws vs 
coding bugs, particularly you have not been able to provide the answers to the 
questions of

1. In the first place, why does Spark serialize the application objects 
*asynchronously* while the streaming application is running continuously from 
batch to batch ?

2. If Spark needs to do this type of serialization at all, why does it not do 
at the end of the batch *synchronously* ?

Instead of blindly closing tickets, you need to either either find their 
answers and post them here or let someone else who is capable to address them. 

Btw, your response to the ticket 
https://issues.apache.org/jira/browse/SPARK-21999 where you said

  "Your app is modifying a collection asynchronously w.r.t. Spark. Right"

confirmed that you do not understand the issue.  My app is *not* modifying a 
collection asynchronously w.r.t. Spark.   So you kept making the same invalid 
claim and kept closing the ticket that you do not understand.   My Streaming 
Spark application  is run synchronously by Spark Streaming framework from batch 
to batch. And it modifies the data synchronously as part of the batch 
processing. However, Spark framework has another thread that *asynchronously* 
serializes the application objects.


was (Author: michaeln_apache):
It is obvious that you don't understand the differences between design flaws vs 
coding bugs, particularly you have not been able to provide the answers the the 
questions of

1. In the first place, why does Spark serialize the application objects 
*asynchronously* while the streaming application is running continuously from 
batch to batch ?

2. If Spark needs to do this type of serialization at all, why does it not do 
at the end of the batch *synchronously* ?

Instead of blindly closing tickets, you need to either either find their 
answers and post them here or let someone else who is capable to address them. 


> Design Issue of Spark Streaming that Causes Random Run-time Exception
> -
>
> Key: SPARK-22163
> URL: https://issues.apache.org/jira/browse/SPARK-22163
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Structured Streaming
>Affects Versions: 2.2.0
> Environment: Spark Streaming
> Kafka
> Linux
>Reporter: Michael N
>Priority: Critical
>
> The application objects can contain List and can be modified dynamically as 
> well.   However, Spark Streaming framework asynchronously serializes the 
> application's objects as the application runs.  Therefore, it causes random 
> run-time exception on the List when Spark Streaming framework happens to 
> serializes the application's objects while the application modifies a List in 
> its own object.  
> In fact, there are multiple bugs reported about
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject
> that are permutation of the same root cause. So the design issue of Spark 
> streaming framework is that it should do this serialization asynchronously.  
> Instead, it should either
> 1. do this serialization synchronously. This is preferred to eliminate the 
> issue completely.  Or
> 2. Allow it to be configured per application whether to do this serialization 
> synchronously or asynchronously, depending on the nature of each application.
> Also, Spark documentation should describe the conditions that trigger Spark 
> to do this type of serialization asynchronously, so the applications can work 
> around them until the fix is provided. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22163) Design Issue of Spark Streaming that Causes Random Run-time Exception

2017-10-04 Thread Michael N (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192110#comment-16192110
 ] 

Michael N edited comment on SPARK-22163 at 10/4/17 10:33 PM:
-

Please distinguish between code bug vs design flaws.  That is why this ticket 
is separate from the other ticket.

Here is the analogy to give more guidance as to why this is a design flaw. The 
older Spark's map framework has a major design flaw, where it makes a function 
call for every single object. its code implementation matched with its design, 
but its design flaw is that it has massive overhead when there are millions and 
billions of objects, where it needs to make the same function millions and 
billions times for every single object. 

The questions posted previously and re-posted below are intended to provide the 
insights as to why this issue is a design flaw of Spark's framework trying to 
serialize application objects of a Streaming application that runs 
continuously.  Please make sure you understand the differences between code 
bugs vs design flaws first, and provide the answers to the questions below and 
resolve them, before respond further, instead of arbitrarily closing this 
ticket.

1. In the first place, why does Spark serialize the application objects 
***asynchronously*** while the streaming application is running continuously 
from batch to batch ?

2. If Spark needs to do this type of serialization at all, why does it not do 
at the end of the batch ***synchronously*** ?



was (Author: michaeln_apache):
Please distinguish between code bug vs design flaws.  That is why this ticket 
is separate from the other ticket.

The analogy is the design flaw with the older Spark's map framework where it 
makes a function call for every single object. its code implementation is ok, 
but its design flaw is that it has massive overhead when there are millions and 
billions of objects.  On the other hand, the newer flatMap framework make one 
function call for a list of objects via the Iterator. 

Here are the questions to provide the insights as to why this issue is a design 
flaw of Spark's framework trying to serialize application objects of a 
Streaming application that runs continuously.  Please make sure you understand 
the differences between code bugs vs design flaws first, and provide the 
answers to the questions below and resolve them, before respond further, 
instead of arbitrarily closing this ticket.

1. In the first place, why does Spark serialize the application objects 
***asynchronously*** while the streaming application is running continuously 
from batch to batch ?

2. If Spark needs to do this type of serialization at all, why does it not do 
at the end of the batch ***synchronously*** ?


> Design Issue of Spark Streaming that Causes Random Run-time Exception
> -
>
> Key: SPARK-22163
> URL: https://issues.apache.org/jira/browse/SPARK-22163
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Structured Streaming
>Affects Versions: 2.2.0
> Environment: Spark Streaming
> Kafka
> Linux
>Reporter: Michael N
>Priority: Critical
>
> The application objects can contain List and can be modified dynamically as 
> well.   However, Spark Streaming framework asynchronously serializes the 
> application's objects as the application runs.  Therefore, it causes random 
> run-time exception on the List when Spark Streaming framework happens to 
> serializes the application's objects while the application modifies a List in 
> its own object.  
> In fact, there are multiple bugs reported about
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject
> that are permutation of the same root cause. So the design issue of Spark 
> streaming framework is that it should do this serialization asynchronously.  
> Instead, it should either
> 1. do this serialization synchronously. This is preferred to eliminate the 
> issue completely.  Or
> 2. Allow it to be configured per application whether to do this serialization 
> synchronously or asynchronously, depending on the nature of each application.
> Also, Spark documentation should describe the conditions that trigger Spark 
> to do this type of serialization asynchronously, so the applications can work 
> around them until the fix is provided. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22163) Design Issue of Spark Streaming that Causes Random Run-time Exception

2017-10-04 Thread Michael N (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192110#comment-16192110
 ] 

Michael N edited comment on SPARK-22163 at 10/4/17 10:12 PM:
-

Please distinguish between code bug vs design flaws.  That is why this ticket 
is separate from the other ticket.

The analogy is the design flaw with the older Spark's map framework where it 
makes a function call for every single object. its code implementation is ok, 
but its design flaw is that it has massive overhead when there are millions and 
billions of objects.  On the other hand, the newer flatMap framework make one 
function call for a list of objects via the Iterator. 

Here are the questions to provide the insights as to why this issue is a design 
flaw of Spark's framework trying to serialize application objects of a 
Streaming application that runs continuously.  Please make sure you understand 
the differences between code bugs vs design flaws first, and provide the 
answers to the questions below and resolve them, before respond further, 
instead of arbitrarily closing this ticket.

1. In the first place, why does Spark serialize the application objects 
***asynchronously*** while the streaming application is running continuously 
from batch to batch ?

2. If Spark needs to do this type of serialization at all, why does it not do 
at the end of the batch ***synchronously*** ?



was (Author: michaeln_apache):
Please distinguish between code bug vs design flaws.  That is why this ticket 
is separate from the other ticket.

The analogy is the design flaw with the older Spark's map framework where it 
makes a function call for every single object. its code implementation is ok, 
but its design flaw is that it has massive overhead when there are millions and 
billions of objects.  On the other hand, the newer flatMap framework make one 
function call for a list of objects via the Iterator. 

Here are the questions to provide the insights as to why this issue is a design 
flaw of Spark's framework trying to serialize application objects of a 
Streaming application that runs continuously.  Please make sure you understand 
the differences between code bugs vs design flaws first, and provide the 
answers to the questions below and resolve them, before respond further, 
instead of arbitrarily closing this ticket.

1. In the first place, why does Spark serialize the application objects 
asynchronously while the streaming application is running continuously from 
batch to batch ?

2. If Spark needs to do this type of serialization at all, why does it not do 
at the end of the batch ?


> Design Issue of Spark Streaming that Causes Random Run-time Exception
> -
>
> Key: SPARK-22163
> URL: https://issues.apache.org/jira/browse/SPARK-22163
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Structured Streaming
>Affects Versions: 2.2.0
> Environment: Spark Streaming
> Kafka
> Linux
>Reporter: Michael N
>Priority: Critical
>
> The application objects can contain List and can be modified dynamically as 
> well.   However, Spark Streaming framework asynchronously serializes the 
> application's objects as the application runs.  Therefore, it causes random 
> run-time exception on the List when Spark Streaming framework happens to 
> serializes the application's objects while the application modifies a List in 
> its own object.  
> In fact, there are multiple bugs reported about
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject
> that are permutation of the same root cause. So the design issue of Spark 
> streaming framework is that it should do this serialization asynchronously.  
> Instead, it should either
> 1. do this serialization synchronously. This is preferred to eliminate the 
> issue completely.  Or
> 2. Allow it to be configured per application whether to do this serialization 
> synchronously or asynchronously, depending on the nature of each application.
> Also, Spark documentation should describe the conditions that trigger Spark 
> to do this type of serialization asynchronously, so the applications can work 
> around them until the fix is provided. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22163) Design Issue of Spark Streaming that Causes Random Run-time Exception

2017-10-04 Thread Michael N (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192110#comment-16192110
 ] 

Michael N edited comment on SPARK-22163 at 10/4/17 10:02 PM:
-

Please distinguish between code bug vs design flaws.  That is why this ticket 
is separate from the other ticket.

The analogy is the design flaw with the older Spark's map framework where it 
makes a function call for every single object. its code implementation is ok, 
but its design flaw is that it has massive overhead when there are millions and 
billions of objects.  On the other hand, the newer flatMap framework make one 
function call for a list of objects via the Iterator. 

Here are the questions to provide the insights as to why this issue is a design 
flaw of Spark's framework trying to serialize application objects of a 
Streaming application that runs continuously.  Please make sure you understand 
the differences between code bugs vs design flaws first, and provide the 
answers to the questions below and resolve them, before respond further, 
instead of arbitrarily closing this ticket.

1. In the first place, why does Spark serialize the application objects 
asynchronously while the streaming application is running continuously from 
batch to batch ?

2. If Spark needs to do this type of serialization at all, why does it not do 
at the end of the batch ?



was (Author: michaeln_apache):
Please distinguish between code bug vs design flaws.  That is why this ticket 
is separate from the other ticket.

The analogy is the design flaw with the older Spark's map framework where it 
makes a function call for every single object. its code implementation is ok, 
but its design flaw is that it has massive overhead when there are millions and 
billions of objects.  On the other hand, the newer flatMap framework make one 
function call for a list of objects via the Iterator. 

Here are the questions to provide the insights as to why this issue is a design 
flaw of Spark's framework trying to serialize application objects of a 
Streaming application that runs continuously.  Until you could provide the 
answers to the questions and resolve them, please do not close this ticket.

1. In the first place, why does Spark serialize the application objects 
asynchronously while the streaming application is running continuously from 
batch to batch ?

2. If Spark needs to do this type of serialization at all, why does it not do 
at the end of the batch ?


> Design Issue of Spark Streaming that Causes Random Run-time Exception
> -
>
> Key: SPARK-22163
> URL: https://issues.apache.org/jira/browse/SPARK-22163
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Structured Streaming
>Affects Versions: 2.2.0
> Environment: Spark Streaming
> Kafka
> Linux
>Reporter: Michael N
>Priority: Critical
>
> The application objects can contain List and can be modified dynamically as 
> well.   However, Spark Streaming framework asynchronously serializes the 
> application's objects as the application runs.  Therefore, it causes random 
> run-time exception on the List when Spark Streaming framework happens to 
> serializes the application's objects while the application modifies a List in 
> its own object.  
> In fact, there are multiple bugs reported about
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject
> that are permutation of the same root cause. So the design issue of Spark 
> streaming framework is that it should do this serialization asynchronously.  
> Instead, it should either
> 1. do this serialization synchronously. This is preferred to eliminate the 
> issue completely.  Or
> 2. Allow it to be configured per application whether to do this serialization 
> synchronously or asynchronously, depending on the nature of each application.
> Also, Spark documentation should describe the conditions that trigger Spark 
> to do this type of serialization asynchronously, so the applications can work 
> around them until the fix is provided. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org