Re: A scene with unstable Spark performance

2022-05-18 Thread Chang Chen
This is a case where resources are fixed in the same SparkContext, but sqls
have different priorities.

Some SQLs are only allowed to be executed if there are spare resources,
once the high priority sql comes in, those sqls taskset either are killed
or stalled.

If  we set a high priority pool's minShare to a relatively higher value,
e.g.  50% or 60% of total cores, does it make sense?


Sungwoo Park  于2022年5月18日周三 13:28写道:

> The problem you describe is the motivation for developing Spark on MR3.
> From the blog article (
> https://www.datamonad.com/post/2021-08-18-spark-mr3/):
>
> *The main motivation for developing Spark on MR3 is to allow multiple
> Spark applications to share compute resources such as Yarn containers or
> Kubernetes Pods.*
>
> The problem is due to an architectural limitation of Spark, and I guess
> fixing the problem would require a heavy rewrite of Spark core. When we
> developed Spark on MR3, we were not aware of any attempt being made
> elsewhere (in academia and industry) to address this limitation.
>
> A potential workaround might be to implement a custom Spark application
> that manages the submission of two groups of Spark jobs and controls their
> execution (similarly to Spark Thrift Server). Not sure if this approach
> would fix your problem, though.
>
> If you are interested, see the webpage of Spark on MR3:
> https://mr3docs.datamonad.com/docs/spark/
>
> We have released Spark 3.0.1 on MR3, and Spark 3.2.1 on MR3 is under
> development. For Spark 3.0.1 on MR3, no change is made to Spark and MR3 is
> used as an add-on. The main application of MR3 is Hive on MR3, but Spark on
> MR3 is equally ready for production.
>
> Thank you,
>
> --- Sungwoo
>
>>


Re: Is RDD thread safe?

2019-11-24 Thread Chang Chen
I need to cache the DataFrame for accelerating query.  In such case, the
two query may simultaneously run the DAG before cache data actually happen.

Sonal Goyal  于2019年11月19日周二 下午9:46写道:

> the RDD or the dataframe is distributed and partitioned by Spark so as to
> leverage all your workers (CPUs) effectively. So all the Dataframe
> operations are actually happening simultaneously on a section of the data.
> Why do you want to use threading here?
>
> Thanks,
> Sonal
> Nube Technologies <http://www.nubetech.co>
>
> <http://in.linkedin.com/in/sonalgoyal>
>
>
>
>
> On Tue, Nov 12, 2019 at 7:18 AM Chang Chen  wrote:
>
>>
>> Hi all
>>
>> I meet a case where I need cache a source RDD, and then create different
>> DataFrame from it in different threads to accelerate query.
>>
>> I know that SparkSession is thread safe(
>> https://issues.apache.org/jira/browse/SPARK-15135), but i am not sure
>> whether RDD  si thread safe or not
>>
>> Thanks
>> Chang
>>
>


Is RDD thread safe?

2019-11-11 Thread Chang Chen
Hi all

I meet a case where I need cache a source RDD, and then create different
DataFrame from it in different threads to accelerate query.

I know that SparkSession is thread safe(
https://issues.apache.org/jira/browse/SPARK-15135), but i am not sure
whether RDD  si thread safe or not

Thanks
Chang


Re: The Future Of DStream

2016-07-27 Thread Chang Chen
Things like kafka and user-defined sources are not supported yet, just
because Structure Streaming is in alpha stage.

Things like sort are not supported because of implementation difficulty,
and I don't think DStream can support either

What I want to know is the difference between API (or abstraction), for
example, It is quite easy to use same codes for processing batch data
because of unbounded table abstraction ( which comes from google's Dataflow
paper), that's why the internal engine is based on logical plan, spark plan
and RDD. In contrast, DStream can't do same thing easily

Actually, Dataset supports map,flatMap and reduce,  and hence I can do any
user-defined work in theory, that's why I ask what kind of low-level
control that DStream can do while Structure Stream can not.

Thanks
Chang





On Wed, Jul 27, 2016 at 6:03 PM, Ofir Manor  wrote:

> For the 2.0 release, look for "Unsupported Operations" here:
>
> http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
> Also, there are bigger gaps - like no Kafka support, no way to plug
> user-defined sources or sinks etc
>
> Ofir Manor
>
> Co-Founder & CTO | Equalum
>
> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>
> On Wed, Jul 27, 2016 at 11:24 AM, Chang Chen  wrote:
>
>>
>> I don't understand what kind of low level control that DStream can do
>> while Structure Streaming can not
>>
>> Thanks
>> Chang
>>
>> On Wednesday, July 27, 2016, Matei Zaharia 
>> wrote:
>>
>>> Yup, they will definitely coexist. Structured Streaming is currently
>>> alpha and will probably be complete in the next few releases, but Spark
>>> Streaming will continue to exist, because it gives the user more low-level
>>> control. It's similar to DataFrames vs RDDs (RDDs are the lower-level API
>>> for when you want control, while DataFrames do more optimizations
>>> automatically by restricting the computation model).
>>>
>>> Matei
>>>
>>> On Jul 27, 2016, at 12:03 AM, Ofir Manor  wrote:
>>>
>>> Structured Streaming in 2.0 is declared as alpha - plenty of bits still
>>> missing:
>>>
>>> http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
>>> I assume that it will be declared stable / GA in a future 2.x release,
>>> and then it will co-exist with DStream for quite a while before someone
>>> will suggest to start a deprecation process that will eventually lead to
>>> its removal...
>>> As a user, I guess we will need to apply judgement about when to switch
>>> to Structured Streaming - each of us have a different risk/value tradeoff,
>>> based on our specific situation...
>>>
>>> Ofir Manor
>>>
>>> Co-Founder & CTO | Equalum
>>>
>>> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>>>
>>> On Wed, Jul 27, 2016 at 8:02 AM, Chang Chen 
>>> wrote:
>>>
>>>> Hi guys
>>>>
>>>> Structure Stream is coming with spark 2.0,  but I noticed that DStream
>>>> is still here
>>>>
>>>> What's the future of the DStream, will it be deprecated and removed
>>>> eventually? Or co-existed with  Structure Stream forever?
>>>>
>>>> Thanks
>>>> Chang
>>>>
>>>>
>>>
>>>
>


Re: The Future Of DStream

2016-07-27 Thread Chang Chen
I don't understand what kind of low level control that DStream can do while
Structure Streaming can not

Thanks
Chang

On Wednesday, July 27, 2016, Matei Zaharia  wrote:

> Yup, they will definitely coexist. Structured Streaming is currently alpha
> and will probably be complete in the next few releases, but Spark Streaming
> will continue to exist, because it gives the user more low-level control.
> It's similar to DataFrames vs RDDs (RDDs are the lower-level API for when
> you want control, while DataFrames do more optimizations automatically by
> restricting the computation model).
>
> Matei
>
> On Jul 27, 2016, at 12:03 AM, Ofir Manor  > wrote:
>
> Structured Streaming in 2.0 is declared as alpha - plenty of bits still
> missing:
>
> http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
> I assume that it will be declared stable / GA in a future 2.x release, and
> then it will co-exist with DStream for quite a while before someone will
> suggest to start a deprecation process that will eventually lead to its
> removal...
> As a user, I guess we will need to apply judgement about when to switch to
> Structured Streaming - each of us have a different risk/value tradeoff,
> based on our specific situation...
>
> Ofir Manor
>
> Co-Founder & CTO | Equalum
>
> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
> 
>
> On Wed, Jul 27, 2016 at 8:02 AM, Chang Chen  > wrote:
>
>> Hi guys
>>
>> Structure Stream is coming with spark 2.0,  but I noticed that DStream is
>> still here
>>
>> What's the future of the DStream, will it be deprecated and removed
>> eventually? Or co-existed with  Structure Stream forever?
>>
>> Thanks
>> Chang
>>
>>
>
>


The Future Of DStream

2016-07-26 Thread Chang Chen
Hi guys

Structure Stream is coming with spark 2.0,  but I noticed that DStream is
still here

What's the future of the DStream, will it be deprecated and removed
eventually? Or co-existed with  Structure Stream forever?

Thanks
Chang