Re: [DISCUSSION] SPIP: An Official Kubernetes Operator for Apache Spark

2023-11-22 Thread Shiqi Sun
Hi all,

Sorry for being late to the party. I went through the SPIP doc and I think
this is a great proposal! I left a comment in the SPIP doc a couple days
ago, but I don't see much activity there and no one replied, so I wanted to
cross-post it here to get some feedback.

I'm Shiqi Sun, and I work for Big Data Platform in Salesforce. My team has
been running the Spark on k8s operator
 (OSS from
Google) in my company to serve Spark users on production for 4+ years, and
we've been actively contributing to the Spark on k8s operator OSS and also,
occasionally, the Spark OSS. According to our experience, Google's Spark
Operator has its own problems, like its close coupling with the spark
version, as well as the JVM overhead during job submission. However on the
other side, it's been a great component in our team's service in the
company, especially being written in golang, it's really easy to have it
interact with k8s, and also its CRD covers a lot of different use cases, as
it has been built up through time thanks to many users' contribution during
these years. There were also a handful of sessions of Google's Spark
Operator Spark Summit that made it widely adopted.

For this SPIP, I really love the idea of this proposal for the official k8s
operator of Spark project, as well as the separate layer of the submission
worker and being spark version agnostic. I think we can get the best of the
two:
1. I would advocate the new project to still use golang for the
implementation, as golang is the go-to cloud native language that works the
best with k8s.
2. We make sure the functionality of the current Google's spark operator
CRD is preserved in the new official Spark Operator; if we can make it
compatible or even merge the two projects to make it the new official
operator in spark project, it would be the best.
3. The new Spark Operator should continue being spark agnostic and continue
having this lightweight/separate layer of submission worker. We've seen
scalability issues caused by the heavy JVM during spark-submit in Google's
Spark Operator and we implemented an internal version of fix for it within
our company.

We can continue the discussion in more detail, but generally I love this
move of the official spark operator, and I really appreciate the effort! In
the SPIP doc. I see my comment has gained several upvotes from someone I
don't know, so I believe there are other spark/spark operator users who
agree with some of my points. Let me know what you all think and let's
continue the discussion, so that we can make this operator a great new
component of the Open Source Spark Project!

Thanks!

Shiqi

On Mon, Nov 13, 2023 at 11:50 PM L. C. Hsieh  wrote:

> Thanks for all the support from the community for the SPIP proposal.
>
> Since all questions/discussion are settled down (if I didn't miss any
> major ones), if no more questions or concerns, I'll be the shepherd
> for this SPIP proposal and call for a vote tomorrow.
>
> Thank you all!
>
> On Mon, Nov 13, 2023 at 6:43 PM Zhou Jiang  wrote:
> >
> > Hi Holden,
> >
> > Thanks a lot for your feedback!
> > Yes, this proposal attempts to integrate existing solutions, especially
> from CRD perspective. The proposed schema retains similarity with current
> designs, while reducing duplicates and maintaining a single source of truth
> from conf properties. It also tends to be close to native integration with
> k8s to minimize schema changes for new features.
> > For dependencies, packing everything is the easiest way to get started.
> It would be straightforward to add --packages and --repositories support
> for Maven dependencies. It's technically possible to pull dependencies in
> cloud storage from init containers (if defined by user). It could be tricky
> to design a general solution that supports different cloud providers from
> the operator layer. An enhancement that I can think of is to add support
> for profile scripts that can enable additional user-defined actions in
> application containers.
> > Operator does not have to build everything for k8s version
> compatibility. Similar to Spark, operator can be built on Fabric8 client(
> https://github.com/fabric8io/kubernetes-client) for support across
> versions, given that it makes similar API calls for resource management as
> Spark. For tests, in addition to fabric8 mock server, we may also borrow
> the idea from Flink operator to start minikube cluster for integration
> tests.
> > This operator is not starting from scratch as it is derived from an
> internal project which has been working in prod scale for a few years. It
> aims to include a few new features / enhancements, and a few
> re-architecture mostly to incorporate lessons learnt for designing CRD /
> API perspective.
> > Benchmarking operator performance alone can be nuanced, often tied to
> the underlying cluster. There's a testing strategy that Aaruna & I
> discussed in a previous Data AI 

Re: [DISCUSS] SPIP: Structured Streaming - Arbitrary State API v2

2023-11-22 Thread Jungtaek Lim
Thanks Anish for proposing SPIP and initiating this thread! I believe this
SPIP will help a bunch of complex use cases on streaming.

dev@: We are coincidentally initiating this discussion in thanksgiving
holidays. We understand people in the US may not have time to review the
SPIP, and we plan to bump this thread in early next week. We are open for
any feedback from non-US during the holiday. We can either address feedback
altogether after the holiday (Anish is in the US) or I can answer if the
feedback is more about the question. Thanks!

On Thu, Nov 23, 2023 at 5:27 AM Anish Shrigondekar <
anish.shrigonde...@databricks.com> wrote:

> Hi dev,
>
> I would like to start a discussion on "Structured Streaming - Arbitrary
> State API v2". This proposal aims to address a bunch of limitations we see
> today using mapGroupsWithState/flatMapGroupsWithState operator. The
> detailed set of limitations is described in the SPIP doc.
>
> We propose to support various features such as multiple state variables
> (flexible data modeling), composite types, enhanced timer functionality,
> support for chaining operators after new operator, handling initial state
> along with state data source, schema evolution etc This will allow users to
> write more powerful streaming state management logic primarily used in
> operational use-cases. Other built-in stateful operators could also benefit
> from such changes in the future.
>
> JIRA: https://issues.apache.org/jira/browse/SPARK-45939
> SPIP:
> https://docs.google.com/document/d/1QtC5qd4WQEia9kl1Qv74WE0TiXYy3x6zeTykygwPWig/edit?usp=sharing
> Design Doc:
> https://docs.google.com/document/d/1QjZmNZ-fHBeeCYKninySDIoOEWfX6EmqXs2lK097u9o/edit?usp=sharing
>
> cc - @Jungtaek Lim   who has graciously
> agreed to be the shepherd for this project
>
> Looking forward to your feedback !
>
> Thanks,
> Anish
>


[DISCUSS] SPIP: Structured Streaming - Arbitrary State API v2

2023-11-22 Thread Anish Shrigondekar
Hi dev,

I would like to start a discussion on "Structured Streaming - Arbitrary
State API v2". This proposal aims to address a bunch of limitations we see
today using mapGroupsWithState/flatMapGroupsWithState operator. The
detailed set of limitations is described in the SPIP doc.

We propose to support various features such as multiple state variables
(flexible data modeling), composite types, enhanced timer functionality,
support for chaining operators after new operator, handling initial state
along with state data source, schema evolution etc This will allow users to
write more powerful streaming state management logic primarily used in
operational use-cases. Other built-in stateful operators could also benefit
from such changes in the future.

JIRA: https://issues.apache.org/jira/browse/SPARK-45939
SPIP:
https://docs.google.com/document/d/1QtC5qd4WQEia9kl1Qv74WE0TiXYy3x6zeTykygwPWig/edit?usp=sharing
Design Doc:
https://docs.google.com/document/d/1QjZmNZ-fHBeeCYKninySDIoOEWfX6EmqXs2lK097u9o/edit?usp=sharing

cc - @Jungtaek Lim   who has graciously
agreed to be the shepherd for this project

Looking forward to your feedback !

Thanks,
Anish