Re: Structured Streaming and Spark Connect

Mich Talebzadeh Mon, 23 Sep 2024 11:48:33 -0700

Hi Anastasia,

My take is that in its current form, Spark Connect is not suitable for
running long-lived Structured Streaming queries in Standalone mode,
especially with long trigger intervals. The lack of support for detached
streaming queries makes it problematic for this particular use case. To
make Structured Streaming work in Standalone mode, you could:

   1. Use spark-submit in cluster mode instead of Spark Connect.
   2. Consider alternative cluster managers like YARN or k8s for better
   driver management.

Specific answers

q) Is Spark Connect intended to support “detached” Streaming Queries?

No, currently Spark Connect ties queries to the client session. Streaming
queries stop when the session ends. Detached queries are not yet supported,

q) Could Streaming Queries be detached from the session, as they are
continuous?

This is a valid request. Detaching streaming queries would allow them to
run independently, ensuring long-running jobs don’t stop when the session
ends. This would require changes in Spark’s session management.

q) Would you extend control options in Spark Connect UI (start, stop, reset
checkpoints)?

Yes, adding controls to start, stop, or reset streaming queries would
improve usability, especially for production systems. This feature would
give users more dynamic management of long-running streaming jobs.

Have a look at this article of mine

Building an Event-Driven Real-Time Data Processor with Spark Structured
Streaming and API Integration
<https://www.linkedin.com/pulse/building-event-driven-real-time-data-processor-spark-mich-zy3ef/?trackingId=RIwY%2FePi0jslLiXqOP8mxQ%3D%3D>

HTH,

Mich Talebzadeh

Architect | Data Engineer | Data Science | Financial Crime
PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College
London <https://en.wikipedia.org/wiki/Imperial_College_London>
London, United Kingdom

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

 https://en.everybodywiki.com/Mich_Talebzadeh

*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".

On Mon, 23 Sept 2024 at 17:41, Anastasiia Sokhova
<anastasiia.sokh...@honic.eu.invalid> wrote:

> Dear Spark Team,
>
>
>
> I am working with a standalone cluster, and I am using Spark Connect to
> submit my applications.
>
> My current version is 3.5.1.
>
>
>
> I am trying to run Structured Streaming Queries with relatively long
> trigger intervals (2 hours, 1 day).
>
> The first issue I encountered was “Streaming query has been idle and
> waiting for new data more than 10000ms”. I solved it by increasing the
> value in the internal config property
>  ‘spark.sql.streaming.noDataProgressEventInterval’.
>
> Now my query is not considered idle anymore but Connect expires the
> session after ~1 hour, and the query is killed with it.
>
>
>
> I believe, I have studied everything I could find online, but I could not
> find the answers.
>
> I would really appreciate if you provided some 😊
>
>
>
> Is it not intended for Spark Connect to support “detached” Streaming
> Queries?
>
> Would you consider detaching StreamingQueries from the sessions that start
> them, as they are meant to run continuously?
>
> Would you consider extending control options in Spark Connect UI (start,
> stop, reset checkpoints)?
>
> It will help the users like me, who want to use Spark’s Structured
> Streaming and Connect without running additional applications just to keep
> the session alive.
>
>
>
> I will be happy to answer any question from your side or provide more
> details.
>
>
>
> Best regards,
>
> Anastasiia
>

Re: Structured Streaming and Spark Connect

Reply via email to