Re: Clarification with Spark Structured Streaming

Mich Talebzadeh Mon, 09 Oct 2023 12:47:59 -0700

Your mileage varies. Often there is a flavour of Cloud Data warehouse
already there. CDWs like BigQuery, Redshift, Snowflake and so forth. They
can all do a good job for various degrees


   - Use efficient data types. Choose data types that are efficient for
   Spark to process. For example, use integer data types for columns that
   store integer values.
   - Avoid using complex data types. Complex data types, such as nested
   structs and arrays, can be less efficient for Spark to process.
   - Opt for columnar storage format like Parquet for your sink table.
   Columnar storage is highly efficient for analytical workloads as it allows
   for column-level compression and predicate pushdown.
   - These CDW come with partitioning options. Popular are date or time
   formats that can be used for partitioning. This will reduce the amount of
   data scanned during queries.
   - Some of these CDWs come with native streaming capabilities like
   BigQuery Streaming, I believe Snowflake has Snowpipe Streaming API as well
   (don't know much about it) . These options  enable real-time data ingestion
   and processing, No need for manual batch processing etc.
   - You can batch incoming data for efficiency processing, which can
   improve performance and simplify data handling. Start by configuring your
   Spark Streaming context with an appropriate batch interval. The batch
   interval defines how often Spark will process a batch of data. Choose a
   batch interval that balances latency and throughput based on the
   application's needs. Spark can process batches of data more efficiently
   than it can process individual records.
   - Snowflake says it is serverless and so is BigQuery. They are designed
   to provide a uniform performance regardless of workload. Serverless CDWs
   can efficiently handle both batch and streaming workloads without the need
   for manual resource provisioning.
   - Use materialized views to pre-compute query results, which can improve
   the performance of frequently executed queries. This has been around from
   classics RDBMs

HTH

Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 9 Oct 2023 at 17:50, [email protected] <[email protected]>
wrote:

> Thank you for your feedback Mich.
>
> In general how can one optimise the cloud data warehouses (the sink part),
> to handle streaming Spark data efficiently, avoiding bottlenecks that
> discussed.
>
>
> AK
> On Monday, 9 October 2023 at 11:04:41 BST, Mich Talebzadeh <
> [email protected]> wrote:
>
>
> Hi,
>
> Please see my responses below:
>
> 1) In Spark Structured Streaming does commit mean streaming data has been
> delivered to the sink like Snowflake?
>
> No. a commit does not refer to data being delivered to a sink like
> Snowflake or bigQuery. The term commit refers to Spark Structured Streaming
> (SS) internals. Specifically it means that a micro-batch of data has been
> processed by SSS. In the checkpoint directory there is a
> subdirectory called commits that marks the micro-batch process as completed.
>
> 2) if sinks like Snowflake  cannot absorb or digest streaming data in a
> timely manner, will there be an impact on spark streaming itself?
>
> Yes, it can potentially impact SSS. If the sink cannot absorb data in a
> timely manner, the batches will start to back up in SSS. This can cause
> Spark to run out of memory and the streaming job to fail. As I understand,
> Spark will use a combination of memory and disk storage (checkpointing).
> This can also happen if the network interface between Spark and the sink is
> disrupted. On the other hand Spark may slow down, as it tries to process
> the backed-up batches of data. You want to avoid these scenarios.
>
> HTH
>
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sun, 8 Oct 2023 at 19:50, [email protected]
> <[email protected]> wrote:
>
> Hello team
>
> 1) In Spark Structured Streaming does commit mean streaming data has been
> delivered to the sink like Snowflake?
>
> 2) if sinks like Snowflake  cannot absorb or digest streaming data in a
> timely manner, will there be an impact on spark streaming itself?
>
> Thanks
>
> AK
>
>

Re: Clarification with Spark Structured Streaming

Reply via email to