Re: Structured Streaming Microbatch Semantics

Dipl.-Inf. Rico Bergmann Fri, 05 Mar 2021 05:55:07 -0800

Hi!

Thanks for your reply!


For several reasons we don't want to "pipe" the real data through Kafka.

What may be a problem arising from this approach?

Best,

Rico.


Am 05.03.2021 um 09:18 schrieb Roland Johann:

Hi Rico,

there is no way to deferr records from one micro batch to the nextone. So it‘s guaranteed that the data and trigger event will beprocessed within the dame batch.

I assume that one trigger event lead to an unknown batch size ofactual events pulled via HTTP. This bypasses throughput properties ofspark streaming. Depending on the amount of the resulting HTTPrecords, maybe you consider splitting the pipeline into two parts:

- process trigger event, pull data from HTTP, write to kafka
- perform structured streaming ingestion

Kind regards

Dipl.-Inf. Rico Bergmann <i...@ricobergmann.de<mailto:i...@ricobergmann.de>> schrieb am Fr. 5. März 2021 um 09:06:


    Hi all!

    I'm using Spark structured streaming for a data ingestion pipeline.
    Basically the pipeline reads events (notifications of new available
    data) from a Kafka topic and then queries a REST endpoint to get the
    real data (within a flatMap).

    For one single event the pipeline creates a few thousand records
    (rows)
    that have to be stored. And to write the data I use foreachBatch().

    My question is now: Is it guaranteed by Spark that all output
    records of
    one event are always contained in a single batch or can the
    records also
    be split into multiple batches?


    Best,

    Rico.


    ---------------------------------------------------------------------
    To unsubscribe e-mail: user-unsubscr...@spark.apache.org
    <mailto:user-unsubscr...@spark.apache.org>

--
Roland Johann
Data Architect/Data Engineer

phenetic GmbH
Lütticher Straße 10, 50674 Köln, Germany

Mobil: +49 172 365 26 46
Mail: roland.joh...@phenetic.io <mailto:roland.joh...@phenetic.io>
Web: phenetic.io <http://phenetic.io>

Handelsregister: Amtsgericht Köln (HRB 92595)
Geschäftsführer: Roland Johann, Uwe Reimann

Re: Structured Streaming Microbatch Semantics

Reply via email to