Hi!
Thanks for your reply!
For several reasons we don't want to "pipe" the real data through Kafka.
What may be a problem arising from this approach?
Best,
Rico.
Am 05.03.2021 um 09:18 schrieb Roland Johann:
Hi Rico,
there is no way to deferr records from one micro batch to the next
one. So it‘s guaranteed that the data and trigger event will be
processed within the dame batch.
I assume that one trigger event lead to an unknown batch size of
actual events pulled via HTTP. This bypasses throughput properties of
spark streaming. Depending on the amount of the resulting HTTP
records, maybe you consider splitting the pipeline into two parts:
- process trigger event, pull data from HTTP, write to kafka
- perform structured streaming ingestion
Kind regards
Dipl.-Inf. Rico Bergmann <i...@ricobergmann.de
<mailto:i...@ricobergmann.de>> schrieb am Fr. 5. März 2021 um 09:06:
Hi all!
I'm using Spark structured streaming for a data ingestion pipeline.
Basically the pipeline reads events (notifications of new available
data) from a Kafka topic and then queries a REST endpoint to get the
real data (within a flatMap).
For one single event the pipeline creates a few thousand records
(rows)
that have to be stored. And to write the data I use foreachBatch().
My question is now: Is it guaranteed by Spark that all output
records of
one event are always contained in a single batch or can the
records also
be split into multiple batches?
Best,
Rico.
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
<mailto:user-unsubscr...@spark.apache.org>
--
Roland Johann
Data Architect/Data Engineer
phenetic GmbH
Lütticher Straße 10, 50674 Köln, Germany
Mobil: +49 172 365 26 46
Mail: roland.joh...@phenetic.io <mailto:roland.joh...@phenetic.io>
Web: phenetic.io <http://phenetic.io>
Handelsregister: Amtsgericht Köln (HRB 92595)
Geschäftsführer: Roland Johann, Uwe Reimann