Re: Separating Data from Kafka by Keying Strategy in a Kafka Splittable DoFn

Boyuan Zhang Mon, 01 Feb 2021 10:48:58 -0800

Hi Rion,

It sounds like ReadFromKafkaDoFn
<https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/ReadFromKafkaDoFn.java>
could be one of the solutions. It takes KafkaSourceDescritpor
<https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaSourceDescriptor.java>(basically
it's a topic + partition) as input and emit KafkaRecords. Then your
pipeline can look like:
testPipeline
  .apply(your source that generates KafkaSourceDescriptor)
  .apply(ParDo.of(ReadFromKafkaDoFn))
  .apply(other parts)


On Mon, Feb 1, 2021 at 8:06 AM Rion Williams <[email protected]> wrote:

> Hey all,
>
> I'm currently in a situation where I have a single Kafka topic with data
> across multiple partitions and covers data from multiple sources. I'm
> trying to see if there's a way that I'd be able to accomplish reading from
> these different sources as different pipelines and if a Splittable DoFn can
> do this.
>
> Basically - what I'd like to do is for a given key on a record, treat this
> as a separate pipeline from Kafka:
>
> testPipeline
>     .apply(
>         /*
>             Apply some function here to tell Kafka how to describe how to 
> split up
>             the sources that I want to read from
>          */
>     )
>     .apply("Ready from Kafka", KafkaIO.read(...))
>     .apply("Remaining Pipeline Omitted for Brevity"
>
> Is it possible to do this? I'm trying to avoid a major architectural
> change that would require multiple separate topics by source, however if I
> can guarantee that a given key (and it's associated watermark) are treated
> separately, that would be ideal.
>
> Any advice or recommendations for a strategy that might work would be
> helpful!
>
> Thanks,
>
> Rion
>

Re: Separating Data from Kafka by Keying Strategy in a Kafka Splittable DoFn

Reply via email to