Re: Partition unbounded collection like Kafka source

Chamikara Jayalath Mon, 27 Jul 2020 10:21:07 -0700

Probably you should apply the Partition[1] transform on the output
PCollection of your read. Note though that the exact parallelization is
runner dependent (for example, runner might autoscale up resulting in more
writers).
Did you run into issues when just reading from Kafka and writing to Cassadra
(without manually controlling the parallelization) ?


Thanks,
Cham

[1]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Partition.java

On Thu, Jul 23, 2020 at 5:01 AM wang Wu <[email protected]> wrote:

> Hi,
> Supposed that a Kafka topic has 3 partitions only. Now we want to
> partition it into 20 partition, each one will produce an output collection.
> The purpose is to write to the sink in parallel from all 20 output
> collections.
>
> Will this code achieve that purpose?
>
>  KafkaIO.Read<byte[], Long> reader =
>       KafkaIO.<byte[], Long>read()
>           .withConsumerFactoryFn(
>               new ConsumerFactoryFn(
>                   topic, 10, numElements, OffsetResetStrategy.EARLIEST))
> // 10 partitions
>
>   PCollection<Long> input =
> p.apply(reader.withoutMetadata()).apply(KafkaToCassandraRow).apply(CassadraIO.write);
>
> Regards
> Dinh
>
>

Re: Partition unbounded collection like Kafka source

Reply via email to