+1 for option 4 that Amol suggested. However, if you definitely need the
shuffle, one thing you can check is if the key based partitioning is
causing skews in the throughputs to individual parquet writer partitions.
Regards,
Ashwin.
On Thu, Dec 22, 2016 at 1:22 PM, Amol Kekre
Arvindan,
Based on what you have it looks like shuffle is not needed between
Kafka->ParquetWriter. The decision to use parallel partiion should be
ideally based the need to shuffle. If so option [1] should not be used per
se. Why even bother to shuffle if you do not need to.
Assuming the ask is
Arvindan,
When you had the MxN case with 100 kafka consumers sending to 120 parquet
writers what was the cpu utilization of the parquet containers. Was it
close to 100% or did you have spare cycles? I am trying to determine if it
is an IO bottleneck or processing.
Thanks
On Thu, Dec 22, 2016 at
Hi,
We have an Apex Application which has a DAG structure like this:
KafkaConsumer —> ParquetWriter
The KafkaConsumer is running at a scale where we have 100 containers for
consumer consuming from a Kafka-Cluster with an incoming rate of 300K msg/sec
and each message is about 1KB (Each