Re: PARTITION_PARALLEL Vs Regular mode

2016-12-22 Thread Ashwin Chandra Putta
+1 for option 4 that Amol suggested. However, if you definitely need the shuffle, one thing you can check is if the key based partitioning is causing skews in the throughputs to individual parquet writer partitions. Regards, Ashwin. On Thu, Dec 22, 2016 at 1:22 PM, Amol Kekre

Re: PARTITION_PARALLEL Vs Regular mode

2016-12-22 Thread Amol Kekre
Arvindan, Based on what you have it looks like shuffle is not needed between Kafka->ParquetWriter. The decision to use parallel partiion should be ideally based the need to shuffle. If so option [1] should not be used per se. Why even bother to shuffle if you do not need to. Assuming the ask is

Re: PARTITION_PARALLEL Vs Regular mode

2016-12-22 Thread Pramod Immaneni
Arvindan, When you had the MxN case with 100 kafka consumers sending to 120 parquet writers what was the cpu utilization of the parquet containers. Was it close to 100% or did you have spare cycles? I am trying to determine if it is an IO bottleneck or processing. Thanks On Thu, Dec 22, 2016 at

PARTITION_PARALLEL Vs Regular mode

2016-12-22 Thread Arvindan Thulasinathan
Hi, We have an Apex Application which has a DAG structure like this: KafkaConsumer —> ParquetWriter The KafkaConsumer is running at a scale where we have 100 containers for consumer consuming from a Kafka-Cluster with an incoming rate of 300K msg/sec and each message is about 1KB (Each