I'm attempting to implement a Spark Streaming application that will consume application log messages from a message broker and store the information in HDFS. During the data ingestion, we apply a custom schema to the logs, partition by application name and log date, and then save the information as parquet files.
All of this works great, except we end up having a large number of parquet files created. It's my understanding that Spark Streaming is unable to control the number of files that get generated in each partition; can anybody confirm that is true? Also, has anybody else run into a similar situation regarding data ingestion with Spark Streaming and do you have any tips to share? Our end goal is to store the information in a way that makes it efficient to query, using a tool like Hive or Impala. Thanks, Kevin