I'm attempting to implement a Spark Streaming application that will consume
application log messages from a message broker and store the information in
HDFS. During the data ingestion, we apply a custom schema to the logs,
partition by application name and log date, and then save the information
as parquet files.

All of this works great, except we end up having a large number of parquet
files created. It's my understanding that Spark Streaming is unable to
control the number of files that get generated in each partition; can
anybody confirm that is true?

Also, has anybody else run into a similar situation regarding data
ingestion with Spark Streaming and do you have any tips to share? Our end
goal is to store the information in a way that makes it efficient to query,
using a tool like Hive or Impala.

Thanks,
Kevin

Reply via email to