I have a Spark structured streaming job that reads from Kafka and writes
parquet files to Hive/HDFS. The files are not very large, but the Kafka
source is noisy so each spark job takes a long time to complete. There is a
significant window during which the parquet files are incomplete and other
tools, including PrestoDB, encounter errors while trying to read them.
I wrote this list and stackoverflow about the problem last summer:
After hesitating for a while, I wrote a custom commit protocol to solve the
problem. It combines HadoopMapReduceCommitProtocol's behavior of writing to
a temp file first, with ManifestFileCommitProtocol. From what I can tell
ManifestFileCommitProtocol is required for the normal Structured Streaming
behavior of being able to resume streams from a known point.
I think this commit protocol could be generally useful. Writing to a temp
file and moving it to the final location is low cost on HDFS and is the
standard behavior for non-streaming jobs, as implemented in
HadoopMapReduceCommitProtocol. At the same time ManifestFileCommitProtocol
is needed for structured streaming jobs. We have been running this for a
few months in production without problems.
Here is the code (at the moment not up to Spark standards, admittedly):
Did I miss a better approach? Does anyone else think this would be useful?
Thanks for reading,