Re: [Structured Streaming] Commit protocol to move temp files to dest path only when complete, with code

2018-03-20 Thread dcam
I'm just circling back to this now. Is the commit protocol an acceptable way
of making this configureable? I could make the temp path (currently
"_temporary") configureable if that is what you are referring to.


Michael Armbrust wrote
> We didn't go this way initially because it doesn't work on storage systems
> that have weaker guarantees than HDFS with respect to rename.  That said,
> I'm happy to look at other options if we want to make this configurable.
> 
>> After hesitating for a while, I wrote a custom commit protocol to solve
>> the problem. It combines HadoopMapReduceCommitProtocol's behavior of
>> writing to a temp file first, with ManifestFileCommitProtocol. From what
>> I can tell ManifestFileCommitProtocol is required for the normal
>> Structured
>> Streaming behavior of being able to resume streams from a known point.





--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Structured Streaming] Commit protocol to move temp files to dest path only when complete, with code

2018-02-09 Thread Michael Armbrust
We didn't go this way initially because it doesn't work on storage systems
that have weaker guarantees than HDFS with respect to rename.  That said,
I'm happy to look at other options if we want to make this configurable.



On Fri, Feb 9, 2018 at 2:53 PM, Dave Cameron 
wrote:

> Hi
>
>
> I have a Spark structured streaming job that reads from Kafka and writes
> parquet files to Hive/HDFS. The files are not very large, but the Kafka
> source is noisy so each spark job takes a long time to complete. There is a
> significant window during which the parquet files are incomplete and other
> tools, including PrestoDB, encounter errors while trying to read them.
>
> I wrote this list and stackoverflow about the problem last summer:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-
> Structured-Streaming-truncated-Parquet-after-driver-crash-or-kill-tt29043.
> html
> https://stackoverflow.com/questions/47337030/not-a-
> parquet-file-too-small-from-presto-during-spark-structured-streaming-r/
> 47339805#47339805
>
> After hesitating for a while, I wrote a custom commit protocol to solve
> the problem. It combines HadoopMapReduceCommitProtocol's behavior of
> writing to a temp file first, with ManifestFileCommitProtocol. From what
> I can tell ManifestFileCommitProtocol is required for the normal Structured
> Streaming behavior of being able to resume streams from a known point.
>
> I think this commit protocol could be generally useful. Writing to a temp
> file and moving it to the final location is low cost on HDFS and is the
> standard behavior for non-streaming jobs, as implemented in
> HadoopMapReduceCommitProtocol. At the same time ManifestFileCommitProtocol
> is needed for structured streaming jobs. We have been running this for a
> few months in production without problems.
>
> Here is the code (at the moment not up to Spark standards, admittedly):
> https://github.com/davcamer/spark/commit/361f1c69851f0f94cfd974ce720c69
> 4407f9340b
>
> Did I miss a better approach? Does anyone else think this would be useful?
>
> Thanks for reading,
> Dave
>
>
>