Tathagata Das created SPARK-23966:
-------------------------------------

             Summary: Refactoring all checkpoint file writing logic in a common 
interface
                 Key: SPARK-23966
                 URL: https://issues.apache.org/jira/browse/SPARK-23966
             Project: Spark
          Issue Type: Improvement
          Components: Structured Streaming
    Affects Versions: 2.3.0
            Reporter: Tathagata Das
            Assignee: Tathagata Das


Checkpoint files (offset log files, state store files) in Structured Streaming 
must be written atomically such that no partial files are generated (would 
break fault-tolerance guarantees). Currently, there are 3 locations which try 
to do this individually, and in some cases, incorrectly.
 # HDFSOffsetMetadataLog - This uses a FileManager interface to use any 
implementation of `FileSystem` or `FileContext` APIs. It preferably loads 
`FileContext` implementation as FileContext of HDFS has atomic renames.
 # HDFSBackedStateStore (aka in-memory state store)
 ## Writing a version.delta file - This uses FileSystem APIs only to perform a 
rename. This is incorrect as rename is not atomic in HDFS FileSystem 
implementation.
 ## Writing a snapshot file - Same as above.

Current problems:
 # State Store behavior is incorrect - 
 # Inflexible - Some file systems provide mechanisms other than 
write-to-temp-file-and-rename for writing atomically and more efficiently. For 
example, with S3 you can write directly to the final file and it will be made 
visible only when the entire file is written and closed correctly. Any failure 
can be made to terminate the writing without making any partial files visible 
in S3. The current code does not abstract out this mechanism enough that it can 
be customized. 

Solution:
 # Introduce a common interface that all 3 cases above can use to write 
checkpoint files atomically. 
 # This interface must provide the necessary interfaces that allow 
customization of the write-and-rename mechanism.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to