Re: Is Spark Structured Streaming TOTALLY BROKEN (Spark Metadata Issues)

Breno Arosa Wed, 17 Jun 2020 14:24:07 -0700

Kafka-connect (https://docs.confluent.io/current/connect/index.html) maybe an easier solution for this use case of just dumping kafka topics.


On 17/06/2020 18:02, Jungtaek Lim wrote:

Just in case if anyone prefers ASF projects then there are otheralternative projects in ASF as well, alphabetically, Apache Hudi [1]and Apache Iceberg [2]. Both are recently graduated as top levelprojects. (DISCLAIMER: I'm not involved in both.)
BTW it would be nice if we make the metadata implementation on filestream source/sink be pluggable - from what I've seen, plugin approachhas been selected as the way to go whenever some part is going to becomplicated and it becomes arguable whether the part should be handledin Spark project vs should be outside. e.g. checkpoint manager, statestore provider, etc. It would open up chances for the ecosystem toplay with the challenge "without completely re-writing the file streamsource and sink", focusing on scalability for metadata in a long runquery. Alternative projects described above will still provide morehigher-level features and look attractive, but sometimes it may bejust "using a sledgehammer to crack a nut".
1. https://hudi.apache.org/
2. https://iceberg.apache.org/
On Thu, Jun 18, 2020 at 2:34 AM Tathagata Das<tathagata.das1...@gmail.com <mailto:tathagata.das1...@gmail.com>> wrote:
    Hello Rachana,

    Getting exactly-once semantics on files and making it scale to a
    very large number of files are very hard problems to solve. While
    Structured Streaming + built-in file sink solves the exactly-once
    guarantee that DStreams could not, it is definitely limited in
    other ways (scaling in terms of files, combining batch and
    streaming writes in the same place, etc). And solving this problem
    requires a holistic solution that is arguably beyond the scope of
    the Spark project.

    There are other projects that are trying to solve this file
    management issue. For example, Delta Lake <https://delta.io/>(full
    disclosure, I am involved in it) was built to exactly solve this
    problem - get exactly-once and ACID guarantees on files, but also
    scale to handling millions of files. Please consider it as part of
    your solution.




    On Wed, Jun 17, 2020 at 9:50 AM Rachana Srivastava
    <rachanasrivas...@yahoo.com.invalid> wrote:

        I have written a simple spark structured steaming app to move
        data from Kafka to S3. Found that in order to support
        exactly-once guarantee spark creates _spark_metadata folder,
        which ends up growing too large as the streaming app is
        SUPPOSE TO run FOREVER. But when the streaming app runs for a
        long time the metadata folder grows so big that we start
        getting OOM errors. Only way to resolve OOM is delete
        Checkpoint and Metadata folder and loose VALUABLE customer data.

        Spark open JIRAs SPARK-24295 and SPARK-29995, SPARK-30462, and
        SPARK-24295)

        Since Spark Streaming was NOT broken like this. Is Spark
        Streaming a BETTER choice?

Re: Is Spark Structured Streaming TOTALLY BROKEN (Spark Metadata Issues)

Reply via email to