Just in case if anyone prefers ASF projects then there are other
alternative projects in ASF as well, alphabetically, Apache Hudi [1] and
Apache Iceberg [2]. Both are recently graduated as top level projects.
(DISCLAIMER: I'm not involved in both.)

BTW it would be nice if we make the metadata implementation on file stream
source/sink be pluggable - from what I've seen, plugin approach has been
selected as the way to go whenever some part is going to be complicated and
it becomes arguable whether the part should be handled in Spark project vs
should be outside. e.g. checkpoint manager, state store provider, etc. It
would open up chances for the ecosystem to play with the challenge "without
completely re-writing the file stream source and sink", focusing on
scalability for metadata in a long run query. Alternative projects
described above will still provide more higher-level features and
look attractive, but sometimes it may be just "using a sledgehammer to
crack a nut".

1. https://hudi.apache.org/
2. https://iceberg.apache.org/


On Thu, Jun 18, 2020 at 2:34 AM Tathagata Das <tathagata.das1...@gmail.com>
wrote:

> Hello Rachana,
>
> Getting exactly-once semantics on files and making it scale to a very
> large number of files are very hard problems to solve. While Structured
> Streaming + built-in file sink solves the exactly-once guarantee that
> DStreams could not, it is definitely limited in other ways (scaling in
> terms of files, combining batch and streaming writes in the same place,
> etc). And solving this problem requires a holistic solution that is
> arguably beyond the scope of the Spark project.
>
> There are other projects that are trying to solve this file management
> issue. For example, Delta Lake <https://delta.io/>(full disclosure, I am
> involved in it) was built to exactly solve this problem - get exactly-once
> and ACID guarantees on files, but also scale to handling millions of files.
> Please consider it as part of your solution.
>
>
>
>
> On Wed, Jun 17, 2020 at 9:50 AM Rachana Srivastava
> <rachanasrivas...@yahoo.com.invalid> wrote:
>
>> I have written a simple spark structured steaming app to move data from
>> Kafka to S3. Found that in order to support exactly-once guarantee spark
>> creates _spark_metadata folder, which ends up growing too large as the
>> streaming app is SUPPOSE TO run FOREVER. But when the streaming app runs
>> for a long time the metadata folder grows so big that we start getting OOM
>> errors. Only way to resolve OOM is delete Checkpoint and Metadata folder
>> and loose VALUABLE customer data.
>>
>> Spark open JIRAs SPARK-24295 and SPARK-29995, SPARK-30462, and
>> SPARK-24295)
>> Since Spark Streaming was NOT broken like this. Is Spark Streaming a
>> BETTER choice?
>>
>

Reply via email to