Re: Is Spark Structured Streaming TOTALLY BROKEN (Spark Metadata Issues)

2020-06-18 Thread Jacek Laskowski
Hi Rachana,

> Should I go backward and use Spark Streaming DStream based.

No. Never. It's no longer supported (and should really be removed from the
codebase once and for all - dreaming...).

Spark focuses on Spark SQL and Spark Structured Streaming as user-facing
modules for batch and streaming queries, respectively.

Please note that I'm not a PMC member or even a committer so I'm speaking
for myself only (not representing the project in an official way).

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books 
Follow me on https://twitter.com/jaceklaskowski




On Thu, Jun 18, 2020 at 12:03 AM Rachana Srivastava
 wrote:

> *Structured Stream Vs Spark Steaming (DStream)?*
>
> Which is recommended for system stability.  Exactly once is NOT first
> priority.  First priority is STABLE system.
>
> I am I need to make a decision soon.  I need help.  Here is the question
> again.  Should I go backward and use Spark Streaming DStream based.  Write
> our own checkpoint and go from there.  At least we never encounter these
> metadata issues there.
>
> Thanks,
>
> Rachana
>
> On Wednesday, June 17, 2020, 02:02:20 PM PDT, Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
>
>
> Just in case if anyone prefers ASF projects then there are other
> alternative projects in ASF as well, alphabetically, Apache Hudi [1] and
> Apache Iceberg [2]. Both are recently graduated as top level projects.
> (DISCLAIMER: I'm not involved in both.)
>
> BTW it would be nice if we make the metadata implementation on file stream
> source/sink be pluggable - from what I've seen, plugin approach has been
> selected as the way to go whenever some part is going to be complicated and
> it becomes arguable whether the part should be handled in Spark project vs
> should be outside. e.g. checkpoint manager, state store provider, etc. It
> would open up chances for the ecosystem to play with the challenge "without
> completely re-writing the file stream source and sink", focusing on
> scalability for metadata in a long run query. Alternative projects
> described above will still provide more higher-level features and
> look attractive, but sometimes it may be just "using a sledgehammer to
> crack a nut".
>
> 1. https://hudi.apache.org/
> 2. https://iceberg.apache.org/
>
>
> On Thu, Jun 18, 2020 at 2:34 AM Tathagata Das 
> wrote:
>
> Hello Rachana,
>
> Getting exactly-once semantics on files and making it scale to a very
> large number of files are very hard problems to solve. While Structured
> Streaming + built-in file sink solves the exactly-once guarantee that
> DStreams could not, it is definitely limited in other ways (scaling in
> terms of files, combining batch and streaming writes in the same place,
> etc). And solving this problem requires a holistic solution that is
> arguably beyond the scope of the Spark project.
>
> There are other projects that are trying to solve this file management
> issue. For example, Delta Lake (full disclosure, I am
> involved in it) was built to exactly solve this problem - get exactly-once
> and ACID guarantees on files, but also scale to handling millions of files.
> Please consider it as part of your solution.
>
>
>
>
> On Wed, Jun 17, 2020 at 9:50 AM Rachana Srivastava
>  wrote:
>
> I have written a simple spark structured steaming app to move data from
> Kafka to S3. Found that in order to support exactly-once guarantee spark
> creates _spark_metadata folder, which ends up growing too large as the
> streaming app is SUPPOSE TO run FOREVER. But when the streaming app runs
> for a long time the metadata folder grows so big that we start getting OOM
> errors. Only way to resolve OOM is delete Checkpoint and Metadata folder
> and loose VALUABLE customer data.
>
> Spark open JIRAs SPARK-24295 and SPARK-29995, SPARK-30462, and SPARK-24295)
> Since Spark Streaming was NOT broken like this. Is Spark Streaming a
> BETTER choice?
>
>


Re: Is Spark Structured Streaming TOTALLY BROKEN (Spark Metadata Issues)

2020-06-17 Thread Rachana Srivastava
 Structured Stream Vs Spark Steaming (DStream)?
Which is recommended for system stability.  Exactly once is NOT first priority. 
 First priority is STABLE system.
I am I need to make a decision soon.  I need help.  Here is the question again. 
 Should I go backward and use Spark Streaming DStream based.  Write our own 
checkpoint and go from there.  At least we never encounter these metadata 
issues there.
Thanks,
Rachana
On Wednesday, June 17, 2020, 02:02:20 PM PDT, Jungtaek Lim 
 wrote:  
 
 Just in case if anyone prefers ASF projects then there are other alternative 
projects in ASF as well, alphabetically, Apache Hudi [1] and Apache Iceberg 
[2]. Both are recently graduated as top level projects. (DISCLAIMER: I'm not 
involved in both.)
BTW it would be nice if we make the metadata implementation on file stream 
source/sink be pluggable - from what I've seen, plugin approach has been 
selected as the way to go whenever some part is going to be complicated and it 
becomes arguable whether the part should be handled in Spark project vs should 
be outside. e.g. checkpoint manager, state store provider, etc. It would open 
up chances for the ecosystem to play with the challenge "without completely 
re-writing the file stream source and sink", focusing on scalability for 
metadata in a long run query. Alternative projects described above will still 
provide more higher-level features and look attractive, but sometimes it may be 
just "using a sledgehammer to crack a nut".
1. https://hudi.apache.org/2. https://iceberg.apache.org/


On Thu, Jun 18, 2020 at 2:34 AM Tathagata Das  
wrote:

Hello Rachana,
Getting exactly-once semantics on files and making it scale to a very large 
number of files are very hard problems to solve. While Structured Streaming + 
built-in file sink solves the exactly-once guarantee that DStreams could not, 
it is definitely limited in other ways (scaling in terms of files, combining 
batch and streaming writes in the same place, etc). And solving this problem 
requires a holistic solution that is arguably beyond the scope of the Spark 
project. 
There are other projects that are trying to solve this file management issue. 
For example, Delta Lake (full disclosure, I am involved in it) was built to 
exactly solve this problem - get exactly-once and ACID guarantees on files, but 
also scale to handling millions of files. Please consider it as part of your 
solution. 



On Wed, Jun 17, 2020 at 9:50 AM Rachana Srivastava 
 wrote:

I have written a simple spark structured steaming app to move data from Kafka 
to S3. Found that in order to support exactly-once guarantee spark creates 
_spark_metadata folder, which ends up growing too large as the streaming app is 
SUPPOSE TO run FOREVER. But when the streaming app runs for a long time the 
metadata folder grows so big that we start getting OOM errors. Only way to 
resolve OOM is delete Checkpoint and Metadata folder and loose VALUABLE 
customer data.

Spark open JIRAs SPARK-24295 and SPARK-29995, SPARK-30462, and SPARK-24295)
Since Spark Streaming was NOT broken like this. Is Spark Streaming a BETTER 
choice?

  

Re: Is Spark Structured Streaming TOTALLY BROKEN (Spark Metadata Issues)

2020-06-17 Thread Rachana Srivastava
 Frankly speaking I do not care about EXACTLY ONCE... I am OK with ATLEAST ONCE 
at long as system does not fail every 5 to 7 days with no recovery option.
On Wednesday, June 17, 2020, 02:31:50 PM PDT, Rachana Srivastava 
 wrote:  
 
  Thanks so much TD.  Thanks for forwarding your datalake project but at this 
time we have budget constraints we can only use open source project.  
I just want the Structured Streaming Application or Spark Streaming DStream 
Application to run without and issue for a long time..  I do not want the size 
of metadata to grow too large that we start getting these OOM issues.  The only 
way to resolve this OOM issues is by deleting the checkpoint and metadata 
folders.  That means loosing customer data.   We have 60 seconds batch where we 
are coleasing and returning only one file per partition.  So we do not have 
small file issues also...
What do you suggest?  How should we resolve this issue?
We have very simple 5 line program that reads data from Kafka and output data 
to S3.  1. Reading records from Kafka topic
  Dataset inputDf = spark \
  .readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2") \
  .option("subscribe", "topic1") \
  .option("startingOffsets", "earliest") \
  .load()
2. Use from_json API from Spark to extract your data for further transformation 
in a dataset.
   Dataset dataDf = inputDf.select(from_json(col("value").cast("string"), 
EVENT_SCHEMA).alias("event"))
   withColumn("oem_id", col("metadata.oem_id"));
3. Construct a temp table of above dataset using SQLContext
   SQLContext sqlContext = new SQLContext(sparkSession);
   dataDf.createOrReplaceTempView("event");
4. Flatten events since Parquet does not support hierarachical data.
5. Store output in parquet format on S3
   StreamingQuery query = flatDf.writeStream().format("parquet")

On Wednesday, June 17, 2020, 02:02:20 PM PDT, Jungtaek Lim 
 wrote:  
 
 Just in case if anyone prefers ASF projects then there are other alternative 
projects in ASF as well, alphabetically, Apache Hudi [1] and Apache Iceberg 
[2]. Both are recently graduated as top level projects. (DISCLAIMER: I'm not 
involved in both.)
BTW it would be nice if we make the metadata implementation on file stream 
source/sink be pluggable - from what I've seen, plugin approach has been 
selected as the way to go whenever some part is going to be complicated and it 
becomes arguable whether the part should be handled in Spark project vs should 
be outside. e.g. checkpoint manager, state store provider, etc. It would open 
up chances for the ecosystem to play with the challenge "without completely 
re-writing the file stream source and sink", focusing on scalability for 
metadata in a long run query. Alternative projects described above will still 
provide more higher-level features and look attractive, but sometimes it may be 
just "using a sledgehammer to crack a nut".
1. https://hudi.apache.org/2. https://iceberg.apache.org/


On Thu, Jun 18, 2020 at 2:34 AM Tathagata Das  
wrote:

Hello Rachana,
Getting exactly-once semantics on files and making it scale to a very large 
number of files are very hard problems to solve. While Structured Streaming + 
built-in file sink solves the exactly-once guarantee that DStreams could not, 
it is definitely limited in other ways (scaling in terms of files, combining 
batch and streaming writes in the same place, etc). And solving this problem 
requires a holistic solution that is arguably beyond the scope of the Spark 
project. 
There are other projects that are trying to solve this file management issue. 
For example, Delta Lake (full disclosure, I am involved in it) was built to 
exactly solve this problem - get exactly-once and ACID guarantees on files, but 
also scale to handling millions of files. Please consider it as part of your 
solution. 



On Wed, Jun 17, 2020 at 9:50 AM Rachana Srivastava 
 wrote:

I have written a simple spark structured steaming app to move data from Kafka 
to S3. Found that in order to support exactly-once guarantee spark creates 
_spark_metadata folder, which ends up growing too large as the streaming app is 
SUPPOSE TO run FOREVER. But when the streaming app runs for a long time the 
metadata folder grows so big that we start getting OOM errors. Only way to 
resolve OOM is delete Checkpoint and Metadata folder and loose VALUABLE 
customer data.

Spark open JIRAs SPARK-24295 and SPARK-29995, SPARK-30462, and SPARK-24295)
Since Spark Streaming was NOT broken like this. Is Spark Streaming a BETTER 
choice?



Re: Is Spark Structured Streaming TOTALLY BROKEN (Spark Metadata Issues)

2020-06-17 Thread Rachana Srivastava
 Thanks so much TD.  Thanks for forwarding your datalake project but at this 
time we have budget constraints we can only use open source project.  
I just want the Structured Streaming Application or Spark Streaming DStream 
Application to run without and issue for a long time..  I do not want the size 
of metadata to grow too large that we start getting these OOM issues.  The only 
way to resolve this OOM issues is by deleting the checkpoint and metadata 
folders.  That means loosing customer data.   We have 60 seconds batch where we 
are coleasing and returning only one file per partition.  So we do not have 
small file issues also...
What do you suggest?  How should we resolve this issue?
We have very simple 5 line program that reads data from Kafka and output data 
to S3.  1. Reading records from Kafka topic
  Dataset inputDf = spark \
  .readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2") \
  .option("subscribe", "topic1") \
  .option("startingOffsets", "earliest") \
  .load()
2. Use from_json API from Spark to extract your data for further transformation 
in a dataset.
   Dataset dataDf = inputDf.select(from_json(col("value").cast("string"), 
EVENT_SCHEMA).alias("event"))
   withColumn("oem_id", col("metadata.oem_id"));
3. Construct a temp table of above dataset using SQLContext
   SQLContext sqlContext = new SQLContext(sparkSession);
   dataDf.createOrReplaceTempView("event");
4. Flatten events since Parquet does not support hierarachical data.
5. Store output in parquet format on S3
   StreamingQuery query = flatDf.writeStream().format("parquet")

On Wednesday, June 17, 2020, 02:02:20 PM PDT, Jungtaek Lim 
 wrote:  
 
 Just in case if anyone prefers ASF projects then there are other alternative 
projects in ASF as well, alphabetically, Apache Hudi [1] and Apache Iceberg 
[2]. Both are recently graduated as top level projects. (DISCLAIMER: I'm not 
involved in both.)
BTW it would be nice if we make the metadata implementation on file stream 
source/sink be pluggable - from what I've seen, plugin approach has been 
selected as the way to go whenever some part is going to be complicated and it 
becomes arguable whether the part should be handled in Spark project vs should 
be outside. e.g. checkpoint manager, state store provider, etc. It would open 
up chances for the ecosystem to play with the challenge "without completely 
re-writing the file stream source and sink", focusing on scalability for 
metadata in a long run query. Alternative projects described above will still 
provide more higher-level features and look attractive, but sometimes it may be 
just "using a sledgehammer to crack a nut".
1. https://hudi.apache.org/2. https://iceberg.apache.org/


On Thu, Jun 18, 2020 at 2:34 AM Tathagata Das  
wrote:

Hello Rachana,
Getting exactly-once semantics on files and making it scale to a very large 
number of files are very hard problems to solve. While Structured Streaming + 
built-in file sink solves the exactly-once guarantee that DStreams could not, 
it is definitely limited in other ways (scaling in terms of files, combining 
batch and streaming writes in the same place, etc). And solving this problem 
requires a holistic solution that is arguably beyond the scope of the Spark 
project. 
There are other projects that are trying to solve this file management issue. 
For example, Delta Lake (full disclosure, I am involved in it) was built to 
exactly solve this problem - get exactly-once and ACID guarantees on files, but 
also scale to handling millions of files. Please consider it as part of your 
solution. 



On Wed, Jun 17, 2020 at 9:50 AM Rachana Srivastava 
 wrote:

I have written a simple spark structured steaming app to move data from Kafka 
to S3. Found that in order to support exactly-once guarantee spark creates 
_spark_metadata folder, which ends up growing too large as the streaming app is 
SUPPOSE TO run FOREVER. But when the streaming app runs for a long time the 
metadata folder grows so big that we start getting OOM errors. Only way to 
resolve OOM is delete Checkpoint and Metadata folder and loose VALUABLE 
customer data.

Spark open JIRAs SPARK-24295 and SPARK-29995, SPARK-30462, and SPARK-24295)
Since Spark Streaming was NOT broken like this. Is Spark Streaming a BETTER 
choice?

  

Re: Is Spark Structured Streaming TOTALLY BROKEN (Spark Metadata Issues)

2020-06-17 Thread Breno Arosa
Kafka-connect (https://docs.confluent.io/current/connect/index.html) may 
be an easier solution for this use case of just dumping kafka topics.


On 17/06/2020 18:02, Jungtaek Lim wrote:
Just in case if anyone prefers ASF projects then there are other 
alternative projects in ASF as well, alphabetically, Apache Hudi [1] 
and Apache Iceberg [2]. Both are recently graduated as top level 
projects. (DISCLAIMER: I'm not involved in both.)


BTW it would be nice if we make the metadata implementation on file 
stream source/sink be pluggable - from what I've seen, plugin approach 
has been selected as the way to go whenever some part is going to be 
complicated and it becomes arguable whether the part should be handled 
in Spark project vs should be outside. e.g. checkpoint manager, state 
store provider, etc. It would open up chances for the ecosystem to 
play with the challenge "without completely re-writing the file stream 
source and sink", focusing on scalability for metadata in a long run 
query. Alternative projects described above will still provide more 
higher-level features and look attractive, but sometimes it may be 
just "using a sledgehammer to crack a nut".


1. https://hudi.apache.org/
2. https://iceberg.apache.org/


On Thu, Jun 18, 2020 at 2:34 AM Tathagata Das 
mailto:tathagata.das1...@gmail.com>> wrote:


Hello Rachana,

Getting exactly-once semantics on files and making it scale to a
very large number of files are very hard problems to solve. While
Structured Streaming + built-in file sink solves the exactly-once
guarantee that DStreams could not, it is definitely limited in
other ways (scaling in terms of files, combining batch and
streaming writes in the same place, etc). And solving this problem
requires a holistic solution that is arguably beyond the scope of
the Spark project.

There are other projects that are trying to solve this file
management issue. For example, Delta Lake (full
disclosure, I am involved in it) was built to exactly solve this
problem - get exactly-once and ACID guarantees on files, but also
scale to handling millions of files. Please consider it as part of
your solution.




On Wed, Jun 17, 2020 at 9:50 AM Rachana Srivastava
 wrote:

I have written a simple spark structured steaming app to move
data from Kafka to S3. Found that in order to support
exactly-once guarantee spark creates _spark_metadata folder,
which ends up growing too large as the streaming app is
SUPPOSE TO run FOREVER. But when the streaming app runs for a
long time the metadata folder grows so big that we start
getting OOM errors. Only way to resolve OOM is delete
Checkpoint and Metadata folder and loose VALUABLE customer data.

Spark open JIRAs SPARK-24295 and SPARK-29995, SPARK-30462, and
SPARK-24295)

Since Spark Streaming was NOT broken like this. Is Spark
Streaming a BETTER choice?





Re: Is Spark Structured Streaming TOTALLY BROKEN (Spark Metadata Issues)

2020-06-17 Thread Jungtaek Lim
Just in case if anyone prefers ASF projects then there are other
alternative projects in ASF as well, alphabetically, Apache Hudi [1] and
Apache Iceberg [2]. Both are recently graduated as top level projects.
(DISCLAIMER: I'm not involved in both.)

BTW it would be nice if we make the metadata implementation on file stream
source/sink be pluggable - from what I've seen, plugin approach has been
selected as the way to go whenever some part is going to be complicated and
it becomes arguable whether the part should be handled in Spark project vs
should be outside. e.g. checkpoint manager, state store provider, etc. It
would open up chances for the ecosystem to play with the challenge "without
completely re-writing the file stream source and sink", focusing on
scalability for metadata in a long run query. Alternative projects
described above will still provide more higher-level features and
look attractive, but sometimes it may be just "using a sledgehammer to
crack a nut".

1. https://hudi.apache.org/
2. https://iceberg.apache.org/


On Thu, Jun 18, 2020 at 2:34 AM Tathagata Das 
wrote:

> Hello Rachana,
>
> Getting exactly-once semantics on files and making it scale to a very
> large number of files are very hard problems to solve. While Structured
> Streaming + built-in file sink solves the exactly-once guarantee that
> DStreams could not, it is definitely limited in other ways (scaling in
> terms of files, combining batch and streaming writes in the same place,
> etc). And solving this problem requires a holistic solution that is
> arguably beyond the scope of the Spark project.
>
> There are other projects that are trying to solve this file management
> issue. For example, Delta Lake (full disclosure, I am
> involved in it) was built to exactly solve this problem - get exactly-once
> and ACID guarantees on files, but also scale to handling millions of files.
> Please consider it as part of your solution.
>
>
>
>
> On Wed, Jun 17, 2020 at 9:50 AM Rachana Srivastava
>  wrote:
>
>> I have written a simple spark structured steaming app to move data from
>> Kafka to S3. Found that in order to support exactly-once guarantee spark
>> creates _spark_metadata folder, which ends up growing too large as the
>> streaming app is SUPPOSE TO run FOREVER. But when the streaming app runs
>> for a long time the metadata folder grows so big that we start getting OOM
>> errors. Only way to resolve OOM is delete Checkpoint and Metadata folder
>> and loose VALUABLE customer data.
>>
>> Spark open JIRAs SPARK-24295 and SPARK-29995, SPARK-30462, and
>> SPARK-24295)
>> Since Spark Streaming was NOT broken like this. Is Spark Streaming a
>> BETTER choice?
>>
>


Re: Is Spark Structured Streaming TOTALLY BROKEN (Spark Metadata Issues)

2020-06-17 Thread Tathagata Das
Hello Rachana,

Getting exactly-once semantics on files and making it scale to a very large
number of files are very hard problems to solve. While Structured Streaming
+ built-in file sink solves the exactly-once guarantee that DStreams could
not, it is definitely limited in other ways (scaling in terms of files,
combining batch and streaming writes in the same place, etc). And solving
this problem requires a holistic solution that is arguably beyond the scope
of the Spark project.

There are other projects that are trying to solve this file management
issue. For example, Delta Lake (full disclosure, I am
involved in it) was built to exactly solve this problem - get exactly-once
and ACID guarantees on files, but also scale to handling millions of files.
Please consider it as part of your solution.




On Wed, Jun 17, 2020 at 9:50 AM Rachana Srivastava
 wrote:

> I have written a simple spark structured steaming app to move data from
> Kafka to S3. Found that in order to support exactly-once guarantee spark
> creates _spark_metadata folder, which ends up growing too large as the
> streaming app is SUPPOSE TO run FOREVER. But when the streaming app runs
> for a long time the metadata folder grows so big that we start getting OOM
> errors. Only way to resolve OOM is delete Checkpoint and Metadata folder
> and loose VALUABLE customer data.
>
> Spark open JIRAs SPARK-24295 and SPARK-29995, SPARK-30462, and SPARK-24295)
> Since Spark Streaming was NOT broken like this. Is Spark Streaming a
> BETTER choice?
>


Is Spark Structured Streaming TOTALLY BROKEN (Spark Metadata Issues)

2020-06-17 Thread Rachana Srivastava
I have written a simple spark structured steaming app to move data from Kafka 
to S3. Found that in order to support exactly-once guarantee spark creates 
_spark_metadata folder, which ends up growing too large as the streaming app is 
SUPPOSE TO run FOREVER. But when the streaming app runs for a long time the 
metadata folder grows so big that we start getting OOM errors. Only way to 
resolve OOM is delete Checkpoint and Metadata folder and loose VALUABLE 
customer data.

Spark open JIRAs SPARK-24295 and SPARK-29995, SPARK-30462, and SPARK-24295)
Since Spark Streaming was NOT broken like this. Is Spark Streaming a BETTER 
choice?