Hi,

I would like to create a "zero" value for a Structured Streaming Dataframe
and unfortunately, I couldn't find any leads.  With Spark batch, I can do a
"emptyDataFrame" or "createDataFrame" with "emptyRDD" but with
StructuredStreaming, I am lost.

If I use the "emptyDataFrame" as the zero value, I wouldn't be able to join
them with any other DataFrames in the program because Spark doesn't allow
you to mix batch and stream data frames. (isStreaming=false for the Batch
ones).

Any clue is greatly appreciated. Here are the alternatives that I have at
the moment.

*1. Reading from an empty file *
*Disadvantages : poll is expensive because it involves IO and it's error
prone in the sense that someone might accidentally update the file.*

val emptyErrorStream = (spark: SparkSession) => {
  spark
    .readStream
    .format("csv")
    .schema(DataErrorSchema)
    
.load("/Users/arunma/IdeaProjects/OSS/SparkDatalakeKitchenSink/src/test/resources/dummy1.txt")
    .as[DataError]
}

*2. Use MemoryStream*

*Disadvantages: MemoryStream itself is not recommended for production
use because of the ability to mutate it but I am converting it to DS
immediately. So, I am leaning towards this at the moment. *


val emptyErrorStream = (spark:SparkSession) => {
  implicit val sqlC = spark.sqlContext
  MemoryStream[DataError].toDS()
}

Cheers,
Arun

Reply via email to