Hi, I would like to create a "zero" value for a Structured Streaming Dataframe and unfortunately, I couldn't find any leads. With Spark batch, I can do a "emptyDataFrame" or "createDataFrame" with "emptyRDD" but with StructuredStreaming, I am lost.
If I use the "emptyDataFrame" as the zero value, I wouldn't be able to join them with any other DataFrames in the program because Spark doesn't allow you to mix batch and stream data frames. (isStreaming=false for the Batch ones). Any clue is greatly appreciated. Here are the alternatives that I have at the moment. *1. Reading from an empty file * *Disadvantages : poll is expensive because it involves IO and it's error prone in the sense that someone might accidentally update the file.* val emptyErrorStream = (spark: SparkSession) => { spark .readStream .format("csv") .schema(DataErrorSchema) .load("/Users/arunma/IdeaProjects/OSS/SparkDatalakeKitchenSink/src/test/resources/dummy1.txt") .as[DataError] } *2. Use MemoryStream* *Disadvantages: MemoryStream itself is not recommended for production use because of the ability to mutate it but I am converting it to DS immediately. So, I am leaning towards this at the moment. * val emptyErrorStream = (spark:SparkSession) => { implicit val sqlC = spark.sqlContext MemoryStream[DataError].toDS() } Cheers, Arun