I am using Structured Streaming with Spark 2.1 and have some basic questions.
* Is there a way to automatically refresh the Hive Partitions when using Parquet Sink with Partition? My query looks like below val queryCount = windowedCount .withColumn("hive_partition_persist_date", $"persist_date_window_start".cast("date")) .writeStream.format("parquet") .partitionBy("hive_partition_persist_date") .option("path", StatsDestination) .option("checkpointLocation", CheckPointLocationStats) .trigger(ProcessingTime(WindowDurationStats)) .outputMode("append") .start() I have an external Parquet table built on top of Destination Dir. Above query creates the Partition Dirs but the Hive partition metadata is not refreshed and I have to execute ALTER TABLE .... RECOVER PARTITIONS, before querying the Hive table. With legacy Streaming, it was possible to use the spark.sql(hiveQL), where hiveQL can be any hive statements, config settings etc:. Would this kind of functionality be available in structured streaming? * The Query creates an additional dir "_spark_metadata<https://console.aws.amazon.com/s3/>" under the Destination dir and this causes the select statements against the Parquet table fail as it is expecting only the parquet files under the destination location. Is there a config to avoid the creation of this dir? [cid:image002.jpg@01D2EFF9.184FE4F0] * Our use-case does not need to account for late-arriving records and so I have set the WaterMark as 0 seconds. Is this needed to flush out the data or is that a default setting or is this inappropriate? * In the "Append" mode, I have to push at least 3 batches to actually see the records written to the Destination, even with the watermark = 0 seconds setting. I understand that the statestore has to wait for watermark to output records but here watermark is 0 seconds. Curious to know what exactly is happening behind the scenes... * The "Trigger" duration and "Window" duration in our case are the same as we need to just get the count for every batch. Is a "Window" really needed in this scenario as I can logically get the batch count by just using count? I tried to just get the count from the batch and it said, aggregation cannot be done on streaming dataframe in the Append Mode. * In our current code, we use the DataBricks' Spark-Redshift library to write output to Redshift. Would this library be available in Structured Streaming? Is there a way to do this using the "ForEach"? * With Legacy streaming, we checkpoint the Kafka Offsets in ZooKeeper. Is using Structured Streaming's checkpointing resilient enough to handle all the failure-restart scenarios? * When would the spark 2.2 available for use? I see that the programming guide still says 2.1.1. Thanks, Revin