I am using Structured Streaming with Spark 2.1 and have some basic questions.


*         Is there a way to automatically refresh the Hive Partitions when 
using Parquet Sink with Partition? My query looks like below


val queryCount = windowedCount
                  .withColumn("hive_partition_persist_date", 
$"persist_date_window_start".cast("date"))
                  .writeStream.format("parquet")
                  .partitionBy("hive_partition_persist_date")
                  .option("path", StatsDestination)
                  .option("checkpointLocation", CheckPointLocationStats)
                  .trigger(ProcessingTime(WindowDurationStats))
                  .outputMode("append")
                  .start()


I have an external Parquet table built on top of Destination Dir. Above query 
creates the Partition Dirs but the Hive partition metadata is not refreshed and 
I have to execute  ALTER TABLE .... RECOVER PARTITIONS, before querying the 
Hive table. With legacy Streaming, it was possible to use the 
spark.sql(hiveQL), where hiveQL can be any hive statements, config settings 
etc:. Would this kind of functionality be available in structured streaming?


*         The Query creates an additional dir 
"_spark_metadata<https://console.aws.amazon.com/s3/>" under the Destination dir 
and this causes the select statements against the Parquet table fail as it is 
expecting only the parquet files under the destination location. Is there a 
config to avoid the creation of this dir?

[cid:image002.jpg@01D2EFF9.184FE4F0]



*         Our use-case does not need to account for late-arriving records and 
so I have set the WaterMark as 0 seconds. Is this needed to flush out the data 
or is that a default setting or is this inappropriate?


*         In the "Append" mode, I have to push at least 3 batches to actually 
see the records written to the Destination, even with the watermark = 0 seconds 
setting. I understand that the statestore has to wait for watermark to output 
records but here watermark is 0 seconds. Curious to know what exactly is 
happening behind the scenes...



*         The "Trigger" duration and "Window" duration in our case are the same 
as we need to just get the count for every batch. Is a "Window" really needed 
in this scenario as I can logically get the batch count by just using count? I 
tried to just get the count from the batch and it said, aggregation cannot be 
done on streaming dataframe in the Append Mode.


*         In our current code, we use the DataBricks' Spark-Redshift library to 
write output to Redshift. Would this library be available in Structured 
Streaming? Is there a way to do this using the "ForEach"?


*         With Legacy streaming, we checkpoint the Kafka Offsets in ZooKeeper. 
Is using Structured Streaming's checkpointing resilient enough to handle all 
the failure-restart scenarios?



*         When would the spark 2.2 available for use? I see that the 
programming guide still says 2.1.1.


Thanks,
Revin

Reply via email to