Hi, I have a setting where data arrives in Kafka and is stored to HDFS from there (maybe using Camus or Flume). I want to write a Spark Streaming app where - first all files in a that HDFS directory are processed, - and then the stream from Kafka is processed, starting with the first item that was not yet in HDFS. The order of the data is somehow important, so I should really *first* do the HDFS processing (which might take a while, by the way) and *then* start stream processing.
Does anyone have any suggestions on how to implement this? Should I write a custom receiver, a custom input stream, can I just use built-in mechanisms? I would be happy to learn about any ideas. Thanks Tobias