I guess it goes through that 500k files <https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L193>for the first time and then use a filter from next time.
Thanks Best Regards On Fri, Jul 31, 2015 at 4:39 AM, Tathagata Das <t...@databricks.com> wrote: > For the first time it needs to list them. AFter that the list should be > cached by the file stream implementation (as far as I remember). > > > On Thu, Jul 30, 2015 at 3:55 PM, Brandon White <bwwintheho...@gmail.com> > wrote: > >> Is this a known bottle neck for Spark Streaming textFileStream? Does it >> need to list all the current files in a directory before he gets the new >> files? Say I have 500k files in a directory, does it list them all in order >> to get the new files? >> > >