No I don't think that much is a bug, since newFilesOnly=false removes a constraint that otherwise exists, and that's what you see.
However read the closely related: https://issues.apache.org/jira/browse/SPARK-6061 @tdas open question for you there. On Sat, Mar 14, 2015 at 8:18 PM, Justin Pihony <justin.pih...@gmail.com> wrote: > All, > Looking into this StackOverflow question > <https://stackoverflow.com/questions/29022379/spark-streaming-hdfs/29036469> > it appears that there is a bug when utilizing the newFilesOnly parameter in > FileInputDStream. Before creating a ticket, I wanted to verify it here. The > gist is that this code is wrong: > > val modTimeIgnoreThreshold = math.max( > initialModTimeIgnoreThreshold, // initial threshold based on > newFilesOnly setting > currentTime - durationToRemember.milliseconds // trailing end of > the remember window > ) > > The problem is that if you set newFilesOnly to false, then the > initialModTimeIgnoreThreshold is always 0. This makes it always dropped out > of the max operation. So, the best you get is files that were put in the > directory (duration) from the start. > > Is this a bug or expected behavior; it seems like a bug to me. > > If I am correct, this appears to be a bigger fix than just using min as it > would break other functionality. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Bug-in-Streaming-files-tp22051.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org