No I don't think that much is a bug, since newFilesOnly=false removes
a constraint that otherwise exists, and that's what you see.

However read the closely related:
https://issues.apache.org/jira/browse/SPARK-6061

@tdas open question for you there.

On Sat, Mar 14, 2015 at 8:18 PM, Justin Pihony <justin.pih...@gmail.com> wrote:
> All,
>     Looking into  this StackOverflow question
> <https://stackoverflow.com/questions/29022379/spark-streaming-hdfs/29036469>
> it appears that there is a bug when utilizing the newFilesOnly parameter in
> FileInputDStream. Before creating a ticket, I wanted to verify it here. The
> gist is that this code is wrong:
>
> val modTimeIgnoreThreshold = math.max(
>         initialModTimeIgnoreThreshold,   // initial threshold based on
> newFilesOnly setting
>         currentTime - durationToRemember.milliseconds  // trailing end of
> the remember window
>       )
>
> The problem is that if you set newFilesOnly to false, then the
> initialModTimeIgnoreThreshold is always 0. This makes it always dropped out
> of the max operation. So, the best you get is files that were put in the
> directory (duration) from the start.
>
> Is this a bug or expected behavior; it seems like a bug to me.
>
> If I am correct, this appears to be a bigger fix than just using min as it
> would break other functionality.
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Bug-in-Streaming-files-tp22051.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to