[jira] [Commented] (SPARK-19524) newFilesOnly does not work according to docs.
[ https://issues.apache.org/jira/browse/SPARK-19524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15861869#comment-15861869 ] Egor Pahomov commented on SPARK-19524: -- [~sowen], probably yes. I don't know. "Should process only new files and ignore existing files in the directory" if you really think about it, than I agree than setting this field to false does not mean to process old files. IMHO, everything around this field seems to be poorly documented or architectured. Since there is no documentation about spark.streaming.minRememberDuration in http://spark.apache.org/docs/2.0.2/configuration.html#spark-streaming I do not feel very comfortable changing it. More than that, it would be strange to change it to process old files, when purpose of this field very different. And nevertheless I was given an API with newFilesOnly, about which I made false assumption, but not totally unreasonable, based on all accessible documentation. I was wrong, but it still feels like a trap, I walked into, which can easily not be there. > newFilesOnly does not work according to docs. > -- > > Key: SPARK-19524 > URL: https://issues.apache.org/jira/browse/SPARK-19524 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.2 >Reporter: Egor Pahomov > > Docs says: > newFilesOnly > Should process only new files and ignore existing files in the directory > It's not working. > http://stackoverflow.com/questions/29852249/how-spark-streaming-identifies-new-files > says, that it shouldn't work as expected. > https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala > not clear at all in terms, what code tries to do -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19524) newFilesOnly does not work according to docs.
[ https://issues.apache.org/jira/browse/SPARK-19524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15861822#comment-15861822 ] Sean Owen commented on SPARK-19524: --- Ah, so you _don't_ want to only read new files. The behavior of newFilesOnly=false is _not_ to read _all_ old files. The default other behavior is as explained in the SO post. It reprocesses some window of recent data, about a minute or so. You can control the size of this lookback with spark.streaming.minRememberDuration which is minRememberDurationS in the code (this is what I meant above.) I think this is just the same question as was answered on SO then so this should be closed. > newFilesOnly does not work according to docs. > -- > > Key: SPARK-19524 > URL: https://issues.apache.org/jira/browse/SPARK-19524 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.2 >Reporter: Egor Pahomov > > Docs says: > newFilesOnly > Should process only new files and ignore existing files in the directory > It's not working. > http://stackoverflow.com/questions/29852249/how-spark-streaming-identifies-new-files > says, that it shouldn't work as expected. > https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala > not clear at all in terms, what code tries to do -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19524) newFilesOnly does not work according to docs.
[ https://issues.apache.org/jira/browse/SPARK-19524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15861600#comment-15861600 ] Egor Pahomov commented on SPARK-19524: -- [~sowen], Folder, which connect my streaming to: {code} [egor@hadoop2 test]$ date Fri Feb 10 09:51:16 PST 2017 [egor@hadoop2 test]$ ls -al total 445746 drwxr-xr-x 13 egor egor 4096 Feb 8 14:27 . drwxr-xr-x 43 egor egor 4096 Feb 9 01:38 .. -rw-r--r-- 1 root jobexecutors241661 Dec 1 18:03 clog.1480636981858.fl.log.gz -rw-r--r-- 1 egor egor387024 Feb 1 17:26 clog.1485986399693.fl.log.gz -rw-r--r-- 1 egor egor 128983477 Feb 8 12:43 clog.2017-01-03.1483431170180.9861.log.gz -rw-r--r-- 1 root jobexecutors 67422481 Dec 1 00:01 clog.new-1.1480579205495.fl.log.gz -rw-r--r-- 1 egor egor287279 Feb 8 13:21 data2.log.gz -rw-r--r-- 1 egor egor 128983477 Feb 8 14:10 data300.log.gz -rw-r--r-- 1 egor egor 128983477 Feb 8 14:20 data365.log.gz -rw-r--r-- 1 egor egor287279 Feb 8 13:23 data3.log.gz -rw-r--r-- 1 egor egor287279 Feb 8 13:45 data4.log.gz -rwxrwxr-x 1 egor egor287279 Feb 8 14:04 data5.log.gz -rwxrwxr-x 1 egor egor287279 Feb 8 14:08 data6.log.gz {code} They way I connect: {code} def f(path:Path): Boolean = { !path.getName.contains("tmp") } val client_log_d_stream = ssc.fileStream[LongWritable, Text, TextInputFormat](input_folder, f _ , newFilesOnly = false) {code} Nothing is processed. Than I add file to directory and it processes it. But not the old ones > newFilesOnly does not work according to docs. > -- > > Key: SPARK-19524 > URL: https://issues.apache.org/jira/browse/SPARK-19524 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.2 >Reporter: Egor Pahomov > > Docs says: > newFilesOnly > Should process only new files and ignore existing files in the directory > It's not working. > http://stackoverflow.com/questions/29852249/how-spark-streaming-identifies-new-files > says, that it shouldn't work as expected. > https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala > not clear at all in terms, what code tries to do -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19524) newFilesOnly does not work according to docs.
[ https://issues.apache.org/jira/browse/SPARK-19524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15861319#comment-15861319 ] Sean Owen commented on SPARK-19524: --- It's based on modification time by the way. Are the files' mod date after the system clock time? > newFilesOnly does not work according to docs. > -- > > Key: SPARK-19524 > URL: https://issues.apache.org/jira/browse/SPARK-19524 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.2 >Reporter: Egor Pahomov > > Docs says: > newFilesOnly > Should process only new files and ignore existing files in the directory > It's not working. > http://stackoverflow.com/questions/29852249/how-spark-streaming-identifies-new-files > says, that it shouldn't work as expected. > https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala > not clear at all in terms, what code tries to do -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19524) newFilesOnly does not work according to docs.
[ https://issues.apache.org/jira/browse/SPARK-19524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860023#comment-15860023 ] Egor Pahomov commented on SPARK-19524: -- I'm really confused. I expected "new" to be the files created after start of streaming job and old ones to be everything else in the folder. If we change definition of "new", than I believe everything consistent between each other. It's just I'm not sure that this "new" definition is very intuitive. I want to process everything in folder - existing and upcoming. I use this flag. And now it turns out, that this flag has it's own definition of "new". My be I'm not correct to call it a bug, but isn't it all very confusing for person, who does not really know who everything works inside? > newFilesOnly does not work according to docs. > -- > > Key: SPARK-19524 > URL: https://issues.apache.org/jira/browse/SPARK-19524 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.2 >Reporter: Egor Pahomov > > Docs says: > newFilesOnly > Should process only new files and ignore existing files in the directory > It's not working. > http://stackoverflow.com/questions/29852249/how-spark-streaming-identifies-new-files > says, that it shouldn't work as expected. > https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala > not clear at all in terms, what code tries to do -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19524) newFilesOnly does not work according to docs.
[ https://issues.apache.org/jira/browse/SPARK-19524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859303#comment-15859303 ] Sean Owen commented on SPARK-19524: --- Well, that is its definition of 'new' -- what do you expect instead? initialModTimeIgnoreThreshold controls some slack in the time it considers 'old'. > newFilesOnly does not work according to docs. > -- > > Key: SPARK-19524 > URL: https://issues.apache.org/jira/browse/SPARK-19524 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.2 >Reporter: Egor Pahomov > > Docs says: > newFilesOnly > Should process only new files and ignore existing files in the directory > It's not working. > http://stackoverflow.com/questions/29852249/how-spark-streaming-identifies-new-files > says, that it shouldn't work as expected. > https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala > not clear at all in terms, what code tries to do -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19524) newFilesOnly does not work according to docs.
[ https://issues.apache.org/jira/browse/SPARK-19524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15858912#comment-15858912 ] Egor Pahomov commented on SPARK-19524: -- [~uncleGen], sorry, I haven't understood. > newFilesOnly does not work according to docs. > -- > > Key: SPARK-19524 > URL: https://issues.apache.org/jira/browse/SPARK-19524 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.2 >Reporter: Egor Pahomov > > Docs says: > newFilesOnly > Should process only new files and ignore existing files in the directory > It's not working. > http://stackoverflow.com/questions/29852249/how-spark-streaming-identifies-new-files > says, that it shouldn't work as expected. > https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala > not clear at all in terms, what code tries to do -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19524) newFilesOnly does not work according to docs.
[ https://issues.apache.org/jira/browse/SPARK-19524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15858910#comment-15858910 ] Egor Pahomov commented on SPARK-19524: -- [~srowen], based on documentation, which says "newFilesOnly - Should process only new files and ignore existing files in the directory", I expect, that files which are already exist in folder to which I connect streaming, would be processed. It's not true. In reality files, which were created after time X would be procesed. Time X: {code} private val durationToRemember = slideDuration * numBatchesToRemember val modTimeIgnoreThreshold = math.max( initialModTimeIgnoreThreshold, // initial threshold based on newFilesOnly setting currentTime - durationToRemember.milliseconds // trailing end of the remember window ) {code} First, this code contradicts with documentation as far as I understand it. Second, this code contradicts with the name "newFilesOnly" itself. There is probably motivation behind this code, but I think the only way to find this motivation - git history and go over all tickets. Sorry, that I wasn't more specific first time. > newFilesOnly does not work according to docs. > -- > > Key: SPARK-19524 > URL: https://issues.apache.org/jira/browse/SPARK-19524 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.2 >Reporter: Egor Pahomov > > Docs says: > newFilesOnly > Should process only new files and ignore existing files in the directory > It's not working. > http://stackoverflow.com/questions/29852249/how-spark-streaming-identifies-new-files > says, that it shouldn't work as expected. > https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala > not clear at all in terms, what code tries to do -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19524) newFilesOnly does not work according to docs.
[ https://issues.apache.org/jira/browse/SPARK-19524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15858819#comment-15858819 ] Genmao Yu commented on SPARK-19524: --- Current implementation will clear the old time-to-files mappings based on the {{minRememberDurationS}}. > newFilesOnly does not work according to docs. > -- > > Key: SPARK-19524 > URL: https://issues.apache.org/jira/browse/SPARK-19524 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.2 >Reporter: Egor Pahomov > > Docs says: > newFilesOnly > Should process only new files and ignore existing files in the directory > It's not working. > http://stackoverflow.com/questions/29852249/how-spark-streaming-identifies-new-files > says, that it shouldn't work as expected. > https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala > not clear at all in terms, what code tries to do -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19524) newFilesOnly does not work according to docs.
[ https://issues.apache.org/jira/browse/SPARK-19524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15858667#comment-15858667 ] Sean Owen commented on SPARK-19524: --- I don't understand what you're reporting. Summarize expected, actual behavior here? this isn't a useful JIRA otherwise > newFilesOnly does not work according to docs. > -- > > Key: SPARK-19524 > URL: https://issues.apache.org/jira/browse/SPARK-19524 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.2 >Reporter: Egor Pahomov > > Docs says: > newFilesOnly > Should process only new files and ignore existing files in the directory > It's not working. > http://stackoverflow.com/questions/29852249/how-spark-streaming-identifies-new-files > says, that it shouldn't work as expected. > https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala > not clear at all in terms, what code tries to do -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org