[jira] [Commented] (SPARK-19524) newFilesOnly does not work according to docs.

2017-02-10 Thread Egor Pahomov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15861869#comment-15861869
 ] 

Egor Pahomov commented on SPARK-19524:
--

[~sowen], probably yes. I don't know. "Should process only new files and ignore 
existing files in the directory" if you really think about it, than I agree 
than setting this field to false does not mean to process old files. IMHO, 
everything around this field seems to be poorly documented or architectured. 
Since there is no documentation about spark.streaming.minRememberDuration in 
http://spark.apache.org/docs/2.0.2/configuration.html#spark-streaming I do not 
feel very comfortable changing it. More than that, it would be strange to 
change it to process old files, when purpose of this field very different. And 
nevertheless I was given an API with newFilesOnly, about which I made false 
assumption, but not totally unreasonable, based on all accessible 
documentation. I was wrong, but it still feels like a trap, I walked into, 
which can easily not be there. 

> newFilesOnly does not work according to docs. 
> --
>
> Key: SPARK-19524
> URL: https://issues.apache.org/jira/browse/SPARK-19524
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
>Reporter: Egor Pahomov
>
> Docs says:
> newFilesOnly
> Should process only new files and ignore existing files in the directory
> It's not working. 
> http://stackoverflow.com/questions/29852249/how-spark-streaming-identifies-new-files
>  says, that it shouldn't work as expected. 
> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala
>  not clear at all in terms, what code tries to do



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19524) newFilesOnly does not work according to docs.

2017-02-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15861822#comment-15861822
 ] 

Sean Owen commented on SPARK-19524:
---

Ah, so you _don't_ want to only read new files. The behavior of 
newFilesOnly=false is _not_ to read _all_ old files. The default other behavior 
is as explained in the SO post. It reprocesses some window of recent data, 
about a minute or so. You can control the size of this lookback with 
spark.streaming.minRememberDuration which is minRememberDurationS in the code 
(this is what I meant above.)

I think this is just the same question as was answered on SO then so this 
should be closed.

> newFilesOnly does not work according to docs. 
> --
>
> Key: SPARK-19524
> URL: https://issues.apache.org/jira/browse/SPARK-19524
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
>Reporter: Egor Pahomov
>
> Docs says:
> newFilesOnly
> Should process only new files and ignore existing files in the directory
> It's not working. 
> http://stackoverflow.com/questions/29852249/how-spark-streaming-identifies-new-files
>  says, that it shouldn't work as expected. 
> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala
>  not clear at all in terms, what code tries to do



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19524) newFilesOnly does not work according to docs.

2017-02-10 Thread Egor Pahomov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15861600#comment-15861600
 ] 

Egor Pahomov commented on SPARK-19524:
--

[~sowen], 
Folder, which connect my streaming to:
{code}
[egor@hadoop2 test]$ date
Fri Feb 10 09:51:16 PST 2017
[egor@hadoop2 test]$ ls -al
total 445746
drwxr-xr-x 13 egor egor  4096 Feb  8 14:27 .
drwxr-xr-x 43 egor egor  4096 Feb  9 01:38 ..
-rw-r--r--  1 root jobexecutors241661 Dec  1 18:03 
clog.1480636981858.fl.log.gz
-rw-r--r--  1 egor egor387024 Feb  1 17:26 
clog.1485986399693.fl.log.gz
-rw-r--r--  1 egor egor 128983477 Feb  8 12:43 
clog.2017-01-03.1483431170180.9861.log.gz
-rw-r--r--  1 root jobexecutors  67422481 Dec  1 00:01 
clog.new-1.1480579205495.fl.log.gz
-rw-r--r--  1 egor egor287279 Feb  8 13:21 data2.log.gz
-rw-r--r--  1 egor egor 128983477 Feb  8 14:10 data300.log.gz
-rw-r--r--  1 egor egor 128983477 Feb  8 14:20 data365.log.gz
-rw-r--r--  1 egor egor287279 Feb  8 13:23 data3.log.gz
-rw-r--r--  1 egor egor287279 Feb  8 13:45 data4.log.gz
-rwxrwxr-x  1 egor egor287279 Feb  8 14:04 data5.log.gz
-rwxrwxr-x  1 egor egor287279 Feb  8 14:08 data6.log.gz
{code}
They way I connect: 
{code}
def f(path:Path): Boolean = {
  !path.getName.contains("tmp")
}

 val client_log_d_stream = ssc.fileStream[LongWritable, Text, 
TextInputFormat](input_folder, f _ , newFilesOnly = false)
{code}

Nothing is processed. Than I add file to directory and it processes it. But not 
the old ones

> newFilesOnly does not work according to docs. 
> --
>
> Key: SPARK-19524
> URL: https://issues.apache.org/jira/browse/SPARK-19524
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
>Reporter: Egor Pahomov
>
> Docs says:
> newFilesOnly
> Should process only new files and ignore existing files in the directory
> It's not working. 
> http://stackoverflow.com/questions/29852249/how-spark-streaming-identifies-new-files
>  says, that it shouldn't work as expected. 
> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala
>  not clear at all in terms, what code tries to do



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19524) newFilesOnly does not work according to docs.

2017-02-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15861319#comment-15861319
 ] 

Sean Owen commented on SPARK-19524:
---

It's based on modification time by the way. Are the files' mod date after the 
system clock time?

> newFilesOnly does not work according to docs. 
> --
>
> Key: SPARK-19524
> URL: https://issues.apache.org/jira/browse/SPARK-19524
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
>Reporter: Egor Pahomov
>
> Docs says:
> newFilesOnly
> Should process only new files and ignore existing files in the directory
> It's not working. 
> http://stackoverflow.com/questions/29852249/how-spark-streaming-identifies-new-files
>  says, that it shouldn't work as expected. 
> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala
>  not clear at all in terms, what code tries to do



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19524) newFilesOnly does not work according to docs.

2017-02-09 Thread Egor Pahomov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860023#comment-15860023
 ] 

Egor Pahomov commented on SPARK-19524:
--

I'm really confused. I expected "new" to be the files created after start of 
streaming job and old ones to be everything else in the folder. If we change 
definition of "new", than I believe everything consistent between each other. 
It's just I'm not sure that this "new" definition is very intuitive. I want to 
process everything in folder - existing and upcoming. I use this flag. And now 
it turns out, that this flag has it's own definition of "new". My be I'm not 
correct to call it a bug, but isn't it all very confusing for person, who does 
not really know who everything works inside? 

> newFilesOnly does not work according to docs. 
> --
>
> Key: SPARK-19524
> URL: https://issues.apache.org/jira/browse/SPARK-19524
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
>Reporter: Egor Pahomov
>
> Docs says:
> newFilesOnly
> Should process only new files and ignore existing files in the directory
> It's not working. 
> http://stackoverflow.com/questions/29852249/how-spark-streaming-identifies-new-files
>  says, that it shouldn't work as expected. 
> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala
>  not clear at all in terms, what code tries to do



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19524) newFilesOnly does not work according to docs.

2017-02-09 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859303#comment-15859303
 ] 

Sean Owen commented on SPARK-19524:
---

Well, that is its definition of 'new' -- what do you expect instead? 
initialModTimeIgnoreThreshold controls some slack in the time it considers 
'old'.

> newFilesOnly does not work according to docs. 
> --
>
> Key: SPARK-19524
> URL: https://issues.apache.org/jira/browse/SPARK-19524
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
>Reporter: Egor Pahomov
>
> Docs says:
> newFilesOnly
> Should process only new files and ignore existing files in the directory
> It's not working. 
> http://stackoverflow.com/questions/29852249/how-spark-streaming-identifies-new-files
>  says, that it shouldn't work as expected. 
> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala
>  not clear at all in terms, what code tries to do



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19524) newFilesOnly does not work according to docs.

2017-02-08 Thread Egor Pahomov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15858912#comment-15858912
 ] 

Egor Pahomov commented on SPARK-19524:
--

[~uncleGen], sorry, I haven't understood. 

> newFilesOnly does not work according to docs. 
> --
>
> Key: SPARK-19524
> URL: https://issues.apache.org/jira/browse/SPARK-19524
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
>Reporter: Egor Pahomov
>
> Docs says:
> newFilesOnly
> Should process only new files and ignore existing files in the directory
> It's not working. 
> http://stackoverflow.com/questions/29852249/how-spark-streaming-identifies-new-files
>  says, that it shouldn't work as expected. 
> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala
>  not clear at all in terms, what code tries to do



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19524) newFilesOnly does not work according to docs.

2017-02-08 Thread Egor Pahomov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15858910#comment-15858910
 ] 

Egor Pahomov commented on SPARK-19524:
--

[~srowen], based on documentation, which says "newFilesOnly - Should process 
only new files and ignore existing files in the directory", I expect, that 
files which are already exist in folder to which I connect streaming, would be 
processed. It's not true. In reality files, which were created after time X 
would be procesed. Time X:

{code}
private val durationToRemember = slideDuration * numBatchesToRemember

val modTimeIgnoreThreshold = math.max(
initialModTimeIgnoreThreshold,   // initial threshold based on 
newFilesOnly setting
currentTime - durationToRemember.milliseconds  // trailing end of the 
remember window
  )
{code}

First, this code contradicts with documentation as far as I understand it. 
Second, this code contradicts with the name "newFilesOnly" itself. There is 
probably motivation behind this code, but I think the only way to find this 
motivation - git history and go over all tickets. 

Sorry, that I wasn't more specific first time. 


> newFilesOnly does not work according to docs. 
> --
>
> Key: SPARK-19524
> URL: https://issues.apache.org/jira/browse/SPARK-19524
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
>Reporter: Egor Pahomov
>
> Docs says:
> newFilesOnly
> Should process only new files and ignore existing files in the directory
> It's not working. 
> http://stackoverflow.com/questions/29852249/how-spark-streaming-identifies-new-files
>  says, that it shouldn't work as expected. 
> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala
>  not clear at all in terms, what code tries to do



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19524) newFilesOnly does not work according to docs.

2017-02-08 Thread Genmao Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15858819#comment-15858819
 ] 

Genmao Yu commented on SPARK-19524:
---

Current implementation will clear the old time-to-files mappings based on the 
{{minRememberDurationS}}.

> newFilesOnly does not work according to docs. 
> --
>
> Key: SPARK-19524
> URL: https://issues.apache.org/jira/browse/SPARK-19524
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
>Reporter: Egor Pahomov
>
> Docs says:
> newFilesOnly
> Should process only new files and ignore existing files in the directory
> It's not working. 
> http://stackoverflow.com/questions/29852249/how-spark-streaming-identifies-new-files
>  says, that it shouldn't work as expected. 
> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala
>  not clear at all in terms, what code tries to do



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19524) newFilesOnly does not work according to docs.

2017-02-08 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15858667#comment-15858667
 ] 

Sean Owen commented on SPARK-19524:
---

I don't understand what you're reporting. Summarize expected, actual behavior 
here? this isn't a useful JIRA otherwise

> newFilesOnly does not work according to docs. 
> --
>
> Key: SPARK-19524
> URL: https://issues.apache.org/jira/browse/SPARK-19524
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2
>Reporter: Egor Pahomov
>
> Docs says:
> newFilesOnly
> Should process only new files and ignore existing files in the directory
> It's not working. 
> http://stackoverflow.com/questions/29852249/how-spark-streaming-identifies-new-files
>  says, that it shouldn't work as expected. 
> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala
>  not clear at all in terms, what code tries to do



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org