Hi!
I am testing thenew `watchForNewFiles` feature but am bumping into some
problems.
I know it is not released yet but I think it is more related to streaming
pipelines than the source.
What I want to do is reprocess old data and then continue processing newly
incoming data continuously after that.
The problem I bump into is when I try to timestamp my datapoints I get the
following error:
```
Sep 26, 2017 6:32:33 PM
org.apache.beam.runners.dataflow.util.MonitoringUtil$LoggingHandler process
SEVERE: 2017-09-26T16:32:31.304Z: (788c8fce71b8d3da):
java.lang.RuntimeException: org.apache.beam.sdk.util.UserCodeException:
java.lang.IllegalArgumentException: Cannot output with timestamp
2017-06-03T00:00:00.000Z. Output timestamps must be no earlier than the
timestamp of the current input (2017-09-26T16:32:10.204Z) minus the allowed
skew (0 milliseconds). See the DoFn#getAllowedTimestampSkew() Javadoc for
details on changing the allowed skew.
com.google.cloud.dataflow.worker.GroupAlsoByWindowsParDoFn$1.output(GroupAlsoByWindowsParDoFn.java:182)
com.google.cloud.dataflow.worker.GroupAlsoByWindowFnRunner$1.outputWindowedValue(GroupAlsoByWindowFnRunner.java:104)
com.google.cloud.dataflow.worker.StreamingGroupAlsoByWindowReshuffleFn.processElement(StreamingGroupAlsoByWindowReshuffleFn.java:55)
com.google.cloud.dataflow.worker.StreamingGroupAlsoByWindowReshuffleFn.processElement(StreamingGroupAlsoByWindowReshuffleFn.java:37)
com.google.cloud.dataflow.worker.GroupAlsoByWindowFnRunner.invokeProcessElement(GroupAlsoByWindowFnRunner.java:117)
com.google.cloud.dataflow.worker.GroupAlsoByWindowFnRunner.processElement(GroupAlsoByWindowFnRunner.java:74)
com.google.cloud.dataflow.worker.GroupAlsoByWindowsParDoFn.processElement(GroupAlsoByWindowsParDoFn.java:133)
com.google.cloud.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:48)
com.google.cloud.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:52)
com.google.cloud.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:187)
com.google.cloud.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:148)
com.google.cloud.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:68)
com.google.cloud.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1049)
com.google.cloud.dataflow.worker.StreamingDataflowWorker.access$1000(StreamingDataflowWorker.java:135)
com.google.cloud.dataflow.worker.StreamingDataflowWorker$8.run(StreamingDataflowWorker.java:823)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
```
I have encountered this error before when streaming (from PubSub) and do
not really understand why this is. Is it because the window should already
have been closed for that time?
Is there any way I can timestamp my old data accordingly, use windowing
etc. and then let the stream continue to process new data as it comes in,
eg without resorting to batches?
Thanks!
Vilhelm von Ehrenheim