Re: Strange errors running on DataFlow

2017-08-03 Thread Ben Chambers
These errors are often seen at the end of a pipeline -- they indicate that due to the failure the backend has been torn down and the attempts to report the current status have failed. If you look in the "Stack Traces" tab in the UI [1] or earlier in the Stackdriver logs, you should (hopefully) be

Strange errors running on DataFlow

2017-08-03 Thread Randal Moore
I have a batch pipeline that runs well with small inputs but fails with a larger dataset. Looking at stackdriver I find a fair number of the following: Request failed with code 400, will NOT retry:

Re: Ordering in PCollection

2017-08-03 Thread Eric Fang
Thanks. On Thu, Aug 3, 2017 at 2:11 PM Lukasz Cwik wrote: > But the windows can still be processed out of order. > > On Thu, Aug 3, 2017 at 2:10 PM, Lukasz Cwik wrote: > >> Yes. >> >> On Thu, Aug 3, 2017 at 2:09 PM, Eric Fang >>

Re: PubSubIO withTimestampAttribute - what are the implications?

2017-08-03 Thread Lukasz Cwik
To my knowledge, autoscaling is dependent on how many messages are backlogged within Pubsub and independent of the second subscription (the second subscription is really to compute a better watermark). On Thu, Aug 3, 2017 at 1:34 PM, wrote: > Thanks Lukasz that's good to know!

Re: Ordering in PCollection

2017-08-03 Thread Lukasz Cwik
But the windows can still be processed out of order. On Thu, Aug 3, 2017 at 2:10 PM, Lukasz Cwik wrote: > Yes. > > On Thu, Aug 3, 2017 at 2:09 PM, Eric Fang wrote: > >> Thanks Lukasz. If the stream is infinite, I am assuming you mean by >> window the

Re: Ordering in PCollection

2017-08-03 Thread Lukasz Cwik
Yes. On Thu, Aug 3, 2017 at 2:09 PM, Eric Fang wrote: > Thanks Lukasz. If the stream is infinite, I am assuming you mean by window > the stream into pans and sort the bundle for each trigger to an order? > > Eric > > On Thu, Aug 3, 2017 at 12:49 PM Lukasz Cwik

Re: Ordering in PCollection

2017-08-03 Thread Eric Fang
Thanks Lukasz. If the stream is infinite, I am assuming you mean by window the stream into pans and sort the bundle for each trigger to an order? Eric On Thu, Aug 3, 2017 at 12:49 PM Lukasz Cwik wrote: > There is currently no strict ordering which is supported within Apache >

Re: PubSubIO withTimestampAttribute - what are the implications?

2017-08-03 Thread jofo90
Thanks Lukasz that's good to know! It sounds like we can halve our PubSub costs then! Just to clarify, if I were to remove withTimestampAttribute from a job, this would cause the watermark to always be up to date (processing time) even if the job starts to lag behind (build up of

Re: Ordering in PCollection

2017-08-03 Thread Lukasz Cwik
There is currently no strict ordering which is supported within Apache Beam (timestamp or not) and any ordering which may be occurring is just a side effect and not guaranteed in any way. Since the smallest unit of work is a bundle containing 1 element, the only way to get ordering is to make one

Ordering in PCollection

2017-08-03 Thread Eric Fang
Hi all, We have a stream of data that's ordered by a timestamp and our use case requires us to process the data in order with respect to the previous element. For example, we have a stream of true/false ingested from PubSub and we want to make sure for each key, a true always follows by a false.

PubSubIO withTimestampAttribute - what are the implications?

2017-08-03 Thread Josh
Hi all, We've been running a few streaming Beam jobs on Dataflow, where each job is consuming from PubSub via PubSubIO. Each job does something like this: PubsubIO.readMessagesWithAttributes() .withIdAttribute("unique_id") .withTimestampAttribute("timestamp"); My

Re: Spark job hangs up at Evaluating ParMultiDo(ParseInput)

2017-08-03 Thread Sathish Jayaraman
Hi, Thanks for trying it out. I was running the job in local single node setup. I also spawn a HDInsights cluster in Azure platform just to test the WordCount program. Its the same result there too, stuck at the Evaluating ParMultiDo step. It runs fine in mvn compile exec, but when bundled

Re: Spark job hangs up at Evaluating ParMultiDo(ParseInput)

2017-08-03 Thread Sathish Jayaraman
Hi, Was anyone able to run Beam application on Spark at all?? I tried all possible options and still no luck. No executors getting assigned for the job submitted by below command even though explicitly specified, $ ~/spark/bin/spark-submit --class org.apache.beam.examples.WordCount --master

Re: Un-parallelized TextIO to HDFS on Beam+Flink+YARN

2017-08-03 Thread Aljoscha Krettek
Hi, I think this might not be a problem. The reason we have this DatSink(DiscardingOutputFormat) at the "end" of Flink Batch pipelines is that Flink Batch will not execute a chain of operations when they're not terminated by a sink. In Beam, it's just fine to just have a DoFn and no sink after