Re: Emitting results saved using state api without input (cleanup)
Thanks for confirming a hunch that I had. I was considering doing that but the javadoc saying "this feature is not implemented by any runner" sort of put me off. Is there a more up to date list of similar in progress features? If not it may be helpful to keep one. Thanks! Ankur Chauhan. Sent from my iPhone > On Apr 11, 2017, at 07:18, Kenneth Knowleswrote: > > Hi Ankur, > > If I understand your desire, then what you need for such a use case is an > event time timer, to flush when you are ready. You might choose the end of > the window, the window GC time or, in the global window, just some later > watermark. > > new DoFn<...>() { > > @StateId("buffer") > private final StateSpec
Re: How to skip processing on failure at BigQueryIO sink?
Have you thought of fetching the schema upfront from BigQuery and prefiltering out any records in a preceeding DoFn instead of relying on BigQuery telling you that the schema doesn't match? Otherwise you are correct in believing that you will need to update BigQueryIO to have the retry/error semantics that you want. On Tue, Apr 11, 2017 at 1:12 AM, Joshwrote: > What I really want to do is configure BigQueryIO to log an error and skip > the write if it receives a 4xx response from BigQuery (e.g. element does > not match table schema). And for other errors (e.g. 5xx) I want it to retry > n times with exponential backoff. > > Is there any way to do this at the moment? Will I need to make some custom > changes to BigQueryIO? > > > > On Mon, Apr 10, 2017 at 7:11 PM, Josh wrote: > >> Hi, >> >> I'm using BigQueryIO to write the output of an unbounded streaming job to >> BigQuery. >> >> In the case that an element in the stream cannot be written to BigQuery, >> the BigQueryIO seems to have some default retry logic which retries the >> write a few times. However, if the write fails repeatedly, it seems to >> cause the whole pipeline to halt. >> >> How can I configure beam so that if writing an element fails a few times, >> it simply gives up on writing that element and moves on without affecting >> the pipeline? >> >> Thanks for any advice, >> Josh >> > >
Re: Emitting results saved using state api without input (cleanup)
Hi Ankur, If I understand your desire, then what you need for such a use case is an event time timer, to flush when you are ready. You might choose the end of the window, the window GC time or, in the global window, just some later watermark. new DoFn<...>() { @StateId("buffer") private final StateSpec
Emitting results saved using state api without input (cleanup)
Hi, I am attempting to do a seemingly simple task using the new state api. I have created a DoFn, KV > that accepts events keyed by a particular id (session id) and intends to emit the same events partitioned by as sessionID/eventType. In the simple case this would be a normal DoFn but there is always a case where some events are not as clean as we would like and we need to save some state for the session and then emit those events later when cleanup is complete. For example: Let’s say that the first few events are missing the eventType (or any other field), so we would like to buffer those events till we get the first event with the eventType field set and then use this information to emit the contents of the buffer with (last observed eventType + original contents of the buffered events), For this my initial approach involved creating a BagState which would contain any buffered events and as more events came in, i would either emit the input with modification, or add the input to the buffer or, emit the events in the buffer with the input. While running my test, I found that if I never get a “good” input, i.e. the session is only filled with error inputs, I would keep on buffering the input and never emit anything. My question is, how do i emit this buffer event when there is no more input? — Ankur Chauhan
Re: How to skip processing on failure at BigQueryIO sink?
What I really want to do is configure BigQueryIO to log an error and skip the write if it receives a 4xx response from BigQuery (e.g. element does not match table schema). And for other errors (e.g. 5xx) I want it to retry n times with exponential backoff. Is there any way to do this at the moment? Will I need to make some custom changes to BigQueryIO? On Mon, Apr 10, 2017 at 7:11 PM, Joshwrote: > Hi, > > I'm using BigQueryIO to write the output of an unbounded streaming job to > BigQuery. > > In the case that an element in the stream cannot be written to BigQuery, > the BigQueryIO seems to have some default retry logic which retries the > write a few times. However, if the write fails repeatedly, it seems to > cause the whole pipeline to halt. > > How can I configure beam so that if writing an element fails a few times, > it simply gives up on writing that element and moves on without affecting > the pipeline? > > Thanks for any advice, > Josh >
Using R in Data Flow
Hi, We are using Big Query for our querying needs. We are also looking to use Dataflow with some of the statistical libraries. We are using R libraries to build these statistical models. We are looking to run our data through the statistical models such as ELM , GAM, ARIMA etc. We see that python doesn't have all these libraries which we get as Cran packages in R. We have seen this example where there is a possibility to run R on data flow. https://medium.com/google-cloud/cloud-dataflow-can-autoscale-r-programs-for- massively-parallel-data-processing-492b57bd732d https://github.com/gregmcinnes/incubator-beam/blob/python-sdk/sdks/python/ apache_beam/examples/complete/wordlength/wordlength_R/wordlength_R.py If we are able to use parallelization provided by Dataflow along with R libraries this would be a great for us as a team and also the whole Data science community which relies on R Packages. We would need some help from the Beam to achieve this. I see that it will be a very good use case for the whole of data science community that will enable usage of both Python and R on Beam and Dataflow. Regards, Anant