Re: Emitting results saved using state api without input (cleanup)

2017-04-11 Thread Ankur Chauhan
Thanks for confirming a hunch that I had. I was considering doing that but the 
javadoc saying "this feature is not implemented by any runner" sort of put me 
off. 

Is there a more up to date list of similar in progress features? If not it may 
be helpful to keep one. 

Thanks!
Ankur Chauhan. 

Sent from my iPhone

> On Apr 11, 2017, at 07:18, Kenneth Knowles  wrote:
> 
> Hi Ankur,
> 
> If I understand your desire, then what you need for such a use case is an 
> event time timer, to flush when you are ready. You might choose the end of 
> the window, the window GC time or, in the global window, just some later 
> watermark.
> 
> new DoFn<...>() {
> 
>   @StateId("buffer")
>   private final StateSpec bufferSpec = 
> StateSpec.bag(...)
> 
>   @TimerId("finallyCleanup")
>   private final TimerSpec finallySpec = 
> TimerSpecs.timer(TimeDomain.EVENT_TIME);
> 
>   @ProcessElement
>   public void process(@TimerId("finallyCleanup") Timer cleanupTimer) {
> cleanupTimer.set(...);
>   }
> 
>   @OnTimer("finallyCleanup")
>   public void onFinallyCleanup(@StateId("buffer") BagState buffered) 
> {
> ...
>   }
> }
> 
> This feature hasn't been blogged about or documented thoroughly except for a 
> couple of examples in the DoFn Javadoc. But it is available since 0.6.0.
> 
> Kenn
> 
>> On Tue, Apr 11, 2017 at 3:03 AM, Ankur Chauhan  wrote:
>> Hi,
>> 
>> I am attempting to do a seemingly simple task using the new state api. I 
>> have created a DoFn, KV> that accepts events 
>> keyed by a particular id (session id) and intends to emit the same events 
>> partitioned by as sessionID/eventType. In the simple case this would be a 
>> normal DoFn but there is always a case where some events are not as clean as 
>> we would like and we need to save some state for the session and then emit 
>> those events later when cleanup is complete. For example:
>> 
>> Let’s say that the first few events are missing the eventType (or any other 
>> field), so we would like to buffer those events till we get the first event 
>> with the eventType field set and then use this information to emit the 
>> contents of the buffer with (last observed eventType + original contents of 
>> the buffered events),
>> 
>> For this my initial approach involved creating a BagState which would 
>> contain any buffered events and as more events came in, i would either emit 
>> the input with modification, or add the input to the buffer or, emit the 
>> events in the buffer with the input.
>> 
>> While running my test, I found that if I never get a “good” input, i.e. the 
>> session is only filled with error inputs, I would keep on buffering the 
>> input and never emit anything. My question is, how do i emit this buffer 
>> event when there is no more input?
>> 
>> — Ankur Chauhan
> 


Re: How to skip processing on failure at BigQueryIO sink?

2017-04-11 Thread Lukasz Cwik
Have you thought of fetching the schema upfront from BigQuery and
prefiltering out any records in a preceeding DoFn instead of relying on
BigQuery telling you that the schema doesn't match?

Otherwise you are correct in believing that you will need to update
BigQueryIO to have the retry/error semantics that you want.

On Tue, Apr 11, 2017 at 1:12 AM, Josh  wrote:

> What I really want to do is configure BigQueryIO to log an error and skip
> the write if it receives a 4xx response from BigQuery (e.g. element does
> not match table schema). And for other errors (e.g. 5xx) I want it to retry
> n times with exponential backoff.
>
> Is there any way to do this at the moment? Will I need to make some custom
> changes to BigQueryIO?
>
>
>
> On Mon, Apr 10, 2017 at 7:11 PM, Josh  wrote:
>
>> Hi,
>>
>> I'm using BigQueryIO to write the output of an unbounded streaming job to
>> BigQuery.
>>
>> In the case that an element in the stream cannot be written to BigQuery,
>> the BigQueryIO seems to have some default retry logic which retries the
>> write a few times. However, if the write fails repeatedly, it seems to
>> cause the whole pipeline to halt.
>>
>> How can I configure beam so that if writing an element fails a few times,
>> it simply gives up on writing that element and moves on without affecting
>> the pipeline?
>>
>> Thanks for any advice,
>> Josh
>>
>
>


Re: Emitting results saved using state api without input (cleanup)

2017-04-11 Thread Kenneth Knowles
Hi Ankur,

If I understand your desire, then what you need for such a use case is an
event time timer, to flush when you are ready. You might choose the end of
the window, the window GC time or, in the global window, just some later
watermark.

new DoFn<...>() {

  @StateId("buffer")
  private final StateSpec bufferSpec =
StateSpec.bag(...)

  @TimerId("finallyCleanup")
  private final TimerSpec finallySpec =
TimerSpecs.timer(TimeDomain.EVENT_TIME);

  @ProcessElement
  public void process(@TimerId("finallyCleanup") Timer cleanupTimer) {
cleanupTimer.set(...);
  }

  @OnTimer("finallyCleanup")
  public void onFinallyCleanup(@StateId("buffer") BagState
buffered) {
...
  }
}

This feature hasn't been blogged about or documented thoroughly except for
a couple of examples in the DoFn Javadoc. But it is available since 0.6.0.

Kenn

On Tue, Apr 11, 2017 at 3:03 AM, Ankur Chauhan  wrote:

> Hi,
>
> I am attempting to do a seemingly simple task using the new state api. I
> have created a DoFn, KV> that accepts
> events keyed by a particular id (session id) and intends to emit the same
> events partitioned by as sessionID/eventType. In the simple case this would
> be a normal DoFn but there is always a case where some events are not as
> clean as we would like and we need to save some state for the session and
> then emit those events later when cleanup is complete. For example:
>
> Let’s say that the first few events are missing the eventType (or any
> other field), so we would like to buffer those events till we get the first
> event with the eventType field set and then use this information to emit
> the contents of the buffer with (last observed eventType + original
> contents of the buffered events),
>
> For this my initial approach involved creating a BagState which
> would contain any buffered events and as more events came in, i would
> either emit the input with modification, or add the input to the buffer or,
> emit the events in the buffer with the input.
>
> While running my test, I found that if I never get a “good” input, i.e.
> the session is only filled with error inputs, I would keep on buffering the
> input and never emit anything. My question is, how do i emit this buffer
> event when there is no more input?
>
> — Ankur Chauhan


Emitting results saved using state api without input (cleanup)

2017-04-11 Thread Ankur Chauhan
Hi,

I am attempting to do a seemingly simple task using the new state api. I have 
created a DoFn, KV> that accepts events keyed 
by a particular id (session id) and intends to emit the same events partitioned 
by as sessionID/eventType. In the simple case this would be a normal DoFn but 
there is always a case where some events are not as clean as we would like and 
we need to save some state for the session and then emit those events later 
when cleanup is complete. For example:

Let’s say that the first few events are missing the eventType (or any other 
field), so we would like to buffer those events till we get the first event 
with the eventType field set and then use this information to emit the contents 
of the buffer with (last observed eventType + original contents of the buffered 
events),

For this my initial approach involved creating a BagState which would 
contain any buffered events and as more events came in, i would either emit the 
input with modification, or add the input to the buffer or, emit the events in 
the buffer with the input.

While running my test, I found that if I never get a “good” input, i.e. the 
session is only filled with error inputs, I would keep on buffering the input 
and never emit anything. My question is, how do i emit this buffer event when 
there is no more input?

— Ankur Chauhan

Re: How to skip processing on failure at BigQueryIO sink?

2017-04-11 Thread Josh
What I really want to do is configure BigQueryIO to log an error and skip
the write if it receives a 4xx response from BigQuery (e.g. element does
not match table schema). And for other errors (e.g. 5xx) I want it to retry
n times with exponential backoff.

Is there any way to do this at the moment? Will I need to make some custom
changes to BigQueryIO?



On Mon, Apr 10, 2017 at 7:11 PM, Josh  wrote:

> Hi,
>
> I'm using BigQueryIO to write the output of an unbounded streaming job to
> BigQuery.
>
> In the case that an element in the stream cannot be written to BigQuery,
> the BigQueryIO seems to have some default retry logic which retries the
> write a few times. However, if the write fails repeatedly, it seems to
> cause the whole pipeline to halt.
>
> How can I configure beam so that if writing an element fails a few times,
> it simply gives up on writing that element and moves on without affecting
> the pipeline?
>
> Thanks for any advice,
> Josh
>


Using R in Data Flow

2017-04-11 Thread Anant Bhandarkar
Hi,
We are using Big Query for our querying needs.
We are also looking to use Dataflow with some of the statistical libraries.
We are using R libraries to build these statistical models.

We are looking to run our data through the statistical models such as ELM ,
GAM, ARIMA etc. We see that python doesn't have all these libraries which
we get as Cran packages in R.

We have seen this example where there is a possibility to run R on data
flow.


https://medium.com/google-cloud/cloud-dataflow-can-autoscale-r-programs-for-
massively-parallel-data-processing-492b57bd732d
https://github.com/gregmcinnes/incubator-beam/blob/python-sdk/sdks/python/
apache_beam/examples/complete/wordlength/wordlength_R/wordlength_R.py

If we are able to use parallelization provided by Dataflow along with R
libraries this would be a great for us as a team and also the whole Data
science community which relies on R Packages.

We would need some help from the Beam to achieve this.

I see that it will be a very good use case for the whole of data science
community that will enable usage of both Python and R on Beam and Dataflow.

Regards,
Anant