Re: BiqQueryIO.write and Wait.on

2018-07-25 Thread Carlos Alonso
rn a PCollection of a type that's invariant to >> which load method is used (streaming writes, load jobs, multiple load jobs >> etc.). If it's unclear what type that should be, you could introduce an >> empty type e.g. "class BigQueryWriteResult {}" just for the sa

Re: BiqQueryIO.write and Wait.on

2018-07-17 Thread Carlos Alonso
Kirpichov wrote: > Hi Carlos, > > Any updates / roadblocks you hit? > > > On Tue, Jul 3, 2018 at 7:13 AM Eugene Kirpichov > wrote: > >> Awesome!! Thanks for the heads up, very exciting, this is going to make a >> lot of people happy :) >> >> O

Re: BiqQueryIO.write and Wait.on

2018-07-03 Thread Carlos Alonso
BatchLoads codepath, > because it's typically used in batch-mode pipelines that don't get updated. > I would recommend to start with this, perhaps even with only the > untriggered codepath (it is much more commonly used) - that will pave the > way for future work. > > Hope this helps, please

Re: Multimap PCollectionViews' values udpated rather than appended

2018-06-11 Thread Carlos Alonso
owledge, there is no guarantee about the order in which the >> values are combined. You will need to use some piece of information about >> the element to figure out which is the latest (or encode some additional >> information along with each element to make this easy). >> >&

Re: Multimap PCollectionViews' values udpated rather than appended

2018-05-31 Thread Carlos Alonso
, May 31, 2018 at 4:44 PM Lukasz Cwik wrote: > The trigger definition in the sample code you have is using discarding > firing mode. Try swapping to using accumulating mode. > > > On Thu, May 31, 2018 at 1:42 AM Carlos Alonso > wrote: > >> But I think what I'm expe

Re: Multimap PCollectionViews' values udpated rather than appended

2018-05-31 Thread Carlos Alonso
re's a prior discussion: >> https://lists.apache.org/thread.html/e9518f5d5f4bcf7bab02de2cb9fe1bd5293d87aa12d46de1eac4600b@%3Cuser.beam.apache.org%3E >> >> It is actually long-standing and the solution is known but hard. >> >> >> >> On Wed, May 30, 2018 at 9:48 AM Carlos Alon

Multimap PCollectionViews' values udpated rather than appended

2018-05-30 Thread Carlos Alonso
Hi everyone!! Working with multimap based side inputs on the global window I'm experiencing something unexpected (at least to me) that I'd like to share with you to clarify. The way I understand multimaps is that when one emits two values for the same key for the same window (obvious thing here

Re: Testing an updating side input on global window

2018-05-29 Thread Carlos Alonso
sting/TestStream.java > 2: > https://github.com/apache/beam/blob/0cbcf4ad1db7d820c5476d636f3a3d69062021a5/sdks/java/core/src/test/java/org/apache/beam/sdk/testing/TestStreamTest.java#L69 > > > On Tue, May 29, 2018 at 1:05 PM Carlos Alonso > wrote: > >> Hi all!! >>

Testing an updating side input on global window

2018-05-29 Thread Carlos Alonso
Hi all!! Basically that's what I'm trying to do. I'm building a pipeline that has a refreshing, multimap, side input (BQ schemas) that then I apply to the main stream of data (records that are ultimately saved to the corresponding BQ table). My job, although being of streaming nature, runs on

Re: Understanding GenerateSequence and SideInputs

2018-05-25 Thread Carlos Alonso
resting read ... > > > https://cloud.google.com/blog/big-data/2018/02/how-to-handle-mutating-json-schemas-in-a-streaming-pipeline-with-square-enix > > Cheers > > Reza > > > > On Fri, May 25, 2018, 5:50 AM Raghu Angadi <rang...@google.com> wrote: > >> >> On Thu

Understanding GenerateSequence and SideInputs

2018-05-24 Thread Carlos Alonso
Hi everyone!! I'm building a pipeline to store streaming data into BQ and I'm using the pattern: Slowly changing lookup cache described here: https://cloud.google.com/blog/big-data/2017/06/guide-to-common-cloud-dataflow-use-case-patterns-part-1 to hold and refresh the table schemas (as they may

Re: BigQuery streaming insert errors

2018-05-09 Thread Carlos Alonso
rts. I agree that this can be useful addition. Feel free to >> create a JIRA. Also, any contributions related to this are welcome. >> >> Thanks, >> Cham >> >> >> On Fri, Apr 6, 2018 at 12:29 AM Carlos Alonso <car...@mrcalonso.com> >> wrote:

Re: Chasing "Cannot output with timestamp" errors

2018-05-01 Thread Carlos Alonso
mm. Since they are different by 1ms I wonder if it is rounding / > truncation combined with very slight skew between Pubsub & DF. Just a > random guess. Your code does seem reasonable at first glance, from a Beam > perspective (it is Scio code, yes?) > > Kenn > > On Tue, May 1,

Re: Chasing "Cannot output with timestamp" errors

2018-05-01 Thread Carlos Alonso
d actually be trying to set them backwards... Does it make any sense? (I'll try to dive into the read from PubSub transform to double check...) Thanks! On Tue, May 1, 2018 at 4:44 PM Carlos Alonso <car...@mrcalonso.com> wrote: > Hi everyone!! > > I have a job that reads

Re: BiqQueryIO.write and Wait.on

2018-04-20 Thread Carlos Alonso
Is any of you interested in > helping out? > > Thanks. > > On Fri, Apr 6, 2018 at 12:36 AM Carlos Alonso <car...@mrcalonso.com> > wrote: > >> Hi Simon, I think your explanation was very accurate, at least to my >> understanding. I'd also be interested in getting batch

Re: BiqQueryIO.write and Wait.on

2018-04-06 Thread Carlos Alonso
Hi Simon, I think your explanation was very accurate, at least to my understanding. I'd also be interested in getting batch load result's feedback on the pipeline... hopefully someone may suggest something, otherwise we could propose submitting a Jira, or even better, a PR!! :) Thanks! On Thu,

Re: BigQuery streaming insert errors

2018-04-06 Thread Carlos Alonso
6, 2018 at 8:13 AM, Pablo Estrada <pabl...@google.com> wrote: > >> Im adding Cham as he might be knowledgeable about BQ IO, or he might be >> able to redirect to someone else. >> Cham, do you have guidance for Carlos here? >> Thanks >> -P. >> >&

Re: BigQuery streaming insert errors

2018-04-02 Thread Carlos Alonso
gt; throw new IOException("Insert failed: " + allErrors); > > Cheers > > On Mon, Apr 2, 2018 at 7:16 AM, Carlos Alonso <car...@mrcalonso.com> > wrote: > >> Hi everyone!! >> >> I was wondering if there's any way to get the error why an in

BigQuery streaming insert errors

2018-04-02 Thread Carlos Alonso
Hi everyone!! I was wondering if there's any way to get the error why an insert (streaming) failed. Looking at the code I think there's currently no way to do that, as the BigQueryServicesImpl insertAll seems to discard the errors and just add the failed TableRow instances into the failedInserts

BigQuery FILE_LOADS recover from errors

2018-03-27 Thread Carlos Alonso
Hi all!! On my pipeline I want to dump some data into BQ using FILE_LOADS write method and I can't see how would I recover from errors (i.e. on the pipeline detect which records couldn't be inserted and store it somewhere else for further inspection) as the WriteTables transform throws an

Best way to repeatedly update a side input

2018-03-26 Thread Carlos Alonso
Hi all!! I want to create a job that reads data from pubsub and then joins with another collection (side input) read from a BigQuery table (actually the schemas). The thing I'm not 100% sure is what would be the best way to update that side input (as those schemas may change at any given point in

Re: Table with field based partitioning must have a schema

2018-03-25 Thread Carlos Alonso
partitioning doesn't have to be associated with a schema. > > Cheers > > On Sat, Mar 24, 2018 at 2:30 PM, Carlos Alonso <car...@mrcalonso.com> > wrote: > >> Otherwise the BQ load job fails with the above error as well (Table with >> field based partitioning must

Re: Table with field based partitioning must have a schema

2018-03-24 Thread Carlos Alonso
hy implement getSchema at all? > > On Sat, Mar 24, 2018, 7:01 AM Carlos Alonso <car...@mrcalonso.com> wrote: > >> The thing is that the previous log "Returning schema for ..." never >> appears, so I don't think anything will appear on the log if I log what you &g

Re: Table with field based partitioning must have a schema

2018-03-24 Thread Carlos Alonso
seSchema and > confirm that it is always non-empty? What does the result look like for the > table that's failing to load? > > On Fri, Mar 23, 2018 at 6:01 PM Carlos Alonso <car...@mrcalonso.com> > wrote: > >> Hi everyone!! >> >> When trying to insert into BigQuery u

Table with field based partitioning must have a schema

2018-03-23 Thread Carlos Alonso
Hi everyone!! When trying to insert into BigQuery using dynamic destinations I get this error: "Tabie with field based partitioning must have a schema" that suggests that I'm not providing such a schema and I don't understand why as I think I am. Here: https://pastebin.com/Q1jF024B you can find

Re: BigQueryIO streaming inserts - poor performance with multiple tables

2018-03-06 Thread Carlos Alonso
Could you please keep writing here the findings you make? I'm very interested in this issue as well. Thanks! On Thu, Mar 1, 2018 at 9:45 AM Josh wrote: > Hi Cham, > > Thanks, I have emailed the dataflow-feedback email address with the > details. > > Best regards, > Josh > >

Re: Partitioning a stream randomly and writing to files with TextIO

2018-02-23 Thread Carlos Alonso
Hi Lukasz, could you please elaborate a bit more around the 2nd part? What's important to know, from the developers perspective, about Dataflow's memory management? How big can partitions grow? And what are the performance considerations? As this sounds like if the workers will "swap" into disk if

Re: Unable to decode tag list after update

2018-02-14 Thread Carlos Alonso
version, but during that time, as new elements were streamed in, the old version wasn't able to decode de new ones. Does it make any sense? Thanks! On Wed, Feb 14, 2018 at 12:50 PM Carlos Alonso <car...@mrcalonso.com> wrote: > Another thing I've realised is that the stacktrace

Re: Unable to decode tag list after update

2018-02-14 Thread Carlos Alonso
out. Thanks! On Wed, Feb 14, 2018 at 11:33 AM Carlos Alonso <car...@mrcalonso.com> wrote: > I've added a couple of methods to a case class and updated the job on > Dataflow and started getting > > java.lang.IllegalStateException: Unable to dec

Unable to decode tag list after update

2018-02-14 Thread Carlos Alonso
I've added a couple of methods to a case class and updated the job on Dataflow and started getting java.lang.IllegalStateException: Unable to decode tag list using org.apache.beam.sdk.coders.SerializableCoder@4ad81832 Caused by java.io.InvalidClassException: my.package.MessageWithAttributes;

Re: How does TextIO decides when to finalise a file?

2018-02-13 Thread Carlos Alonso
ate names are produced > within one pane. Otherwise, it will overwrite. > > On Tue, Feb 13, 2018 at 11:58 AM Carlos Alonso <car...@mrcalonso.com> > wrote: > >> Cool, thanks. >> >> What if the destination is not properly coded and the File naming policy >> then

Re: How does TextIO decides when to finalise a file?

2018-02-13 Thread Carlos Alonso
ation after something for that destination has already > been written will simply be in the next pane, or in a different window. > > On Tue, Feb 13, 2018, 6:33 AM Carlos Alonso <car...@mrcalonso.com> wrote: > >> Hi everyone!! >> >> I'm wondering how a TextIO with

Re: ParDo requires its input to use KvCoder in order to use state and timers

2018-02-13 Thread Carlos Alonso
before the point where the windowing is applied. Thanks! On Mon, Feb 12, 2018 at 6:41 PM Carlos Alonso <car...@mrcalonso.com> wrote: > Ok, that makes a lot of sense. Thanks Kenneth! > > On Mon, Feb 12, 2018 at 5:41 PM Kenneth Knowles <k...@google.com> wrote: > >>

How does TextIO decides when to finalise a file?

2018-02-13 Thread Carlos Alonso
Hi everyone!! I'm wondering how a TextIO with dynamic routing knows/decides when to finalise a file and what happens if after it is finalised, another element routed for the same file appears. Thanks!

Re: Triggers based on size

2018-02-12 Thread Carlos Alonso
n 10, 2018 at 9:09 AM, Carlos Alonso <car...@mrcalonso.com> > wrote: > > Thanks Robert!! > > > > After reading this and the former post about stateful processing > Kenneth's > > suggestions sounds sensible. I'll probably give them a try!! Is there &g

Re: ParDo requires its input to use KvCoder in order to use state and timers

2018-02-12 Thread Carlos Alonso
at it cannot interpret. KvCoder is a special > case where a runner knows the binary layout of encoded data so it can pull > out the keys in order to shuffle data of the same key to the same place, so > that is why it has to be a KvCoder. > > Kenn > > On Mon, Feb 12, 2018

Re: London Apache Beam meetup 2: 11/01

2018-02-12 Thread Carlos Alonso
t; > On Tue, Jan 9, 2018 at 12:30 PM, Carlos Alonso <car...@mrcalonso.com> > wrote: > >> Cool, thanks!! >> >> >> On Mon, Jan 8, 2018 at 1:38 PM Matthias Baetens < >> matthias.baet...@datatonic.com> wrote: >> >>> Yes, we put everythi

Re: Trying to understand Unable to encode element exceptions

2018-02-10 Thread Carlos Alonso
io/pull/1032 > > On Sun, Jan 21, 2018 at 4:36 PM Neville Li <neville@gmail.com> wrote: > >> Awesome! >> We have't wrapped any stateful processing API in scala but if you have >> working snippet or ideas it'd be great to share in that ticket. >> &g

Re: Stateful processing with session window

2018-02-10 Thread Carlos Alonso
Hi Maurizio, I'm not a very experienced user here, I'm actually getting started into all this, but I'm going to give this one a try and see if I can help. What I think is happening here is that the third 'a' you see is actually on a different window of the other 3 a's. Stateful being per key and

Re: IllegalStateException when changing allowed lateness?

2018-02-09 Thread Carlos Alonso
Cool, let me know if you need anything else to nail down this issue. On Thu, Feb 8, 2018 at 3:45 PM Kenneth Knowles <k...@google.com> wrote: > Hi Carlos, > > You are surely correct. Good diagnosis. Filing a bug. > > Kenn > > On Thu, Feb 8, 2018 at 6:23 AM, Carlos A

Re: Lateness droppings debugging

2018-02-08 Thread Carlos Alonso
tic data from > a set of files, the live data from pubsub) and then flatten them for > further processing. > > On Thu, Feb 8, 2018 at 1:23 PM, Carlos Alonso <car...@mrcalonso.com> > wrote: > > Yes, the data is finite (although it comes through PubSub, so I guess is >

Re: Lateness droppings debugging

2018-02-08 Thread Carlos Alonso
I thought of loading all of it into the PubSub subscription before starting the job. That should work, right? Any better suggestion? On Thu, 8 Feb 2018 at 22:23, Carlos Alonso <car...@mrcalonso.com> wrote: > Yes, the data is finite (although it comes through PubSub, so I guess is >

Re: Lateness droppings debugging

2018-02-08 Thread Carlos Alonso
ming from? Rather than > messing with allowed lateness, would it be possible to hold the > watermark back appropriately during the time you're injecting old data > (assuming there's only a finite amount of it)? > > On Thu, Feb 8, 2018 at 12:56 PM, Carlos Alonso <car...@mrcalon

IllegalStateException when changing allowed lateness?

2018-02-08 Thread Carlos Alonso
Hi everyone!! I've just seen a new IllegalStateException: received state cleanup timer for window... that is before the appropriate cleanup time... The full stack trace is here: https://pastebin.com/J8Vuq9xz And I think it could be because I updated a running job with an increased allowed

Re: Lateness droppings debugging

2018-02-07 Thread Carlos Alonso
ter than 1 day. > - Place the elements into another window with larger allowed lateness > and log very late elements in another parallel aggregation (see > TriggerExample.java in Beam repo). > > On Wed, Feb 7, 2018 at 2:55 PM, Carlos Alonso <car...@mrcalonso.com> > wrote: >

Re: Custom metrics not showing on Stackdriver

2018-01-30 Thread Carlos Alonso
st places to > request Dataflow-specific support. > > This could be an issue coordinating between your Stackdriver Account(s) > and your Cloud project(s). We can continue to discuss / investigate > through one of the above forums. > > On Tue, Jan 30, 2018 at 12:30 PM, Carlos Alonso

Re: Custom metrics not showing on Stackdriver

2018-01-30 Thread Carlos Alonso
e.com/dataflow/support? Feel free > to mention my name in your contact. > > Cheers > Andrea > > > > On Tue, Jan 30, 2018 at 10:27 AM, Carlos Alonso <car...@mrcalonso.com> > wrote: > >> Hi Andrea, thank you very much for your response. >> >> I've

Re: Google Dataflow jobs stuck analysing the graph

2018-01-24 Thread Carlos Alonso
Finally the jobs were errored because I ran another job that took all the available quota for that region and the stuck jobs failed as they could not acquire the required resources. On Tue, Jan 23, 2018 at 10:20 AM Carlos Alonso <car...@mrcalonso.com> wrote: > Many thanks for y

Google Dataflow not distributing load across workers

2018-01-24 Thread Carlos Alonso
Hello everyone!! I'm experiencing a weird issue I'd like to understand. I have a workload that basically reads data from PubSub and stores it, organised by types and windows, in GCS. When I run it on low load it works fine, it actually only needs one worker, but if I try to increase the load by

Custom metrics not showing on Stackdriver

2018-01-23 Thread Carlos Alonso
Hi everyone!! I'm trying to get a deeper view on my dataflow jobs by measuring parts of it using `Metrics.counter|gauge` but I cannot find how to see them on Stackdriver. I have a premium Stackdriver account and I can see those counters under the Custom Counters section on the Dataflow UI. I

Google Dataflow jobs stuck analysing the graph

2018-01-22 Thread Carlos Alonso
We have submitted a couple of jobs that seem to have stuck on the graph analysing step. An weird A job with ID "2018-01-19_03_27_48-15138951989738918594" doesn't exist error appears on top of the Google Dataflow jobs list page and trying to list it using gcloud tool shows them as in Unknown

Re: Trying to understand Unable to encode element exceptions

2018-01-20 Thread Carlos Alonso
!! On Fri, Jan 19, 2018 at 6:21 PM Neville Li <neville@gmail.com> wrote: > Welcome. > > Added an issue so we may improve this in the future: > https://github.com/spotify/scio/issues/1020 > > > On Fri, Jan 19, 2018 at 11:14 AM Carlos Alonso <car...@mrcalonso.com>

Re: Trying to understand Unable to encode element exceptions

2018-01-19 Thread Carlos Alonso
stom Coder (you can compose a map coder for > the map field) for the entire case class and set it as part of the KvCoder. > > > On Fri, Jan 19, 2018 at 11:22 AM Carlos Alonso <car...@mrcalonso.com> > wrote: > >> You mean replacing the Map[String, String] from the case

Re: Trying to understand Unable to encode element exceptions

2018-01-19 Thread Carlos Alonso
>> >>> On Fri, Jan 19, 2018 at 4:35 PM Neville Li <neville@gmail.com> >>> wrote: >>> >>>> You shouldn't manually set coder in most cases. It defaults to >>>> KryoAtomicCoder for most Scala types. >>>> Mo

Re: Trying to understand Unable to encode element exceptions

2018-01-19 Thread Carlos Alonso
p the values into something beam-serializable first or > rewrite the transform with a scio built-in which takes care of KvCoder. > > On Fri, Jan 19, 2018, 10:56 AM Carlos Alonso <car...@mrcalonso.com> wrote: > >> I'm following this example: >> https://github.com/apache/beam/bl

Re: Trying to understand Unable to encode element exceptions

2018-01-19 Thread Carlos Alonso
/wiki/Scio%2C-Beam-and-Dataflow#coders > > On Fri, Jan 19, 2018, 10:27 AM Carlos Alonso <car...@mrcalonso.com> wrote: > >> May it be because I’m using >> .setCoder(KvCoder.of(StringUtf8Coder.of(), >> CoderRegistry.createDefault().getCoder(classOf[Message

Re: Trying to understand Unable to encode element exceptions

2018-01-19 Thread Carlos Alonso
gt; On Fri, Jan 19, 2018, 7:43 AM Carlos Alonso <car...@mrcalonso.com> wrote: > >> Hi everyone!! >> >> I'm building a pipeline to store items from a Google PubSub subscription >> into GCS buckets. In order to do it I'm using both stateful and timely >> proces

Re: Triggers based on size

2018-01-10 Thread Carlos Alonso
com> wrote: > Unfortunately, the metadata driven trigger is still just an idea, not > yet implemented. > > A good introduction to state and timers can be found at > https://beam.apache.org/blog/2017/08/28/timely-processing.html > > On Wed, Jan 10, 2018 at 1:08 AM, Carlos A

Re: Triggers based on size

2018-01-10 Thread Carlos Alonso
antee that the trigger fire as soon >> as possible; due to runtime characteristics a significant amount of >> data may be buffered (or come in at once) before the trigger is >> queried. One possibility would be to follow your triggering with a >> DoFn that breaks up large value

Triggers based on size

2018-01-09 Thread Carlos Alonso
Hi everyone!! I was wondering if there is an option to trigger window panes based on the size of the pane itself (rather than the number of elements). To provide a little bit more of context we're backing up a PubSub topic into GCS with the "special" feature that, depending on the "type" of the

Re: London Apache Beam meetup 2: 11/01

2018-01-09 Thread Carlos Alonso
Cool, thanks!! On Mon, Jan 8, 2018 at 1:38 PM Matthias Baetens < matthias.baet...@datatonic.com> wrote: > Yes, we put everything in place to record this time and hope to share the > recordings soon after the meetup. Stay tuned! > > On 8 Jan 2018 10:32, "Carlos Alonso

Re: London Apache Beam meetup 2: 11/01

2018-01-08 Thread Carlos Alonso
Will it be recorded? On Fri, Jan 5, 2018 at 5:11 PM Matthias Baetens < matthias.baet...@datatonic.com> wrote: > Hi all, > > Excited to announce the second Beam meet up located in the *Qubit offices > *next *Thursday 11/01.* > > We are very excited to have JB