Jenkins build is back to normal : beam_Release_Gradle_NightlySnapshot #27

2018-05-03 Thread Apache Jenkins Server
See

Re: Pubsub to Beam SQL

2018-05-03 Thread Reuven Lax
Are you planning on integrating this directly into PubSubIO, or add a follow-on transform? On Wed, May 2, 2018 at 10:30 AM Anton Kedin wrote: > Hi > > I am working on adding functionality to support querying Pubsub messages > directly from Beam SQL. > > *Goal* > Provide Beam

Re: ValidatesRunner test cleanup

2018-05-03 Thread Robert Burke
I am curious as to how long the suite takes with the changes you've made. How long does a full Validates Runner suite take with your recategorizing? On Thu, May 3, 2018, 9:24 AM Eugene Kirpichov wrote: > Thanks Scott, this is awesome! > However, we should be careful when

Re: [SQL] Reconciling Beam SQL Environments with Calcite Schema

2018-05-03 Thread Andrew Pilloud
Ok, I've finished with this change. Didn't get reviews on the early cleanup PRs, so I've pushed all these changes into the first cleanup PR: https://github.com/apache/beam/pull/5224 Andrew On Tue, May 1, 2018 at 10:35 AM Andrew Pilloud wrote: > I'm just starting to move

Re: ValidatesRunner test cleanup

2018-05-03 Thread Kenneth Knowles
Since I went over the PR and dropped a lot of random opinions about what should be VR versus NR, I'll answer too: VR - all primitives: ParDo, GroupByKey, Flatten.pCollections (Flatten.iterables is an unrelated composite), Metrics VR - critical special composites: Combine VR - test infrastructure

Re: Pubsub to Beam SQL

2018-05-03 Thread Andrew Pilloud
This sounds awesome! Is event timestamp something that we need to specify for every source? If so, I would suggest we add this as a first class option on CREATE TABLE rather then something hidden in TBLPROPERTIES. Andrew On Wed, May 2, 2018 at 10:30 AM Anton Kedin wrote: >

Re: ValidatesRunner test cleanup

2018-05-03 Thread Eugene Kirpichov
Thanks Kenn! Note though that we should have VR tests for transforms that have a runner specific override, such as TextIO.write() and Create that you mentioned. Agreed that it'd be good to have a more clear packaging separation between the two. On Thu, May 3, 2018, 10:35 AM Kenneth Knowles

Re: Google Summer of Code Project Intro

2018-05-03 Thread Andrew Pilloud
Hi Kai, Glad to hear someone is putting more work into benchmarking Beam SQL! It would be really cool if we had some of these running as nightly performance test jobs so we would know when there is a performance regression. This might be out of scope of your project, but keep it in mind. I am

Re: I want to allow a user-specified QuerySplitter for DatastoreIO

2018-05-03 Thread Frank Yellin
I actually tried (1), and ran precisely into the size limit that you mentioned. Because of the size of the database, I needed to split it into a few hundred shards, and that was more than the request limit. I was also considering a slightly different alternative to (2), such as adding

Re: Pubsub to Beam SQL

2018-05-03 Thread Anton Kedin
I think it makes sense for the case when timestamp is provided in the payload (including pubsub message attributes). We can mark the field as an event timestamp. But if the timestamp is internally defined by the source (pubsub message publish time) and not exposed in the event body, then we need

Re: Pubsub to Beam SQL

2018-05-03 Thread Anton Kedin
A SQL-specific wrapper+custom transforms for PubsubIO should suffice. We will probably need to a way to expose a message publish timestamp if we want to use it as an event timestamp, but that will be consumed by the same wrapper/transform without adding anything schema or SQL-specific to PubsubIO

Re: I want to allow a user-specified QuerySplitter for DatastoreIO

2018-05-03 Thread Lukasz Cwik
I also like the idea of doing the splitting when the pipeline is running and not during pipeline construction. This works a lot better with things like templates. Do you know what Maven package contains com.google.rpc classes and what is the transitive dependency tree of the package? If those

Re: Pubsub to Beam SQL

2018-05-03 Thread Reuven Lax
I believe PubSubIO already exposes the publish timestamp if no timestamp attribute is set. On Thu, May 3, 2018 at 12:52 PM Anton Kedin wrote: > A SQL-specific wrapper+custom transforms for PubsubIO should suffice. We > will probably need to a way to expose a message publish

Re: Pubsub to Beam SQL

2018-05-03 Thread Andrew Pilloud
I like to avoid magic too. I might not have been entirely clear in what I was asking. Here is an example of what I had in mind, replacing the TBLPROPERTIES with a more generic TIMESTAMP option: CREATE TABLE table_name ( publishTimestamp TIMESTAMP, attributes MAP(VARCHAR, VARCHAR), payload

Re: ValidatesRunner test cleanup

2018-05-03 Thread Reuven Lax
I suspect that at least some of these are because people copy/pasted other tests, not realizing the overhead of ValidatesRunner. Is this something we should document in the contributors guide? On Thu, May 3, 2018 at 8:54 AM Scott Wegner wrote: > Note: if you don't care about

Re: ValidatesRunner test cleanup

2018-05-03 Thread Eugene Kirpichov
Thanks Scott, this is awesome! However, we should be careful when choosing what should be ValidatesRunner and what should be NeedsRunner. Could you briefly describe how you made the call and roughly what are the statistics before/after your PR (number of tests in both categories)? On Thu, May 3,

ValidatesRunner test cleanup

2018-05-03 Thread Scott Wegner
Note: if you don't care about Java runner tests, you can stop reading now. tl;dr: I've made a pass over all @ValidatesRunner tests in pr/5218 [1] and converted many to @NeedsRunner in order to reduce post-commit runtime. This is work that was long overdue and finally got my attention due to the

Re: ValidatesRunner test cleanup

2018-05-03 Thread Jean-Baptiste Onofré
Thanks for the update Scott. That's really a great job. I will ping you on slack about some points as I'm preparing the build for the release (and I have some issues ). Thanks again Regards JB Le 3 mai 2018 à 17:54, à 17:54, Scott Wegner a écrit: >Note: if you don't care

Re: ValidatesRunner test cleanup

2018-05-03 Thread Kenneth Knowles
I think actually that the runner should have such an IT, not the core SDK. On Thu, May 3, 2018 at 11:20 AM Eugene Kirpichov wrote: > Thanks Kenn! Note though that we should have VR tests for transforms that > have a runner specific override, such as TextIO.write() and

Re: Pubsub to Beam SQL

2018-05-03 Thread Kenneth Knowles
It is an interesting question for Beam DDL - since timestamps are fundamental to Beam's data model, should we have a DDL extension that makes it very explicit? Seems nice, but perhaps TBLPROPERTIES is a way to stage the work, getting the functionality in place first and the parsing second. What

Re: ValidatesRunner test cleanup

2018-05-03 Thread Scott Wegner
Thanks for the feedback. For methodology, I crudely went through existing tests and looked at whether they exercise runner behavior or not. When I wasn't sure, I opted to leave them as-is. And then I leaned on Kenn's expertise to help categorize further :) For the current state: here's a run of

How to create a runtime ValueProvider

2018-05-03 Thread Frank Yellin
I'm attempting to create a dataflow template, and within the template have a variable ValueProvider now such that now is the time the dataflow is started, note the time that the template was created. My first attempt was ValueProvider now = StaticValueProvider.of(

Re: Pubsub to Beam SQL

2018-05-03 Thread Ankur Goenka
I like the idea of exposing source timestamp in TBLPROPERTIES which is closely tied to source (KafkaIO, KinesisIO, MqttIO, AmqpIO, unbounded FileIO, PubSubIO). Exposing timestamp as a top level keyword will break the symmetry between streaming and batch pipelines. TBLPROPERTIES gives us

Re: How to create a runtime ValueProvider

2018-05-03 Thread Frank Yellin
[Sorry, I accidentally hit send before I had finished typing . .] Is there any way to achieve what I'm looking for? Or is this just beyond the scope of ValueProvider and templates? On Thu, May 3, 2018 at 5:36 PM, Frank Yellin wrote: > I'm attempting to create a dataflow

Re: How to create a runtime ValueProvider

2018-05-03 Thread Eugene Kirpichov
There is no way to achieve this using ValueProvider. Its value is either fixed at construction time (StaticValueProvider), or completely dynamic (evaluated every time you call .get()). You'll need to implement this using a side input. E.g. take a look at implementation of BigQueryIO, how it

Google Summer of Code Project Intro

2018-05-03 Thread Kai Jiang
Hi Beam Dev, I am Kai. GSoC has announced selected projects last week. During community bonding period, I want to share some basics about this year's project with Apache Beam. Project abstract: https://summerofcode.withgoogle.com/projects/#6460770829729792 Issue Tracker: BEAM-3783

Re: How to create a runtime ValueProvider

2018-05-03 Thread Frank Yellin
Unfortunately, I can't use a side input, because I need to use the result as an input to a DatastoreV1.Read object. This is a PTransform rather than DoFn, so there is no place for a side input. But you did give me the hint I needed to get something that is close enough. It's a bit dirty, but

Graal instead of docker?

2018-05-03 Thread Romain Manni-Bucau
Hi guys Since some time there are efforts to have a language portable support in beam but I cant really find a case it "works" being based on docker except for some vendor specific infra. Current solution: 1. Is runner intrusive (which is bad for beam and prevents adoption of big data vendors)