JDBC support for Beam SQL

2018-05-16 Thread Andrew Pilloud
I'm currently adding JDBC support to Beam SQL! Unfortunately Calcite has
two distinct entry points, one for JDBC and one for everything else (see
CALCITE-1525). Eventually that will change, but I'd like to avoid having
two versions of Beam SQL until Calcite converges on a single path for
parsing SQL. Here are the options I am looking at:

1. Make JDBC the source of truth for Calcite config and state. Generate a
FrameworkConfig based on the JDBC connection and continue to use the
non-JDBC interface to Calcite. This option comes with the risk that the two
paths into Calcite will diverge (as there is a bunch of code copied from
Calcite to generate the config), but is the easiest to implement and
understand.

2. Make JDBC the only path into Calcite. Use prepareStatement and unwrap to
extract a BeamRelNode out of the JDBC interface. This eliminates a
significant amount of code in Beam, but the unwrap path is a little
convoluted.

Both options leave the user facing non-JDBC interface to Beam SQL
unchanged, these changes are internal.

Andrew


Re: [SQL] Cross Join Operation

2018-05-15 Thread Andrew Pilloud
Calcite does not have the concept of a "CROSS JOIN". It shows up in the
plan as a LogicalJoin with condition=[true]. We could try rejecting the
cross join at the planning stage by returning null for them
in BeamJoinRule.convert(), which might result in a different plan. But
looking at your query, you have a cross join unless the where clause on the
inner select contains a row from the outer select.

Andrew

On Tue, May 15, 2018 at 9:15 AM Kenneth Knowles  wrote:

> The logical plan should show you where the cross join is needed. Here is
> where it is logged:
> https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/planner/BeamQueryPlanner.java#L150
>
> (It should probably be put to DEBUG level)
>
> If I look at the original template, like
> https://github.com/gregrahn/tpcds-kit/blob/master/query_templates/query9.tpl
> I see conditions "[RC.1]". Are those templates expected to be filled with
> references to the `reason` table, perhaps? How does that change things?
>
> I still think it would be good to support CROSS JOIN if we can - the
> problem of course is huge data size, but when one side is small it would be
> good for it to work simply.
>
> Kenn
>
> On Tue, May 15, 2018 at 7:41 AM Kai Jiang  wrote:
>
>> Hi everyone,
>>
>> To prove the idea of GSoC project, I was working on some simple TPC-DS
>> queries running with given generated data on direct runner. query example
>> 
>>
>> The example is executed with TPC-DS query 9
>> .
>> Briefly, Query 9 uses case when clauses to select 5 counting numbers
>> from store_sales (table 1). In order to show those result numbers, case
>> when clause inside one select clause. In short, it looks like:
>> SELECT
>>
>> CASE WHEN ( SELECT count(*)  FROM  table 1 WHERE. )
>> THEN condition 1
>> ELSE condition 2,
>> .
>> CASE WHEN .
>>
>> FROM table 2
>>
>> IIUC, this query doesn't need join operation on table 1 and table 2 since
>> outside select clause doesn't need to interfere with table 1.
>> But, the program shows it does and throws errors message said
>> "java.lang.UnsupportedOperationException: CROSS JOIN is not supported". 
>> (error
>> message detail
>> )
>>
>> To make the query work, I am wondering where I can start with:
>> 1. see logic plan?
>> Will logic plan explain why the query need CROSS JOIN?
>>
>> 2. cross join support?
>> I checked all queries in TPC-DS benchmark. Almost every query uses cross
>> join. It is an important feature needs to implement. Unlike other join, it
>> consumes a lot of computing resource. But, I think we need cross join in
>> the future. and support both in join-library? I noticed James has open
>> BEAM-2194  for
>> supporting cross join.
>>
>> Looking forward to comments!
>>
>> Best,
>> Kai
>>
>> ᐧ
>>
>


Re: Fwd: Closing (automatically?) inactive pull requests

2018-05-14 Thread Andrew Pilloud
Warnings are really helpful, I've forgotten about PRs on projects I rarely
contribute to before. Also authors can reopen their closed pull requests if
they decide they want to work on them again. This seems to be already
covered in the Stale pull requests section of the contributor guide. Seems
like you should just make it happen.

Andrew

On Mon, May 14, 2018 at 1:26 PM Kenneth Knowles  wrote:

> Yea, the bot they linked to sends a warning comment first.
>
> Kenn
>
> On Mon, May 14, 2018 at 7:40 AM Jean-Baptiste Onofré 
> wrote:
>
>> Hi,
>>
>> Do you know if the bot can send a first "warn" comment before closing
>> the PR ?
>>
>> I think that would be great: if the contributor is not active after the
>> warn message, then, it's fine to close the PR (the contributor can
>> always open a new one later if it makes sense).
>>
>> Regards
>> JB
>>
>> On 14/05/2018 16:20, Kenneth Knowles wrote:
>> > Hi all,
>> >
>> > Spotted this thread on d...@flink.apache.org
>> > . I didn't make a combined thread because
>> > each project should discuss on our own.
>> >
>> > I think it would be great to share "stale PR closer bot" infrastructure
>> > (and this might naturally be a hook where we put other things / combine
>> > with merge-bot / etc).
>> >
>> > The downside to automation is being less empathetic - but hopefully for
>> > very stale PRs no one is really listening anyhow.
>> >
>> > Kenn
>> >
>> > -- Forwarded message -
>> > From: Ufuk Celebi >
>> > Date: Mon, May 14, 2018 at 5:58 AM
>> > Subject: Re: Closing (automatically?) inactive pull requests
>> > To: >
>> >
>> >
>> > Hey Piotr,
>> >
>> > thanks for bringing this up. I really like this proposal and also saw
>> > it work successfully at other projects. So +1 from my side.
>> >
>> > - I like the approach with a notification one week before
>> > automatically closing the PR
>> > - I think a bot will the best option as these kinds of things are
>> > usually followed enthusiastically in the beginning but eventually
>> > loose traction
>> >
>> > We can enable better integration with GitHub by using ASF GitBox
>> > (https://gitbox.apache.org/setup/) but we should discuss that in a
>> > separate thread.
>> >
>> > – Ufuk
>> >
>> > On Mon, May 14, 2018 at 12:04 PM, Piotr Nowojski
>> > > wrote:
>> >  > Hey,
>> >  >
>> >  > We have lots of open pull requests and quite some of them are
>> > stale/abandoned/inactive. Often such old PRs are impossible to merge
>> due
>> > to conflicts and it’s easier to just abandon and rewrite them.
>> > Especially there are some PRs which original contributor created long
>> > time ago, someone else wrote some comments/review and… that’s about it.
>> > Original contributor never shown up again to respond to the comments.
>> > Regardless of the reason such PRs are clogging the GitHub, making it
>> > difficult to keep track of things and making it almost impossible to
>> > find a little bit old (for example 3+ months) PRs that are still valid
>> > and waiting for reviews. To do something like that, one would have to
>> > dig through tens or hundreds of abandoned PRs.
>> >  >
>> >  > What I would like to propose is to agree on some inactivity dead
>> > line, lets say 3 months. After crossing such deadline, PRs should be
>> > marked/commented as “stale”, with information like:
>> >  >
>> >  > “This pull request has been marked as stale due to 3 months of
>> > inactivity. It will be closed in 1 week if no further activity occurs.
>> > If you think that’s incorrect or this pull request requires a review,
>> > please simply write any comment.”
>> >  >
>> >  > Either we could just agree on such policy and enforce it manually
>> > (maybe with some simple tooling, like a simple script to list inactive
>> > PRs - seems like couple of lines in python by using PyGithub) or we
>> > could think about automating this action. There are some bots that do
>> > exactly this (like this one: https://github.com/probot/stale
>> >  ), but probably they would need to
>> be
>> > adopted to limitations of our Apache repository (we can not add labels
>> > and we can not close the PRs via GitHub).
>> >  >
>> >  > What do you think about it?
>> >  >
>> >  > Piotrek
>>
>


Re: Graal instead of docker?

2018-05-11 Thread Andrew Pilloud
Json and Protobuf aren't the same thing. Json is for exchanging
unstructured data, Protobuf is for exchanging structured data. The point of
Portability is to define a protocol for exchanging structured messages
across languages. What do you propose using on top of Json to define
message structure?

I'd like to see the generic runner rewritten in Golang so we can eliminate
the significant overhead imposed by the JVM. I would argue that Go is the
best language for low overhead infrastructure, and is already widely used
by projects in this space such as Docker, Kubernetes, InfluxDB. Even SQL
can take advantage of this. For example, several runners could be passed
raw SQL and use their own SQL engines to implement more efficient
transforms then generic Beam can. Users will save significant $$$ on
infrastructure by not having to involve the JVM at all.

Andrew

On Fri, May 11, 2018 at 8:53 AM Romain Manni-Bucau 
wrote:

>
>
> Le mer. 9 mai 2018 17:41, Eugene Kirpichov  a
> écrit :
>
>>
>>
>> On Wed, May 9, 2018 at 1:08 AM Romain Manni-Bucau 
>> wrote:
>>
>>>
>>>
>>> Le mer. 9 mai 2018 00:57, Henning Rohde  a écrit :
>>>
 There are indeed lots of possibilities for interesting docker
 alternatives with different tradeoffs and capabilities, but in generally
 both the runner as well as the SDK must support them for it to work. As
 mentioned, docker (as used in the container contract) is meant as a
 flexible main option but not necessarily the only option. I see no problem
 with certain pipeline-SDK-runner combinations additionally supporting a
 specialized setup. Pipeline can be a factor, because that some transforms
 might depend on aspects of the runtime environment -- such as system
 libraries or shelling out to a /bin/foo.

 The worker boot code is tied to the current container contract, so
 pre-launched workers would presumably not use that code path and are not be
 bound by its assumptions. In particular, such a setup might want to invert
 who initiates the connection from the SDK worker to the runner. Pipeline
 options and global state in the SDK and user functions process might make
 it difficult to safely reuse worker processes across pipelines, but also
 doable in certain scenarios.

>>>
>>> This is not that hard actually and most java env do it.
>>>
>>> Main concern is 1. Being tight to an impl detail and 2. A bad
>>> architecture which doeent embrace the community
>>>
>> Could you please be more specific? Concerns about Docker dependency have
>> already been repeatedly addressed in this thread.
>>
>
> My concern is that beam is being driven by an implementation instead of a
> clear and scalable architecture.
>
> The best demonstration is the protobuf usage which is far to be the best
> choice for portability these days due to the implication of its stack in
> several languages (nobody wants it in its classpath in java/scala these
> days for instance cause of conflicts or security careness its requires).
> Json is very tooled and trivial to use whatever lib you want to rely on, in
> any language or environment to cite just one alternative.
>
> Being portable (language) is a good goal but IMHO requires:
>
> 1. Runners in each language (otherwise fallback on the jsr223 and you are
> good with just a json facade)
> 2. A generic runner able to route each task to the right native runner
> 3. A way to run in a single runner when relevant (keep in mind most of
> java users dont even want to see python or portable code or api in their
> classpath and runner)
>
>
>
>
>>
>>>
>>>
>>>
 Henning

 On Tue, May 8, 2018 at 3:51 PM Thomas Weise  wrote:

>
>
> On Sat, May 5, 2018 at 3:58 PM, Robert Bradshaw 
> wrote:
>
>>
>> I would welcome changes to
>>
>> https://github.com/apache/beam/blob/v2.4.0/model/pipeline/src/main/proto/beam_runner_api.proto#L730
>> that would provide alternatives to docker (one of which comes to mind
>> is "I
>> already brought up a worker(s) for you (which could be the same
>> process
>> that handled pipeline construction in testing scenarios), here's how
>> to
>> connect to it/them.") Another option, which would seem to appeal to
>> you in
>> particular, would be "the worker code is linked into the runner's
>> binary,
>> use this process as the worker" (though note even for java-on-java,
>> it can
>> be advantageous to shield the worker and runner code from each others
>> environments, dependencies, and version requirements.) This latter
>> should
>> still likely use the FnApi to talk to itself (either over GRPC on
>> local
>> ports, or possibly better via direct function calls eliminating the
>> RPC
>> overhead altogether--this is how the fast local runner in Python

Re: Documenting Github PR jenkins trigger phrases

2018-05-10 Thread Andrew Pilloud
It would be great to have the set of "Run {Java,Python,Go} PreCommit"
documented in the contributors guide as well. Those match up to the jobs auto
run on every PR and are the ones I use most. There is no security, anyone
can run them including 'Run Seed Job'. That one seems like a good one to
document in testing, because it is the one that loads changes to the rest.

Andrew

On Thu, May 10, 2018 at 12:27 PM Huygaa Batsaikhan 
wrote:

> Hi devs,
>
> We can run various jenkins commands (precommit, postcommit, performance
> tests) directly from Github Pull Request UI by commenting phrases such as
> "retest this please". Unfortunately, this tool is not documented. I am
> adding a brief documentation in
> https://beam.apache.org/contribute/testing/ and I need some help.
>
>1. What are the most common phrases used?
>2. Can anyone run these commands? Are there any permission issues?
>3. Does it make sense to categorize the commands as Performance tests,
>Precommit, Postcommit, and Release Validation?
>
> Let me know what you think,
>
> Thanks,
> Huygaa
>


Re: Merging our two SQL parser configs

2018-05-09 Thread Andrew Pilloud
Haven't heard anything, so I wrote up the change:
https://github.com/apache/beam/pull/5325

Andrew

On Mon, May 7, 2018 at 3:16 PM Andrew Pilloud <apill...@google.com> wrote:

> So we have two incompatible SQL parser configs in beam. One is in
> BeamQueryPlanner
> <https://github.com/apache/beam/blob/8ef71b6eb1d2d5c63974ec506a01faf3813efe74/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/planner/BeamQueryPlanner.java#L95>
>  which is used by default and a second in BeamSqlParser
> <https://github.com/apache/beam/blob/598774738e7a1236cf30f70a584311cee52d1818/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/parser/BeamSqlParser.java#L33>
> which is used only in the BeamSqlCli execute path (not the explain path).
> There are also a bunch of 'toLowerCase()' calls scattered around our code.
> I'd like to get us on one parser config and remove the need for toLowerCase
> calls.
>
> To do this, I am proposing we standardize these all to go through
> BeamQueryPlanner, use the Calcite Lex.JAVA config, and drop the
> 'toLowerCase()' calls. This will result in the parser preserving case,
> being case sensitive, and using backticks for quoted identifiers. This is
> the same as the default config in Apache Flink and is roughly compatible
> with BigQuery. It effectively leaves the default path unchanged, except
> case will now be preserved and checked consistently. The BeamSqlCli execute
> path will remain unchanged for unquoted queries with all lower case names,
> which is what we have tested. Comments? Objections?
>
> Andrew
>


Re: Jenkins Post Commit Status to Github

2018-05-09 Thread Andrew Pilloud
The seed job with the revert was run May 9, 2018 6:12:14 PM:
https://builds.apache.org/job/beam_SeedJob/1657/ Your broken job was run
before that.

Andrew

On Wed, May 9, 2018 at 1:11 PM Pablo Estrada <pabl...@google.com> wrote:

> There is a problem that is still not fixed in Python:
> https://builds.apache.org/job/beam_PostCommit_Python_Verify/4909/console
>
> On Wed, May 9, 2018 at 12:07 PM Andrew Pilloud <apill...@google.com>
> wrote:
>
>> Post commits are no longer failing on status pushes. Now that I know
>> about the seed job, I'll figure out how to test my changes in the future.
>> Sorry for all the trouble!
>>
>> Andrew
>>
>> On Wed, May 9, 2018 at 11:36 AM Pablo Estrada <pabl...@google.com> wrote:
>>
>>> I was able to trigger a build again just now.
>>>
>>> On Wed, May 9, 2018 at 11:27 AM Andrew Pilloud <apill...@google.com>
>>> wrote:
>>>
>>>> The manual launch button doesn't exist for me. I am not a committer,
>>>> so I don't have a login to Jenkins.
>>>>
>>>> On Wed, May 9, 2018 at 11:16 AM Lukasz Cwik <lc...@google.com> wrote:
>>>>
>>>>> You can also just manually launch a seed job from within Jenkins
>>>>> pointing at apache/master as its source.
>>>>>
>>>>> On Wed, May 9, 2018 at 11:10 AM Andrew Pilloud <apill...@google.com>
>>>>> wrote:
>>>>>
>>>>>> I've not heard of seed jobs before, but from what I've been told I
>>>>>> need to create a PR with a empty '.test-infra/jenkins' folder then type
>>>>>> 'Run Seed Job' in a comment to apply the revert? Doing that now.
>>>>>>
>>>>>> Andrew
>>>>>>
>>>>>> On Wed, May 9, 2018 at 11:08 AM Lukasz Cwik <lc...@google.com> wrote:
>>>>>>
>>>>>>> I am also seeing the delayed launch of the tests.
>>>>>>>
>>>>>>> On Wed, May 9, 2018 at 10:56 AM Pablo Estrada <pabl...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Does this require a Seed Job rerun? I'm seeing Python postcommits
>>>>>>>> breaking as well
>>>>>>>>
>>>>>>>> On Wed, May 9, 2018 at 10:49 AM Alan Myrvold <amyrv...@google.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I saw ~1-2 hour latency for the trigger happening yesterday. Not
>>>>>>>>> sure what would cause it to stop or be slow.
>>>>>>>>>
>>>>>>>>> On Wed, May 9, 2018 at 10:45 AM Lukasz Cwik <lc...@google.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> It seems as though precommits are no longer triggering and
>>>>>>>>>> trigger requests like 'Run Java PreCommit' are no longer honored.
>>>>>>>>>>
>>>>>>>>>> On Wed, May 9, 2018 at 10:22 AM Andrew Pilloud <
>>>>>>>>>> apill...@google.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> I broke all the post commits with this. Sorry! It has been
>>>>>>>>>>> reverted. I'm going to follow up with Apache Infra about getting 
>>>>>>>>>>> the right
>>>>>>>>>>> credentials configured on the Jenkins plugin.
>>>>>>>>>>>
>>>>>>>>>>> Andrew
>>>>>>>>>>>
>>>>>>>>>>> On Tue, May 8, 2018 at 1:38 PM Andrew Pilloud <
>>>>>>>>>>> apill...@google.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Yep, mess with the groovy scripts appears to be the answer. We
>>>>>>>>>>>> use different jenkins libraries to handle PRs vs pushes to master 
>>>>>>>>>>>> but I
>>>>>>>>>>>> think I finally figured it out. Change is here:
>>>>>>>>>>>> https://github.com/apache/beam/pull/5305
>>>>>>>>>>>>
>>>>>>>>>>>> Andrew
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, May 7, 2018 at 12:40 PM Kenneth Knowles <k...@google.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I think you want to mess with the groovy scripts in
>>>>>>>>>>>>> .test-infra/jenkins
>>>>>>>>>>>>>
>>>>>>>>>>>>> Kenn
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, May 7, 2018 at 11:12 AM Andrew Pilloud <
>>>>>>>>>>>>> apill...@google.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> The Github branches page shows the status of the latest
>>>>>>>>>>>>>> commit on each branch and provides a set of links to the jobs 
>>>>>>>>>>>>>> run on that
>>>>>>>>>>>>>> commit. But it doesn't appear Jenkins is publishing status from 
>>>>>>>>>>>>>> post commit
>>>>>>>>>>>>>> jobs. This seems like a simple oversight that should be easy to 
>>>>>>>>>>>>>> fix. Could
>>>>>>>>>>>>>> someone point me in the right direction to fix this?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Andrew
>>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>> Got feedback? go/pabloem-feedback
>>>>>>>> <https://goto.google.com/pabloem-feedback>
>>>>>>>>
>>>>>>> --
>>> Got feedback? go/pabloem-feedback
>>> <https://goto.google.com/pabloem-feedback>
>>>
>> --
> Got feedback? go/pabloem-feedback
> <https://goto.google.com/pabloem-feedback>
>


Re: Jenkins Post Commit Status to Github

2018-05-09 Thread Andrew Pilloud
The manual launch button doesn't exist for me. I am not a committer, so I
don't have a login to Jenkins.

On Wed, May 9, 2018 at 11:16 AM Lukasz Cwik <lc...@google.com> wrote:

> You can also just manually launch a seed job from within Jenkins pointing
> at apache/master as its source.
>
> On Wed, May 9, 2018 at 11:10 AM Andrew Pilloud <apill...@google.com>
> wrote:
>
>> I've not heard of seed jobs before, but from what I've been told I need
>> to create a PR with a empty '.test-infra/jenkins' folder then type 'Run
>> Seed Job' in a comment to apply the revert? Doing that now.
>>
>> Andrew
>>
>> On Wed, May 9, 2018 at 11:08 AM Lukasz Cwik <lc...@google.com> wrote:
>>
>>> I am also seeing the delayed launch of the tests.
>>>
>>> On Wed, May 9, 2018 at 10:56 AM Pablo Estrada <pabl...@google.com>
>>> wrote:
>>>
>>>> Does this require a Seed Job rerun? I'm seeing Python postcommits
>>>> breaking as well
>>>>
>>>> On Wed, May 9, 2018 at 10:49 AM Alan Myrvold <amyrv...@google.com>
>>>> wrote:
>>>>
>>>>> I saw ~1-2 hour latency for the trigger happening yesterday. Not sure
>>>>> what would cause it to stop or be slow.
>>>>>
>>>>> On Wed, May 9, 2018 at 10:45 AM Lukasz Cwik <lc...@google.com> wrote:
>>>>>
>>>>>> It seems as though precommits are no longer triggering and trigger
>>>>>> requests like 'Run Java PreCommit' are no longer honored.
>>>>>>
>>>>>> On Wed, May 9, 2018 at 10:22 AM Andrew Pilloud <apill...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I broke all the post commits with this. Sorry! It has been reverted.
>>>>>>> I'm going to follow up with Apache Infra about getting the right
>>>>>>> credentials configured on the Jenkins plugin.
>>>>>>>
>>>>>>> Andrew
>>>>>>>
>>>>>>> On Tue, May 8, 2018 at 1:38 PM Andrew Pilloud <apill...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Yep, mess with the groovy scripts appears to be the answer. We use
>>>>>>>> different jenkins libraries to handle PRs vs pushes to master but I 
>>>>>>>> think I
>>>>>>>> finally figured it out. Change is here:
>>>>>>>> https://github.com/apache/beam/pull/5305
>>>>>>>>
>>>>>>>> Andrew
>>>>>>>>
>>>>>>>> On Mon, May 7, 2018 at 12:40 PM Kenneth Knowles <k...@google.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I think you want to mess with the groovy scripts in
>>>>>>>>> .test-infra/jenkins
>>>>>>>>>
>>>>>>>>> Kenn
>>>>>>>>>
>>>>>>>>> On Mon, May 7, 2018 at 11:12 AM Andrew Pilloud <
>>>>>>>>> apill...@google.com> wrote:
>>>>>>>>>
>>>>>>>>>> The Github branches page shows the status of the latest commit on
>>>>>>>>>> each branch and provides a set of links to the jobs run on that 
>>>>>>>>>> commit. But
>>>>>>>>>> it doesn't appear Jenkins is publishing status from post commit 
>>>>>>>>>> jobs. This
>>>>>>>>>> seems like a simple oversight that should be easy to fix. Could 
>>>>>>>>>> someone
>>>>>>>>>> point me in the right direction to fix this?
>>>>>>>>>>
>>>>>>>>>> Andrew
>>>>>>>>>>
>>>>>>>>> --
>>>> Got feedback? go/pabloem-feedback
>>>> <https://goto.google.com/pabloem-feedback>
>>>>
>>>


Re: Jenkins Post Commit Status to Github

2018-05-09 Thread Andrew Pilloud
Kenn tells me there is a button he can push to run it. He clicked it.
Hopefully that fixes the postcommits. I don't know why Jenkins itself is
having high latency but I've seen the same thing over the last few days.

On Wed, May 9, 2018 at 11:09 AM Andrew Pilloud <apill...@google.com> wrote:

> I've not heard of seed jobs before, but from what I've been told I need to
> create a PR with a empty '.test-infra/jenkins' folder then type 'Run Seed
> Job' in a comment to apply the revert? Doing that now.
>
> Andrew
>
> On Wed, May 9, 2018 at 11:08 AM Lukasz Cwik <lc...@google.com> wrote:
>
>> I am also seeing the delayed launch of the tests.
>>
>> On Wed, May 9, 2018 at 10:56 AM Pablo Estrada <pabl...@google.com> wrote:
>>
>>> Does this require a Seed Job rerun? I'm seeing Python postcommits
>>> breaking as well
>>>
>>> On Wed, May 9, 2018 at 10:49 AM Alan Myrvold <amyrv...@google.com>
>>> wrote:
>>>
>>>> I saw ~1-2 hour latency for the trigger happening yesterday. Not sure
>>>> what would cause it to stop or be slow.
>>>>
>>>> On Wed, May 9, 2018 at 10:45 AM Lukasz Cwik <lc...@google.com> wrote:
>>>>
>>>>> It seems as though precommits are no longer triggering and trigger
>>>>> requests like 'Run Java PreCommit' are no longer honored.
>>>>>
>>>>> On Wed, May 9, 2018 at 10:22 AM Andrew Pilloud <apill...@google.com>
>>>>> wrote:
>>>>>
>>>>>> I broke all the post commits with this. Sorry! It has been reverted.
>>>>>> I'm going to follow up with Apache Infra about getting the right
>>>>>> credentials configured on the Jenkins plugin.
>>>>>>
>>>>>> Andrew
>>>>>>
>>>>>> On Tue, May 8, 2018 at 1:38 PM Andrew Pilloud <apill...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Yep, mess with the groovy scripts appears to be the answer. We use
>>>>>>> different jenkins libraries to handle PRs vs pushes to master but I 
>>>>>>> think I
>>>>>>> finally figured it out. Change is here:
>>>>>>> https://github.com/apache/beam/pull/5305
>>>>>>>
>>>>>>> Andrew
>>>>>>>
>>>>>>> On Mon, May 7, 2018 at 12:40 PM Kenneth Knowles <k...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I think you want to mess with the groovy scripts in
>>>>>>>> .test-infra/jenkins
>>>>>>>>
>>>>>>>> Kenn
>>>>>>>>
>>>>>>>> On Mon, May 7, 2018 at 11:12 AM Andrew Pilloud <apill...@google.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> The Github branches page shows the status of the latest commit on
>>>>>>>>> each branch and provides a set of links to the jobs run on that 
>>>>>>>>> commit. But
>>>>>>>>> it doesn't appear Jenkins is publishing status from post commit jobs. 
>>>>>>>>> This
>>>>>>>>> seems like a simple oversight that should be easy to fix. Could 
>>>>>>>>> someone
>>>>>>>>> point me in the right direction to fix this?
>>>>>>>>>
>>>>>>>>> Andrew
>>>>>>>>>
>>>>>>>> --
>>> Got feedback? go/pabloem-feedback
>>> <https://goto.google.com/pabloem-feedback>
>>>
>>


Re: Jenkins Post Commit Status to Github

2018-05-09 Thread Andrew Pilloud
I've not heard of seed jobs before, but from what I've been told I need to
create a PR with a empty '.test-infra/jenkins' folder then type 'Run Seed
Job' in a comment to apply the revert? Doing that now.

Andrew

On Wed, May 9, 2018 at 11:08 AM Lukasz Cwik <lc...@google.com> wrote:

> I am also seeing the delayed launch of the tests.
>
> On Wed, May 9, 2018 at 10:56 AM Pablo Estrada <pabl...@google.com> wrote:
>
>> Does this require a Seed Job rerun? I'm seeing Python postcommits
>> breaking as well
>>
>> On Wed, May 9, 2018 at 10:49 AM Alan Myrvold <amyrv...@google.com> wrote:
>>
>>> I saw ~1-2 hour latency for the trigger happening yesterday. Not sure
>>> what would cause it to stop or be slow.
>>>
>>> On Wed, May 9, 2018 at 10:45 AM Lukasz Cwik <lc...@google.com> wrote:
>>>
>>>> It seems as though precommits are no longer triggering and trigger
>>>> requests like 'Run Java PreCommit' are no longer honored.
>>>>
>>>> On Wed, May 9, 2018 at 10:22 AM Andrew Pilloud <apill...@google.com>
>>>> wrote:
>>>>
>>>>> I broke all the post commits with this. Sorry! It has been reverted.
>>>>> I'm going to follow up with Apache Infra about getting the right
>>>>> credentials configured on the Jenkins plugin.
>>>>>
>>>>> Andrew
>>>>>
>>>>> On Tue, May 8, 2018 at 1:38 PM Andrew Pilloud <apill...@google.com>
>>>>> wrote:
>>>>>
>>>>>> Yep, mess with the groovy scripts appears to be the answer. We use
>>>>>> different jenkins libraries to handle PRs vs pushes to master but I 
>>>>>> think I
>>>>>> finally figured it out. Change is here:
>>>>>> https://github.com/apache/beam/pull/5305
>>>>>>
>>>>>> Andrew
>>>>>>
>>>>>> On Mon, May 7, 2018 at 12:40 PM Kenneth Knowles <k...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I think you want to mess with the groovy scripts in
>>>>>>> .test-infra/jenkins
>>>>>>>
>>>>>>> Kenn
>>>>>>>
>>>>>>> On Mon, May 7, 2018 at 11:12 AM Andrew Pilloud <apill...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> The Github branches page shows the status of the latest commit on
>>>>>>>> each branch and provides a set of links to the jobs run on that 
>>>>>>>> commit. But
>>>>>>>> it doesn't appear Jenkins is publishing status from post commit jobs. 
>>>>>>>> This
>>>>>>>> seems like a simple oversight that should be easy to fix. Could someone
>>>>>>>> point me in the right direction to fix this?
>>>>>>>>
>>>>>>>> Andrew
>>>>>>>>
>>>>>>> --
>> Got feedback? go/pabloem-feedback
>> <https://goto.google.com/pabloem-feedback>
>>
>


Re: Jenkins Post Commit Status to Github

2018-05-09 Thread Andrew Pilloud
I broke all the post commits with this. Sorry! It has been reverted. I'm
going to follow up with Apache Infra about getting the right credentials
configured on the Jenkins plugin.

Andrew

On Tue, May 8, 2018 at 1:38 PM Andrew Pilloud <apill...@google.com> wrote:

> Yep, mess with the groovy scripts appears to be the answer. We use
> different jenkins libraries to handle PRs vs pushes to master but I think I
> finally figured it out. Change is here:
> https://github.com/apache/beam/pull/5305
>
> Andrew
>
> On Mon, May 7, 2018 at 12:40 PM Kenneth Knowles <k...@google.com> wrote:
>
>> I think you want to mess with the groovy scripts in .test-infra/jenkins
>>
>> Kenn
>>
>> On Mon, May 7, 2018 at 11:12 AM Andrew Pilloud <apill...@google.com>
>> wrote:
>>
>>> The Github branches page shows the status of the latest commit on each
>>> branch and provides a set of links to the jobs run on that commit. But it
>>> doesn't appear Jenkins is publishing status from post commit jobs. This
>>> seems like a simple oversight that should be easy to fix. Could someone
>>> point me in the right direction to fix this?
>>>
>>> Andrew
>>>
>>


Re: Jenkins Post Commit Status to Github

2018-05-08 Thread Andrew Pilloud
Yep, mess with the groovy scripts appears to be the answer. We use
different jenkins libraries to handle PRs vs pushes to master but I think I
finally figured it out. Change is here:
https://github.com/apache/beam/pull/5305

Andrew

On Mon, May 7, 2018 at 12:40 PM Kenneth Knowles <k...@google.com> wrote:

> I think you want to mess with the groovy scripts in .test-infra/jenkins
>
> Kenn
>
> On Mon, May 7, 2018 at 11:12 AM Andrew Pilloud <apill...@google.com>
> wrote:
>
>> The Github branches page shows the status of the latest commit on each
>> branch and provides a set of links to the jobs run on that commit. But it
>> doesn't appear Jenkins is publishing status from post commit jobs. This
>> seems like a simple oversight that should be easy to fix. Could someone
>> point me in the right direction to fix this?
>>
>> Andrew
>>
>


Merging our two SQL parser configs

2018-05-07 Thread Andrew Pilloud
So we have two incompatible SQL parser configs in beam. One is in
BeamQueryPlanner

 which is used by default and a second in BeamSqlParser

which is used only in the BeamSqlCli execute path (not the explain path).
There are also a bunch of 'toLowerCase()' calls scattered around our code.
I'd like to get us on one parser config and remove the need for toLowerCase
calls.

To do this, I am proposing we standardize these all to go through
BeamQueryPlanner, use the Calcite Lex.JAVA config, and drop the
'toLowerCase()' calls. This will result in the parser preserving case,
being case sensitive, and using backticks for quoted identifiers. This is
the same as the default config in Apache Flink and is roughly compatible
with BigQuery. It effectively leaves the default path unchanged, except
case will now be preserved and checked consistently. The BeamSqlCli execute
path will remain unchanged for unquoted queries with all lower case names,
which is what we have tested. Comments? Objections?

Andrew


Jenkins Post Commit Status to Github

2018-05-07 Thread Andrew Pilloud
The Github branches page shows the status of the latest commit on each
branch and provides a set of links to the jobs run on that commit. But it
doesn't appear Jenkins is publishing status from post commit jobs. This
seems like a simple oversight that should be easy to fix. Could someone
point me in the right direction to fix this?

Andrew


Re: Graal instead of docker?

2018-05-05 Thread Andrew Pilloud
Thanks for the examples earlier, I think Hazelcast is a great example of
something portability might make more difficult. I'm not working on
portability, but my understanding is that the data sent to the runner is a
blob of code and the name of the container to run it in. A runner with a
native language (java on Hazelcast for example) could run the code directly
without the container if it is in a language it supports. So when Hazelcast
sees a known java container specified, it just loads the java blob and runs
it. When it sees another container it rejects the pipeline. You could use
Graal in the Hazelcast runner to do this for a number of languages. I would
expect that this could also be done in the direct runner, which similarly
provides a native java environment, so portable Java pipelines can be
tested without docker?

For another way to frame this: if Beam was originally written in Go, we
would be having a different discussion. A pipeline written entirely in java
wouldn't be possible, so instead to enable Hazelcast, we would have to be
able to run the java from portability without running the container.

Andrew

On Sat, May 5, 2018 at 1:48 AM Romain Manni-Bucau <rmannibu...@gmail.com>
wrote:

>
>
> 2018-05-05 9:27 GMT+02:00 Ismaël Mejía <ieme...@gmail.com>:
>
>> Graal would not be a viable solution for the reasons Henning and Andrew
>> mentioned, or put in other words, when users choose a programming language
>> they don’t choose only a ‘friendly’ syntax or programming model, they
>> choose also the ecosystem that comes with it, and the libraries that make
>> their life easier. However isolating these user libraries/dependencies is
>> a
>> hard problem and so far the standard solution to this problem is to use
>> operating systems containers via docker.
>>
>
> Graal solves that Ismael. Same kind of experience than running npm libs on
> nashorn but with a more unified API to run any language soft.
>
>
>>
>> The Beam vision from day zero is to run pipelines written in multiple
>> languages in runners in multiple systems, and so far we are not doing this
>> in particular in the Apache runners. The portability work is the cleanest
>> way to achieve this vision given the constraints.
>>
>
> Hmm, did I read it wrong and we don't have specific integration of the
> portable API in runners? This is what is messing up the runners and
> limiting beam adoption on existing runners.
> Portable API is a feature buildable on top of runner, not in runners.
> Same as a runner implementing the 5-6 primitives can run anything, the
> portable API should just rely on that and not require more integration.
> It doesn't prevent more deep integrations as for some higher level
> primitives existing in runners but it is not the case today for runners so
> shouldn't exist IMHO.
>
>
>>
>> I agree however that for the Java SDK to Java runner case this can
>> represent additional pain, docker ideally should not be a requirement for
>> Java users with the Direct runner and debugging a pipeline should be as
>> easy as it is today. I think the Univerrsal Local Runner exists to cover
>> the Portable case, but after looking at this JIRA I am not sure if
>> unification is coming (and by consequence if docker would be mandatory).
>> https://issues.apache.org/jira/browse/BEAM-4239
>>
>> I suppose for the distributed runners that they must implement the full
>> Portability APIs to be considered Beam multi language compliant but they
>> can prefer for performance reasons to translate without the portability
>> APIs the Java to Java case.
>>
>
>
> This is my issue, language portability must NOT impact runners at all, it
> is just a way to forward primitives to a runner.
> See it as a layer rewriting the pipeline and submitting it. No need to
> modify any runner.
>
>
>> On Sat, May 5, 2018 at 9:11 AM Reuven Lax <re...@google.com> wrote:
>>
>> > A beam cluster with the spark runner would include a spark cluster, plus
>> what's needed for portability, plus the beam sdk.
>>
>> > On Fri, May 4, 2018, 11:55 PM Romain Manni-Bucau <rmannibu...@gmail.com
>> >
>> wrote:
>>
>>
>>
>> >> Le 5 mai 2018 08:43, "Reuven Lax" <re...@google.com> a écrit :
>>
>> >> I don't believe we enforce docker anywhere. In fact if someone wanted
>> to
>> run an all-windows beam cluster, they would probably not use docker for
>> their runner (docker runs on Windows, but not efficiently).
>>
>>
>>
>> >> Or doesnt run sometimes - a colleague hit that yesterday :(.
>>
>> >> What is a "beam cluster" - o

Re: Pubsub to Beam SQL

2018-05-04 Thread Andrew Pilloud
I don't think we should jump to adding a extension, but TBLPROPERTIES is
already a DDL extension and it isn't user friendly. We should strive for a
world where no one needs to use it. SQL needs the timestamp to be exposed
as a column, we can't hide it without changing the definition of GROUP BY.
I like Anton's proposal of adding it as an annotation in the column
definition. That seems even simpler and more user friendly. We might even
be able to get away with using the PRIMARY KEY keyword.

Andrew

On Fri, May 4, 2018 at 12:11 PM Anton Kedin <ke...@google.com> wrote:

> There are few aspects of the event timestamp definition in SQL, which we
> are talking about here:
>
>- configuring the source. E.g. for PubsubIO you can choose whether to
>extract event timestamp from the message attributes or the message publish
>time:
>- this is source-specific and cannot be part of the common DDL;
>   - TBLPROPERTIES, on the other hand, is an opaque json blob which
>   exists specifically for source configuration;
>   - as Kenn is saying, some sources might not even have such
>   configuration;
>   - at processing time, event timestamp is available in
>   ProcessContext.timestamp() regardless of the specifics of the source
>   configuration, so it can be extracted the same way for all sources, as
>   Raghu said;
>- designating one of the table columns as an event timestamp:
>   - query needs to be able to reference the event timestamp so we
>   have to declare which column to populate with the event timestamp;
>   - this is common for all sources and we can create a special
>   syntax, e.g. "columnName EVENT_TIMESTAMP". It must not contain
>   source-specific configuration at this point, in my opinion;
>   - when SQL knows which column is supposed to be the timestamp, then
>   it can get it from the ProcessContext.timestamp() and put it into the
>   designated field the same way regardless of the source configuration;
>   - pubsub-specific message formatting:
>   - on top of the above we want to be able to expose pubsub message
>   attributes, payload, and timestamp to the user queries, and do it 
> without
>   magic or user schema modifications. To do this we can enforce some
>   pubsub-specific schema limitations, e.g. by exposing attributes and
>   timestamp fields at a top-level schema, with payload going into the 
> second
>   level in its own field;
>   - this aspect is not fully implementable until we have support for
>   complex types. Until then we cannot map full JSON to the payload field;
>
> I will update the doc and the implementation to reflect these comments
> where possible.
>
> Thank you,
> Anton
>
>
> On Fri, May 4, 2018 at 9:48 AM Raghu Angadi <rang...@google.com> wrote:
>
>> On Thu, May 3, 2018 at 12:47 PM Anton Kedin <ke...@google.com> wrote:
>>
>>> I think it makes sense for the case when timestamp is provided in the
>>> payload (including pubsub message attributes).  We can mark the field as an
>>> event timestamp. But if the timestamp is internally defined by the source
>>> (pubsub message publish time) and not exposed in the event body, then we
>>> need a source-specific mechanism to extract and map the event timestamp to
>>> the schema. This is, of course, if we don't automatically add a magic
>>> timestamp field which Beam SQL can populate behind the scenes and add to
>>> the schema. I want to avoid this magic path for now.
>>>
>>
>> Commented on the PR. As Kenn mentioned, every element in Beam has an
>> event timestamp, there is no requirement to extract the timestamp by the
>> SQL transform. Using the element timestamp takes care of Pubsub publish
>> timestamp as well (in fact, this is the default when timestamp attribute is
>> not specified in PubsubIO).
>>
>> How timestamp are customized is specific to each source. That way custom
>> timestamp option seem like they belong in TBLPROPERTIES. E.g. for KafkaIO,
>> it could specify "logAppendTime", "createTime", or "processingTime" etc
>> (though I am not sure how user can provide their own custom extractor in
>> Beam SQL, may be it could support a timestamp field in json records).
>>
>> Raghu.
>>
>>>
>>> On Thu, May 3, 2018 at 11:10 AM Andrew Pilloud <apill...@google.com>
>>> wrote:
>>>
>>>> This sounds awesome!
>>>>
>>>> Is event timestamp something that we need to specify for every source?
>>>> If so, I would suggest we add this as a first class option on CREATE

Re: [PROPOSAL] Preparing 2.5.0 release next week

2018-05-04 Thread Andrew Pilloud
Spanner is also broken, and post commits are failing. I've added the issue
as a blocker. https://issues.apache.org/jira/browse/BEAM-4229

Andrew

On Fri, May 4, 2018 at 1:24 PM Charles Chen  wrote:

> I have added https://issues.apache.org/jira/browse/BEAM-4236 as a blocker.
>
> On Fri, May 4, 2018 at 1:19 PM Ahmet Altay  wrote:
>
>> Hi JB,
>>
>> We found an issue related to using side inputs in streaming mode using
>> python SDK. Charles is currently trying to find the root cause. Would you
>> be able to give him some additional time to investigate the issue?
>>
>> Charles, do you have a JIRA issue on the blocker list?
>>
>> Thank you everyone for understanding.
>>
>> Ahmet
>>
>> On Fri, May 4, 2018 at 8:52 AM, Jean-Baptiste Onofré 
>> wrote:
>>
>>> Hi
>>>
>>> I have couple of PRs I would like to include. I would like also to take
>>> the weekend for new builds and tests.
>>>
>>> If it works for everyone I propose to start the release process Tuesday.
>>>
>>> Thoughts ?
>>>
>>> Regards
>>> JB
>>> Le 4 mai 2018, à 17:49, Scott Wegner  a écrit:

 Hi JB, any idea when you will begin the release? Boyuan has a couple
 Python PRs [1] [2] that are ready to merge, but we'd like to wait until
 after the release branch is cut in case there is some performance
 regression.

 [1] https://github.com/apache/beam/pull/4741
 [2] https://github.com/apache/beam/pull/4925

 On Tue, May 1, 2018 at 9:25 AM Scott Wegner  wrote:

> Sounds good, thanks J.B. Feel free to ping if you need anything.
>
> On Mon, Apr 30, 2018 at 10:12 PM Jean-Baptiste Onofré 
> wrote:
>
>> That's a good idea ! I think using Slack to ping/ask is a good way as
>> it's async.
>>
>> Regards
>> JB
>>
>> On 05/01/2018 06:51 AM, Reuven Lax wrote:
>> > I think it makes sense to have someone who hadn't done the Gradle
>> migration to
>> > run the release. However would it make sense for someone who did
>> work on the
>> > migration to partner with you JB? There may be issues that are
>> simply due to
>> > things that were not documented well. In that case the partner can
>> quickly help
>> > resolve, and can then be the one who makes sure that the
>> documentation is updated.
>> >
>> > Reuven
>> >
>> > On Mon, Apr 30, 2018 at 9:36 PM Jean-Baptiste Onofré <
>> j...@nanthrax.net
>> > > wrote:
>> >
>> > Hi Scott,
>> >
>> > Thanks for the update. The Gradle build crashed on my machine
>> (not related to
>> > Gradle). I launched a new one.
>> >
>> > I'm volunteer to cut the release: I think I know Gradle
>> decently, and even if I
>> > didn't work on the gradle "migration" during the last two
>> weeks, I think it's
>> > actually better: I have an "external" view on the latest
>> changes.
>> >
>> > Thoughts ?
>> >
>> > Regards
>> > JB
>> >
>> > On 05/01/2018 02:05 AM, Scott Wegner wrote:
>> > > Welcome back JB!
>> > >
>> > > I just sent a separate update about Gradle [1]-- the build
>> migration is
>> > complete
>> > > and the release documentation has been updated.
>> > >
>> > > I recommend we produce the 2.5.0 release using Gradle. Having
>> a successful
>> > > release should be the final validation before declaring the
>> Gradle migration
>> > > complete. So the sooner we can have a Gradle release, the
>> sooner we can
>> > get back
>> > > to a single build system :)
>> > >
>> > > If it would be helpful, I suggest that somebody who's been
>> working on the
>> > Gradle
>> > > migration to manage the 2.5.0 release. That way if we
>> encounter any issues
>> > from
>> > > the build system, they should have sufficient expertise to
>> fix it.
>> > >
>> > >
>> > [1]
>> https://lists.apache.org/thread.html/e543b3850bfc4950d57bc18624e1d4131324c6cf691fd10034947cad@%3Cdev.beam.apache.org%3E
>>
>> > >
>> > > On Mon, Apr 30, 2018 at 11:38 AM Romain Manni-Bucau <
>> rmannibu...@gmail.com
>> > 
>> > > >>
>> wrote:
>> > >
>> > >
>> > >
>> > > Le 30 avr. 2018 19:39, "Jean-Baptiste Onofré" <
>> j...@nanthrax.net
>> > 
>> > > >> a
>> écrit :
>> > >
>> > > Hi guys,
>> > >
>> > > now that I'm back from vacations, I bring back 2.5.0
>> release on
>> > the 

Re: [SQL] Reconciling Beam SQL Environments with Calcite Schema

2018-05-04 Thread Andrew Pilloud
Reviews are wrapping up, this will probably merge Monday if I don't hear
from anyone else. One more TableProvider API change after review feedback:
getTables now returns Map<String, Table> instead of Set.

Andrew

On Thu, May 3, 2018 at 10:41 AM Andrew Pilloud <apill...@google.com> wrote:

> Ok, I've finished with this change. Didn't get reviews on the early
> cleanup PRs, so I've pushed all these changes into the first cleanup PR:
> https://github.com/apache/beam/pull/5224
>
> Andrew
>
> On Tue, May 1, 2018 at 10:35 AM Andrew Pilloud <apill...@google.com>
> wrote:
>
>> I'm just starting to move forward on this. Looking at my team's short
>> term needs for SQL, option one would be good enough, however I agree with
>> Kenn that we want something like option two eventually. I also don't want
>> to break existing users and it sounds like there is at least one custom
>> MetaStore not in beam. So my plan is to go with option two and simplify the
>> interface where functionality loss will not result.
>>
>> There is a common set of operations between the MetaStore and the
>> TableProvider. I'd like to make MetaStore inherit the interface of
>> TableProvider. Most operations we need (createTable, dropTable, listTables)
>> are already identical between the two, and so this will have no impact on
>> custom implementations. The buildBeamSqlTable operation does differ: the
>> MetaStore takes a table name, the TableProvider takes a table object.
>> However everything calling this API already has the full table object, so I
>> would like to simplify this interface by passing the table object in both
>> cases. Objections?
>>
>> Andrew
>>
>> On Tue, Apr 24, 2018 at 9:27 AM James <xumingmi...@gmail.com> wrote:
>>
>>> Kenn: yes, MetaStore is user-facing, Users can choose to implement their
>>> own MetaStore, currently only an InMemory implementation in Beam CodeBase.
>>>
>>> Andrew: I like the second option, since it "retains the ability for DDL
>>> operations to be processed by a custom MetaStore.", IMO we should have the
>>> DDL ability as a fully functional SQL.
>>>
>>> On Tue, Apr 24, 2018 at 10:28 PM Kenneth Knowles <k...@google.com> wrote:
>>>
>>>> Can you say more about how the metastore is used? I presume it is or
>>>> will be user-facing, so are Beam SQL users already providing their own?
>>>>
>>>> I'm sure we want something like that eventually to support things like
>>>> Apache Atlas and HCatalog, IIUC for the "create if needed" logic when using
>>>> Beam SQL to create a derived data set. But I don't think we should build
>>>> out those code paths until we have at least one non-in-memory
>>>> implementation.
>>>>
>>>> Just a really high level $0.02.
>>>>
>>>> Kenn
>>>>
>>>> On Mon, Apr 23, 2018 at 4:56 PM Andrew Pilloud <apill...@google.com>
>>>> wrote:
>>>>
>>>>> I'm working on updating our Beam DDL code to use the DDL execution
>>>>> functionality that recently merged into core calcite. This enables us to
>>>>> take advantage of Calcite JDBC as a way to use Beam SQL. As part of that I
>>>>> need to reconcile the Beam SQL Environments with the Calcite Schema (which
>>>>> is calcite's environment). We currently have copies of our tables in the
>>>>> Beam meta/store, Calcite Schema, BeamSqlEnv, and BeamQueryPlanner. I have 
>>>>> a
>>>>> pending PR which merges the later two to just use the Calcite Schema copy.
>>>>> Merging the Beam MetaStore and Calcite Schema isn't as simple. I have
>>>>> two options I'm looking for feedback on:
>>>>>
>>>>> 1. Make Calcite Schema authoritative and demote MetaStore to be
>>>>> something more like a Calcite TableFactory. Calcite Schema already
>>>>> implements the semantics of our InMemoryMetaStore. If the Store interface
>>>>> is just over built, this approach would result in a significant reduction
>>>>> in code. This would however eliminate the CRUD part of the interface
>>>>> leaving just the buildBeamSqlTable function.
>>>>>
>>>>> 2. Pass the Beam MetaStore into Calcite wrapped with a class
>>>>> translating to Calcite Schema (like we do already with tables). Instead of
>>>>> copying tables into the Calcite Schema we would pass in Beam meta/store as
>>>>> the source of truth and Calcite would manipulate tables directly in the
>>>>> Beam meta/store. This is a bit more complicated but retains the ability 
>>>>> for
>>>>> DDL operations to be processed by a custom MetaStore.
>>>>>
>>>>> Thoughts?
>>>>>
>>>>> Andrew
>>>>>
>>>>


Re: Pubsub to Beam SQL

2018-05-03 Thread Andrew Pilloud
I like to avoid magic too. I might not have been entirely clear in what I
was asking. Here is an example of what I had in mind, replacing the
TBLPROPERTIES
with a more generic TIMESTAMP option:

CREATE TABLE  table_name (
  publishTimestamp TIMESTAMP,
  attributes MAP(VARCHAR, VARCHAR),
  payload ROW (
   name VARCHAR,
   age INTEGER,
   isSWE BOOLEAN,
   tags ARRAY(VARCHAR)))
TIMESTAMP attributes["createTime"];

Andrew

On Thu, May 3, 2018 at 12:47 PM Anton Kedin <ke...@google.com> wrote:

> I think it makes sense for the case when timestamp is provided in the
> payload (including pubsub message attributes).  We can mark the field as an
> event timestamp. But if the timestamp is internally defined by the source
> (pubsub message publish time) and not exposed in the event body, then we
> need a source-specific mechanism to extract and map the event timestamp to
> the schema. This is, of course, if we don't automatically add a magic
> timestamp field which Beam SQL can populate behind the scenes and add to
> the schema. I want to avoid this magic path for now.
>
> On Thu, May 3, 2018 at 11:10 AM Andrew Pilloud <apill...@google.com>
> wrote:
>
>> This sounds awesome!
>>
>> Is event timestamp something that we need to specify for every source? If
>> so, I would suggest we add this as a first class option on CREATE TABLE
>> rather then something hidden in TBLPROPERTIES.
>>
>> Andrew
>>
>> On Wed, May 2, 2018 at 10:30 AM Anton Kedin <ke...@google.com> wrote:
>>
>>> Hi
>>>
>>> I am working on adding functionality to support querying Pubsub messages
>>> directly from Beam SQL.
>>>
>>> *Goal*
>>>   Provide Beam users a pure  SQL solution to create the pipelines with
>>> Pubsub as a data source, without the need to set up the pipelines in
>>> Java before applying the query.
>>>
>>> *High level approach*
>>>
>>>-
>>>- Build on top of PubsubIO;
>>>- Pubsub source will be declared using CREATE TABLE DDL statement:
>>>   - Beam SQL already supports declaring sources like Kafka and Text
>>>   using CREATE TABLE DDL;
>>>   - it supports additional configuration using TBLPROPERTIES
>>>   clause. Currently it takes a text blob, where we can put a JSON
>>>   configuration;
>>>   - wrapping PubsubIO into a similar source looks feasible;
>>>- The plan is to initially support messages only with JSON payload:
>>>-
>>>   - more payload formats can be added later;
>>>- Messages will be fully described in the CREATE TABLE statements:
>>>   - event timestamps. Source of the timestamp is configurable. It
>>>   is required by Beam SQL to have an explicit timestamp column for 
>>> windowing
>>>   support;
>>>   - messages attributes map;
>>>   - JSON payload schema;
>>>- Event timestamps will be taken either from publish time or
>>>user-specified message attribute (configurable);
>>>
>>> Thoughts, ideas, comments?
>>>
>>> More details are in the doc here:
>>> https://docs.google.com/document/d/1wIXTxh-nQ3u694XbF0iEZX_7-b3yi4ad0ML2pcAxYfE
>>>
>>>
>>> Thank you,
>>> Anton
>>>
>>


Re: Google Summer of Code Project Intro

2018-05-03 Thread Andrew Pilloud
Hi Kai,

Glad to hear someone is putting more work into benchmarking Beam SQL! It
would be really cool if we had some of these running as nightly performance
test jobs so we would know when there is a performance regression. This
might be out of scope of your project, but keep it in mind.

I am working on SQL and ported some of the Nexmark benchmarks there. Feel
free to email me questions. I can also poke Kenn for you whenever he's not
responsive.

Andrew

On Thu, May 3, 2018 at 4:43 AM Kai Jiang  wrote:

> Hi Beam Dev,
>
> I am Kai. GSoC has announced selected projects last week. During community
> bonding period, I want to share some basics about this year's project with
> Apache Beam.
>
> Project abstract:
> https://summerofcode.withgoogle.com/projects/#6460770829729792
> Issue Tracker: BEAM-3783 
>
> This project will be mentored by Kenneth Knowles. Many thanks to Kenn's
> mentorship in next three months. Also, Welcome any ideas and comments from
> you!
>
> The project will mainly focus on implementing a TPC-DS benchmark on Beam
> SQL. We've seen many works have been tested on Spark, Hive and Pig, etc.
> It's interesting to see what happened if it builds onto Beam SQL.
> Presumably, the benchmark will test against on different runners (like,
> spark or flink). Based on the benchmark, a performance report will be
> generated eventually.
>
> Proposal doc is here:(more details will be updated)
>
> https://docs.google.com/document/d/15oYd_jFVbkiSPGT8-XnSh7Q-R3CtZwHaizyQfmrShfo/edit?usp=sharing
>
> Once coding period starts on May 14, I will keep updating the status and
> progress of this project.
>
> Best,
> Kai
> ᐧ
>


Re: Pubsub to Beam SQL

2018-05-03 Thread Andrew Pilloud
This sounds awesome!

Is event timestamp something that we need to specify for every source? If
so, I would suggest we add this as a first class option on CREATE TABLE
rather then something hidden in TBLPROPERTIES.

Andrew

On Wed, May 2, 2018 at 10:30 AM Anton Kedin  wrote:

> Hi
>
> I am working on adding functionality to support querying Pubsub messages
> directly from Beam SQL.
>
> *Goal*
>   Provide Beam users a pure  SQL solution to create the pipelines with
> Pubsub as a data source, without the need to set up the pipelines in Java
> before applying the query.
>
> *High level approach*
>
>-
>- Build on top of PubsubIO;
>- Pubsub source will be declared using CREATE TABLE DDL statement:
>   - Beam SQL already supports declaring sources like Kafka and Text
>   using CREATE TABLE DDL;
>   - it supports additional configuration using TBLPROPERTIES clause.
>   Currently it takes a text blob, where we can put a JSON configuration;
>   - wrapping PubsubIO into a similar source looks feasible;
>- The plan is to initially support messages only with JSON payload:
>-
>   - more payload formats can be added later;
>- Messages will be fully described in the CREATE TABLE statements:
>   - event timestamps. Source of the timestamp is configurable. It is
>   required by Beam SQL to have an explicit timestamp column for windowing
>   support;
>   - messages attributes map;
>   - JSON payload schema;
>- Event timestamps will be taken either from publish time or
>user-specified message attribute (configurable);
>
> Thoughts, ideas, comments?
>
> More details are in the doc here:
> https://docs.google.com/document/d/1wIXTxh-nQ3u694XbF0iEZX_7-b3yi4ad0ML2pcAxYfE
>
>
> Thank you,
> Anton
>


Re: [SQL] Reconciling Beam SQL Environments with Calcite Schema

2018-05-03 Thread Andrew Pilloud
Ok, I've finished with this change. Didn't get reviews on the early cleanup
PRs, so I've pushed all these changes into the first cleanup PR:
https://github.com/apache/beam/pull/5224

Andrew

On Tue, May 1, 2018 at 10:35 AM Andrew Pilloud <apill...@google.com> wrote:

> I'm just starting to move forward on this. Looking at my team's short term
> needs for SQL, option one would be good enough, however I agree with Kenn
> that we want something like option two eventually. I also don't want to
> break existing users and it sounds like there is at least one custom
> MetaStore not in beam. So my plan is to go with option two and simplify the
> interface where functionality loss will not result.
>
> There is a common set of operations between the MetaStore and the
> TableProvider. I'd like to make MetaStore inherit the interface of
> TableProvider. Most operations we need (createTable, dropTable, listTables)
> are already identical between the two, and so this will have no impact on
> custom implementations. The buildBeamSqlTable operation does differ: the
> MetaStore takes a table name, the TableProvider takes a table object.
> However everything calling this API already has the full table object, so I
> would like to simplify this interface by passing the table object in both
> cases. Objections?
>
> Andrew
>
> On Tue, Apr 24, 2018 at 9:27 AM James <xumingmi...@gmail.com> wrote:
>
>> Kenn: yes, MetaStore is user-facing, Users can choose to implement their
>> own MetaStore, currently only an InMemory implementation in Beam CodeBase.
>>
>> Andrew: I like the second option, since it "retains the ability for DDL
>> operations to be processed by a custom MetaStore.", IMO we should have the
>> DDL ability as a fully functional SQL.
>>
>> On Tue, Apr 24, 2018 at 10:28 PM Kenneth Knowles <k...@google.com> wrote:
>>
>>> Can you say more about how the metastore is used? I presume it is or
>>> will be user-facing, so are Beam SQL users already providing their own?
>>>
>>> I'm sure we want something like that eventually to support things like
>>> Apache Atlas and HCatalog, IIUC for the "create if needed" logic when using
>>> Beam SQL to create a derived data set. But I don't think we should build
>>> out those code paths until we have at least one non-in-memory
>>> implementation.
>>>
>>> Just a really high level $0.02.
>>>
>>> Kenn
>>>
>>> On Mon, Apr 23, 2018 at 4:56 PM Andrew Pilloud <apill...@google.com>
>>> wrote:
>>>
>>>> I'm working on updating our Beam DDL code to use the DDL execution
>>>> functionality that recently merged into core calcite. This enables us to
>>>> take advantage of Calcite JDBC as a way to use Beam SQL. As part of that I
>>>> need to reconcile the Beam SQL Environments with the Calcite Schema (which
>>>> is calcite's environment). We currently have copies of our tables in the
>>>> Beam meta/store, Calcite Schema, BeamSqlEnv, and BeamQueryPlanner. I have a
>>>> pending PR which merges the later two to just use the Calcite Schema copy.
>>>> Merging the Beam MetaStore and Calcite Schema isn't as simple. I have
>>>> two options I'm looking for feedback on:
>>>>
>>>> 1. Make Calcite Schema authoritative and demote MetaStore to be
>>>> something more like a Calcite TableFactory. Calcite Schema already
>>>> implements the semantics of our InMemoryMetaStore. If the Store interface
>>>> is just over built, this approach would result in a significant reduction
>>>> in code. This would however eliminate the CRUD part of the interface
>>>> leaving just the buildBeamSqlTable function.
>>>>
>>>> 2. Pass the Beam MetaStore into Calcite wrapped with a class
>>>> translating to Calcite Schema (like we do already with tables). Instead of
>>>> copying tables into the Calcite Schema we would pass in Beam meta/store as
>>>> the source of truth and Calcite would manipulate tables directly in the
>>>> Beam meta/store. This is a bit more complicated but retains the ability for
>>>> DDL operations to be processed by a custom MetaStore.
>>>>
>>>> Thoughts?
>>>>
>>>> Andrew
>>>>
>>>


Re: Jenkins: can a job execute concurrently on multiple nodes?

2018-05-02 Thread Andrew Pilloud
These jobs also require Dataflow which has various quotas on resource
usage. I hit these while working on the Dataflow Nexmark tests for SQL. I'm
not sure what the quota is on the account that Jenkins uses, but the
default quota will max out at around 2 concurrent jobs.

Andrew

On Wed, May 2, 2018 at 1:16 PM Scott Wegner  wrote:

> While working on tuning our Gradle build for Jenkins, I noticed that our
> Jenkins jobs often get queued up, even though we have low utilization of
> our Jenkins executor pool [2]. For example, right now I see 4 instances of
> beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle queued [1], while the
> 'beam' label currently has 29 out of 32 executors available [2].
>
> Is it possible to configure our Jenkins jobs to execute concurrently on
> separate nodes? I'm not sure whether it's safe for the same job to execute
> concurrently on the same node, for example if runner resources are shared.
> But I believe executing on separate nodes should be safe.
>
> Who would know more about this or have access to the Jenkins config if we
> did want to make changes?
>
> [1]
> https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle/
> [2]
> https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle/
>
> --
>
>
> Got feedback? http://go/swegner-feedback
>


Re: [SQL] Reconciling Beam SQL Environments with Calcite Schema

2018-05-01 Thread Andrew Pilloud
I'm just starting to move forward on this. Looking at my team's short term
needs for SQL, option one would be good enough, however I agree with Kenn
that we want something like option two eventually. I also don't want to
break existing users and it sounds like there is at least one custom
MetaStore not in beam. So my plan is to go with option two and simplify the
interface where functionality loss will not result.

There is a common set of operations between the MetaStore and the
TableProvider. I'd like to make MetaStore inherit the interface of
TableProvider. Most operations we need (createTable, dropTable, listTables)
are already identical between the two, and so this will have no impact on
custom implementations. The buildBeamSqlTable operation does differ: the
MetaStore takes a table name, the TableProvider takes a table object.
However everything calling this API already has the full table object, so I
would like to simplify this interface by passing the table object in both
cases. Objections?

Andrew

On Tue, Apr 24, 2018 at 9:27 AM James <xumingmi...@gmail.com> wrote:

> Kenn: yes, MetaStore is user-facing, Users can choose to implement their
> own MetaStore, currently only an InMemory implementation in Beam CodeBase.
>
> Andrew: I like the second option, since it "retains the ability for DDL
> operations to be processed by a custom MetaStore.", IMO we should have the
> DDL ability as a fully functional SQL.
>
> On Tue, Apr 24, 2018 at 10:28 PM Kenneth Knowles <k...@google.com> wrote:
>
>> Can you say more about how the metastore is used? I presume it is or will
>> be user-facing, so are Beam SQL users already providing their own?
>>
>> I'm sure we want something like that eventually to support things like
>> Apache Atlas and HCatalog, IIUC for the "create if needed" logic when using
>> Beam SQL to create a derived data set. But I don't think we should build
>> out those code paths until we have at least one non-in-memory
>> implementation.
>>
>> Just a really high level $0.02.
>>
>> Kenn
>>
>> On Mon, Apr 23, 2018 at 4:56 PM Andrew Pilloud <apill...@google.com>
>> wrote:
>>
>>> I'm working on updating our Beam DDL code to use the DDL execution
>>> functionality that recently merged into core calcite. This enables us to
>>> take advantage of Calcite JDBC as a way to use Beam SQL. As part of that I
>>> need to reconcile the Beam SQL Environments with the Calcite Schema (which
>>> is calcite's environment). We currently have copies of our tables in the
>>> Beam meta/store, Calcite Schema, BeamSqlEnv, and BeamQueryPlanner. I have a
>>> pending PR which merges the later two to just use the Calcite Schema copy.
>>> Merging the Beam MetaStore and Calcite Schema isn't as simple. I have
>>> two options I'm looking for feedback on:
>>>
>>> 1. Make Calcite Schema authoritative and demote MetaStore to be
>>> something more like a Calcite TableFactory. Calcite Schema already
>>> implements the semantics of our InMemoryMetaStore. If the Store interface
>>> is just over built, this approach would result in a significant reduction
>>> in code. This would however eliminate the CRUD part of the interface
>>> leaving just the buildBeamSqlTable function.
>>>
>>> 2. Pass the Beam MetaStore into Calcite wrapped with a class translating
>>> to Calcite Schema (like we do already with tables). Instead of copying
>>> tables into the Calcite Schema we would pass in Beam meta/store as the
>>> source of truth and Calcite would manipulate tables directly in the Beam
>>> meta/store. This is a bit more complicated but retains the ability for DDL
>>> operations to be processed by a custom MetaStore.
>>>
>>> Thoughts?
>>>
>>> Andrew
>>>
>>


Re: Merge options in Github UI are confusing

2018-04-24 Thread Andrew Pilloud
Thanks for the feedback. Sounds like there are a few takeaways from the
discussion:

1. Not everyone is a git power user, squash and merge is commonly used and
can't be disabled.
2. As a non-committer I should expect my commits to be squashed, rewritten,
or otherwise changed at the discretion of the committer.
3. I should be breaking up separate commits into separate PRs.

Andrew

On Wed, Apr 18, 2018 at 12:00 PM Kenneth Knowles <k...@google.com> wrote:

> One thing that is available is the "allow maintainers to edit" which can
> let any committer push to the PR and do all of this. It is either on by
> default, or I have simply checked it in the past, since I have recently had
> a maintainer push to my PR on accident :-)
>
> On Tue, Apr 17, 2018 at 5:57 PM Ahmet Altay <al...@google.com> wrote:
>
>> I agree with Robert. In this case one size does not fit all. There are
>> times, another round trip with a contributor would be frustrating to the
>> author. Especially for new contributors. Having the option to squash and
>> merge is useful in those cases. (For reference in the past we even
>> helped new contributors by doing small fixes at merge time.)
>>
>> On Tue, Apr 17, 2018 at 2:28 PM, Robert Bradshaw <rober...@google.com>
>> wrote:
>>
>>> I think the two options are useful, because we have different kinds of
>>> contributors. Sophisticated users curate their own history, create
>>> logically useful commits, build atop it, etc. and merge is by far the
>>> better option. Others have a single commit followed by any number of
>>> "lint," "fixup," and "reviewer comments" ones that should clearly be
>>> squashed, and given that it takes a round trip to ask them to squash it,
>>> it's nice to be able to do it once there's an LGTM as part of the merge.
>>> At
>>> least making this fact explicit and pointing it out in the docs may be
>>> useful.
>>> On Tue, Apr 17, 2018 at 1:43 PM Mingmin Xu <mingm...@gmail.com> wrote:
>>>
>>> > Not strongly against `Create a merge commit`, but I use `squash and
>>> merge` by default. I understand the potential impact mentioned by Andrew,
>>> it's still a better option IMO:
>>> > 1. if a PR contains several parts, it can be documented in commit
>>> message
>>> instead of several commits; --If it's a big task, let's split it into
>>> several PRs if possible;
>>> > 2. when several PRs are changing the same file, I would ask contributor
>>> to fix it;
>>> > 3. most commits are introduced by reviewer's ask, it's not necessary to
>>> do another squash(by contributors) before merge;
>>>
>>> > On Tue, Apr 17, 2018 at 1:09 PM, Robert Burke <rob...@frantil.com>
>>> wrote:
>>>
>>> >> +1 Having made a few web commits and been frustrated by the options,
>>> anything to standardize on a single option seems good to me.
>>>
>>> >> On Tue, 17 Apr 2018 at 01:49 Etienne Chauchot <echauc...@apache.org>
>>> wrote:
>>>
>>> >>> +1 to enforce the behavior recommended in the committer guide. I
>>> usually ask the author to manually squash before committing.
>>>
>>> >>> Etienne
>>>
>>> >>> Le lundi 16 avril 2018 à 22:19 +, Robert Bradshaw a écrit :
>>>
>>> >>> +1, though I'll admit I've been an occasional user of the "squash and
>>> merge" button when a small PR has a huge number of small, fixup changes
>>> piled on it.
>>>
>>> >>> On Mon, Apr 16, 2018 at 3:07 PM Kenneth Knowles <k...@google.com>
>>> wrote:
>>>
>>> >>> It is no secret that I agree with this. When you don't rewrite
>>> history,
>>> distributed git "just works". I didn't realize we could mechanically
>>> enforce it.
>>>
>>> >>> Kenn
>>>
>>> >>> On Mon, Apr 16, 2018 at 2:55 PM Andrew Pilloud <apill...@google.com>
>>> wrote:
>>>
>>> >>> The Github UI provides several options for merging a PR hidden behind
>>> the “Merge pull request” button. Only the “Create a merge commit” option
>>> does what most users expect, which is to merge by creating a new merge
>>> commit. This is the option recommended in the Beam committer’s guide, but
>>> it is not necessarily the default behavior of the merge button.
>>>
>>>
>>> >>> A small cleanup PR I made was recently merged via the merge button
>>> which generated a squash merge instead of a merge commit, breaking two
>>> other PRs which were based on it. See
>>> https://github.com/apache/beam/pull/4991
>>>
>>>
>>> >>> I would propose that we disable the options for both rebase and
>>> squash
>>> merging via the Github UI. This will make the behavior of the merge
>>> button
>>> unambiguous and consistent with our documentation, but will not prevent a
>>> committer from performing these operations from the git cli if they
>>> desire.
>>>
>>>
>>> >>> Andrew
>>>
>>>
>>>
>>>
>>>
>>>
>>> > --
>>> > 
>>> > Mingmin
>>>
>>
>>


[SQL] Reconciling Beam SQL Environments with Calcite Schema

2018-04-23 Thread Andrew Pilloud
I'm working on updating our Beam DDL code to use the DDL execution
functionality that recently merged into core calcite. This enables us to
take advantage of Calcite JDBC as a way to use Beam SQL. As part of that I
need to reconcile the Beam SQL Environments with the Calcite Schema (which
is calcite's environment). We currently have copies of our tables in the
Beam meta/store, Calcite Schema, BeamSqlEnv, and BeamQueryPlanner. I have a
pending PR which merges the later two to just use the Calcite Schema copy.
Merging the Beam MetaStore and Calcite Schema isn't as simple. I have two
options I'm looking for feedback on:

1. Make Calcite Schema authoritative and demote MetaStore to be something
more like a Calcite TableFactory. Calcite Schema already implements the
semantics of our InMemoryMetaStore. If the Store interface is just over
built, this approach would result in a significant reduction in code. This
would however eliminate the CRUD part of the interface leaving just the
buildBeamSqlTable function.

2. Pass the Beam MetaStore into Calcite wrapped with a class translating to
Calcite Schema (like we do already with tables). Instead of copying tables
into the Calcite Schema we would pass in Beam meta/store as the source of
truth and Calcite would manipulate tables directly in the Beam meta/store.
This is a bit more complicated but retains the ability for DDL operations
to be processed by a custom MetaStore.

Thoughts?

Andrew


Merge options in Github UI are confusing

2018-04-16 Thread Andrew Pilloud
*The Github UI provides several options for merging a PR hidden behind the
“Merge pull request” button. Only the “Create a merge commit” option does
what most users expect, which is to merge by creating a new merge commit.
This is the option recommended in the Beam committer’s guide, but it is not
necessarily the default behavior of the merge button.A small cleanup PR I
made was recently merged via the merge button which generated a squash
merge instead of a merge commit, breaking two other PRs which were based on
it. See https://github.com/apache/beam/pull/4991
I would propose that we disable
the options for both rebase and squash merging via the Github UI. This will
make the behavior of the merge button unambiguous and consistent with our
documentation, but will not prevent a committer from performing these
operations from the git cli if they desire.Andrew*


Re: SQL in Python SDK

2018-04-13 Thread Andrew Pilloud
Hi Gabor,

Are Python UDFs (User-defined functions) something that might work for you?
If all you really need to write in Python is your DoFn this is probably
your best option. It is still a bit of work but we support Java UDFs today,
so all you would need to do is write a Java wrapper to call your Python
function.

Andrew


On Fri, Apr 13, 2018, 7:58 AM Kenneth Knowles  wrote:

> The most recent work on cross-language pipeline authoring is the design
> brainstorming at https://s.apache.org/beam-mixed-language-pipelines so it
> is still in the preliminary stages. There's no basic mystery, but there are
> a lot of practical considerations about what is easy to run on a pipeline
> author's machine.
>
> Regarding Apache Calcite - it is a Java library. It doesn't really make
> sense to bind it to Python. Today we don't use most of its capabilities. We
> just use it as a parser mostly. It would be easy to find an existing parser
> in Python or write your own (with ply, the basics could be done within a
> day). But still I don't think it makes sense to reimplement and maintain
> the SQL-to-Beam translation in multiple languages.
>
> Kenn
>
> On Fri, Apr 13, 2018 at 2:43 AM Reuven Lax  wrote:
>
>> If someone implemented it directly in Python then it would be supported
>> directly in Python. I don't know if anyone is actively working on that -
>> the current implementation uses Apache Calcite, and I don't know whether
>> they have a Python API.
>>
>> On Fri, Apr 13, 2018 at 9:40 AM Prabeesh K.  wrote:
>>
>>> What about supporting SQL in Python SDK?
>>>
>>> On 13 April 2018 at 13:32, Reuven Lax  wrote:
>>>
 The portability work will allow the Python and Java SDKs to be used in
 the same pipeline, though this work is not yet complete.


>>> This is would be an interesting feature.
>>>
>>> On Fri, Apr 13, 2018 at 9:15 AM Gabor Hermann 
 wrote:

> Hey all,
>
> Are there any efforts towards supporting SQL from the Python SDK, not
> just from Java? I couldn't find any info about this in JIRA or mailing
> lists.
>
> How much effort do you think it would take to implement this? Are
> there
> some dependencies like supporting more features in Python? I know that
> the Python SDK is experimental.
>
> As an alternative, is there a way to combine Python and Java SDKs in
> the
> same pipeline?
>
> Thanks for your answers in advance!
>
> Cheers,
> Gabor
>
>
>>>
>>>


Re: Beam7 Outage

2018-04-12 Thread Andrew Pilloud
They all seem flaky over the past few days. I just hit one on beam1:

java.io.IOException: Backing channel 'beam1' is disconnected.

https://builds.apache.org/job/beam_PreCommit_Java_GradleBuild/4068/console

Could there be some load issue from the Gradle changes?

Andrew

On Thu, Apr 12, 2018 at 12:21 PM Jason Kuster 
wrote:

> I can ask Infra if there's an issue with that machine as well, but if it's
> still accessible at all it's probably not the same issue; jenkins couldn't
> get to beam7 at all.
>
> On Thu, Apr 12, 2018 at 12:11 PM Ismaël Mejía  wrote:
>
>> beam5 has been failing in the last week. Almost all builds there break.
>>
>> On Thu, Apr 12, 2018, 6:41 PM Jason Kuster 
>> wrote:
>>
>>> Hi all,
>>>
>>> The Jenkins Beam7 executor gave up the ghost some time in the last
>>> couple of days. I've been on the line with Infra yesterday and today
>>> getting it fixed, and it looks like it should be back up in a few hours.
>>> I'll ping this thread again when I have confirmation. Thanks for your
>>> patience.
>>>
>>> Best,
>>>
>>> Jason
>>>
>>> --
>>> ---
>>> Jason Kuster
>>> Apache Beam / Google Cloud Dataflow
>>>
>>> See something? Say something. go/jasonkuster-feedback
>>>
>>
>
> --
> ---
> Jason Kuster
> Apache Beam / Google Cloud Dataflow
>
> See something? Say something. go/jasonkuster-feedback
> 
>


Re: New beam contributor experience?

2018-03-14 Thread Andrew Pilloud
To add more to what Anton said, the 'mvn clean verify' step takes hours and
fails frequently due to bad tests. I spent the first few days working with
beam trying to figure out what was wrong with my system when I was just
hitting test flaps. If we're going to gradle that would be a great place to
update. If not, adding directions on how to skip the tests in maven and
just run the build would be good too.

Andrew


On Wed, Mar 14, 2018 at 11:54 AM Anton Kedin  wrote:

> Not sure if it was mentioned in other threads, but it probably makes sense
> to add gradle instructions there.
>
>
> On Wed, Mar 14, 2018 at 11:48 AM Alan Myrvold  wrote:
>
>> There is a contribution guide at
>> https://beam.apache.org/contribute/contribution-guide/
>> Has anyone had challenges / pain points when getting started with new
>> contributions?
>> Any suggestions for making this better?
>>
>> Alan
>>
>


<    1   2   3