Re: Celebrating Pride... in the Apache Beam Logo

2018-06-15 Thread Eugene Kirpichov
Very cool!

On Fri, Jun 15, 2018 at 10:56 AM OrielResearch Eila Arich-Landkof <
e...@orielresearch.org> wrote:

> 
>
> On Fri, Jun 15, 2018 at 1:50 PM, Griselda Cuevas  wrote:
>
>> Someone in my team edited some Open-Source-Projects' logos to celebrate
>> pride and Apache Beam was included!
>>
>>
>> I'm attaching what she did... sprinkling some fun in the mailing list,
>> because it's Friday!
>>
>
>
>
> --
> Eila
> www.orielresearch.org
> https://www.meetu 
> p.co 
> m/Deep-Learning-In-Production/
> 
>
>
>


Re: Celebrating Pride... in the Apache Beam Logo

2018-06-15 Thread OrielResearch Eila Arich-Landkof


On Fri, Jun 15, 2018 at 1:50 PM, Griselda Cuevas  wrote:

> Someone in my team edited some Open-Source-Projects' logos to celebrate
> pride and Apache Beam was included!
>
>
> I'm attaching what she did... sprinkling some fun in the mailing list,
> because it's Friday!
>



-- 
Eila
www.orielresearch.org
https://www.meetu p.co

m/Deep-Learning-In-Production/



Celebrating Pride... in the Apache Beam Logo

2018-06-15 Thread Griselda Cuevas
Someone in my team edited some Open-Source-Projects' logos to celebrate
pride and Apache Beam was included!


I'm attaching what she did... sprinkling some fun in the mailing list,
because it's Friday!


Re: SQL Filter Pushdowns in Apache Beam SQL

2018-06-15 Thread Kenneth Knowles
I think I understand your use case better. Comments on those methods:

1) I think to make this work you would have to apply the filter before
converting it to a side input. So in that case pushdown is the question of
whether you use a Filter transform or do it in the JDBC query. Either way,
you will have to write the logic to figure out the keys you want. That
could be a moderately complex correlated subquery in SQL.

2) If you chose this route, you could use stateful ParDo(DoFn) to batch
requests to the external data source.

Have you also considered this?

1a) Use CoGroupByKey and/or the Join library, passing the unbounded and
bounded data sets.

Again, none of these rely on having the entire JDBC data set in memory.

Kenn



On Wed, Jun 13, 2018 at 5:54 PM Harshvardhan Agrawal <
harshvardhan.ag...@gmail.com> wrote:

> I would assume that in the case where we don’t go the SQL route we would
> have 2 options:
>
> 1) Store the reference data and supply it as side input. This solution
> would not be feasible in cases where I have to join against say 10
> different datasets since I don’t want to have so much of data in memory.
>
> 2) Perform lookups for each value of the field I am joining on. This could
> make my pipeline really chatty with the external source. It is possible
> that the external source might not be able to handle the volume of requests
> and network could end up being a bottleneck.
>
>
> On Wed, Jun 13, 2018 at 19:47 Kenneth Knowles  wrote:
>
>> This has come up in a couple of in-person conversations. Pushing
>> filtering and projection into to connectors is something we intend to do.
>> Calcite's optimizer is designed to support this, we just don't have it set
>> up.
>>
>> Your use case sounds like one that might test the limits of that, since
>> the JDBC read would occur before windowing or setting it up as a side
>> input. I'd be curious what a Beam pipeline to do this without SQL would
>> look like.
>>
>> Kenn
>>
>> On Wed, Jun 13, 2018 at 8:47 AM Lukasz Cwik  wrote:
>>
>>> It is currently the later where all the data is read and then filtered
>>> within the pipeline. Note that this doesn't mean that all the data is
>>> loaded into memory as the way that the join is done is dependent on the
>>> Runner that is powering the pipeline.
>>>
>>> Kenn had shared this doc[1] which is starting to look at integrating
>>> Runners and IO into the SQL shell and attempting to start defining a way to
>>> map properties from SQL onto the IO connector but it seems natural that the
>>> filter would get pushed down to the IO connector as well. Please take a
>>> look and feel free to comment.
>>>
>>> 1:
>>> https://docs.google.com/document/d/1ZFVlnldrIYhUgOfxIT2JcmTFFSWTl4HwAnQsnwiNL1g/edit#heading=h.4zubkdp87wok
>>>
>>> On Wed, Jun 13, 2018 at 7:39 AM Harshvardhan Agrawal <
>>> harshvardhan.ag...@gmail.com> wrote:
>>>
 Hi,

 We are currently playing with Apache Beam’s SQL extension on top of
 Flink. One of the features that we were interested is the SQL Predicate
 Pushdown feature that Spark provides. Does Beam support that?

 For eg:
 I have an unbounded dataset that I want to join with some static
 reference data stored in a database. Will beam perform the logic of
 figuring out all the unique keys in the window and push it down to the jdbc
 source or will it bring all the data from the jdbc source into memory and
 then perform the join?

 Thanks,
 Harsh
 --
 Regards,
 Harshvardhan

>>> --
> Regards,
> Harshvardhan
>


Re: Building and visualizing the Beam SQL graph

2018-06-15 Thread Kenneth Knowles
@Reuven: I think DSLs are better served by having their own wrappers than
by putting their data into generic attributes. They would need attributes
if they needed to put them in and have them come back out, but usually the
DSL has a higher-level view and no need for Beam to propagate data on its
behalf, in fact it is simpler to do it directly at the DSL level. That is
the case for SQL and LIMIT.

@Mingmin: Agree. The name on each node is the portable way to describe
transforms. It should be locally unique and the composite structure makes
them globally unique. Do all our runners use it to make their UIs pretty? I
don't know. It would be great to check on that and improve it.

@Andrew: Do we really want getStageName()? Can it just be a constant
string, with the composite structure giving context?

Kenn

On Thu, Jun 14, 2018 at 11:12 AM Mingmin Xu  wrote:

> Is there a guideline about how the name provided in `PCollection.apply(
> String name, PTransform, PCollection> t)`
> is adopted in different runners? I suppose that should be the option, to
> have a readable graph for all runners, instead of 'adjust' it to make
> DataFlow runner works only.
>
> On Thu, Jun 14, 2018 at 8:53 AM, Reuven Lax  wrote:
>
>> There was a previous discussion about having generic attributes on
>> PCollection. Maybe this is a good driving use case?
>>
>> On Wed, Jun 13, 2018 at 4:36 PM Kenneth Knowles  wrote:
>>
>>> Another thing to consider is that we might return something like a
>>> "SqlPCollection" that is the PCollection plus additional metadata that
>>> is useful to the shell / enumerable converter (such as if the PCollection
>>> has a known finite size due to LIMIT, even if it is "unbounded", and the
>>> shell can return control to the user once it receives enough rows). After
>>> your proposed change this will be much more natural to do, so that's
>>> another point in favor of the refactor.
>>>
>>> Kenn
>>>
>>> On Wed, Jun 13, 2018 at 10:22 AM Andrew Pilloud 
>>> wrote:
>>>
 One of my goals is to make the graph easier to read and map back to the
 SQL EXPLAIN output. The way the graph is currently built (`toPTransform` vs
 `toPCollection`) does make a big difference in that graph. I think it is
 also important to have a common function to do the apply with consistent
 naming. I think that will greatly help with ease of understanding. It
 sounds like what really want is this in the BeamRelNode interface:

 PInput buildPInput(Pipeline pipeline);
 PTransform> buildPTransform();

 default PCollection toPCollection(Pipeline pipeline) {
 return buildPInput(pipeline).apply(getStageName(),
 buildPTransform());
 }

 Andrew

 On Mon, Jun 11, 2018 at 2:27 PM Mingmin Xu  wrote:

> EXPLAIN shows the execution plan in SQL perspective only. After
> converting to a Beam composite PTransform, there're more steps underneath,
> each Runner re-org Beam PTransforms again which makes the final pipeline
> hard to read. In SQL module itself, I don't see any difference between
> `toPTransform` and `toPCollection`. We could have an easy-to-understand
> step name when converting RelNodes, but Runners show the graph to
> developers.
>
> Mingmin
>
> On Mon, Jun 11, 2018 at 2:06 PM, Andrew Pilloud 
> wrote:
>
>> That sounds correct. And because each rel node might have a different
>> input there isn't a standard interface (like PTransform<
>> PCollection, PCollection> toPTransform());
>>
>> Andrew
>>
>> On Mon, Jun 11, 2018 at 1:31 PM Kenneth Knowles 
>> wrote:
>>
>>> Agree with that. It will be kind of tricky to generalize. I think
>>> there are some criteria in this case that might apply in other cases:
>>>
>>> 1. Each rel node (or construct of a DSL) should have a PTransform
>>> for how it computes its result from its inputs.
>>> 2. The inputs to that PTransform should actually be the inputs to
>>> the rel node!
>>>
>>> So I tried to improve #1 but I probably made #2 worse.
>>>
>>> Kenn
>>>
>>> On Mon, Jun 11, 2018 at 12:53 PM Anton Kedin 
>>> wrote:
>>>
 Not answering the original question, but doesn't "explain" satisfy
 the SQL use case?

 Going forward we probably want to solve this in a more general way.
 We have at least 3 ways to represent the pipeline:
  - how runner executes it;
  - what it looks like when constructed;
  - what the user was describing in DSL;
 And there will probably be more, if extra layers are built on top
 of DSLs.

 If possible, we probably should be able to map any level of
 abstraction to any other to better understand and debug the pipelines.


 On Mon, Jun 11, 2018 at 12:17 PM Kenneth Knowles 
 wrote:

> In other words, revert
> 

Re: Invite to comment on the @RequiresStableInput design doc

2018-06-15 Thread Kenneth Knowles
Thanks for the write up. It is great to see someone pushing this through.

I wanted to bring Luke's high-level question back to the list for
visibility: what about portability and other SDKs?

Portability is probably trivial, but the "other SDKs" question is probably
best answered by folks working on them who can have opinions about how it
works in their SDKs idioms.

Kenn
​


Jenkins build is back to normal : beam_Release_Gradle_NightlySnapshot #70

2018-06-15 Thread Apache Jenkins Server
See 




Re: Proposing interactive beam runner

2018-06-15 Thread Kenneth Knowles
Nice! As-is, this already looks useful for making Beam accessible.

Commented a bit on doc to highlight where SQL is different than Scio/Python
style. I think notebooks are the perfect target. Specifically, Python and
SQL on the same notebook would be amazing.

Kenn

On Thu, Jun 14, 2018 at 2:04 PM Sindy Li  wrote:

> Thanks Ahmet,
>
> We know quite a few teams in Google are interested to run interactive Beam
> pipelines, especially in Python for Machine Learning -- some are already
> using it interactively in their own way. So instead of for the those teams
> to develop their own version of interactive solution, we want one
> repository that people can contribute to. We could also provide better
> features like fast re-execution as is shown in the demo.
>
> Thanks,
> Sindy
>
> On Wed, Jun 13, 2018 at 5:48 PM, Ahmet Altay  wrote:
>
>> Thank you Sindy.
>>
>> I like the demo; it looks great. This would be interesting to a lot of
>> users. What are your plans for moving this forward? What kind of an input
>> you are looking for?
>>
>> Ahmet
>>
>> On Wed, Jun 13, 2018 at 2:32 PM, Eugene Kirpichov 
>> wrote:
>>
>>> This is awesome, thanks Sindy! I hope that the questions related to
>>> portability will get resolved in a way that will allow to reuse some of the
>>> work for other interactive Beam experiences, including SQL as Andrew says,
>>> and providing a REPL e.g. for users of Scala or other JVM-based languages.
>>>
>>> +Neville Li  Do I remember correctly that you guys
>>> had some sort of interactivity going in Scio but were looking forward to
>>> Beam developing a native solution?
>>>
>>> On Wed, Jun 13, 2018 at 2:22 PM Sindy Li  wrote:
>>>
 *Thanks, Andrew!*

 *Here is a link to the demo on Youtube for people interested:*
 *https://www.youtube.com/watch?v=c5CjA1e3Cqw=youtu.be
 *

 On Wed, Jun 13, 2018 at 1:23 PM, Andrew Pilloud 
 wrote:

> This sounds really interesting, thanks for sharing! We've just begun
> to explore making Beam SQL interactive. The Interactive Runner you've
> proposed sounds like it would solve a bunch of the problems SQL faces as
> well. SQL is written in Java right now, so we can't immediately reuse any
> code.
>
> Andrew
>
> On Wed, Jun 13, 2018 at 11:48 AM Sindy Li  wrote:
>
>> Resending after subscribing to dev list.
>>
>> -- Forwarded message --
>> From: Sindy Li 
>> Date: Fri, Jun 8, 2018 at 5:57 PM
>> Subject: Proposing interactive beam runner
>> To: dev@beam.apache.org
>> Cc: Harsh Vardhan , Chamikara Jayalath <
>> chamik...@google.com>, Anand Iyer , Robert
>> Bradshaw 
>>
>>
>> Hello,
>>
>> We were exploring ways to provide an interactive notebook experience
>> for writing Beam Python pipelines. The design doc
>> 
>>  provides
>> an overview/vision of what we would like to achieve. Pull request
>>  provides a prototype for
>> the same. The document also provides demo screen shots and
>> instructions for running a demo in Jupyter. Please take a look. We 
>> believe
>> this would be a useful addition to Beam.
>>
>> Thanks!
>>
>>
>>
>>

>>
>


Re: [VOTE] Apache Beam, version 2.5.0, release candidate #1

2018-06-15 Thread Jean-Baptiste Onofré
Thanks, I'm testing it as well.

Regards
JB

On 15/06/2018 10:25, Charles Chen wrote:
> Thank you and sorry for the delay.  Been testing the fix the past few
> hours.  This CP PR fixes the
> issue: https://github.com/apache/beam/pull/5658.
> 
> On Thu, Jun 14, 2018 at 10:25 PM Jean-Baptiste Onofré  > wrote:
> 
> OK, I started the RC2, but I'm stopping the process to cut a new one.
> 
> Is it ok from your side ?
> 
> Regards
> JB
> 
> On 15/06/2018 01:54, Charles Chen wrote:
> > Looks like there is something wrong with PR 5636
> >  which we cherry-picked
> > above.  It breaks leaderboard examples which previously passed.  I've
> > reopened the issue and will update this thread shortly.
> >
> > On Thu, Jun 14, 2018 at 12:55 PM Jean-Baptiste Onofré
> mailto:j...@nanthrax.net>
> > >> wrote:
> >
> >     Sure, just in time ;)
> >
> >     Regards
> >     JB
> >
> >     On 14/06/2018 20:58, Charles Chen wrote:
> >     > Can you also merge the
> CP https://github.com/apache/beam/pull/5636 for
> >     > https://issues.apache.org/jira/browse/BEAM-4549?
> >     >
> >     > On Thu, Jun 14, 2018 at 6:52 AM Jean-Baptiste Onofré
> >     mailto:j...@nanthrax.net>
> >
> >     > 
>  >     >
> >     >     FYI, I'm starting RC2 right now.
> >     >
> >     >     Stay tuned !
> >     >
> >     >     Regards
> >     >     JB
> >     >
> >     >     On 06/06/2018 10:44, Jean-Baptiste Onofré wrote:
> >     >     > Hi everyone,
> >     >     >
> >     >     > Please review and vote on the release candidate #1 for the
> >     version
> >     >     > 2.5.0, as follows:
> >     >     >
> >     >     > [ ] +1, Approve the release
> >     >     > [ ] -1, Do not approve the release (please provide
> specific
> >     comments)
> >     >     >
> >     >     > NB: this is the first release using Gradle, so don't
> be too
> >     harsh ;) A
> >     >     > PR about the release guide will follow thanks to this
> release.
> >     >     >
> >     >     > The complete staging area is available for your
> review, which
> >     >     includes:
> >     >     > * JIRA release notes [1],
> >     >     > * the official Apache source release to be deployed to
> >     >     dist.apache.org 
>  
> >     >     > [2], which is signed with the key with fingerprint
> C8282E76 [3],
> >     >     > * all artifacts to be deployed to the Maven Central
> >     Repository [4],
> >     >     > * source code tag "v2.5.0-RC1" [5],
> >     >     > * website pull request listing the release and publishing
> >     the API
> >     >     > reference manual [6].
> >     >     > * Java artifacts were built with Gradle 4.7 (wrapper) and
> >     >     OpenJDK/Oracle
> >     >     > JDK 1.8.0_172 (Oracle Corporation 25.172-b11).
> >     >     > * Python artifacts are deployed along with the source
> >     release to the
> >     >     > dist.apache.org 
> 
> >      [2].
> >     >     >
> >     >     > The vote will be open for at least 72 hours. It is adopted
> >     by majority
> >     >     > approval, with at least 3 PMC affirmative votes.
> >     >     >
> >     >     > Thanks,
> >     >     > JB
> >     >     >
> >     >     > [1]
> >     >     >
> >     >   
> >   
>   
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12342847
> >     >     > [2] https://dist.apache.org/repos/dist/dev/beam/2.5.0/
> >     >     > [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> >     >     > [4]
> >     >   
> >   
>   https://repository.apache.org/content/repositories/orgapachebeam-1041/
> >     >     > [5] https://github.com/apache/beam/tree/v2.5.0-RC1
> >     >     > [6] https://github.com/apache/beam-site/pull/463
> >     >     >
> >     >
> >     >     --
> >     >     Jean-Baptiste Onofré
> >     >     jbono...@apache.org 
> >
> >     
> >>
> >     >     http://blog.nanthrax.net
> >     >     Talend - http://www.talend.com
> >     >
> >
> >     --
> >     Jean-Baptiste Onofré
> >     

Re: [VOTE] Apache Beam, version 2.5.0, release candidate #1

2018-06-15 Thread Charles Chen
Thank you and sorry for the delay.  Been testing the fix the past few
hours.  This CP PR fixes the issue: https://github.com/apache/beam/pull/5658
.

On Thu, Jun 14, 2018 at 10:25 PM Jean-Baptiste Onofré 
wrote:

> OK, I started the RC2, but I'm stopping the process to cut a new one.
>
> Is it ok from your side ?
>
> Regards
> JB
>
> On 15/06/2018 01:54, Charles Chen wrote:
> > Looks like there is something wrong with PR 5636
> >  which we cherry-picked
> > above.  It breaks leaderboard examples which previously passed.  I've
> > reopened the issue and will update this thread shortly.
> >
> > On Thu, Jun 14, 2018 at 12:55 PM Jean-Baptiste Onofré  > > wrote:
> >
> > Sure, just in time ;)
> >
> > Regards
> > JB
> >
> > On 14/06/2018 20:58, Charles Chen wrote:
> > > Can you also merge the CP https://github.com/apache/beam/pull/5636
>  for
> > > https://issues.apache.org/jira/browse/BEAM-4549?
> > >
> > > On Thu, Jun 14, 2018 at 6:52 AM Jean-Baptiste Onofré
> > mailto:j...@nanthrax.net>
> > > >> wrote:
> > >
> > > FYI, I'm starting RC2 right now.
> > >
> > > Stay tuned !
> > >
> > > Regards
> > > JB
> > >
> > > On 06/06/2018 10:44, Jean-Baptiste Onofré wrote:
> > > > Hi everyone,
> > > >
> > > > Please review and vote on the release candidate #1 for the
> > version
> > > > 2.5.0, as follows:
> > > >
> > > > [ ] +1, Approve the release
> > > > [ ] -1, Do not approve the release (please provide specific
> > comments)
> > > >
> > > > NB: this is the first release using Gradle, so don't be too
> > harsh ;) A
> > > > PR about the release guide will follow thanks to this
> release.
> > > >
> > > > The complete staging area is available for your review, which
> > > includes:
> > > > * JIRA release notes [1],
> > > > * the official Apache source release to be deployed to
> > > dist.apache.org  <
> http://dist.apache.org>
> > > > [2], which is signed with the key with fingerprint C8282E76
> [3],
> > > > * all artifacts to be deployed to the Maven Central
> > Repository [4],
> > > > * source code tag "v2.5.0-RC1" [5],
> > > > * website pull request listing the release and publishing
> > the API
> > > > reference manual [6].
> > > > * Java artifacts were built with Gradle 4.7 (wrapper) and
> > > OpenJDK/Oracle
> > > > JDK 1.8.0_172 (Oracle Corporation 25.172-b11).
> > > > * Python artifacts are deployed along with the source
> > release to the
> > > > dist.apache.org 
> >  [2].
> > > >
> > > > The vote will be open for at least 72 hours. It is adopted
> > by majority
> > > > approval, with at least 3 PMC affirmative votes.
> > > >
> > > > Thanks,
> > > > JB
> > > >
> > > > [1]
> > > >
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12342847
> > > > [2] https://dist.apache.org/repos/dist/dev/beam/2.5.0/
> > > > [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> > > > [4]
> > >
> >
> https://repository.apache.org/content/repositories/orgapachebeam-1041/
> > > > [5] https://github.com/apache/beam/tree/v2.5.0-RC1
> > > > [6] https://github.com/apache/beam-site/pull/463
> > > >
> > >
> > > --
> > > Jean-Baptiste Onofré
> > > jbono...@apache.org 
> > >
> > > http://blog.nanthrax.net
> > > Talend - http://www.talend.com
> > >
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org 
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>