Re: Building and visualizing the Beam SQL graph

2018-06-18 Thread Andrew Pilloud
The name can't just be a constant string as it needs to be unique for every
node in the graph. We currently build the name as the rel node class name
plus the unique ID from the calcite graph (plus a unique ID from the beam
graph, but that goes away with some bug fixes).

I've written this up into a PR. https://github.com/apache/beam/pull/5673

Andrew

On Fri, Jun 15, 2018 at 7:51 AM Kenneth Knowles  wrote:

> @Reuven: I think DSLs are better served by having their own wrappers than
> by putting their data into generic attributes. They would need attributes
> if they needed to put them in and have them come back out, but usually the
> DSL has a higher-level view and no need for Beam to propagate data on its
> behalf, in fact it is simpler to do it directly at the DSL level. That is
> the case for SQL and LIMIT.
>
> @Mingmin: Agree. The name on each node is the portable way to describe
> transforms. It should be locally unique and the composite structure makes
> them globally unique. Do all our runners use it to make their UIs pretty? I
> don't know. It would be great to check on that and improve it.
>
> @Andrew: Do we really want getStageName()? Can it just be a constant
> string, with the composite structure giving context?
>
> Kenn
>
> On Thu, Jun 14, 2018 at 11:12 AM Mingmin Xu  wrote:
>
>> Is there a guideline about how the name provided in `PCollection.apply(
>> String name, PTransform, PCollection> t)`
>> is adopted in different runners? I suppose that should be the option, to
>> have a readable graph for all runners, instead of 'adjust' it to make
>> DataFlow runner works only.
>>
>> On Thu, Jun 14, 2018 at 8:53 AM, Reuven Lax  wrote:
>>
>>> There was a previous discussion about having generic attributes on
>>> PCollection. Maybe this is a good driving use case?
>>>
>>> On Wed, Jun 13, 2018 at 4:36 PM Kenneth Knowles  wrote:
>>>
 Another thing to consider is that we might return something like a
 "SqlPCollection" that is the PCollection plus additional metadata that
 is useful to the shell / enumerable converter (such as if the PCollection
 has a known finite size due to LIMIT, even if it is "unbounded", and the
 shell can return control to the user once it receives enough rows). After
 your proposed change this will be much more natural to do, so that's
 another point in favor of the refactor.

 Kenn

 On Wed, Jun 13, 2018 at 10:22 AM Andrew Pilloud 
 wrote:

> One of my goals is to make the graph easier to read and map back to
> the SQL EXPLAIN output. The way the graph is currently built
> (`toPTransform` vs `toPCollection`) does make a big difference in that
> graph. I think it is also important to have a common function to do the
> apply with consistent naming. I think that will greatly help with ease of
> understanding. It sounds like what really want is this in the BeamRelNode
> interface:
>
> PInput buildPInput(Pipeline pipeline);
> PTransform> buildPTransform();
>
> default PCollection toPCollection(Pipeline pipeline) {
> return buildPInput(pipeline).apply(getStageName(),
> buildPTransform());
> }
>
> Andrew
>
> On Mon, Jun 11, 2018 at 2:27 PM Mingmin Xu  wrote:
>
>> EXPLAIN shows the execution plan in SQL perspective only. After
>> converting to a Beam composite PTransform, there're more steps 
>> underneath,
>> each Runner re-org Beam PTransforms again which makes the final pipeline
>> hard to read. In SQL module itself, I don't see any difference between
>> `toPTransform` and `toPCollection`. We could have an easy-to-understand
>> step name when converting RelNodes, but Runners show the graph to
>> developers.
>>
>> Mingmin
>>
>> On Mon, Jun 11, 2018 at 2:06 PM, Andrew Pilloud 
>> wrote:
>>
>>> That sounds correct. And because each rel node might have a
>>> different input there isn't a standard interface (like PTransform<
>>> PCollection, PCollection> toPTransform());
>>>
>>> Andrew
>>>
>>> On Mon, Jun 11, 2018 at 1:31 PM Kenneth Knowles 
>>> wrote:
>>>
 Agree with that. It will be kind of tricky to generalize. I think
 there are some criteria in this case that might apply in other cases:

 1. Each rel node (or construct of a DSL) should have a PTransform
 for how it computes its result from its inputs.
 2. The inputs to that PTransform should actually be the inputs to
 the rel node!

 So I tried to improve #1 but I probably made #2 worse.

 Kenn

 On Mon, Jun 11, 2018 at 12:53 PM Anton Kedin 
 wrote:

> Not answering the original question, but doesn't "explain" satisfy
> the SQL use case?
>
> Going forward we probably want to solve this in a more general
> way. We have at least 3 ways to represent the 

Re: Building and visualizing the Beam SQL graph

2018-06-15 Thread Kenneth Knowles
@Reuven: I think DSLs are better served by having their own wrappers than
by putting their data into generic attributes. They would need attributes
if they needed to put them in and have them come back out, but usually the
DSL has a higher-level view and no need for Beam to propagate data on its
behalf, in fact it is simpler to do it directly at the DSL level. That is
the case for SQL and LIMIT.

@Mingmin: Agree. The name on each node is the portable way to describe
transforms. It should be locally unique and the composite structure makes
them globally unique. Do all our runners use it to make their UIs pretty? I
don't know. It would be great to check on that and improve it.

@Andrew: Do we really want getStageName()? Can it just be a constant
string, with the composite structure giving context?

Kenn

On Thu, Jun 14, 2018 at 11:12 AM Mingmin Xu  wrote:

> Is there a guideline about how the name provided in `PCollection.apply(
> String name, PTransform, PCollection> t)`
> is adopted in different runners? I suppose that should be the option, to
> have a readable graph for all runners, instead of 'adjust' it to make
> DataFlow runner works only.
>
> On Thu, Jun 14, 2018 at 8:53 AM, Reuven Lax  wrote:
>
>> There was a previous discussion about having generic attributes on
>> PCollection. Maybe this is a good driving use case?
>>
>> On Wed, Jun 13, 2018 at 4:36 PM Kenneth Knowles  wrote:
>>
>>> Another thing to consider is that we might return something like a
>>> "SqlPCollection" that is the PCollection plus additional metadata that
>>> is useful to the shell / enumerable converter (such as if the PCollection
>>> has a known finite size due to LIMIT, even if it is "unbounded", and the
>>> shell can return control to the user once it receives enough rows). After
>>> your proposed change this will be much more natural to do, so that's
>>> another point in favor of the refactor.
>>>
>>> Kenn
>>>
>>> On Wed, Jun 13, 2018 at 10:22 AM Andrew Pilloud 
>>> wrote:
>>>
 One of my goals is to make the graph easier to read and map back to the
 SQL EXPLAIN output. The way the graph is currently built (`toPTransform` vs
 `toPCollection`) does make a big difference in that graph. I think it is
 also important to have a common function to do the apply with consistent
 naming. I think that will greatly help with ease of understanding. It
 sounds like what really want is this in the BeamRelNode interface:

 PInput buildPInput(Pipeline pipeline);
 PTransform> buildPTransform();

 default PCollection toPCollection(Pipeline pipeline) {
 return buildPInput(pipeline).apply(getStageName(),
 buildPTransform());
 }

 Andrew

 On Mon, Jun 11, 2018 at 2:27 PM Mingmin Xu  wrote:

> EXPLAIN shows the execution plan in SQL perspective only. After
> converting to a Beam composite PTransform, there're more steps underneath,
> each Runner re-org Beam PTransforms again which makes the final pipeline
> hard to read. In SQL module itself, I don't see any difference between
> `toPTransform` and `toPCollection`. We could have an easy-to-understand
> step name when converting RelNodes, but Runners show the graph to
> developers.
>
> Mingmin
>
> On Mon, Jun 11, 2018 at 2:06 PM, Andrew Pilloud 
> wrote:
>
>> That sounds correct. And because each rel node might have a different
>> input there isn't a standard interface (like PTransform<
>> PCollection, PCollection> toPTransform());
>>
>> Andrew
>>
>> On Mon, Jun 11, 2018 at 1:31 PM Kenneth Knowles 
>> wrote:
>>
>>> Agree with that. It will be kind of tricky to generalize. I think
>>> there are some criteria in this case that might apply in other cases:
>>>
>>> 1. Each rel node (or construct of a DSL) should have a PTransform
>>> for how it computes its result from its inputs.
>>> 2. The inputs to that PTransform should actually be the inputs to
>>> the rel node!
>>>
>>> So I tried to improve #1 but I probably made #2 worse.
>>>
>>> Kenn
>>>
>>> On Mon, Jun 11, 2018 at 12:53 PM Anton Kedin 
>>> wrote:
>>>
 Not answering the original question, but doesn't "explain" satisfy
 the SQL use case?

 Going forward we probably want to solve this in a more general way.
 We have at least 3 ways to represent the pipeline:
  - how runner executes it;
  - what it looks like when constructed;
  - what the user was describing in DSL;
 And there will probably be more, if extra layers are built on top
 of DSLs.

 If possible, we probably should be able to map any level of
 abstraction to any other to better understand and debug the pipelines.


 On Mon, Jun 11, 2018 at 12:17 PM Kenneth Knowles 
 wrote:

> In other words, revert
> 

Re: Building and visualizing the Beam SQL graph

2018-06-14 Thread Mingmin Xu
Is there a guideline about how the name provided in `PCollection.apply(
String name, PTransform, PCollection> t)` is
adopted in different runners? I suppose that should be the option, to have
a readable graph for all runners, instead of 'adjust' it to make DataFlow
runner works only.

On Thu, Jun 14, 2018 at 8:53 AM, Reuven Lax  wrote:

> There was a previous discussion about having generic attributes on
> PCollection. Maybe this is a good driving use case?
>
> On Wed, Jun 13, 2018 at 4:36 PM Kenneth Knowles  wrote:
>
>> Another thing to consider is that we might return something like a
>> "SqlPCollection" that is the PCollection plus additional metadata that
>> is useful to the shell / enumerable converter (such as if the PCollection
>> has a known finite size due to LIMIT, even if it is "unbounded", and the
>> shell can return control to the user once it receives enough rows). After
>> your proposed change this will be much more natural to do, so that's
>> another point in favor of the refactor.
>>
>> Kenn
>>
>> On Wed, Jun 13, 2018 at 10:22 AM Andrew Pilloud 
>> wrote:
>>
>>> One of my goals is to make the graph easier to read and map back to the
>>> SQL EXPLAIN output. The way the graph is currently built (`toPTransform` vs
>>> `toPCollection`) does make a big difference in that graph. I think it is
>>> also important to have a common function to do the apply with consistent
>>> naming. I think that will greatly help with ease of understanding. It
>>> sounds like what really want is this in the BeamRelNode interface:
>>>
>>> PInput buildPInput(Pipeline pipeline);
>>> PTransform> buildPTransform();
>>>
>>> default PCollection toPCollection(Pipeline pipeline) {
>>> return buildPInput(pipeline).apply(getStageName(),
>>> buildPTransform());
>>> }
>>>
>>> Andrew
>>>
>>> On Mon, Jun 11, 2018 at 2:27 PM Mingmin Xu  wrote:
>>>
 EXPLAIN shows the execution plan in SQL perspective only. After
 converting to a Beam composite PTransform, there're more steps underneath,
 each Runner re-org Beam PTransforms again which makes the final pipeline
 hard to read. In SQL module itself, I don't see any difference between
 `toPTransform` and `toPCollection`. We could have an easy-to-understand
 step name when converting RelNodes, but Runners show the graph to
 developers.

 Mingmin

 On Mon, Jun 11, 2018 at 2:06 PM, Andrew Pilloud 
 wrote:

> That sounds correct. And because each rel node might have a different
> input there isn't a standard interface (like PTransform Row>, PCollection> toPTransform());
>
> Andrew
>
> On Mon, Jun 11, 2018 at 1:31 PM Kenneth Knowles 
> wrote:
>
>> Agree with that. It will be kind of tricky to generalize. I think
>> there are some criteria in this case that might apply in other cases:
>>
>> 1. Each rel node (or construct of a DSL) should have a PTransform for
>> how it computes its result from its inputs.
>> 2. The inputs to that PTransform should actually be the inputs to the
>> rel node!
>>
>> So I tried to improve #1 but I probably made #2 worse.
>>
>> Kenn
>>
>> On Mon, Jun 11, 2018 at 12:53 PM Anton Kedin 
>> wrote:
>>
>>> Not answering the original question, but doesn't "explain" satisfy
>>> the SQL use case?
>>>
>>> Going forward we probably want to solve this in a more general way.
>>> We have at least 3 ways to represent the pipeline:
>>>  - how runner executes it;
>>>  - what it looks like when constructed;
>>>  - what the user was describing in DSL;
>>> And there will probably be more, if extra layers are built on top of
>>> DSLs.
>>>
>>> If possible, we probably should be able to map any level of
>>> abstraction to any other to better understand and debug the pipelines.
>>>
>>>
>>> On Mon, Jun 11, 2018 at 12:17 PM Kenneth Knowles 
>>> wrote:
>>>
 In other words, revert https://github.com/
 apache/beam/pull/4705/files, at least in spirit? I agree :-)

 Kenn

 On Mon, Jun 11, 2018 at 11:39 AM Andrew Pilloud <
 apill...@google.com> wrote:

> We are currently converting the Calcite Rel tree to Beam by
> recursively building a tree of nested PTransforms. This results in a 
> weird
> nested graph in the dataflow UI where each node contains its inputs 
> nested
> inside of it. I'm going to change the internal data structure for
> converting the tree from a PTransform to a PCollection, which will 
> result
> in a more accurate representation of the tree structure being built 
> and
> should simplify the code as well. This will not change the public 
> interface
> to SQL, which will remain a PTransform. Any thoughts or objections?
>
> I was also wondering if there are tools for 

Re: Building and visualizing the Beam SQL graph

2018-06-14 Thread Reuven Lax
There was a previous discussion about having generic attributes on
PCollection. Maybe this is a good driving use case?

On Wed, Jun 13, 2018 at 4:36 PM Kenneth Knowles  wrote:

> Another thing to consider is that we might return something like a
> "SqlPCollection" that is the PCollection plus additional metadata that
> is useful to the shell / enumerable converter (such as if the PCollection
> has a known finite size due to LIMIT, even if it is "unbounded", and the
> shell can return control to the user once it receives enough rows). After
> your proposed change this will be much more natural to do, so that's
> another point in favor of the refactor.
>
> Kenn
>
> On Wed, Jun 13, 2018 at 10:22 AM Andrew Pilloud 
> wrote:
>
>> One of my goals is to make the graph easier to read and map back to the
>> SQL EXPLAIN output. The way the graph is currently built (`toPTransform` vs
>> `toPCollection`) does make a big difference in that graph. I think it is
>> also important to have a common function to do the apply with consistent
>> naming. I think that will greatly help with ease of understanding. It
>> sounds like what really want is this in the BeamRelNode interface:
>>
>> PInput buildPInput(Pipeline pipeline);
>> PTransform> buildPTransform();
>>
>> default PCollection toPCollection(Pipeline pipeline) {
>> return buildPInput(pipeline).apply(getStageName(), buildPTransform());
>> }
>>
>> Andrew
>>
>> On Mon, Jun 11, 2018 at 2:27 PM Mingmin Xu  wrote:
>>
>>> EXPLAIN shows the execution plan in SQL perspective only. After
>>> converting to a Beam composite PTransform, there're more steps underneath,
>>> each Runner re-org Beam PTransforms again which makes the final pipeline
>>> hard to read. In SQL module itself, I don't see any difference between
>>> `toPTransform` and `toPCollection`. We could have an easy-to-understand
>>> step name when converting RelNodes, but Runners show the graph to
>>> developers.
>>>
>>> Mingmin
>>>
>>> On Mon, Jun 11, 2018 at 2:06 PM, Andrew Pilloud 
>>> wrote:
>>>
 That sounds correct. And because each rel node might have a different
 input there isn't a standard interface (like PTransform<
 PCollection, PCollection> toPTransform());

 Andrew

 On Mon, Jun 11, 2018 at 1:31 PM Kenneth Knowles  wrote:

> Agree with that. It will be kind of tricky to generalize. I think
> there are some criteria in this case that might apply in other cases:
>
> 1. Each rel node (or construct of a DSL) should have a PTransform for
> how it computes its result from its inputs.
> 2. The inputs to that PTransform should actually be the inputs to the
> rel node!
>
> So I tried to improve #1 but I probably made #2 worse.
>
> Kenn
>
> On Mon, Jun 11, 2018 at 12:53 PM Anton Kedin  wrote:
>
>> Not answering the original question, but doesn't "explain" satisfy
>> the SQL use case?
>>
>> Going forward we probably want to solve this in a more general way.
>> We have at least 3 ways to represent the pipeline:
>>  - how runner executes it;
>>  - what it looks like when constructed;
>>  - what the user was describing in DSL;
>> And there will probably be more, if extra layers are built on top of
>> DSLs.
>>
>> If possible, we probably should be able to map any level of
>> abstraction to any other to better understand and debug the pipelines.
>>
>>
>> On Mon, Jun 11, 2018 at 12:17 PM Kenneth Knowles 
>> wrote:
>>
>>> In other words, revert
>>> https://github.com/apache/beam/pull/4705/files, at least in spirit?
>>> I agree :-)
>>>
>>> Kenn
>>>
>>> On Mon, Jun 11, 2018 at 11:39 AM Andrew Pilloud 
>>> wrote:
>>>
 We are currently converting the Calcite Rel tree to Beam by
 recursively building a tree of nested PTransforms. This results in a 
 weird
 nested graph in the dataflow UI where each node contains its inputs 
 nested
 inside of it. I'm going to change the internal data structure for
 converting the tree from a PTransform to a PCollection, which will 
 result
 in a more accurate representation of the tree structure being built and
 should simplify the code as well. This will not change the public 
 interface
 to SQL, which will remain a PTransform. Any thoughts or objections?

 I was also wondering if there are tools for visualizing the Beam
 graph aside from the dataflow runner UI. What other tools exist?

 Andrew

>>>
>>>
>>>
>>> --
>>> 
>>> Mingmin
>>>
>>


Re: Building and visualizing the Beam SQL graph

2018-06-13 Thread Anton Kedin
>From the visualization perspective I really loved the interactive runner
demo where it shows the graph:
https://www.youtube.com/watch?v=c5CjA1e3Cqw=27s

On Wed, Jun 13, 2018 at 4:36 PM Kenneth Knowles  wrote:

> Another thing to consider is that we might return something like a
> "SqlPCollection" that is the PCollection plus additional metadata that
> is useful to the shell / enumerable converter (such as if the PCollection
> has a known finite size due to LIMIT, even if it is "unbounded", and the
> shell can return control to the user once it receives enough rows). After
> your proposed change this will be much more natural to do, so that's
> another point in favor of the refactor.
>
> Kenn
>
> On Wed, Jun 13, 2018 at 10:22 AM Andrew Pilloud 
> wrote:
>
>> One of my goals is to make the graph easier to read and map back to the
>> SQL EXPLAIN output. The way the graph is currently built (`toPTransform` vs
>> `toPCollection`) does make a big difference in that graph. I think it is
>> also important to have a common function to do the apply with consistent
>> naming. I think that will greatly help with ease of understanding. It
>> sounds like what really want is this in the BeamRelNode interface:
>>
>> PInput buildPInput(Pipeline pipeline);
>> PTransform> buildPTransform();
>>
>> default PCollection toPCollection(Pipeline pipeline) {
>> return buildPInput(pipeline).apply(getStageName(), buildPTransform());
>> }
>>
>> Andrew
>>
>> On Mon, Jun 11, 2018 at 2:27 PM Mingmin Xu  wrote:
>>
>>> EXPLAIN shows the execution plan in SQL perspective only. After
>>> converting to a Beam composite PTransform, there're more steps underneath,
>>> each Runner re-org Beam PTransforms again which makes the final pipeline
>>> hard to read. In SQL module itself, I don't see any difference between
>>> `toPTransform` and `toPCollection`. We could have an easy-to-understand
>>> step name when converting RelNodes, but Runners show the graph to
>>> developers.
>>>
>>> Mingmin
>>>
>>> On Mon, Jun 11, 2018 at 2:06 PM, Andrew Pilloud 
>>> wrote:
>>>
 That sounds correct. And because each rel node might have a different
 input there isn't a standard interface (like PTransform<
 PCollection, PCollection> toPTransform());

 Andrew

 On Mon, Jun 11, 2018 at 1:31 PM Kenneth Knowles  wrote:

> Agree with that. It will be kind of tricky to generalize. I think
> there are some criteria in this case that might apply in other cases:
>
> 1. Each rel node (or construct of a DSL) should have a PTransform for
> how it computes its result from its inputs.
> 2. The inputs to that PTransform should actually be the inputs to the
> rel node!
>
> So I tried to improve #1 but I probably made #2 worse.
>
> Kenn
>
> On Mon, Jun 11, 2018 at 12:53 PM Anton Kedin  wrote:
>
>> Not answering the original question, but doesn't "explain" satisfy
>> the SQL use case?
>>
>> Going forward we probably want to solve this in a more general way.
>> We have at least 3 ways to represent the pipeline:
>>  - how runner executes it;
>>  - what it looks like when constructed;
>>  - what the user was describing in DSL;
>> And there will probably be more, if extra layers are built on top of
>> DSLs.
>>
>> If possible, we probably should be able to map any level of
>> abstraction to any other to better understand and debug the pipelines.
>>
>>
>> On Mon, Jun 11, 2018 at 12:17 PM Kenneth Knowles 
>> wrote:
>>
>>> In other words, revert
>>> https://github.com/apache/beam/pull/4705/files, at least in spirit?
>>> I agree :-)
>>>
>>> Kenn
>>>
>>> On Mon, Jun 11, 2018 at 11:39 AM Andrew Pilloud 
>>> wrote:
>>>
 We are currently converting the Calcite Rel tree to Beam by
 recursively building a tree of nested PTransforms. This results in a 
 weird
 nested graph in the dataflow UI where each node contains its inputs 
 nested
 inside of it. I'm going to change the internal data structure for
 converting the tree from a PTransform to a PCollection, which will 
 result
 in a more accurate representation of the tree structure being built and
 should simplify the code as well. This will not change the public 
 interface
 to SQL, which will remain a PTransform. Any thoughts or objections?

 I was also wondering if there are tools for visualizing the Beam
 graph aside from the dataflow runner UI. What other tools exist?

 Andrew

>>>
>>>
>>>
>>> --
>>> 
>>> Mingmin
>>>
>>


Re: Building and visualizing the Beam SQL graph

2018-06-13 Thread Kenneth Knowles
Another thing to consider is that we might return something like a
"SqlPCollection" that is the PCollection plus additional metadata that
is useful to the shell / enumerable converter (such as if the PCollection
has a known finite size due to LIMIT, even if it is "unbounded", and the
shell can return control to the user once it receives enough rows). After
your proposed change this will be much more natural to do, so that's
another point in favor of the refactor.

Kenn

On Wed, Jun 13, 2018 at 10:22 AM Andrew Pilloud  wrote:

> One of my goals is to make the graph easier to read and map back to the
> SQL EXPLAIN output. The way the graph is currently built (`toPTransform` vs
> `toPCollection`) does make a big difference in that graph. I think it is
> also important to have a common function to do the apply with consistent
> naming. I think that will greatly help with ease of understanding. It
> sounds like what really want is this in the BeamRelNode interface:
>
> PInput buildPInput(Pipeline pipeline);
> PTransform> buildPTransform();
>
> default PCollection toPCollection(Pipeline pipeline) {
> return buildPInput(pipeline).apply(getStageName(), buildPTransform());
> }
>
> Andrew
>
> On Mon, Jun 11, 2018 at 2:27 PM Mingmin Xu  wrote:
>
>> EXPLAIN shows the execution plan in SQL perspective only. After
>> converting to a Beam composite PTransform, there're more steps underneath,
>> each Runner re-org Beam PTransforms again which makes the final pipeline
>> hard to read. In SQL module itself, I don't see any difference between
>> `toPTransform` and `toPCollection`. We could have an easy-to-understand
>> step name when converting RelNodes, but Runners show the graph to
>> developers.
>>
>> Mingmin
>>
>> On Mon, Jun 11, 2018 at 2:06 PM, Andrew Pilloud 
>> wrote:
>>
>>> That sounds correct. And because each rel node might have a different
>>> input there isn't a standard interface (like PTransform,
>>> PCollection> toPTransform());
>>>
>>> Andrew
>>>
>>> On Mon, Jun 11, 2018 at 1:31 PM Kenneth Knowles  wrote:
>>>
 Agree with that. It will be kind of tricky to generalize. I think there
 are some criteria in this case that might apply in other cases:

 1. Each rel node (or construct of a DSL) should have a PTransform for
 how it computes its result from its inputs.
 2. The inputs to that PTransform should actually be the inputs to the
 rel node!

 So I tried to improve #1 but I probably made #2 worse.

 Kenn

 On Mon, Jun 11, 2018 at 12:53 PM Anton Kedin  wrote:

> Not answering the original question, but doesn't "explain" satisfy the
> SQL use case?
>
> Going forward we probably want to solve this in a more general way. We
> have at least 3 ways to represent the pipeline:
>  - how runner executes it;
>  - what it looks like when constructed;
>  - what the user was describing in DSL;
> And there will probably be more, if extra layers are built on top of
> DSLs.
>
> If possible, we probably should be able to map any level of
> abstraction to any other to better understand and debug the pipelines.
>
>
> On Mon, Jun 11, 2018 at 12:17 PM Kenneth Knowles 
> wrote:
>
>> In other words, revert https://github.com/apache/beam/pull/4705/files,
>> at least in spirit? I agree :-)
>>
>> Kenn
>>
>> On Mon, Jun 11, 2018 at 11:39 AM Andrew Pilloud 
>> wrote:
>>
>>> We are currently converting the Calcite Rel tree to Beam by
>>> recursively building a tree of nested PTransforms. This results in a 
>>> weird
>>> nested graph in the dataflow UI where each node contains its inputs 
>>> nested
>>> inside of it. I'm going to change the internal data structure for
>>> converting the tree from a PTransform to a PCollection, which will 
>>> result
>>> in a more accurate representation of the tree structure being built and
>>> should simplify the code as well. This will not change the public 
>>> interface
>>> to SQL, which will remain a PTransform. Any thoughts or objections?
>>>
>>> I was also wondering if there are tools for visualizing the Beam
>>> graph aside from the dataflow runner UI. What other tools exist?
>>>
>>> Andrew
>>>
>>
>>
>>
>> --
>> 
>> Mingmin
>>
>


Re: Building and visualizing the Beam SQL graph

2018-06-13 Thread Andrew Pilloud
One of my goals is to make the graph easier to read and map back to the SQL
EXPLAIN output. The way the graph is currently built (`toPTransform` vs
`toPCollection`) does make a big difference in that graph. I think it is
also important to have a common function to do the apply with consistent
naming. I think that will greatly help with ease of understanding. It
sounds like what really want is this in the BeamRelNode interface:

PInput buildPInput(Pipeline pipeline);
PTransform> buildPTransform();

default PCollection toPCollection(Pipeline pipeline) {
return buildPInput(pipeline).apply(getStageName(), buildPTransform());
}

Andrew

On Mon, Jun 11, 2018 at 2:27 PM Mingmin Xu  wrote:

> EXPLAIN shows the execution plan in SQL perspective only. After converting
> to a Beam composite PTransform, there're more steps underneath, each Runner
> re-org Beam PTransforms again which makes the final pipeline hard to read.
> In SQL module itself, I don't see any difference between `toPTransform` and
> `toPCollection`. We could have an easy-to-understand step name when
> converting RelNodes, but Runners show the graph to developers.
>
> Mingmin
>
> On Mon, Jun 11, 2018 at 2:06 PM, Andrew Pilloud 
> wrote:
>
>> That sounds correct. And because each rel node might have a different
>> input there isn't a standard interface (like PTransform,
>> PCollection> toPTransform());
>>
>> Andrew
>>
>> On Mon, Jun 11, 2018 at 1:31 PM Kenneth Knowles  wrote:
>>
>>> Agree with that. It will be kind of tricky to generalize. I think there
>>> are some criteria in this case that might apply in other cases:
>>>
>>> 1. Each rel node (or construct of a DSL) should have a PTransform for
>>> how it computes its result from its inputs.
>>> 2. The inputs to that PTransform should actually be the inputs to the
>>> rel node!
>>>
>>> So I tried to improve #1 but I probably made #2 worse.
>>>
>>> Kenn
>>>
>>> On Mon, Jun 11, 2018 at 12:53 PM Anton Kedin  wrote:
>>>
 Not answering the original question, but doesn't "explain" satisfy the
 SQL use case?

 Going forward we probably want to solve this in a more general way. We
 have at least 3 ways to represent the pipeline:
  - how runner executes it;
  - what it looks like when constructed;
  - what the user was describing in DSL;
 And there will probably be more, if extra layers are built on top of
 DSLs.

 If possible, we probably should be able to map any level of abstraction
 to any other to better understand and debug the pipelines.


 On Mon, Jun 11, 2018 at 12:17 PM Kenneth Knowles 
 wrote:

> In other words, revert https://github.com/apache/beam/pull/4705/files,
> at least in spirit? I agree :-)
>
> Kenn
>
> On Mon, Jun 11, 2018 at 11:39 AM Andrew Pilloud 
> wrote:
>
>> We are currently converting the Calcite Rel tree to Beam by
>> recursively building a tree of nested PTransforms. This results in a 
>> weird
>> nested graph in the dataflow UI where each node contains its inputs 
>> nested
>> inside of it. I'm going to change the internal data structure for
>> converting the tree from a PTransform to a PCollection, which will result
>> in a more accurate representation of the tree structure being built and
>> should simplify the code as well. This will not change the public 
>> interface
>> to SQL, which will remain a PTransform. Any thoughts or objections?
>>
>> I was also wondering if there are tools for visualizing the Beam
>> graph aside from the dataflow runner UI. What other tools exist?
>>
>> Andrew
>>
>
>
>
> --
> 
> Mingmin
>


Re: Building and visualizing the Beam SQL graph

2018-06-11 Thread Mingmin Xu
EXPLAIN shows the execution plan in SQL perspective only. After converting
to a Beam composite PTransform, there're more steps underneath, each Runner
re-org Beam PTransforms again which makes the final pipeline hard to read.
In SQL module itself, I don't see any difference between `toPTransform` and
`toPCollection`. We could have an easy-to-understand step name when
converting RelNodes, but Runners show the graph to developers.

Mingmin

On Mon, Jun 11, 2018 at 2:06 PM, Andrew Pilloud  wrote:

> That sounds correct. And because each rel node might have a different
> input there isn't a standard interface (like PTransform,
> PCollection> toPTransform());
>
> Andrew
>
> On Mon, Jun 11, 2018 at 1:31 PM Kenneth Knowles  wrote:
>
>> Agree with that. It will be kind of tricky to generalize. I think there
>> are some criteria in this case that might apply in other cases:
>>
>> 1. Each rel node (or construct of a DSL) should have a PTransform for how
>> it computes its result from its inputs.
>> 2. The inputs to that PTransform should actually be the inputs to the rel
>> node!
>>
>> So I tried to improve #1 but I probably made #2 worse.
>>
>> Kenn
>>
>> On Mon, Jun 11, 2018 at 12:53 PM Anton Kedin  wrote:
>>
>>> Not answering the original question, but doesn't "explain" satisfy the
>>> SQL use case?
>>>
>>> Going forward we probably want to solve this in a more general way. We
>>> have at least 3 ways to represent the pipeline:
>>>  - how runner executes it;
>>>  - what it looks like when constructed;
>>>  - what the user was describing in DSL;
>>> And there will probably be more, if extra layers are built on top of
>>> DSLs.
>>>
>>> If possible, we probably should be able to map any level of abstraction
>>> to any other to better understand and debug the pipelines.
>>>
>>>
>>> On Mon, Jun 11, 2018 at 12:17 PM Kenneth Knowles  wrote:
>>>
 In other words, revert https://github.com/apache/beam/pull/4705/files,
 at least in spirit? I agree :-)

 Kenn

 On Mon, Jun 11, 2018 at 11:39 AM Andrew Pilloud 
 wrote:

> We are currently converting the Calcite Rel tree to Beam by
> recursively building a tree of nested PTransforms. This results in a weird
> nested graph in the dataflow UI where each node contains its inputs nested
> inside of it. I'm going to change the internal data structure for
> converting the tree from a PTransform to a PCollection, which will result
> in a more accurate representation of the tree structure being built and
> should simplify the code as well. This will not change the public 
> interface
> to SQL, which will remain a PTransform. Any thoughts or objections?
>
> I was also wondering if there are tools for visualizing the Beam graph
> aside from the dataflow runner UI. What other tools exist?
>
> Andrew
>



-- 

Mingmin


Re: Building and visualizing the Beam SQL graph

2018-06-11 Thread Andrew Pilloud
That sounds correct. And because each rel node might have a different input
there isn't a standard interface (like PTransform,
PCollection> toPTransform());

Andrew

On Mon, Jun 11, 2018 at 1:31 PM Kenneth Knowles  wrote:

> Agree with that. It will be kind of tricky to generalize. I think there
> are some criteria in this case that might apply in other cases:
>
> 1. Each rel node (or construct of a DSL) should have a PTransform for how
> it computes its result from its inputs.
> 2. The inputs to that PTransform should actually be the inputs to the rel
> node!
>
> So I tried to improve #1 but I probably made #2 worse.
>
> Kenn
>
> On Mon, Jun 11, 2018 at 12:53 PM Anton Kedin  wrote:
>
>> Not answering the original question, but doesn't "explain" satisfy the
>> SQL use case?
>>
>> Going forward we probably want to solve this in a more general way. We
>> have at least 3 ways to represent the pipeline:
>>  - how runner executes it;
>>  - what it looks like when constructed;
>>  - what the user was describing in DSL;
>> And there will probably be more, if extra layers are built on top of DSLs.
>>
>> If possible, we probably should be able to map any level of abstraction
>> to any other to better understand and debug the pipelines.
>>
>>
>> On Mon, Jun 11, 2018 at 12:17 PM Kenneth Knowles  wrote:
>>
>>> In other words, revert https://github.com/apache/beam/pull/4705/files,
>>> at least in spirit? I agree :-)
>>>
>>> Kenn
>>>
>>> On Mon, Jun 11, 2018 at 11:39 AM Andrew Pilloud 
>>> wrote:
>>>
 We are currently converting the Calcite Rel tree to Beam by recursively
 building a tree of nested PTransforms. This results in a weird nested graph
 in the dataflow UI where each node contains its inputs nested inside of it.
 I'm going to change the internal data structure for converting the tree
 from a PTransform to a PCollection, which will result in a more accurate
 representation of the tree structure being built and should simplify the
 code as well. This will not change the public interface to SQL, which will
 remain a PTransform. Any thoughts or objections?

 I was also wondering if there are tools for visualizing the Beam graph
 aside from the dataflow runner UI. What other tools exist?

 Andrew

>>>


Re: Building and visualizing the Beam SQL graph

2018-06-11 Thread Andrew Pilloud
Not quite a revert, we still want to keep the actual transformation inside
a PTransform but the input of that PTransform will be different for each
node type (joins have multiple inputs for example). We have this function
as our builder right now:

PTransform> toPTransform();

When I'm done we will have this:

PCollection toPCollection(Pipeline pipeline);

Andrew

On Mon, Jun 11, 2018 at 12:53 PM Anton Kedin  wrote:

> Not answering the original question, but doesn't "explain" satisfy the SQL
> use case?
>
> Going forward we probably want to solve this in a more general way. We
> have at least 3 ways to represent the pipeline:
>  - how runner executes it;
>  - what it looks like when constructed;
>  - what the user was describing in DSL;
> And there will probably be more, if extra layers are built on top of DSLs.
>
> If possible, we probably should be able to map any level of abstraction to
> any other to better understand and debug the pipelines.
>
>
> On Mon, Jun 11, 2018 at 12:17 PM Kenneth Knowles  wrote:
>
>> In other words, revert https://github.com/apache/beam/pull/4705/files,
>> at least in spirit? I agree :-)
>>
>> Kenn
>>
>> On Mon, Jun 11, 2018 at 11:39 AM Andrew Pilloud 
>> wrote:
>>
>>> We are currently converting the Calcite Rel tree to Beam by recursively
>>> building a tree of nested PTransforms. This results in a weird nested graph
>>> in the dataflow UI where each node contains its inputs nested inside of it.
>>> I'm going to change the internal data structure for converting the tree
>>> from a PTransform to a PCollection, which will result in a more accurate
>>> representation of the tree structure being built and should simplify the
>>> code as well. This will not change the public interface to SQL, which will
>>> remain a PTransform. Any thoughts or objections?
>>>
>>> I was also wondering if there are tools for visualizing the Beam graph
>>> aside from the dataflow runner UI. What other tools exist?
>>>
>>> Andrew
>>>
>>


Re: Building and visualizing the Beam SQL graph

2018-06-11 Thread Kenneth Knowles
Agree with that. It will be kind of tricky to generalize. I think there are
some criteria in this case that might apply in other cases:

1. Each rel node (or construct of a DSL) should have a PTransform for how
it computes its result from its inputs.
2. The inputs to that PTransform should actually be the inputs to the rel
node!

So I tried to improve #1 but I probably made #2 worse.

Kenn

On Mon, Jun 11, 2018 at 12:53 PM Anton Kedin  wrote:

> Not answering the original question, but doesn't "explain" satisfy the SQL
> use case?
>
> Going forward we probably want to solve this in a more general way. We
> have at least 3 ways to represent the pipeline:
>  - how runner executes it;
>  - what it looks like when constructed;
>  - what the user was describing in DSL;
> And there will probably be more, if extra layers are built on top of DSLs.
>
> If possible, we probably should be able to map any level of abstraction to
> any other to better understand and debug the pipelines.
>
>
> On Mon, Jun 11, 2018 at 12:17 PM Kenneth Knowles  wrote:
>
>> In other words, revert https://github.com/apache/beam/pull/4705/files,
>> at least in spirit? I agree :-)
>>
>> Kenn
>>
>> On Mon, Jun 11, 2018 at 11:39 AM Andrew Pilloud 
>> wrote:
>>
>>> We are currently converting the Calcite Rel tree to Beam by recursively
>>> building a tree of nested PTransforms. This results in a weird nested graph
>>> in the dataflow UI where each node contains its inputs nested inside of it.
>>> I'm going to change the internal data structure for converting the tree
>>> from a PTransform to a PCollection, which will result in a more accurate
>>> representation of the tree structure being built and should simplify the
>>> code as well. This will not change the public interface to SQL, which will
>>> remain a PTransform. Any thoughts or objections?
>>>
>>> I was also wondering if there are tools for visualizing the Beam graph
>>> aside from the dataflow runner UI. What other tools exist?
>>>
>>> Andrew
>>>
>>


Re: Building and visualizing the Beam SQL graph

2018-06-11 Thread Anton Kedin
Not answering the original question, but doesn't "explain" satisfy the SQL
use case?

Going forward we probably want to solve this in a more general way. We have
at least 3 ways to represent the pipeline:
 - how runner executes it;
 - what it looks like when constructed;
 - what the user was describing in DSL;
And there will probably be more, if extra layers are built on top of DSLs.

If possible, we probably should be able to map any level of abstraction to
any other to better understand and debug the pipelines.


On Mon, Jun 11, 2018 at 12:17 PM Kenneth Knowles  wrote:

> In other words, revert https://github.com/apache/beam/pull/4705/files, at
> least in spirit? I agree :-)
>
> Kenn
>
> On Mon, Jun 11, 2018 at 11:39 AM Andrew Pilloud 
> wrote:
>
>> We are currently converting the Calcite Rel tree to Beam by recursively
>> building a tree of nested PTransforms. This results in a weird nested graph
>> in the dataflow UI where each node contains its inputs nested inside of it.
>> I'm going to change the internal data structure for converting the tree
>> from a PTransform to a PCollection, which will result in a more accurate
>> representation of the tree structure being built and should simplify the
>> code as well. This will not change the public interface to SQL, which will
>> remain a PTransform. Any thoughts or objections?
>>
>> I was also wondering if there are tools for visualizing the Beam graph
>> aside from the dataflow runner UI. What other tools exist?
>>
>> Andrew
>>
>


Re: Building and visualizing the Beam SQL graph

2018-06-11 Thread Kenneth Knowles
In other words, revert https://github.com/apache/beam/pull/4705/files, at
least in spirit? I agree :-)

Kenn

On Mon, Jun 11, 2018 at 11:39 AM Andrew Pilloud  wrote:

> We are currently converting the Calcite Rel tree to Beam by recursively
> building a tree of nested PTransforms. This results in a weird nested graph
> in the dataflow UI where each node contains its inputs nested inside of it.
> I'm going to change the internal data structure for converting the tree
> from a PTransform to a PCollection, which will result in a more accurate
> representation of the tree structure being built and should simplify the
> code as well. This will not change the public interface to SQL, which will
> remain a PTransform. Any thoughts or objections?
>
> I was also wondering if there are tools for visualizing the Beam graph
> aside from the dataflow runner UI. What other tools exist?
>
> Andrew
>


Re: Building and visualizing the Beam SQL graph

2018-06-11 Thread Huygaa Batsaikhan
I was also wondering the same thing. I don't think there is any
visualization tool for Beam. :(

On Mon, Jun 11, 2018 at 11:39 AM Andrew Pilloud  wrote:

> We are currently converting the Calcite Rel tree to Beam by recursively
> building a tree of nested PTransforms. This results in a weird nested graph
> in the dataflow UI where each node contains its inputs nested inside of it.
> I'm going to change the internal data structure for converting the tree
> from a PTransform to a PCollection, which will result in a more accurate
> representation of the tree structure being built and should simplify the
> code as well. This will not change the public interface to SQL, which will
> remain a PTransform. Any thoughts or objections?
>
> I was also wondering if there are tools for visualizing the Beam graph
> aside from the dataflow runner UI. What other tools exist?
>
> Andrew
>


Building and visualizing the Beam SQL graph

2018-06-11 Thread Andrew Pilloud
We are currently converting the Calcite Rel tree to Beam by recursively
building a tree of nested PTransforms. This results in a weird nested graph
in the dataflow UI where each node contains its inputs nested inside of it.
I'm going to change the internal data structure for converting the tree
from a PTransform to a PCollection, which will result in a more accurate
representation of the tree structure being built and should simplify the
code as well. This will not change the public interface to SQL, which will
remain a PTransform. Any thoughts or objections?

I was also wondering if there are tools for visualizing the Beam graph
aside from the dataflow runner UI. What other tools exist?

Andrew