Re: Beam SQL Alias issue while using With Clause

2023-03-02 Thread Andrew Pilloud via dev
Hi Talat,

I managed to turn your test case into something against Calcite. It
looks like there is a bug affecting tables that contain one or more
single element structs and no multi element structs. I've sent the
details to the Calcite mailing list here.
https://lists.apache.org/thread/tlr9hsmx09by79h91nwp2d4nv8jfwsto

I'm experimenting with ideas on how to work around this but a fix will
likely require a Calcite upgrade, which is not something I'd have time
to help with. (I'm not on the Google Beam team anymore.)

Andrew

On Wed, Feb 22, 2023 at 12:18 PM Talat Uyarer
 wrote:
>
> Hi @Andrew Pilloud
>
> Sorry for the late response. Yes your test is working fine. I changed the 
> test input structure like our input structure. Now this test also has the 
> same exception.
>
> Feb 21, 2023 2:02:28 PM 
> org.apache.beam.sdk.extensions.sql.impl.CalciteQueryPlanner convertToBeamRel
> INFO: SQL:
> WITH `tempTable` AS (SELECT `panwRowTestTable`.`user_info`, 
> `panwRowTestTable`.`id`, `panwRowTestTable`.`value`
> FROM `beam`.`panwRowTestTable` AS `panwRowTestTable`
> WHERE `panwRowTestTable`.`user_info`.`name` = 'innerStr') (SELECT 
> `tempTable`.`user_info`, `tempTable`.`id`, `tempTable`.`value`
> FROM `tempTable` AS `tempTable`)
> Feb 21, 2023 2:02:28 PM 
> org.apache.beam.sdk.extensions.sql.impl.CalciteQueryPlanner convertToBeamRel
> INFO: SQLPlan>
> LogicalProject(user_info=[ROW($0)], id=[$1], value=[$2])
>   LogicalFilter(condition=[=($0.name, 'innerStr')])
> LogicalProject(name=[$0.name], id=[$1], value=[$2])
>   BeamIOSourceRel(table=[[beam, panwRowTestTable]])
>
>
> fieldList must not be null, type = VARCHAR
> java.lang.AssertionError: fieldList must not be null, type = VARCHAR
>
> I dont know what is different from yours. I am sharing my version of the test 
> also.
>
>
> Index: 
> sdks/java/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/BeamComplexTypeTest.java
> IDEA additional info:
> Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
> <+>UTF-8
> ===
> diff --git 
> a/sdks/java/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/BeamComplexTypeTest.java
>  
> b/sdks/java/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/BeamComplexTypeTest.java
> --- 
> a/sdks/java/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/BeamComplexTypeTest.java
>  (revision fd383fae1adc545b6b6a22b274902cda956fec49)
> +++ 
> b/sdks/java/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/BeamComplexTypeTest.java
>  (date 1677017032324)
> @@ -54,6 +54,9 @@
>private static final Schema innerRowSchema =
>
> Schema.builder().addStringField("string_field").addInt64Field("long_field").build();
>
> +  private static final Schema innerPanwRowSchema =
> +  Schema.builder().addStringField("name").build();
> +
>private static final Schema innerRowWithArraySchema =
>Schema.builder()
>.addStringField("string_field")
> @@ -127,8 +130,12 @@
>.build()))
>.put(
>"basicRowTestTable",
> -  TestBoundedTable.of(FieldType.row(innerRowSchema), "col")
> -  
> .addRows(Row.withSchema(innerRowSchema).addValues("innerStr", 1L).build()))
> +  TestBoundedTable.of(FieldType.row(innerRowSchema), "col", 
> FieldType.INT64, "field")
> +  
> .addRows(Row.withSchema(innerRowSchema).addValues("innerStr", 1L).build(), 
> 1L))
> +.put(
> +  "panwRowTestTable",
> +  TestBoundedTable.of(FieldType.row(innerPanwRowSchema), 
> "user_info", FieldType.INT64, "id", FieldType.STRING, "value")
> +  
> .addRows(Row.withSchema(innerRowSchema).addValues("name", 1L).build(), 1L, 
> "some_value"))
>.put(
>"rowWithArrayTestTable",
>TestBoundedTable.of(FieldType.row(rowWithArraySchema), 
> "col")
> @@ -219,6 +226,21 @@
>  .build());
>  pipeline.run().waitUntilFinish(Duration.standardMinutes(2));
>}
> +
> +  @Test
> +  public void testBasicRowWhereField() {
> +BeamSqlEnv sqlEnv = BeamSqlEnv.inMemory(readOnlyTableProvider);
> +PCollection stream =
> +BeamSqlRelUtils.toPCollection(
> +pipeline, sqlEnv.parseQuery("WITH tempTable AS (SELECT * FROM 
> panwRowTestTable WHERE panwRowTestTable.`user_info`.`name` = 'innerStr') 
> SELECT * FROM tempTable"));
> +Schema outputSchema = Schema.builder().addRowField("col", 
> innerRowSchema).addInt64Field("field").build();
> +PAssert.that(stream)
> +.containsInAnyOrder(
> +Row.withSchema(outputSchema)
> +.addValues(Row.withSchema(innerRowSchema).addValues("name", 
> 1L).build(), 1L)
> +.build());
> +pipeline.run().waitUntilFinish(Duration.standardMinutes(2));
> +  }
>
>@Test

Re: Review ask for Flink Runner Backlog Metric Bug Fix

2023-02-23 Thread Andrew Pilloud via dev
The bot says there are no reviewers for Flink. Possibly you'll find a
volunteer to review it here?

On Thu, Feb 23, 2023 at 4:47 PM Talat Uyarer via dev 
wrote:

> Hi,
>
> I created a bugfix for Flink Runner backlog metrics. I asked OWNERs and
> try to run assign reviewer command. But I am not sure. I pressed the right
> button :)
>
> If you know some who can review this change
> https://github.com/apache/beam/pull/25554
>
> Could you assign him/her to this mr ?
>
> Thanks
>


Re: Beam SQL Alias issue while using With Clause

2023-02-10 Thread Andrew Pilloud via dev
I have a test case that I believe should reproduce this on both head
and 2.43 but it ends up with a different logical plan. Can you provide your
input types?

We have a class of issues around compex types
https://github.com/apache/beam/issues/19009 I don't believe the
"LogicalFilter(condition=[=($2.name, 'User1')])" particularly "$2.name" is
something that works, in my test it seems that the planner has flattened
the complex input and reproduced a ROW at the output.

INFO: SQLPlan>
LogicalProject(col=[ROW($0, $1)], field=[$2])
  LogicalFilter(condition=[=($0, 'innerStr')])
LogicalProject(string_field=[$0.string_field],
long_field=[$0.long_field], field=[$1])
  BeamIOSourceRel(table=[[beam, basicRowTestTable]])

Feb 10, 2023 6:07:35 PM
org.apache.beam.sdk.extensions.sql.impl.CalciteQueryPlanner convertToBeamRel
INFO: BEAMPlan>
BeamCalcRel(expr#0..1=[{inputs}], expr#2=[$t0.string_field],
expr#3=[$t0.long_field], expr#4=[ROW($t2, $t3)],
expr#5=['innerStr':VARCHAR], expr#6=[=($t2, $t5)], col=[$t4], field=[$t1],
$condition=[$t6])
  BeamIOSourceRel(table=[[beam, basicRowTestTable]])

---
a/sdks/java/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/BeamComplexTypeTest.java
+++
b/sdks/java/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/BeamComplexTypeTest.java
@@ -127,8 +127,8 @@ public class BeamComplexTypeTest {
   .build()))
   .put(
   "basicRowTestTable",
-  TestBoundedTable.of(FieldType.row(innerRowSchema), "col")
-
 .addRows(Row.withSchema(innerRowSchema).addValues("innerStr", 1L).build()))
+  TestBoundedTable.of(FieldType.row(innerRowSchema),
"col", FieldType.INT64, "field")
+
 .addRows(Row.withSchema(innerRowSchema).addValues("innerStr", 1L).build(),
1L))
   .put(
   "rowWithArrayTestTable",
   TestBoundedTable.of(FieldType.row(rowWithArraySchema),
"col")
@@ -220,6 +220,21 @@ public class BeamComplexTypeTest {
 pipeline.run().waitUntilFinish(Duration.standardMinutes(2));
   }

+  @Test
+  public void testBasicRowWhereField() {
+BeamSqlEnv sqlEnv = BeamSqlEnv.inMemory(readOnlyTableProvider);
+PCollection stream =
+BeamSqlRelUtils.toPCollection(
+pipeline, sqlEnv.parseQuery("WITH tempTable AS (SELECT * FROM
basicRowTestTable WHERE basicRowTestTable.col.string_field = 'innerStr')
SELECT * FROM tempTable"));
+Schema outputSchema = Schema.builder().addRowField("col",
innerRowSchema).addInt64Field("field").build();
+PAssert.that(stream)
+.containsInAnyOrder(
+Row.withSchema(outputSchema)
+
 .addValues(Row.withSchema(innerRowSchema).addValues("innerStr",
1L).build(), 1L)
+.build());
+pipeline.run().waitUntilFinish(Duration.standardMinutes(2));
+  }
+
   @Test
   public void testArrayConstructor() {
 BeamSqlEnv sqlEnv = BeamSqlEnv.inMemory(readOnlyTableProvider);


On Fri, Feb 3, 2023 at 2:06 PM Talat Uyarer 
wrote:

> Hi Andrew,
>
> Thank you for your MR. I am parricated to help us to solve the issue. I
> rerun our tests and they are partially passing now with your fix.  However,
> there is one more issue with the WITH clause.
>
> When i run following query somehow beam lost type of column
>
> WITH tempTable AS (SELECT * FROM PCOLLECTION WHERE
> PCOLLECTION.`user_info`.`name` = 'User1') SELECT * FROM tempTable
>
> I havent test on Beam Master. I run with your latest patch on our code
> base. This is the output
>
> 14:00:30.095 [Test worker] INFO
>  o.a.b.sdk.extensions.sql.impl.CalciteQueryPlanner - SQL:
> WITH `tempTable` AS (SELECT `PCOLLECTION`.`id`, `PCOLLECTION`.`value`,
> `PCOLLECTION`.`user_info`
> FROM `beam`.`PCOLLECTION` AS `PCOLLECTION`
> WHERE `PCOLLECTION`.`user_info`.`name` = 'User1') (SELECT
> `tempTable`.`id`, `tempTable`.`value`, `tempTable`.`user_info`
> FROM `tempTable` AS `tempTable`)
> 14:00:30.106 [Test worker] DEBUG
> o.a.b.v.calcite.v1_28_0.org.apache.calcite.sql2rel - Plan after converting
> SqlNode to RelNode
> LogicalProject(id=[$0], value=[$1], user_info=[$2])
>   LogicalFilter(condition=[=($2.name, 'User1')])
> BeamIOSourceRel(table=[[beam, PCOLLECTION]])
>
> 14:00:30.107 [Test worker] DEBUG
> o.a.b.v.calcite.v1_28_0.org.apache.calcite.sql2rel - Plan after converting
> SqlNode to RelNode
> LogicalProject(id=[$0], value=[$1], user_info=[$2])
>   LogicalFilter(condition=[=($2.name, 'User1')])
> BeamIOSourceRel(table=[[beam, PCOLLECTION]])
>
> 14:00:30.109 [Test worker] INFO
>  o.a.b.sdk.extensions.sql.impl.CalciteQueryPlanner - SQLPlan>
> LogicalProject(id=[$0], value=[$1], user_info=[ROW($2)])
>   LogicalFilter(condition=[=($2.name, 'User1')])
> LogicalProject(id=[$0], value=[$1], name=[$2.name])
>   BeamIOSourceRel(table=[[beam, PCOLLECTION]])
>
> 14:00:30.173 [Test worker] DEBUG
> o.a.b.v.c.v.org.apache.calcite.plan.RelOptPlanner - PLANNER =
> 

Re: OpenJDK8 / OpenJDK11 container deprecation

2023-02-07 Thread Andrew Pilloud via dev
This sounds reasonable to me as well.

I've made swaps like this in the past, the base image of each is probably a
bigger factor than the JDK. The openjdk images were based on Debian 11. The
default eclipse-temurin images are based on Ubuntu 22.04 with an alpine
option. Ubuntu is a Debian derivative but the versions and package names
aren't exact matches and Ubuntu tends to update a little faster. For most
users I don't think this will matter but users building custom containers
may need to make minor changes. The alpine option will be much smaller
(which could be a significant improvement) but would be a more significant
change to the environment.

On Tue, Feb 7, 2023 at 5:18 PM Robert Bradshaw via dev 
wrote:

> Seams reasonable to me.
>
> On Tue, Feb 7, 2023 at 4:19 PM Luke Cwik via user 
> wrote:
> >
> > As per [1], the JDK8 and JDK11 containers that Apache Beam uses have
> stopped being built and supported since July 2022. I have filed [2] to
> track the resolution of this issue.
> >
> > Based upon [1], almost everyone is swapping to the eclipse-temurin
> container[3] as their base based upon the linked issues from the
> deprecation notice[1]. The eclipse-temurin container is released under
> these licenses:
> > Apache License, Version 2.0
> > Eclipse Distribution License 1.0 (BSD)
> > Eclipse Public License 2.0
> > 一 (Secondary) GNU General Public License, version 2 with OpenJDK
> Assembly Exception
> > 一 (Secondary) GNU General Public License, version 2 with the GNU
> Classpath Exception
> >
> > I propose that we swap all our containers to the eclipse-temurin
> containers[3].
> >
> > Open to other ideas and also would be great to hear about your
> experience in any other projects that you have had to make a similar
> decision.
> >
> > 1: https://github.com/docker-library/openjdk/issues/505
> > 2: https://github.com/apache/beam/issues/25371
> > 3: https://hub.docker.com/_/eclipse-temurin
>


Re: Beam SQL Alias issue while using With Clause

2023-02-02 Thread Andrew Pilloud via dev
It looks like Calcite stopped considering field names in RelNode equality
as of Calcite 2.22 (which we use in Beam v2.34.0+). This can result in a
planner state where two nodes that only differ by field name are considered
equivalent.

I have a fix for Beam in https://github.com/apache/beam/pull/25290 and I'll
send an email to the Calcite dev list with more details.

Andrew

On Fri, Jan 27, 2023 at 11:33 AM Andrew Pilloud  wrote:

> Also this is at very least a Beam bug. You can file a Beam issue if you
> want, otherwise I will when I get back.
>
> Andrew
>
> On Fri, Jan 27, 2023 at 11:27 AM Andrew Pilloud 
> wrote:
>
>> Hi Talat,
>>
>> I did get your test case running and added some logging to
>> RexProgramBuilder.mergePrograms. There is only one merge that occurs during
>> the test and it has an output type of RecordType(JavaType(int) ID,
>> JavaType(class java.lang.String) V). This does seem like the correct output
>> name but it doesn't match the final output name, so something is still
>> different than the Beam test case. I also modified mergePrograms to
>> purposely corrupt the output names, that did not cause the test to fail or
>> trip the 'assert mergedProg.getOutputRowType() ==
>> topProgram.getOutputRowType();' in mergePrograms. I could not find any
>> Calcite unit tests for RexProgramBuilder.mergePrograms or
>> CoreRules.CALC_MERGE rule so I think it is still probable that the problem
>> is in this area.
>>
>> One minor issue I encountered. It took me a while to get your test case
>> running, it doesn't appear there are any calcite gradle rules to run
>> CoreQuidemTest and constructing the classpath manually was tedious. Did I
>> miss something?
>>
>> I'm still working on this but I'm out today and Monday, it will probably
>> be Wednesday before I make any more progress.
>>
>> Andrew
>>
>> On Fri, Jan 27, 2023 at 10:40 AM Talat Uyarer <
>> tuya...@paloaltonetworks.com> wrote:
>>
>>> Hi Andrew,
>>>
>>> Yes This aligned also with my debugging. In My Kenn's reply you can see
>>> a sql test which I wrote in Calcite. Somehow Calcite does not have this
>>> issue with the 1.28 version.
>>>
>>> !use post
>>> !set outputformat mysql
>>>
>>> #Test aliases with with clause
>>> WITH tempTable(id, v) AS (select "hr"."emps"."empid" as id, 
>>> "hr"."emps"."name" as v from "hr"."emps")
>>> SELECT tempTable.id as id, tempTable.v as "value" FROM tempTable WHERE 
>>> tempTable.v <> '11' ;
>>> +-+---+
>>> | ID  | value |
>>> +-+---+
>>> | 100 | Bill  |
>>> | 110 | Theodore  |
>>> | 150 | Sebastian |
>>> | 200 | Eric  |
>>> +-+---+
>>> (4 rows)
>>>
>>> !ok
>>>
>>>
>>> On Wed, Jan 25, 2023 at 6:08 PM Andrew Pilloud 
>>> wrote:
>>>
 Yes, that worked.

 The issue does not occur if I disable all of the following planner
 rules: CoreRules.FILTER_CALC_MERGE, CoreRules.PROJECT_CALC_MERGE,
 LogicalCalcMergeRule.INSTANCE (which wraps CoreRules.CALC_MERGE),
 and BeamCalcMergeRule.INSTANCE (which wraps CoreRules.CALC_MERGE).

 All the rules share a common call to RexProgramBuilder.mergePrograms,
 so I suspect the problem lies there. I spent some time looking but wasn't
 able to find it by code inspection, it looks like this code path is doing
 the right thing with names. I'll spend some time tomorrow trying to
 reproduce this on pure Calcite.

 Andrew


 On Tue, Jan 24, 2023 at 8:24 PM Talat Uyarer <
 tuya...@paloaltonetworks.com> wrote:

> Hi Andrew,
>
> Thanks for writing a test for this use case. Without Where clause it
> works as expected on our test cases also too. Please add where clause on
> second select. With the below query it does not return column names. I
> tested on my local also.
>
> WITH tempTable (id, v) AS (SELECT f_int as id, f_string as v FROM
> PCOLLECTION) SELECT id AS fout_int, v AS fout_string FROM tempTable WHERE
> id > 1
>
> Thanks
>
> On Tue, Jan 24, 2023 at 5:28 PM Andrew Pilloud 
> wrote:
>
>> +dev@beam.apache.org 
>>
>> I tried reproducing this but was not successful, the output schema
>> was as expected. I added the following to BeamSqlMultipleSchemasTest.java
>> at head. (I did discover
>> that  PAssert.that(result).containsInAnyOrder(output) doesn't validate
>> column names however.)
>>
>>   @Test
>>   public void testSelectAs() {
>> PCollection input = pipeline.apply(create(row(1, "strstr")));
>>
>> PCollection result =
>> input.apply(SqlTransform.query("WITH tempTable (id, v) AS
>> (SELECT f_int as id, f_string as v FROM PCOLLECTION) SELECT id AS 
>> fout_int,
>> v AS fout_string FROM tempTable"));
>>
>> Schema output_schema =
>>
>> Schema.builder().addInt32Field("fout_int").addStringField("fout_string").build();
>> assertThat(result.getSchema(), equalTo(output_schema));
>>
>> Row 

Re: Beam SQL Alias issue while using With Clause

2023-01-27 Thread Andrew Pilloud via dev
Also this is at very least a Beam bug. You can file a Beam issue if you
want, otherwise I will when I get back.

Andrew

On Fri, Jan 27, 2023 at 11:27 AM Andrew Pilloud  wrote:

> Hi Talat,
>
> I did get your test case running and added some logging to
> RexProgramBuilder.mergePrograms. There is only one merge that occurs during
> the test and it has an output type of RecordType(JavaType(int) ID,
> JavaType(class java.lang.String) V). This does seem like the correct output
> name but it doesn't match the final output name, so something is still
> different than the Beam test case. I also modified mergePrograms to
> purposely corrupt the output names, that did not cause the test to fail or
> trip the 'assert mergedProg.getOutputRowType() ==
> topProgram.getOutputRowType();' in mergePrograms. I could not find any
> Calcite unit tests for RexProgramBuilder.mergePrograms or
> CoreRules.CALC_MERGE rule so I think it is still probable that the problem
> is in this area.
>
> One minor issue I encountered. It took me a while to get your test case
> running, it doesn't appear there are any calcite gradle rules to run
> CoreQuidemTest and constructing the classpath manually was tedious. Did I
> miss something?
>
> I'm still working on this but I'm out today and Monday, it will probably
> be Wednesday before I make any more progress.
>
> Andrew
>
> On Fri, Jan 27, 2023 at 10:40 AM Talat Uyarer <
> tuya...@paloaltonetworks.com> wrote:
>
>> Hi Andrew,
>>
>> Yes This aligned also with my debugging. In My Kenn's reply you can see a
>> sql test which I wrote in Calcite. Somehow Calcite does not have this issue
>> with the 1.28 version.
>>
>> !use post
>> !set outputformat mysql
>>
>> #Test aliases with with clause
>> WITH tempTable(id, v) AS (select "hr"."emps"."empid" as id, 
>> "hr"."emps"."name" as v from "hr"."emps")
>> SELECT tempTable.id as id, tempTable.v as "value" FROM tempTable WHERE 
>> tempTable.v <> '11' ;
>> +-+---+
>> | ID  | value |
>> +-+---+
>> | 100 | Bill  |
>> | 110 | Theodore  |
>> | 150 | Sebastian |
>> | 200 | Eric  |
>> +-+---+
>> (4 rows)
>>
>> !ok
>>
>>
>> On Wed, Jan 25, 2023 at 6:08 PM Andrew Pilloud 
>> wrote:
>>
>>> Yes, that worked.
>>>
>>> The issue does not occur if I disable all of the following planner
>>> rules: CoreRules.FILTER_CALC_MERGE, CoreRules.PROJECT_CALC_MERGE,
>>> LogicalCalcMergeRule.INSTANCE (which wraps CoreRules.CALC_MERGE),
>>> and BeamCalcMergeRule.INSTANCE (which wraps CoreRules.CALC_MERGE).
>>>
>>> All the rules share a common call to RexProgramBuilder.mergePrograms, so
>>> I suspect the problem lies there. I spent some time looking but wasn't able
>>> to find it by code inspection, it looks like this code path is doing the
>>> right thing with names. I'll spend some time tomorrow trying to reproduce
>>> this on pure Calcite.
>>>
>>> Andrew
>>>
>>>
>>> On Tue, Jan 24, 2023 at 8:24 PM Talat Uyarer <
>>> tuya...@paloaltonetworks.com> wrote:
>>>
 Hi Andrew,

 Thanks for writing a test for this use case. Without Where clause it
 works as expected on our test cases also too. Please add where clause on
 second select. With the below query it does not return column names. I
 tested on my local also.

 WITH tempTable (id, v) AS (SELECT f_int as id, f_string as v FROM
 PCOLLECTION) SELECT id AS fout_int, v AS fout_string FROM tempTable WHERE
 id > 1

 Thanks

 On Tue, Jan 24, 2023 at 5:28 PM Andrew Pilloud 
 wrote:

> +dev@beam.apache.org 
>
> I tried reproducing this but was not successful, the output schema was
> as expected. I added the following to BeamSqlMultipleSchemasTest.java at
> head. (I did discover that  
> PAssert.that(result).containsInAnyOrder(output)
> doesn't validate column names however.)
>
>   @Test
>   public void testSelectAs() {
> PCollection input = pipeline.apply(create(row(1, "strstr")));
>
> PCollection result =
> input.apply(SqlTransform.query("WITH tempTable (id, v) AS
> (SELECT f_int as id, f_string as v FROM PCOLLECTION) SELECT id AS 
> fout_int,
> v AS fout_string FROM tempTable"));
>
> Schema output_schema =
>
> Schema.builder().addInt32Field("fout_int").addStringField("fout_string").build();
> assertThat(result.getSchema(), equalTo(output_schema));
>
> Row output = Row.withSchema(output_schema).addValues(1,
> "strstr").build();
> PAssert.that(result).containsInAnyOrder(output);
> pipeline.run();
>   }
>
> On Tue, Jan 24, 2023 at 8:13 AM Talat Uyarer <
> tuya...@paloaltonetworks.com> wrote:
>
>> Hi Kenn,
>>
>> Thank you for replying back to my email.
>>
>> I was under the same impression about Calcite. But I wrote a test on
>> Calcite 1.28 too. It is working without issue that I see on BEAM
>>
>> Here is my test case. If you want you can 

Re: Beam SQL Alias issue while using With Clause

2023-01-27 Thread Andrew Pilloud via dev
Hi Talat,

I did get your test case running and added some logging to
RexProgramBuilder.mergePrograms. There is only one merge that occurs during
the test and it has an output type of RecordType(JavaType(int) ID,
JavaType(class java.lang.String) V). This does seem like the correct output
name but it doesn't match the final output name, so something is still
different than the Beam test case. I also modified mergePrograms to
purposely corrupt the output names, that did not cause the test to fail or
trip the 'assert mergedProg.getOutputRowType() ==
topProgram.getOutputRowType();' in mergePrograms. I could not find any
Calcite unit tests for RexProgramBuilder.mergePrograms or
CoreRules.CALC_MERGE rule so I think it is still probable that the problem
is in this area.

One minor issue I encountered. It took me a while to get your test case
running, it doesn't appear there are any calcite gradle rules to run
CoreQuidemTest and constructing the classpath manually was tedious. Did I
miss something?

I'm still working on this but I'm out today and Monday, it will probably be
Wednesday before I make any more progress.

Andrew

On Fri, Jan 27, 2023 at 10:40 AM Talat Uyarer 
wrote:

> Hi Andrew,
>
> Yes This aligned also with my debugging. In My Kenn's reply you can see a
> sql test which I wrote in Calcite. Somehow Calcite does not have this issue
> with the 1.28 version.
>
> !use post
> !set outputformat mysql
>
> #Test aliases with with clause
> WITH tempTable(id, v) AS (select "hr"."emps"."empid" as id, 
> "hr"."emps"."name" as v from "hr"."emps")
> SELECT tempTable.id as id, tempTable.v as "value" FROM tempTable WHERE 
> tempTable.v <> '11' ;
> +-+---+
> | ID  | value |
> +-+---+
> | 100 | Bill  |
> | 110 | Theodore  |
> | 150 | Sebastian |
> | 200 | Eric  |
> +-+---+
> (4 rows)
>
> !ok
>
>
> On Wed, Jan 25, 2023 at 6:08 PM Andrew Pilloud 
> wrote:
>
>> Yes, that worked.
>>
>> The issue does not occur if I disable all of the following planner rules:
>> CoreRules.FILTER_CALC_MERGE, CoreRules.PROJECT_CALC_MERGE,
>> LogicalCalcMergeRule.INSTANCE (which wraps CoreRules.CALC_MERGE),
>> and BeamCalcMergeRule.INSTANCE (which wraps CoreRules.CALC_MERGE).
>>
>> All the rules share a common call to RexProgramBuilder.mergePrograms, so
>> I suspect the problem lies there. I spent some time looking but wasn't able
>> to find it by code inspection, it looks like this code path is doing the
>> right thing with names. I'll spend some time tomorrow trying to reproduce
>> this on pure Calcite.
>>
>> Andrew
>>
>>
>> On Tue, Jan 24, 2023 at 8:24 PM Talat Uyarer <
>> tuya...@paloaltonetworks.com> wrote:
>>
>>> Hi Andrew,
>>>
>>> Thanks for writing a test for this use case. Without Where clause it
>>> works as expected on our test cases also too. Please add where clause on
>>> second select. With the below query it does not return column names. I
>>> tested on my local also.
>>>
>>> WITH tempTable (id, v) AS (SELECT f_int as id, f_string as v FROM
>>> PCOLLECTION) SELECT id AS fout_int, v AS fout_string FROM tempTable WHERE
>>> id > 1
>>>
>>> Thanks
>>>
>>> On Tue, Jan 24, 2023 at 5:28 PM Andrew Pilloud 
>>> wrote:
>>>
 +dev@beam.apache.org 

 I tried reproducing this but was not successful, the output schema was
 as expected. I added the following to BeamSqlMultipleSchemasTest.java at
 head. (I did discover that  PAssert.that(result).containsInAnyOrder(output)
 doesn't validate column names however.)

   @Test
   public void testSelectAs() {
 PCollection input = pipeline.apply(create(row(1, "strstr")));

 PCollection result =
 input.apply(SqlTransform.query("WITH tempTable (id, v) AS
 (SELECT f_int as id, f_string as v FROM PCOLLECTION) SELECT id AS fout_int,
 v AS fout_string FROM tempTable"));

 Schema output_schema =

 Schema.builder().addInt32Field("fout_int").addStringField("fout_string").build();
 assertThat(result.getSchema(), equalTo(output_schema));

 Row output = Row.withSchema(output_schema).addValues(1,
 "strstr").build();
 PAssert.that(result).containsInAnyOrder(output);
 pipeline.run();
   }

 On Tue, Jan 24, 2023 at 8:13 AM Talat Uyarer <
 tuya...@paloaltonetworks.com> wrote:

> Hi Kenn,
>
> Thank you for replying back to my email.
>
> I was under the same impression about Calcite. But I wrote a test on
> Calcite 1.28 too. It is working without issue that I see on BEAM
>
> Here is my test case. If you want you can also run on Calcite. Please
> put under core/src/test/resources/sql as text file. and Run CoreQuidemTest
> class.
>
> !use post
> !set outputformat mysql
>
> #Test aliases with with clause
> WITH tempTable(id, v) AS (select "hr"."emps"."empid" as id, 
> "hr"."emps"."name" as v from "hr"."emps")
> SELECT tempTable.id as id, tempTable.v as 

Re: Beam SQL Alias issue while using With Clause

2023-01-25 Thread Andrew Pilloud via dev
Yes, that worked.

The issue does not occur if I disable all of the following planner rules:
CoreRules.FILTER_CALC_MERGE, CoreRules.PROJECT_CALC_MERGE,
LogicalCalcMergeRule.INSTANCE (which wraps CoreRules.CALC_MERGE),
and BeamCalcMergeRule.INSTANCE (which wraps CoreRules.CALC_MERGE).

All the rules share a common call to RexProgramBuilder.mergePrograms, so I
suspect the problem lies there. I spent some time looking but wasn't able
to find it by code inspection, it looks like this code path is doing the
right thing with names. I'll spend some time tomorrow trying to reproduce
this on pure Calcite.

Andrew


On Tue, Jan 24, 2023 at 8:24 PM Talat Uyarer 
wrote:

> Hi Andrew,
>
> Thanks for writing a test for this use case. Without Where clause it works
> as expected on our test cases also too. Please add where clause on second
> select. With the below query it does not return column names. I tested on
> my local also.
>
> WITH tempTable (id, v) AS (SELECT f_int as id, f_string as v FROM
> PCOLLECTION) SELECT id AS fout_int, v AS fout_string FROM tempTable WHERE
> id > 1
>
> Thanks
>
> On Tue, Jan 24, 2023 at 5:28 PM Andrew Pilloud 
> wrote:
>
>> +dev@beam.apache.org 
>>
>> I tried reproducing this but was not successful, the output schema was as
>> expected. I added the following to BeamSqlMultipleSchemasTest.java at head.
>> (I did discover that  PAssert.that(result).containsInAnyOrder(output)
>> doesn't validate column names however.)
>>
>>   @Test
>>   public void testSelectAs() {
>> PCollection input = pipeline.apply(create(row(1, "strstr")));
>>
>> PCollection result =
>> input.apply(SqlTransform.query("WITH tempTable (id, v) AS (SELECT
>> f_int as id, f_string as v FROM PCOLLECTION) SELECT id AS fout_int, v AS
>> fout_string FROM tempTable"));
>>
>> Schema output_schema =
>>
>> Schema.builder().addInt32Field("fout_int").addStringField("fout_string").build();
>> assertThat(result.getSchema(), equalTo(output_schema));
>>
>> Row output = Row.withSchema(output_schema).addValues(1,
>> "strstr").build();
>> PAssert.that(result).containsInAnyOrder(output);
>> pipeline.run();
>>   }
>>
>> On Tue, Jan 24, 2023 at 8:13 AM Talat Uyarer <
>> tuya...@paloaltonetworks.com> wrote:
>>
>>> Hi Kenn,
>>>
>>> Thank you for replying back to my email.
>>>
>>> I was under the same impression about Calcite. But I wrote a test on
>>> Calcite 1.28 too. It is working without issue that I see on BEAM
>>>
>>> Here is my test case. If you want you can also run on Calcite. Please
>>> put under core/src/test/resources/sql as text file. and Run CoreQuidemTest
>>> class.
>>>
>>> !use post
>>> !set outputformat mysql
>>>
>>> #Test aliases with with clause
>>> WITH tempTable(id, v) AS (select "hr"."emps"."empid" as id, 
>>> "hr"."emps"."name" as v from "hr"."emps")
>>> SELECT tempTable.id as id, tempTable.v as "value" FROM tempTable WHERE 
>>> tempTable.v <> '11' ;
>>> +-+---+
>>> | ID  | value |
>>> +-+---+
>>> | 100 | Bill  |
>>> | 110 | Theodore  |
>>> | 150 | Sebastian |
>>> | 200 | Eric  |
>>> +-+---+
>>> (4 rows)
>>>
>>> !ok
>>>
>>>
>>> On Mon, Jan 23, 2023 at 10:16 AM Kenneth Knowles 
>>> wrote:
>>>
 Looking at the code that turns a logical CalcRel into a BeamCalcRel I
 do not see any obvious cause for this:
 https://github.com/apache/beam/blob/b3aa2e89489898f8c760294ba4dba2310ac53e70/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rule/BeamCalcRule.java#L69
 

 I don't like to guess that upstream libraries have the bug, but in this
 case I wonder if the alias is lost in the Calcite optimizer rule for
 merging the projects and filters into a Calc.

 Kenn

 On Mon, Jan 23, 2023 at 10:13 AM Kenneth Knowles 
 wrote:

> I am not sure I understand the question, but I do see an issue.
>
> Context: "CalcRel" is an optimized relational operation that is
> somewhat like ParDo, with a small snippet of a single-assignment DSL
> embedded in it. Calcite will choose to merge all the projects and filters
> into the node, and then generates Java bytecode to directly execute the 
> DSL.
>
> Problem: it looks like the CalcRel has output columns with aliases
> "id" and "v" where it should have output columns with aliases "id" and
> "value".
>
> Kenn
>
> On Thu, Jan 19, 2023 at 6:01 PM Ahmet Altay  wrote:
>
>> Adding: @Andrew Pilloud  @Kenneth Knowles
>> 
>>
>> On Thu, Jan 12, 

Re: Beam SQL Alias issue while using With Clause

2023-01-24 Thread Andrew Pilloud via dev
+dev@beam.apache.org 

I tried reproducing this but was not successful, the output schema was as
expected. I added the following to BeamSqlMultipleSchemasTest.java at head.
(I did discover that  PAssert.that(result).containsInAnyOrder(output)
doesn't validate column names however.)

  @Test
  public void testSelectAs() {
PCollection input = pipeline.apply(create(row(1, "strstr")));

PCollection result =
input.apply(SqlTransform.query("WITH tempTable (id, v) AS (SELECT
f_int as id, f_string as v FROM PCOLLECTION) SELECT id AS fout_int, v AS
fout_string FROM tempTable"));

Schema output_schema =

Schema.builder().addInt32Field("fout_int").addStringField("fout_string").build();
assertThat(result.getSchema(), equalTo(output_schema));

Row output = Row.withSchema(output_schema).addValues(1,
"strstr").build();
PAssert.that(result).containsInAnyOrder(output);
pipeline.run();
  }

On Tue, Jan 24, 2023 at 8:13 AM Talat Uyarer 
wrote:

> Hi Kenn,
>
> Thank you for replying back to my email.
>
> I was under the same impression about Calcite. But I wrote a test on
> Calcite 1.28 too. It is working without issue that I see on BEAM
>
> Here is my test case. If you want you can also run on Calcite. Please put
> under core/src/test/resources/sql as text file. and Run CoreQuidemTest
> class.
>
> !use post
> !set outputformat mysql
>
> #Test aliases with with clause
> WITH tempTable(id, v) AS (select "hr"."emps"."empid" as id, 
> "hr"."emps"."name" as v from "hr"."emps")
> SELECT tempTable.id as id, tempTable.v as "value" FROM tempTable WHERE 
> tempTable.v <> '11' ;
> +-+---+
> | ID  | value |
> +-+---+
> | 100 | Bill  |
> | 110 | Theodore  |
> | 150 | Sebastian |
> | 200 | Eric  |
> +-+---+
> (4 rows)
>
> !ok
>
>
> On Mon, Jan 23, 2023 at 10:16 AM Kenneth Knowles  wrote:
>
>> Looking at the code that turns a logical CalcRel into a BeamCalcRel I do
>> not see any obvious cause for this:
>> https://github.com/apache/beam/blob/b3aa2e89489898f8c760294ba4dba2310ac53e70/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rule/BeamCalcRule.java#L69
>> 
>>
>> I don't like to guess that upstream libraries have the bug, but in this
>> case I wonder if the alias is lost in the Calcite optimizer rule for
>> merging the projects and filters into a Calc.
>>
>> Kenn
>>
>> On Mon, Jan 23, 2023 at 10:13 AM Kenneth Knowles  wrote:
>>
>>> I am not sure I understand the question, but I do see an issue.
>>>
>>> Context: "CalcRel" is an optimized relational operation that is somewhat
>>> like ParDo, with a small snippet of a single-assignment DSL embedded in it.
>>> Calcite will choose to merge all the projects and filters into the node,
>>> and then generates Java bytecode to directly execute the DSL.
>>>
>>> Problem: it looks like the CalcRel has output columns with aliases "id"
>>> and "v" where it should have output columns with aliases "id" and "value".
>>>
>>> Kenn
>>>
>>> On Thu, Jan 19, 2023 at 6:01 PM Ahmet Altay  wrote:
>>>
 Adding: @Andrew Pilloud  @Kenneth Knowles
 

 On Thu, Jan 12, 2023 at 12:31 PM Talat Uyarer via user <
 u...@beam.apache.org> wrote:

> Hi All,
>
> I am using Beam 2.43 with Calcite SQL with Java.
>
> I have a query with a WITH clause and some aliasing. Looks like Beam
> Query optimizer after optimizing my query, it drops Select statement's
> aliases. Can you help me to identify where the problem is ?
>
> This is my query
> INFO: SQL:
> WITH `tempTable` (`id`, `v`) AS (SELECT
> `PCOLLECTION`.`f_nestedRow`.`f_nestedInt` AS `id`,
> `PCOLLECTION`.`f_nestedRow`.`f_nestedString` AS `v`
> FROM `beam`.`PCOLLECTION` AS `PCOLLECTION`) (SELECT `tempTable`.`id`
> AS `id`, `tempTable`.`v` AS `value`
> FROM `tempTable` AS `tempTable`
> WHERE `tempTable`.`v` <> '11')
>
> This is Calcite Plan look at LogicalProject(id=[$0], value=[$1]) in
> SQL plan.
>
> Jan 12, 2023 12:19:08 PM
> org.apache.beam.sdk.extensions.sql.impl.CalciteQueryPlanner 
> convertToBeamRel
> INFO: SQLPlan>
> LogicalProject(id=[$0], value=[$1])
>   LogicalFilter(condition=[<>($1, '11')])
> LogicalProject(id=[$1.f_nestedInt], v=[$1.f_nestedString])
>   BeamIOSourceRel(table=[[beam, PCOLLECTION]])
>
> But Beam Plan does not have a LogicalProject(id=[$0], value=[$1]) or
> similar.
>
> Jan 12, 2023 12:19:08 PM
> 

Re: [Proposal] Change to Default PubsubMessage Coder

2022-12-19 Thread Andrew Pilloud via dev
I think the Dataflow update concern is the biggest concern and as you point
out that can be easily overcome. I generally believe that users who aren't
setting the coder don't actually care as long as it works, so changing the
default across Beam versions seems relatively low risk. Another mitigating
factor is that both concerns require users to actually be using the coder
(likely via Reshuffle.viaRandomKey) rather than consuming the output in a
fused ParDo (which I think is what we would recommend).

As a counter proposal: is there something that stops us from using RowCoder
by default here? I assume all forms of "PubsubMessage" can be encoded with
RowCoder, it provides flexibility for future changes, and PubSub will be
able to take advantage of future work to make RowCoder more efficient. (If
we can't switch to RowCoder, that seems like a useful feature request for
RowCoder.)

Andrew

On Mon, Dec 19, 2022 at 3:37 PM Evan Galpin  wrote:

> Bump 
>
> Any other risks or drawbacks associated with altering the default coder
> for PubsubMessage to be the most inclusive coder with respect to possible
> fields?
>
> Thanks,
> Evan
>
>
> On Mon, Dec 12, 2022 at 10:06 AM Evan Galpin  wrote:
>
>> Hi folks,
>>
>> I'd like to solicit feedback on the notion of using
>> PubsubMessageWithAttributesAndMessageIdAndOrderingKeyCoder[1] as the
>> default coder for Pubsub messages instead of the current default of
>> PubsubMessageWithAttributesCoder.
>>
>> Not long ago, support for reading and writing Pubsub messages in Beam
>> including an OrderingKey was added[2].  Part of this change involved adding
>> a new Coder for PubsubMessage in order to capture and propagate the
>> orderingKey[1].  This change illuminated that in cases where the coder type
>> for PubsubMessage is inferred, it is possible to accidentally and silently
>> nullify fields like MessageId and OrderingKey in a way that is not at all
>> obvious to users[3].
>>
>> So far two potential drawbacks of this proposal have been identified:
>> 1. Update compatibility for pipelines using PubsubIO might require users
>> to explicitly specify the current default coder (
>> PubsubMessageWithAttributesCoder)
>> 2. Messages would require a larger number of bytes to store as compared
>> to the current default (which could again be overcome by users specifying
>> the current default coder)
>>
>> What other potential drawbacks might there be? I look forward to hearing
>> others' input!
>>
>> Thanks,
>> Evan
>>
>> [1]
>> https://github.com/apache/beam/pull/22216/files#diff-28243ab1f9eef144e45a9f6cb2e07fa1cf53c021ceaf733d92351254f38712fd
>> [2] https://github.com/apache/beam/pull/22216
>> [3] https://github.com/apache/beam/issues/23525
>>
>


Re: [Proposal] Adopt a Beam I/O Standard

2022-12-16 Thread Andrew Pilloud via dev
By "Relational" I mean things like: Column Pruning, Filter Pushdown, Table
Statistics, Partition Metadata, Metastore. We have a bunch of one-off
implementations in various IOs (mostly BigQueryIO) and have been waiting
for IO standards to push them out to all IOs. This was section "F5 -
Relational" from https://s.apache.org/beam-io-api-standard-documentation

On Thu, Dec 15, 2022 at 6:50 PM Herman Mak  wrote:

> Hey all,
>
> Firstly apologies for the confusion.
>
> The scope of this effort is to *finalize and have this added to the Beam
> public documentation* to be used as a PR reference once we have resolved
> the comments.
> YES this document is a continuation of the below docs with some additional
> components such as testing!
>
> The idea is to convert this to a MD file and add a page under "Developing
> new I/O connectors" with some small cleanup work around this area in other
> pages.
> [image: image.png]
>
>
>
>
> Docs that this is a continuation of:
> https://s.apache.org/beam-io-api-standard-documentation
> https://s.apache.org/beam-io-api-standard
>
>
> @Andrew Pilloud  Totally not intending to start from
> the beginning here, by relational do you mean having this hosting in the
> Beam confluence?
>
> Thanks all, and keep the feedback to the docs coming
>
> Herman Mak |  Customer Engineer, Hong Kong, Google Cloud |
> herman...@google.com |  +852-3923-5417 <+852%203923%205417>
>
>
>
>
>
> On Fri, Dec 16, 2022 at 1:36 AM Chamikara Jayalath 
> wrote:
>
>>
>>
>> On Thu, Dec 15, 2022, 8:33 AM Alexey Romanenko 
>> wrote:
>>
>>> Cham, do you remember what was a reason to not finalise that doc?
>>>
>>
>> I think this is a continuation of those docs (so we are trying to
>> finalize) but probably  Herman can explain better.
>>
>>
>>> Personally, I find having such standards very useful (if they are
>>> flexible during a time, of course), especially for new developers and PR
>>> reviewers, and it’d be great to finally have such doc as a part of
>>> contribution guide.
>>>
>>
>> +1
>>
>> Thanks,
>> Cham
>>
>>>
>>> —
>>> Alexey
>>>
>>> On 13 Dec 2022, at 04:32, Chamikara Jayalath via dev <
>>> dev@beam.apache.org> wrote:
>>>
>>> Yeah, I don't think either finalized or documented (in the Website) the
>>> previous iteration. This doc seems to contain details from the documents
>>> shared in the previous iteration.
>>>
>>> Thanks,
>>> Cham
>>>
>>>
>>>
>>> On Mon, Dec 12, 2022 at 6:49 PM Robert Burke  wrote:
>>>
>>>> I think ultimately: until the docs a clearly available on the Beam site
>>>> itself, it's not documentation. See also, design docs, previous emails, and
>>>> similar.
>>>>
>>>> On Mon, Dec 12, 2022, 6:07 PM Andrew Pilloud via dev <
>>>> dev@beam.apache.org> wrote:
>>>>
>>>>> I believe the previous iteration was here:
>>>>> https://lists.apache.org/thread/3o8glwkn70kqjrf6wm4dyf8bt27s52hk
>>>>>
>>>>> The associated docs are:
>>>>> https://s.apache.org/beam-io-api-standard-documentation
>>>>> https://s.apache.org/beam-io-api-standard
>>>>>
>>>>> This is missing all the relational stuff that was in those docs, this
>>>>> appears to be another attempt starting from the beginning?
>>>>>
>>>>> Andrew
>>>>>
>>>>>
>>>>> On Mon, Dec 12, 2022 at 9:57 AM Alexey Romanenko <
>>>>> aromanenko@gmail.com> wrote:
>>>>>
>>>>>> Thanks for writing this!
>>>>>>
>>>>>> IIRC, the similar design doc was sent for review here a while ago. Is
>>>>>> this just an updated version and a new one?
>>>>>>
>>>>>> —
>>>>>> Alexey
>>>>>>
>>>>>> On 11 Dec 2022, at 15:16, Herman Mak via dev 
>>>>>> wrote:
>>>>>>
>>>>>> Hello Everyone,
>>>>>>
>>>>>> *TLDR*
>>>>>>
>>>>>> Should we adopt a set of standards that Connector I/Os should adhere
>>>>>> to?
>>>>>> Attached is a first version of a Beam I/O Standards guideline that
>>>>>> includes opinionated best practices across important components

Re: [Proposal] Adopt a Beam I/O Standard

2022-12-12 Thread Andrew Pilloud via dev
I believe the previous iteration was here:
https://lists.apache.org/thread/3o8glwkn70kqjrf6wm4dyf8bt27s52hk

The associated docs are:
https://s.apache.org/beam-io-api-standard-documentation
https://s.apache.org/beam-io-api-standard

This is missing all the relational stuff that was in those docs, this
appears to be another attempt starting from the beginning?

Andrew


On Mon, Dec 12, 2022 at 9:57 AM Alexey Romanenko 
wrote:

> Thanks for writing this!
>
> IIRC, the similar design doc was sent for review here a while ago. Is this
> just an updated version and a new one?
>
> —
> Alexey
>
> On 11 Dec 2022, at 15:16, Herman Mak via dev  wrote:
>
> Hello Everyone,
>
> *TLDR*
>
> Should we adopt a set of standards that Connector I/Os should adhere to?
> Attached is a first version of a Beam I/O Standards guideline that
> includes opinionated best practices across important components of a
> Connector I/O, namely Documentation, Development and Testing.
>
> *The Long Version*
>
> Apache Beam is a unified open-source programming model for both batch and
> streaming. It runs on multiple platform runners and integrates with over 50
> services using individually developed I/O Connectors
> .
>
> Given that Apache Beam connectors are written by many different developers
> and at varying points in time, they vary in syntax style, documentation
> completeness and testing done. For a new adopter of Apache Beam, that can
> definitely cause some uncertainty.
>
> So should we adopt a set of standards that Connector I/Os should adhere
> to?
> Attached is a first version, in Doc format, of a Beam I/O Standards
> guideline that includes opinionated best practices across important
> components of a Connector I/O, namely Documentation, Development and
> Testing. And the aim is to incorporate this into the documentation and to
> have it referenced as standards for new Connector I/Os (and ideally have
> existing Connectors upgraded over time). If it looks helpful, the immediate
> next step is that we can convert it into a .md as a PR into the Beam repo!
>
> Thanks and looking forward to feedbacks and discussion,
>
>  [PUBLIC] Beam I/O Standards
> 
>
> Herman Mak |  Customer Engineer, Hong Kong, Google Cloud |
> herman...@google.com |  +852-3923-5417 <+852%203923%205417>
>
>
>
>


Re: Performance and Cost benchmarking

2022-09-26 Thread Andrew Pilloud via dev
Hi Pranav,

I left some comments on your design. Your doc discusses a bunch of
details about infrastructure such as testing frameworks, automation,
and performance databases, but doesn't describe how it will fit in
with our existing infrastructure (Load Tests, Nexmark, Jenkins,
InfluxDB, Grafina). I would suspect we actually have most of the
infrastructure already built?

What I didn't see (and expected to see) was details on how the tests
would actually interact with IOs. Will there be a generic Schema IO
test harness or do you plan to write one for each IO? Will you be
comparing different data types (data stored as byte[] vs more complex
structures)? What about different IO specific optimization (data
sharding, pushdown)?

Andrew

On Mon, Sep 26, 2022 at 9:07 AM Pranav Bhandari
 wrote:
>
> Hello,
>
> Hope this email finds you well. I have attached a link to a doc which 
> discusses the design for a performance and cost benchmarking framework to be 
> used by Beam IOs and Google-provided dataflow templates.
>
> Please feel free to comment on the doc with any questions, concerns or ideas 
> you might have.
>
> Thank you,
> Pranav Bhandari
>
>
> https://docs.google.com/document/d/14GatBilwuR4jJGb-ZNpYeuB-KkVmDvEm/edit?usp=sharing=102139643796739130048=true=true